Sharing chemical knowledge between humans and machines is a critical component of developing chemistry research and innovation. This collaboration has the potential to accelerate and improve the accuracy, and efficiency of chemical research, drug development, materials science, and a variety of other applications.
A framework that uses artificial neural networks to transform chemical structural equations into machine-readable form has been developed by researchers. They have designed a program utilizing this platform to automatically feed information from scientific papers into databases. Until recently, this had to be done by hand and was time-consuming.
Structural formulas indicate how chemical compounds are built, i.e., what atoms are used, how they are placed spatially, and how they are linked. Chemists can conclude from a structural formula, for example, which molecules can and cannot react with each other, how complicated compounds can be synthesized, or which natural substances may have a therapeutic impact because they fit together with target molecules in cells.
The portrayal of molecules as structural formulae, which was developed in the nineteenth century, has withstood the test of time and is still employed in every chemistry textbook. What makes the chemical world intuitively understandable to humans, however, is simply a collection of black-and-white pixels for software. “The information from structural formulae must be translated into machine-readable code before it can be used in databases that can be searched automatically,” explains Christoph Steinbeck, Professor of Analytical Chemistry, Cheminformatics, and Chemometrics at the University of Jena.
The information from structural formulae must be translated into machine-readable code before it can be used in databases that can be searched automatically.
Christoph Steinbeck
An image becomes a code
That is exactly what the Artificial Intelligence tool “DECIMER,” developed by the team lead by Prof. Steinbeck and his colleague Prof. Achim Zielesny from the Westphalian University of Applied Sciences, can do. DECIMER is an acronym that stands for “Deep Learning for Chemical Image Recognition.” It is an open-source platform that is freely available on the Internet and can be accessed with any normal web browser. Drag and drop scientific publications providing chemical structural formulae into the AI tool, and it will immediately begin working.
“First, the entire document is searched for images,” explains Steinbeck. The algorithm then identifies the image information contained and classifies it according to whether it is a chemical structural formula or some other image. Finally, the structural formulae recognized are translated into the chemical structure code or displayed in a structure editor, so that they can be further processed. “This step is the core of the project and the real achievement,” adds Steinbeck.
In this way, the chemical structural formula for the caffeine molecule becomes the machine-readable structure code CN1C=NC2=C1C(=O)N(C(=O)N2C)C. This can then be uploaded directly into a database and linked to further information on the molecule.
DECIMER was created using modern AI methodologies that have only lately been established and are also employed, for example, in the Large Language Models (such as ChatGPT) that are presently being debated. To train its AI tool, the team generated structural formulae from existing machine-readable databases and used them as training data – over 450 million structural formulas have been generated to yet. Companies, in addition to researchers, are already making use of AI tool, for example, to transfer structural equations from patent specifications into databases.
Steinbeck and Zielesny came up with the idea of developing an AI tool for decoding chemical images a few years ago. The two chemists were interested in the development of AI methods in connection with the millennia-old Asian board game Go. In 2016, together with millions of people around the world, they watched the spectacular tournament between the best Go player at the time, the South Korean Lee Sedol, and the computer software “AlphaGo,” which the machine won 4:1.
“It was like a bolt from the blue that showed us how powerful AI can be,” Steinbeck remembers. It had previously been regarded as practically unimaginable that an algorithm could compete with human inventiveness and intuition in this game. “We realized that these new methods could also solve other very complex problems with enough training data when, a little later, an AI tool developed quasi-superhuman playing strength without being laboriously trained through countless sessions of human games – as was still the case with AlphaGo – but simply by the system playing against itself again and again, and optimizing its playing style as it did.” That was something we intended to use for our studies.”
Making scientific information sustainably usable
Steinbeck and his colleagues expect that with DECIMER, they will be able to machine-read any chemical literature of relevance to them, dating back to the 1950s, and transform it into open databases. After all, one of Steinbeck’s primary concerns as the coordinator of Germany’s National Research Data Infrastructure for Chemistry is to sustainably secure existing information and make it available to the global scientific community.