Researchers from IBM, Oxford University, and Diamond Light Source have demonstrated in a new study that IBM’s AI Model, MoLFormer, can produce antiviral molecules for a variety of target virus proteins, including SARS-CoV-2. This finding could speed up the drug discovery process and improve our preparedness for pandemics in the future.
The findings were presented in a new publication published in Science Advances, and at the time the paper was submitted, Oxford researchers had effectively validated the antiviral characteristics of eleven compounds. This discovery could speed up the distribution of medications in the event of an emergency and make treatments for illnesses that are urgent and life-threatening more accessible.
Early in the pandemic, a team of IBM computer scientists intended to investigate whether generative AI could be used to build novel compounds to obstruct COVID-19-causing SARS-CoV-2.
David Stuart, Head of the Division of Structural Biology in the Department of Clinical Medicine at the University of Oxford and Life Sciences Director at Diamond Light Source, the UK’s national synchrotron who is an authority on pathogens HIV, SARS, and Ebola, among other viruses explains he was initially skeptical.
“The idea that you could take a protein sequence and, with AI, pluck out of thin air chemicals that would bind to a 3D site on the virus seemed very unlikely,” he said.
To show that generative AI could “pluck viable starting points for antivirals out of thin air,” he and Martin Walsh, an expert structural biologist and Diamond’s deputy director for life sciences, joined the IBM team and worked with Enamine Ltd., a Ukrainian chemical supplier, and other Oxford researchers over the course of three years.
We created valid starting points for accelerated development of antivirals using a generative foundation model that knew relatively little about its protein targets. I’m hopeful that these methods will allow us to create antivirals and other urgently needed compounds much faster and more inexpensively in the future.
Jason Crain
The generative model was adaptable enough to develop new inhibitors for numerous protein targets without additional training or any understanding of its 3D structure because it was also a foundation model, pre-trained on enormous volumes of raw data.
The Stuart and Walsh groups had commenced working on two essential SARS-CoV-2 proteins, namely the spike protein and the main protease. The scientists discovered four possible COVID-19 antivirals using these targets in a fraction of the time it would have taken using more traditional techniques. In order to see how a portion of the AI-generated compounds bonded to the primary protease, the team next took advantage of Diamond’s high-throughput macromolecular crystallography beamlines.
In addition to publishing a web-based interface for engaging with the model and other chemical foundation models like it in IBM Cloud, they have published a new paper showcasing their work in Science Advances.
The scientists claimed that before firms could possibly turn the validated compounds into medications, they still had a long way to go, including going through clinical trials. But even if the AI-generated “hits” are never turned into pharmaceuticals, the research shows that generative AI will be crucial to drug development in the future, particularly during a crisis.
“It took time to develop and validate these methods, but now that we have a working pipeline in place, we can generate results much faster,” said study co-senior author, Payel Das, a researcher at IBM Research. “When the next virus emerges, generative AI could be pivotal in the search for new treatments.”
“Generating initial compounds that bind with high affinity to a drug target of interest accelerates the structure-based drug discovery pipeline and underpins our efforts to be better prepared for future pandemics,” said, Martin Walsh, who was co-senior author at Diamond
The researchers built their model, Controlled Generation of Molecules (or CogMol), on a generative AI architecture known as variational autoencoders, or VAEs. VAEs compress raw data into a representation, which is subsequently translated back into a statistical variation on the original sample during decoding. They used a sizable dataset of compounds represented as strings of text together with general knowledge about proteins and their binding characteristics to train their model.
However, they purposefully omitted details on the 3D structure of SARS-CoV-2 or the molecules that are known to attach to it. Their intention was to equip their generative foundation model with a substantial body of knowledge so that it could be used for molecular design challenges that it had never encountered before with greater ease.
Finding drug-like compounds that would bind with two COVID protein targets the primary protease and the spike, which convey the virus to the host cell was their aim. Although both proteins’ 3D structures had been uncovered by that point, the IBM researchers decided to just use their DNA-derived amino acid sequences. They set these restrictions in an effort to train the model to build molecules without being aware of their target’s structure.
The researchers input only the amino acid sequence for each protein target into CogMol, which generated 875,000 candidate molecules in three days. To narrow the pool, the researchers ran the candidates through a retrosynthesis platform, IBM RXN for Chemistry, to understand what ingredients would be needed to synthesize the compounds.
They chose 100 molecules for each target based on the projected recipes provided by the platform. Enamine chemists further whittled the list down to four compounds for each target, choosing those they thought would be the simplest to produce.
After synthesizing the eight novel molecules, Enamine shipped them to Oxford for testing their ability to disrupt the functions of the two protein targets in the labs of Prof Chris Schofield and PRof Gavin Screaton. The intense X-ray beam generated from Diamond which are 10 billion times brighter than the sun were used to visualize how the compounds interacted with proteins to inactivate their function.
Target inhibition tests and testing for the neutralization of live viruses were conducted on the new compounds. The primary protease is the target of two of the approved antivirals, while the spike protein and all six significant COVID variants were also neutralized by the other two.
“You get a map that shows exactly where things bind, and bang! you’ve got a confirmation,” said Stuart.
CogMol is one of several chemical foundation models that IBM has since developed. The largest, MoLFormer-XL, was trained on a database of more than 1.1 billion molecules and is currently being used by Moderna to design mRNA medicines.
“We created valid starting points for accelerated development of antivirals using a generative foundation model that knew relatively little about its protein targets,” said the study’s co-senior author, Jason Crain, a researcher at IBM Research and professor at Oxford. “I’m hopeful that these methods will allow us to create antivirals and other urgently needed compounds much faster and more inexpensively in the future.”
Although the researchers’ main focus was on validating COVID antivirals, they contend that same techniques can be applied to viruses that are already in existence but are still evolving, such as the flu, or viruses that have not yet been discovered.
“If you want to be prepared for the next pandemic, you want drugs that act on different sites of the protein,” concluded Stuart. “It becomes much harder for the virus to escape.”