Artificial intelligence researchers at Meta claim to have developed the largest protein folding model of its kind to date and is able to predict the structure of over 600 million proteins.
The team on Tuesday released the model based on the 15 billion-parameter transformer ESM-2 and a database of its protein structure predictions, dubbed the ESM Metagenomic Atlas. This database includes forms of proteins that have not yet been observed by scientists.
Proteins are complex biological molecules containing up to 20 types of amino acids and perform all kinds of biological functions in organisms. Basically, they fold into complex 3D structures, the shape of which is essential to their functioning; knowing their shape helps scientists understand how they work, and from there helps them find ways to mimic, modify, or counter that behavior.
Unfortunately, you cannot simply take the amino acid formula and immediately determine the final structure. You can do simulations or experiments to eventually figure it out, but it takes time. These days you can feed properly trained machine learning software the chemical makeup of a protein and the model will quickly and accurately predict the structure, relatively speaking.
Indeed, DeepMind has demonstrated this with its AlphaFold model, which won the biennial CASP International Computational Protein Folding Competition in 2020. Given an input string of amino acids, AlphaFold and other computational software machine learning can generate its corresponding three-dimensional structure.
Researchers at London-based DeepMind have since improved their system to predict the structure of more than 200 million proteins known to science. Meta’s latest ESM system went further, predicting hundreds of millions more after being trained on millions of protein sequences.
A pre-printed article by the Meta team – Lin et al – explaining the design of the ESM-2 can be found here. Interestingly, according to the researchers, the system is actually a large model language designed to “learn evolutionary patterns and generate end-to-end accurate structure predictions directly from a protein’s sequence.” AlphaFold, on the other hand, is not a language model and uses a different approach.
As the boffins note in their paper, these large language models can be used for much more than managing human languages: “Modern language models containing tens to hundreds of billions of parameters expand capabilities such as translating language in a few strokes, common sense reasoning and math problem solving, all without explicit supervision.
“These observations raise the possibility that a parallel form of emergence may be exhibited by language models trained on protein sequences.”
The result is ESM-2, which although a learned language model predicts the physical form of a protein from a string of text representing its amino acids.
ESM-2 is the largest model of its kind and apparently predicts structures faster than similar systems; it’s up to 60 times faster than previous leading systems like AlphaFold or Rosetta, which can take more than ten minutes to generate output, according to Meta.
The model was able to create the ESM Metagenomic Atlas, predicting over 600 million structures from the MGnify90 protein database in just two weeks on 2,000 GPUs. On a single Nvidia V100 GPU, it only takes 14.2 seconds to simulate a protein made up of 384 amino acids. It appears from the article that Meta said his system mostly, but not entirely, matched AlphaFold in terms of accuracy, although his speed was the key element, allowing him to predict more proteins.
“With today’s advanced computing tools, predicting the structures of hundreds of millions of protein sequences in a practical amount of time could take years, even using the resources of a large research institution. To make predictions at the ‘scale of metagenomics, a breakthrough in prediction speed is essential,’ said the Facebook owner.
Meta hopes that ESM-2 and the ESM Metagenomic Atlas will help advance science by helping scientists study the history of evolution or fight disease and climate change. “To extend this work even further, we are investigating how language models can be used to design new proteins and help solve health, disease and environmental problems,” the company concluded. ®
#Meta #hes #created #nextgeneration #protein #folding #model