AI tool predicts structures of 1 billion proteins in newly released atlas
Researchers at the Chan Zuckerberg Initiative's Biohub released an open-source database today containing predicted structures for more than 1 billion proteins. The ESM Atlas dwarfs existing databases by hundreds of millions of entries and was built using ESMFold2, an AI model the team says outperforms Google DeepMind's AlphaFold3.
The atlas includes 1.1 billion predicted protein structures and sequence information for 6.8 billion proteins. Most come from metagenomic sequences - genetic material from soil, ocean, and other environments - that weren't included in previous databases.
How this expands the protein universe
The AlphaFold Database, the previous standard, contains predictions for roughly 200 million proteins. ESMFold2's predecessor atlas held about 800 million entries. The new release more than doubles the available structural data.
Alex Rives, science head at Biohub and lead researcher, said the atlas reveals "the totality of protein biology and especially the parts that are most unknown." The team trained ESMFold2 on billions of proteins across the tree of life, including those metagenomic sequences absent from earlier databases.
Practical applications already demonstrated
Researchers used ESMFold2 to design new antibodies and proteins that bind to targets implicated in cancers and immune disorders. When tested in the lab, a high proportion of the designs functioned as predicted.
The team also identified structural similarities between CRISPR defense proteins and a gene-editing protein found in soil fungi, discovering connections across previously separate areas of protein biology.
ESMFold2 particularly excels at predicting structures of interacting proteins - including antibody-antigen complexes - where it outperforms existing methods.
Reception and remaining questions
Computational biologists see the atlas as a significant resource. Gemma Atkinson at Lund University called it "an extraordinary resource for biology" and noted how large-scale protein language models capture fundamental rules of protein structure.
Christine Orengo at University College London said the predictions could help uncover new protein folds and functions, though they will first need evaluation.
Some researchers raised caution. Martin Steinegger at Seoul National University questioned how well ESMFold2 predicts proteins very different from known structures. His team found the original ESMFold struggled with unusual proteins, particularly those from metagenomic data.
Sergey Ovchinnikov at MIT views the ESM Atlas as supplementary to AlphaFold rather than a replacement. He noted that other proprietary and open-source models have also made gains at predicting protein interactions, though he expects ESMFold2's fully open-source nature and unrestricted commercial use will attract wide adoption.
The atlas is freely accessible. Researchers can explore the AI applications in scientific discovery through structured learning paths designed for professionals in research roles.
Your membership also unlocks: