Open Molecules 2025: A Leap Forward in Molecular Simulations
Open Molecules 2025 (OMol25), a large-scale dataset of molecular simulations, has been released to the scientific community. This dataset enables the development of machine learning models capable of accurately simulating chemical reactions with real-world complexity. Created through a collaboration co-led by Meta and Lawrence Berkeley National Laboratory (Berkeley Lab), OMol25 offers new opportunities for research in materials science, biology, and energy technologies.
Samuel Blau, chemist and research scientist at Berkeley Lab, highlighted the importance of this release, noting its potential to change how atomistic simulations are conducted in chemistry. Larry Zitnick, research director at Meta’s Fundamental AI Research (FAIR) lab, expressed excitement about the dataset’s role in fostering new AI models.
The Dataset: Open Molecules 2025
OMol25 consists of over 100 million 3D molecular snapshots with properties calculated using density functional theory (DFT). DFT is a powerful method that models atomic interactions precisely, predicting forces on atoms and system energies. These details dictate molecular motion and chemical reactions, influencing larger-scale properties like electrolyte behavior in batteries or drug-receptor binding.
While DFT provides high accuracy, it requires substantial computational resources, especially as molecular size grows. This limitation has historically made simulating complex molecular systems infeasible. Machine Learned Interatomic Potentials (MLIPs), trained on DFT data, offer a solution by delivering predictions with similar accuracy but up to 10,000 times faster. MLIPs can thus extend simulations to larger systems using standard computing resources.
The value of an MLIP depends heavily on the quality, scale, and diversity of the training data. OMol25 addresses this need by providing the most chemically diverse molecular dataset available for training MLIPs.
Building a New Resource
Creating OMol25 required exceptional computing power and DFT expertise. The FAIR team leveraged Meta’s extensive global computing network, utilizing idle computational capacity worldwide. Previous datasets averaged 20-30 atoms per simulation with limited elemental variety. In contrast, OMol25 features configurations up to 350 atoms, spanning most of the periodic table, including heavy elements and metals that are complex to simulate.
The dataset captures a broad spectrum of molecular interactions and dynamics involving both organic and inorganic molecules. Samuel Blau emphasized the scale by noting OMol25 consumed six billion CPU hours—over ten times more than any previous dataset. To put this in perspective, running these calculations on 1,000 typical laptops would take more than 50 years.
A Leap Forward in AI Models
Researchers worldwide can now train MLIPs using OMol25. Alongside the dataset, the FAIR lab released an open-access universal model trained on OMol25 and other datasets. This model is ready to use for many applications but is expected to improve as researchers refine training and usage methods.
The collaboration also provides comprehensive evaluations—sets of challenges that assess model performance in practical tasks. These evaluations build confidence in MLIPs trained on OMol25 by testing their ability to handle complex chemistry, such as bond breaking and reforming, and molecular charges and spins.
Evaluations also encourage progress through public rankings and friendly competition. Users can identify models that perform well and developers can benchmark their work effectively. Aditi Krishnapriyan, a key contributor to the evaluations, noted that trust in these models is critical, as scientific research depends on physically accurate results.
By the Community, For the Community
OMol25 was developed to meet the needs of the scientific community, with collaboration at its core. The team started by integrating existing datasets that represent important molecular configurations across chemistry fields. They then expanded coverage by adding new simulations to fill gaps—especially in biomolecules, electrolytes, and metal complexes.
Though comprehensive, the dataset currently lacks extensive polymer data. This will be addressed in a complementary project, Open Polymer data, involving collaborators from Lawrence Livermore National Laboratory.
The OMol25 team includes scientists from academia, industry, and national labs. Co-leads Samuel Blau and Brandon Wood connected through Berkeley Lab and Meta’s FAIR lab, recruiting experts from institutions such as UC Berkeley, Carnegie Mellon, Princeton, Stanford, Cambridge, and Genentech.
Brandon Wood emphasized the team effort behind OMol25 and their eagerness to see how the community applies this resource to advance AI modeling in molecular simulations.
Funding and Support
- Samuel Blau’s work on OMol25 was funded by Berkeley Lab’s Laboratory Directed Research and Development program.
- His contributions to electrolyte modeling were supported by the Energy Storage Research Alliance under the DOE Office of Science.
- Aditi Krishnapriyan’s efforts were funded by the DOE Office of Science through the Center for Ionomer-based Water Electrolysis.
For researchers interested in improving AI capabilities in scientific domains, OMol25 provides a foundational resource to accelerate innovation in molecular simulations.
Your membership also unlocks: