AI Turns Materials Literature into a Lab Assistant
Cambridge and Argonne teams build AI to mine papers, structure data, and guide experiments. Fine-tuned Q&A models match larger ones, cutting compute and costs for labs.

AI That Reads the Literature So You Can Run the Experiments
Scientific papers are piling up faster than any team can read. With support from supercomputers at the U.S. Department of Energy's Argonne National Laboratory, Jacqueline Cole and her University of Cambridge team are building AI systems that mine journals, extract structured data, and feed compact language models built for materials research.
The goal is simple: a lab-ready assistant that answers questions, offers feedback, and helps steer experiments. As Head of Molecular Engineering at Cambridge, Cole frames it plainly: a tool that complements scientists, not replaces them.
From Text Mining to Lab-Ready Assistants
This work started at the Argonne Leadership Computing Facility (ALCF) nearly a decade ago, including one of the first ALCF Data Science Program projects. Cole's team combined machine learning, simulations, and experimental results to build data-first workflows for materials discovery.
They developed ChemDataExtractor to automatically parse papers and create structured databases. That foundation enabled AI models that are smaller, faster, and easier to deploy in real labs.
Skip Costly Pretraining: Fine-Tune on Domain Q&A
Pretraining large language models on generic text requires massive compute. Cole's team took another path: generate a large, high-quality question-answer dataset directly from curated materials databases, then fine-tune compact models on that Q&A.
Using ChemDataExtractor and new algorithms, they converted a photovoltaic materials database into hundreds of thousands of Q&A pairs. As Cole explains, this shifts the knowledge burden off the model and into the data: give the model clean, structured Q&A, and skip pretraining while still getting domain-specific utility.
The result: smaller models that match or beat much larger general models on materials tasks, with up to 20% higher accuracy in the target domain. While the study centered on solar-cell materials, the method generalizes.
Domain Models That Deliver
The team built a large database of stress-strain properties for materials used in aerospace and automotive applications. They then trained MechBERT to answer questions about those properties, achieving stronger predictions of material behavior under load than standard tools.
In optoelectronics, they adapted language models using 80% less compute than typical training methods, with no loss in performance. The throughline: focused data pipelines, compact models, and practical outputs for researchers.
Why This Matters for Your Lab
- Faster decisions mid-experiment. Ask targeted questions, interpret anomalies, and adjust setups without sifting through dozens of PDFs.
- Lower compute and cost. Fine-tune with a few GPUs-or even a personal workstation-using curated Q&A instead of full-model pretraining.
- More reproducible insights. Structured datasets and transparent Q&A generation make results easier to audit and extend.
- Broader access. Teams across materials domains can build their own assistants using their own databases.
How to Try This Approach
- Pick a domain (e.g., photovoltaics, stress-strain, optoelectronics) and assemble a high-quality, structured dataset.
- Use a text-mining pipeline (e.g., ChemDataExtractor) to expand and normalize entries from the literature.
- Programmatically generate question-answer pairs that reflect the queries your lab actually asks.
- Fine-tune a compact, open model on the Q&A; validate against held-out papers and known benchmarks.
- Deploy behind a simple interface; log queries and outcomes to keep improving your dataset and model.
Recognition and Scale
The team earned the Royal Society of Chemistry's 2025 Materials Chemistry Horizon Prize for work on panchromatic co-sensitized solar cells. With ALCF support, they continue to ship practical AI tools for energy materials, light-based technologies, and mechanical engineering.
The intent is democratization: you don't need to be an LLM specialist to build a useful assistant for your niche. Off-the-shelf models, plus your curated Q&A, can get you there.
Learn More
The ALCF is a DOE Office of Science user facility. Argonne National Laboratory advances basic and applied research across scientific disciplines, operated by UChicago Argonne, LLC for the U.S. Department of Energy's Office of Science.