$152M OMAI Project to Build Transparent, Open AI for Science as UNM Social Scientist Joins

UNM's Sarah Dreier joins AI2-led OMAI, a $152M push for transparent, open AI built for reproducible science. Expect open models, provenance-first data, and extensible tools.

Published on: Sep 21, 2025
$152M OMAI Project to Build Transparent, Open AI for Science as UNM Social Scientist Joins

Transparent, Open AI for Science: UNM Social Scientist Joins $152M OMAI Initiative

AI is only as good as its data. Most large models were trained on the open internet, which means noise, bias, and zero visibility into sources. That's a nonstarter for scientific work where reproducibility, provenance, and audit trails are non-negotiable.

Sarah Dreier, assistant professor of political science at the University of New Mexico, is joining the Open Multimodal AI Infrastructure to Accelerate Science (OMAI) project to fix that. She is the sole social scientist on the team, focusing on dataset curation and practical data needs for scientific workflows like literature analysis and code generation, backed by a $600,000 allocation.

Led by the Allen Institute for AI (AI2), OMAI is a $152 million effort to build a fully open suite of AI models and infrastructure for U.S. science. Funding includes $75 million from the U.S. National Science Foundation and $77 million from Nvidia, supporting the broader federal push for trustworthy, high-performance AI in research.

"The engineers training these models don't know what the data is," Dreier noted, pointing to the gap between opaque training corpora and the demands of rigorous science. Her goal: models that are more transparent, more open, and more flexible for real research pipelines.

Noah Smith, who directs natural language processing research at AI2 and teaches at the University of Washington, emphasized why openness matters. Many LLMs are closed: their data, training tools, and methods are private, which blocks inspection, adaptation, and reuse. "Open models are essential for transparency, reproducibility, and collaboration - the core of how scientific progress happens."

Other investigators on the five-year project include UW's Hanna Hajishirzi, University of Hawai'i at Hilo's Travis Mandel, and the University of New Hampshire's Samuel Carton. The team plans to release open models, tools, and compute infrastructure to help scientists move faster with fewer blind spots.

According to Smith, the models will help researchers parse vast literatures, generate code and visualizations, and connect new findings to prior work. Expect impact across materials science, protein function prediction, and energy research.

Why this matters to engineers and researchers

  • Auditability by design: Open data pipelines, model cards, and documented training sets allow you to verify sources, assess bias, and replicate results.
  • Domain-focused training: Curated corpora for political science, sociology, biology, and engineering ensure models are tuned for real tasks like hypothesis generation, code authoring, and data analysis.
  • Tooling you can extend: Open checkpoints, evaluation suites, and APIs let teams fine-tune, add retrieval, and integrate with lab systems without license constraints.
  • Data rights and provenance: Licensing, usage permissions, and lineage tracking reduce legal and compliance risk, especially for government- and grant-funded work.

What to expect over the next five years

  • Public releases of models, datasets, and benchmarks with transparent documentation.
  • Provenance-first data pipelines and clear governance for contributions from partner universities.
  • Support for multimodal inputs (text, code, visuals) to match real scientific artifacts and workflows.
  • Community collaboration: opportunities to contribute datasets, evaluations, and domain expertise.

How to prepare your team now

  • Inventory clean, licensed datasets. Add data statements that spell out collection methods, intended use, and limitations.
  • Stand up retrieval pipelines with strict provenance so model outputs cite sources. Avoid mixing unknown web corpora into training or fine-tuning.
  • Create evaluation protocols for literature synthesis, code generation, and scientific QA that reflect your lab's acceptance criteria.
  • Train your staff on open model workflows and governance. If you need structured upskilling, explore practical programs at Complete AI Training.

OMAI's promise is simple: make high-utility AI for science that anyone can inspect, extend, and trust. For updates, watch the Allen Institute for AI's announcements here and the U.S. National Science Foundation's programs here.