DOE plans public-private data-curation consortium for science and engineering AI
The Department of Energy is planning a public-private consortium to aggregate scientific data across national laboratories and train "self-improving" AI models for science and engineering. The agency issued a new request for information asking how to structure the effort and make models available via cloud resources to government, academia, and industry.
The goal is straightforward: unify high-value datasets, modernize data preparation, and enable model access that accelerates discovery and engineering workflows. The RFI also invites partners across think tanks, investors, research organizations, and AI developers with advanced model capabilities.
What DOE is asking for
The RFI seeks practical guidance on how to build and run the consortium so it actually delivers usable models and data at scale. It centers on mobilizing the labs, fixing data quality at the source, and streamlining access.
- How to mobilize national labs to partner with industry without slowing research momentum.
- How to ensure data is structured, cleaned, and preprocessed for training and evaluation.
- How to design a consortium that supports many scientific and technical disciplines.
- How to provide AI models to the scientific community using cloud programs and infrastructure.
The RFI also asks for recommendations on developing leading-edge models that use DOE data, facilities, and expertise, plus a call for interested partners that can contribute proven capabilities.
Policy backdrop and scope
The administration's AI Action Plan released in July emphasized energy-focused initiatives, national lab collaboration, and a nationwide buildout of AI-ready data centers. It directs DOE, NSF, NIST, and other federal partners to invest in automated cloud-enabled labs and to encourage researchers to release more high-quality datasets.
That aligns with the RFI's emphasis on shared infrastructure, open access where possible, and clear incentives for data contribution. A recent DOE request also sought proposals to expand data center capacity and energy infrastructure at Oak Ridge National Laboratory, signaling the compute and power footprint this effort will demand.
Why this matters for scientists and engineers
Unified data and shared models reduce duplicated effort and make cross-domain research faster. Standardized preprocessing and metadata improve reproducibility and downstream integration with lab workflows.
Cloud access lowers the barrier for multidisciplinary teams, including smaller labs that lack on-prem resources. If done well, the consortium could set common benchmarks, streamline model evaluation, and shorten the path from raw data to publishable results or deployable engineering outputs.
How to prepare a high-value response
- Map your data assets: Inventory datasets, modalities, sizes, formats, and known data quality issues. Highlight unique or hard-to-recreate data.
- Commit to FAIR: Propose metadata schemas, identifiers, and provenance that support findability and reuse. The FAIR principles are a useful baseline.
- Define cleaning and preprocessing: Show how you will structure, normalize, and label data. Specify file formats (e.g., HDF5, NetCDF), ontologies, and versioning.
- Data governance and security: Classify data; address export controls, controlled unclassified information, and privacy. Propose access tiers and audit mechanisms.
- Licensing and IP: Recommend licenses for datasets and models that enable research use while respecting contributor rights.
- Model development and evaluation: Outline training plans, baselines, domain-specific metrics, and reporting (e.g., model cards, documentation for reproducibility).
- Compute and storage: Estimate training and inference needs; discuss cloud-HPC integration, data locality, and cost controls.
- Interoperability: Plan APIs and formats that support multi-lab, multi-discipline use without rigid coupling.
- Sustainability: Propose lifecycle plans for dataset refresh, model updates, and deprecation policies.
- Consortium operations: Suggest governance, partner onboarding, publication policies, and incentives for data contribution.
Access and infrastructure
DOE intends to provide models through cloud programs to speed up experimentation and collaboration. Expect tight coordination with national lab HPC systems, data center expansions, and energy planning to support training and large-scale inference.
Standards and risk management
For research-grade deployments, tie your approach to recognized frameworks for documentation, safety, and assurance. NIST's guidance is a solid reference point for methodical risk treatment across the AI lifecycle.
What to do next
- Assemble a cross-functional team (PI, data engineer, security lead, and program manager) to draft responses.
- Prioritize one or two high-impact use cases with clear datasets and measurable outcomes.
- Prepare short, testable pilots you can scale within the consortium.
- Monitor DOE channels for submission deadlines and technical annexes tied to the RFI.
If your team needs structured upskilling to align with this work, browse role-specific AI curricula: AI courses by job.
Your membership also unlocks: