Columbia University team releases open-source framework to standardize health data for AI research

Columbia University researchers released MEDS, an open-source framework that standardizes how hospital EHR data is formatted for AI research. It has already been adopted by 21 institutions across 12 countries.

Categorized in: AI News Science and Research
Published on: May 30, 2026
Columbia University team releases open-source framework to standardize health data for AI research

Columbia team releases open-source framework to standardize health AI research

Columbia University researchers have developed an open-source framework called MEDS to address a persistent problem in clinical AI: electronic health records stored in incompatible formats across hospitals and institutions. The framework standardizes how longitudinal clinical data is represented for machine learning, allowing researchers to train models on data from multiple sites without sharing sensitive patient information.

The study was published in NEJM AI.

Why standardization matters

Hospital EHR systems use institution-specific formats that require extensive preprocessing before researchers can use them for AI development. This creates redundant work, blocks collaboration between institutions, and makes it difficult to reproduce findings across studies.

Matthew McDermott, assistant professor of biomedical informatics at Columbia University and study leader, said MEDS solves this by creating a common language. "MEDS is a simple way to make all different sources of electronic health record data look the same to your code, regardless of what hospital or clinic or EHR software system the data came from."

The framework includes open-source tools for data transformation, preprocessing, benchmarking, and model development. Because it's open source, researchers at academic institutions, health systems, and companies can contribute extensions and improvements.

What MEDS enables

Researchers can now share code that works across multiple sites without needing to fully harmonize clinical vocabularies or transfer sensitive patient data. This shifts time away from rebuilding data pipelines and toward answering clinical questions.

The framework supports multiple use cases: predictive modeling, representation learning, multimodal modeling, and large-scale benchmarking studies.

McDermott noted that the framework's success reflects a broader pattern in AI development. "The big successes in AI have always been driven by the community coming together and being able to collaborate, often in a decentralized, open-source manner, on tools, model parts, and ultimately ecosystems that let us build larger models that scale to massive datasets."

Adoption and next steps

MEDS has already been adopted across 21 institutions spanning 12 countries. The researchers emphasize that standardization supports reproducibility and transparency as machine learning models move toward clinical deployment.

The framework complements rather than replaces existing clinical data standards like HL7 or FHIR. It was designed specifically for machine learning workflows, not for clinical operations.

For researchers developing health AI models, understanding data standardization is foundational. AI Data Analysis Courses and AI Research Courses can provide deeper context on how standardized data formats accelerate development and collaboration.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)