Columbia team releases open-source framework to standardize health AI research
Columbia University researchers have developed an open-source framework called MEDS to address a persistent problem in clinical AI: electronic health records stored in incompatible formats across hospitals and institutions. The framework standardizes how longitudinal clinical data is represented for machine learning, allowing researchers to train models on data from multiple sites without sharing sensitive patient information.
The study was published in NEJM AI.
Why standardization matters
Hospital EHR systems use institution-specific formats that require extensive preprocessing before researchers can use them for AI development. This creates redundant work, blocks collaboration between institutions, and makes it difficult to reproduce findings across studies.
Matthew McDermott, assistant professor of biomedical informatics at Columbia University and study leader, said MEDS solves this by creating a common language. "MEDS is a simple way to make all different sources of electronic health record data look the same to your code, regardless of what hospital or clinic or EHR software system the data came from."
The framework includes open-source tools for data transformation, preprocessing, benchmarking, and model development. Because it's open source, researchers at academic institutions, health systems, and companies can contribute extensions and improvements.
What MEDS enables
Researchers can now share code that works across multiple sites without needing to fully harmonize clinical vocabularies or transfer sensitive patient data. This shifts time away from rebuilding data pipelines and toward answering clinical questions.
The framework supports multiple use cases: predictive modeling, representation learning, multimodal modeling, and large-scale benchmarking studies.
McDermott noted that the framework's success reflects a broader pattern in AI development. "The big successes in AI have always been driven by the community coming together and being able to collaborate, often in a decentralized, open-source manner, on tools, model parts, and ultimately ecosystems that let us build larger models that scale to massive datasets."
Adoption and next steps
MEDS has already been adopted across 21 institutions spanning 12 countries. The researchers emphasize that standardization supports reproducibility and transparency as machine learning models move toward clinical deployment.
The framework complements rather than replaces existing clinical data standards like HL7 or FHIR. It was designed specifically for machine learning workflows, not for clinical operations.
For researchers developing health AI models, understanding data standardization is foundational. AI Data Analysis Courses and AI Research Courses can provide deeper context on how standardized data formats accelerate development and collaboration.
Your membership also unlocks: