Columbia-Led Team Develops Open-Source Framework to Accelerate Health AI Research

Columbia-Led Team Develops Open-Source Framework
to Accelerate Health AI Research

Healthcare AI has lagged behind other areas of artificial intelligence for one major reason: hospitals store electronic health records in vastly different ways. Researchers have been unable to fully harness the vast potential of medical data due to an inability to easily reproduce results, compare findings or validate new AI tools across different sites.

DBMI Assistant Professor Matthew McDermott is leading an international team — including researchers from across the USA and seven nations in Europe or Asia — that has developed an open-source framework designed to help researchers seamlessly collaborate across institutions while allowing institutions to keep sensitive patient data local. This framework, called Medical Event Data Standard (MEDS), helps researchers apply AI training methods across very different hospital data systems while allowing institutions to keep patient data local. Already, MEDS has been adopted across 21 institutions spanning 12 countries.

DBMI Assistant Professor Matthew McDermott leads an international research team that developed the Medical Event Data Standard (MEDS).

Detailed in a study published May 28 in NEJM AI, MEDS has the potential to remove a bottleneck that has long restricted collaborative healthcare research and help researchers more reliably reproduce and validate findings across institutions without risking the privacy of patients.

“MEDS is a simple way to make all different sources of electronic health record (EHR) data look the same to your code, regardless of what hospital or clinic or EHR software system the data came from,” McDermott said. “MEDS lets us share code that we can use to train models on many different sites of care without needing to share that data — and often without needing to even do the more challenging step of fully ‘harmonizing’ the data into a consistent clinical vocabulary.”

The Current AI Obstacle

Powerful AI models require massive amounts of data. In an ideal world, healthcare researchers could easily pool the billions of data points detailed throughout electronic health records around the world, giving them more than enough information to study the medical field’s biggest questions.

Healthcare research has been far from ideal. The rigid, fragmented way data is stored is one major challenge, but safety and institutional differences create an even larger barrier.

“Sharing patient data would be great, but it opens a lot more risk that the data could be accidentally leaked to parties who shouldn’t have access to it,” McDermott said. “In addition, different hospitals or healthcare clinics are very different — a cancer center is going to look at patient data very differently than a pediatric primary care center, so it doesn’t necessarily make sense to put all that data together before you train your models.”

Instead of forcing hospitals to pool raw data, MEDS prioritizes the transportability of the model-training algorithms themselves. It achieves this by organizing messy medical records into a universal format: a simple, chronological timeline.

Rather than trying to force different hospitals to agree on a complex, universal medical vocabulary — a process known as data harmonization — MEDS simply focuses on three basic questions to map a chronological sequence of health events: Who is the patient? When did an event happen? What medical event or observation was recorded?

It is important to note that MEDS is not designed to replace widely used healthcare data standards such as OHDSI’s OMOP model, PCORnet, or i2b2. Those systems are highly sophisticated and are optimized for tasks such as querying clinical databases, identifying patient cohorts, and supporting large-scale observational research. Instead, MEDS acts as a bridge that allows researchers to work with data from those systems or from raw datasets for large-scale AI training and evaluation applications.

The Power of Collaboration

MEDS is not just a standalone AI model; it is a foundational blueprint that enables the global medical research community to build and test AI tools more efficiently across institutions.

A visual overview of the design principles and schema of MEDS.

According to the NEJM AI review, the standard has already supported 27 academic papers and preprints, 17 datasets or dataset formats, 12 distinct AI model training algorithms, and a growing toolbox of at least 14 public developer tools.

Tools built within the MEDS ecosystem reported computational speed improvements ranging from 1.9 times faster to nearly 40,000 times faster than previous workflows or tools. Researchers found that projects using MEDS required 33% to 70% fewer lines of code, suggesting that the framework can dramatically simplify AI development workflows.

“The big successes in AI have always been driven by the community coming together and being able to collaborate, often in a decentralized, open-source manner, on tools, model parts, and ultimately ecosystems that let us build larger models that scale to massive datasets,” McDermott said. “These impressive results in MEDS are just reflecting the benefits you get when the community can share tools or abstract common parts of their pipelines out into a shared library and use them across everyone’s data.”

Empowering Research, Focusing on The Patient

Most people are already familiar with generative AI through tools like ChatGPT, which work by looking at a string of words and predicting what the next logical word in the sentence should be. By treating a patient’s medical history like a chronological sentence — where a doctor’s visit, a lab test, and a prescription are the words — MEDS allows a brand-new class of medical AI, known as “autoregressive foundation models,” to do the exact same thing.

These systems can analyze years of deidentified patient timelines to identify patterns in how diseases and medical events unfold over time, potentially helping researchers develop better predictive models. MEDS is increasingly being used as a framework to support the development of this new wave of frontier technologies, which includes health models like ETHOS, cehr-xgpt, and Curiosity.

While MEDS could significantly accelerate health AI research, it is important to note that it is built for research, not immediate hospital deployment. It will take time before these computational breakthroughs actively change how a routine doctor’s appointment feels. However, MEDS creates a new starting line for a much safer, faster, and more collaborative generation of medicine.

“For patients, these innovations will take a long time to translate into improvements in care,” McDermott said. “But we hope that with better AI capabilities on medical data, this will ultimately lead to better care for patients, especially for complex, longitudinal diseases where AI has a lot of potential to significantly improve patient outcomes and experience.”

More Information

The study, MEDS: An Emerging Data Standard and Ecosystem for Health AI Research, was published in NEJM AI on May 28, 2026.

Matthew McDermott is the lead author. The full list of authors includes Ethan Steinberg (Stanford University), Jason A Fries (Stanford University), Robin P. van de Water (University of Potsdam), Chao Pang (Columbia University), Patrick Rockenschaub (Medical University of Innsbruck), Pawel Renc (University of Krakow), Jungwoo Oh (KAIST), Kamilė Stankevičiūtė (University of Cambridge), Justin Xu (University of Oxford), Tom J. Pollard (MIT), Nassim Oufattole (MIT), Michael Wornow (Stanford University), Teya S. Bergamaschi (MIT), Hyewon Jeong (MIT), Simon A. Lee (UCLA), Vincent Jeanselme (Columbia University), Kiril V. Klein (University of Copenhagen), Mikkel Odgaard (University of Copenhagen), Maria E. Montgomery (University of Copenhagen), Arkadiusz Sitek (Harvard Medical School), Mads Nielsen (University of Copenhagen), Jeffrey N. Chiang (UCLA), Noa Dagan (Ben Gurion University), Isaac Kohane (Harvard Medical School), Shalmali Joshi (Columbia University), Edward Choi (KAIST), Nigam H. Shah (Stanford University)