AI that separates shared vs. modality-specific cell data - so you can plan smarter experiments
Single-cell assays capture slices of the same cell state. Measure RNA and you see growth and transcriptional programs. Measure proteins or chromatin and you see signaling and regulation. The snag: integrating these modalities often blends signals, making it hard to tell which feature came from where.
A team from the Broad Institute of MIT and Harvard, MIT, and ETH Zurich/Paul Scherrer Institute built an AI framework that learns what information is shared across modalities and what is unique to each modality. The result is a clearer map of cell state that ties signals back to their source, helping researchers interrogate mechanisms and plan the right measurements.
Why current multimodal pipelines stall
Cells are multilayered systems. Proteins, RNA, chromatin, and morphology report on different aspects of the same biology. Traditional autoencoders compress each modality on its own, then mash the results together. You gain speed, but you lose attribution: which readout actually carries a biomarker or pathway signal?
As one researcher put it, we only have one underlying cell state, yet many ways to measure it. Without separating shared from modality-specific signals, downstream conclusions blur. That slows decisions about what to assay next and how to track disease progression.
What's different about this framework
The model builds a shared latent space for overlapping biological signals and modality-specific spaces for features found in only one readout - think of it like a Venn diagram for cellular data. A two-step training routine helps the model decide what belongs in the shared bucket versus the modality-specific buckets, even for complex datasets.
In practice, you feed in multimodal cell data, and the model returns which components are common across modalities and which are unique. That attribution holds on new, unseen cells.
Evidence it works
On synthetic datasets, the framework recovered the known partition between shared and modality-specific factors. On real single-cell data, it separated joint gene activity captured by transcriptomics and chromatin accessibility, while correctly flagging signals present in only one of those assays.
It also pinpointed which modality best captures a DNA damage protein marker in cancer samples - exactly the kind of guidance a clinical team needs to select the right assay.
Practical gains for your lab
- Run fewer assays by deciding what to measure and what to predict from other modalities.
- Trace biomarkers to the modality that carries them, improving assay selection for trials and longitudinal studies.
- Compare modalities to study how cellular components regulate each other, not just aggregate them.
- Track disease courses (e.g., cancer, neurodegeneration, metabolic disorders) with clearer mechanistic signals.
How to integrate this into your workflow
- Start with 2-3 modalities you already collect (e.g., RNA + ATAC; RNA + protein; morphology + protein).
- Standardize preprocessing and QC; misaligned pipelines will masquerade as "modality-specific" signals.
- Hold out cell types or conditions to test generalization of shared vs. unique features.
- Use the shared space for cross-modality tasks (imputation, denoising, batch assessment).
- Probe modality-specific spaces to prioritize assays, choose markers, and design perturbation follow-ups.
What's next
The team is pushing for more interpretable outputs and wider clinical applications. The core idea stays the same: don't just integrate everything; compare modalities to see how cellular layers interact, and act on that map.
Learn more
See the journal hosting this work: Nature Computational Science. Explore the institute ecosystem behind the research: Broad Institute of MIT and Harvard.
For training and implementation paths, start here: AI Learning Path for Research Scientists and, for molecular and cellular focus, AI Learning Path for Biochemists.
Your membership also unlocks: