MIT Researchers Cut AI Training Costs by Compressing Models During Learning
Researchers at MIT's Computer Science and Artificial Intelligence Laboratory have developed a method to shrink AI models while they train, rather than after. The technique, called CompreSSM, eliminates unnecessary components early in the training process and cuts compute costs without sacrificing accuracy.
The work targets state-space models, a family of AI architectures used in language processing, audio generation, and robotics. By applying mathematical tools from control theory, the team identifies which internal components matter and which don't, then removes the dead weight before training completes.
How the Method Works
The key insight is that the relative importance of different model components stabilizes early in training. Using a mathematical quantity called Hankel singular values-which measure how much each internal state contributes to overall performance-the researchers can reliably rank which dimensions are essential after just 10 percent of training.
Once those rankings are established, less-important components are discarded. The remaining 90 percent of training proceeds at the speed of a much smaller model.
"During learning, the models are getting rid of parts that are not useful to their development," said Makram Chahine, a PhD student at CSAIL and lead author of the paper.
Performance Gains
On image classification benchmarks, compressed models maintained nearly the same accuracy as full-sized versions while training up to 1.5 times faster. A model reduced to roughly a quarter of its original size achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to 81.8 percent for a model trained at that smaller size from scratch.
On Mamba, a widely used state-space architecture, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.
Advantages Over Existing Methods
Conventional approaches either train a full model and then prune it-paying the full computational cost upfront-or use knowledge distillation, which requires training both a large "teacher" model and a smaller "student" model, essentially doubling effort.
CompreSSM avoids both costs by making compression decisions mid-stream. When benchmarked against Hankel nuclear norm regularization, a recent spectral technique for compact models, CompreSSM was more than 40 times faster while achieving higher accuracy. Against knowledge distillation on CIFAR-10, CompreSSM-compressed models maintained near-full performance while distilled models saw significant accuracy drops.
The researchers proved mathematically that the importance of individual model states changes smoothly during training, giving practitioners confidence that dimensions identified as negligible early on won't become critical later.
Practical Limitations
CompreSSM works best on models where internal state dimension correlates strongly with overall performance-a property that varies across tasks. The method is particularly effective on multi-input, multi-output models, where the relationship between state size and expressivity is strongest. For single-input, single-output architectures, gains are more modest.
The theory applies most cleanly to linear time-invariant systems, though the team has developed extensions for input-dependent, time-varying architectures like Mamba.
What's Next
Chahine and his collaborators see this as a stepping stone. The team has already extended the work to linear time-varying systems and plans to push further into matrix-valued dynamical systems used in linear attention mechanisms-architectures closer to the transformers that power most large AI systems today.
"This is where the theory is neat and the approach can stay principled," Chahine said. "It's the stepping stone to then extend to other architectures that people are using in industry today."
The work was accepted as a conference paper at the International Conference on Learning Representations 2026 and will be presented later this month. It was supported in part by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.
For those working in Generative AI and LLM development or research, this approach offers a concrete path to reducing training costs for state-space models without retraining from scratch.
Your membership also unlocks: