MIT researchers compress AI models during training using control theory, cutting compute costs without performance loss

MIT researchers built a method that shrinks AI models during training, not after, cutting compute costs without losing accuracy. Called CompreSSM, it drops unnecessary components after just 10% of training, achieving up to 4x speedups.

Categorized in: AI News Science and Research
Published on: Apr 10, 2026
MIT researchers compress AI models during training using control theory, cutting compute costs without performance loss

MIT Researchers Cut AI Training Costs by Compressing Models During Learning

Researchers at MIT's Computer Science and Artificial Intelligence Laboratory have developed a method to shrink AI models while they train, rather than after. The technique, called CompreSSM, eliminates unnecessary components early in the training process and cuts compute costs without sacrificing accuracy.

The work targets state-space models, a family of AI architectures used in language processing, audio generation, and robotics. By applying mathematical tools from control theory, the team identifies which internal components matter and which don't, then removes the dead weight before training completes.

How the Method Works

The key insight is that the relative importance of different model components stabilizes early in training. Using a mathematical quantity called Hankel singular values-which measure how much each internal state contributes to overall performance-the researchers can reliably rank which dimensions are essential after just 10 percent of training.

Once those rankings are established, less-important components are discarded. The remaining 90 percent of training proceeds at the speed of a much smaller model.

"During learning, the models are getting rid of parts that are not useful to their development," said Makram Chahine, a PhD student at CSAIL and lead author of the paper.

Performance Gains

On image classification benchmarks, compressed models maintained nearly the same accuracy as full-sized versions while training up to 1.5 times faster. A model reduced to roughly a quarter of its original size achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to 81.8 percent for a model trained at that smaller size from scratch.

On Mamba, a widely used state-space architecture, the method achieved approximately 4x training speedups, compressing a 128-dimensional model down to around 12 dimensions while maintaining competitive performance.

Advantages Over Existing Methods

Conventional approaches either train a full model and then prune it-paying the full computational cost upfront-or use knowledge distillation, which requires training both a large "teacher" model and a smaller "student" model, essentially doubling effort.

CompreSSM avoids both costs by making compression decisions mid-stream. When benchmarked against Hankel nuclear norm regularization, a recent spectral technique for compact models, CompreSSM was more than 40 times faster while achieving higher accuracy. Against knowledge distillation on CIFAR-10, CompreSSM-compressed models maintained near-full performance while distilled models saw significant accuracy drops.

The researchers proved mathematically that the importance of individual model states changes smoothly during training, giving practitioners confidence that dimensions identified as negligible early on won't become critical later.

Practical Limitations

CompreSSM works best on models where internal state dimension correlates strongly with overall performance-a property that varies across tasks. The method is particularly effective on multi-input, multi-output models, where the relationship between state size and expressivity is strongest. For single-input, single-output architectures, gains are more modest.

The theory applies most cleanly to linear time-invariant systems, though the team has developed extensions for input-dependent, time-varying architectures like Mamba.

What's Next

Chahine and his collaborators see this as a stepping stone. The team has already extended the work to linear time-varying systems and plans to push further into matrix-valued dynamical systems used in linear attention mechanisms-architectures closer to the transformers that power most large AI systems today.

"This is where the theory is neat and the approach can stay principled," Chahine said. "It's the stepping stone to then extend to other architectures that people are using in industry today."

The work was accepted as a conference paper at the International Conference on Learning Representations 2026 and will be presented later this month. It was supported in part by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the U.S. Office of Naval Research.

For those working in Generative AI and LLM development or research, this approach offers a concrete path to reducing training costs for state-space models without retraining from scratch.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)