AI Meets Free Energy: Inverse Folding Turns Structure into Zero-Shot Protein Stability for Drug Design and Disease Insight

Inverse folding models track protein stability by linking sequence likelihoods to free energy. With a few principled tweaks, zero-shot scores can better flag risky mutations.

Categorized in: AI News Science and Research
Published on: Jan 16, 2026
AI Meets Free Energy: Inverse Folding Turns Structure into Zero-Shot Protein Stability for Drug Design and Disease Insight

Drug Design & Disease Transparency: Behind the Science of Protein Stability with AI

Inverse folding models weren't built to predict protein stability. They're trained to recover amino acid sequences from 3D structures. Yet in zero-shot tests, they consistently track experimental stability with correlations around 0.6-0.7. The obvious question: why does this work?

At SophIA Summit 2025, Jes Frellsen laid out a clean explanation that links model likelihoods to physical free-energy differences. The short version: under reasonable assumptions, the probabilities these models assign to sequences reflect the same forces that drive folding. With a few principled corrections, the estimates get even better.

Zero-shot learning means the model makes useful predictions without being trained on explicit stability labels. It uses what it learned from structure-sequence patterns and generalizes to mutation effects.

Why inverse folding models predict stability

Protein stability is governed by the Gibbs free-energy difference (ΔG) between folded and unfolded states. A lower ΔG favors the folded state. You can think of the distribution over structures as weighted by a Boltzmann factor, which ties probabilities directly to energy. See Gibbs free energy and the Boltzmann distribution for the physics.

Inverse folding models learn "amino acid preferences" conditioned on a structure. If you accept that observed structures are samples related to these energy-weighted ensembles, then model likelihoods and free energy are connected. That link explains why a simple likelihood comparison between mutant and wild type often tracks experimental ΔΔG.

From heuristic to principled scoring

The field has leaned on a quick heuristic: compare the log-likelihood of the mutant to the wild type for the same structure. It works, but it's an approximation. Frellsen's team showed how to interpret that ratio through probability theory and statistical physics, then adjust it with assumptions that better reflect the thermodynamics.

  • Heuristic: log P(model | mutant, structure) - log P(model | WT, structure)
  • Issue: it ignores terms that matter for free-energy differences and can be biased by modeling choices.
  • Fix: apply principled corrections derived from the energy-probability relationship to better approximate ΔΔG.

The result: measurable gains over the baseline heuristic, without training on stability labels. In practice, that means stronger zero-shot performance for mutation scanning.

Why this matters for disease research

Destabilizing mutations can break protein function and drive disease. If you can flag those variants early, you can prioritize experiments, refine hypotheses, and shorten feedback loops in functional genomics. "This is a physical quantity," as Frellsen put it. "Protein stability is something you can measure and quantify." Mapping model output to ΔG builds trust because the prediction has a clear physical interpretation.

Clinically, that credibility is key. You need to know what a score means before you use it in a variant triage pipeline or feed it into downstream models for risk assessment.

Missing data, uncertainty, and safe medical AI

Real clinical data are incomplete on purpose. Tests are ordered selectively; the absence of a measurement often encodes a decision signal. That's data missing not at random (MNAR), and it can bias models if treated as random dropout.

Frellsen's group develops generative approaches that model the missingness mechanism itself, making the assumptions explicit. At the same time, they emphasize interpretability and uncertainty. Doctors want to see why a recommendation was made and how confident the system is. "You should never be a thousand percent sure," he noted. Communicating margins of error is part of safe deployment.

From stability to binding: implications for drug design

Inverse folding sits inside many protein design pipelines. The next step is extending these physics-aware ideas to interactions: protein-protein, protein-RNA, and protein-small molecule. If model likelihoods relate to energy, they can also inform binding affinity and complex stability.

That unlocks fast, label-free screening across vast design spaces. But scale without theory is risky. The value here is an explanation layer that ties predictions to measurable quantities, so teams can decide when to trust a score and when to run the assay.

Practical takeaways for researchers

  • Use inverse folding likelihood ratios as a starting point for ΔΔG, but prefer corrected, physics-consistent scores when available.
  • Report uncertainty. Confidence intervals and calibration plots beat single-number claims.
  • Treat missingness as signal, especially in clinical datasets. Model MNAR mechanisms or run sensitivity analyses.
  • For design tasks, align model outputs with biophysical endpoints you can measure later (stability, binding, expression).

Where this is headed

Expect tighter bridges between generative models and thermodynamics, better zero-shot tools for variant effect prediction, and extensions from single chains to complexes. The goal isn't just higher correlation; it's predictions that map to lab reality and hold up under distribution shift.

If you're upskilling for AI-heavy research programs, explore curated learning paths by job role at Complete AI Training.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)
Advertisement
Stream Watch Guide