New technique reveals how AI models identify promising protein drug and vaccine targets

MIT researchers developed a new method to reveal which protein features AI models use to predict drug and vaccine targets. This transparency helps improve model accuracy and biological insights.

Categorized in: AI News Science and Research
Published on: Aug 19, 2025
New technique reveals how AI models identify promising protein drug and vaccine targets

A New Method Reveals How AI Predicts Protein Targets for Drugs and Vaccines

In recent years, protein language models based on large language models (LLMs) have become essential tools for predicting protein structure and function. These predictions help identify potential drug targets and design therapeutic antibodies with high accuracy. Yet, the internal decision-making process of these models has remained a mystery, limiting researchers’ ability to trust or optimize them fully.

Researchers at MIT have now developed a technique that opens up this "black box," revealing which protein features AI models use when making predictions. This insight can help select and improve models for specific biological tasks, streamlining drug and vaccine target discovery.

From Protein Sequences to Model Interpretability

Protein language models treat amino acid sequences like sentences, analyzing patterns to predict how proteins behave. Early models, including those that inspired AlphaFold and others like ESM2 and OmegaFold, use neural networks to compress protein information into dense internal representations. These representations, however, are difficult to interpret because each neuron in the network encodes multiple overlapping features.

To tackle this, the MIT team applied a sparse autoencoder algorithm—a method that expands the neural representation of proteins from a few hundred nodes to tens of thousands. This expansion, combined with a sparsity constraint, forces the network to "spread out" information, allowing individual neurons to capture specific protein features more cleanly.

With sparse representations, it becomes easier to identify which neurons correspond to meaningful biological properties, such as protein family, molecular function, or cellular location.

Leveraging AI to Decode AI

After generating sparse protein representations, the researchers used an AI assistant, Claude, to analyze these patterns. Claude compared the neuron activations to known protein features and translated this information into plain language descriptions. For example, a neuron might be identified as detecting proteins involved in transmembrane transport located in the plasma membrane.

This process makes the inner workings of protein language models more transparent and interpretable. Knowing which features a model encodes can guide researchers in choosing or fine-tuning models for specific tasks, improving prediction accuracy and biological insight.

Implications for Biology and Drug Discovery

By revealing what features protein language models track, this approach opens the door to uncovering new biological knowledge hidden within AI representations. As models grow more powerful, this interpretability could help biologists discover novel protein functions or interactions that were previously inaccessible.

The research, supported by the National Institutes of Health, marks an important step in making AI tools in biology more explainable and practical for real-world applications.

For those interested in advancing AI literacy and its applications in science, exploring courses on Complete AI Training can provide structured learning paths in AI technologies.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)