Explainable AI Tool Deciphers Protein Clumping Linked to Alzheimer’s and Drug Development

Researchers developed CANYA, an AI tool that explains how protein clumping occurs, linked to diseases like Alzheimer’s. It analyzes chemical motifs to predict and clarify aggregation.

Categorized in: AI News Science and Research
Published on: Jun 07, 2025
Explainable AI Tool Deciphers Protein Clumping Linked to Alzheimer’s and Drug Development

Scientists Develop Explainable AI Tool to Decode Protein Aggregation

Researchers have created an AI tool called CANYA that clarifies the chemical signals proteins use to decide if they form clumps. These clumps are linked to diseases like Alzheimer’s and about fifty other human disorders affecting nearly half a billion people. By analyzing over 100,000 synthetic protein fragments tested in yeast cells, CANYA pinpoints the motifs that encourage or block aggregation.

This new approach stands apart from traditional “black-box” AI models because CANYA explains its predictions. It reveals the exact chemical patterns that promote or inhibit harmful protein clumping. The research, published in Science Advances, draws on the largest dataset ever collected on protein aggregation and provides fresh insights into the molecular causes behind these clumps.

Protein Aggregation and Its Impact

Protein aggregation, or amyloid formation, disrupts normal cell functions. It happens when parts of a protein become sticky, causing molecules to bind into fibrous, often toxic, structures. This process contributes to many diseases affecting hundreds of millions worldwide.

Implications for Medicine and Biotechnology

While this study advances understanding of neurodegenerative diseases, its immediate value lies in biotechnology. Many therapeutic proteins tend to clump, causing costly manufacturing failures. “Protein aggregation is a major headache for pharmaceutical companies,” says Dr. Benedetta Bolognesi from the Institute for Bioengineering of Catalonia (IBEC). CANYA can guide the design of antibodies and enzymes less prone to aggregation, reducing expensive setbacks.

Decoding the Protein Language

Proteins are made from 20 amino acids, forming sequences akin to words in a chemical language. Identifying which “words” or motifs cause clumping has been challenging due to limited data. To tackle this, researchers synthesized 100,000 random protein fragments, each 20 amino acids long, and tested their aggregation in yeast cells. About 21,936 fragments triggered clumping, providing an unprecedented dataset for AI training.

Exploring a Vast Protein Universe

Dr. Mike Thompson from the Centre for Genomic Regulation (CRG) explains that evolution has sampled only a fraction of possible protein sequences. By testing random sequences, the team explored a broader “galaxy” of possibilities, enabling general rules of aggregation to emerge.

CANYA combines convolutional and attention AI models. Convolution scans for local features like motifs, similar to how image recognition identifies facial features. Attention models evaluate the importance of motifs in the context of the entire protein, like translating key phrases in a sentence. This hybrid approach helps CANYA not only predict aggregation but also explain why certain motifs matter.

New Insights and Practical Findings

CANYA confirmed known trends, such as water-repelling amino acids promoting clumping and the influence of motif position in a protein sequence. It also uncovered unexpected rules—for example, charged amino acids, usually thought to prevent aggregation, can promote it depending on their context.

Currently, CANYA classifies whether aggregation occurs or not. The next step is refining it to predict aggregation speed, which is crucial for diseases where timing affects progression.

Looking Ahead

“There are 1,024 quintillion ways to build a 20-amino-acid protein fragment. So far, we’ve trained AI on only 100,000,” notes Dr. Bolognesi. Expanding the dataset will improve predictions and deepen insights into protein aggregation. This will support both medical research and synthetic biology.

ICREA Research Professor Ben Lehner from CRG highlights the cost-effectiveness of combining large-scale experiments with AI. Using DNA synthesis and sequencing, hundreds of thousands of tests can run simultaneously, efficiently generating data to train AI models. This strategy aims to make biology more predictable and programmable.

Study Details and Collaboration

This work, published as “Massive experimental quantification allows interpretable deep learning of protein aggregation”, is a collaboration between labs at the Centre for Genomic Regulation (CRG), Institute for Bioengineering of Catalonia (IBEC), Cold Spring Harbor Laboratory (CSHL), and the Wellcome Sanger Institute. Funding came from ”La Caixa” Research Foundation, the European Research Council, and the Spanish Ministry of Science and Innovation.