AI Tool Sheds Light on Hidden Microproteins Linked to Disease

Researchers developed ShortStop, an AI tool that identifies microproteins hidden in noncoding DNA regions. It helps pinpoint functional microproteins, speeding up disease research.

Categorized in: AI News Science and Research
Published on: Aug 01, 2025
AI Tool Sheds Light on Hidden Microproteins Linked to Disease

New AI Tool Illuminates Overlooked Regions of the Human Genome

Researchers at the Salk Institute have developed ShortStop, a machine learning framework designed to identify microproteins hidden within the vast “noncoding” regions of DNA. These microproteins, typically fewer than 150 amino acids long, have historically been difficult to detect and study due to their small size and the focus on larger proteins in genetic research.

ShortStop enables scientists to analyze extensive genetic databases, pinpointing DNA sequences—specifically small open reading frames (smORFs)—that are likely to produce biologically relevant microproteins. By predicting which microproteins have functional roles, the tool significantly reduces the time and resources needed for experimental validation.

Why Microproteins Matter

Traditional studies have largely ignored the majority of the genome labeled as “noncoding” or “junk DNA.” Recent discoveries suggest that many of these regions contain smORFs encoding microproteins with important biological functions. Microproteins can influence health and disease, yet their small size makes them elusive to standard protein detection methods.

Instead of searching directly for these tiny proteins, researchers focus on the DNA sequences that encode them. However, existing experimental approaches to catalog smORFs are often costly, time-consuming, and limited in distinguishing functional microproteins from nonfunctional ones.

How ShortStop Changes the Game

ShortStop introduces a two-class sorting system that classifies smORFs as either likely functional or nonfunctional. It achieves this by training on a dataset of computer-generated random smORFs (negative controls) and comparing them to real smORFs found in genetic data.

This approach doesn’t guarantee biological relevance but efficiently narrows down candidates. When applied to existing datasets, ShortStop identified about 8% of smORFs as likely functional, focusing experimental efforts on the most promising targets. This filtering process accelerates discovery and reduces wasted resources.

Importantly, ShortStop works well with commonly available data types like RNA sequencing datasets, making it accessible to many research labs. This capability opens up the possibility of large-scale screening for microproteins across various tissues and disease states.

Microprotein Discovery in Lung Cancer

Using ShortStop, the team analyzed genetic data from human lung tumors and adjacent normal tissue. They identified 210 new microprotein candidates, with one microprotein notably upregulated in tumor tissue. This microprotein could serve as a biomarker or therapeutic target for lung cancer, illustrating ShortStop’s potential impact on disease research.

Looking Ahead

The ability to efficiently identify functional microproteins from existing datasets is a major step forward. The approach can be applied to a wide range of diseases, including Alzheimer’s and obesity, to deepen our understanding of their molecular drivers.

As genetic databases continue to grow, tools like ShortStop help prioritize research efforts by highlighting the most promising microprotein candidates for further study.

Publication Details

  • Journal: BMC Methods
  • Title: ShortStop: A machine learning framework for microprotein discovery
  • Authors: Brendan Miller, Eduardo Vieira de Souza, Victor J. Pai, Hosung Kim, Joan M. Vaughan, Calvin J. Lau, Jolene K. Diedrich, Alan Saghatelian
  • Research Areas: Computational Biology
  • DOI: 10.1186/s44330-025-00037-4

Funding was provided by the National Institutes of Health and the Clayton Medical Research Foundation.

For researchers interested in machine learning applications in biology, this study demonstrates how AI can efficiently sift through complex genomic data to uncover previously hidden biological insights.


Get Daily AI News

Your membership also unlocks:

700+ AI Courses
700+ Certifications
Personalized AI Learning Plan
6500+ AI Tools (no Ads)
Daily AI News by job industry (no Ads)