Smarter Searching: NASA AI Makes Science Data Easier to Find
Smarter Tagging, Accelerated Discovery
Searching for scientific data can be as confusing as shopping for running shoes without standardized categories. Different researchers and data providers describe datasets using varied terms, making it difficult to locate relevant information quickly. To tackle this, NASA developed the Global Change Master Directory (GCMD), a controlled vocabulary that helps scientists tag datasets consistently and makes searching more straightforward.
As science advances, keeping metadata organized becomes increasingly challenging. To address this, NASA’s Office of Data Science and Informatics (ODSI) at the Marshall Space Flight Center created the GCMD Keyword Recommender (GKR), a smart tool that assists data providers and curators in assigning accurate keywords automatically.
Metadata Matchmaker
The upgraded GKR model handles an extreme multi-label classification problem: selecting multiple accurate labels from thousands of possible keywords for each dataset. For example, tagging a dataset may require identifying numerous relevant descriptors, some of which are rare or nuanced.
The latest GKR version considers over 3,200 keywords, a significant increase from the previous 430. This jump in vocabulary complexity demanded a more advanced model. The core of this upgrade is INDUS, a language model trained on 66 billion words from scientific literature spanning Earth science, biology, astronomy, and other fields.
INDUS allows the GKR to recognize the context behind keywords rather than relying on simple word similarities. For example, it can distinguish when "precipitation" refers to weather events versus climate variables in satellite data. Additionally, the new model was trained on more than 43,000 metadata records from NASA’s Common Metadata Repository, improving prediction accuracy.
Learning to Love Rare Words
One of the toughest challenges in this task is class imbalance: some keywords appear frequently, while others are rare. Traditional training methods tend to favor common labels, often overlooking less frequent but important ones.
NASA’s team applied a technique called focal loss, which reduces the model's focus on easy, common examples and shifts attention toward rarer, harder cases. This approach improves performance across all keywords, especially those crucial for specialists searching for niche datasets.
From Metadata to Mission
Collecting data is only part of scientific progress; making it discoverable and usable is equally vital. The enhanced GKR tool plays a quiet but critical role by applying AI to metadata tagging, helping ensure that vast amounts of Earth observation data do not get lost or overlooked.
Beyond GKR, the INDUS language model supports other NASA Science Mission Directorate projects. For instance, it enhances the Science Discovery Engine by automating metadata curation and refining search result relevance. INDUS is becoming a foundational AI capability within NASA’s data science efforts.
INDUS is funded by NASA's Office of the Chief Science Data Officer, which promotes scientific discovery through innovative data science, advanced analytics, and artificial intelligence applications.
Your membership also unlocks: