IBM and NASA’s Interagency Implementation and Advanced Concepts Team have developed a set of large language models designed to improve scientific research by providing researchers with improved access to large volumes of specialized knowledge and enabling them to extract relevant information from diverse data sources.
NASA said Tuesday the INDUS suite of LLMs contains encoder and sentence transformer models and supports five science domains: Earth science, astrophysics, planetary sciences, heliophysics and biological and physical sciences.
According to the space agency, the INDUS encoder models were trained using a corpus of 60 billion tokens encompassing data related to the five science domains and were used to improve the sentence transformer models on about 268 million text pairs.
The IMPACT-IBM collaborative team designed the LLMs for retrieval augmented generation and diverse linguistic tasks, enabling INDUS to process questions from researchers, produce answers to questions and retrieve relevant documents.
The team used curated scientific corpora derived from various sources of data to train the INDUS models, which are available on the Hugging Face platform.
Kaylin Bugbee, team lead of NASA’s Science Discovery Engine, cited the benefits of the INDUS models to existing applications.
“Large language models are rapidly changing the search experience. The Science Discovery Engine, a unified, insightful search interface for all of NASA’s open science data and information, has prototyped integrating INDUS into its search engine. Initial results have shown that INDUS improved the accuracy and relevancy of the returned results,” noted Bugbee.