RAG in Drug Discovery
Introduction
Artificial intelligence (AI) in drug discovery is gaining momentum, with the market estimated to grow significantly over the next decade. While generative AI offers new ways to interpret complex biological data, integrating these technologies into life sciences comes with its own set of challenges—from data privacy to ensuring model accuracy.
At BioStrand, we address these challenges with LENSai™, powered by HYFT® technology—a bio-native AI framework designed to refine AI-driven insights with biological relevance. HYFT acts as a biological retrieval-augmented generation (RAG) system, integrating biological indexing, cross-modal retrieval, and multi-LLM stacking to ensure AI models operate with scientific precision rather than statistical inference.
This article examines the practical applications of Large Language Models (LLMs) and Retrieval Augmented Generation (RAG) in the field, exploring how our LENSai platform approaches these challenges to support effective AI-driven research in drug discovery.
LLM hallucination example highlighting the need for RAG in life sciences
1. Generative AI in Drug Discovery
The global market for artificial intelligence (AI) in drug discovery in 2023 was an estimated USD 1.5 billion. However, this rather modest figure is expected to grow exponentially, at a CAGR of almost 30 percent, to over USD 15 billion by 2032.
Early studies into AI’s potential in drug discovery have yielded some impressive findings. A BCG analysis of the research pipelines of 20 emerging AI-intensive pharmaceutical companies revealed two key statistics: one, between them these 20 AI startups had 158 drug candidates in discovery and preclinical development compared to 333 at the top twenty pharma companies by revenue. And two, reconstructed development timelines of startup drug candidates showed that all had reached the clinical trial stage in less than a decade. A separate BCG-Wellcome report modeled time and cost savings of up to 50% for AI-driven drug discovery efforts.
The potential of AI in addressing several of the key pain points associated with traditional early-stage drug discovery is, understandably, driving investments and innovations in the pharmaceutical industry.
However, the industry is already poised on the next big AI-powered paradigm in biomedical R&D - generative AI.
By one estimate, generative AI has over 200 potential use cases across the pharmaceutical value chain. This emerging technology is forecast to generate additional annual value of between $15-28 billion in the research and early discovery stage.
Retrieval augmented generation - Improving LLM performance in the life sciences
Generative AI represents a whole new opportunity to expand the horizons of computational drug discovery beyond predictive AI models. Its ability to decode complex biological and chemical languages and interpret the rules and relationships of DNA, RNA, and proteins can provide transformative insights into biomedical processes that were not previously possible.
Generative AI can analyze vast amounts of genomic data to identify patterns, mutations, and gene interactions not readily apparent through traditional methods to advance insights on gene function, predicting disease susceptibility, and identifying potential therapeutic targets.
It can model complex metabolic pathways, and elucidate how various biological processes interact to provide new insights into disease mechanisms and potential interventions. Generative AI tools are already being used to predict gene expression, design synthetic sequences to safely and accurately modulate gene expression, identify therapeutic targets and discover new biomarkers, biomarker discovery, match target proteins with potential drug molecules, validate protein designs, and predict mutations that improve RNA function, to list just a few.
Despite the apparent value and versatility, there are still several challenges that have to be addressed before generative AI solutions can be productively integrated into life sciences applications. Foundational Large Language Models (LLMs) are not inherently suited for complex biomedical applications and require extensive pretraining and fine-tuning before they can conform to scientific research standards. Then there are the limitations of these models including the tendency to hallucinate information, and knowledge cutoffs, lack of interpretability, etc. Finally, the successful and scalable deployment of AI technologies, both generative and analytical, will require an AI-focused data + information architecture that can support a high-outcome, future-proof AI strategy.
In this article, we take a closer look at some of the key challenges involved with the effective integration of LLMs into life science applications, the capabilities and components of Retrieval Augmented Generation (RAG) models that can help mitigate these challenges, and BioStrand’s unique approach to designing LENSai platform.
2. Retrieval-Augmented Generation (RAG): Enhancing AI in Life Sciences
Despite the immense potential of LLMs in the rapidly evolving field of artificial intelligence, they come with a significant challenge: hallucinations. In generalized applications, these hallucinations, which have been described as features not bugs, can be caused by a variety of factors that result in confident but incorrect information, such as the recent relocation of the Golden Gate Bridge.
In the context of life sciences and drug discovery, where accuracy is paramount, such hallucinations could pose a serious disruption risk with critical consequences.
Challenges in AI-driven drug discovery
Key limitations in applying LLMs to life sciences
LLMs have huge potential in life sciences, but they also come with challenges—hallucinations, data gaps, privacy concerns, and more. At BioStrand, we tackle these head-on with RAG for reliable outputs, neuro-symbolic reasoning for smarter retrieval, and a semantic-first approach for accuracy and traceability. Let’s break down the key limitations.
● Domain-specific knowledge gaps: Despite the versatility of general purpose LLMs, they lack a deep, contextual understanding of complex life science topics required to capture relational functionings that are critical to scientific knowledge creation. Pre-training, therefore, will be key to the development of biological domain-specific LLMs.
● Data availability, quality, diversity & recency: Pre-training and fine-tuning life sciences LLMs require highly specialized datasets that are often scarce or difficult to access. Moreover, life sciences data is extremely diverse, ranging from genomic sequences to EHRs, necessitating multimodal models and access to multimodal training data that may not be readily available from high quality public sources. LLMs also have a recency problem that limits their ability to continuously learn from the rapidly evolving field of life sciences research and discoveries.
● Data privacy, security & ethics: Handling sensitive, confidential and proprietary life sciences data poses significant challenges involving data security and privacy, regulatory and compliance, and ethical use considerations.
● Integration & interoperability: To achieve domain-specificity at scale, LLMs must be seamlessly integrated with established life sciences research systems, processes and workflows, while ensuring interoperability with specialized life science software and internal/external data sources.
● Accuracy, explainability and interpretability: The "black box" nature of LLMs, their potential to hallucinate plausible-sounding but factually incorrect information, and the challenges of explaining the reasoning behind their outputs, is one of the biggest challenges in the context of high-stakes life science applications. There has to be a complex and concerted strategy to mitigate hallucinations and enhance the accuracy, explainability, and interpretability of LLMs.
Addressing these challenges is crucial for successfully leveraging the power of LLMs in life science applications while ensuring the highest standards of scientific integrity, security, compliance, and ethics.
AI challenges in target validation
The RAG framework: Making LLMs effective for drug discovery
Retrieval Augmented Generation (RAG) represents one of the most promising solutions to address many of the challenges associated with deploying LLMs in life sciences research.
Retrieval-augmented generation is an AI framework that combines the strengths of retrieval-based models and generative models to improve the quality, recency, and relevance of LLM-generated outputs. The key principle involved is the augmentation of the knowledge inherent in pre-trained language models with dynamically retrieved up-to-date and relevant information from external sources to generate more accurate and contextually appropriate responses.
3. BioStrand's Unique RAG System Explained
A RAG system broadly consists of four main components: one, a large language model for understanding and generating text, two, an external knowledge base as an information source, three, a retrieval system to identify relevant information from the knowledge base, and four, an integration mechanism to combine retrieved information with the language model's output.
The BioStrand RAG-LLM framework comprises the following three key components — Semantic Understanding, Smart Retrieval with Semantic Similarity Safeguarding, and Controlled Answer Generation — with each stage playing a critical role in the overall performance of the system.
Key stages of retrieval-augmented generation
1. Semantic Understanding
Understanding is the crucial first step that defines the performance benchmarks for the subsequent retrieval and generation in the RAG process. The ‘understanding’ stage leverages the natural language understanding (NLU) capabilities of the underlying language model to process and comprehend the input query.
The BioStrand framework emphasizes semantic understanding, the importance of understanding the actual context of a query using a semantic first approach. This is critical for two reasons: one, breaking down search queries semantically enables a more nuanced understanding of biological queries. In the life sciences context, for instance, "dopamine" could refer to a small molecule, a receptor, an antagonist, a gene, or a mutation. And two, this nuanced understanding provides the robust foundation required to optimize and improve the next stage, precise information retrieval.
The key BioStrand differentiator, therefore, is the ability to capture the semantics in an input query, by effectively detecting the semantic word boundaries in a query, thereby ensuring close proximity to “the meaning” of the input. This is important as it improves relevance through close semantic matching and also improves the downstream retrieval tasks by ensuring that it is as fine grain as possible.
RAG - Correct understanding of your question
The key processes in this stage include:
a. Tokenization:
Breaking the input query into meaningful units, or tokens, and encoding into a format suitable for the language model.
b. Contextual embedding:
Using pre-trained large language models to process the tokens and generate high-dimensional contextual embeddings that capture the semantic meaning of the input query.
c. Feature extraction/intent recognition:
Extracting key features and intent from the contextual embeddings to identify the key topics, entities, and relationships and determine the intent underlying the query to ensure that retrieval and generation processes are aligned with the user's needs.
2. Smart Retrieval, Safeguarding Semantic Similarity
In this stage, retrieval-based models are to identify information from an external knowledge base that is most relevant to the input query’s encoding.
RAG - Smart retrieval of relevant content
The key processes in this stage include:
a. Indexing:
In RAG retrievers, indexing is the process of organizing data in external knowledge bases for the fast and efficient retrieval of relevant information. BioStrand approach to indexing documents using the same semantic mapping techniques used in the "Semantic Understanding" process. This universal semantic mapping model forms the basis for querying a harmonized and specialized knowledge layer that integrates an extensive life sciences knowledge base and unstructured information from life sciences literature into one unified knowledge graph.
Our approach to smart retrieval is based on using a combined neuro symbolic approach to define the embeddings that are closest in similarity to that of the query. This also enables full control over LLM input in terms of ensuring specificity through semantics and knowledge, controlling the amount of information through filtering, and verifying origin through gated lookups.
Leveraging the same semantic concepts used for semantic understanding to index our knowledge layer ensures that the data itself is neatly “organized” for more accurate information retrieval. In addition, the ability to index any document ensures that the knowledge base is dynamic and can capture knowledge that is not directly accessible through standard out-of-the-box LLMs. This streamlines the process of integrating clinical information and privacy sensitive or proprietary information while also ensuring that user-specific access, privacy and authentication policies can be enforced.
Safeguard close semantic similarity to the question
b. Query expansion:
This refers to a set of techniques to expand the input query with semantically related data points that can help improve retrieval accuracy.
c. Retrieval algorithms:
These are algorithms that utilize the formulated query to search the knowledge base and retrieve relevant information. Retrieval methods can be broadly classified into traditional, keyword-based sparse methods (BM25, TF-IDF), contemporary dense neural retrieval models (DPR), and hybrid models that combine both approaches to significantly improve information retrieval in specialized domains.
Given the limited processing capabilities of LLMs, there is an additional result processing phase required to further filter and refine the initial set of retrieved documents to prepare them for LLM input.
Key processes in this substage include:
I. Embedding generation & comparison:
This process transforms both the original query and retrieved passages into dense vector representations using sentence transformers or universal sentence encoders. The choice of embedding model significantly impacts the quality of semantic similarity assessments with models fine-tuned on domain-specific data often performing better for specialized applications. These embeddings are then compared, cosine similarity or other distance metrics, to compute similarity.
II. Relevance ranking:
Relevant retrieved documents are subject to a more thorough analysis to rerank them based on their relevance, in terms of semantic similarity, contextual relevance, recency, and source credibility, etc. to the formulated query. The retrieved documents are then ranked based on their similarity scores with higher scores indicating greater relevance.
This filtering and reranking process enhances the relevance and accuracy of LLM inputs by prioritizing only the most pertinent information. Similarity thresholds are used to filter out less relevant information and ensure that only the most semantically related content is considered for generation.
Neuro Symbolic Methodology of BioStrand
A fusion of deep learning and symbolic logic techniques, a branch of mathematics and philosophical logic that uses symbols to represent logical expressions, rather than using words. This approach harnesses the data-driven strengths of LLMs and the reasoning capabilities of symbolic systems, offering both adaptability from LLM methods and transparency from symbolic logic, ensuring comprehensive and interpretable outcomes for inquiries.
3. Controlled Answer Generation
This final stage involves combining the semantically and contextually rich information retrieved with the language model's inherent knowledge to generate a coherent and informative response.
BioStrand facilitates full control over the generative model including the dynamic selection of the LLM selection that researchers want to use to process the retrieved information for answer generation. It also offers more fine-grained control in terms of using an LLM to summarize only the most relevant content. Our approach generates human readable answers linked to massive amounts of data points, provides source highlighting to references on which the answer is based, and empowers reference traceback on sub-sentence level sources.
Answer generation
Key processes include:
a. Contextual input:
Concatenating retrieved and semantically filtered information with the original query as contextual input for the generative model.
b. Source retracing:
Cite the documents that were used to produce the answers, we can do this to sub sentence level of details. Independent of which LLM has been used we can source the original documents.
4. BioStrand's RAG-LLM: Designed for Efficient Drug Discovery
At BioStrand, we empower AI-driven life sciences research with a bio-native approach, ensuring that AI models not only process biological data but truly understand it. Our LENSai platform, powered by HYFT technology, unifies retrieval-augmented generation (RAG), vector search, and multi-LLM stacking to refine AI outputs with verified life sciences-specific context. By integrating biological indexing, cross-modal retrieval, and contrastive learning, HYFT transforms AI from a bio-unaware tool into an adaptive system that retrieves, aligns, and contextualizes biological knowledge at scale.
This foundation enhances our semantic-first approach, ensuring precise retrieval, reducing hallucinations, and providing transparent, traceable AI-driven insights. With LLM stacking, neuro-symbolic reasoning, and a dynamically expanding knowledge graph, LENSai delivers actionable, biologically-grounded intelligence to accelerate drug discovery and biomedical research.
At BioStrand, we effectively leverage the power of Retrieval Augmented Generation (RAG) to address the hallucination problem and to improve LLM responses by incorporating external lookups, effectively grounding the model's outputs in verified life sciences-specific information.
- Our semantic first approach emphasizes the importance of the accurate understanding of the context of a query. Accurate semantic breakdown is vital to the precision retrieval of specific and relevant documents
- Our smart retrieval capabilities safeguard semantic similarity to the question by using a combined neuro symbolic approach to defining the closest embeddings. These smart capabilities are synced with a semantically organized and harmonized life science knowledge layer that integrates unstructured information and ontologies into a dynamic knowledge graph.
- Our Neuro Symbolic Methodology fuses deep learning and symbolic logic techniques to harness the data-driven strengths of LLMs and the reasoning capabilities of symbolic systems. This combined approach offers both adaptability from LLM methods and transparency from symbolic logic to provide comprehensive, interpretable outcomes.
- Researchers have complete control over the generative models, including LLM input, choice of LLM, specific content to be summarized, and token control for hallucinations. Generated content is human readable and our white box approach allows that all content can be traced back to sources and references right down to sub-sentence level details.
- Our advanced foundation AI model uses "LLM stacking" to intelligently combine different generative models with the capability to integrate real-world data and evidence. The BioStrand foundation model is designed for flexible integration into existing architectures or fine-tuned models and for multiple use cases and interfaces.
Answer breakdown with LENSai
At BioStrand, our focus continues to be on offering the absolute cutting edge in AI technologies into one unified platform that supports holistic life sciences research. At the core of our LENSai platform is a comprehensive and continuously expanding knowledge graph that maps a remarkable 25 billion relationships across 660 million data objects, linking sequence, structure, function, and literature information from the entire biosphere into one comprehensive biomedical knowledge base. The platform also orchestrates the semantic proficiency of knowledge graphs, the reasoning capabilities of LLMs, the advanced information retrieval capabilities of Vector Search and Retrieval Augmented Generation (RAG) models into one synergistic model that enables the integrated and intelligent analysis of biomedical data.
For more information about integrating our advanced AI model into your life sciences research and drug discovery workflows, please drop us a line here.