Protein Language Models
In 2020, AlphaFold, a deep-learning system that relies on multiple sequence alignments (MSAs) as inputs, created a new milestone in AI-driven computational biology by resolving the decades-old protein folding challenge. The focus has since expanded to protein language models (PLMs) and their ability to understand the language of proteins to address a more complex and broader set of biotechnological applications, including protein modeling, design, and engineering.
Here is a primer on the evolution from AlphaFold to protein language models. For more information on large language models (LLMs) in computational biology and drug discovery, check out our previous posts.
Everyone’s talking about LLMs: Notes from BioTechX Europe 2023
Integrating knowledge graphs and large language models for next-generation drug discovery
Knowledge graphs and black box LLMs
How retrieval-augmented generation (RAG) can transform drug discovery
1. What is a protein language model?
Despite the ability of MSA algorithms to deliver high accuracy and performance in protein prediction, they still have important limitations. Protein language models represent a powerful new approach to MSA-free protein structure prediction.
Protein language models leverage massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. They represent a natural progression of large language models in bioNLP applications to leverage the conceptual similarities between natural language and the language of proteins. The key distinction is that instead of modeling the distribution of words/texts, these specialized models are trained to understand and predict the relationship between amino acid sequences and protein properties, such as structure, function, and interactions.
Protein Language Models, which are a type of Large Language Model (LLM), use transformer architecture with a specialized protein vocabulary. These models can learn, without supervision, the patterns and relationships within amino acid sequences across extensive protein datasets, providing detailed insights into protein structure and function.
As early as 2019, researchers were able to showcase the ability of generative large language models trained on millions of sequences of natural proteins in the self-supervised learning of the fundamental properties of proteins such as secondary structure, contacts, and biological activity. More recently, researchers at Meta introduced ESM-2 (LLM) and ESMFold (the structure prediction tool) to demonstrate how scaling up language enabled an order of magnitude acceleration in atom-level resolution protein structure prediction.
2. Distinction from general LLMs
PLMs extend the capabilities of traditional large language models, such as OpenAI's GPT (Generative Pre-trained Transformer), to the domain of proteins. These specialized models, pre-trained on vast datasets of protein sequences, learn the language and patterns inherent in protein sequences and structures. By leveraging this pre-training, PLMs can perform various protein-related tasks, including structure prediction, function annotation, protein-protein interaction prediction, and protein design.
3. PLM applications in life sciences
PLMs offer unprecedented insights into the language of proteins, with the potential to accelerate discoveries across a range of life sciences applications.
Accurate protein structure prediction, one of the primary applications of PLMs, plays a critical role in elucidating the molecular mechanisms underlying diseases and enabling the design of targeted therapies.
By integrating various data sources, including protein sequences, structures, and evolutionary information, protein language model representations can help improve high-level functional annotation and enhance the understanding of disease mechanisms and the identification of potential therapeutic targets.
Protein language models can deliver state-of-the-art performance metrics in protein-protein interactions (PPI) prediction on diverse proteins and enhance both the accuracy and efficiency of protein-ligand interaction (PLI) prediction thereby elucidating complex biological networks and pathways.
General protein language models have been successfully used to evolve human antibodies based on evolutionarily plausible mutations, even without information related to target antigen, binding specificity, or protein structure.
PLMs can advance protein engineering and design by generating novel protein sequences that are functional and have the desired properties.
Finally, PLMs have the potential to advance personalized medicine by integrating genomic data with protein sequence and structural information to identify genetic variants associated with disease susceptibility, drug response, and treatment outcomes.
4. How does protein modeling and design work?
An overview of protein sequences, structures, and functions
First used by the Swedish chemist Jöns Jacob Berzelius Protein, the term ‘protein’ refers to long chains of 51 or more amino acids—these chains, linked by peptide bonds, form proteins that are responsible for housekeeping and specific functions essential to life. As per the sequence-structure-function paradigm of structural biology, the sequence of a protein determines its three-dimensional structure (fold), which in turn, influences its function.
Protein sequences describe the specific linear arrangement of amino acids within a protein molecule and protein sequencing, determination of the precise arrangement, is at the core of understanding its structure and function.
Protein structures refer to the three-dimensional arrangement of amino acids within a protein molecule. However, in the context of proteins the term ‘structure’ is used with broader implications i.e. the primary structure (amino acid sequence), secondary structure (spatial arrangement), tertiary structure (three-dimensional shape of the protein), and quaternary structure (aggregate arrangement of multiple protein molecules).
Proteins have a diverse range of functions and could, therefore, be structural, regulatory, contractile, protective, serve in transport, storage, or membranes, be toxins or enzymes, or have more than one function.
Protein sequences, structures, and functions are interconnected aspects that collectively define the molecular basis of life.
Techniques for Protein Modeling and Design
- Homology modeling
- Threading/fold recognition
- Ab initio modeling
- Molecular dynamics simulation
- Protein-ligand docking
- Directed evolution
- Rational protein design
- X-ray crystallography
- NMR spectroscopy
- The CASP experiments
- Deep learning
5. Integration of AI in Protein Modeling
The integration of AI, including PLMs, in protein modeling continues to facilitate innovative research in several fields such as structural biology and novel drug design. AI-based protein models have played a significant role in enhancing the accuracy of experimentally determined protein structures.
Additionally, AI-driven protein modeling enables the exploration of protein sequence space, facilitating the design of novel proteins with desired properties for various biotechnological applications. Deep graph neural networks have been successfully trained to rapidly design precise sequences that fold into a predetermined shape for specific functions.
Specialized protein deep large language models have been proven to deliver state-of-the-art prediction performance across multiple protein properties, including structure, post-translational modifications, and biophysical attributes. As transformer-based architectures continue to evolve, protein language models are becoming increasingly more sophisticated in terms of sampling unexplored regions of protein space.
The computational power and sophistication of large protein language models also facilitated the ESM Metagenomic Atlas, a 600+ million protein database of high-resolution predicted structures that also cover metagenomic proteins at scale.
6. Well-known protein language models
ProtTrans
ProtTrans, considered one of the first transformer-based AI learning models to study the grammar of proteins, was developed by researchers at the Technical University of Munich. These models are trained on large datasets, nearly 393 billion amino acids from UniRef and Big Fat Database, and are capable of performing various protein-related tasks, including protein sequence classification, protein function prediction, protein structure prediction, and protein-protein interaction prediction.
ProtTrans models leverage the self-attention mechanism of transformers to capture complex patterns and relationships in protein sequences for more accurate predictions. Its ability to handle diverse protein data types and tasks makes it a versatile tool for different domains and tasks. These open-source pre-trained models enable researchers to fine-tune them to specific domains, datasets, or tasks, saving valuable time and computational resources compared to training models from scratch. On the flip side, training and fine-tuning ProtTrans models require significant high-performance and large amounts of labeled protein data.
ProteinBert
ProteinBert, a deep-learning model designed for protein sequences, augments the classic Transformer/BERT architecture with several innovations, including global-attention layers with linear complexity for sequence length and denoising autoencoding training objectives. The model was pretrained on protein sequences and GO annotations extracted from UniRef90 (∼106M proteins) on two simultaneous tasks, bidirectional language modeling of protein sequences and Gene Ontology (GO) annotation prediction.
The ProteinBERT architecture is agnostic to the length of the processed sequences, meaning that it generalizes well across sequences of different lengths without the need to alter learned parameters. The ability to incorporate protein Gene Ontology (GO) annotations as an additional input enhances the model’s ability to more accurately infer biological context and make predictions. Compared to state-of-the-art language models, ProteinBERT requires considerably less computing and memory resources for pretraining and inference and yet delivers comparable performance on a diverse set of benchmarks. Pretraining from scratch will still be a resource-intensive process and the model may require fine-tuning to enhance the accuracy of some downstream tasks.
ProGen
ProGen, a suite of protein language models developed by Salesforce AI Research, is designed to generate protein sequences with specific functions and desired characteristics across large protein families. These models were trained on 280 million nonredundant protein sequences, including associated taxonomic identifiers and keywords, curated from publicly available databases from diverse protein families (>19,000). Augmented with control tags specifying protein properties, ProGen can be further fine-tuned by additional training on curated sequences with a single control tag to improve protein generation.
Progen models are scaled from 151 million up to 6.4 billion parameters on different training datasets totaling around a billion protein sequences from genomic, metagenomic, and immune repertoire databases. These models demonstrate across a range of protein engineering tasks without the need for additional fine-tuning.
ProtGPT2
ProtGPT2 is an autoregressive transformer model with 738 million parameters capable of the high-throughput generation of de novo protein sequences. The model was trained on around 50 million non-annotated sequences UniRef50 spanning the entire protein space, including dark regions where protein structures are not accessible by other conventional techniques. The final training dataset contained ~45 million sequences.
ProtGPT2 explores the dark protein space by expanding natural superfamilies with the generated sequences exhibiting predicted stabilities and dynamic properties similar to natural sequences. Since the model has been trained on the entire protein sequence space, including the dark proteome, it opens up new opportunities for protein design. Out of the box, this pre-trained model can design globular proteins on any standard workstation. Plus, it can be fine-tuned (e.g. SpikeGPT2) to a specific family, function, or fold.
These are just a few popular examples from the ever-expanding protein language model universe, which includes EMBER2, ESMFold, ProtFlash, DistilProtBert, and SaProt, to name just a few.
The rapid growth of the antibody based-therapeutics market has also triggered the widespread development of protein LLM-inspired Antibody Language Models for antigen-specific computational antibody design.
7. A select overview of some popular antibody-specific LLMs
AbLang
AbLang is a large language model trained solely on the antibody sequences in the Observed Antibody Space (OAS) database to restore missing sequence residues, a key challenge in B-cell receptor repertoire sequencing. This antibody-specific model has proven to be much more effective, and up to seven times faster, than general protein languages at completing antibody sequences with missing amino acids in the OAS database.
This model has since evolved into the AbLang-2 in order to address the effects of the germline bias even in antibody-specific LLMs and the resulting limitations in the ability of these models to suggest relevant mutations.
AntiBERTy
AntiBERTy is an antibody-specific transformer large language model pre-trained on a non-redundant set of 588 million natural antibody sequences from the OAS database. This model has proven to be able to cluster antibodies within repertoires into trajectories that resemble affinity maturation and therefore potentially provide new biological insights into the process. The AntiBERTy team also demonstrated that the model could be trained to predict highly redundant sequences and to identify key binding residues. AntiBERTy-based sequence features have also served as the basis for a recent computational approach to predict the immunogenicity of therapeutic antibodies solely from the variable region amino acid sequences.
AntiBERTa
AntiBERTa is a 12-layer antibody-specific transformer model that addresses the central challenge of prediction of B cell receptor (BCR) properties based only on their amino acid sequence. Pre-trained on 42 million unpaired heavy-chain and 15 million unpaired light-chain BCR sequences, the model’s 12 attention heads in each of its 12 layers focus on different aspects of the BCR sequence, considered the natural language equivalent of a sentence, with amino-acid-level tokenization to enable residue-level downstream predictions. Described as the deepest protein-family-specific large language model, AntiBERTa can be fine-tuned on a range of downstream tasks including paratope prediction, potentially antibody structure prediction, and humanization.
AbMAP
AbMAP (Antibody Mutagenesis-Augmented Processing) is a transfer learning framework to adapt and fine-tune foundational PLMs specifically for antibodies. The approach represents a middle path that seeks to address some of the limitations of general PLMs in modeling antibodies while bringing the full diversity of protein sequences that foundational PLMs access to antibody-specific LLMs. The AbMAP framework can be adapted to any PLM, as demonstrated by AbMAP-E (ESM-1b), AbMAP-P (ProtBert), and AbMAP-B (Bepler &
Berger), with the choice of foundational PLM changing based on the specific downstream prediction task, for example, AbMAP-B for structure prediction, and AbMAP-E/P for function/property prediction).
8. The LENSᵃⁱ approach to protein language models
Large language models have proven to be extremely adept at extracting meaning from natural language data based on codified linguistic principles and distributional patterns rather than on explicit linguistic knowledge. In the language of proteins, however, the unavailability of any formal design principles or equations makes the application of LLMs essentially an exercise in data fitting.
At LENSai , we believe that any effective foundation AI model for life sciences research and development will require the holistic integration and management of multimodal data that encompasses genetic, textual, and structural dimensions, including diverse embeddings from different layers of LLMs capturing varying information sources.
At the core of our Advanced Foundation AI Model, is the patented HYFT technology, a sophisticated framework designed to identify and leverage universal fingerprint™ patterns across the biosphere. When it comes to the language of proteins, HYFTs serve as critical enablers in translating the previously abstract notion of protein semantics and grammar into tangible, meaningful linguistic principles that enable more precise mapping and analysis. The HYFT technology’s application of "word boundaries" to the language of proteins offers a groundbreaking new approach to unlocking the complexities of protein structure and function.
Another key innovation in our unique Foundation AI Model is the use of an advanced "LLM stacking" technique that intelligently combines different LLMs, with the HYFTs seamlessly linked to specific features found in various LLMs. Integrating the HYFT framework's capability to pinpoint unique 'fingerprints' in biological sequences with the vast knowledge base of the stacked LLMs enables more complex biological data analysis with greater specificity, leading to more accurate predictions and insights.