Skip to content

Proteins

🚧 Under construction 🦺

Generating or modifying protein sequences to improve behavior, or to create novel behavior, is is a powrful application for AI. Guided through evolutionary-techniques, Bayesian optimization, and/or with the use of protein language models (PLMS), they can vastly accelerate the development of biotechnological tools, as well as for identifying targets and avenues for therapeutics. Because of their ability to represent the 'language of proteins', PLMS are increasingly important in predicting structure and function of proteins.

Components

Protein optimization can be broken down into several component 1

  • Target property is the intended goal(s) for protein development
  • [Fitness Predictor] that uses sequence information to estimate the value of the optimization target, as a surrogate for laboratory measurement
  • Sequence Proposer that creates sequences to evaluate and explore
  • Prioritizer that uses sequence and predictor information to estimate the top candidates
  • Laboratory measurements that reveal the quality of the generated proteins based on the targets
  • Orchestrator That puts the pieces together in a functional and validated manner

Optimization systems may involve merging and combining these components for full solutions in two general manners.

  1. A model that separates generation and evaluation steps, where an the predictor model evaluates the quality of an input set of sequences (generated or otherwise defined)
  2. A model that model that directly predicts the best designs using adaptive sampling, proposing solutions, evaluating them with the predictor model, and then iterating.

These components can be cleanly seen in the box below:

Adaptive machine learning for protein engineering

An overview of ML for protein engineering:

image

Strategy

Protein optimization will necessarily evolve the creation of those proteins and evaluations of target characteristics. There are large volumes of databases of various forms that may be useful in creating foundation models. It will still be essential to use continued observaiton to improve the optimization target based on predicted and iterated feedback.

The volume of the observations will help to determine the architectures that one could use. Base models tend to be PLMs because of the large set of available data. Unsupervised finetuning with those large models may be able to occur through homology or family sets to per. Final targets may be then optimized with simple networks, often involving regression to minimize overfitting, or methods that include Bayesian or evolutionary approaches.

To be able to successfully deliver on final target optimziation, the greater the quantity of direct or surrogate data that can be obtained, the greater the potential the resulting models will sufficiently predictive of fitness of future protein sequence candidates. That is why massive screening approaches, as described in by Ginko's platform screening thousands of candidates.

An example process by Ginkgo

Gingko reveals with foundry-scale protein estimates, that with thousands of samples they were able to create an enzyme with 10x improvement from where they started. In their design they use structure (differential) estimates via Rosetta, Evolutionary-scale modeling (PLMs), active site focus evolutionary models, as well as an in-house method called 'OWL.

image

When it is possibly to iteratively measure proposed sequences, new data can be used to improve subsequent sequence predictions. This can be done greedily, choosing the best solutions, or using probabilistic methods, such as [Bayesian Optimization]. Searching for a protein that optimizes a target by combining both estimated values, as well as their uncertainties. Selecting the sequences with highest-predicted target values will greedily inform what should be used, and may easily fail due to incorrect estimates due to the predictor model. In other manners, confidence bound (UCB) acquisition, that selects sequences based on an a sum of the predicted target value and the predicted target unertainty.

Ways of prioritizing

image

Optimization Targets

There are a number of targets that protein optimization can focus on. For examples, some targets enable primarily basic understanding, such as protein structure, and other targets are related to function, though it is generally considered that structure enables the functions.

In the cannon of causal influence, source has → sequence that creates → structure → enables the function. we can generally compartmentalize targets based on these, though there is certain crossover betwen them .

  • Source
  • Sequence
    • Alignment
    • [Remote cohomology] Similar function, or structure,
  • Structure
    • Contact prediction
    • Secondary and tertiary structure
    • (mis)Folding (missense)
  • Function

    • Enzymatic Catalysis: The ability of an enzyme to accelerate chemical processes
    • Thermocompatibility or thermostability, how well a protein remains stable or functions at varying temperatures
    • Fluorescence for visualization purposes

    • Protein Binding to...

      • Proteins
      • Nucleic Acids
      • Drugs molecules
      • Metals

Though there are many examples where these classes cross-these potential

Fitness prediction

Training a fitness model may first involve training an unsupervised foundation model on a high volume of data. These models can then be fine-tuned, or otherwise adapted, to incorporate protein-sequences or higher relevance to the protein targets of interest.

Learning protein fitness models from evolutionary and assay-labeled data

The authors show in their paper that uses a manner to combine ridge regression wiømbined with large-languag emodels revealing the ability to effectively predict evolutionary and assay-labeled fitness image

Sequence Proposer

With a fitness predictor made available, the next step is to create proposal sequences that may be evaluated with the predictor model, or potentially with direct measurement.

One way of doing this is to use generative models directly in seeding the generated sequence with starting sequences of the target sequence, or even from an natural language prompt. Another mthod is to use activation maximization, a method that will generate input to a model that will ideally maximize the output for a given model (assuming maximization is the desired target direction.

Generative Models

📋
Sequence modeling and design from molecular to genome scale with Evo

The authors reveal in their paper the use of long-context Genetics models can be powerful in their ability to yield state of of art predictions in protein-related tasks. These tasks include zero-shot function prediction, multi-element sequence generation. Their models use the 'Striped-Hyena' structured state space model. Their model is known as Evo. image

ZymCTRL: a conditional language model for the controllable generation of artificial enzymes

Here, we describe ZymCTRL, a conditional language model trained on the BRENDA database of enzymes, which generates enzymes of a specific enzymatic class upon a user prompt. ZymCTRL generates artificial enzymes distant from natural ones while their intended functionality matches predictions from orthogonal methods. image

Model

Low-N protein engineering with data-efficient deep learning

Site-directed mutagenics with the LLM: do random maskings on the sequences and predict distribution of mask fillers.

Markov Chain Monte Carlo

📋
Plug & play directed evolution of proteins with gradient-based discrete MCMC (EvoProtGrad for MCMC)

A Python package for directed evolution on a protein sequence with gradient-based discrete Markov chain monte carlo (MCMC) based on the paper blog and docs image

With Natural Large Language Models

Activation Maximization

📋
SeqProp: Stochastic Sequence Propagation - A Keras Model for optimizing DNA, RNA and protein sequences based on a predictor.

The authors reveal in their paper and arxiv a method to optimize biological protein sequences based on an a predictor model. They use something called trainable logits that can be sampled from, but do so doing instance normalizaton. A Python API for constructing generative DNA/RNA/protein Sequence PWM models in Keras. Implements a PWM generator (with support for discrete sampling and ST gradient estimation), a predictor model wrapper and a loss model. image image

Protein sequence design by conformational landscape optimization

The authors propose a bayesian approach to optimizing the a protein structure to yield a residue sequence. They use a loss of the form \(Loss = -/log P(contacts|sequence) + D_{KL}(f_{20}||f_{20}^{PDB}\) where \(D_{KL}\) is the Kullback-Leibler divergence, \(f_20\) is the average frequency of amino acids from the sequence, and \(f_{20}^{PDB}\) is the average frequency of amino acids from proteins int he PDB. Paper image

📋
Structure-based scoring and sampling of 'Combinatorial Variant Effects from Structure' (CoVES)

The authors show in their paper over 7 different combinatorial mutation studies, the ability to design proteins by exploring the design spacae without needs for combinatorial number of mutations. They build a model to estimate a residue preference effect for each amino acid variant at each position, and sums these effects to predict combinatorial variants. Simple linear and logistic models using a 'mutation effect preference of size 20(Amino Acids)x residue size' were able to predict the effect of variance. They could then use this to design sequences using Boltsman sampling and generate variatns that were much better.

image image image Particularly the following image provides credence that these simple models of important sites can be useful in predicting proteins.

image

Data sources

Brenda
ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants

The results are "curated to enable thorough comparisons of various mutation effect predictors in different regimes" image

Website Paper

Homologous Pairs of Low and High Temperature Originating Proteins Spanning the Known Prokaryotic Universe

Example Architectures

While there are many architecture and methods for creating and optimizing proteins, we focus here, primarily on ways that employ PLMs in some way. These create foundation models that can be fine-tuned and readily adapted to specific domains of interest.

The general method of creating protein foundation models uses Masked Language Modeling (MLM) or 'Bert-based' predictions, though next-token predictions, as is done with GPT-architectures may also be used. We share a number of prominent models and uses or derivatives,

Evaluation Metrics

  • Spearman Correlation Coefficient
  • AUC
  • MCC

Foundation models

ESM models

GitHub Repo stars Language models enable zero-shot prediction of the effects of mutations on protein function
Evolutionary-scale prediction of atomic-level protein structure with a language model (esm)

End to end Language model enabling structure sequence pairing, coupled with an equivariant transformer structure model at the end image

Science paper

Genome-wide prediction of disease variant effects with a deep protein language model

The authors show in their paper a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal.

Developments

Using established and newly trained protein language models, the authors demonstrate the ability to provide zero-shot predictions of the effect of a protein mutation on a protein's fluorescence.

image

They use a PLM to score the mutations using a log odds-ration of the mutated protein.

image

Data

They create ESM-1v, an unsupervised masked transformer model by training on 98 million protein sequences, using Uniref90 2020-03.

They evaluate the model on a set of 41 deep mutational scans.

Paper

GitHub Repo stars MSA Transformer

The author's demonstrate in their paper training an unsupervised PLM that operates on sets of aligned sequences. Self-supervision helps to reconstruct the corrupted MSA.

Developments

image

** Architecture**

The architecture 'interleaves attention across the rows and columns of the alignment as an axial attention' that ties the attention map across the rows with 'tied row attention'. They use a single feed-forward layer for each block. For position embeddings, they use a 1D learned position embeddings added independently to each row of MSA to distinguish aligned positions differently for each sequence.

The objective looks for the loss of th masked MSA as follows

image

With the probabilities are the output of the MSA transformer, softmax normalized of the amino acid vocabulary indepentely normalized per position in the sequence. Masking the columns uniformly resulted in the best performance.

The models are 12 layers, with a 768 embedding size, and 12 attention heads resulting in 100M parameters.

Data

They use 26 million MSA sequences generated from from UniRef50 by searching UniClust30 with HHblits.

Analysis

image

They show that a logistic regression with 144 parameters fit on 20 training structures could predict the contact maps of almost 15k other structures almost unsupervised. They show a supervised contact prediction map can improve the contact-prediction maps. They find the attention heads focus on highly variable colums, correlating with the per-column entropy of MSA.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

The authors used masked languaged prediction with transformer models to train a foundation model capable of multiple downstream tasks.

"To this end we use unsupervised learning to train a deep contextual language model on
86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The
resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone.
The learned representation space has a multi-scale
organization reflecting structure from the level
of biochemical properties of amino acids to remote homology of proteins. Information about
secondary and tertiary structure is encoded in the
representations and can be identified by linear projections. "

<img width="329" alt="image" src="https://github.com/ianderrington/genai/assets/76016868/27df578a-50ab-42ac-b675-58f7d740be4a">
TRANSFORMER PROTEIN LANGUAGE MODELS ARE UNSUPERVISED STRUCTURE LEARNERS

image

📋
Reference Optimization of Protein Language Models as a Multi-objective Binder Design Paradigm

The authors create a design paradigm using instruction fine-tuning and direct preference optimization of PLMS. Creating ProtGPT2 allows binders to be designed based on receptor and drug develepoability criterion. To do this, they do two-step instruction tuning with receptor-bindign 'chat-templates', and then optimize fine-tuned models to promote preferred binders. Specifically they "propose an alignment method to transform pre-trained unconditional protein sequence models (p(s)), that autoregressively sample sequences (s) from underlying data distribution (D), to conditional probability models (p(s|r; c)) that given a target receptor ® sample binders that satisfy constraints © encoded by preference datasets compiled from experiments and domain experts."

image

Notably, they fuse protein sequences with English-language prompts and use BPE encoding with a large vocabulary size (50k) instead of the smaller pLM vocabulary sizes (33) that are standard.

image

Multimodal

📋
🧬 GitHub Repo stars Protein function prediction as approximate semantic entailment

Developments

Current LLM models excel at predicting the structure and other attributes of biological sequences like proteins. However, their transferability is limited, capping their true potential. The DeepGO-SE model innovates 🚀 by integrating protein language models with specific knowledge on protein function, bridging the gap between knowledge-graphs' explicit representations and next-token prediction's implicit representations, and thereby significantly improving model performance.

How it works

  • 🔄 First, DeepGO-SE reuses the ESM2 large language model to convert a protein sequence into a vector space embedding, prepping it for machine learning application.
  • 🧠 Next, an ensemble of fitted prediction models is trained to align ESM2 embeddings with an embedding space (ELEmbeddings) derived from GO axioms, creating a world model filled with geometric shapes and relations akin to a Σ algebra, which can verify the truth of a statement.
  • ✅ Finally, for statements such as "protein has function C", when the ensemble reaches a consensus on truth, the semantic truth estimation is then accepted as valid.

DeepGO-SE Model Overview

The authors demonstrate 📈 that this method improves molecular function prediction by a substantial margin. Moreover, they reveal that training with protein-protein interactions substantially benefits the understanding of complex biological processes. They suggest that predicting biological processes may only require knowledge of molecular functions, potentially paving the way for a more generalized approach that could be advantageous in other domains.

📋
ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

The authors show in their paper that the fusion of natural language model with a protein language model can reasonably improve protein location prediction, fitness landscape prediction, and protein function annotation.

image

Data Their build a ProtDescribe to match protein sequences with text descriptions.

Models Their models involve three losses. 1. InfooNCE loss to maximize similarity between sequence pairs, and minimize similarity between negative pairs. 2. A Masked protein modeling cross-entropy loss to maintain unimodal information to the sequences, and a fusion MultiMOdal Mask Prediction that uses self and cross-attention on masked input sequence and text pairs to mutually recover the predicted results in sequence and text results. They start with pre-traiend protein models (Bert, ESM-1b and ESM-2) and pretrained language model (PubMedBERT-abs and PubMedBERT-full)

image

The text data set looks like this: image

Other models

Single-sequence protein structure prediction using supervised transformer protein language models

The authors show in their [paper[(https://nature.com/articles/s43588-022-00373-3) the ability to generate high quality predictions outperforming AlphaFold2, with a model called trRosettaX-Single using ESM to generate representations and attention maps that can be trained for distance+energy maps,

image

Tasks Assessing Protein Embeddings (TAPE)

Architectures by Target

Enzymatic Catalysis
De novo design of luciferases using deep learning

image

📋
ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model

Developments The authors present ForceGen, an end-to-end algorithm for de novo protein generation based on nonlinear mechanical unfolding responses. Rooted in the physics of protein mechanics, this generative strategy provides a powerful way to design new proteins rapidly, including exquisite and rapid predictions about their dynamical behavior.

Proteins, like any other mechanical object, respond to forces in peculiar ways. Think of the different response you'd get from pulling on a steel cable versus pulling on a rubber band, or the difference between honey and glass. Now, we can design proteins with a set of desirable mechanical characteristics, with applications from health to sustainable plastics.

image

image

The key to solving this problem was to integrate a protein language model with denoising diffusion methods, and using accurate atomistic-level physical simulation data to endow the model a first-principles understanding. ForceGen can solve both forward and inverse tasks: In the forward task, we can predict how stable a protein is, how it will unfold and what the forces involved are, all given just the sequence of amino acids. In the inverse task, we can design new proteins that meet complex nonlinear mechanical signature targets.

With the new generative model they can directly design proteins to meet complex nonlinear mechanical property-design objectives by leveraging deep knowledge on protein sequences from a pretrained protein language model and maps mechanical unfolding responses to create proteins.

Via full-atom molecular simulations for direct validation from physical and chemical principles, we demonstrate that the designed proteins are de novo, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, and a detailed unfolding force-separation curves.

Thermostability

📋
ProLaTherm: Protein Language Model-based Thermophilicity Predictor

Developments The authors reveal in their paper a model that is good at predicting thermal stability as well as an augmented dataset to enable their good predictive control

image

Data: Collected from multiple sources to create new sets. " 9422 UniProt identifiers and 9363 corresponding amino acid sequences from 16 thermophilic and 16 mesophilic organisms" Filtered

Models: They considered several first, we consider feature-based models that rely on manually engineered features, such as physicochemical properties. Second, we include hybrid sequence-based models that use amino acid features to learn sequence embeddings. Third, we consider approaches that are purely sequence-based, similarly to ProLaTherm, but in contrast train sequence embeddings from scratch. The final model used a simplified transformer solution that used 1024 sequence enbeddings that were put into a self-attention network resulting in an output embedding that was averaged and put into an ReLU activation that then went to a a batch norm and logistic predction of whether the protein was a thermophile.

Training: From scratch.

Results: High performance of PLM 97% accuracy over other models, though this accuracy is reduced when reducing train/test set homology.

Candidate Identification

Particularly for evolutionary methods, it is essential to know where to start optimizing from. GenAI can be used to identify candidates based on databases of prior candidates.

Searching is essential to find similar sequences that may aid in the training or fine-tuning of models. This can be done with sequence-based alignment, as well as structure-based alignment. Here are a few references of highly-relevant tools for search/alignment.

Fast and accurate protein structure search with: Foldseek

Foldseek "aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet". Paper

Candidate alignment

It is not necessarily just enough to identify a potential candidate but to have a degree of alignment with of the candidate with starting or suggested candidates. This allows for a degree of interpretability to by people.

Protein Binding

📋
Contrastive learning in protein language space predicts interactions between drugs and protein targets

The authors show in their paper the use of contrastive learning to help co-locate proteins and potential drug molecules in a 'shared feature space' and learns to map drue drugs against non-binding 'decoy' molecules. image

📋
Robust deep learning based protein sequence design using ProteinMPNN

In their [paper](https://www.biorxiv.org/content/10.1101/2022.06.03.494563v1) the authors reveal a novel method to predict sequences and sequence recovery. image

Tools

Evaluation Methods

BERTOLOGY MEETS BIOLOGY: INTERPRETING ATTENTION IN PROTEIN LANGUAGE MODELS

Developments The authors show in their paper " that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We find this behavior to be consistent across three Transformer architectures (BERT, ALBERT, XLNet) and two distinct protein datasets. We also present a three-dimensional visualization of the interaction between attention and protein structure"

They see the following:

  • Attention aligns strongly with contact maps in the deepest layers.
  • Attention targets binding sites throughout most layers of the models.
  • Attention targets Post-translational modifications in a small number of heads.
  • Attention targets higher-level properties in deeper layers.
  • Attention heads specialize in particular amino acids.
  • Attention is consistent with substitution relationships

Colab Design

Quality reviews and references

Deep Learning in Protein Structural Modeling and Design provides a thorough summary of DL manners of optimizing proteins.

They emphasize a Sequence → Structure → Function approach should be focused upon. image

Companies

Here are several companies that help to focus in protein design. If you have one you'd like to suggest, please file an issue.