Skip to content

Protein Optimization Using AI

Generating or modifying protein sequences to improve behavior or create novel behavior is a powerful application for AI. Guided through evolutionary techniques, Bayesian optimization, and/or with the use of protein language models (PLMs), AI can vastly accelerate the development of biotechnological tools and identify targets and avenues for therapeutics. Because of their ability to represent the 'language of proteins,' PLMs are increasingly important in predicting the structure and function of proteins.


Protein optimization can be broken down into several components1:

  • Target Property: The intended goal(s) for protein development.
  • Fitness Predictor: Uses sequence information to estimate the value of the optimization target, as a surrogate for laboratory measurement.
  • Sequence Proposer: Creates sequences to evaluate and explore.
  • Prioritizer: Uses sequence and predictor information to estimate the top candidates.
  • Laboratory Measurements: Reveal the quality of the generated proteins based on the targets.
  • Orchestrator: Puts the pieces together in a functional and validated manner.

Optimization systems may involve merging and combining these components for full solutions in two general manners:

  1. A model that separates generation and evaluation steps, where the predictor model evaluates the quality of an input set of sequences (generated or otherwise defined).
  2. A model that directly predicts the best designs using adaptive sampling, proposing solutions, evaluating them with the predictor model, and then iterating.

These components can be cleanly seen in the box below:

Adaptive Machine Learning for Protein Engineering

An overview of ML for protein engineering: image


Protein optimization will necessarily evolve the creation of those proteins and evaluations of target characteristics. There are large volumes of databases of various forms that may be useful in creating foundation models. It will still be essential to use continued observation to improve the optimization target based on predicted and iterated feedback.

The volume of the observations will help to determine the architectures that one could use. Base models tend to be PLMs because of the large set of available data. Unsupervised fine-tuning with those large models may be able to occur through homology or family sets. Final targets may then be optimized with simple networks, often involving regression to minimize overfitting, or methods that include Bayesian or evolutionary approaches.

To be able to successfully deliver on final target optimization, the greater the quantity of direct or surrogate data that can be obtained, the greater the potential the resulting models will sufficiently predict the fitness of future protein sequence candidates. That is why massive screening approaches, as described by Ginkgo's platform, screen thousands of candidates.

An example process by Ginkgo

Ginkgo reveals with foundry-scale protein estimates, that with thousands of samples they were able to create an enzyme with 10x improvement from where they started. In their design, they use structure (differential) estimates via Rosetta, Evolutionary-scale modeling (PLMs), active site focus evolutionary models, as well as an in-house method called 'OWL.' image

When it is possible to iteratively measure proposed sequences, new data can be used to improve subsequent sequence predictions. This can be done greedily, choosing the best solutions, or using probabilistic methods, such as [Bayesian Optimization]. Searching for a protein that optimizes a target by combining both estimated values, as well as their uncertainties. Selecting the sequences with the highest-predicted target values will greedily inform what should be used and may easily fail due to incorrect estimates from the predictor model. In other manners, confidence bound (UCB) acquisition selects sequences based on a sum of the predicted target value and the predicted target uncertainty.

Ways of prioritizing


Optimization Targets

There are a number of targets that protein optimization can focus on. For example, some targets enable primarily basic understanding, such as protein structure, and other targets are related to function, though it is generally considered that structure enables the functions.

In the canon of causal influence, source has → sequence that creates → structure → enables the function. We can generally compartmentalize targets based on these, though there is certain crossover between them.

  • Source
  • Sequence
    • Alignment
    • [Remote cohomology]: Similar function, or structure
  • Structure
    • Contact prediction
    • Secondary and tertiary structure
    • (mis)Folding (missense)
  • Function
    • Enzymatic Catalysis: The ability of an enzyme to accelerate chemical processes
    • Thermocompatibility or thermostability, how well a protein remains stable or functions at varying temperatures
    • Fluorescence for visualization purposes
    • Protein Binding to...
      • Proteins
      • Nucleic Acids
      • Drug molecules
      • Metals

Though there are many examples where these classes cross, these potential targets are essential for protein optimization.

Fitness Prediction

Training a fitness model may first involve training an unsupervised foundation model on a high volume of data. These models can then be fine-tuned, or otherwise adapted, to incorporate protein sequences or higher relevance to the protein targets of interest.

Learning protein fitness models from evolutionary and assay-labeled data

The authors show in their paper that uses a manner to combine ridge regression with large-language models revealing the ability to effectively predict evolutionary and assay-labeled fitness. image

Sequence Proposer

With a fitness predictor made available, the next step is to create proposal sequences that may be evaluated with the predictor model, or potentially with direct measurement.

One way of doing this is to use generative models directly in seeding the generated sequence with starting sequences of the target sequence, or even from a natural language prompt. Another method is to use activation maximization, a method that will generate input to a model that will ideally maximize the output for a given model (assuming maximization is the desired target direction).

Generative Models

Sequence modeling and design from molecular to genome scale with Evo

The authors reveal in their paper the use of long-context Genetics models can be powerful in their ability to yield state-of-the-art predictions in protein-related tasks. These tasks include zero-shot function prediction, multi-element sequence generation. Their models use the 'Striped-Hyena' structured state space model. Their model is known as Evo. image

ZymCTRL: a conditional language model for the controllable generation of artificial enzymes

Here, we describe ZymCTRL, a conditional language model trained on the BRENDA database of enzymes, which generates enzymes of a specific enzymatic class upon a user prompt. ZymCTRL generates artificial enzymes distant from natural ones while their intended functionality matches predictions from orthogonal methods. image Model

Low-N protein engineering with data-efficient deep learning

The authors demonstrate a standard model where a PLM undergoes unsupervised pre-training and then refined on evolutionarily related sequences, and finally fine-tuned on assay-specific sequences. They use a Markov Chain Monte Carlo (MCMC) method to mutate and iteratively evaluate mutations to improve design approaches.

Markov Chain Monte Carlo

Plug & play directed evolution of proteins with gradient-based discrete MCMC (EvoProtGrad for MCMC)

A Python package for directed evolution on a protein sequence with gradient-based discrete Markov chain Monte Carlo (MCMC) based on the paper, blog, and docs image

With Natural Large Language Models

Activation Maximization

SeqProp: Stochastic Sequence Propagation - A Keras Model for optimizing DNA, RNA and protein sequences based on a predictor.

The authors reveal in their paper and arxiv a method to optimize biological protein sequences based on a predictor model. They use something called trainable logits that can be sampled from, but do so using instance normalization. A Python API for constructing generative DNA/RNA/protein Sequence PWM models in Keras. Implements a PWM generator (with support for discrete sampling and ST gradient estimation), a predictor model wrapper, and a loss model. image image

Protein sequence design by conformational landscape optimization

The authors propose a Bayesian approach to optimizing a protein structure to yield a residue sequence. They use a loss of the form \(Loss = -/log P(contacts|sequence) + D_{KL}(f_{20}||f_{20}^{PDB}\) where \(D_{KL}\) is the Kullback-Leibler divergence, \(f_{20}\) is the average frequency of amino acids from the sequence, and \(f_{20}^{PDB}\) is the average frequency of amino acids from proteins in the PDB. Paper image

Structure-based scoring and sampling of 'Combinatorial Variant Effects from Structure' (CoVES)

The authors show in their paper and Nature over 7 different combinatorial mutation studies, the ability to design proteins by exploring the design space without the need for a combinatorial number of mutations. They build a model to estimate a residue preference effect for each amino acid variant at each position and sum these effects to predict combinatorial variants. Simple linear and logistic models using a 'mutation effect preference of size 20(Amino Acids)x residue size' were able to predict the effect of variance. They could then use this to design sequences using Boltzmann sampling and generate variants that were much better. image image image Particularly the following image provides credence that these simple models of important sites can be useful in predicting proteins. image

Data Sources

ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design

ProteinGym is an extensive set of Deep Mutational Scanning (DMS) assays and annotated human clinical variants. The results are "curated to enable thorough comparisons of various mutation effect predictors in different regimes." image Website Paper

Homologous Pairs of Low and High Temperature Originating Proteins Spanning the Known Prokaryotic Universe

Example Architectures

While there are many architectures and methods for creating and optimizing proteins, we focus here primarily on ways that employ PLMs in some way. These create foundation models that can be fine-tuned and readily adapted to specific domains of interest.

The general method of creating protein foundation models uses Masked Language Modeling (MLM) or 'Bert-based' predictions, though next-token predictions, as is done with GPT-architectures, may also be used. We share a number of prominent models and uses or derivatives.

Evaluation Metrics

  • Spearman Correlation Coefficient
  • AUC
  • MCC

Foundation Models

ESM Models

GitHub Repo stars Language models enable zero-shot prediction of the effects of mutations on protein function
Evolutionary-scale prediction of atomic-level protein structure with a language model (esm)

End-to-end Language model enabling structure sequence pairing, coupled with an equivariant transformer structure model at the end. image Science paper

Genome-wide prediction of disease variant effects with a deep protein language model

The authors show in their paper a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal. Developments Using established and newly trained protein language models, the authors demonstrate the ability to provide zero-shot predictions of the effect of a protein mutation on a protein's fluorescence. image They use a PLM to score the mutations using a log odds-ratio of the mutated protein. image Data They create ESM-1v, an unsupervised masked transformer model by training on 98 million protein sequences, using Uniref90 2020-03. They evaluate the model on a set of 41 deep mutational scans.

[Paper]( Paper

GitHub Repo stars MSA Transformer

The authors demonstrate in their paper training an unsupervised PLM that operates on sets of aligned sequences. Self-supervision helps to reconstruct the corrupted MSA. Developments image Architecture The architecture 'interleaves attention across the rows and columns of the alignment as an axial attention' that ties the attention map across the rows with 'tied row attention'. They use a single feed-forward layer for each block. For position embeddings, they use a 1D learned position embeddings added independently to each row of MSA to distinguish aligned positions differently for each sequence. The objective looks for the loss of the masked MSA as follows: image With the probabilities being the output of the MSA transformer, softmax normalized of the amino acid vocabulary independently normalized per position in the sequence. Masking the columns uniformly resulted in the best performance. The models are 12 layers, with a 768 embedding size, and 12 attention heads resulting in 100M parameters. Data They use 26 million MSA sequences generated from UniRef50 by searching UniClust30 with HHblits. Analysis image They show that a logistic regression with 144 parameters fit on 20 training structures could predict the contact maps of almost 15k other structures almost unsupervised. They show a supervised contact prediction map can improve the contact-prediction maps. They find the attention heads focus on highly variable columns, correlating with the per-column entropy of MSA.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

The authors used masked language prediction with transformer models to train a foundation model capable of multiple downstream tasks. "To this end we use unsupervised learning to train a deep contextual language model on 86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone. The learned representation space has a multi-scale organization reflecting structure from the level of biochemical properties of amino acids to remote homology of proteins. Information about secondary and tertiary structure is encoded in the representations and can be identified by linear projections." image



Reference Optimization of Protein Language Models as a Multi-objective Binder Design Paradigm

The authors create a design paradigm using instruction fine-tuning and direct preference optimization of PLMs. Creating ProtGPT2 allows binders to be designed based on receptor and drug developability criteria. To do this, they do two-step instruction tuning with receptor-binding 'chat-templates', and then optimize fine-tuned models to promote preferred binders. Specifically, they "propose an alignment method to transform pre-trained unconditional protein sequence models (p(s)), that autoregressively sample sequences (s) from underlying data distribution (D), to conditional probability models (p(s|r; c)) that given a target receptor ® sample binders that satisfy constraints © encoded by preference datasets compiled from experiments and domain experts." image Notably, they fuse protein sequences with English-language prompts and use BPE encoding with a large vocabulary size (50k) instead of the smaller PLM vocabulary sizes (33) that are standard. image


🧬 GitHub Repo stars Protein function prediction as approximate semantic entailment

Developments Current LLM models excel at predicting the structure and other attributes of biological sequences like proteins. However, their transferability is limited, capping their true potential. The DeepGO-SE model innovates 🚀 by integrating protein language models with specific knowledge on protein function, bridging the gap between knowledge-graphs' explicit representations and next-token prediction's implicit representations, and thereby significantly improving model performance. How it works * 🔄 First, DeepGO-SE reuses the ESM2 large language model to convert a protein sequence into a vector space embedding, prepping it for machine learning application. * 🧠 Next, an ensemble of fitted prediction models is trained to align ESM2 embeddings with an embedding space (ELEmbeddings) derived from GO axioms, creating a world model filled with geometric shapes and relations akin to a Σ algebra, which can verify the truth of a statement. * ✅ Finally, for statements such as "protein has function C", when the ensemble reaches a consensus on truth, the semantic truth estimation is then accepted as valid. DeepGO-SE Model Overview The authors demonstrate 📈 that this method improves molecular function prediction by a substantial margin. Moreover, they reveal that training with protein-protein interactions substantially benefits the understanding of complex biological processes. They suggest that predicting biological processes may only require knowledge of molecular functions, potentially paving the way for a more generalized approach that could be advantageous in other domains.

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

The authors show in their paper that the fusion of natural language model with a protein language model can reasonably improve protein location prediction, fitness landscape prediction, and protein function annotation. image Data They build a ProtDescribe to match protein sequences with text descriptions. Models Their models involve three losses. 1. InfoNCE loss to maximize similarity between sequence pairs, and minimize similarity between negative pairs. 2. A Masked protein modeling cross-entropy loss to maintain unimodal information to the sequences, and a fusion MultiModal Mask Prediction that uses self and cross-attention on masked input sequence and text pairs to mutually recover the predicted results in sequence and text results. They start with pre-trained protein models (Bert, ESM-1b, and ESM-2) and pre-trained language model (PubMedBERT-abs and PubMedBERT-full). image The text data set looks like this: image

Other Models

Single-sequence protein structure prediction using supervised transformer protein language models

The authors show in their paper the ability to generate high-quality predictions outperforming AlphaFold2, with a model called trRosettaX-Single using ESM to generate representations and attention maps that can be trained for distance+energy maps. image

Tasks Assessing Protein Embeddings (TAPE)

Architectures by Target

Enzymatic Catalysis
De novo design of luciferases using deep learning


ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model

Developments The authors present ForceGen, an end-to-end algorithm for de novo protein generation based on nonlinear mechanical unfolding responses. Rooted in the physics of protein mechanics, this generative strategy provides a powerful way to design new proteins rapidly, including exquisite and rapid predictions about their dynamical behavior. Proteins, like any other mechanical object, respond to forces in peculiar ways. Think of the different response you'd get from pulling on a steel cable versus pulling on a rubber band, or the difference between honey and glass. Now, we can design proteins with a set of desirable mechanical characteristics, with applications from health to sustainable plastics. image image The key to solving this problem was to integrate a protein language model with denoising diffusion methods, and using accurate atomistic-level physical simulation data to endow the model a first-principles understanding. ForceGen can solve both forward and inverse tasks: In the forward task, we can predict how stable a protein is, how it will unfold and what the forces involved are, all given just the sequence of amino acids. In the inverse task, we can design new proteins that meet complex nonlinear mechanical signature targets. With the new generative model, they can directly design proteins to meet complex nonlinear mechanical property-design objectives by leveraging deep knowledge on protein sequences from a pretrained protein language model and maps mechanical unfolding responses to create proteins. Via full-atom molecular simulations for direct validation from physical and chemical principles, we demonstrate that the designed proteins are de novo, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, and a detailed unfolding force-separation curves.


ProLaTherm: Protein Language Model-based Thermophilicity Predictor

Developments The authors reveal in their paper a model that is good at predicting thermal stability as well as an augmented dataset to enable their good predictive control. image Data: Collected from multiple sources to create new sets. "9422 UniProt identifiers and 9363 corresponding amino acid sequences from 16 thermophilic and 16 mesophilic organisms" Filtered. Models: They considered several first, we consider feature-based models that rely on manually engineered features, such as physicochemical properties. Second, we include hybrid sequence-based models that use amino acid features to learn sequence embeddings. Third, we consider approaches that are purely sequence-based, similarly to ProLaTherm, but in contrast train sequence embeddings from scratch. The final model used a simplified transformer solution that used 1024 sequence embeddings that were put into a self-attention network resulting in an output embedding that was averaged and put into a ReLU activation that then went to a batch norm and logistic prediction of whether the protein was a thermophile. Training: From scratch. Results: High performance of PLM 97% accuracy over other models, though this accuracy is reduced when reducing train/test set homology.

Candidate Identification

Particularly for evolutionary methods, it is essential to know where to start optimizing from. GenAI can be used to identify candidates based on databases of prior candidates.

Searching is essential to find similar sequences that may aid in the training or fine-tuning of models. This can be done with sequence-based alignment, as well as structure-based alignment. Here are a few references of highly-relevant tools for search/alignment.

Fast and accurate protein structure search with: Foldseek

Foldseek "aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet." Paper

Candidate Alignment

It is not necessarily just enough to identify a potential candidate but to have a degree of alignment with the candidate with starting or suggested candidates. This allows for a degree of interpretability by people.

Contrastive learning on protein embeddings enlightens midnight zone

In their paper the authors demonstrate the use of contrastive optimization (like CLIP) to create embeddings that "optimize constraints captured by hierarchical classification of protein 3D structures." image

Protein Binding

Contrastive learning in protein language space predicts interactions between drugs and protein targets

The authors show in their paper the use of contrastive learning to help co-locate proteins and potential drug molecules in a 'shared feature space' and learns to map drugs against non-binding 'decoy' molecules. image

Robust deep learning based protein sequence design using ProteinMPNN

In their paper the authors reveal a novel method to predict sequences and sequence recovery. image


Evaluation Methods


Developments The authors show in their paper "that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We find this behavior to be consistent across three Transformer architectures (BERT, ALBERT, XLNet) and two distinct protein datasets. We also present a three-dimensional visualization of the interaction between attention and protein structure." They see the following: * Attention aligns strongly with contact maps in the deepest layers. * Attention targets binding sites throughout most layers of the models. * Attention targets Post-translational modifications in a small number of heads. * Attention targets higher-level properties in deeper layers. * Attention heads specialize in particular amino acids. * Attention is consistent with substitution relationships.

Colab Design

Quality Reviews and References

Deep Learning in Protein Structural Modeling and Design

Provides a thorough summary of DL manners of optimizing proteins. They emphasize a Sequence → Structure → Function approach should be focused upon. image


Here are several companies that focus on protein design. If you have one you'd like to suggest, please file an issue.