Skip to content


🚧 Under construction 🦺

Generating or modifying protein sequences to improve behavior, or to create novel behavior, is is a powrful application for AI. Guided through evolutionary-techniques, Bayesian optimization, and/or with the use of protein language models (PLMS), they can vastly accelerate the development of biotechnological tools, as well as for identifying targets and avenues for therapeutics. Because of their ability to represent the 'language of proteins', PLMS are increasingly important in predicting structure and function of proteins.


Protein optimization can be broken down into several component 1

  • Target property is the intended goal(s) for protein development
  • [Fitness Predictor] that uses sequence information to estimate the value of the optimization target, as a surrogate for laboratory measurement
  • Sequence Proposer that creates sequences to evaluate and explore
  • Prioritizer that uses sequence and predictor information to estimate the top candidates.
  • Laboratory measurements that reveal the quality of the generated proteins based on the targets

Optimization systems may involve merging and combining these components for full solutions in two general manners.

  1. A model that separates generation and evaluation steps, where an the predictor model evaluates the quality of an input set of sequences (generated or otherwise defined)
  2. A model that model that directly predicts the best designs using adaptive sampling, proposing solutions, evaluating them with the predictor model, and then iterating.

These components can be cleanly seen in the box below:

Adaptive machine learning for protein engineering

An overview of ML for protein engineering:



Protein optimization will necessarily evolve the creation of those proteins and evaluations of target characteristics. There are large volumes of databases of various forms that may be useful in creating foundation models. It will still be essential to use continued observaiton to improve the optimization target based on predicted and iterated feedback.

The volume of the observations will help to determine the architectures that one could use. Base models tend to be PLMs because of the large set of available data. Unsupervised finetuning with those large models may be able to occur through homology or family sets to per. Final targets may be then optimized with simple networks, often involving regression to minimize overfitting, or methods that include Bayesian or evolutionary approaches.

To be able to successfully deliver on final target optimziation, the greater the quantity of direct or surrogate data that can be obtained, the greater the potential the resulting models will ˆs sufficiently predictive of fitness of future protein sequence candidates. That is why massive screening approaches, as described in by Ginko's platform screening thousands of candidates.

An example process by Ginkgo

Gingko reveals with foundry-scale protein estimates, that with thousands of samples they were able to create an enzyme with 10x improvement from where they started. In their design they use structure (differential) estimates via Rosetta, Evolutionary-scale modeling (PLMs), active site focus evolutionary models, as well as an in-house method called 'OWL. image

When it is possibly to iteratively measure proposed sequences, new data can be used to improve subsequent sequence predictions. This can be done greedily, choosing the best solutions, or using probabilistic methods, such as [Bayesian Optimization]. Searching for a protein that optimizes a target by combining both estimated values, as well as their uncertainties. Selecting the sequences with highest-predicted target values will greedily inform what should be used, and may easily fail due to incorrect estimates due to the predictor model. In other manners, confidence bound (UCB) acquisition, that selects sequences based on an a sum of the predicted target value and the predicted target unertainty.

Ways of prioritizing


Optimization Targets

There are a number of targets that protein optimization can focus on. For examples, some targets enable primarily basic understanding, such as protein structure, and other targets are related to function, though it is generally considered that structure enables the functions.

In the cannon causal influence sequence creates → structure → enables the function.

There are several optimization targets of direct interest

  • Structure

    • Contact prediction
    • Secondary and tertiary structure
    • (mis)Folding (missense)
  • Function

    • Enzymatic Catalysi: The ability of an enzyme to accelerate chemical processes
    • Thermocompatibility or thermostability, how well a protein remains stable or functions at varying temperatures
    • Fluorescence for visualization purposes
  • Protein Binding to...

    • Proteins
    • Nucleic Acids
    • Drugs molecules
    • Metals
  • Candidate Identification
  • Alignment
  • Remote cohomology: Similar function, or structure,
  • Multimodal targets versions to optimize multiple the targets.

Fitness prediction

Training a fitness model may first involve training an unsupervised foundation model on a high volume of data. These models can then be fine-tuned, or otherwise adapted, to incorporate protein-sequences or higher relevance to the protein targets of interest.

Sequence Proposer

With a fitness predictor made available, the next step is to create proposal sequences that may be evaluated with the predictor model, or potentially with direct measurement.

One way of doing this is to use activation maximization, a method that will generate input to a model that will ideally maximize the output for a given model (assuming maximization is the desired target direction.)

SeqProp: Stochastic Sequence Propagation - A Keras Model for optimizing DNA, RNA and protein sequences based on a predictor.

The authors reveal in their paper a method to optimize biological protein sequences based on an a predictor model. They use something called trainable logits that can be sampled from, but do so doing instance normalizaton image image

Example Architectures

While there are many architecture and methods for creating and optimizing proteins, we focus here, primarily on ways that employ PLMs in some way. These create foundation models that can be fine-tuned and readily adapted to specific domains of interest.

The general method of creating protein foundation models uses Masked Language Modeling (MLM) or 'Bert-based' predictions, though next-token predictions, as is done with GPT-architectures may also be used. We share a number of prominent models and uses or derivatives,

Evaluation Metrics

Foundation models

ESM models

Evolutionary-scale prediction of atomic-level protein structure with a language model (esm) End to end Language model enabling structure sequence pairing, coupled with an equivariant transformer structure model at the end

Genome-wide prediction of disease variant effects with a deep protein language model

The authors show in their paper a workflow using ESM1b, a 650-million-parameter protein language model, to predict all ~450 million possible missense variant effects in the human genome, and made all predictions available on a web portal.

GitHub Repo stars Language models enable zero-shot prediction of the effects of mutations on protein function


Using established and newly trained protein language models, the authors demonstrate the ability to provide zero-shot predictions of the effect of a protein mutation on a protein's fluorescence.


They use a PLM to score the mutations using a log odds-ration of the mutated protein.



They create ESM-1v, an unsupervised masked transformer model by training on 98 million protein sequences, using Uniref90 2020-03.

They evaluate the model on a set of 41 deep mutational scans.


GitHub Repo stars MSA Transformer

The author's demonstrate in their paper training an unsupervised PLM that operates on sets of aligned sequences. Self-supervision helps to reconstruct the corrupted MSA.



** Architecture**

The architecture 'interleaves attention across the rows and columns of the alignment as an axial attention' that ties the attention map across the rows with 'tied row attention'. They use a single feed-forward layer for each block. For position embeddings, they use a 1D learned position embeddings added independently to each row of MSA to distinguish aligned positions differently for each sequence.

The objective looks for the loss of th masked MSA as follows


With the probabilities are the output of the MSA transformer, softmax normalized of the amino acid vocabulary indepentely normalized per position in the sequence. Masking the columns uniformly resulted in the best performance.

The models are 12 layers, with a 768 embedding size, and 12 attention heads resulting in 100M parameters.


They use 26 million MSA sequences generated from from UniRef50 by searching UniClust30 with HHblits.



They show that a logistic regression with 144 parameters fit on 20 training structures could predict the contact maps of almost 15k other structures almost unsupervised. They show a supervised contact prediction map can improve the contact-prediction maps. They find the attention heads focus on highly variable colums, correlating with the per-column entropy of MSA.

Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences

The authors used masked languaged prediction with transformer models to train a foundation model capable of multiple downstream tasks.

"To this end we use unsupervised learning to train a deep contextual language model on
86 billion amino acids across 250 million protein sequences spanning evolutionary diversity. The
resulting model contains information about biological properties in its representations. The representations are learned from sequence data alone.
The learned representation space has a multi-scale
organization reflecting structure from the level
of biochemical properties of amino acids to remote homology of proteins. Information about
secondary and tertiary structure is encoded in the
representations and can be identified by linear projections. "

<img width="329" alt="image" src="">



🧬 GitHub Repo stars Protein function prediction as approximate semantic entailment


Current LLM models excel at predicting the structure and other attributes of biological sequences like proteins. However, their transferability is limited, capping their true potential. The DeepGO-SE model innovates 🚀 by integrating protein language models with specific knowledge on protein function, bridging the gap between knowledge-graphs' explicit representations and next-token prediction's implicit representations, and thereby significantly improving model performance.

How it works

  • 🔄 First, DeepGO-SE reuses the ESM2 large language model to convert a protein sequence into a vector space embedding, prepping it for machine learning application.
  • 🧠 Next, an ensemble of fitted prediction models is trained to align ESM2 embeddings with an embedding space (ELEmbeddings) derived from GO axioms, creating a world model filled with geometric shapes and relations akin to a Σ algebra, which can verify the truth of a statement.
  • ✅ Finally, for statements such as "protein has function C", when the ensemble reaches a consensus on truth, the semantic truth estimation is then accepted as valid.

DeepGO-SE Model Overview

The authors demonstrate 📈 that this method improves molecular function prediction by a substantial margin. Moreover, they reveal that training with protein-protein interactions substantially benefits the understanding of complex biological processes. They suggest that predicting biological processes may only require knowledge of molecular functions, potentially paving the way for a more generalized approach that could be advantageous in other domains.

ProtST: Multi-Modality Learning of Protein Sequences and Biomedical Texts

The authors show in their paper that the fusion of natural language model with a protein language model can reasonably improve protein location prediction, fitness landscape prediction, and protein function annotation.


Data Their build a ProtDescribe to match protein sequences with text descriptions.

Models Their models involve three losses. 1. InfooNCE loss to maximize similarity between sequence pairs, and minimize similarity between negative pairs. 2. A Masked protein modeling cross-entropy loss to maintain unimodal information to the sequences, and a fusion MultiMOdal Mask Prediction that uses self and cross-attention on masked input sequence and text pairs to mutually recover the predicted results in sequence and text results. They start with pre-traiend protein models (Bert, ESM-1b and ESM-2) and pretrained language model (PubMedBERT-abs and PubMedBERT-full)


The text data set looks like this: image

Other models

Single-sequence protein structure prediction using supervised transformer protein language models

The authors show in their [paper[( the ability to generate high quality predictions outperforming AlphaFold2, with a model called trRosettaX-Single using ESM to generate representations and attention maps that can be trained for distance+energy maps,


Tasks Assessing Protein Embeddings (TAPE)

Architectures by Target

Enzymatic Catalysis
De novo design of luciferases using deep learning


ForceGen: End-to-end de novo protein generation based on nonlinear mechanical unfolding responses using a language diffusion model

Developments The authors present ForceGen, an end-to-end algorithm for de novo protein generation based on nonlinear mechanical unfolding responses. Rooted in the physics of protein mechanics, this generative strategy provides a powerful way to design new proteins rapidly, including exquisite and rapid predictions about their dynamical behavior.

Proteins, like any other mechanical object, respond to forces in peculiar ways. Think of the different response you'd get from pulling on a steel cable versus pulling on a rubber band, or the difference between honey and glass. Now, we can design proteins with a set of desirable mechanical characteristics, with applications from health to sustainable plastics.



The key to solving this problem was to integrate a protein language model with denoising diffusion methods, and using accurate atomistic-level physical simulation data to endow the model a first-principles understanding. ForceGen can solve both forward and inverse tasks: In the forward task, we can predict how stable a protein is, how it will unfold and what the forces involved are, all given just the sequence of amino acids. In the inverse task, we can design new proteins that meet complex nonlinear mechanical signature targets.

With the new generative model they can directly design proteins to meet complex nonlinear mechanical property-design objectives by leveraging deep knowledge on protein sequences from a pretrained protein language model and maps mechanical unfolding responses to create proteins.

Via full-atom molecular simulations for direct validation from physical and chemical principles, we demonstrate that the designed proteins are de novo, and fulfill the targeted mechanical properties, including unfolding energy and mechanical strength, and a detailed unfolding force-separation curves.


ProLaTherm: Protein Language Model-based Thermophilicity Predictor

Developments The authors reveal in their paper a model that is good at predicting thermal stability as well as an augmented dataset to enable their good predictive control


Data: Collected from multiple sources to create new sets. " 9422 UniProt identifiers and 9363 corresponding amino acid sequences from 16 thermophilic and 16 mesophilic organisms" Filtered

Models: They considered several first, we consider feature-based models that rely on manually engineered features, such as physicochemical properties. Second, we include hybrid sequence-based models that use amino acid features to learn sequence embeddings. Third, we consider approaches that are purely sequence-based, similarly to ProLaTherm, but in contrast train sequence embeddings from scratch. The final model used a simplified transformer solution that used 1024 sequence enbeddings that were put into a self-attention network resulting in an output embedding that was averaged and put into an ReLU activation that then went to a a batch norm and logistic predction of whether the protein was a thermophile.

Training: From scratch.

Results: High performance of PLM 97% accuracy over other models, though this accuracy is reduced when reducing train/test set homology.

Candidate Identification

Particularly for evolutionary methods, it is essential to know where to start optimizing from. GenAI can be used to identify candidates based on databases of prior candidates.

Searching is essential to find similar sequences that may aid in the training or fine-tuning of models. This can be done with sequence-based alignment, as well as structure-based alignment. Here are a few references of highly-relevant tools for search/alignment.

Fast and accurate protein structure search with: Foldseek

Foldseek "aligns the structure of a query protein against a database by describing tertiary amino acid interactions within proteins as sequences over a structural alphabet". Paper

Candidate alignment

It is not necessarily just enough to identify a potential candidate but to have a degree of alignment with of the candidate with starting or suggested candidates. This allows for a degree of interpretability to by people.


Evaluation Methods


Developments The authors show in their paper " that attention: (1) captures the folding structure of proteins, connecting amino acids that are far apart in the underlying sequence, but spatially close in the three-dimensional structure, (2) targets binding sites, a key functional component of proteins, and (3) focuses on progressively more complex biophysical properties with increasing layer depth. We find this behavior to be consistent across three Transformer architectures (BERT, ALBERT, XLNet) and two distinct protein datasets. We also present a three-dimensional visualization of the interaction between attention and protein structure"

They see the following:

  • Attention aligns strongly with contact maps in the deepest layers.
  • Attention targets binding sites throughout most layers of the models.
  • Attention targets Post-translational modifications in a small number of heads.
  • Attention targets higher-level properties in deeper layers.
  • Attention heads specialize in particular amino acids.
  • Attention is consistent with substitution relationships

Colab Design

Quality reviews and references

Deep Learning in Protein Structural Modeling and Design provides a thorough summary of DL manners of optimizing proteins.

They emphasize a Sequence → Structure → Function approach should be focused upon. image


Here are several companies that help to focus in protein design. If you have one you'd like to suggest, please file an issue.