Skip to content


Transformers are a powerful type of architecture allows input sequences to be considered with the whole input context. They are built on [self attention] mechanism, that performs an \(O(N^2)\) computation fon the input sequence and, in continued stacks, provides the ability to represent relations between inputs at different levels of abstration.

Transformers can be used in three general ways: encoder-only, decoder-only and encoder-decoder.

In Encoder-only networks, like Bert, the entire input text is used, but is useful for primarily output classification tasks (sequence-to-value).

As in the original Transformer attention paper, encoder-decoder networks are used to convert sequence to sequences for language translation. In these systems, an encoder will first project information based on the input, generate new outputs, and the new outpus will be used in a recurrent fashion to generate subsequent outputs.

In Decoder-only networks, like GPT, because they are next-token_predictions, they only require information from words/tokens that have been previously seen. The outputs will be the estimates of the probability of the next word/token. While next-token prediction is singular, this can happen iteratively, and with the proper prompting, the generation of output sequences can perform a varity of sequence-to-sequence tasks, such as language translation.

  • Attention: Token being predicted is mapped to a query vector and tokens in context are mapped to key and value vectors. Inner products are used to combine to extract information.
  • Bi-directional / unmasked
  • Unidirectional / masked self attetion
  • Cross attention applies attention to the primary sequence and treates the second token sequence the context.
  • Multi-head attention. Multiple attention heads in parallel.
  • Layer normalization. Found to be computationally efficient version sets m = beta = 0 or root mean square layer normalizagion or RMSnorm.
  • Unembedding: Learns to convert vector intot he vocuabulary elements.

Visualizing The Structures

Visualizing Large Transformers

A very interesting visual representation of transformers. image


  1. Positional Encoding
  2. Attention: Query, Key, Vectors
  3. Layer Normalization

Initially the word, or subword is broken anad represented as a lookup-key to find an 'embedding'. This can be trained alongside transformer models, or pre-trained from other models. It provides a vector representation of the input word.

To allow the token embedding to attend or share information with the other inputs, calculate a self-attention matrix. In a series of input token-embeddings, there is a attention query

  1. A Query matrix \(W^Q\)
  2. A Key matrix \(W^K\)
  3. A Value matrix \(W^V\)

For each token/word \(i\), the embedding is multiplied by this matrix to yield a query vector, a key vector, and a value vector, \(Q_i\), \(K_i\) and \(V_i\)

Each query-vector, is then multiplied by each key-vector, resulting in matrix computation \(Q*V\). Because the key-query is suppoesed to describe how important an input combination is, it is then normalized by the dimension of the values to allow for similar behavior for different dimensions, and then passed through a soft-max function

\[ softmax(\frac{(Q * K^T)}{\sqrt{d_k}}) \]

This is then multiplied by the value matrix to provide the attention output.

$Z_{head i} = \(softmax(\frac{(Q * K^T)}{\sqrt{d_k}}) V\)

Multiple attention heads can be combined by stacking, con_cat_enating, them together and then multiplied by a final matrix that will produce a final

\(Z = cat(Z_i) * W^O\)

Finally, this matrix is with input values to have a residual connection, and the layer is normalized.

This matrix can be passed to additional layers, or a final fully-connected projection layer.

Positional Encoding

Standard embeddings are position in variant, meaning the position of the token/word in the input will have little importance. As token/ word positions have certain importance, position-embeddings are also used. Generally additive, position embeddings are based on varying sinusoids, or trainable parameters.

TODO: Which is used more and why aren't trainable, and why are not non-linear embeddings considered as opposed to just addative...?

Layer Normalization

Layer normalization observably improves results On Layer Normalization in the Transformer Architecture


The Illustrated Transformer
The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture provides q thorough exposition of transformer technology.

Useful References and Research

General Introductions

Seminal research


Enhancements and variations

Context length Improvements

In its vanilla state, Transformers are \(O(N^2)\) in their computation with self-complexity. This makes long context lengths increasingly costly to train and generate. Improvements in context length, for both training and generation have found ways to generally work-around these limits. While there is ample research in this domain, we present a few of the most successful methods. They improve computation complexity in one of several ways:

  • Introducing sparsity that is

    • Banded or fixed
    • Hierarchical
    • Banded to reduce full computation
    • Wedge shaped with a banded window that also takes into account observably important first tokens.
  • Inclusion of a recurrent RNN-style that permits memory to be retained.

  • Memory retrieval systems.

GitHub Repo stars HyperAttention: Long-context Attention in Near-Linear Time

Developments The authors reveal a new method of attention that allows for very-long context lengths which they call 'hyperattention'. This algorithm finds (1) larger entries in the attention matrix using sorted locality sensitive hashing, and then performs column subsampling to rearrange the matrices to provide block-diagonal approximation.

image image

Results While not without a tradeoff for perplexity, the speedup for long context lengths can be considerable. image



Generating Long Sequences with Sparse Transformers provides simple solutions to generate longer sequences.


GitHub Repo stars Heirarchichal Attention


Scaling Transformer to 1M tokens and beyond with RMT Uses a Recurrent Memory Transformer(RMT) architecture to extend understanding to large lengths.

MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers

MEGABYTE segments sequences into patches and uses a local submodel within patches and a global model between patches. Very nice demonstration that allows for \(O(N^{4/3}\) scaling directly on bytes, thereby bypassing tokenization requirements found with traditional transformers.


An open source version made by lucidrains: Megabyte Github implementation for PyTorch

GitHub Repo stars Infinite Former Uses a representation of the input sequence as a continuous signal expressed in a combination of N radial basis functions.

Paper Infinity Former

LM-INfinite: Simple On-the-Fly Length Generalization for Large Language Models provides an O(n) time/space extension allows LMMs to ability to go to 32k tokens and 2.7x speedup.

image image image

GitHub Repo stars Efficient Streaming Language Models with Attention Sinks

Paper image

Advanced Transformer Blocks

GitHub Repo stars DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging

Developments The authors reveal in their paper a variation of the transformer that yields improved results by introducing 'Depth Weighted Averaging' that averages weights at layer (i) with the output from the current block \(B_i\) (ii) the output of all previous blocks \(B_{j<i}\), and (iii) the embedded input \(X_0\). image image

Computation Reduction

GitHub Repo stars Simplified Transformers that removes the 'value' parameter-set to increase speed by 14% with potentially minimal accuracy reduction

Herein the authors reveal a variation of transformers that removes the 'value' parameter to yield notable speed gains at the same performance level. image Paper

Other modalities



Transformers Meet Directed Graphs introduces a variation of Transformer GNNs that uses 'direction-aware' positional encodings to help handle both undirected and directed graphs



Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model


In this work, we present VistaLLM, the first generalpurpose vision model that addresses coarse- and finegrained vision-language reasoning and grounding tasks over single and multiple input images. We unify these tasks by converting them into an instruction-following sequenceto-sequence format. We efficiently transform binary masks into a sequence of points by proposing a gradient-aware adaptive contour sampling scheme, which significantly improves over the naive uniform sampling technique previously used for sequence-to-sequence segmentation tasks



Hugging Face Transformers An API to access a large number of pre-trained transformers. Pytorch based.
Fast Transformers A quality collection of a number of transformer implementations written in Pytorch.

Theory and Experiment


We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time.

Abstract Uses

Looped Transformers and Programmable Computers Understanding that transformer networks can simulate complex algorithms when hardcoded with specific weights and made intoa loop.

'Machine Learning' 'Machine code'. "We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, non-linear functions, function calls, program counters, and conditional branches. Using these building blocks, we emulate a small instruction-set computer."