Transformers
Transformers are a powerful type of architecture allows input sequences to be considered with the whole input context. They are built on [self attention] mechanism, that performs an \(O(N^2)\) computation fon the input sequence and, in continued stacks, provides the ability to represent relations between inputs at different levels of abstration.
Transformers can be used in three general ways: encoderonly, decoderonly and encoderdecoder.
In Encoderonly networks, like Bert, the entire input text is used, but is useful for primarily output classification tasks (sequencetovalue).
As in the original Transformer attention paper, encoderdecoder networks are used to convert sequence to sequences for language translation. In these systems, an encoder will first project information based on the input, generate new outputs, and the new outpus will be used in a recurrent fashion to generate subsequent outputs.
In Decoderonly networks, like GPT, because they are nexttoken_predictions, they only require information from words/tokens that have been previously seen. The outputs will be the estimates of the probability of the next word/token. While nexttoken prediction is singular, this can happen iteratively, and with the proper prompting, the generation of output sequences can perform a varity of sequencetosequence tasks, such as language translation.
 Attention: Token being predicted is mapped to a query vector and tokens in context are mapped to key and value vectors. Inner products are used to combine to extract information.
 Bidirectional / unmasked
 Unidirectional / masked self attetion
 Cross attention applies attention to the primary sequence and treates the second token sequence the context.
 Multihead attention. Multiple attention heads in parallel.
 Layer normalization. Found to be computationally efficient version sets m = beta = 0 or root mean square layer normalizagion or
RMSnorm
.  Unembedding: Learns to convert vector intot he vocuabulary elements.
Visualizing The Structures¶
Visualizing Large Transformers
A very interesting visual representation of transformers.
Components¶
 Positional Encoding
 Attention: Query, Key, Vectors
 Layer Normalization
Initially the word, or subword is broken anad represented as a lookupkey to find an 'embedding'. This can be trained alongside transformer models, or pretrained from other models. It provides a vector representation of the input word.
To allow the token embedding to attend or share information with the other inputs, calculate a selfattention matrix. In a series of input tokenembeddings, there is a attention query
 A Query matrix \(W^Q\)
 A Key matrix \(W^K\)
 A Value matrix \(W^V\)
For each token/word \(i\), the embedding is multiplied by this matrix to yield a query vector, a key vector, and a value vector, \(Q_i\), \(K_i\) and \(V_i\)
Each queryvector, is then multiplied by each keyvector, resulting in matrix computation \(Q*V\). Because the keyquery is suppoesed to describe how important an input combination is, it is then normalized by the dimension of the values to allow for similar behavior for different dimensions, and then passed through a softmax function
This is then multiplied by the value matrix to provide the attention output.
$Z_{head i} = \(softmax(\frac{(Q * K^T)}{\sqrt{d_k}}) V\)
Multiple attention heads can be combined by stacking, con_cat_enating, them together and then multiplied by a final matrix that will produce a final
\(Z = cat(Z_i) * W^O\)
Finally, this matrix is with input values to have a residual connection, and the layer is normalized.
This matrix can be passed to additional layers, or a final fullyconnected projection layer.
Positional Encoding¶
Standard embeddings are position in variant, meaning the position of the token/word in the input will have little importance. As token/ word positions have certain importance, positionembeddings are also used. Generally additive, position embeddings are based on varying sinusoids, or trainable parameters.

A Gentle Introduction to Positional Encoding in Transformer Models, pt1

Transformer Language Models without POsitional Encodings STill Learn Positional Information Indications that causal LMS may derive positional awareness from more than the positional embeddings: they learn it from the causal mask.
TODO: Which is used more and why aren't trainable, and why are not nonlinear embeddings considered as opposed to just addative...?
Layer Normalization¶
Layer normalization observably improves results On Layer Normalization in the Transformer Architecture
Reviews¶
The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture provides q thorough exposition of transformer technology.
Useful References and Research¶
General Introductions¶
Seminal research¶

Neural Machine Translation by Jointly Learning to Align and Translate First paper indicating the notion of 'attention' sort of mechanism.

Attention Is All you Need Initial paper indicating that attention is very powerful and potential replacement of LLM architectures.

Formal Algorithms for Transformers in 2023 Important discussion revealing the components of Transformers.
Modifications¶
 A Simple yet Effective Learnable Positional Encoding Method for Improving Document Transformer Model They introduce a learnable sinusoidal positional encoding feed forward network. Demonstrates significant improvements over other datasets.
Enhancements and variations¶
Context length Improvements¶
In its vanilla state, Transformers are \(O(N^2)\) in their computation with selfcomplexity. This makes long context lengths increasingly costly to train and generate. Improvements in context length, for both training and generation have found ways to generally workaround these limits. While there is ample research in this domain, we present a few of the most successful methods. They improve computation complexity in one of several ways:

Introducing sparsity that is
 Banded or fixed
 Hierarchical
 Banded to reduce full computation
 Wedge shaped with a banded window that also takes into account observably important first tokens.

Inclusion of a recurrent RNNstyle that permits memory to be retained.
 Memory retrieval systems.
HyperAttention: Longcontext Attention in NearLinear Time
Developments The authors reveal a new method of attention that allows for verylong context lengths which they call 'hyperattention'. This algorithm finds (1) larger entries in the attention matrix using sorted locality sensitive hashing
, and then performs column subsampling to rearrange the matrices to provide blockdiagonal approximation.
Results While not without a tradeoff for perplexity, the speedup for long context lengths can be considerable.
Generating Long Sequences with Sparse Transformers provides simple solutions to generate longer sequences.
Scaling Transformer to 1M tokens and beyond with RMT Uses a Recurrent Memory Transformer(RMT) architecture to extend understanding to large lengths.
MEGABYTE: Predicting Millionbyte Sequences with Multiscale Transformers
MEGABYTE segments sequences into patches and uses a local submodel within patches and a global model between patches. Very nice demonstration that allows for \(O(N^{4/3}\) scaling directly on bytes, thereby bypassing tokenization requirements found with traditional transformers.
An open source version made by lucidrains
: Megabyte Github implementation for PyTorch
Infinite Former Uses a representation of the input sequence as a continuous signal expressed in a combination of N radial basis functions.
LMINfinite: Simple OntheFly Length Generalization for Large Language Models provides an O(n) time/space extension allows LMMs to ability to go to 32k tokens and 2.7x speedup.
Advanced Transformer Blocks¶
DenseFormer: Enhancing Information Flow in Transformers via Depth Weighted Averaging
Developments The authors reveal in their paper a variation of the transformer that yields improved results by introducing 'Depth Weighted Averaging' that averages weights at layer (i) with the output from the current block \(B_i\) (ii) the output of all previous blocks \(B_{j<i}\), and (iii) the embedded input \(X_0\).
Computation Reduction¶
Simplified Transformers that removes the 'value' parameterset to increase speed by 14% with potentially minimal accuracy reduction
Herein the authors reveal a variation of transformers that removes the 'value' parameter to yield notable speed gains at the same performance level. Paper
Other modalities¶
Vision¶
Graphs¶
Transformers Meet Directed Graphs introduces a variation of Transformer GNNs that uses 'directionaware' positional encodings to help handle both undirected and directed graphs
Multimodal¶
Jack of All Tasks, Master of Many: Designing Generalpurpose CoarsetoFine VisionLanguage Model
In this work, we present VistaLLM, the first generalpurpose vision model that addresses coarse and finegrained visionlanguage reasoning and grounding tasks over single and multiple input images. We unify these tasks by converting them into an instructionfollowing sequencetosequence format. We efficiently transform binary masks into a sequence of points by proposing a gradientaware adaptive contour sampling scheme, which significantly improves over the naive uniform sampling technique previously used for sequencetosequence segmentation tasks
 Visual GPT
 Language is not all you need
 MetaTransformer: A Unified Framework for Multimodal Learning The first framework to perform unified learning across 12 modalities with unpaired data. It does so by learning an embedding that can be shared across the modalities. Github
Graph¶
Code¶
Hugging Face Transformers An API to access a large number of pretrained transformers. Pytorch based.
Fast Transformers A quality collection of a number of transformer implementations written in Pytorch.
Theory and Experiment¶
A MATHEMATICAL PERSPECTIVE ON TRANSFORMERS
We develop a mathematical framework for analyzing Transformers based on their interpretation as interacting particle systems, which reveals that clusters emerge in long time.
Abstract Uses¶
Looped Transformers and Programmable Computers Understanding that transformer networks can simulate complex algorithms when hardcoded with specific weights and made intoa loop.
'Machine Learning' 'Machine code'. "We demonstrate that a constant number of encoder layers can emulate basic computing blocks, including embedding edit operations, nonlinear functions, function calls, program counters, and conditional branches. Using these building blocks, we emulate a small instructionset computer."