Skip to content


The models for Generative AI consist of the computational components that are trained to generate outputs conditioned upon given inputs. While computational models may be used to generate impressive new content, as for traditional state-machines that make output choices based on heuristics, they differ from those that are data-informed.

Architecture Genres

  • Encoder-Decoder (EDT), is also sequence-to-sequence.
  • Encoder-only: (BERT)
  • Decoder-only (GPT) Next-token
  • Multi-domain decoder-only transformer (Gato)

Model Classes

Different model classes of models can often be used with multiple types of model learning. Because of their present degree of quality present model Architectures tend to be transformer-based, or diffusion-based, or made from any other sufficently capable AI method. While Generative Adversarial Networks, GANS were the initially most successful, the challenges in training them successfully can be difficult to surmount. Below we describe the model classes in greater detail.

Model Domains

While there is a great deal in several primary domains of Generative AI, Text, Image, sound, video, there are many other modalities that are of interest. Here we share prominent and interesting methods for these domains. These models will often rely on tokenization. Once tokenized, the transformed projected in some way to an embedding vector that can be used by downstream LLM's, as well as vector-databases.

Multi-Modal Models

Multi-modal Large Language Models (MLMMs) enable us to connect information from different domains, and bring us closer to artificial general intelligence.

It can be challenging to fuse different domains of data, such as text and images, for a number of reasons. Here are some essential concepts to consider when working with or building MLMMs.

There are two general methods to create MLMMS:

  1. Early Fusion: Combine data modalities and then train a singular model to begin with.
  2. Late Fusion: Create separate language models for different modalities and then combine the models under a fine-tuning objective.

Each of these offers different benefits and challenges.

How to Bridge the Gap between Modalities: A Comprehensive Survey on Multi-modal Large Language Model

TODO: Clip paper

Meta Transformer Combines embedding in from 12 modalities by adjoining individual models and flattening them together.

image Github

Vision-Language Models

Vision Language models are among the most prominent of models beyond language models. They are often based on transformer though there are some unique requirements in them. There are some interesting ways of considering how to the different domains in ways that may have applicability across models. Here are a few useful considerations.

SPAE: Semantic Pyramid AutoEncoder for Multimodal Generation with Frozen LLMs A really cool idea that uses pyramidal representations and compresses information into text-tokens of different levels.

It can be reconstructed as needed. These tokens then could be used in novel image generation via semantic mapping with an LLM. image

Towards Language Models That Can See: Computer Vision Through the LENS of Natural Language Represents images into language and combines them with a Frozen LLM to produce output.

image image Github Website

Tabular Models

Model Fusion


Developments FuseLM provides a manner and method of combining different LLMs to train a new fused model based on the probabilistic output of each of the different LLMs. image image

Common Components


The components of model classes include a number of operations.


Softmax is an activation function that computes a probability-like output for logistic outputs. Generally given in the form

\[ (softmax(x))𝑖=exp(𝑥𝑖)∑𝑗exp(𝑥𝑗) \\ softmax(x_i) = \exp(x_i)/\sum_j\exp(x_j) \]

Is softmax Off by 1?

Based on some observations by Qualcom, where "97%+ of outlier activations in LLMs occur in whitespace and punctuation positions.” there was indication that it is important to have 'no attention' given to some tokens.

Adding a \(1\) to the demonimator allows for no attention to be had. This is describe here, discussed here and already found in the flaxformer architecture.

A general conclusion is that it is likely more important for highly quantized weights, but 32 and 16 bit dtypes are probably unaffected.


Embeddings play a key role in AI as they translate tokens into numerical representation that can be processed by the AI.

'What are Embeddings' is an essential read that elucidates the concept of embeddings in a digestible manner. For a deeper dive, check the accompanied Github page.

Position Embeddings

Position embedding is an essential aspect of transformer-based attention models -- without it the order of tokens in the sequence would not matter.

A common manner of including positional embeddings is to add them to the text embeddings. There are other manners of including embeddings.

GitHub Repo stars Deberta: Decoding-Enhanced Bert with Disentangled Attention

Paper The authors herein describe a manner of including embeddings in a manner that enables position-dependence but does not require addition of the embeddings.

General Literature

GitHub Repo stars A Survey of Large Language Models