Skip to content

Tokenizing

In generative AI, the raw data—whether it be in text or binary input is divided into individual units termed as tokens. These are then made into IDs that provide a lookup table that can be used in downstream learning that allow for context aware embedding representations.

Understanding Tokenization

Tokenization is the process of splitting data into these individual units. Broken up as The choice of a token largely depends on the data type and the expected outcome of the AI. In text data, for instance, tokens often correspond to single words or subwords. These tokens can be represented in one-hot encoding, or as an ID.

Tokenization can be have a pre-processing phase, called pre-tokenization that will use regular expressions for defining patterns for text segmentation. GPT-2 and GPT-4 do that as well as one called punct.

There are many types of tokenizers, including Byte-Pair Encoding (BPE), WordPIece and SentencePiece.

Pre-tokenization methods

image

GitHub Repo stars Minimal BPE tokenizer by Andrej Karpathy provides a understandable and efficient demonstration of several modern tokenizing methods including BPE, RegExp, BPE and GPT-4.

Character Tokenizers

Character tokenizers represent individual characters as tokens, creating very small representations. The do not often,

Word tokenizers

Word tokenizers break up text in a rule-base fashion that allow whole tex to be split into diffent units. Because of the large number of words, and variations, it would be necessary to maintain a large vocabulary, that causes memory and computation-complexity issues. spaCy and Moses are two common word tokenizers.

Subword Tokenizers

A subword unit, or a part of a word, can be a token in itself.

Byte-Pair Encoding

The paper titled Neural Machine Translation of Rare Words with Subword Units introduced Byte-Pair encoding to create subword to allowing for highly common character patterns to be compressed into tokens, thereby reducing vocabulary-size requirements.

GitHub Repo stars Tiktoken is a fast BPE tokenizer for use with OpenAI models

📋
GitHub Repo stars Token Monster is an ungreedy subword tokenizer and vocabulary generator, enabling language models to run faster, cheaper, smarter and generate longer streams of text.

image

GitHub Repo stars implements subword units (e.g., byte-pair-encoding (BPE) [Sentence Piece](Sentence Piece implements subword units (e.g., byte-pair-encoding (BPE)) and unigram language model
Unigram Language Model (Kudo) introduces subword regularization, which trains the model with multiple subword segmentations probabilistically sampled during training

Effectively, this takes aliasing-like effects that cause different tokenization. It is more effective because it breaks it down in different ways.

GitHub Repo stars Fully open source tokenizer: Nomic

Nomic provides a disruptive tokenizer that is fully open source, with code and weights!

Special tokens

There are special tokens that are used by high-level interpreters on what next to do.

Token Name Description
START_TOKEN or BOS_TOKEN This is used to indicate the beginning of a sequence. BOS stands for "Beginning Of Sequence".
STOP_TOKEN or EOS_TOKEN This is used to indicate the end of a sequence. EOS stands for "End Of Sequence".
MASK_TOKEN This is used to represent a masked value, which the model needs to predict.
MODALITY_TOKEN This is used to indicate the type of data in the sequence (such as text, images, etc.)

Other modalities

Speech tokenization

GitHub Repo stars is a unified speech tokenizer for speech language models, which adopts the Encoder-Decoder architecture with residual vector quantization (RVQ) Speech Tokenizer

image

Multimodal Tokenization

Multimodal tokenization is an area of tokenization that focuses on incorporating multiple data forms or modes. This facet of tokenization has seen remarkable strides. Bytes are all you need—a study utilizing transformer technology to input file bytes directly—demonstrates that multimodal tokenization can assist in improving the AI's performance accuracy. The researchers in the study developed ByteFormer, a model based on their study’s findings that can be accessed here.

Tokenizing might not be necessary

It is regarded that tokenizing is a bit arbitrary and has disadvantages. There are promising results using methods without tokenization MEGABYTE: Predicting Million-byte Sequences with Multiscale Transformers that "show that MEGABYTE allows byte-level models to perform competitively with subword models on long context language modeling"

Heirarchichal Tokenization

Superbloom: Bloom filter meets Transformer

Wherein a bloom filter is used to create tokens/embeddings. image

Interesting research

Getting the most out of your tokenizer for pre-training and domain adaptation

The authors highlight sub-optimial tokenizers hurt performance and efficiency of models, and reveal specialized Byte-Pair Encoding code tokenizers with a new pre-tokenizer with improved performance. image image

References