Skip to content

Mixture of Experts

MOE provides the ability to use different smaller models that have better performance in certain domains. Their use is notable, as it has been stated that GPT-4 is powered by 8 different agents.

Scaling Expert Language Models with Unsupervised Domain Discovery

Developments "Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference. This approach generalizes embarrassingly parallel training by automatically discovering the domains for each expert, and eliminates nearly all the communication overhead of existing sparse language models. "


GitHub Repo stars SwitchHead: Accelerating Transformers with Mixture-of-Experts Attention



GitHub Repo stars Pushing Mixture of Experts to the Limit: Extremely Parameter Efficient MoE for Instruction Tuning

"The codebase is built on T5X, which defines the model and training loop; Flaxformer, which defines the model computation; Flax, which defines the low level model layers; and Jax, which provides the execution." Paper image

Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM

Paper The authors demonstrate that selecting parameters from differently trained models at generation can yield significant improvements in performance for lower-sized models. Here is the algorithm:

Algorithm 1 Blended Algorithm

1. k ← 1
2. while true do
3.     uₖ ← user’s current input turn
4.     Sample model parameter θₙ ~ Pθ
5.     Generate response rₖ according to:
6.         rₖ ~ P(r|u₁:k, r₁:k−1; θₙ)
7.     k = k + 1
8. end while