Skip to content

Evaluating and comparing

Evaluating and comparing models is essential to enabling quality outcomes. There are a number of ways that models can be evaluated, and in many domains. How to evaluate the models may depend on the intended use-cases of the model, but generally evaluating an LLM architectures look at the performance of individual architecture-calls. When multiple calls are chained together, as with agents it is preferable to evaluate them accordingly. Because LLM models may be more frozen, and potentially less-likely to change, it is likely important to evaluate a the architecture-level first, before moving on to more complex and high-level evaluations. It also is important to know that model-evaluations will be dependent on your prompting, and consequently if one wishes to find optimal models, one should consider prompt optimization

If you are using or developing your own models, checking out the leader boards will help you to identify models that are appropriately performant for your needs. But what are your needs? That it is why it is important to know what you should evaluate. With this in hand, you can then figure out how to evaluate your LLM models.


Here are a few boards that help to aggregate and test models that have been released.

What to evaluate?

There are several domains of expertise where it may be essential to measure Model's performance. For general-performance models, even if not multi-model, it is useful to consider multiple-criteria simultaneously, which may include specific criteria to evaluate

Multi-criteria evaluation

HELM Instruct: A Multidimensional Instruction Following Evaluation Framework with Absolute Ratings

Developments The authors create HELM instruct to use multiple LLMs to evaluate multiple model for given input instructions. They evaluate around the following criteria: Helpfulness, Understandability, Completeness, COnciseness, and Harmlessness.

image The evaluation rubric is as follows image

Results They find that GPT-4 generally performs the best in all metrics. Interestingly, however, they do not find high consistency amongst evaluators. image

Generalization ability

It may be important for your modal to have generalization beyond your training data. If so, it is important to thoroughly separate any testing data from the training data. To remove this, you will want to work on your data preparation. If needed, the 'contamination' of data may be removed with automated methods.

Specific Criteria

Accuracy vs Hallucination

Hallucinations remain a core problem with LLMs as they may generate linguistic and syntatically correct statements, that lack epistemic or factually grounded understanding.

??? important "Hugging faces leaderboard on hallucinations provides a comparison of different models' hallucinations". Much is based on awesome-hallucination-detection

GitHub Repo stars Truthful - QA helpes to Measuring How Models Mimic Human Falsehoods

Information Retrieval

The ability for an LLM to 'recall' information within its context window is an integral part of its ability function with contextually relevant information, and to act as effective retrieval mechanisms. To evaluate this ability, the needle-in-a-haystack test can be used. In it the following occur:

  1. Place a random fact or statement (the 'needle') in the middle of a long context window (the 'haystack')
  2. Ask the model to retrieve this statement
  3. Iterate over various document depths (where the needle is placed) and context lengths to measure performance

In ideal systems, context retrieval will be independent of the position within the context, and of the content itself.

GitHub Repo stars Testing with LLMTest_NeedleInAHaystack repo shows where in the context space that LLMs may fail at context retrieval.

As demonstrated additionally in the authors' youtube

It was, however Anthropic found, that LLMs can perform better context retrieval when phrases are added:

 “Here is the most relevant sentence in the context:” 

While information retrieval are important, they might also be good at following instructions.

FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions creates FOLLOWIR, which contains a benchmark that explicitly measures the instruction following ability of retrieval model
GitHub Repo stars Lighteval by Hugging Face provides lightweight framework for LLM evaluation

Question Answering

Domain expertise

Language generation

Code generation

Math, logic, and reasoning

Science and engineering

Healthcare and medicine

Law and policy

Legal Bench is an ongoing open science effort to collaboratively curate tasks for evaluating LLM legal reasoning in English.

Embodied Devices and Robotics


While it may be projective to consider AI as having 'psychology', it may be useful to relate to different human-like characteristics when evaluating GenAI models.



Role play with large language models (Murray Shanahan et al., November 2023)

Abstract: "As dialogue agents become increasingly human-like in their performance, we must develop effective ways to describe their behaviour in high-level terms without falling into the trap of anthropomorphism. Here we foreground the concept of role play. Casting dialogue-agent behaviour in terms of role play allows us to draw on familiar folk psychological terms, without ascribing human characteristics to language models that they in fact lack. Two important cases of dialogue-agent behaviour are addressed this way, namely,
(apparent) deception and (apparent) self-awareness."


Sycophancy is the degree to which a model mirrors biases, large or small, that are put into input queries by the user. In ideal systems, sycophancy will be minimized to prevent echo-chamber amplification of innaccuracies.

The repo GitHub Repo stars Sycophancy-eval offers manners and methods of evaluating sycophancy.

General Discussions

How do we know how smart AI systems are?

“AI systems, especially generative language systems like GPT-4, will become increasingly influential in our lives, as will claims about their cognitive capacities. Thus, designing methods to properly assess their intelligence—and associated capabilities and limitations—is an urgent matter. To scientifically evaluate claims of humanlike and even superhuman machine intelligence, we need more transparency on the ways these models are trained, and better experimental methods and benchmarks. Transparency will rely on the development of open-source (rather than closed, commercial) AI models. Better experimental methods and benchmarks will be brought about through collaborations between AI researchers and cognitive scientists who have long investigated how to do robust tests for intelligence, understanding, and other cognitive capabilities in children, animals, and other “alien” intelligences.”

How to evaluate

While it may seem reasonable to evaluate with a 'guess-and-check' approach, this is not scaleable, nor is will it be quantitatively informative. That is why the use of various tools/libaries are essential to evaluate your models. This

Measurements Libraries

ROSCOE: A SUITE OF METRICS FOR SCORING STEP-BYSTEP REASONING is ' a new suite of interpretable, unsupervised metrics that enables evaluation of step-by-step reasoning generations of LMs when no golden reference generation exists. '


Introducing MMMU, a Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI.


11.5K meticulously collected multimodal questions from college exams, quizzes, and textbooks Spanning Art & Design 🎨, Business 💼, Science 🔬, Health & Medicine 🩺, Humanities & Social Science 📖, Tech & Engineering 🛠️ across 30 subjects and 183 subfields 30 heterogeneous image types🗺️📉🎼, such as charts, diagrams, maps, tables, music sheets, and chemical structures Focuses on advanced perception and reasoning with domain-specific knowledge 🧠 Results and Takeaways from evaluating 14 open-source models and #GPT4-Vision: 🧐MMMU Benchmark post a great challenge to existing #LMMs: #GPT4V only hits 56% accuracy, showing a vast landscape for #LMMs advancement. 💪 Long way to go for open-source LMMs. Top open-source models like #BLIP2-FLAN-T5-XXL and #LLaVA-1.5 achieve around 34% accuracy. 🖼️📝OCR and captions addition to #LLMs show little gain in MMMU, highlighting the need for deeper joint image-text interpretation. Models tend to perform better on photos and paintings🖼️ than on diagrams and tables📊, where nuanced and fine-grained visual information persists. 🤖Error analysis on 150 error cases of GPT-4V reveals that 35% of errors are perceptual, 29% stem from a lack of knowledge, and 26% are due to flaws in the reasoning process.

Domain specific

Legal Bench is an ongoing open science effort to collaboratively curate tasks for evaluating LLM legal reasoning in English.

The evaluation of models helps us to identify which, if any, model to use for a particular task at hand. Directly related to the manner of pre-training, fine-tuning, and any RLHF, the ways that we consider the output can also be used to improve the models.

Useful References

GitHub Repo stars LLM Eval survey, paper collection