Skip to content


In generation models, higher quality is generally found through feedback methods. Because token-generation is greedy, or it generally maximizes the likelihood of the immediate token and not all subsequent tokens, the complete-generation may easily be biased by tokens that are generated that do not lead to more globally optimial responses. Feedback methods are designed to guide the generation of the entire set of next token(s) to more successfully fulfill the intention of calling prompts.

Navigating through a maze of tokens

The process of generating responses can be likened to navigating through a maze of tokens. The final generation token, 'EOF', signifies the end of the output and the completion of a path through the maze, which is the 'destination'. The quality of this path depends on the individual steps taken while navigating the maze. It is possible to take wrong 'turns' in the maze, resulting in a 'wrong' or suboptimal path when the generation arrives at the final destination. This is where feedback comes into play, guiding the path through the maze towards a more correct destination.

Feedback can be provided by humans, referred to as human-feedback (HF), or by AI, known as AI-feedback (AIF), or a combination of both.

[^n1]Note: This is different from recurrent_training where a model is used to generate training examples to improve the training of a subsequent model.

Feedback-based model updates can be categorized into two types: those that use reinforcement learning (RL) and those that use RL-free feedback.

Prominent models, like GPT-4, Reinforcement Learning with Human Feedback, RLHF, has enabled some of the most powerful models.

Key Takeaway

Feedback is a technique that trains a model to predict a more optimal sequence of token outputs conditioned on a given input.


Feedback is generated from evaluations by people or AI of two or more outputs conditioned on an input prompt. These evaluations can be applied to the entirety of an output or specific portions of it. The evaluation results are then used to optimize the complete path.

In generative models, the quality of output is often enhanced through feedback mechanisms. This is because token-generation is typically a greedy process, maximizing the likelihood of the immediate token without considering the impact on subsequent tokens. As a result, the complete generation can be biased by tokens that do not lead to globally optimal responses. Feedback methods are designed to guide the generation of the entire set of next tokens to more effectively fulfill the intention of the calling prompts.

Reinforcement learning based feedback

Reinforcement Learning (RL) uses the outcomes of a game, also known as a roll-out, to determine how to improve the choices or moves made during the game. In the context of Language Models, these moves are discrete and correspond to the next tokens that are produced.

A policy helps to decide what action or direction to take based on your current state or location. Specifically, a proximal policy predicts a probability distribution over all potential output states, shaping the entire path of the outcome.

The policy model creates a path of tokens that will end with a reward that is closest to the preferred reward. Feedback, generally from humans or other models, is used to update the policy model. However, not all variations of input data can be reasonably considered given the volume of feedback that could be provided.

A reward model is created to estimate how humans would evaluate the output. This model allows general human-informed guidance to help improve the policy model iteratively.

One of the most successful examples of this is Instruct GPT, which follows the process outlined above. This method underlies the basis of Chat-GPT 3 and 4.

Many RL methods use 'outcome' evaluations, but process reward models can be better

Using RL feedback from human labelers to provide feedback on intermediate steps, in Let's Verify Step By Step the authors demonstrate that providing feedback on intermediate steps can yield a reward model that is considerably better on various math-tests, than it is for outcome-based reward models.

(Anthropic) Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback



Training language models to follow instructions with human feedback

Instruct GPT allows for following of instructions. InstructGPT, established a powerful paradigm of LLM performance image

Learning to summarize from human feedback Provides initial successful examples using PPO and human feedback to improve summaries.

RLHF Diagram


Proximal Policy optimization

There are several policy gradient methods to optimize, a common one being proximal policy optimization, or PPO.

\[ \hat{g} = \hat{\mathbb{E}}_t \left[ \nabla_\theta \log \pi_\theta(a_t | s_t) \hat{A}_t \right] \]

TODO: Expand this based on Proximal Policy Optimization Algorithms

Reward Models

A reward model is used to approximate the quality, or reward, that a labeler (a person) might assign to an example output.

While multiple examples may be ranked and used simultaneously, the reward model may be trained by considering only a winning and a losing example. The reward models will produce a \(S_w\) \(S_l\) for winning and losing examples.

The reward model is trained with the objective of incentivizing the winning response to have a lower score than the losing response. More specifically, it minimizes

\[ -E_x(\log(\sigma(s_w-s_l))) \]

TODO: Expand this to include more mathematics.

Process reward models

Much like intermediate points to a ball-game are indicators of the winner of a game, a process reward model approximates the quality of intermediate steps in a total outcome.

Having intermediate rewards provides better guidance on how the token generation occurs before the token termination.

Let's reward step by step; Step-Level Reward Model as the Navigators for Reasoning

image image


Because of the ability to minimize costs associated with feedback, reinforcement Learning from AI Feedback (RLAIF) has proved additionally valuable.

Starling-7B: Increasing LLM Helpfulness & Harmlessness with RLAIF provides a solid example using RLAIF generated with GPT-4 to create a 7B model that is almost as good as GPT-4

They also released a data set called Nectar that with over 180k GPT-4 ranked outputs.

RL-free feedback

It is possible to provide feedback without using Reinforcement learning. Using a technique called 'Direct Policy Optimization', DPO, models can be optimize without explicitly generating a reward model for different output prompts. Using this method helps to reduce several challenges associated with RL, including the need to iteratively train reward models, and any stability challenges that are offen associated with reinforcement learning.

TODO: INtegrate this:


Literature to read and integrate :