Retrieval-Augmented Generation (RAG)¶

Trained and fine-tuned LLMs can generate high quality results, though their generated results will be generally confined to the information they have been trained on. Additionally, responses can suffer from:

Confabulations and Hallucinations that create false or inaccurate information
Lack of attributon making it difficult to ascertain validity
Staleness due to new or updated information

Retrieval-Augmented Generation (RAG) helps to solve these!! is a context-augmentation method by coupling the information to external memory.

Here is a basic comparison of the two:

Comparison with/without RAG

WithWithout

graph LR
    style QueryEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
    style QueryOptimizer1 fill:#E7B4E1,stroke:#333,stroke-width:1px
    style Query fill:#FADAD2,stroke:#333,stroke-width:1px
    style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
    style Docs fill:#FADAD2,stroke:#333,stroke-width:1px
    style QueryOptimizer2 fill:#E7B4E1,stroke:#333,stroke-width:1px
    style DocEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
    style Retriever fill:#E1E7B4,stroke:#333,stroke-width:1px
    style Context fill:#B4E1E7,stroke:#333,stroke-width:1px
    style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
    style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px

    QueryEncoder --> |Retrieve\n from|Retriever
    Prompt --> Generator[LLM\n Generation]
    Query --> Generator
    Query --> QueryOptimizer1(Query\n Optimizer)
    QueryOptimizer1 --> QueryEncoder[Encoder]
    Docs --> QueryOptimizer2(Docs\n Optimizer)
    QueryOptimizer2 --> DocEncoder[Encoder]
    DocEncoder --> |Index\n to| Retriever[Database]

    Retriever --> Context

    Context --> Generator
    Generator --> Answer

graph LR
    style Query fill:#E1FAD2,stroke:#333,stroke-width:1px
    style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
    style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
    style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px

    Query --> Generator[LLM Generation]
    Prompt --> Generator
    Generator --> Answer

Original inceptions of RAG involve queries that involve connecting with Embedding based lookups, though other lookup mechanisms, including key-word searches and other lookups from memory sources may also be possible.

RAG is still an area of optimization with a number of components that may be optimized

These areas of optimization include:

Manner of document encoding and chunking
Manner of query encoding when and what to retrieve.
How to combine the contexts with the prompts

One of the seminal papers on RAG, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduced a solution for end-to-end training of models involving training document and query encoding, lookup and demosntrated revealing improved results over solutions where model components were frozen. For reasons of simplicity, however, a generally standard approach uses models that are frozen to embed and query documents.

It is important to Evaluate your system to ensure efficient efforts in using RAG.

Why use RAG?¶

Large foundation models are trained on large corporas of public (and sometimes private) data. Models may lose effective semantic grounding because of the breadth of implicing knowledge they have codified in the next-token predictors. To improve the groundedness and appropriateness of the desired output, RAG fetches appropriate information that can be combined with the prompt context in order for the LLM to generate appropriate results. This can be particularly important when there is information that my be changing, and needs to be incorporated quickly.

Importantly, iou can use RAG to help with for data summarization, question-answeering, and the ability to 'know how' information is generated in a somewhat more interpretable manner.

Use RAG because:

You need knowledge beyond the LLM's training set
You want to minimize hallucinations
Your data can be highly dynamic
The results need to interpretable
You don't have training data available

Why not use RAG?¶

The primary challenges regarding rag may be related to organizational or functional challenges.

Don't use RAG because:

You have Latency requirements that RAG retrieval may induce.
You don't want to pay for, or maintain and support a RAG database.
There are ethical or privacy concerns relating to sending data to a third-party API

RAG vs Finetuning¶

Because finetuning can enable intrisic knowledge to be ingrained in an LLM, it generally leads to improved performance.

Rag vs Finetuning reveals Fine tuning boosts performance over RAG

Paper

That said, it can be seen that using RAG to informe fine tuning, in Retrieval Augmented Fine Tuning (RAFT), as variations are done with mixture of experts can lead to even improved performance.

📋

🦍 RAFT: Adapting Language Model to Domain Specific RAG

Blog post Paper

RAG in Detail¶

The RAG process can be divided into two main stages: Preparation (offline) and Retrieval and Generation (online).

Document Indexing (offline)¶

Indexing will involve Loading Data, Splitting data, Embedding Data, Adding Metadata, Storing the data.

It is useful to perform parallel indexing that keeps track of records that are put into vector stores.

📋

Indexing

Indexing helps to improves performance saving time and money by not:

Re-processing unchanged content
Re-computing embeddings of unchanged content
Inserting duplicated content

The langchain Blog and docs on indexing provide quality discussions on these topics.

Indexing process (clickable)

graph LR
    style DocumentSelection fill:#B4E1E7,stroke:#333,stroke-width:1px
    style LoadDocuments fill:#FAD2E1,stroke:#333,stroke-width:1px
    style SplitDocuments fill:#E1FAD2,stroke:#333,stroke-width:1px
    style EmbedDocumentSplits fill:#D2FAFA,stroke:#333,stroke-width:1px
    style StoringData fill:#FADAD2,stroke:#333,stroke-width:1px

    DocumentSelection[Select Documents] --> LoadDocuments[Load \nDocuments]
    LoadDocuments --> SplitDocuments[Split \n Documents]
    SplitDocuments --> EmbedDocumentSplits[Embed \n Document \n Splits]
    EmbedDocumentSplits --> StoringData[Store in \nDatabase]

    click DocumentSelection "#selecting-data"
    click LoadDocuments "#loading-data"
    click SplitDocuments "#splitting-data"
    click EmbedDocumentSplits "#embedding-data"
    click StoringData "#storing-data"

The preparation stage involves the following steps in an offline manner

Data Selection: Choose the appropriate data to ingest.
Loading Data: Load the data in a manner that can be consumed by the models.
Splitting Data: Split the data into chunks that can be both consumed by the model and retrieved with a reasonable degree of data.
Embedding Data: Embed the data.
Storing Data: Store the embedding.

Selecting Data¶

Users should only access data that is appropriate for their application. However, including too much information might be unnecessary or harmful to retrieval if the retrieval cannot handle the volume or complexity of data. It is also crucial to ensure data privacy when providing data that might not be appropriate (or legal) to access.

Loading Data¶

Different data types require different loaders. Raw text, PDFs, spreadsheets, and more proprietary formats need to be processed in a way that the information is of highest relevance to data. Text is easy to process, but some data, especially multimodal data like PDFs, may need to be formatted with a schema to allow for more effective searching.

Splitting Data¶

Once data has been loaded in a way that a model can process it, it must be split. There are several ways of splitting data:

By the max size a model can handle.
By some heuristic break, such as . sentences, \n return characters or \p paragraphs or newlines.
In a manner that maximizes the topic coherence. In this case, splitting and embedding may happen simultaneously.

AST-T5: Structure-Aware Pretraining for Code Generation and Understanding

Late Chunking of Short Chunks in Long-Context Embedding Models

The authors show in their Blog_and Paper The use of tokenization initially and then pooling those intelligently for having better embeddings for lookup.

📋

Contextual retrieval

Anthropic reveals contextual-retrieval where entire documents are cached (for efficiency) and RAG-retrieval is significantly improved. They use the following to generate contextual chunks that are paired with the item when performing embedding. The results leads to significant (67% !!!) performance improvements.

<document> 
{{ WHOLE_DOCUMENT }} 
</document> 
Here is the chunk we want to situate within the whole document 
<chunk> 
{{ CHUNK_CONTENT }} 
</chunk> 
Please give a short succinct context to situate this chunk within the overall document for the purposes of improving search retrieval of the chunk. Answer only with the succinct context and nothing else.

Embedding Data¶

Index Building - One of the most useful tricks is multi-representation indexing: decouple what you index for retrieval (e.g., table or image summary) from what you pass to the LLM for answer synthesis (e.g., the raw image, a table). Read more

Adding metadata¶

Information such as dates, chapters, or key words can allow for filtering and key-word lookup.

Storing Data¶

The embedded data is stored for future retrieval and use. This is done via standarad database methods, with the use of embeddings as vector retrieval addresses as well as meta-data for more traditional search (key-word) methods.

Retrieval and Generation (online)¶

The retrieval and generation stage involves the following steps:

Retrieving Data: Retrieve the data based on input in such a way that relevant documents and chunks can be used in downstream chains.
Generating Output: Generate an output using a prompt that integrates the query and retrieved data.

The decision and act to retrieve the documents will depend on the additional contexts that the agents may need to be aware of.

It might not always be necessary to retrieve documents. When it is necessary to retrieve the document, it is important to know where to retrieve from routing, and then matching the query to the appropriately stored information. Both of these may involve rewriting the prompt to be more effective in the manner the data is retrieved.

Retrieval and generation (clickable)

    graph LR
        style C fill:#B4E1E7,stroke:#333,stroke-width:1px
        style T fill:#FAD2E1,stroke:#333,stroke-width:1px
        style RR fill:#E1FAD2,stroke:#333,stroke-width:1px
        style R fill:#FADAD2,stroke:#333,stroke-width:1px
        style F fill:#E7B4E1,stroke:#333,stroke-width:1px
        style G fill:#D2E1FA,stroke:#333,stroke-width:1px
        style H fill:#E1E7B4,stroke:#333,stroke-width:1px

        C[Query] --> T[Optimize]
        T --> RR[Route]
        RR --> R[Match and \nRank Documents]
        R --> F[Combine With\n Context]
        F --> G[LLM \nGeneration]
        G --> H[Answer]

        click T "#query-optimization"
        click RR "#routing"
        click R "#match-and-rank"
        click F "#CombineWithContext"
        click G "#LLMGeneration"
        click H "#Answer"

Query Optimization¶

In production settings, the queries that users ask are unlikely to be optimal for retrieval. This can be due to a combination of challenges such as questions that are.

Irrelevant
Vague
Not related to retrieval
Are made of multiple questions

Optimization of queries, looks to improve these queries in several manners. Here are a several with other greater descriptions written in Langchain's query analysis.

Rewrite-Retrieve-Read¶

This approach involves rewriting the query for better retrieval and reading of the relevant documents.

Query Rewriting for Retrieval-Augmented Large Language Models

Step Back Prompting¶

This method generates an intermediate context that helps to 'abstract' the information. Once generated, the additional context can be used.

Step back

    You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

    {normal_context}
    {step_back_context}

    Original Question: {question}
    Answer:

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Query Rephrasing¶

Particularly in chat settings, it's important to include all of the appropriate context to create an effective search query.

Rephrase question

    Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone Question:

Query Decomposition¶

When questions are directly made of multiple questions, or the effective answer to these questions involves answering several sub-questions, breaking the questions into multiple queries may be essential. This may involve performing sequential queries that are created based on retrieved information, or queries that can be run irrespective of other results. Langchain Query decomposition

Query Expasion¶

Can generate multiple rephrased versions of the query to increas the likelihood of a hit, or use the advanced retrieval methods to triangulate higher quality hits.

Query Clarifying¶

Particularly in chat settings when questions are vague, asking follow-up questions can be instrumental in ensuring the lookup can be as effective as possible.

Query structuring¶

When answers to queries can be 'filtered' using meta-data based on elements of the queries can be highly valuable. This can include attributes such as date, location, subjects. See Langchain's Query construction for additional information related to this.

Routing¶

Depending on the question asked, queries may need to be routed to different sources of data, or indexes. OpenAI's RAG strategies provides some guidance on question routing:

Matching and Ranking¶

Matching involves aligning the query with the appropriately stored information.

Multi-Hop RAG¶

In order to effectively answer some queries, retrieval of evidence from multiple documents may be needed. This is known as multi-hop rag.

MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries provides a dataset for evaluating multihop rag

"MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications."

Small to big lookup¶

TODO xxx

Reranking¶

TODO xxx Reranking

Generating responses¶

The final step is generating an output using a prompt that integrates the query and retrieved data.

Challenges in generating responses can involve

Not having enough information: RAG can help minimize response generation of non-factual information, but only if retrieved information provides sufficient context to answer theq estion properly. If the question cannot be answered with a reasonable degree of certainty, then the response should be along the lines of "I don't know."
Conflicting information: When retrieved results contain different responses to the same question, a difinitive response may not be possible
Stale information: When information is no longer relevant.

Advanced methods¶

Multimodal Rag¶

Natural-language lookup with RAG can be improved by allowing other modalities, such as tables and images, at the same time. There are several ways that this may be accomplished as described in Langchain's multi modal rag:

Option 1:

Use multimodal embeddings (such as CLIP) to embed images and text
Retrieve both using similarity search
Pass raw images and text chunks to a multimodal LLM for answer synthesis

Option 2:

Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve text
Pass text chunks to an LLM for answer synthesis

Option 3:

Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve image summaries with a reference to the raw image
Pass raw images and text chunks to a multimodal LLM for answer synthesis

Multi-Modal: This approach is used for RAG on a substack that has many images of densely packed tables, graphs. Here is an example implementation, and Here is one that works with private data.
Semi-Structured: This approach is used for RAG on documents with tables, which can be split using naive RAG text-splitting that does not explicitly preserve them. Here is an example implementation.

Evaluating and Comparing¶

Because of the large number of manners of performing RAG, it is important to evaluate the quality of the implemented solution.

Rag Arena Provides interfaces with LangChain to provide a RAG chatbot experience where queries receive multiple responses.

📋

Retrieval Augmented Generation (RAG) and Beyond: A Comprehensive Survey on How to Make your LLMs use External Data More Wisely

Development: The authors present a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task. It summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them.

In general, it breaks down the complexity of queries into several levels: L1: Explicit Fact Queries: ** To just answer specific questions based on document or snippets within the collection. **L2: Implicit Fact Queries: ** To answer questions involving data dependencies or some level of logical or common sense reasoning. **L3: Interpretable Rational Queries: ** Queries that require external data to create rational for comparison. **L4: Hidden Rational Queri8es: They have domain specific reasoning that may not be explicitly described and difficult to enumerate.

Resources, Tutorials and Blogs¶

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduces a complete solution for enabling improved response generation with LLMs.

The authors reveal that allowing for fine tuning of the models when equipped with RAG improved the results.

📋

12 RAG Pain Points and Proposed Solutions

Things that might lead to failure of RAG pipeline. Mostly taken from the blog

Pain point: * and solutions

1: Missing Content:

Clean your data
Better prompting

2: Missed the Top Ranked Documents

Hyperparameter tuning for chunk_size and similarity_top_k as in Hyperparameter Optimization for RAG.

Reranking notebook usses Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex and CohereRank to rerank the results

    import os
    from llama_index.postprocessor.cohere_rerank import CohereRerank

    api_key = os.environ["COHERE_API_KEY"]
    cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker

    query_engine = index.as_query_engine(
        similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval
        node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors
    )

    response = query_engine.query(
        "What did Sam Altman do in this essay?",
    )

3: Not in Context — Consolidation Strategy Limitations

Tweak retrieval strategies
Finetune embeddings

4: Not Extracted

Clean your Data
Prompt Compression
Long Context Reorder (put crucial content at beginning and end)

5: Wrong Format

Output Parsing
Pydantic

6: Incorrect Specificity

7: Incomplete and Impartial Responses

8: Data Ingestion Scalability

Chain of table and Llama solution
Mix-Self-Consistency Pack based on Rethinking Tabular Data Understanding with Large Language Models Llama solution

9: Structured Data QA

Use Llama index ChainOfTablePack based on Chain of Table
Use Llama index MixSelfConsistencyQueryEngine based on Rethinking Tabular Data Understanding with Large Language Models

10: Data Extraction from Complex PDFs

Use pdf2htmlEX
Use EmbeddedTablesUnstructuredRetrieverPack in LlamaIndex

11: Fallback Model(s): Use a model router like - Neutrino

    from llama_index.llms import Neutrino
    from llama_index.llms import ChatMessage

    llm = Neutrino(
        api_key="<your-Neutrino-api-key>", 
        router="test"  # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
    )

    response = llm.complete("What is large language model?")
    print(f"Optimal model: {response.raw['model']}")

Openrouter

    from llama_index.llms import OpenRouter
    from llama_index.llms import ChatMessage

    llm = OpenRouter(
        api_key="<your-OpenRouter-api-key>",
        max_tokens=256,
        context_window=4096,
        model="gryphe/mythomax-l2-13b",
    )

    message = ChatMessage(role="user", content="Tell me a joke")
    resp = llm.chat([message])
    print(resp)

12: LLM Security

Use things like Llama Guard

Advanced Rag small to big

Blog

Advanced Retreival Augmented Generation from Theory to Llamaindex

Blog

RAG vs finetuning