Skip to content

Retrieval-Augmented Generation (RAG)

Trained and fine-tuned LLMs can generate high quality results, though their generated results will be generally confined to the information they have been trained on. Additionally, their responses can suffer from:

  • Hallucinations that create false or inaccurate information
  • Lack of attributon making it difficult to ascertain validity
  • Staleness due to new or updated information

Retrieval-Augmented Generation (RAG) helps to solve these by coupling the information to external memory.

Here is a basic comparison of the two:

Comparison with/without RAG

graph LR
    style QueryEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
    style QueryOptimizer1 fill:#E7B4E1,stroke:#333,stroke-width:1px
    style Query fill:#FADAD2,stroke:#333,stroke-width:1px
    style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
    style Docs fill:#FADAD2,stroke:#333,stroke-width:1px
    style QueryOptimizer2 fill:#E7B4E1,stroke:#333,stroke-width:1px
    style DocEncoder fill:#D2E1FA,stroke:#333,stroke-width:1px
    style Retriever fill:#E1E7B4,stroke:#333,stroke-width:1px
    style Context fill:#B4E1E7,stroke:#333,stroke-width:1px
    style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
    style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px

    QueryEncoder --> |Retrieve\n from|Retriever
    Prompt --> Generator[LLM\n Generation]
    Query --> Generator
    Query --> QueryOptimizer1(Query\n Optimizer)
    QueryOptimizer1 --> QueryEncoder[Encoder]
    Docs --> QueryOptimizer2(Docs\n Optimizer)
    QueryOptimizer2 --> DocEncoder[Encoder]
    DocEncoder --> |Index\n to| Retriever[Database]

    Retriever --> Context

    Context --> Generator
    Generator --> Answer 
graph LR
    style Query fill:#E1FAD2,stroke:#333,stroke-width:1px
    style Prompt fill:#D2FAFA,stroke:#333,stroke-width:1px
    style Generator fill:#FAD2E1,stroke:#333,stroke-width:1px
    style Answer fill:#E1FAD2,stroke:#333,stroke-width:1px

    Query --> Generator[LLM Generation]
    Prompt --> Generator
    Generator --> Answer

Original inceptions of RAG involve queries that involve connecting with Embedding based lookups, though other lookup mechanisms, including key-word searches and other lookups from memory sources may also be possible.

RAG is still an area of optimization with a number of components that may be optimized

These areas of optimization include:

  • Manner of document encoding and chunking
  • Manner of query encoding when and what to retrieve.
  • How to combine the contexts with the prompts

One of the seminal papers on RAG, Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduced a solution for end-to-end training of models involving training document and query encoding, lookup and demosntrated revealing improved results over solutions where model components were frozen. For reasons of simplicity, however, a generally standard approach uses models that are frozen to embed and query documents.

It is important to Evaluate your system to ensure efficient efforts in using RAG.

Why use RAG?

Large foundation models are trained on large corporas of public (and sometimes private) data. Models may lose effective semantic grounding because of the breadth of implicing knowledge they have codified in the next-token predictors. To improve the groundedness and appropriateness of the desired output, RAG fetches appropriate information that can be combined with the prompt context in order for the LLM to generate appropriate results. This can be particularly important when there is information that my be changing, and needs to be incorporated quickly.

Importantly, iou can use RAG to help with for data summarization, question-answeering, and the ability to 'know how' information is generated in a somewhat more interpretable manner.

Use RAG because:

  • You need knowledge beyond the LLM's training set
  • You want to minimize hallucinations
  • Your data can be highly dynamic
  • The results need to interpretable
  • You don't have training data available

Why not use RAG?

The primary challenges regarding rag may be related to organizational or functional challenges.

Don't use RAG because:

  • You have Latency requirements that RAG retrieval may induce.
  • You don't want to pay for, or maintain and support a RAG database.
  • There are ethical or privacy concerns relating to sending data to a third-party API

RAG vs Finetuning

Because finetuning can enable intrisic knowledge to be ingrained in an LLM, it generally leads to improved performance.

GitHub Repo stars Rag vs Finetuning reveals Fine tuning boosts performance over RAG

Paper

That said, it can be seen that using RAG to informe fine tuning, in Retrieval Augmented Fine Tuning (RAFT), as variations are done with mixture of experts can lead to even improved performance.

📋
🦍 RAFT: Adapting Language Model to Domain Specific RAG

image Blog post Paper

RAG in Detail

The RAG process can be divided into two main stages: Preparation (offline) and Retrieval and Generation (online).

Document Indexing (offline)

Indexing will involve Loading Data, Splitting data, Embedding Data, Adding Metadata, Storing the data.

It is useful to perform parallel indexing that keeps track of records that are put into vector stores.

📋

Indexing

Indexing helps to improves performance saving time and money by not:

  • Re-processing unchanged content
  • Re-computing embeddings of unchanged content
  • Inserting duplicated content

The langchain Blog and docs on indexing provide quality discussions on these topics.

Indexing process (clickable)

graph LR
    style DocumentSelection fill:#B4E1E7,stroke:#333,stroke-width:1px
    style LoadDocuments fill:#FAD2E1,stroke:#333,stroke-width:1px
    style SplitDocuments fill:#E1FAD2,stroke:#333,stroke-width:1px
    style EmbedDocumentSplits fill:#D2FAFA,stroke:#333,stroke-width:1px
    style StoringData fill:#FADAD2,stroke:#333,stroke-width:1px

    DocumentSelection[Select Documents] --> LoadDocuments[Load \nDocuments]
    LoadDocuments --> SplitDocuments[Split \n Documents]
    SplitDocuments --> EmbedDocumentSplits[Embed \n Document \n Splits]
    EmbedDocumentSplits --> StoringData[Store in \nDatabase]

    click DocumentSelection "#selecting-data"
    click LoadDocuments "#loading-data"
    click SplitDocuments "#splitting-data"
    click EmbedDocumentSplits "#embedding-data"
    click StoringData "#storing-data"

The preparation stage involves the following steps in an offline manner

  1. Data Selection: Choose the appropriate data to ingest.
  2. Loading Data: Load the data in a manner that can be consumed by the models.
  3. Splitting Data: Split the data into chunks that can be both consumed by the model and retrieved with a reasonable degree of data.
  4. Embedding Data: Embed the data.
  5. Storing Data: Store the embedding.

Selecting Data

Users should only access data that is appropriate for their application. However, including too much information might be unnecessary or harmful to retrieval if the retrieval cannot handle the volume or complexity of data. It is also crucial to ensure data privacy when providing data that might not be appropriate (or legal) to access.

Loading Data

Different data types require different loaders. Raw text, PDFs, spreadsheets, and more proprietary formats need to be processed in a way that the information is of highest relevance to data. Text is easy to process, but some data, especially multimodal data like PDFs, may need to be formatted with a schema to allow for more effective searching.

Splitting Data

Once data has been loaded in a way that a model can process it, it must be split. There are several ways of splitting data:

  1. By the max size a model can handle.
  2. By some heuristic break, such as . sentences, \n return characters or \p paragraphs or newlines.
  3. In a manner that maximizes the topic coherence. In this case, splitting and embedding may happen simultaneously.

Embedding Data

Index Building - One of the most useful tricks is multi-representation indexing: decouple what you index for retrieval (e.g., table or image summary) from what you pass to the LLM for answer synthesis (e.g., the raw image, a table). Read more

Adding metadata

Information such as dates, chapters, or key words can allow for filtering and key-word lookup.

Storing Data

The embedded data is stored for future retrieval and use. This is done via standarad database methods, with the use of embeddings as vector retrieval addresses as well as meta-data for more traditional search (key-word) methods.

Retrieval and Generation (online)

The retrieval and generation stage involves the following steps:

  1. Retrieving Data: Retrieve the data based on input in such a way that relevant documents and chunks can be used in downstream chains.
  2. Generating Output: Generate an output using a prompt that integrates the query and retrieved data.

The decision and act to retrieve the documents will depend on the additional contexts that the agents may need to be aware of.

It might not always be necessary to retrieve documents. When it is necessary to retrieve the document, it is important to know where to retrieve from routing, and then matching the query to the appropriately stored information. Both of these may involve rewriting the prompt to be more effective in the manner the data is retrieved.

Retrieval and generation (clickable)

    graph LR
        style C fill:#B4E1E7,stroke:#333,stroke-width:1px
        style T fill:#FAD2E1,stroke:#333,stroke-width:1px
        style RR fill:#E1FAD2,stroke:#333,stroke-width:1px
        style R fill:#FADAD2,stroke:#333,stroke-width:1px
        style F fill:#E7B4E1,stroke:#333,stroke-width:1px
        style G fill:#D2E1FA,stroke:#333,stroke-width:1px
        style H fill:#E1E7B4,stroke:#333,stroke-width:1px

        C[Query] --> T[Optimize]
        T --> RR[Route]
        RR --> R[Match and \nRank Documents]
        R --> F[Combine With\n Context]
        F --> G[LLM \nGeneration]
        G --> H[Answer]

        click T "#query-optimization"
        click RR "#routing"
        click R "#match-and-rank"
        click F "#CombineWithContext"
        click G "#LLMGeneration"
        click H "#Answer"

Query Optimization

In production settings, the queries that users ask are unlikely to be optimal for retrieval. This can be due to a combination of challenges such as questions that are.

  • Irrelevant
  • Vague
  • Not related to retrieval
  • Are made of multiple questions

Optimization of queries, looks to improve these queries in several manners. Here are a several with other greater descriptions written in Langchain's query analysis.

Rewrite-Retrieve-Read

This approach involves rewriting the query for better retrieval and reading of the relevant documents.

Query Rewriting for Retrieval-Augmented Large Language Models

image

Step Back Prompting

This method generates an intermediate context that helps to 'abstract' the information. Once generated, the additional context can be used.

Step back
    You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

    {normal_context}
    {step_back_context}

    Original Question: {question}
    Answer:
Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

image

Query Rephrasing

Particularly in chat settings, it's important to include all of the appropriate context to create an effective search query.

Rephrase question
    Given the following conversation and a follow up question, rephrase the follow up question to be a standalone question.

    Chat History:
    {chat_history}
    Follow Up Input: {question}
    Standalone Question:
Query Decomposition

When questions are directly made of multiple questions, or the effective answer to these questions involves answering several sub-questions, breaking the questions into multiple queries may be essential. This may involve performing sequential queries that are created based on retrieved information, or queries that can be run irrespective of other results. Langchain Query decomposition

Query Expasion

Can generate multiple rephrased versions of the query to increas the likelihood of a hit, or use the advanced retrieval methods to triangulate higher quality hits.

Query Clarifying

Particularly in chat settings when questions are vague, asking follow-up questions can be instrumental in ensuring the lookup can be as effective as possible.

Query structuring

When answers to queries can be 'filtered' using meta-data based on elements of the queries can be highly valuable. This can include attributes such as date, location, subjects. See Langchain's Query construction for additional information related to this.

Routing

Depending on the question asked, queries may need to be routed to different sources of data, or indexes. OpenAI's RAG strategies provides some guidance on question routing:

Matching and Ranking

Matching involves aligning the query with the appropriately stored information.

Multi-Hop RAG

In order to effectively answer some queries, retrieval of evidence from multiple documents may be needed. This is known as multi-hop rag.

GitHub Repo stars MultiHop-RAG: Benchmarking Retrieval-Augmented Generation for Multi-Hop Queries provides a dataset for evaluating multihop rag

"MultiHop-RAG: a QA dataset to evaluate retrieval and reasoning across documents with metadata in the RAG pipelines. It contains 2556 queries, with evidence for each query distributed across 2 to 4 documents. The queries also involve document metadata, reflecting complex scenarios commonly found in real-world RAG applications."

image

Small to big lookup

TODO xxx

Reranking

TODO xxx Reranking

Generating responses

The final step is generating an output using a prompt that integrates the query and retrieved data.

Challenges in generating responses can involve

  • Not having enough information: RAG can help minimize response generation of non-factual information, but only if retrieved information provides sufficient context to answer theq estion properly. If the question cannot be answered with a reasonable degree of certainty, then the response should be along the lines of "I don't know."
  • Conflicting information: When retrieved results contain different responses to the same question, a difinitive response may not be possible
  • Stale information: When information is no longer relevant.

Multimodal Rag

Natural-language lookup with RAG can be improved by allowing other modalities, such as tables and images, at the same time. There are several ways that this may be accomplished as described in Langchain's multi modal rag:

Option 1:

Use multimodal embeddings (such as CLIP) to embed images and text
Retrieve both using similarity search
Pass raw images and text chunks to a multimodal LLM for answer synthesis

Option 2:

Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve text
Pass text chunks to an LLM for answer synthesis

Option 3:

Use a multimodal LLM (such as GPT4-V, LLaVA, or FUYU-8b) to produce text summaries from images
Embed and retrieve image summaries with a reference to the raw image
Pass raw images and text chunks to a multimodal LLM for answer synthesis
  • Multi-Modal: This approach is used for RAG on a substack that has many images of densely packed tables, graphs. Here is an example implementation, and Here is one that works with private data.

  • Semi-Structured: This approach is used for RAG on documents with tables, which can be split using naive RAG text-splitting that does not explicitly preserve them. Here is an example implementation.

Evaluating and Comparing

Because of the large number of manners of performing RAG, it is important to evaluate the quality of the implemented solution.

GitHub Repo stars Rag Arena Provides interfaces with LangChain to provide a RAG chatbot experience where queries receive multiple responses.

Resources, Tutorials and Blogs

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks introduces a complete solution for enabling improved response generation with LLMs.

image The authors reveal that allowing for fine tuning of the models when equipped with RAG improved the results. image

📋
12 RAG Pain Points and Proposed Solutions

Things that might lead to failure of RAG pipeline. Mostly taken from the blog

Pain point: * and solutions

1: Missing Content:

  • Clean your data
  • Better prompting

2: Missed the Top Ranked Documents

  • Hyperparameter tuning for chunk_size and similarity_top_k as in Hyperparameter Optimization for RAG.
  • Reranking notebook usses Improving Retrieval Performance by Fine-tuning Cohere Reranker with LlamaIndex and CohereRank to rerank the results
        import os
        from llama_index.postprocessor.cohere_rerank import CohereRerank
    
        api_key = os.environ["COHERE_API_KEY"]
        cohere_rerank = CohereRerank(api_key=api_key, top_n=2) # return top 2 nodes from reranker
    
        query_engine = index.as_query_engine(
            similarity_top_k=10, # we can set a high top_k here to ensure maximum relevant retrieval
            node_postprocessors=[cohere_rerank], # pass the reranker to node_postprocessors
        )
    
        response = query_engine.query(
            "What did Sam Altman do in this essay?",
        )
    

3: Not in Context — Consolidation Strategy Limitations

  • Tweak retrieval strategies
  • Finetune embeddings

4: Not Extracted

5: Wrong Format

  • Output Parsing
  • Pydantic

6: Incorrect Specificity

7: Incomplete and Impartial Responses

8: Data Ingestion Scalability

9: Structured Data QA

10: Data Extraction from Complex PDFs

  • Use pdf2htmlEX
  • Use EmbeddedTablesUnstructuredRetrieverPack in LlamaIndex

11: Fallback Model(s): Use a model router like - Neutrino

    from llama_index.llms import Neutrino
    from llama_index.llms import ChatMessage

    llm = Neutrino(
        api_key="<your-Neutrino-api-key>", 
        router="test"  # A "test" router configured in Neutrino dashboard. You treat a router as a LLM. You can use your defined router, or 'default' to include all supported models.
    )

    response = llm.complete("What is large language model?")
    print(f"Optimal model: {response.raw['model']}")
    from llama_index.llms import OpenRouter
    from llama_index.llms import ChatMessage

    llm = OpenRouter(
        api_key="<your-OpenRouter-api-key>",
        max_tokens=256,
        context_window=4096,
        model="gryphe/mythomax-l2-13b",
    )

    message = ChatMessage(role="user", content="Tell me a joke")
    resp = llm.chat([message])
    print(resp)

12: LLM Security

Advanced Rag small to big

Blog

Advanced Retreival Augmented Generation from Theory to Llamaindex

Blog

RAG vs finetuning