Retrieval Augmented Generation (RAG) - Empowering Generative AI with External Knowledge
Content
- Overview: The problem statement
- How RAG Works
- Advantages of RAG
- Embeddings: The Key to Semantic Understanding in RAG
- Vector Databases: The Backbone of Efficient Embedding Storage and Retrieval
- Endnotes
- References
Overview: The problem statement
Generative AI has taken the world by storm, capable of producing remarkably human-like text, images, and even music. Large language models (LLMs) like GPT-3 have showcased the potential of these systems, but they have a significant limitation: their knowledge is confined to the data they were trained on. This means they can struggle with questions requiring up-to-date information or specialized domain knowledge.
Enter Retrieval Augmented Generation (RAG), a powerful technique that enhances the capabilities of generative AI by incorporating external knowledge sources during the generation process. RAG marries the strengths of:
- Retrieval Systems: Efficiently fetching relevant information from vast databases, documents, or the web.
- Generative Models: Creating coherent and contextually relevant responses.
In essence, RAG empowers AI models to access a wealth of knowledge beyond their initial training data, making them far more versatile, accurate, and adaptable.
 
How RAG Works
At its core, RAG operates on a two-step process:
- Retrieval:
    - When a user submits a query, the RAG system employs a retrieval model to search through a knowledge base (e.g., a collection of documents, a database, or the web) and identify the most relevant passages or snippets of information.
- This retrieval process might involve techniques like semantic search, keyword matching, or even more sophisticated methods based on vector embeddings or neural networks.
 
- Generation:
    - The retrieved information is then fed into a generative AI model, along with the original user query.
- The model leverages this additional context to produce a response that is both informed by the retrieved knowledge and relevant to the user’s query.
 
RAG Code Example
Let’s see a simplified example to grasp the concept of RAG. We will be using Google Gemini Model API to make call to the LLM. Any other llm can be used instead. Visit Google AI Studio(Gemini) page to Get your Google AI Studio API key
# !pip install -q -U google-generativeai
import google.generativeai as genai
import os
#genai.configure(api_key=os.environ["YOUR_API_KEY"])
genai.configure(api_key="YOUR_API_KEY")
# list the various Gemini models to chose any one from
for m in genai.list_models():
  if 'generateContent' in m.supported_generation_methods:
    print(m.name)
Output:–
models/gemini-1.0-pro-latest
models/gemini-1.0-pro
models/gemini-pro
models/gemini-1.0-pro-001
models/gemini-1.0-pro-vision-latest
models/gemini-pro-vision
models/gemini-1.5-pro-latest
models/gemini-1.5-pro-001
models/gemini-1.5-pro
models/gemini-1.5-pro-exp-0801
models/gemini-1.5-pro-exp-0827
models/gemini-1.5-flash-latest
models/gemini-1.5-flash-001
models/gemini-1.5-flash-001-tuning
models/gemini-1.5-flash
models/gemini-1.5-flash-exp-0827
models/gemini-1.5-flash-8b-exp-0827
model = genai.GenerativeModel("gemini-1.5-flash-001",
                              system_instruction=[
                                  "Do not generate the output in markdown format"
    ])
# Formulate a question the LLM might struggle with
question = "Who won the most number of individual gold medals in paris summer olympics 2024?"
response = model.generate_content(question)
initial_response = response.text
print("Initial Response:", initial_response)
Output:–
Initial Response: I do not have access to real-time information, including the results of future events like the 2024 Paris Summer Olympics. 
To find out who won the most gold medals in the 2024 Paris Olympics, you'll have to wait until the games are over and the results are officially announced. You can follow the Olympics on official websites, news channels, and social media for updates. 
We can see that the llm is not able to provide the information about the recent events like olympics 2024. Let’s supply the knowledge base to the llm to equip it with the require information.
knowledge_base = [
    "French swimmer Leon Marchand won the most individual gold medals at the 2024 Paris Olympics.",
    "Chinese swimmer Yefei Zhang won the most individual medals overall with six, including five bronze and one silver.",
    "The United States and China tied for the most gold medals among countries with 40 each, but the US won the most overall medals with 126."
]
retrieved_knowledge = "\n".join(knowledge_base)
# If relevant knowledge is found, augment the prompt and get a new response
if retrieved_knowledge:
    augmented_prompt = f"{question}\n\nContext:\n {retrieved_knowledge}"
    print("Augmented Prompt:-\n\n", augmented_prompt)
    
    rag_response = model.generate_content(augmented_prompt, )
    
    print("\n\nRAG Response:", rag_response.text)
else:
    print("No relevant knowledge found in the dataset.")
Output:–
Augmented Prompt:-
 Who won the most number of individual gold medals in paris summer olympics 2024?
Context:
 French swimmer Leon Marchand won the most individual gold medals at the 2024 Paris Olympics.
Chinese swimmer Yefei Zhang won the most individual medals overall with six, including five bronze and one silver.
The United States and China tied for the most gold medals among countries with 40 each, but the US won the most overall medals with 126.
RAG Response: Leon Marchand 
In this example, we’ve simulated a very basic RAG system. The retrieved information about the top gold medals winning athlete in summer olympics 2024 is provided to the generative model, influencing its response. Of course, in real-world scenarios, the retrieval process would be far more complex and the knowledge base much larger.
Advantages of RAG
- Enhanced Accuracy and Relevance: RAG empowers generative AI to produce responses that are grounded in factual information, reducing the risk of “hallucinations” or generating incorrect information.
- Up-to-date Information: By incorporating external knowledge sources, RAG allows AI models to stay current with the latest developments, even beyond their initial training data.
- Domain Adaptability: RAG systems can be easily adapted to various domains by plugging in different knowledge bases, making them versatile tools for a range of applications.
Embeddings: The Key to Semantic Understanding in RAG
In RAG, the retrieval process involves finding the most relevant passages or information from a knowledge base given a user query. This is where embeddings come into play. Embeddings are a way of representing text (or other data) as dense vectors in a high-dimensional space. The key idea is that semantically similar pieces of text will have embeddings that are close to each other in this space.
How Embeddings Work in RAG
- Pre-processing:
    - Both the documents in your knowledge base and the user query are converted into embeddings using an embedding model.
- There are several pre-trained embedding models available (e.g., Sentence-BERT, Universal Sentence Encoder) that you can use directly or fine-tune on your specific data.
 
- Storage:
    - The embeddings of your knowledge base documents are typically stored in an efficient data structure like an approximate nearest neighbor (ANN) index (e.g., FAISS, Annoy). This allows for fast retrieval of the most similar embeddings even with large knowledge bases.
 
- Retrieval:
    - When a user query comes in, it’s also converted into an embedding.
- This query embedding is then compared to the embeddings in your ANN index to find the nearest neighbors, i.e., the documents whose embeddings are closest to the query embedding in the high-dimensional space.
- These nearest neighbors are assumed to be the most semantically relevant passages to the query and are retrieved from the knowledge base.
 
- Generation:
    - The retrieved passages are then provided as context to the generative AI model, along with the original query, to produce the final response.
 
Advantages of Using Embeddings in RAG
- Semantic Understanding: Embeddings capture the meaning and context of text, enabling the retrieval system to find relevant information even when there’s no exact keyword match.
- Efficiency: ANN indexes allow for fast retrieval even from large knowledge bases.
- Flexibility: You can fine-tune embedding models on your specific data to improve retrieval performance for your domain.
Illustrative Example
Imagine you have a knowledge base containing information about different animals.
- Query: “What is the largest land animal?”
- Documents in Knowledge Base:
    - “The African elephant is the largest land animal, weighing up to 6,000 kg.”
- “The giraffe is the tallest land animal, reaching heights of up to 5.5 meters.”
 
Using embeddings, the RAG system would be able to understand that the query is about the size of land animals and retrieve the relevant document about the African elephant, even though the query doesn’t explicitly mention “elephant.”
In Summary
Embeddings are a crucial component of RAG, allowing for semantic understanding and efficient retrieval of relevant information from large knowledge bases. They enable generative AI models to produce more accurate, informed, and contextually appropriate responses.
Key Considerations:
- Choice of Embedding Model: Select an embedding model suitable for your data and task.
- Fine-tuning: Consider fine-tuning the embedding model on your domain-specific data for improved performance.
- Index Choice: Choose an ANN index that balances speed and accuracy for your needs.
Embeddings Code Example
In this code we read the olympics information from the Wikipedia page for “2024”. That has the latest up-to-date information about the current events.
#!pip install -q -U sentence-transformers
#!pip install -q -U google-generativeai
#!pip install torch
# !pip install tiktoken
from dateutil.parser import parse
import pandas as pd
import requests
from sentence_transformers import SentenceTransformer, util
import numpy as np
import torch
import tiktoken
import google.generativeai as genai
import os
#genai.configure(api_key=os.environ["API_KEY"])
genai.configure(api_key="YOUR_API_KEY")
genai_model = genai.GenerativeModel("gemini-1.5-flash-001",
                              system_instruction=[
                                  "Do not generate the output in markdown format"
    ])
# Get the Wikipedia page for "2024". This will be our knowledge base for the LLM to refer for potential answers
resp = requests.get("https://en.wikipedia.org/w/api.php?action=query&prop=extracts&exlimit=1&titles=2024&explaintext=1&formatversion=2&format=json")
# Load page text into a dataframe
df = pd.DataFrame()
df["text"] = resp.json()["query"]["pages"][0]["extract"].split("\n")
# Clean up text to remove empty lines and headings
df = df[(df["text"].str.len() > 0) & (~df["text"].str.startswith("=="))]
# Prepare Dataset: Loading and Wrangling Data
prefix = ""
for (i, row) in df.iterrows():
    # If the row already has " - ", it already has the needed date prefix
    if " – " not in row["text"]:
        try:
            # If the row's text is a date, set it as the new prefix
            parse(row["text"])
            prefix = row["text"]
        except:
            # If the row's text isn't a date, add the prefix
            row["text"] = prefix + " – " + row["text"]
df = df[df["text"].str.contains(" – ")]
# Generate embeddings
from sentence_transformers import SentenceTransformer
embeddings_model = SentenceTransformer('paraphrase-MiniLM-L6-v2') # or any other suitable model
knowledge_base = df["text"].tolist()
all_embeddings = embeddings_model.encode(knowledge_base)
df['embeddings'] = [x for x in all_embeddings]
# Create a Function that Finds Related Pieces of Text from the knowledge base for a Given Question
def get_rows_sorted_by_relevance(question, df, embedding_model):
    # Get embeddings for the question text
    question_embeddings = embeddings_model.encode(question)
    # Make a copy of the dataframe and add a "distances" column containing
    # the cosine distances between each row's embeddings and the
    # embeddings of the question
    df_copy = df.copy()
    # 5. Find Most Similar Passage (Basic Similarity Search)
    all_embeddings = list(df_copy["embeddings"])
    cosine_similarities = util.cos_sim(question_embeddings, all_embeddings)
    df_copy["distances"] = torch.transpose(cosine_similarities, 0, 1)
    # Sort the copied dataframe by the distances and return it
    # (shorter distance = more relevant so we sort in ascending order)
    df_copy.sort_values("distances", ascending=False, inplace=True)
    return df_copy
# Create a Function that Composes a Text Prompt
# We want to fit as much of our dataset as possible into the "context" part of
# the prompt without exceeding the number of tokens allowed by our model, For a
# safe side we consider the model max context length as 4000 tokens (in reality
# its way more than this and its increasing probably with each day passing)
def create_prompt(question, df, max_token_count):
    """
    Given a question and a dataframe containing rows of text and their
    embeddings, return a text prompt to send to a Completion model
    """
    # Create a tokenizer that is designed to align with our embeddings
    tokenizer = tiktoken.get_encoding("cl100k_base")
    
    # Count the number of tokens in the prompt template and question
    prompt_template = """
Answer the question based on the context below, and if the question
can't be answered based on the context, say "I don't know"
Context: 
{}
---
Question: {}
Answer:"""
    
    current_token_count = len(tokenizer.encode(prompt_template)) + \
                            len(tokenizer.encode(question))
    
    context = []
    for text in get_rows_sorted_by_relevance(question, df, embeddings_model)["text"].values:
        
        # Increase the counter based on the number of tokens in this row
        text_token_count = len(tokenizer.encode(text))
        current_token_count += text_token_count
        
        # Add the row of text to the list if we haven't exceeded the max
        if current_token_count <= max_token_count:
            context.append(text)
        else:
            break
    return prompt_template.format("\n\n###\n\n".join(context), question)
# Create a Function that Answers a Question
def answer_question(
    question, df, max_prompt_tokens=1800, max_answer_tokens=150
):
    """
    Given a question, a dataframe containing rows of text, and a maximum
    number of desired tokens in the prompt and response, return the
    answer to the question according to an OpenAI Completion model
    
    If the model produces an error, return an empty string
    """
    
    prompt = create_prompt(question, df, max_prompt_tokens)
    
    try:
        response = genai_model.generate_content(prompt)
        return response.text
    except Exception as e:
        print(e)
        return ""
    
custom_answer = answer_question("Where were the Olympics held in 2024?", df)
print(custom_answer)    
Output:–
Paris, France
Vector Databases: The Backbone of Efficient Embedding Storage and Retrieval
As we’ve seen, embeddings play a vital role in RAG by representing text as vectors in a high-dimensional space. However, when dealing with large knowledge bases, efficiently storing and retrieving these embeddings becomes crucial. This is where vector databases shine.
Role of Vector Databases in RAG
- Efficient Storage: Vector databases are specifically designed to store and manage high-dimensional vectors. They employ optimized data structures and indexing techniques (like HNSW, IVFADC) to enable fast insertion, updates, and retrieval of embeddings.
- Similarity Search: The core functionality of a vector database is its ability to perform similarity search. Given a query embedding, it can quickly identify the most similar embeddings (i.e., the nearest neighbors) in the database, representing the most relevant passages or documents.
- Scalability: Vector databases are built to handle massive datasets containing millions or even billions of embeddings, making them ideal for RAG applications where the knowledge base can be vast.
Key Points:
- Vector Database Choice: Several vector databases are available, each with its strengths and weaknesses. Popular choices include FAISS, Pinecone, Milvus, Weaviate, and Qdrant. Consider factors like scalability, performance, ease of use, and cloud vs. on-premises deployment when choosing one.
- Integration with RAG: The vector database seamlessly integrates into the RAG pipeline. The retrieval model interacts with the vector database to fetch the relevant documents, which are then passed to the generative model.
In essence, vector databases act as the efficient storage and retrieval engine for embeddings, enabling RAG systems to scale to large knowledge bases and provide fast, accurate responses to user queries.
VectorDB Code Example
Let’s see a simple Python example using FAISS (Facebook AI Similarity Search), a popular open-source library for efficient similarity search and clustering of dense vectors.
# !pip install faiss-cpu
import faiss
import numpy as np
# 1. Sample Embeddings (replace with your actual embeddings)
#embeddings = np.random.rand(1000, 128).astype('float32')  # 1000 embeddings, each of dimension 128
embeddings_list = list(df['embeddings'])
embeddings = np.array(embeddings_list)
# 2. Build FAISS Index
index = faiss.IndexFlatL2(embeddings.shape[1])  # Create an index for L2 distance
index.add(embeddings)           # Add the embeddings to the index
# 3. Query Embedding 
query = "Where were the Olympics held in 2024?"
query_embedding = embeddings_model.encode(query)
query_embedding = np.array(query_embedding).reshape(1, -1)
# 4. Search for Nearest Neighbors
k = 5  # Number of nearest neighbors to retrieve
distances, indices = index.search(query_embedding, k)
# 5. Retrieve Relevant Documents (replace with your actual document retrieval)
#relevant_documents = [f"Document {i}" for i in indices[0]]
relevant_documents = [df.iloc[i]['text'] for i in indices[0]]
print(relevant_documents)
"""
Explanation:
1. We reuse the embeddings from the previous Embeddings code example
2. We create a FAISS index using L2 distance (other distance metrics are available) and add the embeddings to it.
3. We create a sample query embedding.
4. We perform a similarity search to find the k nearest neighbors (documents with the most similar embeddings) to the query embedding.
5. We retrieve the relevant documents based on the indices returned by the search.
"""
Output:–
['July 26 – August 11 – The 2024 Summer Olympics are held in Paris, France. The controversial opening ceremony and the boxing match of Luca Hámori and Imane Khelif spark international debate.',
 ' – 2024 (MMXXIV) is the current year, and is a leap year starting on Monday of the Gregorian calendar, the 2024th year of the Common Era (CE) and Anno Domini (AD) designations, the 24th  year of the 3rd millennium and the 21st century, and the  5th   year of the 2020s decade.  ',
 'August 28 – September 8 – The 2024 Summer Paralympics are held in Paris, France.',
 'June 20 – July 14 – The 2024 Copa América is held in the United States, and is won by Argentina.',
 "October 3–20 – The 2024 ICC Women's T20 World Cup is scheduled to be held in the United Arab Emirates."]
Endnotes
Retrieval Augmented Generation represents a significant step forward in the evolution of generative AI. By bridging the gap between vast external knowledge and the creative prowess of language models, RAG enables AI systems to generate responses that are not only creative but also informed, accurate, and contextually relevant.
As research and development in this field continue, we can expect RAG to play a pivotal role in shaping the future of AI-powered applications, from chatbots and virtual assistants to content creation and decision-support systems.
Let me know if you have any other questions or would like a deeper dive into a specific aspect of RAG!