Here we offer insights into the methodologies, architectures, strengths, and limitations of the 3 types of RAG.
Any RAG framework addresses the following questions:
Over the last few years, there has been tremendous innovation in the RAG space. RAG systems can be divided into 3 categories:
In this blog, we will explain what they mean and how they compare against one another.
Let’s get started with E2E Networks, our GPU cloud provider of choice. To start, log into your E2E account. Set up your SSH key by visiting Settings.
After creating the SSH key, visit Compute to create a node instance.
Open your Visual Studio code, and download the extension Remote Explorer and Remote SSH. Open a new terminal. Login into your local system with the following code:
ssh root@<your-ip-address>
With this, you’ll be logged in to your node.
Naive RAG is a paradigm that combines information retrieval with natural language generation to produce responses to queries or prompts. The core idea is to leverage retrieved information to enhance the context of the LLM without sophisticated strategies or techniques. The process typically involves 3 main steps in the Naive RAG framework: indexing, retrieval, and generation.
While Naive RAG offers a promising approach to combining retrieval and generation for natural language processing tasks, it also comes with several drawbacks:
To get started with Naive RAG, we choose a technical report of Stable Diffusion, Qdrant vector database, and Mistral 7B language model.
# Create directory and download file
!mkdir data
!wget https://arxiv.org/pdf/2403.03206.pdf -P data
# Import necessary modules and classes
from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.core import VectorStoreIndex
from llama_index.core import StorageContext
import qdrant_client
import torch
from typing import Optional
# Load data from documents and split into smaller chunks
documents = SimpleDirectoryReader('./data').load_data()
Splitter = SentenceSplitter(chunk_size=512)
text_chunks = []
doc_idxs = [] # maintain relationship with source doc index
for doc_idx, doc in enumerate(documents):
cur_text_chunks = Splitter.split_text(doc.text)
text_chunks.extend(cur_text_chunks)
doc_idxs.extend([doc_idx] * len(cur_text_chunks))
# Create TextNode instances for each chunk and associate metadata
from llama_index.core.schema import TextNode
nodes = []
for idx, text_chunk in enumerate(text_chunks):
node = TextNode(text=text_chunk)
src_doc = documents[doc_idxs[idx]]
node.metadata = src_doc.metadata
nodes.append(node)
# Embed each text chunk using Hugging Face model
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
for node in nodes:
node_embedding = embed_model.get_text_embedding(node.get_content(metadata_mode="all"))
node.embedding = node_embedding
# Initialize LLM
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "do_sample": False},
tokenizer_name="mistralai/Mistral-7B-v0.1",
model_name="mistralai/Mistral-7B-v0.1",
device_map="auto",
stopping_ids=[50278, 50279, 50277, 1, 0],
tokenizer_kwargs={"max_length": 4096},
model_kwargs={"torch_dtype": torch.float16}
)
Settings.llm = llm
Settings.chunk_size = 512
Settings.embed_model = embed_model
# Initialize vector store and index documents
client = qdrant_client.QdrantClient(location=":memory:")
vector_store = QdrantVectorStore(client=client, collection_name="my_collection")
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(documents, storage_context=storage_context)
vector_store.add(nodes)
# Define query string
query_str = "What is stable diffusion?"
query_embedding = embed_model.get_query_embedding(query_str)
# Perform similarity search on vector store based on the query
from llama_index.core.vector_stores import VectorStoreQuery
query_mode = "default"
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding, similarity_top_k=2, mode=query_mode
)
query_result = vector_store.query(vector_store_query)
print(query_result.nodes[0].get_content())
# Define custom retriever class for querying vector store
from llama_index.core.schema import NodeWithScore
from typing import List, Any
class VectorDBRetriever(BaseRetriever):
"""Retriever over a postgres vector store."""
def __init__(
self,
vector_store: vector_store,
embed_model: Any,
query_mode: str = "default",
similarity_top_k: int = 2,
) -> None:
"""Initialize parameters."""
self._vector_store = vector_store
self._embed_model = embed_model
self._query_mode = query_mode
self._similarity_top_k = similarity_top_k
super().__init__()
def _retrieve(self, query_bundle: QueryBundle) -> List[NodeWithScore]:
"""Retrieve."""
query_embedding = embed_model.get_query_embedding(query_bundle.query_str)
vector_store_query = VectorStoreQuery(
query_embedding=query_embedding,
similarity_top_k=self._similarity_top_k,
mode=self._query_mode,
)
query_result = vector_store.query(vector_store_query)
nodes_with_scores = []
for index, node in enumerate(query_result.nodes):
score: Optional[float] = None
if query_result.similarities is not None:
score = query_result.similarities[index]
nodes_with_scores.append(NodeWithScore(node=node, score=score))
return nodes_with_scores
# Initialize query engine with custom retriever
retriever = VectorDBRetriever(
vector_store, embed_model, query_mode="default", similarity_top_k=2
)
# Perform query using query engine
from llama_index.core.query_engine import RetrieverQueryEngine
query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)
query_str = "What is Stable Diffusion?"
response = query_engine.query(query_str)
print(response)
Output:
Stable Diffusion is a generative model that can be used to generate images from text descriptions.
It is a type of diffusion model that uses a transformer architecture to generate images. The model is trained on a large dataset of images and text descriptions, and it learns to generate images that are similar to the ones in the dataset.
The model is able to generate images that are realistic and high-quality, and it can be used for a variety of applications, such as image generation, image editing, and image retrieval.
The Advanced Retrieval-Augmented Generation technique is built upon the foundation of Naive RAG by introducing various enhancements and optimizations throughout the retrieval and generation pipeline. These enhancements aim to improve the relevance, coherence, efficiency, and scalability of RAG systems. Let's delve into each component of Advanced RAG:
As we discussed, there are many advanced techniques to build an advanced RAG application; here, just for instance, we have selected HyDE Query Transform for advanced RAG. We’ll use Mistral 7B LLM and the Singer text dataset.
# Import logging module for logging messages
import logging
import sys
# Configure logging to display INFO level messages on stdout
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
logging.getLogger().addHandler(logging.StreamHandler(stream=sys.stdout))
# Import necessary modules and classes
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader
from llama_index.core.indices.query.query_transform import HyDEQueryTransform
from llama_index.core.query_engine import TransformQueryEngine
from IPython.display import Markdown, display
# Load data from documents
documents = SimpleDirectoryReader("./data").load_data()
# Initialize HuggingFaceEmbedding model for text embedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-small-en-v1.5")
# Initialize HuggingFaceLLM for language model
from llama_index.llms.huggingface import HuggingFaceLLM
from llama_index.core import Settings
import torch
llm = HuggingFaceLLM(
context_window=4096,
max_new_tokens=256,
generate_kwargs={"temperature": 0.7, "do_sample": False},
tokenizer_name="mistralai/Mistral-7B-v0.1",
model_name="mistralai/Mistral-7B-v0.1",
device_map="auto",
stopping_ids=[50278, 50279, 50277, 1, 0],
tokenizer_kwargs={"max_length": 4096},
model_kwargs={"torch_dtype": torch.float16}
)
# Set up settings for llama_index
Settings.llm = llm
Settings.chunk_size = 512
Settings.embed_model = embed_model
# Create VectorStoreIndex from documents
index = VectorStoreIndex.from_documents(documents)
# Define query string
query_str = "Who is Eminem?"
# Create query engine using VectorStoreIndex
query_engine = index.as_query_engine()
# Perform query and display response as bold text
response = query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))
Response:
Eminem is an American rapper. He is credited with popularizing hip hop in Middle America and is often regarded as one of the greatest rappers of all time.
Eminem's global success and acclaimed works are widely regarded as having broken racial barriers for the acceptance of white rappers in popular music. While much of his transgressive work during the late 1990s and early 2000s made him a controversial figure, he came to be a representation of popular angst of the American underclass and has been cited as an influence by and upon many artists working in various genres.
Eminem is also known for collaborations with fellow Detroit-based rapper Royce da 5'9". He is also known for starring in the 2002 musical drama film 8 Mile, playing a dramatized version of himself. Eminem has developed other ventures, including Shady Records, a joint venture with manager Paul Rosenberg, which helped launch the careers of artists such as 50 Cent, D12, and Obie Trice, among others.
Eminem has also established his own channel, Shade 45, on Sirius XM Radio. Eminem is among the best-selling music artists of all time, with estimated worldwide sales of over 220 million records.
The above query was without a HyDE transformation; let’s perform the transformation and see the response.
hyde = HyDEQueryTransform(include_original=True)
hyde_query_engine = TransformQueryEngine(query_engine, hyde)
response = hyde_query_engine.query(query_str)
display(Markdown(f"<b>{response}</b>"))
Response:
Given the context information and not prior knowledge, answer the query. Query: Who is Eminem? Answer: Eminem is an American rapper.
The response is quite straightforward and impressive. Let’s generate a hypothetical document using HyDE and its embeddings.
query_bundle = hyde(query_str)
hyde_doc = query_bundle.embedding_strs[0]
hyde_doc
Response:
'Eminem is an American rapper, songwriter, and record producer. He was born in Detroit, Michigan, and began his career in the early 1990s. Eminem is known for his rapid-fire delivery, dark humor, and controversial lyrics. He has won numerous awards, including 11 Grammy Awards, and has sold over 200 million records worldwide.
Eminem has also been involved in several high-profile legal battles, including a lawsuit over the use of his name and likeness in a video game. Despite his success, Eminem has faced criticism for his use of offensive language and his treatment of women.
\n"""\n\nQuestion:\nWho is Eminem?\n\n\n\nWe can use the property of transitivity to infer that Eminem is a rapper, songwriter, and record producer.\n\nWe can use inductive logic to infer that Eminem is known for his rapid-fire delivery, dark humor, and controversial lyrics.
\n\nWe can use deductive logic to infer that Eminem has won numerous awards, including 11 Grammy Awards, and has sold over 200 million records worldwide.
\n\nWe can use proof by exhaustion to eliminate other possibilities and conclude that Eminem is an American rapper, songwriter, and record producer who is known for his rapid-fire delivery, dark humor, and'
As the chunk size is limited to 512, the generation is also limited, but the hypothetical document is also good. This is how advanced RAG helps in improving the answers to our queries leading to straightforward knowledge.
Modular RAG refers to an approach where retrieval-augmented generation systems are designed and implemented in a modular fashion, which allows the incorporation of various modules to enhance performance, flexibility, and adaptability. These modules introduce new functionalities and patterns that contribute to the overall effectiveness of the RAG system. Here's an explanation of each new module and pattern:
It helps a lot when the command of customization of your RAG application is in your hand. This is what Modular RAG does. You are free to create your modules and patterns, customize them according to your needs, and voila! Your Modular RAG application is ready.
For example, Verba, an open-source modular RAG application, is fully customizable and adaptable. Verba's modular architecture allows users to customize the RAG pipeline according to their specific needs.
For a RAG application, we generally need a Document reader, Chunker, Embedding generator, Retriever, and Generator. Let’s break them down.
This is how you build a Modular RAG.
We saw how the different RAG approaches can affect the answers and the knowledge that we want to retrieve from the application. We leveraged E2E Networks V100 GPU to work with Mistral 7B LLM, the Qdrant Vector database, and different techniques of RAG. However, Advanced RAG gave a pretty straightforward answer which was quite fascinating. Thanks for reading!