Want to extract knowledge from complex PDF documents? We explain the steps in this article.
In today's data-driven landscape, enterprises are inundated with unstructured data—emails, documents, images, and more—comprising approximately 80% of all data and growing at an annual rate of 55-65%. Despite its volume, a staggering 90% of this data remains unanalyzed, representing a significant untapped resource for organizations. The International Data Corporation (IDC), in a report titled Data Age 2025, has projected that the global datasphere will grow from 33 zettabytes in 2018 to 175 zettabytes by 2025 (one Zettabyte is 10^21 bytes). The majority of this data will be unstructured, with only 10% being stored, and even less analyzed.
PDFs are a gold mine of unstructured data, containing immense untapped value in the form of annual reports, presentations, and research documents. These files often blend text with graphs, charts, tables, and other visuals that enrich the content. Extracting knowledge from such complex documents has long been a challenge. Traditional methods like OCR or PDF-to-text conversion tools frequently fall short, struggling with precision and accuracy when handling these intricately formatted elements.
Enter Multimodal AI, particularly Multimodal Retrieval-Augmented Generation (RAG) systems, which use visual embedding models like ColPali. These systems can process and understand multiple data types, such as text and images, providing a powerful way to unlock insights from unstructured information. By leveraging AI models trained on both textual and visual data, these systems can be used to interpret complex documents with far greater accuracy and effectiveness than traditional methods.
Multimodal RAG systems hold vast potential, with applications across a wide range of industries. In the Banking, Financial Services, and Insurance (BFSI) sector, unstructured data—such as balance sheets and financial reports—constitutes a significant portion of information assets. Manually processing these documents is slow and error-prone. Automating tasks like comparing and analyzing balance sheets with AI tools can dramatically enhance accuracy and efficiency. We've experienced these efficiency gains firsthand while developing a Multimodal RAG system capable of analyzing annual reports, extracting insights, and comparing companies based on their performance across various metrics.
Recognizing these opportunities, the BFSI sector is actively exploring AI integration to streamline operations and enhance decision-making. JPMorgan, for example, has reportedly said that it plans to increase its annual tech spending by $1.5 billion, reaching $17 billion in 2024. Similarly, Bank of America has allocated $4 billion this year for new tech initiatives, including generative AI development.
In this article, we explore Multimodal RAG, and an emerging approach to embedding generation of complex visual documents using ColPali. We will discuss the underlying technology, how it can be used for document analysis and knowledge extraction, the challenges of its implementation, and strategies for scaling. By applying this technology to unstructured data use cases within your organization, you can unlock critical insights and achieve a competitive advantage.
All traditional RAG systems can be broken down into three key parts: ingestion, retrieval, and generation. During the ingestion process, data is converted into embeddings using an embedding model and inserted into vector search engines. When a user submits a query, the query is also converted into an embedding using the same embedding model. A similarity search is then performed in the vector space with the query embedding to identify data embeddings that closely match the query vector. The retrieved results are then optionally reranked or presented directly to the LLM for response generation.
In traditional RAG systems, we use different embedding models for text, image or audio, and then embed them in separate vector spaces. However, this approach quickly fails in case of documents like PDFs, which may contain graphs, charts and text in a single page, or across multiple pages. OCR systems also fail in such scenarios.
For instance, a chart within a document may visually present critical insights, while the accompanying text provides context or explanation. To extract information effectively, the semantic representation of the chart must align seamlessly with the meaning conveyed by the text. When this alignment is weak or missing, the result is fragmented or incomplete understanding, which hampers downstream tasks such as retrieval, summarization, and data analysis.
Multimodal RAG is an advanced AI approach that extends traditional RAG systems by integrating Multimodal Large Language Models (LLMs). While conventional AI systems are restricted to processing a single data type, such as text, the real world is inherently multimodal, comprising information conveyed through text, images, videos, audio, and more. Multimodal RAG harnesses generative AI's capabilities to process multiple data types simultaneously, mirroring the way humans perceive the world through different senses.
In Multimodal RAG, the embeddings encode information from various modalities—such as text, images, and audio—into a unified vector space. This unified representation allows the model to draw deeper connections between different types of data, enhancing its ability to understand and synthesize complex, interrelated information. As a result, you achieve a more comprehensive and accurate extraction of insights from multimedia-rich documents and datasets.
The key challenge in building Multimodal RAG systems is the embedding generation. This is where ColPali comes in.
One of the biggest business use cases of Multimodal RAG is building AI systems that can understand PDFs and documents. Classic approaches to generating multimodal embeddings, such as CLIP or Imagebind, can be used for general image-text associations (such as search-by-image), but they don’t really help with document data.
This is where ColPali comes in, a groundbreaking approach to document retrieval that leverages Vision Language Models (VLMs) to efficiently index and retrieve information from documents based solely on their visual features.
ColPali works by processing document images through a vision encoder and language model, which then generates high-quality contextualized embeddings without relying on traditional OCR methods. ColPali has demonstrated superior performance on benchmarks like the Visual Document Retrieval Benchmark (ViDoRe), and outperforms existing retrieval pipelines. ColPali is also trainable and its low latency makes it suitable for real-time applications in various domains.
ColPali directly works with images of document pages. You convert your PDFs to images, and then use ColPali to generate embeddings.
To use ColPali, you first need to create images of each page of a PDF instead of extracting text. Think of it as taking a screenshot of the page. Colpali splits each page into small pieces (grids) to capture fine details. Each image is divided into a 32x32 grid, resulting in 1024 small patches, like dividing the page into 1024 "mini-images."
Each grid patch is then processed by a Vision Transformer (PaliGemma-3B), a powerful multimodal vision language model, and the resulting embeddings are generated in a unified vector space. This approach eliminates the need for OCR, and allows ColPali to handle complex documents containing text, tables, figures, and layouts more effectively.
When a search query is presented, it is converted into a similar vector format. ColPali then compares the query with the document using a late interaction model, meaning it looks at the relationship between each query word and every patch in the image. It calculates how well each patch aligns with the query and scores the document based on relevance. This approach enables it to surface results which are highly relevant and related to the query.
ColPali is built on Google's PaliGemma-3B, a Vision Language Model (VLM) trained on a diverse dataset. During training, ColPali was trained to distinguish between pages relevant and irrelevant to specific queries.
The training dataset consisted of approximately 127,460 query-page pairs, sourced from:
This diverse training set enables ColPali to effectively handle various document structures and content types.
The key advantage of using ColPali is the accuracy and the simplified workflow.
In simple terms, think of ColPali as a tool that "sees" and "reads" a document page like a human would, using both the text and the visual layout to find relevant information. Instead of converting everything to plain text, it processes the document as an image, making it highly effective for complex, image-heavy documents like PDFs.
Let’s now go through a simple implementation of ColPali using Byaldi. Byaldi is a lightweight wrapper around the ColPali repository, designed to simplify the use of late-interaction multimodal models like ColPali by providing a familiar API.
In this implementation, we will use Google’s Gemini 1.5 Flash as the language model, and ColPali for embedding generation.
Let’s first start with installing the required libraries and packages. Apart from Byaldi, we will also install poppler-utils for handling PDF to image generation.
!pip install byaldi
!sudo apt-get update
!apt-get install poppler-utils
!python -m pip install git+https://github.com/huggingface/transformers
!pip install -qU pdf2image
!pip install google-generativeai
You would also need a Hugging Face Token to download the model from Hugging Face, and a Gemini API Key. Save them in a .env file or use os.environ.
import base64
import os
os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN" # to download the ColPali model
os.environ["GEMINI_API_KEY"] = "YOUR_GEMINI_API_KEY"
Now, let’s set up the Gemini model.
import google.generativeai as genai
generation_config = {
"temperature": 0.0,
"top_p": 0.95,
"top_k": 64,
"max_output_tokens": 1024,
"response_mime_type": "text/plain",
}
genai.configure(api_key='GEMINI_API_KEY')
model = genai.GenerativeModel(model_name="gemini-1.5-flash" , generation_config=generation_config)
Now let’s set up ColPali.
from byaldi import RAGMultiModalModel
RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2", verbose=1)
Now we can index the documents using the utility function provided by Byaldi.
RAG.index(
input_path="/home/ubuntu/multimodal_rag/multimodal_byaldi/VF_FY2023_Environmental_Social_Responsibility_Report_FINAL_removed.pdf",
index_name="sustainability",
store_collection_with_index=False,
overwrite=True
)
The key decision you need to make here is whether to set store_collection_with_index to True or False. Setting it to True simplifies your workflow significantly: the query results will include the base64-encoded versions of relevant documents, allowing you to directly feed them into your LLM.
However, this option increases memory and storage usage for your index. If you have limited resources, it's better to keep the default setting (False) and generate the base64-encoded versions on demand. We will follow the latter approach.
Now you can launch a query and prompt in the following way:
query='What is described in the company purpose?'
prompt = "Explain what is shown in this image. List in bullet points."
Once you've created or loaded an index, you can start searching for relevant documents.
results = RAG.search(query, k=1)
# Getting relevant image index
image_index = results[0]["page_num"] -1
Since we did not save the collection, we need to get the right PDF page as an image, so that we can use the VLM Gemini 1.5 Flash for response generation.
We now use pdf2image to convert the PDF into a list of images, where each element corresponds to a page from the PDF.
from pdf2image import convert_from_path
images = convert_from_path("/home/ubuntu/multimodal_rag/multimodal_byaldi/VF_FY2023_Environmental_Social_Responsibility_Report_FINAL_removed.pdf")
Once this is done, we can also display the results our RAG search retrieved.
from IPython.display import Image,display
display(images[image_index])
We can now use the images retrieved to generate response from the Multimodal Gemini 1.5 Flash model:
def get_answer(prompt:str , image:Image):
response = model.generate_content([prompt, image])
return response.text
answer = f"Gemini Response\n: {get_answer(prompt, images[image_index])}"
print(answer)
You can now try to tweak the above basic tutorial for your use-case, and implement a basic ColPali-powered RAG in your application.
Like any large retrieval system, ColPali has some challenges with computing power and storage. In order to use ColPali in production, you would need to implement additional tactics to get it to scale and work for a large number of documents. Let’s first look at the challenges:
There are different tactics to manage the space and computation requirements. One of them is Binary Quantization.
Binary Quantization (BQ) converts high-dimensional floating-point vector embeddings into binary (0s and 1s) values. This is achieved by mapping all positive numbers to 1 and all zero or negative numbers to 0.
Here is an example:
Original vector: [0.58768, -0.37768, 1.29891, 0.0]
Binary representation: [1, 0, 1, 0]
In BQ, we essentially convert a 32-bit floating-point vector to just 1 bit per dimension. This is what leads to space saving.
For example, OpenAI embeddings with 1536 dimensions typically require 6 kB per vector. After binary quantization, the same vector only requires 128 bytes, reducing storage by nearly 8x.
Additionally, binary operations (AND, OR, XOR) are much faster than floating-point calculations, leading to a 40x speed improvement in search and retrieval tasks. This is particularly beneficial in large-scale vector search tasks like nearest neighbor searches.
Since we are converting 32 bit floating numbers to integers, doesn’t binary quantization lead to loss of precision? It does, to some extent, but it may not matter in many scenarios. Here are some reasons why:
Vector embeddings are often larger than necessary for search tasks, as they are optimized for ranking and clustering. BQ exploits this redundancy, reducing dimensions without significantly impacting accuracy.
Using BQ, you can first perform a fast, approximate search using binary vectors. You can then refine the subset of results using the original high-precision vectors.
Example: If oversampling is 2.0, BQ selects 200 candidates first and then refines the top 100 using the original vectors, balancing speed and accuracy.
In scalar quantization, you reduce 32 bit floats to 8-bit integers. If binary quantization isn’t giving the right results, scalar quantization can be used.
When binary quantization is used, you can drastically reduce the storage requirements and computation complexity of a ColPali-powered RAG pipeline. Here’s why:
As a next step, if you are facing challenges around implementing ColPali, try to use Binary Quantization or Scalar Quantization.
Several open source projects have emerged that leverage ColPali. Here are a few notable community projects and resources.
You can also directly work with ColPali, and skip using any framework.
The power of ColPali-powered Multimodal RAG lies in its ability to unlock insights from visually complex and multimodal documents. Here are some key use cases where this technology can deliver significant value:
In the BFSI sector, reports, balance sheets, and regulatory documents often contain a mix of text, tables, and charts. ColPali can automate the retrieval and analysis of these documents, enabling you to:
For example, automating the analysis of annual reports can reduce manual effort that analysts need to put in and improve the accuracy of financial assessments.
Legal documents, such as contracts, case files, and court records, often combine text with tables, exhibits, and annotations. ColPali can help you:
Medical records, research papers, and diagnostic reports often contain a combination of text, charts, and medical images. With ColPali, you can:
For example, analyzing a patient’s medical history alongside diagnostic charts could aid in faster and more accurate diagnoses.
Technical manuals, schematics, and quality reports frequently include complex diagrams, charts, and instructions. ColPali can enable you to:
This capability can streamline operations and reduce downtime in manufacturing processes.
Market research reports often combine text with charts, infographics, and data tables. Using ColPali, you can:
For instance, quickly retrieving insights from market trend analyses can give your organization a competitive edge.
Environmental, Social, and Governance (ESG) reports are rich with data visualizations, tables, and explanatory text. ColPali helps you:
Automating ESG analysis can ensure more accurate reporting and compliance tracking.
ColPali-powered Multimodal RAG represents a transformative step in document analysis and knowledge extraction. By leveraging vision-language models and unified vector spaces, you can overcome the limitations of traditional text-based retrieval methods. Whether you’re analyzing financial reports, legal documents, or medical records, ColPali allows you to unlock insights from multimodal data with unprecedented accuracy and efficiency.
By implementing ColPali in your SaaS application, and optimizing performance through techniques like Binary Quantization, you can scale this solution to handle vast amounts of complex data. As enterprises continue to generate more unstructured and multimodal data, adopting ColPali-powered Multimodal RAG will be key to maintaining a competitive advantage and driving informed decision-making.
If you are an enterprise or a SaaS organization looking to integrate ColPali-powered Multimodal RAG in your application or product, you can reach out to us. At Superteams.ai, we partner with your team to develop and deploy cutting-edge AI solutions tailored to your business needs. Our approach helps you build AI features or launch AI products without the need to build a full-scale inhouse AI team. To learn more, reach out to us today.