Academy
Updated on
Dec 17, 2024

How to Extract Knowledge from Complex PDF Documents Using Multimodal RAG Powered By ColPali

Want to extract knowledge from complex PDF documents? We explain the steps in this article.

How to Extract Knowledge from Complex PDF Documents Using Multimodal RAG Powered By ColPali
Ready to build AI-powered products or integrate seamless AI workflows into your enterprise or SaaS platform? Schedule a free consultation with our experts today.

In today's data-driven landscape, enterprises are inundated with unstructured data—emails, documents, images, and more—comprising approximately 80% of all data and growing at an annual rate of 55-65%. Despite its volume, a staggering 90% of this data remains unanalyzed, representing a significant untapped resource for organizations. The International Data Corporation (IDC), in a report titled Data Age 2025, has projected that the global datasphere will grow from 33 zettabytes in 2018 to 175 zettabytes by 2025 (one Zettabyte is 10^21 bytes). The majority of this data will be unstructured, with only 10% being stored, and even less analyzed.

PDFs are a gold mine of unstructured data, containing immense untapped value in the form of annual reports, presentations, and research documents. These files often blend text with graphs, charts, tables, and other visuals that enrich the content. Extracting knowledge from such complex documents has long been a challenge. Traditional methods like OCR or PDF-to-text conversion tools frequently fall short, struggling with precision and accuracy when handling these intricately formatted elements.

Image showing an example of a complex PDF, where text, graphics, charts are inherently interwoven.

Enter Multimodal AI, particularly Multimodal Retrieval-Augmented Generation (RAG) systems, which use visual embedding models like ColPali. These systems can process and understand multiple data types, such as text and images, providing a powerful way to unlock insights from unstructured information. By leveraging AI models trained on both textual and visual data, these systems can be used to interpret complex documents with far greater accuracy and effectiveness than traditional methods.

Multimodal RAG systems hold vast potential, with applications across a wide range of industries. In the Banking, Financial Services, and Insurance (BFSI) sector, unstructured data—such as balance sheets and financial reports—constitutes a significant portion of information assets. Manually processing these documents is slow and error-prone. Automating tasks like comparing and analyzing balance sheets with AI tools can dramatically enhance accuracy and efficiency. We've experienced these efficiency gains firsthand while developing a Multimodal RAG system capable of analyzing annual reports, extracting insights, and comparing companies based on their performance across various metrics.

Recognizing these opportunities, the BFSI sector is actively exploring AI integration to streamline operations and enhance decision-making. JPMorgan, for example, has reportedly said that it plans to increase its annual tech spending by $1.5 billion, reaching $17 billion in 2024. Similarly, Bank of America has allocated $4 billion this year for new tech initiatives, including generative AI development.

In this article, we explore Multimodal RAG, and an emerging approach to embedding generation of complex visual documents using ColPali. We will discuss the underlying technology, how it can be used for document analysis and knowledge extraction, the challenges of its implementation, and strategies for scaling. By applying this technology to unstructured data use cases within your organization, you can unlock critical insights and achieve a competitive advantage.




What is Multimodal RAG

All traditional RAG systems can be broken down into three key parts: ingestion, retrieval, and generation. During the ingestion process, data is converted into embeddings using an embedding model and inserted into vector search engines. When a user submits a query, the query is also converted into an embedding using the same embedding model. A similarity search is then performed in the vector space with the query embedding to identify data embeddings that closely match the query vector. The retrieved results are then optionally reranked or presented directly to the LLM for response generation.

In traditional RAG systems, we use different embedding models for text, image or audio, and then embed them in separate vector spaces. However, this approach quickly fails in case of documents like PDFs, which may contain graphs, charts and text in a single page, or across multiple pages. OCR systems also fail in such scenarios.  

Traditional approach to embedding generation

For instance, a chart within a document may visually present critical insights, while the accompanying text provides context or explanation. To extract information effectively, the semantic representation of the chart must align seamlessly with the meaning conveyed by the text. When this alignment is weak or missing, the result is fragmented or incomplete understanding, which hampers downstream tasks such as retrieval, summarization, and data analysis.

PDF Document Page with Complex Data.

Multimodal RAG is an advanced AI approach that extends traditional RAG systems by integrating Multimodal Large Language Models (LLMs). While conventional AI systems are restricted to processing a single data type, such as text, the real world is inherently multimodal, comprising information conveyed through text, images, videos, audio, and more. Multimodal RAG harnesses generative AI's capabilities to process multiple data types simultaneously, mirroring the way humans perceive the world through different senses.

In Multimodal RAG, the embeddings encode information from various modalities—such as text, images, and audio—into a unified vector space. This unified representation allows the model to draw deeper connections between different types of data, enhancing its ability to understand and synthesize complex, interrelated information. As a result, you achieve a more comprehensive and accurate extraction of insights from multimedia-rich documents and datasets.

The key challenge in building Multimodal RAG systems is the embedding generation. This is where ColPali comes in.




ColPali and Multimodal RAG Systems

One of the biggest business use cases of Multimodal RAG is building AI systems that can understand PDFs and documents. Classic approaches to generating multimodal embeddings, such as CLIP or Imagebind, can be used for general image-text associations (such as search-by-image), but they don’t really help with document data.

This is where ColPali comes in, a groundbreaking approach to document retrieval that leverages Vision Language Models (VLMs) to efficiently index and retrieve information from documents based solely on their visual features.

ColPali works by processing document images through a vision encoder and language model, which then generates high-quality contextualized embeddings without relying on traditional OCR methods. ColPali has demonstrated superior performance on benchmarks like the Visual Document Retrieval Benchmark (ViDoRe), and outperforms existing retrieval pipelines. ColPali is also trainable and its low latency makes it suitable for real-time applications in various domains.  

ColPali directly works with images of document pages. You convert your PDFs to images, and then use ColPali to generate embeddings.

ViDoRe benchmark dataset. Source.



How Does ColPali Work?

To use ColPali, you first need to create images of each page of a PDF instead of extracting text. Think of it as taking a screenshot of the page. Colpali splits each page into small pieces (grids) to capture fine details. Each image is divided into a 32x32 grid, resulting in 1024 small patches, like dividing the page into 1024 "mini-images."

Each grid patch is then processed by a Vision Transformer (PaliGemma-3B), a powerful multimodal vision language model, and the resulting embeddings are generated in a unified vector space. This approach eliminates the need for OCR, and allows ColPali to handle complex documents containing text, tables, figures, and layouts more effectively.

ColPali’s architecture. Source.

When a search query is presented, it is converted into a similar vector format. ColPali then compares the query with the document using a late interaction model, meaning it looks at the relationship between each query word and every patch in the image. It calculates how well each patch aligns with the query and scores the document based on relevance. This approach enables it to surface results which are highly relevant and related to the query.

Model Training

ColPali is built on Google's PaliGemma-3B, a Vision Language Model (VLM) trained on a diverse dataset. During training, ColPali was trained to distinguish between pages relevant and irrelevant to specific queries.

The training dataset consisted of approximately 127,460 query-page pairs, sourced from:

  1. Academic Datasets (63%): Openly available academic documents, including research papers and scientific articles, providing complex layouts with text, tables, and figures.
  2. Synthetic Data (37%): Pages from web-crawled PDF documents, augmented with pseudo-questions generated by Vision Language Models (e.g., Claude-3 Sonnet), to simulate diverse query-page relevance scenarios.

This diverse training set enables ColPali to effectively handle various document structures and content types.

Advantages of Using ColPali

The key advantage of using ColPali is the accuracy and the simplified workflow.

  • Simplified workflow: You don’t need complex preprocessing steps like OCR or advanced approaches to text chunking.
  • Preserves full context: Works directly with entire page images, maintaining visual and textual context.
  • Captures both text and visuals: Retrieves information based on both the content and layout of the document.
  • Efficient and accurate retrieval: Uses fine-grained matching to find the most relevant information in visually complex documents.

In simple terms, think of ColPali as a tool that "sees" and "reads" a document page like a human would, using both the text and the visual layout to find relevant information. Instead of converting everything to plain text, it processes the document as an image, making it highly effective for complex, image-heavy documents like PDFs.




Simple ColPali implementation using Byaldi

Let’s now go through a simple implementation of ColPali using Byaldi. Byaldi is a lightweight wrapper around the ColPali repository, designed to simplify the use of late-interaction multimodal models like ColPali by providing a familiar API.

In this implementation, we will use Google’s Gemini 1.5 Flash as the language model, and ColPali for embedding generation.

Step 1 - Install requirements:

Let’s first start with installing the required libraries and packages. Apart from Byaldi, we will also install poppler-utils for handling PDF to image generation.

!pip install byaldi
!sudo apt-get update
!apt-get install poppler-utils
!python -m pip install git+https://github.com/huggingface/transformers
!pip install -qU pdf2image
!pip install google-generativeai

Step 2 - Import environment variables:

You would also need a Hugging Face Token to download the model from Hugging Face, and a Gemini API Key. Save them in a .env file or use os.environ.

import base64
import os

os.environ["HF_TOKEN"] = "YOUR_HF_TOKEN" # to download the ColPali model
os.environ["GEMINI_API_KEY"] = "YOUR_GEMINI_API_KEY"

Step 3 - Setup Multimodal Language Model

Now, let’s set up the Gemini model.

import google.generativeai as genai
 
generation_config = {
  "temperature": 0.0,
  "top_p": 0.95,
  "top_k": 64,
  "max_output_tokens": 1024,
  "response_mime_type": "text/plain",
}
 
genai.configure(api_key='GEMINI_API_KEY')
 
model = genai.GenerativeModel(model_name="gemini-1.5-flash" , generation_config=generation_config)

Step 4 - Load the pretrained ColPali model:

Now let’s set up ColPali.

from byaldi import RAGMultiModalModel

RAG = RAGMultiModalModel.from_pretrained("vidore/colpali-v1.2", verbose=1)

Step 5 - Index Documents:

Now we can index the documents using the utility function provided by Byaldi.

RAG.index(
    input_path="/home/ubuntu/multimodal_rag/multimodal_byaldi/VF_FY2023_Environmental_Social_Responsibility_Report_FINAL_removed.pdf",
    index_name="sustainability",
    store_collection_with_index=False,
    overwrite=True
)

The key decision you need to make here is whether to set store_collection_with_index to True or False. Setting it to True simplifies your workflow significantly: the query results will include the base64-encoded versions of relevant documents, allowing you to directly feed them into your LLM.

However, this option increases memory and storage usage for your index. If you have limited resources, it's better to keep the default setting (False) and generate the base64-encoded versions on demand. We will follow the latter approach.

Step 6 - Define query and prompt:

Now you can launch a query and prompt in the following way:

query='What is described in the company purpose?'
prompt = "Explain what is shown in this image. List in bullet points."

Step 7 - Get results from RAG pipeline:

Once you've created or loaded an index, you can start searching for relevant documents.

results = RAG.search(query, k=1)

# Getting relevant image index
image_index = results[0]["page_num"] -1

Since we did not save the collection, we need to get the right PDF page as an image, so that we can use the VLM Gemini 1.5 Flash for response generation.

Step 8 - Converting each page of the PDF into an image.

We now use pdf2image to convert the PDF into a list of images, where each element corresponds to a page from the PDF.

from pdf2image import convert_from_path

images = convert_from_path("/home/ubuntu/multimodal_rag/multimodal_byaldi/VF_FY2023_Environmental_Social_Responsibility_Report_FINAL_removed.pdf")

Once this is done, we can also display the results our RAG search retrieved.

from IPython.display import Image,display
display(images[image_index])

Step 9 - Generate Response

We can now use the images retrieved to generate response from the Multimodal Gemini 1.5 Flash model:

def get_answer(prompt:str , image:Image):
      response = model.generate_content([prompt, image])
      return response.text

answer = f"Gemini Response\n: {get_answer(prompt, images[image_index])}"
print(answer)

You can now try to tweak the above basic tutorial for your use-case, and implement a basic ColPali-powered RAG in your application.




Challenges with ColPali

Like any large retrieval system, ColPali has some challenges with computing power and storage. In order to use ColPali in production, you would need to implement additional tactics to get it to scale and work for a large number of documents. Let’s first look at the challenges:

  • Computational Complexity: ColPali's computing needs increase quickly as the number of query tokens and patch vectors grow. This means that as the queries or document image quality get more complex, the computing demand increases rapidly.
  • Storage Requirements: ColPali requires much more storage than typical dense vector methods, using 10 to 100 times more space. This is because it stores a vector for each token. The storage needed grows based on three factors, number of documents, number of patches in each document and size of vector embedding.

There are different tactics to manage the space and computation requirements. One of them is Binary Quantization.




Binary Quantization and ColPali

Binary Quantization (BQ) converts high-dimensional floating-point vector embeddings into binary (0s and 1s) values. This is achieved by mapping all positive numbers to 1 and all zero or negative numbers to 0.

Here is an example:

Original vector: [0.58768, -0.37768, 1.29891, 0.0]

Binary representation: [1, 0, 1, 0]

In BQ, we essentially convert a 32-bit floating-point vector to just 1 bit per dimension. This is what leads to space saving.

For example, OpenAI embeddings with 1536 dimensions typically require 6 kB per vector. After binary quantization, the same vector only requires 128 bytes, reducing storage by nearly 8x.

Additionally, binary operations (AND, OR, XOR) are much faster than floating-point calculations, leading to a 40x speed improvement in search and retrieval tasks. This is particularly beneficial in large-scale vector search tasks like nearest neighbor searches.

Why Binary Quantization Works

Since we are converting 32 bit floating numbers to integers, doesn’t binary quantization lead to loss of precision? It does, to some extent, but it may not matter in many scenarios. Here are some reasons why:

Over-parameterization

Vector embeddings are often larger than necessary for search tasks, as they are optimized for ranking and clustering. BQ exploits this redundancy, reducing dimensions without significantly impacting accuracy.

Oversampling for Accuracy

Using BQ, you can first perform a fast, approximate search using binary vectors. You can then refine the subset of results using the original high-precision vectors.

Example: If oversampling is 2.0, BQ selects 200 candidates first and then refines the top 100 using the original vectors, balancing speed and accuracy.

Using Scalar Quantization instead of Binary Quantization

In scalar quantization, you reduce 32 bit floats to 8-bit integers. If binary quantization isn’t giving the right results, scalar quantization can be used.

Key Benefits of Quantization:

When binary quantization is used, you can drastically reduce the storage requirements and computation complexity of a ColPali-powered RAG pipeline. Here’s why:

  • Reduced Storage: A dataset requiring 900 MB in float32 format can be reduced to 128 MB using binary quantization.
  • Faster Search: Speeds up search operations while maintaining accuracy by leveraging binary operations and oversampling techniques.
  • Efficient Memory Usage: Minimizes RAM consumption by storing quantized vectors in memory and keeping full vectors on disk for occasional refinement.

As a next step, if you are facing challenges around implementing ColPali, try to use Binary Quantization or Scalar Quantization.




Open Source Projects around ColPali

Several open source projects have emerged that leverage ColPali. Here are a few notable community projects and resources.

Frameworks:

  • Byaldi: Byaldi is the ColPali equivalent of RAGatouille, leveraging the colpali-engine package to streamline indexing and embedding storage.
  • PyVespa: PyVespa allows seamless interaction with Vespa, a production-grade vector database, with comprehensive support for ColPali.
  • Candle: Candle is an efficient machine learning framework for Rust, enabling ColPali inference with high performance.

You can also directly work with ColPali, and skip using any framework.




Applications of ColPali-Powered Multimodal RAG

The power of ColPali-powered Multimodal RAG lies in its ability to unlock insights from visually complex and multimodal documents. Here are some key use cases where this technology can deliver significant value:

1. Banking, Financial Services, and Insurance (BFSI)

In the BFSI sector, reports, balance sheets, and regulatory documents often contain a mix of text, tables, and charts. ColPali can automate the retrieval and analysis of these documents, enabling you to:

  • Compare financial statements across different periods or organizations.
  • Extract insights from earnings reports that combine textual analysis with visual data.
  • Identify anomalies in financial data with greater accuracy.

For example, automating the analysis of annual reports can reduce manual effort that analysts need to put in and improve the accuracy of financial assessments.

2. Legal Tech

Legal documents, such as contracts, case files, and court records, often combine text with tables, exhibits, and annotations. ColPali can help you:

  • Efficiently search and retrieve case law or contract clauses relevant to specific legal queries.
  • Analyze large volumes of evidence that may include images, diagrams, and text.
  • Summarize key insights from multimodal legal documents, improving review efficiency.

3. Healthcare

Medical records, research papers, and diagnostic reports often contain a combination of text, charts, and medical images. With ColPali, you can:

  • Extract and correlate information from patient records that contain both diagnostic text and imaging data.
  • Analyze research papers that blend textual descriptions with graphs and diagrams.
  • Support clinical decision-making by retrieving relevant information faster and more accurately.

For example, analyzing a patient’s medical history alongside diagnostic charts could aid in faster and more accurate diagnoses.

4. Manufacturing and Engineering

Technical manuals, schematics, and quality reports frequently include complex diagrams, charts, and instructions. ColPali can enable you to:

  • Search through technical documentation to find specific procedures or schematics.
  • Retrieve design diagrams based on textual descriptions.
  • Identify quality control issues by correlating textual reports with visual data.

This capability can streamline operations and reduce downtime in manufacturing processes.

5. Market Research and Competitive Analysis

Market research reports often combine text with charts, infographics, and data tables. Using ColPali, you can:

  • Extract competitive insights from industry reports.
  • Analyze market trends by synthesizing information from both textual and visual elements.
  • Summarize and compare company performance metrics efficiently.

For instance, quickly retrieving insights from market trend analyses can give your organization a competitive edge.

6. Sustainability and ESG Compliance

Environmental, Social, and Governance (ESG) reports are rich with data visualizations, tables, and explanatory text. ColPali helps you:

  • Analyze ESG reports to assess sustainability metrics.
  • Retrieve compliance-related data from visually dense documents.
  • Compare ESG performance across different organizations.

Automating ESG analysis can ensure more accurate reporting and compliance tracking.




Conclusion

ColPali-powered Multimodal RAG represents a transformative step in document analysis and knowledge extraction. By leveraging vision-language models and unified vector spaces, you can overcome the limitations of traditional text-based retrieval methods. Whether you’re analyzing financial reports, legal documents, or medical records, ColPali allows you to unlock insights from multimodal data with unprecedented accuracy and efficiency.

By implementing ColPali in your SaaS application, and optimizing performance through techniques like Binary Quantization, you can scale this solution to handle vast amounts of complex data. As enterprises continue to generate more unstructured and multimodal data, adopting ColPali-powered Multimodal RAG will be key to maintaining a competitive advantage and driving informed decision-making.

If you are an enterprise or a SaaS organization looking to integrate ColPali-powered Multimodal RAG in your application or product, you can reach out to us. At Superteams.ai, we partner with your team to develop and deploy cutting-edge AI solutions tailored to your business needs. Our approach helps you build AI features or launch AI products without the need to build a full-scale inhouse AI team. To learn more, reach out to us today.

Authors