This tutorial walks you through deploying and using Pixtral-12B for invoice parsing tasks, and creating a chat-based invoice analysis system.
Invoice parsing and processing pose significant challenges for businesses of all sizes. In most cases, invoices lack a standardized format, which means companies looking to streamline invoice handling from different vendors must build automated systems capable of interpreting a wide variety of layouts.
An effective invoice parsing system should reliably extract key details such as payment terms, totals, and item descriptions. Traditionally, businesses have relied on OCR (Optical Character Recognition) models to accomplish this, but these systems often struggle with inconsistent formatting, complex tables, and handwritten elements. Additionally, OCR models are prone to errors when handling poor image quality or non-text elements, resulting in inaccuracies that require manual correction.
This is where large vision models (LVMs), like OpenAI’s 4o, have shown great promise. LVMs work by combining image recognition capabilities with natural language understanding, allowing them to process both visual and textual data within the same model. These models are trained on internet-scale datasets, enabling them to handle various invoice formats, including complex layouts that traditional OCR models struggle with.
Among the LVMs recently released, the Pixtral-12B model by Mistral AI stands out. It is an open model that excels in multimodal tasks, making it highly effective for invoice parsing scenarios. The model, approximately 24GB in size, builds on Mistral's text-focused Nemo 12B and integrates a vision adapter, allowing it to handle complex visual layouts such as tables, graphs, and embedded images within documents. Trained on a diverse range of image and text data, Pixtral-12B generalizes well across various document types and formats.
In this tutorial, we will walk you through the process of deploying and using Pixtral-12B and applying it to invoice parsing tasks. We will also build a chat-based invoice analysis system that allows you to query multiple invoices at the same time.
Let’s get started!
Before we commence, let’s take a quick look at Pixtral-12B.
Multimodal Capabilities: Pixtral-12B can process both text and images simultaneously, making it highly effective for tasks such as invoice parsing, document processing, and more.
12 Billion Parameters: The model boasts 12 billion parameters. Its size allows it to handle complex and large-scale tasks and offer superior performance compared to smaller models. However, it remains small enough to be deployed on a single A100 GPU.
High-Resolution Image Processing: Pixtral-12B can process high-resolution images (up to 1024 x 1024) with a deep understanding of spatial relationships between elements such as tables, graphs, and embedded images.
Contextual Understanding: The model is capable of understanding both textual and visual contexts within documents, enabling more accurate information extraction and parsing. This makes it a powerful candidate for invoice parsing.
Open-Source: Available on platforms like GitHub and Hugging Face, Pixtral-12B can be fine-tuned and used for various purposes, with different licensing options for research and commercial applications.
These features make Pixtral-12B a robust solution for automating document workflows and handling complex multimodal tasks. In our tutorial, we will use it to process both computer-generated and handwritten invoices.
Let’s get started. Our stack will be:
Our first step is to create a virtual environment, and then install the required libraries. We will assume that you have done so, and launched a Jupyter Notebook on your chosen cloud or your laptop.
!pip install vllm
!pip install --upgrade mistral_common
Pixtral requires the mistral_common library, so let’s install that.
Next, let’s import the modules.
from vllm import LLM
from vllm.sampling_params import SamplingParams
from dotenv import load_dotenv
import os
import gradio as gr
What’s the use of the following imports?
Now let’s load the environment variables for each use case.
load_dotenv()
To install Pixtral-12B locally, we will use vLLM. Also, let’s import the libraries.
from vllm import LLMfrom vllm.sampling_params
import SamplingParams
You will need an access token from Hugging Face (https://huggingface.co). Get that first, and then download the model in the following way:
from huggingface_hub import notebook_login
notebook_login()
llm=LLM(
model="mistral-community/pixtral-12b-240910",
tokenizer_mode="mistral",
max_model_len=4000
)
Let’s write a function that will invoke the Pixtral-12B model with a prompt where we pass the image URL. Yes, you can either directly pass the image URL, or encode your image in Base64 format. Let’s do the former.
def generate_context(url):
model = "pixtral-12b-2409"
messages = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Extract the text from the image precisely, extract every text."
},
{
"type": "image_url",
"image_url": url
}
]
}
]
# Get the chat response
chat_response = client.chat.complete(
model=model,
messages=messages
)
# return the context
return chat_response.choices[0].message.content
That’s all that’s needed!
Now, let’s try this out with a few bill images.
Here’s the first bill image we experimented with.
This is the extracted text using Pixtral-12B.
DOD FORM 1289
1 NOV 71
DOD PRESCRIPTION
FOR (Full name, address, & phone number) (If under 12, give age)
John R Doe, HM3, USN
U.S.S. Neverforgotten (DD 178)
MEDICAL FACILITY
U.S.S. Neverforgotten (DD 178) DATE 23 Jan 99
R (Superscription)
(Inscription)
Tm Belladonna 15 ml
Amphogel goat 120 ml
(Subscription)
M & FT Solution
(Signs)
Sig: 5 ml t.d. a.c.
MFGR: Wyeth EXP DATE: 12/02 LOT NO: P39X186 FILLED BY: RWT
Jack R Frost LCDR MD USNR
B NUMBER 10072 SIGNATURE RANK AND DEGREE
EDITION OF 1 JAN 60 MAY BE USED FOR S/N 0102 LF 012 8201
Extracted JSON using Pixtral-12B:
{
"DD": "1289",
"Form": "1289",
"Date": "1 NOV 71",
"Title": "DOD PRESCRIPTION",
"Patient": {
"Name": "John R Doe",
"Military Rank": "HM3",
"Military Service": "USN"
},
"Facility": {
"Name": "U.S.S. Neverforgotten",
"Code": "DD 178"
},
"Medical Facility": {
"Name": "U.S.S. Neverforgotten",
"Code": "OO 178",
"Date": "23 Jan 99"
},
"Prescription": {
"Superscription": "B",
"Incription": [
"Tm Belledonna - 15 ml",
"Amphogel hors 120ml"
],
"Subscription": "M + FT Solution",
"Signa": "Seq. 5 ml t.i.d a.c."
},
"Pharmacy Additional Info": {
"MFGR": "Wyeth",
"Lot No": "P39X106",
"Exp Date": "12/02",
"Filled By": "RMT"
},
"Prescriber": {
"Name": "Jack R Frost",
"Military Rank": "LCDR",
"Medical Degree": "MD",
"Military Service": "USNR",
"BN": "10072"
}
}
As you can see, the extraction is quite accurate.
We will now use the Pixtral-12B model to analyze the parsed data from the invoice within a JSON schema.
What are we going to do?
Why JSON formatting?
The JSON format helps structure the parsed data in a machine-understandable way. This allows us to skip the step of building a vector database over the data, enabling us to directly query the JSON data from the image.
def generate_context(image_url, prompt = "Extract text from the image and give the response in JSON format"):
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": prompt},
{"type": "image_url", "image_url": {"url": image_url}}
]
}
]
outputs = llm.chat(
messages,
sampling_params=SamplingParams(max_tokens=8192)
)
return outputs[0].outputs[0].text
The function generate_context handles the task of extracting text from the given image after parsing it, making it suitable for querying.
It already has a default prompt in case the user doesn’t need to change the prompt multiple times.
We provide a maximum token size of 8192, which should be sufficient for our use case. However, if needed, you can opt for a different maximum token size. In such cases, ensure that your model operates within the maximum prompt size defined by max_model_len.
In the prompt, we will provide our custom prompt along with the image URL on which we will run the extraction.
def query_llm(context,query):
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "You are an answer generation agent, you'll be given context and query, generate answer in human readable form"},
{"type": "text", "text": f"here is the question {query} and here is the context {context}"}
]
}
]
outputs = llm.chat(
messages,
sampling_params=SamplingParams(max_tokens=8192)
)
return outputs[0].outputs[0].text
Now we’ll use the multimodal capability of Pixtral-12B, which easily handles both images and text.
We will provide the JSON-formatted context obtained from the previous generate_context function, along with the user’s query.
Pixtral-12B’s multimodal capabilities will handle the rest, delivering the answer in a clear, human-readable format.
import gradio as gr
def process_query(url, query):
context = generate_context(url)
response = query_llm(context, query)
return response
if __name__ == "__main__":
# Create the Gradio interface
interface = gr.Interface(
fn=process_query,
inputs=[
gr.Textbox(label="Enter the URL", placeholder="Enter image URL here"),
gr.Textbox(label="Enter your query", placeholder="Ask a question about the content")
],
outputs=gr.Textbox(label="Response"),
title="Pixtral-12b RAG Application",
description="Provide an image URL and ask questions based on the context generated from it."
)
# Launch the interface
interface.launch(share = True)
Now, regarding the Gradio interface we are going to build on top of this, let’s start by installing Gradio.
!pip install -q gradio
Tips: Always use the -q flag when installing something to prevent your screen from being flooded with installation logs.
Major components in Gradio interface building:
Bill 1:
JSON text extracted:
{
"table": {
"header": [
"Stock Name",
"Symbol",
"Shares",
"Purchase Price",
"Cost Basis",
"Current Price",
"Market Value",
"Gain/Loss",
"Dividend/share",
"Yield"
],
"rows": [
{
"Stock Name": "Apple",
"Symbol": "AAPL",
"Shares": 100,
"Purchase Price": "$90.00",
"Cost Basis": "$9,000.00",
"Current Price": "$144.13",
"Market Value": "$14,413.27",
"Gain/Loss": "$14,269.14",
"Dividend/share": "$2.28",
"Yield": "1.58%"
},
{
"Stock Name": "Microsoft",
"Symbol": "MSFT",
"Shares": 200,
"Purchase Price": "$62.00",
"Cost Basis": "$12,400.00",
"Current Price": "$64.57",
"Market Value": "$13,114.14",
"Gain/Loss": "$13,048.57",
"Dividend/share": "$1.56",
"Yield": "2.38%"
},
{
"Stock Name": "Salesforce",
"Symbol": "CRM",
"Shares": 150,
"Purchase Price": "$25.00",
"Cost Basis": "$3,750.00",
"Current Price": "$82.57",
"Market Value": "$12,385.50",
"Gain/Loss": "$12,302.83",
"Dividend/share": "$0.00",
"Yield": "0.00%"
},
{
"Stock Name": "Oracle",
"Symbol": "ORCL",
"Shares": 250,
"Purchase Price": "$50.00",
"Cost Basis": "$12,500.00",
"Current Price": "$44.56",
"Market Value": "$11,138.75",
"Gain/Loss": "$11,094.20",
"Dividend/share": "$0.64",
"Yield": "1.44%"
},
{
"Stock Name": "Hewlett Packard Enterprise",
"Symbol": "HPE",
"Shares": 500,
"Purchase Price": "$18.00",
"Cost Basis": "$9,000.00",
"Current Price": "$17.69",
"Market Value": "$8,842.50",
"Gain/Loss": "$8,824.82",
"Dividend/share": "$0.26",
"Yield": "1.47%"
},
{
"Stock Name": "Alphabet",
"Symbol": "GOOG",
"Shares": 100,
"Purchase Price": "$225.00",
"Cost Basis": "$22,500.00",
"Current Price": "$833.36",
"Market Value": "$83,336.00",
"Gain/Loss": "$82,502.64",
"Dividend/share": "$0.00",
"Yield": "0.00%"
},
{
"Stock Name": "Intel",
"Symbol": "INTC",
"Shares": 200,
"Purchase Price": "$22.00",
"Cost Basis": "$4,400.00",
"Current Price": "$36.07",
"Market Value": "$7,213.00",
"Gain/Loss": "$7,176.94",
"Dividend/share": "$1.09",
"Yield": "3.02%"
},
{
"Stock Name": "Cisco",
"Symbol": "CSCO",
"Shares": 225,
"Purchase Price": "$18.00",
"Cost Basis": "$4,050.00",
"Current Price": "$33.24",
"Market Value": "$7,478.78",
"Gain/Loss": "$7,445.54",
"Dividend/share": "$1.16",
"Yield": "3.49%"
},
{
"Stock Name": "Qualcomm",
"Symbol": "QCOM",
"Shares": 185,
"Purchase Price": "$65.00",
"Cost Basis": "$12,025.00",
"Current Price": "$56.48",
"Market Value": "$10,447.88",
"Gain/Loss": "$10,391.40",
"Dividend/share": "$2.12",
"Yield": "3.75%"
},
{
"Stock Name": "Amazon",
"Symbol": "AMZN",
"Shares": 50,
"Purchase Price": "$800.00",
"Cost Basis": "$40,000.00",
"Current Price": "$897.64",
"Market Value": "$44,882.00",
"Gain/Loss": "$43,984.36",
"Dividend/share": "$0.00",
"Yield": "0.00%"
},
{
"Stock Name": "Redhat",
"Symbol": "RHT",
"Shares": 100,
"Purchase Price": "$95.00",
"Cost Basis": "$9,500.00",
"Current Price": "$86.26",
"Market Value": "$8,626.00",
"Gain/Loss": "$8,539.74",
"Dividend/share": "$0.00",
"Yield": "0.00%"
},
{
"Stock Name": "Facebook",
"Symbol": "FB",
"Shares": 1000,
"Purchase Price": "$17.00",
"Cost Basis": "$17,000.00",
"Current Price": "$141.64",
"Market Value": "$141,640.00",
"Gain/Loss": "$141,498.36",
"Dividend/share": "$0.00",
"Yield": "0.00%"
},
{
"Stock Name": "Twitter",
"Symbol": "TWTR",
"Shares": 500,
"Purchase Price": "$45.00",
"Cost Basis": "$22,500.00",
"Current Price": "$14.61",
"Market Value": "$7,302.55",
"Gain/Loss": "$7,287.94",
"Dividend/share": "$0.00",
"Yield": "0.00%"
}
]
}
}
Q&A over image with Pixtral-12B LLM:
Response:
The dividend per share for Apple is $2.28. This means that for each share of Apple stock you own, you will receive $2.28 as a dividend.
Response:
Based on the provided context, here is the summary of the total profit and loss:
- **Total Profit:** From the given "GainLoss" values, the total profit looks as follows:
- Apple: $14,269.14
- Microsoft: $13,048.57
- Salesforce: $12,302.83
- Oracle: $10,994.20
- Hewlett Packard Enterprise: $9,824.82
- Alphabet: $82,502.64
- Intel: $7,176.94
- Cisco: $7,445.54
- Qualcomm: $10,391.40
- Amazon: $43,984.36
- Redhat: $8,539.74
- Facebook: $141,498.36
- Twitter: $7,287.94
Sum of gains: $437,227.43
- **Total Loss:** There are no losses indicated among the given stocks (none of the "GainLoss" values are negative).
Therefore, the total profit from the listed stocks is $437,227.43, and there is no total loss.
Bill 2:
JSON text extracted:
{
"title": "ÉPICERIE, CONFISERIE, DROGUERIE EN GROS",
"subtitle": "PRODUITS DE GROIX",
"header": "Constances Guilbaud",
"address": "2.1, Rue Ledru-Rollin",
"signature": "M. Ladumaie-Sieux à Richelieu",
"digit": "Les Marchandises ci-dessus désignées ont été reçues ce 95",
"date": "Chantiers, le 4.12.Decembre.1919.",
"items": [
{
"quantity": "1",
"description": "Liquorice (Liqui.requirement)",
"quantity_unit": "",
"amount": "7.60",
"currency": "",
"total": "7.60"
},
{
"quantity": "1",
"description": "Liquorice (Liqui.requirement)",
"quantity_unit": "",
"amount": "7.60",
"currency": "",
"total": "7.60"
},
{
"quantity": "1",
"description": "Coch. Grain",
"quantity_unit": "",
"amount": "6.60",
"currency": "",
"total": "6.60"
},
{
"quantity": "1.8IG",
"description": "Maloush",
"quantity_unit": "IG",
"amount": "14.40",
"currency": "",
"total": "14.40"
},
{
"quantity": "1",
"description": "So Boudix Boudis",
"quantity_unit": "",
"amount": "1.80",
"currency": "",
"total": "1.80"
},
{
"quantity": "1.8",
"description": "So Mardi",
"quantity_unit": "",
"amount": "8.40",
"currency": "",
"total": "8.40"
},
{
"quantity": "1",
"description": "White Pill Gel",
"quantity_unit": "",
"amount": "3.60",
"currency": "",
"total": "3.80"
},
{
"quantity": "1",
"description": "Castrature ",
"quantity_unit": "",
"amount": "10.00",
"currency": "",
"total": "10.00"
}
],
" cultivated": " tot.",
"total": "54.98",
"stamp": {
"text": "Cahier",
"timestamp": "40"
}
}
Q&A over image with Pixtral-12B:
Response:
The final bill amount is **54.55 francs**.
Response:
The bill lists the following items:
1. Liquorice (Liqui.requirement) - Quantity: 1, Amount: 7.60
2. Liquorice (Liqui.requirement) - Quantity: 1, Amount: 7.60
3. Coch. Grain - Quantity: 1, Amount: 6.60
4. Maloush - Quantity: 1.8 IG, Amount: 14.40
5. So Boudix Boudis - Quantity: 1, Amount: 1.80
6. So Mardi - Quantity: 1.8, Amount: 8.40
7. White Pill Gel - Quantity: 1, Amount: 3.60
8. Castrature - Quantity: 1, Amount: 10.00
Bill 3:
JSON text extracted:
{
"invoice_number": "11473",
"invoice_date": "Mar 14, 2018",
"issued_to": "THE WEDDING ARRANGER",
"items": [
{
"item_code": "7005",
"description": "baby roller 4 pcs",
"quantity": "3",
"unit_price": "6.00",
"total_price": "18.00"
},
{
"item_code": "410",
"description": "bed pad grind size inchesX24 inches X150",
"quantity": "n",
"unit_price": "n",
"total_price": "n/a"
},
{
"item_code": "53",
"description": "share grinder 450 Watt",
"quantity": "n",
"unit_price": "6.30",
"total_price": "6.30"
},
{
"item_code": "n",
"description": "tag cloth for cloth",
"quantity": "n",
"unit_price": "48.50",
"total_price": "48.50"
},
{
"item_code": "n",
"description": " coalated expect Fogg Paper for steel",
"quantity": "n",
"unit_price": "n",
"total_price": "n/a"
},
{
"item_code": "500",
"description": "4 inch cloth for fogging the size of new bottles",
"quantity": "3",
"unit_price": "8.00",
"total_price": "24.00"
},
{
"item_code": "102",
"description": "brass brush roller with ring",
"quantity": "4",
"unit_price": "3.00",
"total_price": "12.00"
},
{
"item_code": "00",
"description": "cotton cloth",
"quantity": "2",
"unit_price": "7.50",
"total_price": "7.50"
},
{
"item_code": "20",
"description": "mesh NAC 4 meter",
"quantity": "4",
"unit_price": "4.150",
"total_price": "16.60"
},
{
"item_code": "00",
"description": "cotton cloth wash as per m",
"quantity": "2",
"unit_price": "2.75",
"total_price": "5.50"
}
],
"tax": "631.61",
"total": "5895.00"
}
Q&A over image with Pixtral-12B:
Response:
Based on the provided context and the question "All items," here is the answer:
"The invoice contains the following items:
1. Baby roller 4 culon - 5 units at Php 6.00 each (total Php 3,000.00)
2. Negative salient - 1 unit at Php 6.00 each (total Php 6.00)
3. Sandwig seder - 1 unit at Php 6.00 each (total Php 6.30)
4. Clear, Cross Lag. Ethnic - 1 unit at Php 4.88 each (total Php 4.88)
5. Plaet. Lag. Clear - 4 units at Php 2.05 each (total Php 2.05)
6. 8s crass. Lag. White Lag. - 2 units at Php 3.50 each (total Php 3.50)
7. Colt. yake - 2 units at Php 75.00 each (total Php 75.00)
8. V.K. Z.M. Paster 10 - 5 units at Php 25.00 each (total Php 25.00)
9. A150 LensVAT - 3 units at Php 1.50 each (total Php 7.50)
These items make up the total sales amounting to Php 4,362.00, excluding VAT."
Response:
Based on the provided context, the date of the bill is **March 14, 2015**.
Bill 4:
JSON extracted text:
{
"menu": {
"title": "PALMIYE RESTAURANT & CAFE",
"location": "Eyüpacolı Evler K starter 8",
"address": "İncirli Caddesi Sok. No:70, Sudanşhrie'stanbul - ISTANBUL",
"contact": "Telefon +90 212 641 76 76 - Faks: +90 212 641 76 77",
"stamp": "TURKISH CUISINE"
},
"admission": {
"name": "ADİSYON",
"ref no": " ballet 174528",
"date": "12.10.2019",
"Registration no": "SEN- A- 6723",
"commital date": "Görüşмой",
"no": "Il Koddu 34 - No: 115059"
},
"order": {
"cinsi": [
"D-described: 50",
"described: 80",
"described: 50",
"described: 30"
],
"mik": [
"",
"",
"",
""
],
"fiyati": [
"54",
"37",
"74",
"39"
],
"tutar": [
"27",
"26",
"70",
"19"
]
},
"notes": [
{
"title": "D-site",
"text": "Günaydın Ramazan",
"subtext": "Barkod"
},
{
"title": "Site",
"text": "KFrontwire",
"subtext": "Belusage"
}
]
}
Q&A over image with Pixtral-12B:
The address on the bill is:
Eyyüb Collier Road
Ince Caves Sektor No: 35
Subashim missiles / ISTANBUL
The ferries. 6471 / 376 512855
In this tutorial, we have guided you through the process of deploying and utilizing Pixtral-12B for invoice parsing tasks. Additionally, we have developed a chat-based invoice analysis system that enables you to query multiple invoices simultaneously.
Key takeaways: