PaliGemma 2 Mix: A Guide With Demo OCR Project
PaliGemma 2 Mix is a multimodal AI model developed by Google. It is an improved version of the PaliGemma vision language model (VLM), integrating advanced capabilities from the SigLIP vision model and Gemma 2 language models.
In this tutorial, I’ll explain how to use PaliGemma 2 Mix to build an AI-powered bill scanner and spending analyzer capable of:
- Extracting and categorizing expenses from bill receipts.
- Performing optical character recognition (OCR) to retrieve key information.
- Summarizing spending based on provided images.
While our focus will be on building a financial insights tool, you can use what you learn in this blog to explore other use cases of PaliGemma 2 Mix, such as image segmentation, object detection, and question answering.
What Is PaliGemma 2 Mix?
PaliGemma 2 Mix is an advanced vision-language model (VLM) that processes both images and text as input and generates text-based outputs. It is designed to handle a diverse range of multimodal AI tasks while supporting multiple languages.
PaliGemma 2 is designed for a wide array of vision-language tasks, including image and short video captioning, visual question answering, optical character recognition (OCR), object detection, and segmentation.
Source of the images used in the diagram: Google
PaliGemma 2 Mix model is designed for:
- Image & short video captioning: Generating accurate and context-aware captions for static images and short videos.
- Visual question answering (VQA): Analyzing images and answering text-based questions based on visual content.
- Optical character recognition (OCR): Extracting and interpreting text from images, making it useful for documents, receipts, and scanned materials.
- Object detection & segmentation: It identifies, labels, and segments objects within an image for structured analysis.
- Multi-language support: The model also enables text generation and understanding in multiple languages for global applications.
You can find more information about the PaliGemma 2 Mix model in the official release article.
Project Overview: Bill Scanner and Spending Analyzer With PaliGemma 2 Mix
Let’s outline the main steps that we’re going to take:
- Load and prepare the dataset: The process begins by loading and preparing receipt images as input.
- Initialize the PaliGemma 2 Mix Model: We configure and load the model for processing vision-language tasks.
- Process input images: Then, convert images to an appropriate format (RGB) and prepare them for analysis.
- Extract key information: Perform optical character recognition (OCR) to retrieve the total amount.
- Categorize expenses: Classify purchases into categories like grocery, clothing, electronics, and others.
- Generate spending insights: We summarize the categorized expenses and generate a spending distribution chart.
- Build an interactive Gradio interface: Finally, we create a UI where users can upload multiple bills, extract data, and analyze spending visually.
Step 1: Prerequisites
Before we start, let’s ensure that we have the following tools and libraries installed:
- Python 3.8
- torch
- Transformers
- PIL
- Matplotlib
- Gradio
Run the following commands to install the necessary dependencies:
pip install gradio -U bitsandbytes -U transformers -q
Once the above dependencies are installed, run the following import commands:
import gradio as gr import torch import pandas as pd import matplotlib.pyplot as plt from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig from transformers import BitsAndBytesConfig from PIL import Image import re
Step 2: Model Initialization
We configure and load the PaliGemma 2 Mix model with quantization to optimize performance. For this demo, we’ll be using the 10b parameter model with 448 x 448 input image resolution. You need a minimum of T4 GPU with 40GB memory (Colab configuration) to run this model.
device = "cuda" if torch.cuda.is_available() else "cpu" # Model setup model_id = "google/paligemma2-10b-mix-448" bnb_config = BitsAndBytesConfig( load_in_8bit=True, # Change to load_in_4bit=True for even lower memory usage llm_int8_threshold=6.0, ) # Load model with quantization model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, quantization_config=bnb_config ).eval() # Load processor processor = PaliGemmaProcessor.from_pretrained(model_id) # Print success message print("Model and processor loaded successfully!")
BitsAndBytes quantization helps to reduce memory usage while maintaining performance, making it possible to run large models on limited GPU resources. In this implementation, we use 4-bit quantization to further optimize memory efficiency.
We load the model using the PaliGemmaForConditionalGeneration class from the transformers library by passing in the model ID and quantization configuration. Similarly, we load the processor, which preprocesses the inputs into tensors before passing it to the model.
Step 3: Image Processing
Once the model shards are loaded, we process the images before passing them to the model to maintain image format compatibility and gain uniformity. We convert images to RGB format:
def ensure_rgb(image: Image.Image) -> Image.Image: if image.mode != "RGB": image = image.convert("RGB") return image
Now, our images are ready for inference.
Step 4: Inference with PaliGemma
Now, we set up the main function for running inference with the model. This function takes in input images and questions, incorporates them into prompts, and passes them to the model via the processor for inference.
def ask_model(image: Image.Image, question: str) -> str: prompt = f"<image> answer en {question}" inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) with torch.inference_mode(): generated_ids = model.generate( **inputs, max_new_tokens=50, do_sample=False ) result = processor.batch_decode(generated_ids, skip_special_tokens=True) return result[0].strip()
Step 5: Extracting Key Information
Now that we have the main function ready, we’ll work next on extracting the key parameters from the image—in our case, these are the total amount and goods category.
pip install gradio -U bitsandbytes -U transformers -q
The extract_total_amount() function processes an image to extract the total amount from a receipt using OCR. It constructs a query (question) instructing the model to extract only numerical values, and then it calls the ask_model() function to generate a response from the model.
import gradio as gr import torch import pandas as pd import matplotlib.pyplot as plt from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig from transformers import BitsAndBytesConfig from PIL import Image import re
The categorize_goods() function classifies the type of goods in an image by prompting the model with a predefined question listing possible categories: grocery, clothing, electronics, or other. The ask_model() function then processes the image and returns a textual response. If the processed response matches any of the predefined valid categories, it returns that category—otherwise, it defaults to the "Other" category.
Step 6: Analyzing Information
We have all the key functions ready, so let’s analyse the outputs.
device = "cuda" if torch.cuda.is_available() else "cpu" # Model setup model_id = "google/paligemma2-10b-mix-448" bnb_config = BitsAndBytesConfig( load_in_8bit=True, # Change to load_in_4bit=True for even lower memory usage llm_int8_threshold=6.0, ) # Load model with quantization model = PaliGemmaForConditionalGeneration.from_pretrained( model_id, quantization_config=bnb_config ).eval() # Load processor processor = PaliGemmaProcessor.from_pretrained(model_id) # Print success message print("Model and processor loaded successfully!")
The above function creates a pie chart to visualize spending distribution across different categories. If no valid spending data exists, it generates a blank figure with a message indicating "No Spending Data." Otherwise, it creates a pie chart with category labels and percentage values, ensuring a proportional and well-aligned visualization.
Step 6: Analyzing Multiple Bills Simultaneously
We typically have multiple bills to analyze, so let’s create a function to process all our bills simultaneously.
def ensure_rgb(image: Image.Image) -> Image.Image: if image.mode != "RGB": image = image.convert("RGB") return image
For analyzing multiple bills at once, we perform the following steps:
- Initialize storage: We create lists for storing results and images, set total_spending to 0, and define a dictionary for category-wise totals.
- Process each bill:
- Open and convert the image to RGB.
- Append the image to the list.
- Extract the total amount from the receipt.
- Categorize the goods in the receipt.
- Update total spending and category-wise totals.
- Store the extracted data in a results list.
- Generate insights: We create a spending distribution pie chart along with a summary of total spending.
- Return results: Finally, we return the list of images, a DataFrame of bill summaries, the total spending summary, and the spending chart.
Step 7: Build the Gradio Interface
Now, we have all key logic functions in place. Next, we work on building interactive UI with Gradio.
def ask_model(image: Image.Image, question: str) -> str: prompt = f"<image> answer en {question}" inputs = processor(text=prompt, images=image, return_tensors="pt").to(device) with torch.inference_mode(): generated_ids = model.generate( **inputs, max_new_tokens=50, do_sample=False ) result = processor.batch_decode(generated_ids, skip_special_tokens=True) return result[0].strip()
The above code creates a structured Gradio UI with a file uploader for multiple images and a submit button to trigger processing. Upon submission, uploaded bill images are displayed in a gallery, extracted data is shown in a table, total spending is summarized in text, and a spending distribution pie chart is generated.
The function connects user inputs to the process_multiple_bills() function, ensuring seamless data extraction and visualization. Finally, the demo.launch() function starts the Gradio app for real-time interaction.
I also tried this demo with two image-based bills (Amazon shopping invoice) and got the following results.
Note: VLMs find it difficult to extract numbers, which may lead to incorrect results at times. For instance, it extracted the wrong total amount for the second bill below. This is correctable with the use of larger models or simply fine-tuning the existing ones.
Conclusion
In this tutorial, we built an AI-powered multiple bill scanner using PaliGemma 2 Mix, which can help us extract and categorize our expenses from receipts. We used PaliGemma 2 Mix’s vision-language capabilities for OCR and classification to analyze spending insights effortlessly. I encourage you to adapt this tutorial to your own use case.
The above is the detailed content of PaliGemma 2 Mix: A Guide With Demo OCR Project. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Undress AI Tool
Undress images for free

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

Hot Topics

The article reviews top AI art generators, discussing their features, suitability for creative projects, and value. It highlights Midjourney as the best value for professionals and recommends DALL-E 2 for high-quality, customizable art.

Meta's Llama 3.2: A Leap Forward in Multimodal and Mobile AI Meta recently unveiled Llama 3.2, a significant advancement in AI featuring powerful vision capabilities and lightweight text models optimized for mobile devices. Building on the success o

The article compares top AI chatbots like ChatGPT, Gemini, and Claude, focusing on their unique features, customization options, and performance in natural language processing and reliability.

The article discusses top AI writing assistants like Grammarly, Jasper, Copy.ai, Writesonic, and Rytr, focusing on their unique features for content creation. It argues that Jasper excels in SEO optimization, while AI tools help maintain tone consist

Shopify CEO Tobi Lütke's recent memo boldly declares AI proficiency a fundamental expectation for every employee, marking a significant cultural shift within the company. This isn't a fleeting trend; it's a new operational paradigm integrated into p

This week's AI landscape: A whirlwind of advancements, ethical considerations, and regulatory debates. Major players like OpenAI, Google, Meta, and Microsoft have unleashed a torrent of updates, from groundbreaking new models to crucial shifts in le

Hey there, Coding ninja! What coding-related tasks do you have planned for the day? Before you dive further into this blog, I want you to think about all your coding-related woes—better list those down. Done? – Let’

The article reviews top AI voice generators like Google Cloud, Amazon Polly, Microsoft Azure, IBM Watson, and Descript, focusing on their features, voice quality, and suitability for different needs.
