PaliGemma 2 Mix: A Guide With Demo OCR Project-AI-php.cn

Table of Contents

What Is PaliGemma 2 Mix?

Project Overview: Bill Scanner and Spending Analyzer With PaliGemma 2 Mix

Step 1: Prerequisites

Step 2: Model Initialization

Step 3: Image Processing

Step 4: Inference with PaliGemma

Step 5: Extracting Key Information

Step 6: Analyzing Information

Step 6: Analyzing Multiple Bills Simultaneously

Step 7: Build the Gradio Interface

Conclusion

Home

Technology peripherals

PaliGemma 2 Mix: A Guide With Demo OCR Project

Christopher Nolan

Feb 28, 2025 pm 04:32 PM

PaliGemma 2 Mix is a multimodal AI model developed by Google. It is an improved version of the PaliGemma vision language model (VLM), integrating advanced capabilities from the SigLIP vision model and Gemma 2 language models.

In this tutorial, I’ll explain how to use PaliGemma 2 Mix to build an AI-powered bill scanner and spending analyzer capable of:

Extracting and categorizing expenses from bill receipts.
Performing optical character recognition (OCR) to retrieve key information.
Summarizing spending based on provided images.

While our focus will be on building a financial insights tool, you can use what you learn in this blog to explore other use cases of PaliGemma 2 Mix, such as image segmentation, object detection, and question answering.

What Is PaliGemma 2 Mix?

PaliGemma 2 Mix is an advanced vision-language model (VLM) that processes both images and text as input and generates text-based outputs. It is designed to handle a diverse range of multimodal AI tasks while supporting multiple languages.

PaliGemma 2 is designed for a wide array of vision-language tasks, including image and short video captioning, visual question answering, optical character recognition (OCR), object detection, and segmentation.

PaliGemma 2 Mix: A Guide With Demo OCR Project

Source of the images used in the diagram: Google

PaliGemma 2 Mix model is designed for:

Image & short video captioning: Generating accurate and context-aware captions for static images and short videos.
Visual question answering (VQA): Analyzing images and answering text-based questions based on visual content.
Optical character recognition (OCR): Extracting and interpreting text from images, making it useful for documents, receipts, and scanned materials.
Object detection & segmentation: It identifies, labels, and segments objects within an image for structured analysis.
Multi-language support: The model also enables text generation and understanding in multiple languages for global applications.

You can find more information about the PaliGemma 2 Mix model in the official release article.

Project Overview: Bill Scanner and Spending Analyzer With PaliGemma 2 Mix

Let’s outline the main steps that we’re going to take:

Load and prepare the dataset: The process begins by loading and preparing receipt images as input.
Initialize the PaliGemma 2 Mix Model: We configure and load the model for processing vision-language tasks.
Process input images: Then, convert images to an appropriate format (RGB) and prepare them for analysis.
Extract key information: Perform optical character recognition (OCR) to retrieve the total amount.
Categorize expenses: Classify purchases into categories like grocery, clothing, electronics, and others.
Generate spending insights: We summarize the categorized expenses and generate a spending distribution chart.
Build an interactive Gradio interface: Finally, we create a UI where users can upload multiple bills, extract data, and analyze spending visually.

Step 1: Prerequisites

Before we start, let’s ensure that we have the following tools and libraries installed:

Python 3.8
torch
Transformers
PIL
Matplotlib
Gradio

Run the following commands to install the necessary dependencies:

pip install gradio -U bitsandbytes -U transformers -q

Copy after login

Once the above dependencies are installed, run the following import commands:

import gradio as gr
import torch
import pandas as pd
import matplotlib.pyplot as plt
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig
from transformers import BitsAndBytesConfig
from PIL import Image
import re

Copy after login

Step 2: Model Initialization

We configure and load the PaliGemma 2 Mix model with quantization to optimize performance. For this demo, we’ll be using the 10b parameter model with 448 x 448 input image resolution. You need a minimum of T4 GPU with 40GB memory (Colab configuration) to run this model.

device = "cuda" if torch.cuda.is_available() else "cpu"
# Model setup
model_id = "google/paligemma2-10b-mix-448" 
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,  # Change to load_in_4bit=True for even lower memory usage
    llm_int8_threshold=6.0,
)

# Load model with quantization
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=bnb_config
).eval()

# Load processor
processor = PaliGemmaProcessor.from_pretrained(model_id)
# Print success message
print("Model and processor loaded successfully!")

Copy after login

BitsAndBytes quantization helps to reduce memory usage while maintaining performance, making it possible to run large models on limited GPU resources. In this implementation, we use 4-bit quantization to further optimize memory efficiency.

We load the model using the PaliGemmaForConditionalGeneration class from the transformers library by passing in the model ID and quantization configuration. Similarly, we load the processor, which preprocesses the inputs into tensors before passing it to the model.

Step 3: Image Processing

Once the model shards are loaded, we process the images before passing them to the model to maintain image format compatibility and gain uniformity. We convert images to RGB format:

def ensure_rgb(image: Image.Image) -> Image.Image:
    if image.mode != "RGB":
        image = image.convert("RGB")
    return image

Copy after login

Now, our images are ready for inference.

Step 4: Inference with PaliGemma

Now, we set up the main function for running inference with the model. This function takes in input images and questions, incorporates them into prompts, and passes them to the model via the processor for inference.

def ask_model(image: Image.Image, question: str) -> str:
    prompt = f"<image> answer en {question}"
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    with torch.inference_mode():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=False
        )
    result = processor.batch_decode(generated_ids, skip_special_tokens=True)
    return result[0].strip()

Copy after login

Step 5: Extracting Key Information

Now that we have the main function ready, we’ll work next on extracting the key parameters from the image—in our case, these are the total amount and goods category.

pip install gradio -U bitsandbytes -U transformers -q

Copy after login

The extract_total_amount() function processes an image to extract the total amount from a receipt using OCR. It constructs a query (question) instructing the model to extract only numerical values, and then it calls the ask_model() function to generate a response from the model.

import gradio as gr
import torch
import pandas as pd
import matplotlib.pyplot as plt
from transformers import PaliGemmaForConditionalGeneration, PaliGemmaProcessor, BitsAndBytesConfig
from transformers import BitsAndBytesConfig
from PIL import Image
import re

Copy after login

The categorize_goods() function classifies the type of goods in an image by prompting the model with a predefined question listing possible categories: grocery, clothing, electronics, or other. The ask_model() function then processes the image and returns a textual response. If the processed response matches any of the predefined valid categories, it returns that category—otherwise, it defaults to the "Other" category.

Step 6: Analyzing Information

We have all the key functions ready, so let’s analyse the outputs.

device = "cuda" if torch.cuda.is_available() else "cpu"
# Model setup
model_id = "google/paligemma2-10b-mix-448" 
bnb_config = BitsAndBytesConfig(
    load_in_8bit=True,  # Change to load_in_4bit=True for even lower memory usage
    llm_int8_threshold=6.0,
)

# Load model with quantization
model = PaliGemmaForConditionalGeneration.from_pretrained(
    model_id, quantization_config=bnb_config
).eval()

# Load processor
processor = PaliGemmaProcessor.from_pretrained(model_id)
# Print success message
print("Model and processor loaded successfully!")

Copy after login

The above function creates a pie chart to visualize spending distribution across different categories. If no valid spending data exists, it generates a blank figure with a message indicating "No Spending Data." Otherwise, it creates a pie chart with category labels and percentage values, ensuring a proportional and well-aligned visualization.

Step 6: Analyzing Multiple Bills Simultaneously

We typically have multiple bills to analyze, so let’s create a function to process all our bills simultaneously.

def ensure_rgb(image: Image.Image) -> Image.Image:
    if image.mode != "RGB":
        image = image.convert("RGB")
    return image

Copy after login

For analyzing multiple bills at once, we perform the following steps:

Initialize storage: We create lists for storing results and images, set total_spending to 0, and define a dictionary for category-wise totals.
Process each bill:

Open and convert the image to RGB.
Append the image to the list.
Extract the total amount from the receipt.
Categorize the goods in the receipt.
Update total spending and category-wise totals.
Store the extracted data in a results list.

Generate insights: We create a spending distribution pie chart along with a summary of total spending.
Return results: Finally, we return the list of images, a DataFrame of bill summaries, the total spending summary, and the spending chart.

Step 7: Build the Gradio Interface

Now, we have all key logic functions in place. Next, we work on building interactive UI with Gradio.

def ask_model(image: Image.Image, question: str) -> str:
    prompt = f"<image> answer en {question}"
    inputs = processor(text=prompt, images=image, return_tensors="pt").to(device)
    with torch.inference_mode():
        generated_ids = model.generate(
            **inputs,
            max_new_tokens=50,
            do_sample=False
        )
    result = processor.batch_decode(generated_ids, skip_special_tokens=True)
    return result[0].strip()

Copy after login

The above code creates a structured Gradio UI with a file uploader for multiple images and a submit button to trigger processing. Upon submission, uploaded bill images are displayed in a gallery, extracted data is shown in a table, total spending is summarized in text, and a spending distribution pie chart is generated.

The function connects user inputs to the process_multiple_bills() function, ensuring seamless data extraction and visualization. Finally, the demo.launch() function starts the Gradio app for real-time interaction.

PaliGemma 2 Mix: A Guide With Demo OCR Project

I also tried this demo with two image-based bills (Amazon shopping invoice) and got the following results.

Note: VLMs find it difficult to extract numbers, which may lead to incorrect results at times. For instance, it extracted the wrong total amount for the second bill below. This is correctable with the use of larger models or simply fine-tuning the existing ones.

Conclusion

In this tutorial, we built an AI-powered multiple bill scanner using PaliGemma 2 Mix, which can help us extract and categorize our expenses from receipts. We used PaliGemma 2 Mix’s vision-language capabilities for OCR and classification to analyze spending insights effortlessly. I encourage you to adapt this tutorial to your own use case.

The above is the detailed content of PaliGemma 2 Mix: A Guide With Demo OCR Project. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

What's New in Windows 11 KB5054979 & How to Fix Update Issues

3 weeks ago By DDD

How to fix KB5055523 fails to install in Windows 11?

2 weeks ago By DDD

InZoi: How To Apply To School And University

4 weeks ago By DDD

How to fix KB5055518 fails to install in Windows 10?

2 weeks ago By DDD

Where to find the Site Office Key in Atomfall

4 weeks ago By DDD

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Where is the login entrance for gmail email?

7870

Java Tutorial

1649

CakePHP Tutorial

1407

Laravel Tutorial

1301

PHP Tutorial

1244

Related knowledge

Best AI Art Generators (Free & Paid) for Creative Projects Apr 02, 2025 pm 06:10 PM

The article reviews top AI art generators, discussing their features, suitability for creative projects, and value. It highlights Midjourney as the best value for professionals and recommends DALL-E 2 for high-quality, customizable art.

Getting Started With Meta Llama 3.2 - Analytics Vidhya Apr 11, 2025 pm 12:04 PM

Meta's Llama 3.2: A Leap Forward in Multimodal and Mobile AI Meta recently unveiled Llama 3.2, a significant advancement in AI featuring powerful vision capabilities and lightweight text models optimized for mobile devices. Building on the success o

Best AI Chatbots Compared (ChatGPT, Gemini, Claude & More) Apr 02, 2025 pm 06:09 PM

The article compares top AI chatbots like ChatGPT, Gemini, and Claude, focusing on their unique features, customization options, and performance in natural language processing and reliability.

Top AI Writing Assistants to Boost Your Content Creation Apr 02, 2025 pm 06:11 PM

The article discusses top AI writing assistants like Grammarly, Jasper, Copy.ai, Writesonic, and Rytr, focusing on their unique features for content creation. It argues that Jasper excels in SEO optimization, while AI tools help maintain tone consist

Selling AI Strategy To Employees: Shopify CEO's Manifesto Apr 10, 2025 am 11:19 AM

Shopify CEO Tobi Lütke's recent memo boldly declares AI proficiency a fundamental expectation for every employee, marking a significant cultural shift within the company. This isn't a fleeting trend; it's a new operational paradigm integrated into p

AV Bytes: Meta's Llama 3.2, Google's Gemini 1.5, and More Apr 11, 2025 pm 12:01 PM

This week's AI landscape: A whirlwind of advancements, ethical considerations, and regulatory debates. Major players like OpenAI, Google, Meta, and Microsoft have unleashed a torrent of updates, from groundbreaking new models to crucial shifts in le

10 Generative AI Coding Extensions in VS Code You Must Explore Apr 13, 2025 am 01:14 AM

Hey there, Coding ninja! What coding-related tasks do you have planned for the day? Before you dive further into this blog, I want you to think about all your coding-related woes—better list those down. Done? – Let&#8217

Choosing the Best AI Voice Generator: Top Options Reviewed Apr 02, 2025 pm 06:12 PM

The article reviews top AI voice generators like Google Cloud, Amazon Polly, Microsoft Azure, IBM Watson, and Descript, focusing on their features, voice quality, and suitability for different needs.

See all articles