A Comprehensive Guide to Building Multimodal RAG Systems-AI-php.cn

Retrieval Augmented Generation systems, better known as RAG systems, have become the de-facto standard for building intelligent AI assistants answering questions on custom enterprise data without the hassles of expensive fine-tuning of large language models (LLMs). One of the key advantages of RAG systems is you can easily integrate your own data and augment your LLM’s intelligence, and give more contextual answers to your questions. However, the key limitation of most RAG systems is that it works well only on text data. However, a lot of real-world data is multimodal in nature, which means a mixture of text, images, tables, and more. In this comprehensive hands-on guide, we will look at building a Multimodal RAG System that can handle mixed data formats using intelligent data transformations and multimodal LLMs.

A Comprehensive Guide to Building Multimodal RAG Systems

Overview

Retrieval Augmented Generation (RAG) systems enable intelligent AI assistants to answer questions on custom enterprise data without needing expensive LLM fine-tuning.
Traditional RAG systems are constrained to text data, making them ineffective for multimodal data, which includes text, images, tables, and more.
These systems integrate multimodal data processing (text, images, tables) and utilize multimodal LLMs, like GPT-4o, to provide more contextual and accurate answers.
Multimodal Large Language Models (LLMs) like GPT-4o, Gemini, and LLaVA-NeXT can process and generate responses from multiple data formats, handling mixed inputs like text and images.
The guide provides a detailed guide on building a Multimodal RAG system with LangChain, integrating intelligent document loaders, vector databases, and multi-vector retrievers.
The guide shows how to process complex multimodal queries by utilizing multimodal LLMs and intelligent retrieval systems, creating advanced AI systems capable of answering diverse data-driven questions.

Traditional RAG System Architecture
Traditional RAG System limitations
What is Multimodal Data?
What is a Multimodal Large Language Model?
Multimodal RAG System Workflow
- End-to-End Workflow
- Multi-Vector Retrieval Workflow
Detailed Multimodal RAG System Architecture
Hands-on Implementation of our Multimodal RAG System
- Install Dependencies
- Enter Open AI API Key
- Setup Environment Variables
Frequently Asked Questions

Traditional RAG System Architecture

A retrieval augmented generation (RAG) system architecture typically consists of two major steps:

Data Processing and Indexing
Retrieval and Response Generation

In Step 1, Data Processing and Indexing, we focus on getting our custom enterprise data into a more consumable format by loading typically the text content from these documents, splitting large text elements into smaller chunks, converting them into embeddings using an embedder model and then storing these chunks and embeddings into a vector database as depicted in the following figure.

A Comprehensive Guide to Building Multimodal RAG Systems

In Step 2, the workflow starts with the user asking a question, relevant text document chunks which are similar to the input question are retrieved from the vector database and then the question and the context document chunks are sent to an LLM to generate a human-like response as depicted in the following figure.

A Comprehensive Guide to Building Multimodal RAG Systems

This two-step workflow is commonly used in the industry to build a traditional RAG system; however, it does have its own set of limitations, some of which we discuss below in detail.

Traditional RAG System limitations

Traditional RAG systems have several limitations, some of which are mentioned as follows:

They are not privy to real-time data
The system is as good as the data you have in your vector database
Most RAG systems only work on text data for both retrieval and generation
Traditional LLMs can only process text content to generate answers
Unable to work with multimodal data

In this article, we will focus particularly on solving the limitations of traditional RAG systems in terms of their inability to work with multimodal content, as well as traditional LLMs, which can only reason and analyze text data to generate responses. Before diving into multimodal RAG systems, let’s first understand what Multimodal data is.

What is Multimodal Data?

Multimodal data is essentially data belonging to multiple modalities. The formal definition of modality comes from the context of Human-Computer Interaction (HCI) systems, where a modality is termed as the classification of a single independent channel of input and output between a computer and human (more details on Wikipedia). Common Computer-Human modalities include the following:

Text: Input and output through written language (e.g., chat interfaces).
Speech: Voice-based interaction (e.g., voice assistants).
Vision: Image and video processing for visual recognition (e.g., face detection).
Gestures: Hand and body movement tracking (e.g., gesture controls).
Touch: Haptic feedback and touchscreens.
Audio: Sound-based signals (e.g., music recognition, alerts).
Biometrics: Interaction through physiological data (e.g., eye-tracking, fingerprints).

In short, multimodal data is essentially data that has a mixture of modalities or formats, as seen in the sample document below, with some of the distinct formats highlighted in various colors.

A Comprehensive Guide to Building Multimodal RAG Systems

The key focus here is to build a RAG system that can handle documents with a mixture of data modalities, such as text, images, tables, and maybe even audio and video, depending on your data sources. This guide will focus on handling text, images, and tables. One of the key components needed to understand such data is multimodal large language models (LLMs).

What is a Multimodal Large Language Model?

Multimodal Large Language Models (LLMs) are essentially transformer-based LLMs that have been pre-trained and fine-tuned on multimodal data to analyze and understand various data formats, including text, images, tables, audio, and video. A true multimodal model ideally should be able not just to understand mixed data formats but also generate the same as shown in the following workflow illustration of NExT-GPT, published as a paper, NExT-GPT: Any-to-Any Multimodal Large Language Model

A Comprehensive Guide to Building Multimodal RAG Systems

From the paper on NExT-GPT, any true multimodal model would typically have the following stages:

Multimodal Encoding Stage. Leveraging existing well-established models to encode inputs of various modalities.
LLM Understanding and Reasoning Stage. An LLM is used as the core agent of NExT-GPT. Technically, they employ the Vicuna LLM which takes as input the representations from different modalities and carries out semantic understanding and reasoning over the inputs. It outputs 1) the textual responses directly and 2) signal tokens of each modality that serve as instructions to dictate the decoding layers whether to generate multimodal content and what content to produce if yes.
Multimodal Generation Stage. Receiving the multimodal signals with specific instructions from LLM (if any), the Transformer-based output projection layers map the signal token representations into the ones that are understandable to following multimodal decoders. Technically, they employ the current off-the-shelf latent conditioned diffusion models of different modal generations, i.e., Stable Diffusion (SD) for image synthesis, Zeroscope for video synthesis, and AudioLDM for audio synthesis.

However, most current Multimodal LLMs available for practical use are one-sided, which means they can understand mixed data formats but only generate text responses. The most popular commercial multimodal models are as follows:

GPT-4V & GPT-4o (OpenAI): GPT-4o can understand text, images, audio, and video, although audio and video analysis are still not open to the public.
Gemini (Google): A multimodal LLM from Google with true multimodal capabilities where it can understand text, audio, video, and images.
Claude (Anthropic): A highly capable commercial LLM that includes multimodal capabilities in its latest versions, such as handling text and image inputs.

You can also consider open or open-source multimodal LLMs in case you want to build a completely open-source solution or have concerns on data privacy or latency and prefer to host everything locally in-house. The most popular open and open-source multimodal models are as follows:

LLaVA-NeXT: An open-source multimodal model with capabilities to work with text, images and also video, which an improvement on top of the popular LLaVa model
PaliGemma: A vision-language model from Google that integrates both image and text processing, designed for tasks like optical character recognition (OCR), object detection, and visual question answering (VQA).
Pixtral 12B: An advanced multimodal model from Mistral AI with 12 billion parameters that can process both images and text. Built on Mistral’s Nemo architecture, Pixtral 12B excels in tasks like image captioning and object recognition.

For our Multimodal RAG System, we will leverage GPT-4o, one of the most powerful multimodal models currently available.

Multimodal RAG System Workflow

In this section, we will explore potential ways to build the architecture and workflow of a multimodal RAG system. The following figure illustrates potential approaches in detail and highlights the one we will use in this guide.

A Comprehensive Guide to Building Multimodal RAG Systems

End-to-End Workflow

Multimodal RAG Systems can be implemented in various ways, the above figure illustrates three possible workflows as recommended in the LangChain blog, this include:

Option 1: Use multimodal embeddings (such as CLIP) to embed images and text together. Retrieve either using similarity search, but simply link to images in a docstore. Pass raw images and text chunks to a multimodal LLM for synthesis.
Option 2: Use a multimodal LLM (such as GPT-4o, GPT4-V, LLaVA) to produce text summaries from images. Embed and retrieve text summaries using a text embedding model. Again, reference raw text chunks or tables from a docstore for answer synthesis by a regular LLM; in this case, we exclude images from the docstore.
Option 3: Use a multimodal LLM (such as GPT-4o, GPT4-V, LLaVA) to produce text, table and image summaries (text chunk summaries are optional). Embed and retrieved text, table, and image summaries with reference to the raw elements, as we did above in option 1. Again, raw images, tables, and text chunks will be passed to a multimodal LLM for answer synthesis. This option is sensible if we don’t want to use multimodal embeddings, which don’t work well when working with images that are more charts and visuals. However, we can also use multimodal embedding models here to embed images and summary descriptions together if necessary.

There are limitations in Option 1 as we cannot use images, which are charts and visuals, which is often the case with a lot of documents. The reason is that multimodal embedding models can’t often encode granular information like numbers in these visual images and compress them into meaningful embedding. Option 2 is severely limited because we do not end up using images at all in this system even if it might contain valuable information and it is not truly a multimodal RAG system.

Hence, we will proceed with Option 3 as our Multimodal RAG System workflow. In this workflow, we will create summaries out of our images, tables, and, optionally, our text chunks and use a multi-vector retriever, which can help in mapping and retrieving the original image, table, and text elements based on their corresponding summaries.

Multi-Vector Retrieval Workflow

Considering the workflow we will implement as discussed in the previous section, for our retrieval workflow, we will be using a multi-vector retriever as depicted in the following illustration, as recommended and mentioned in the LangChain blog. The key purpose of the multi-vector retriever is to act as a wrapper and help in mapping every text chunk, table, and image summary to the actual text chunk, table, and image element, which can then be obtained during retrieval.

A Comprehensive Guide to Building Multimodal RAG Systems

The workflow illustrated above will first use a document parsing tool like Unstructured to extract the text, table and image elements separately. Then we will pass each extracted element into an LLM and generate a detailed text summary as depicted above. Next we will store the summaries and their embeddings into a vector database by using any popular embedder model like OpenAI Embedders. We will also store the corresponding raw document element (text, table, image) for each summary in a document store, which can be any database platform likeRedis.

The multi-vector retriever links each summary and its embedding to the original document’s raw element (text, table, image) using a common document identifier (doc_id). Now, when a user question comes in, first, the multi-vector retriever retrieves the relevant summaries, which are similar to the question in terms of semantic (embedding) similarity, and then using the common doc_ids, the original text, table and image elements are returned back which are further passed on to the RAG system’s LLM as the context to answer the user question.

Detailed Multimodal RAG System Architecture

Now, let’s dive deep into the detailed system architecture of our multimodal RAG system. We will understand each component in this workflow and what happens step-by-step. The following illustration depicts this architecture in detail.

A Comprehensive Guide to Building Multimodal RAG Systems

We will now discuss the key steps of the above-illustrated multimodal RAG System and how it will work. The workflow is as follows:

Load all documents and use a document loader like unstructured.io to extract text chunks, image, and tables.
If necessary, convert HTML tables to markdown; they are often very effective with LLMs
Pass each text chunk, image, and table into a multimodal LLM like GPT-4o and get a detailed summary.
Store summaries in a vector DB and the raw document pieces in a document DB like Redis
Connect the two databases with a common document_id using a multi-vector retriever to identify which summary maps to which raw document piece.
Connect this multi-vector retrieval system with a multimodal LLM like GPT-4o.
Query the system, and based on similar summaries to the query, get the raw document pieces, including tables and images, as the context.
Using the above context, generate a response using the multimodal LLM for the question.

It’s not too complicated once you see all the components in place and structure the flow using the above steps! Let’s implement this system now in the next section.

Hands-on Implementation of our Multimodal RAG System

We will now implement the Multimodal RAG System we have discussed so far using LangChain. We will be loading the raw text, table and image elements from our documents into Redis and the element summaries and their embeddings in our vector database which will be the Chroma database and connect them together using a multi-vector retriever. Connections to LLMs and prompting will be done with LangChain. For our multimodal LLM, we will be using ChatGPT GPT-4o which is a powerful multimodal LLM. However, you are free to use any other multimodal LLM, including the open-source options mentioned earlier. It is recommended to use a powerful multimodal LLM that can effectively understand images, tables, and text to generate quality responses.

Install Dependencies

We start by installing the necessary dependencies, which are going to be the libraries we will be using to build our system. This includes langchain, unstructured as well as necessary dependencies like openai, chroma and utilities for data processing and extraction of tables and images.

!pip install langchain
!pip install langchain-openai
!pip install langchain-chroma
!pip install langchain-community
!pip install langchain-experimental
!pip install "unstructured[all-docs]"
!pip install htmltabletomd
# install OCR dependencies for unstructured
!sudo apt-get install tesseract-ocr
!sudo apt-get install poppler-utils

Copy after login

Downloading Data

We downloaded a report on wildfire statistics in the US from the Congressional Research Service Reports Page which provides open access to detailed reports. This document has a mixture of text, tables and images as shown in the Multimodal Data section above. We will build a simple Chat to my PDF application here using our multimodal RAG System but you can easily extend this to multiple documents also.

!wget <strong>https://sgp.fas.org/crs/misc/IF10244.pdf</strong>

Copy after login

OUTPUT

--2024-08-18 10:08:54-- https://sgp.fas.org/crs/misc/IF10244.pdf<br>Connecting to sgp.fas.org (sgp.fas.org)|18.172.170.73|:443... connected.<br>HTTP request sent, awaiting response... 200 OK<br>Length: 435826 (426K) [application/pdf]<br>Saving to: ‘IF10244.pdf’<br>IF10244.pdf     100%[===================>] 425.61K  732KB/s  in 0.6s<br>2024-08-18 10:08:55 (732 KB/s) - ‘IF10244.pdf’ saved [435826/435826]

Copy after login

Extracting Document Elements with Unstructured

We will now use the unstructured library, which provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and more. We will use it to extract and chunk text elements and extract tables and images separately using the following code snippet.

from langchain_community.document_loaders import UnstructuredPDFLoader

doc = './IF10244.pdf'
# takes 1 min on Colab
loader = UnstructuredPDFLoader(file_path=doc,
strategy='hi_res',
extract_images_in_pdf=True,
infer_table_structure=True,
                              # section-based chunking
chunking_strategy="by_title",
max_characters=4000, # max size of chunks
new_after_n_chars=4000, # preferred size of chunks
# smaller chunks 



<p><strong>OUTPUT</strong></p>



<pre class="brush:php;toolbar:false">7

Copy after login

This tells us that unstructured has successfully extracted seven elements from the document and also downloaded the images separately in the `figures` directory. It has used section-based chunking to chunk text elements based on section headings in the documents and a chunk size of roughly 4000 characters. Also document intelligence deep learning models have been used to detect and extract tables and images separately. We can look at the type of elements extracted using the following code.

[doc.metadata['category'] for doc in data]

Copy after login

OUTPUT

['CompositeElement',<br>'CompositeElement',<br>'Table',<br>'CompositeElement',<br>'CompositeElement',<br>'Table',<br>'CompositeElement']

Copy after login

This tells us we have some text chunks and tables in our extracted content. We can now explore and check out the contents of some of these elements.

# This is a text chunk element
data[0]

Copy after login

OUTPUT

Document(metadata={'source': './IF10244.pdf', 'filetype': 'application/pdf',<br> 'languages': ['eng'], 'last_modified': '2024-04-10T01:27:48', 'page_number':<br> 1, 'orig_elements': 'eJzF...eUyOAw==', 'file_directory': '.', 'filename':<br> 'IF10244.pdf', 'category': 'CompositeElement', 'element_id':<br> '569945de4df264cac7ff7f2a5dbdc8ed'}, page_content='a. aa = Informing the<br> legislative debate since 1914 Congressional Research Service\n\nUpdated June<br> 1, 2023\n\nWildfire Statistics\n\nWildfires are unplanned fires, including<br> lightning-caused fires, unauthorized human-caused fires, and escaped fires<br> from prescribed burn projects. States are responsible for responding to<br> wildfires that begin on nonfederal (state, local, and private) lands, except<br> for lands protected by federal agencies under cooperative agreements. The<br> federal government is responsible for responding to wildfires that begin on<br> federal lands. The Forest Service (FS)—within the U.S. Department of<br> Agriculture—carries out wildfire management ...... Over 40% of those acres<br> were in Alaska (3.1 million acres).\n\nAs of June 1, 2023, around 18,300<br> wildfires have impacted over 511,000 acres this year.')

Copy after login

The following snippet depicts one of the table elements extracted.

# This is a table element
data[2]

Copy after login

OUTPUT

Document(metadata={'source': './IF10244.pdf', 'last_modified': '2024-04-<br>10T01:27:48', 'text_as_html': '

Copy after login

......

......

	2018	2019	2020	2021	2022
Number of Fires (thousands)
Federal	12.5	10.9	14.4	14.0	11.7
FS	50.5	59.0	59.0	69.0
Acres Burned (millions)
Federal	4.6	3.1	7.1	5.2	40
FS	1.6	3.1	Lg	3.6
Total	8.8	4.7	10.1	7.1	7.6

', 'filetype': 'application/pdf', 'languages': ['eng'],
'page_number': 1, 'orig_elements': 'eJylVW1.....AOBFljW', 'file_directory':
'.', 'filename': 'IF10244.pdf', 'category': 'Table', 'element_id':
'40059c193324ddf314ed76ac3fe2c52c'}, page_content='2018 2019 2020 Number of
Fires (thousands) Federal 12.5 10......Nonfederal 4.1 1.6 3.1 1.9 Total 8.8
4.7 10.1 7.1

We can see the text content extracted from the table using the following code snippet.

print(data[2].page_content)

Copy after login

OUTPUT

2018 2019 2020 Number of Fires (thousands) Federal 12.5 10.9 14.4 FS 5.6 5.3<br> 6.7 DOI 7.0 5.3 7.6 2021 14.0 6.2 7.6 11.7 5.9 5.8 Other 0.1 0.2  0.1 Nonfederal 45.6 39.6 44.6 45.0 57.2 Total 58.1 Acres Burned (millions)<br> Federal 4.6 FS 2.3 DOI 2.3 50.5 3.1 0.6 2.3 59.0 7.1 4.8 2.3 59.0 5.2 4.1<br> 1.0 69.0 4.0 1.9 2.1 Other  Total 8.8 4.7 10.1 7.1 



<p>While this can be fed into an LLM, the structure of the table is lost here so we can rather focus on the HTML table content itself and do some transformations later.</p>



<pre class="brush:php;toolbar:false">data[2].metadata['text_as_html']

Copy after login

OUTPUT

Copy after login

......

......

We can view this as HTML as follows to see what it looks like.

display(Markdown(data[2].metadata['text_as_html']))

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

It does a pretty good job here in preserving the structure however some of the extractions are not correct but you can still get away with it when using a powerful LLM like GPT-4o which we will see later. One option here is to use a more powerful table extraction model. Let’s now look at how to convert this HTML table into Markdown. While we can use the HTML text and put it directly in prompts (LLMs understand HTML tables well) or even better convert HTML tables to Markdown tables as depicted below.

import htmltabletomd

md_table = htmltabletomd.convert_table(data[2].metadata['text_as_html'])
print(md_table)

Copy after login

OUTPUT

| | 2018 | 2019 | 2020 | 2021 | 2022 |<br>| :--- | :--- | :--- | :--- | :--- | :--- |<br>| Number of Fires (thousands) |<br>| Federal | 12.5 | 10.9 | 14.4 | 14.0 | 11.7 |<br>| FS | 5.6 | 5.3 | 6.7 | 6.2 | 59 |<br>| Dol | 7.0 | 5.3 | 7.6 | 7.6 | 5.8 |<br>| Other | 0.1 | 0.2 | | Nonfederal | 45.6 | 39.6 | 44.6 | 45.0 | $7.2 |<br>| Total | 58.1 | 50.5 | 59.0 | 59.0 | 69.0 |<br>| Acres Burned (millions) |<br>| Federal | 4.6 | 3.1 | 7.1 | 5.2 | 40 |<br>| FS | 2.3 | 0.6 | 48 | 41 | 19 |<br>| Dol | 2.3 | 2.3 | 2.3 | 1.0 | 2.1 |<br>| Other | | Nonfederal | 4.1 | 1.6 | 3.1 | Lg | 3.6 |<br>| Total | 8.8 | 4.7 | 10.1 | 7.1 | 7.6 |

Copy after login

This looks great! Let’s now separate the text and table elements and convert all table elements from HTML to Markdown.

docs = []
tables = []
for doc in data:
if doc.metadata['category'] == 'Table':
tables.append(doc)
elif doc.metadata['category'] == 'CompositeElement':
docs.append(doc)
for table in tables:
table.page_content = htmltabletomd.convert_table(table.metadata['text_as_html'])
len(docs), len(tables)

Copy after login

OUTPUT

(5, 2)

Copy after login

We can also validate the tables extracted and converted into Markdown.

for table in tables:
print(table.page_content)
print()

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

We can now view some of the extracted images from the document as shown below.

! ls -l ./figures

Copy after login

OUTPUT

total 144<br>-rw-r--r-- 1 root root 27929 Aug 18 10:10 figure-1-1.jpg<br>-rw-r--r-- 1 root root 27182 Aug 18 10:10 figure-1-2.jpg<br>-rw-r--r-- 1 root root 26589 Aug 18 10:10 figure-1-3.jpg<br>-rw-r--r-- 1 root root 26448 Aug 18 10:10 figure-2-4.jpg<br>-rw-r--r-- 1 root root 29260 Aug 18 10:10 figure-2-5.jpg

Copy after login

from IPython.display import Image

Image('./figures/figure-1-2.jpg')

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

Image('./figures/figure-1-3.jpg')

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

Everything looks to be in order, we can see that the images from the document which are mostly charts and graphs have been correctly extracted.

Enter Open AI API Key

We enter our Open AI key using the getpass() function so we don’t accidentally expose our key in the code.

from getpass import getpass

OPENAI_KEY = getpass('Enter Open AI API Key: ')

Copy after login

Setup Environment Variables

Next, we setup some system environment variables which will be used later when authenticating our LLM.

import os

os.environ['OPENAI_API_KEY'] = OPENAI_KEY

Copy after login

Load Connection to Multimodal LLM

Next, we create a connection to GPT-4o, the multimodal LLM we will use in our system.

from langchain_openai import ChatOpenAI

chatgpt = ChatOpenAI(model_name='gpt-4o', temperature=0)

Copy after login

Setup the Multi-vector Retriever

We will now build our multi-vector-retriever to index image, text chunk and table element summaries, create their embeddings and store in the vector database and the raw elements in a document store and connect them so that we can then retrieve the raw image, text and table elements for user queries.

Create Text and Table Summaries

We will use GPT-4o to produce table and text summaries. Text summaries are advised if using large chunk sizes (e.g., as set above, we use 4k token chunks). Summaries are used to retrieve raw tables and / or raw chunks of text later on using the multi-vector retriever. Creating summaries of text elements is optional.

from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
from langchain_core.runnables import RunnablePassthrough

# Prompt
prompt_text = """
You are an assistant tasked with summarizing tables and text particularly for semantic retrieval.
These summaries will be embedded and used to retrieve the raw text or table elements
Give a detailed summary of the table or text below that is well optimized for retrieval.
For any tables also add in a one line description of what the table is about besides the summary.
Do not add additional words like Summary: etc.
Table or text chunk:
{element}
"""
prompt = ChatPromptTemplate.from_template(prompt_text)

# Summary chain
summarize_chain = (
                    {"element": RunnablePassthrough()}
                      |
                    prompt
                      |
                    chatgpt
                      |
                    StrOutputParser() # extracts response as text
)

# Initialize empty summaries
text_summaries = []
table_summaries = []

text_docs = [doc.page_content for doc in docs]
table_docs = [table.page_content for table in tables]

text_summaries = summarize_chain.batch(text_docs, {"max_concurrency": 5})
table_summaries = summarize_chain.batch(table_docs, {"max_concurrency": 5})

Copy after login

The above snippet uses a LangChain chain to create a detailed summary of each text chunk and table and we can see the output for some of them below.

# Summary of a text chunk element
text_summaries[0]

Copy after login

OUTPUT

Wildfires include lightning-caused, unauthorized human-caused, and escaped<br> prescribed burns. States handle wildfires on nonfederal lands, while federal<br> agencies manage those on federal lands. The Forest Service oversees 193<br> million acres of the National Forest System, ...... In 2022, 68,988<br> wildfires burned 7.6 million acres, with over 40% of the acreage in Alaska.<br> As of June 1, 2023, 18,300 wildfires have burned over 511,000 acres.

Copy after login

#Summary of a table element
table_summaries[0]

Copy after login

OUTPUT

This table provides data on the number of fires and acres burned from 2018 to<br> 2022, categorized by federal and nonfederal sources. \n\nNumber of Fires<br> (thousands):\n- Federal: Ranges from 10.9K to 14.4K, peaking in 2020.\n- FS<br> (Forest Service): Ranges from 5.3K to 6.7K, with an anomaly of 59K in<br> 2022.\n- Dol (Department of the Interior): Ranges from 5.3K to 7.6K.\n-<br> Other: Consistently low, mostly around 0.1K.\n- ....... Other: Consistently<br> less than 0.1M.\n- Nonfederal: Ranges from 1.6M to 4.1M, with an anomaly of<br> "Lg" in 2021.\n- Total: Ranges from 4.7M to 10.1M.

Copy after login

This looks pretty good and the summaries are quite informative and should generate good embeddings for retrieval later on.

Create Image Summaries

We will use GPT-4o to produce the image summaries. However since images cannot be passed directly, we will base64 encode the images as strings and then pass it to them. We start by creating a few utility functions to encode images and generate a summary for any input image by passing it to GPT-4o.

import base64
import os
from langchain_core.messages import HumanMessage

# create a function to encode images
def encode_image(image_path):
    """Getting the base64 string"""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

# create a function to summarize the image by passing a prompt to GPT-4o
def image_summarize(img_base64, prompt):
    """Make image summary"""
    chat = ChatOpenAI(model="gpt-4o", temperature=0)
    msg = chat.invoke(
        [
            HumanMessage(
                content=[
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {"url": 
                                     f"data:image/jpeg;base64,{img_base64}"},
                    },
                ]
            )
        ]
    )
    return msg.content

Copy after login

The above functions serve the following purpose:

encode_image(image_path): Reads an image file from the provided path, converts it to a binary stream, and then encodes it to a base64 string. This string can be used to send the image over to GPT-4o.
image_summarize(img_base64, prompt): Sends a base64-encoded image along with a text prompt to the GPT-4o model. It returns a summary of the image based on the given prompt by invoking a prompt where both text and image inputs are processed.

We now use the above utilities to summarize each of our images using the following function.

def generate_img_summaries(path):
    """
    Generate summaries and base64 encoded strings for images
    path: Path to list of .jpg files extracted by Unstructured
    """
    # Store base64 encoded images
    img_base64_list = []
    # Store image summaries
    image_summaries = []
    
    # Prompt
    prompt = """You are an assistant tasked with summarizing images for retrieval.
                Remember these images could potentially contain graphs, charts or 
                tables also.
                These summaries will be embedded and used to retrieve the raw image 
                for question answering.
                Give a detailed summary of the image that is well optimized for 
                retrieval.
                Do not add additional words like Summary: etc.
             """
    
    # Apply to images
    for img_file in sorted(os.listdir(path)):
        if img_file.endswith(".jpg"):
            img_path = os.path.join(path, img_file)
            base64_image = encode_image(img_path)
            img_base64_list.append(base64_image)
            image_summaries.append(image_summarize(base64_image, prompt))
    return img_base64_list, image_summaries

# Image summaries
IMG_PATH = './figures'
imgs_base64, image_summaries = generate_img_summaries(IMG_PATH)

Copy after login

We can now look at one of the images and its summary just to get an idea of how GPT-4o has generated the image summaries.

# View one of the images
display(Image('./figures/figure-1-2.jpg'))

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

# View the image summary generated by GPT-4o
image_summaries[1]

Copy after login

OUTPUT

Line graph showing the number of fires (in thousands) and the acres burned<br> (in millions) from 1993 to 2022. The left y-axis represents the number of<br> fires, peaking around 100,000 in the mid-1990s and fluctuating between<br> 50,000 and 100,000 thereafter. The right y-axis represents acres burned,<br> with peaks reaching up to 10 million acres. The x-axis shows the years from<br> 1993 to 2022. The graph uses a red line to depict the number of fires and a<br> grey shaded area to represent the acres burned.

Copy after login

Overall looks to be quite descriptive and we can use these summaries and embed them into a vector database soon.

Index Documents and Summaries in the Multi-Vector Retriever

We are now going to add the raw text, table and image elements and their summaries to a Multi Vector Retriever using the following strategy:

Store the raw texts, tables, and images in the docstore (here we are using Redis).
Embed the text summaries (or text elements directly), table summaries, and image summaries using an embedder model and store the summaries and embeddings in the vectorstore (here we are using Chroma) for efficient semantic retrieval.
Connect the two using a common doc_id identifier in the multi-vector retriever

Start Redis Server for Docstore

The first step is to get the docstore ready, for this we use the following code to download the open-source version of Redis and start a Redis server locally as a background process.

%%sh
curl -fsSL https://packages.redis.io/gpg | sudo gpg --dearmor -o /usr/share/keyrings/redis-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg] https://packages.redis.io/deb $(lsb_release -cs) main" | sudo tee /etc/apt/sources.list.d/redis.list
sudo apt-get update  > /dev/null 2>&1
sudo apt-get install redis-stack-server  > /dev/null 2>&1
redis-stack-server --daemonize yes

Copy after login

OUTPUT

deb [signed-by=/usr/share/keyrings/redis-archive-keyring.gpg]<br> https://packages.redis.io/deb jammy main<br>Starting redis-stack-server, database path /var/lib/redis-stack

Copy after login

Open AI Embedding Models

LangChain enables us to access Open AI embedding models which include the newest models: a smaller and highly efficient text-embedding-3-small model, and a larger and more powerful text-embedding-3-large model. We need an embedding model to convert our document chunks into embeddings before storing in our vector database.

from langchain_openai import OpenAIEmbeddings

# details here: https://openai.com/blog/new-embedding-models-and-api-updates
openai_embed_model = OpenAIEmbeddings(model='text-embedding-3-small')

Copy after login

Implement the Multi-Vector Retriever Function

We now create a function that will help us connect our vector store and doctors and index the documents, summaries, and embeddings using the following function.

import uuid
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_community.storage import RedisStore
from langchain_community.utilities.redis import get_client
from langchain_chroma import Chroma
from langchain_core.documents import Document
from langchain_openai import OpenAIEmbeddings

def create_multi_vector_retriever(
    docstore, vectorstore, text_summaries, texts, table_summaries, tables, 
    image_summaries, images
):
    """
    Create retriever that indexes summaries, but returns raw images or texts
    """
    id_key = "doc_id"
    
    # Create the multi-vector retriever
    retriever = MultiVectorRetriever(
        vectorstore=vectorstore,
        docstore=docstore,
        id_key=id_key,
    )
    
    # Helper function to add documents to the vectorstore and docstore
    def add_documents(retriever, doc_summaries, doc_contents):
        doc_ids = [str(uuid.uuid4()) for _ in doc_contents]
        summary_docs = [
            Document(page_content=s, metadata={id_key: doc_ids[i]})
            for i, s in enumerate(doc_summaries)
        ]
        retriever.vectorstore.add_documents(summary_docs)
        retriever.docstore.mset(list(zip(doc_ids, doc_contents)))
    
    # Add texts, tables, and images
    # Check that text_summaries is not empty before adding
    if text_summaries:
        add_documents(retriever, text_summaries, texts)
    
    # Check that table_summaries is not empty before adding
    if table_summaries:
        add_documents(retriever, table_summaries, tables)
    
    # Check that image_summaries is not empty before adding
    if image_summaries:
        add_documents(retriever, image_summaries, images)
    return retriever

Copy after login

Following are the key components in the above function and their role:

create_multi_vector_retriever(…): This function sets up a retriever that indexes text, table, and image summaries but retrieves raw data (texts, tables, or images) based on the indexed summaries.
add_documents(retriever, doc_summaries, doc_contents): A helper function that generates unique IDs for the documents, adds the summarized documents to the vectorstore, and stores the full content (raw text, tables, or images) in the docstore.
retriever.vectorstore.add_documents(…): Adds the summaries and embeddings to the vectorstore, where the retrieval will be performed based on the summary embeddings.
retriever.docstore.mset(…): Stores the actual raw document content (texts, tables, or images) in the docstore, which will be returned when a matching summary is retrieved.

Create the vector database

We will now create our vectorstore using Chroma as the vector database so we can index summaries and their embeddings shortly.

# The vectorstore to use to index the summaries and their embeddings
chroma_db = Chroma(
    collection_name="mm_rag",
    embedding_function=openai_embed_model,
    collection_metadata={"hnsw:space": "cosine"},
)

Copy after login

Create the document database

We will now create our docstore using Redis as the database platform so we can index the actual document elements which are the raw text chunks, tables and images shortly. Here we just connect to the Redis server we started earlier.

# Initialize the storage layer - to store raw images, text and tables
client = get_client('redis://localhost:6379')
redis_store = RedisStore(client=client) # you can use filestore, memorystore, any other DB store also

Copy after login

Create the multi-vector retriever

We will now index our document raw elements, their summaries and embeddings in the document and vectorstore and build the multi-vector retriever.

# Create retriever
retriever_multi_vector = create_multi_vector_retriever(
    redis_store,  chroma_db,
    text_summaries, text_docs,
    table_summaries, table_docs,
    image_summaries, imgs_base64,
)

Copy after login

Test the Multi-vector Retriever

We will now test the retrieval aspect in our RAG pipeline to see if our multi-vector retriever is able to return the right text, table and image elements based on user queries. Before we check it out, let’s create a utility to be able to visualize any images retrieved as we need to convert them back from their encoded base64 format into the raw image element to be able to view it.

from IPython.display import HTML, display, Image
from PIL import Image
import base64
from io import BytesIO

def plt_img_base64(img_base64):
    """Disply base64 encoded string as image"""
    # Decode the base64 string
    img_data = base64.b64decode(img_base64)
    # Create a BytesIO object
    img_buffer = BytesIO(img_data)
    # Open the image using PIL
    img = Image.open(img_buffer)
    display(img)

Copy after login

This function will help in taking in any base64 encoded string representation of an image, convert it back into an image and display it. Now let’s test our retriever.

# Check retrieval
query = "Tell me about the annual wildfires trend with acres burned"
docs = retriever_multi_vector.invoke(query, limit=5)
# We get 3 relevant docs
len(docs)

Copy after login

OUTPUT

Copy after login

We can check out the documents retrieved as follows:

docs

Copy after login

[b'a. aa = Informing the legislative debate since 1914 Congressional Research<br> Service\n\nUpdated June 1, 2023\n\nWildfire Statistics\n\nWildfires are<br> unplanned fires, including lightning-caused fires, unauthorized human-caused<br> fires, and escaped fires from prescribed burn projects ...... and an average<br> of 7.2 million acres impacted annually. In 2022, 68,988 wildfires burned 7.6<br> million acres. Over 40% of those acres were in Alaska (3.1 million<br> acres).\n\nAs of June 1, 2023, around 18,300 wildfires have impacted over<br> 511,000 acres this year.',<br><br><br> b'|  | 2018 | 2019 | 2020 | 2021 | 2022 |\n| :--- | :--- | :--- | :--- | :--<br>- | :--- |\n| Number of Fires (thousands) |\n| Federal | 12.5 | 10.9 | 14.4 |<br> 14.0 | 11.7 |\n| FS | 5.6 | 5.3 | 6.7 | 6.2 | 59 |\n| Dol | 7.0 | 5.3 | 7.6<br> | 7.6 | 5.8 |\n| Other | 0.1 | 0.2 |  45.6 | 39.6 | 44.6 | 45.0 | $7.2 |\n| Total | 58.1 | 50.5 | 59.0 | 59.0 |<br> 69.0 |\n| Acres Burned (millions) |\n| Federal | 4.6 | 3.1 | 7.1 | 5.2 | 40<br> |\n| FS | 2.3 | 0.6 | 48 | 41 | 19 |\n| Dol | 2.3 | 2.3 | 2.3 | 1.0 | 2.1<br> |\n| Other |  | 4.1 | 1.6 | 3.1 | Lg | 3.6 |\n| Total | 8.8 | 4.7 | 10.1 | 7.1 | 7.6 |\n',<br> b'/9j/4AAQSkZJRgABAQAAAQABAAD/......RXQv gZB RrYooAx/']

Copy after login

It is clear that the first retrieved element is a text chunk, the second retrieved element is a table and the last retrieved element is an image for our given query. We can also use the utility function from above to view the retrieved image.

# view retrieved image
plt_img_base64(docs[2])

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

We can definitely see the right context being retrieved based on the user question. Let’s try one more and validate this again.

# Check retrieval
query = "Tell me about the percentage of residences burned by wildfires in 2022"
docs = retriever_multi_vector.invoke(query, limit=5)
# We get 2 docs
docs

Copy after login

OUTPUT

[b'Source: National Interagency Coordination Center (NICC) Wildland Fire<br> Summary and Statistics annual reports. Notes: FS = Forest Service; DOI =<br> Department of the Interior. Column totals may not sum precisely due to<br> rounding.\n\n2022\n\nYear Acres burned (millions) Number of Fires 2015 2020<br> 2017 2006 2007\n\nSource: NICC Wildland Fire Summary and Statistics annual<br> reports. ...... and structures (residential, commercial, and other)<br> destroyed. For example, in 2022, over 2,700 structures were burned in<br> wildfires; the majority of the damage occurred in California (see Table 2).',<br><br><br> b'|  | 2019 | 2020 | 2021 | 2022 |\n| :--- | :--- | :--- | :--- | :--- |\n|<br> Structures Burned | 963 | 17,904 | 5,972 | 2,717 |\n| % Residences | 46% |<br> 54% | 60% | 46% |\n']<br>

Copy after login

This definitely shows that our multi-vector retriever is working quite well and is able to retrieve multimodal contextual data based on user queries!

import re
import base64

# helps in detecting base64 encoded strings
def looks_like_base64(sb):
    """Check if the string looks like base64"""
    return re.match("^[A-Za-z0-9 /] [=]{0,2}$", sb) is not None

# helps in checking if the base64 encoded image is actually an image
def is_image_data(b64data):
    """
    Check if the base64 data is an image by looking at the start of the data
    """
    image_signatures = {
        b"\xff\xd8\xff": "jpg",
        b"\x89\x50\x4e\x47\x0d\x0a\x1a\x0a": "png",
        b"\x47\x49\x46\x38": "gif",
        b"\x52\x49\x46\x46": "webp",
    }
    try:
        header = base64.b64decode(b64data)[:8]  # Decode and get the first 8 bytes
        for sig, format in image_signatures.items():
            if header.startswith(sig):
                return True
        return False
    except Exception:
        return False

# returns a dictionary separating images and text (with table) elements
def split_image_text_types(docs):
    """
    Split base64-encoded images and texts (with tables)
    """
    b64_images = []
    texts = []
    for doc in docs:
        # Check if the document is of type Document and extract page_content if so
        if isinstance(doc, Document):
            doc = doc.page_content.decode('utf-8')
        else:
            doc = doc.decode('utf-8')
        if looks_like_base64(doc) and is_image_data(doc):
            b64_images.append(doc)
        else:
            texts.append(doc)
    return {"images": b64_images, "texts": texts}

Copy after login

These utility functions mentioned above help us in separating the text (with table) elements and image elements separately from the retrieved context documents. Their functionality is explained in a bit more detail as follows:

looks_like_base64(sb): Uses a regular expression to check if the input string follows the typical pattern of base64 encoding. This helps identify whether a given string might be base64-encoded.
is_image_data(b64data): Decodes the base64 string and checks the first few bytes of the data against known image file signatures (JPEG, PNG, GIF, WebP). It returns True if the base64 string represents an image, helping verify the type of base64-encoded data.
split_image_text_types(docs): Processes a list of documents, differentiating between base64-encoded images and regular text (which could include tables). It checks each document using the looks_like_base64 and is_image_data functions and then splits the documents into two categories: images (base64-encoded images) and texts (non-image documents). The result is returned as a dictionary with two lists.

We can quickly test this function on any retrieval output from our multi-vector retriever as shown below with an example.

# Check retrieval
query = "Tell me detailed statistics of the top 5 years with largest wildfire
         acres burned"
docs = retriever_multi_vector.invoke(query, limit=5)
r = split_image_text_types(docs)
r

Copy after login

OUTPUT

{'images': ['/9j/4AAQSkZJRgABAQAh......30aAPda8Kn/wCPiT/eP86PPl/56v8A99GpURSgJGTQB//Z'],<br><br><br> 'texts': ['Figure 2. Top Five Years with Largest Wildfire Acreage Burned<br> Since 1960\n\nTable 1. Annual Wildfires and Acres Burned',<br>  'Source: NICC Wildland Fire Summary and Statistics annual reports.\n\nConflagrations Of the 1.6 million wildfires that have occurred<br> since 2000, 254 exceeded 100,000 acres burned and 16 exceeded 500,000 acres<br> burned. A small fraction of wildfires become .......']}

Copy after login

Looks like our function is working perfectly and separating out the retrieved context elements as desired.

Build End-to-End Multimodal RAG Pipeline

Now let’s connect our multi-vector retriever, prompt instructions and build a multimodal RAG chain. To begin, we create a multimodal prompt function that takes context text, tables, and images to structure a proper prompt in the correct format for GPT-4o.

from operator import itemgetter
from langchain_core.runnables import RunnableLambda, RunnablePassthrough
from langchain_core.messages import HumanMessage

def multimodal_prompt_function(data_dict):
    """
    Create a multimodal prompt with both text and image context.
    This function formats the provided context from `data_dict`, which contains
    text, tables, and base64-encoded images. It joins the text (with table) portions
    and prepares the image(s) in a base64-encoded format to be included in a 
    message.
    The formatted text and images (context) along with the user question are used to
    construct a prompt for GPT-4o
    """
    formatted_texts = "\n".join(data_dict["context"]["texts"])
    messages = []
    
    # Adding image(s) to the messages if present
    if data_dict["context"]["images"]:
        for image in data_dict["context"]["images"]:
            image_message = {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{image}"},
            }
            messages.append(image_message)
    
    # Adding the text for analysis
    text_message = {
        "type": "text",
        "text": (
            f"""You are an analyst tasked with understanding detailed information 
                and trends from text documents,
                data tables, and charts and graphs in images.
                You will be given context information below which will be a mix of 
                text, tables, and images usually of charts or graphs.
                Use this information to provide answers related to the user 
                question.
                Do not make up answers, use the provided context documents below and 
                answer the question to the best of your ability.
                
                User question:
                {data_dict['question']}
                
                Context documents:
                {formatted_texts}
                
                Answer:
            """
        ),
    }
    messages.append(text_message)
    return [HumanMessage(content=messages)]

Copy after login

This function helps in structuring the prompt to be sent to GPT-4o as explained here:

multimodal_prompt_function(data_dict): creates a multimodal prompt by combining text and image data from a dictionary. The function formats text context (with tables), appends base64-encoded images (if available), and constructs a HumanMessage to send to GPT-4o for analysis along with the user question.

We now construct our multimodal RAG chain using the following code snippet.

# Create RAG chain
multimodal_rag = (
        {
            "context": itemgetter('context'),
            "question": itemgetter('input'),
        }
            |
        RunnableLambda(multimodal_prompt_function)
            |
        chatgpt
            |
        StrOutputParser()
)

# Pass input query to retriever and get context document elements
retrieve_docs = (itemgetter('input')
                    |
                retriever_multi_vector
                    |
                RunnableLambda(split_image_text_types))

# Below, we chain `.assign` calls. This takes a dict and successively
# adds keys-- "context" and "answer"-- where the value for each key
# is determined by a Runnable (function or chain executing at runtime).
# This helps in having the retrieved context along with the answer generated by GPT-4o
multimodal_rag_w_sources = (RunnablePassthrough.assign(context=retrieve_docs)
                                               .assign(answer=multimodal_rag)
)

Copy after login

The chains create above work as follows:

multimodal_rag_w_sources: This chain, chains the assignments of context and answer. It assigns the context from the documents retrieved using retrieve_docs and assigns the answer generated by the multimodal RAG chain using multimodal_rag. This setup ensures that both the retrieved context and the final answer are available and structured together as part of the output.
retrieve_docs: This chain retrieves the context documents related to the input query. It starts by extracting the user’s input , passes the query through our multi-vector retriever to fetch relevant documents, and then calls the split_image_text_types function we defined earlier via RunnableLambda to separate base64-encoded images and text (with table) elements.
multimodal_rag: This chain is the final step which creates a RAG (Retrieval-Augmented Generation) chain, where it uses the user input and retrieved context obtained from the previous two chains, processes them using the multimodal_prompt_function we defined earlier, through a RunnableLambda, and passes the prompt to GPT-4o to generate the final response. The pipeline ensures multimodal inputs (text, tables and images) are processed by GPT-4o to give us the response.

Test the Multimodal RAG Pipeline

Everything is set up and ready to go; let’s test out our multimodal RAG pipeline!

# Run multimodal RAG chain
query = "Tell me detailed statistics of the top 5 years with largest wildfire acres 
         burned"
response = multimodal_rag_w_sources.invoke({'input': query})
response

Copy after login

OUTPUT

{'input': 'Tell me detailed statistics of the top 5 years with largest<br> wildfire acres burned',<br> 'context': {'images': ['/9j/4AAQSkZJRgABAa.......30aAPda8Kn/wCPiT/eP86PPl/56v8A99GpURSgJGTQB//Z'],<br>  'texts': ['Figure 2. Top Five Years with Largest Wildfire Acreage Burned<br> Since 1960\n\nTable 1. Annual Wildfires and Acres Burned',<br>   'Source: NICC Wildland Fire Summary and Statistics annual<br> reports.\n\nConflagrations Of the 1.6 million wildfires that have occurred<br> since 2000, 254 exceeded 100,000 acres burned and 16 exceeded 500,000 acres<br> burned. A small fraction of wildfires become catastrophic, and a small<br> percentage of fires accounts for the vast majority of acres burned. For<br> example, about 1% of wildfires become conflagrations—raging, destructive<br> fires—but predicting which fires will “blow up” into conflagrations is<br> challenging and depends on a multitude of factors, such as weather and<br> geography. There have been 1,041 large or significant fires annually on<br> average from 2018 through 2022. In 2022, 2% of wildfires were classified as<br> large or significant (1,289); 45 exceeded 40,000 acres in size, and 17<br> exceeded 100,000 acres. For context, there were fewer large or significant<br> wildfires in 2021 (943)......']},<br> 'answer': 'Based on the provided context and the image, here are the<br> detailed statistics for the top 5 years with the largest wildfire acres<br> burned:\n\n1. **2015**\n   - **Acres burned:** 10.13 million\n   - **Number<br> of fires:** 68.2 thousand\n\n2. **2020**\n   - **Acres burned:** 10.12<br> million\n   - **Number of fires:** 59.0 thousand\n\n3. **2017**\n   -<br> **Acres burned:** 10.03 million\n   - **Number of fires:** 71.5 <br>thousand\n\n4. **2006**\n   - **Acres burned:** 9.87 million\n   - **Number<br> of fires:** 96.4 thousand\n\n5. **2007**\n   - **Acres burned:** 9.33<br> million\n   - **Number of fires:** 67.8 thousand\n\nThese statistics<br> highlight the years with the most significant wildfire activity in terms of<br> acreage burned, showing a trend of large-scale wildfires over the past few<br> decades.'}

Copy after login

Looks like we are above to get the answer as well as the source context documents used to answer the question! Let’s create a function now to format these results and display them in a better way!

def multimodal_rag_qa(query):
    response = multimodal_rag_w_sources.invoke({'input': query})
    print('=='*50)
    print('Answer:')
    display(Markdown(response['answer']))
    print('--'*50)
    print('Sources:')
    text_sources = response['context']['texts']
    img_sources = response['context']['images']
    for text in text_sources:
        display(Markdown(text))
        print()
    for img in img_sources:
        plt_img_base64(img)
        print()
    print('=='*50)

Copy after login

This is a simple function which just takes the dictionary output from our multimodal RAG pipeline and displays the results in a nice format. Time to put this to the test!

query = "Tell me detailed statistics of the top 5 years with largest wildfire acres 
         burned"
multimodal_rag_qa(query)

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

It does a pretty good job leveraging text and image context documents to answer the question correctly. Let’s try another one.

# Run RAG chain
query = "Tell me about the annual wildfires trend with acres burned"
multimodal_rag_qa(query)

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

It does a pretty good job here analyzing tables, images and text context documents to answer the user question with a detailed report. Let’s look at one more example of a very specific query.

# Run RAG chain
query = "Tell me about the number of acres burned by wildfires for the forest service in 2021"
multimodal_rag_qa(query)

Copy after login

OUTPUT

A Comprehensive Guide to Building Multimodal RAG Systems

Here you can clearly see that even though the table elements were wrongly extracted for some of the rows, especially the one being used to answer the question, GPT-4o is intelligent enough to look at the surrounding table elements and the text chunks retrieved to give the right answer of 4.1 million instead of 41 million. Of course this may not always work and that is where you might need to focus on improving your extraction pipelines.

Conclusion

If you are reading this, I commend your efforts in staying right till the end in this massive guide! Here, we went through an in-depth understanding of the current challenges in traditional RAG systems especially in handling multimodal data. We then talked about what is multimodal data as well as multimodal large language models (LLMs). We discussed at length a detailed system architecture and workflow for a Multimodal RAG system with GPT-4o. Last but not the least, we implemented this Multimodal RAG system with LangChain and tested it on various scenarios. Do check out this Colab notebook for easy access to the code and try improving this system by adding more capabilities like support for audio, video and more!

If you want to become a Generative AI expert, then explore:GenAI Pinnacle Program

Frequently Asked Questions

Q1. What is a RAG system?

Ans. A Retrieval Augmented Generation (RAG) system is an AI framework that combines data retrieval with language generation, enabling more contextual and accurate responses without the need for fine-tuning large language models (LLMs).

Q2. What are the limitations of traditional RAG systems?

Ans. Traditional RAG systems primarily handle text data, cannot process multimodal data (like images or tables), and are limited by the quality of the stored data in the vector database.

Q3. What is multimodal data?

Ans. Multimodal data consists of multiple types of data formats such as text, images, tables, audio, video, and more, allowing AI systems to process a combination of these modalities.

Q4. What is a multimodal LLM?

Ans. A multimodal Large Language Model (LLM) is an AI model capable of processing and understanding various data types (text, images, tables) to generate relevant responses or summaries.

Q5. What are some popular multimodal LLMs?

Ans. Some popular multimodal LLMs include GPT-4o (OpenAI), Gemini (Google), Claude (Anthropic), and open-source models like LLaVA-NeXT and Pixtral 12B.

	2018	2019	2020	2021	2022
Number of Fires (thousands)
Federal	12.5	10.9	14.4	14.0	11.7	45.6	39.6	44.6	45.0	$7.2
Total	58.1	50.5	59.0	59.0	69.0
colspan="6">Acres Burned (millions)
Federal	4.6	3.1	7.1	5.2	40
FS	2.3	1.0	2.1
Other

The above is the detailed content of A Comprehensive Guide to Building Multimodal RAG Systems. For more information, please follow other related articles on the PHP Chinese website!

Statement of this Website

The content of this article is voluntarily contributed by netizens, and the copyright belongs to the original author. This site does not assume corresponding legal responsibility. If you find any content suspected of plagiarism or infringement, please contact admin@php.cn

Hot AI Tools

Undresser.AI Undress

AI-powered app for creating realistic nude photos

AI Clothes Remover

Online AI tool for removing clothes from photos.

Undress AI Tool

Undress images for free

Clothoff.io

AI clothes remover

Video Face Swap

Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

How to fix KB5055612 fails to install in Windows 10?

4 weeks ago By DDD

Roblox: Bubble Gum Simulator Infinity - How To Get And Use Royal Keys

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Roblox: Grow A Garden - Complete Mutation Guide

3 weeks ago By DDD

Nordhold: Fusion System, Explained

4 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Mandragora: Whispers Of The Witch Tree - How To Unlock The Grappling Hook

3 weeks ago By 尊渡假赌尊渡假赌尊渡假赌

Hot Tools

Notepad++7.3.1

Easy-to-use and free code editor

SublimeText3 Chinese version

Chinese version, very easy to use

Zend Studio 13.0.1

Powerful PHP integrated development environment

Dreamweaver CS6

Visual web development tools

SublimeText3 Mac version

God-level code editing software (SublimeText3)

Hot Topics

Java Tutorial

1673

CakePHP Tutorial

1428

Laravel Tutorial

1333

PHP Tutorial

1278

C# Tutorial

1257

Related knowledge

How to Build MultiModal AI Agents Using Agno Framework? Apr 23, 2025 am 11:30 AM

While working on Agentic AI, developers often find themselves navigating the trade-offs between speed, flexibility, and resource efficiency. I have been exploring the Agentic AI framework and came across Agno (earlier it was Phi-

How to Add a Column in SQL? - Analytics Vidhya Apr 17, 2025 am 11:43 AM

SQL's ALTER TABLE Statement: Dynamically Adding Columns to Your Database In data management, SQL's adaptability is crucial. Need to adjust your database structure on the fly? The ALTER TABLE statement is your solution. This guide details adding colu

OpenAI Shifts Focus With GPT-4.1, Prioritizes Coding And Cost Efficiency Apr 16, 2025 am 11:37 AM

The release includes three distinct models, GPT-4.1, GPT-4.1 mini and GPT-4.1 nano, signaling a move toward task-specific optimizations within the large language model landscape. These models are not immediately replacing user-facing interfaces like

New Short Course on Embedding Models by Andrew Ng Apr 15, 2025 am 11:32 AM

Unlock the Power of Embedding Models: A Deep Dive into Andrew Ng's New Course Imagine a future where machines understand and respond to your questions with perfect accuracy. This isn't science fiction; thanks to advancements in AI, it's becoming a r

Rocket Launch Simulation and Analysis using RocketPy - Analytics Vidhya Apr 19, 2025 am 11:12 AM

Simulate Rocket Launches with RocketPy: A Comprehensive Guide This article guides you through simulating high-power rocket launches using RocketPy, a powerful Python library. We'll cover everything from defining rocket components to analyzing simula

Google Unveils The Most Comprehensive Agent Strategy At Cloud Next 2025 Apr 15, 2025 am 11:14 AM

Gemini as the Foundation of Google’s AI Strategy Gemini is the cornerstone of Google’s AI agent strategy, leveraging its advanced multimodal capabilities to process and generate responses across text, images, audio, video and code. Developed by DeepM

Open Source Humanoid Robots That You Can 3D Print Yourself: Hugging Face Buys Pollen Robotics Apr 15, 2025 am 11:25 AM

“Super happy to announce that we are acquiring Pollen Robotics to bring open-source robots to the world,” Hugging Face said on X. “Since Remi Cadene joined us from Tesla, we’ve become the most widely used software platform for open robotics thanks to

DeepCoder-14B: The Open-source Competition to o3-mini and o1 Apr 26, 2025 am 09:07 AM

In a significant development for the AI community, Agentica and Together AI have released an open-source AI coding model named DeepCoder-14B. Offering code generation capabilities on par with closed-source competitors like OpenAI

See all articles