In my other post/notebook, Using GPT Vision in RAG, we use vision to help enrich our content during ingestion, prior to chat. By using vision, we are able to create robust descriptions of our complex slides that contain charts and graphs, and bring some amazing value to our end users.
In this post, we’re going to explore using 4o vision in our chat itself, that is, send images in with the users’ questions, instead of relying on the vision descsription of that slide.
As with my last post, this post can be found in its full notebook form, code and all, on GitHub.
The R in multi-modal RAG
Retrieving the images is the most important piece: send GPT the right image for the question. This gets tricky when we’re searching for images. If we were to embed the images, we could retrieve images using vectors. My experience, vector searching doesn’t suffice for text search, a hybrid approach improves search relevancy. If I were to apply the same assumptions to embedding images, I think we’ll still need some level of textual description to help find the most relevant images. Maybe I’m wrong. I won’t be exploring multi-modal embeddings and retrieval here. I will in a later post.
For Retrieval, I’m using the output from my last post, all of the vision descriptions, along with the same questions, and performing basic text search, not using a full search engine.
The G in multi-modal RAG
Let’s generate some answers using images, not text! The following questions and generated answers are using only the images. No additional text or descriptions are provided to GPT.
For each question, you’ll see the images sent, and the answers below.
How many Marvel books are there?



The total number of Marvel books is approximately 750, as indicated by the bar under Marvel Comics in the top publishers chart on the second slide.
How many Marvel books are Star Wars related?






The pie chart titled “star wars comics” shows that 67% of the comics are Star Wars related. Assuming the total number under Marvel Comics in the graph is approximately 800 (as indicated in the “by publisher” slide), this means there are around 536 Star Wars related Marvel books.
What year did the collection start?






The collection started on July 4th, 2021.
What are the top brands?



The top brands, according to the charts, are:
- Marvel Comics
- DC Comics
- BOOM! Studios
- Dark Horse Comics
- IDW Publishing
These publishers are highlighted as top performers in all presented graphs. Marvel Comics has a notable lead due to the impact of Star Wars comics.
Not bad!
However, look at how it handled How many Marvel books are Star Wars related?:
The pie chart titled “star wars comics” shows that 67% of the comics are Star Wars related. Assuming the total number under Marvel Comics in the graph is approximately 800 (as indicated in the “by publisher” slide), this means there are around 536 Star Wars related Marvel books.
Awesome.
A picture is worth a thousand words
When we use multi-modal RAG (sending images and asking questions) the machine can better identify the key portions of those images and provide valuable answers. If we convert to text first, all the machine has to work off of is the text. That text, though generated by GPT, may not include all of the context that pertains to the user’s query. Let’s see if I can prove this.
Let’s create some new questions which may not be as easy from a textual summary.
[
{
“question”: “What are the top brands that have better stories?”,
“keywords”: [“publisher”, “original”]
},
{
“question”: “How many books are from 1975?”,
“keywords”: [“1975”, “release”]
},
{
“question”: “How many Marvel books are read?”,
“keywords”: [“Marvel”, “reading”]
}
]
I looped through these questions and performed two GPT calls: one using multi-modal, and one just using the vision description.
What are the top brands that have better stories?



Multi-modal Answer
The top brands that contain more original stories, according to the first slide, are Image Comics, with IDW Publishing and Dark Horse Comics contributing significantly due to their Star Wars content. Image Comics is highlighted as containing more original stories “in my opinion.”
Vision Description Answer
According to the information provided, the top brands with better stories are associated with the middle publishers in the bar chart under the section labeled “by publisher.” A note mentions that these publishers, which include those with bars slightly above and below 100, contain more original stories and are considered better in someone’s opinion. Specifically, Dark Horse Comics and other smaller publishers with bars below 100 are highlighted as having better, more original stories.
How many books are from 1975?

Multi-modal Answer
There are approximately 50 books from 1975.
Vision Description Answer
The graph does not specify exact values for individual years, including 1975. The numerical values on the y-axis are unspecified, and the graph only provides a general trend rather than detailed data points for each year. To determine the exact number of books from 1975, specific data points or additional details beyond the graph would be needed.
How many Marvel books are read?



Multi-modal Answer
To determine how many Marvel books are read, we can analyze the data from the slides:
- Total Marvel Books: The bar chart from the first slide shows that Marvel has the highest number of books among the publishers, close to 800.
- Reading Status for Marvel: On the second slide, the top publishers’ bar chart shows the read (blue) and unread (orange) portions for Marvel. It appears that the read portion is slightly more than half.
- Calculation: If Marvel has about 800 books in total and more than half are read, then approximately 400 to 450 Marvel books are read.
Hence, around 400-450 Marvel books are read.
Vision Description Answer
The context provided does not specify the exact number of Marvel books that are read. However, in the bar chart titled “Top Publishers,” the Marvel Comics bar is the tallest and divided into two sections: dark blue for read and orange for unread. To find the number of read Marvel books, one would need to determine the proportion of the dark blue section in the Marvel Comics bar relative to the total height of that bar. Unfortunately, without specific numerical values or additional data from the chart, the precise number cannot be calculated from the given information.
Multi-modal RAG is better
Let’s check out the questions and who gave a better answer.
What are the top brands that have better stories? Multi-modal is better, although the vision description answer feels more thoughtful…
How many books are from 1975? Multi-modal killed it. Vision description has no idea.
How many Marvel books are read? Multi-modal has some good reasoning, but it was the wrong answer. Vision description has no idea.
Is multi-modal better?

Multi-modal is clearly a better experience, although not perfect (nothing is perfect).
Implementing multi-modal RAG
As we walked through here, it’s not simply giving the model images, but the first important piece is finding the right images to give the LLM. You may still want to run vision on ingestion, as we discussed in the last post, so you have a better search experience, and then send those slides into the LLM.
Keep in mind, my example here is one PowerPoint file. If this was against a full corpus of knowledge, that could be millions of slides. I suggest including more context for each slide, so the machine can understand what the slide is above, in the context of the deck it’s in.
And yes, there’s a cost with all of this. As I shared in the last notebook, I’ll say it again here:
In reality, most of us live within budgets, have end-user performance expectations, speed to market requirements, blah blah blah. It’ll come down to the business stakeholders to see the value in using vision, compared to the current value you’re delivering, and decide the cost is worth it.
Subscribe to my blog and get posts like this in your inbox. Share your email below, or follow me on Threads, LinkedIn, or BlueSky.
