Using GPT Vision in RAG

This post is created off of a Jupyter notebook in Github, which you can access here. I copied all the content for this blog post. I’ve adapted this a little to read easier and include no code, just the good tid bits.

We’re going to explore using OpenAI’s GPT Vision with 4o model to gather more insight from complex slides, charts, and images, to improve our chat experiences in our RAG solutions.

The PowerPoint PPTX file

I’ll be using a PowerPoint file I made of my comic book collection. I didn’t want to use real content from work, and I had a little fun building this ;). The value of adding vision to your RAG pipeline will still be realized using this data, I promise.

You can review the PowerPoint file, in all its glory, here.

Getting text from the PPTX

I leveraged a light library to page through my slides and pull all of the text off the slides. We’ll explore that data below.

I manually exported the deck to JPEG files. Finding a library to do that programmatically was getting annoying. In production, we use ASPose, on Java. There are some Python libraries available, but decided not to implement as that’s outside of the goal for this exercise. So for today, I’ve saved my deck as .jpeg sized 1024×578.

Let’s quickly review the deck and confirm the text I pulled off it. And note how “useless” (from a plain text point of view) some of the content is!

 Slide 1

My Comic Book Collection As of Dec 14, 2024

 Slide 2

Comic Books Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.

 Slide 3

My Comic Books I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.

 Slide 4

My Comic Books As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.

 Slide 5

My Collection The following are some stats from my collection

 Slide 6

Chart Title: Series Name: Total Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] by publisher Contains more original stories (better in my opinion) This is so high due to Star Wars

 Slide 7

Series Name: Count of Series Name Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0] release years Started Collecting That was an expensive year

 Slide 8

Chart Title: Top Publishers Series Name: Read Values: [550.0, 100.0, 57.0, 43.0, 41.0] Series Name: Unread Values: [247.0, 87.0, 51.0, 64.0, 34.0] reading status Chart Title: All Comic Books Series Name: Total Values: [964.0, 816.0]

 Slide 9

Series Name: Star Wars Books Values: [554.0, 34.0, 67.0, 2.0] Series Name: Other Books Values: [243.0, 41.0, 40.0, 0.0] Series Name: Values: [657.0, 324.0] star wars comics Star Wars 67% All other books 33% 2

Thoughts? Reading the text on some has little value to what the slide actually says, right? Some slides it’s mostly numbers! Useless. Enter GPT 4o Vision!! Let’s see what we can get GPT to help us out with.

Getting value from GPT 4o Vision

Leveraging OpenAI’s documentation on using vision, I ran all of my slides through, asking for a description. (See the notebook for the actual prompt).

Drum roll please, let’s see what the plain text extraction looks like against the vision description!

Comic panel: You need to see this

Slide 1
Plain Text

My Comic Book Collection As of Dec 14, 2024
Vision Description:

The image is a collage of comic book covers with a central text box overlay. Each comic book cover is distinct, displaying various characters and art styles. Here’s a detailed description:Central Text Box: A large, dark gray rectangle is centered in the image, with the text “My Comic Book Collection” in bold white lettering. Below this, it says “As of Dec 14, 2024” in smaller white text.Comic Covers:Top Row:Left: A cover of “The Amazing Spider-Man” featuring Spider-Man in his iconic red and blue suit, swinging on a web.Middle Left: “Batman 89” shows Batman standing in front of a dark, urban backdrop with a full moon.Middle Right: “Nebula” depicts a futuristic character with mechanical elements and vibrant colors.Right: “Eve” shows a girl with eyes closed, inside a circular window, hinting at a science fiction theme.Bottom Row:Left: “Scarlet Witch” has a striking image of a woman in red, with intense green eyes.Middle Left: A comic with a woman in a pirate-like outfit, holding a sword.Middle Right: A vibrant and colorful comic cover with a group of characters in action poses.Right: A retro-style comic featuring “The Hulk” with a dynamic and bold illustration.Arrangement: The comic covers are arranged in two rows of four, framing the central text box. Each cover has vivid, engaging artwork, highlighting the variety and diversity in the comic collection.Overall, the image is a visually engaging showcase of the individual’s diverse comic book collection as of December 14, 2024.

Slide 2
Plain Text

Comic Books Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.
Vision Description:

The image is a slide titled “Comic Books.” The title appears at the top in large, bold black text. Below the title is a paragraph in smaller black text. It reads: “Comic books started in 1938 with the introduction of Action Comics #1 marking the debut of Superman. Over the years, comic books have sold for over $3.5M, making comic books investment-grade collectibles.”To the right of the text is an image of the cover of “Action Comics #1” from June 1938. It shows a colorful illustration of Superman lifting a green car. There are people in the foreground reacting with surprise and fear. The cover has a bold red and yellow background.Below this, there are additional single comic book covers displayed in a row from left to right:The Warlord: The cover features bold, vivid colors with an image of a muscular warrior wielding a sword against a dinosaur. The title is set against a dark background in an elaborate stylized font.The Thing: This cover features a group of heroes against a space-themed backdrop, with “THE THING” in a large red font at the top. The characters are dressed in futuristic outfits.Star Wars: The cover prominently displays the characters from the Star Wars universe, including Darth Vader, in front of starry space. The title “STAR WARS” is large, yellow, and bold.The overall layout presents a historical context alongside visual examples of comic book covers, highlighting their cultural and collectible importance.

Slide 3
Plain Text

My Comic Books I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.
Vision Description:

The image is a collage with a text portion on the left side and four comic book covers. Here’s a detailed description:Text:At the top left, there’s a title that reads “My Comic Books.”Below the title, a paragraph states: “I started collecting July 4th, 2021, and fell in love with the stories and artwork. I quickly learned certain artists and rarity of the covers could increase a book’s value ten-fold in the first day.”Comic Book Covers:Top Right Cover:Titled “We Don’t Kill Spiders.”Features an illustrated person holding a glowing red object with several dark shadows around. The background is blue with bright pink text.Bottom Row – Left Cover:Titled “BRZRKR.”Shows a monochrome figure in dark armor, holding a weapon, standing against a backdrop of large block letters in blue.Bottom Row – Middle Cover:Titled “Stray Dogs: Dog Days.”Has an image of a bloody envelope and a paw print, implying a mysterious or suspenseful theme, with mostly dark red and brown tones.Bottom Row – Right Cover:Titled “Darth Vader.”Depicts the iconic Star Wars character, Darth Vader, in full armor holding a lightsaber against a space-themed background with warm brown and gold colors.The comic book covers are arranged in two rows: one cover on the top right by itself, and three covers along the bottom. The text is adjacent to the bottom row. Each cover has its unique style and theme.

Slide 4
Plain Text

My Comic Books As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.
Vision Description:

The image is a slide titled “My Comic Books,” featuring a collection of Star Wars-themed comic book covers.At the top, there’s a heading in bold: “My Comic Books.” Below this, there’s a paragraph of text: “As a Star Wars fan, the stories from the comics fill much of the gaps between the movies and TV shows. We get to see our beloved heroes, and villains, in their natural elements like we never see on the screen.”Beneath the text, four comic book covers are displayed. From left to right:First Cover (Leftmost): Features a character in a white outfit holding a red lightsaber. The background is dark, with the words “Darth Vader” prominently displayed at the bottom.Second Cover: Shows Darth Vader holding a red lightsaber, standing against a smoky background. “Star Wars Darth Vader” is written at the top in bold.Third Cover: This is a classic-style comic cover with multiple characters including Darth Vader in the center. The title “Star Wars” is written in bright yellow at the top.Fourth Cover (Rightmost): Features the character Chewbacca, a large, furry being with a serious expression. It is titled “Han Solo & Chewbacca.” This cover is positioned separately to the right of the other three covers.The covers are aligned neatly, with the first three creating a row, and the fourth slightly offset to the right.

Slide 5
Plain Text

My Collection The following are some stats from my collection
Vision Description:

The image has a simple, minimalistic design. It consists of a plain white background with text aligned to the left side. The main heading, “My Collection,” is placed prominently in bold, black font. Below this heading, there is a subheading in smaller, gray font that reads, “The following are some stats from my collection.” There are no charts, graphs, or additional objects present in the image—just these two lines of text centered towards the top left portion.

Slide 6
Plain Text

Chart Title: Series Name: Total Values: [835.0, 203.0, 116.0, 113.0, 80.0, 50.0, 21.0, 18.0, 17.0, 17.0, 13.0, 10.0, 8.0, 7.0, 7.0, 6.0, 5.0, 4.0, 4.0, 2.0, 2.0, 2.0, 2.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0] by publisher Contains more original stories (better in my opinion) This is so high due to Star Wars
Vision Description:

The image is a bar chart with the title “by publisher” at the top left. The chart represents the total number of items attributed to various publishers along the x-axis, while the y-axis shows a numerical scale from 0 to 900.X-Axis: Represents different publishers. The list includes Marvel Comics at the far left, followed by DC Comics, Dark Horse Comics, and several others. Farther right, smaller publishers are also listed.Y-Axis: Ranges from 0 to 900, indicating the total count of items published.Bars: Each publisher has a vertical bar corresponding to the number of items. Notably:Marvel Comics: Has the tallest bar, reaching above 800, with a note stating, “This is so high due to Star Wars.”DC Comics: The second-highest bar, slightly above 200.Dark Horse Comics: Slightly above 100.The other publishers have significantly shorter bars, many below 100.Annotations:Next to the Marvel Comics bar, there is a text pointing at the height of the bar, explaining its height due to Star Wars.Above the middle publishers’ shorter bars, there is a note stating, “Contains more original stories (better in my opinion).”Overall Trend: There is a steep drop-off in height after Marvel Comics and DC Comics, with minor fluctuations among the remaining publishers.The visual presentation emphasizes the dominance of Marvel Comics in terms of volume, with DC Comics being a distant second, highlighting the contribution of Star Wars to Marvel’s count.

Slide 7
Plain Text

Series Name: Count of Series Name Values: [1.0, 10.0, 11.0, 9.0, 25.0, 15.0, 8.0, 2.0, 3.0, 2.0, 1.0, 6.0, 13.0, 14.0, 8.0, 14.0, 8.0, 2.0, 7.0, 6.0, 4.0, 3.0, 5.0, 1.0, 6.0, 2.0, 2.0, 2.0, 5.0, 12.0, 1.0, 10.0, 8.0, 27.0, 27.0, 11.0, 35.0, 21.0, 71.0, 369.0, 497.0, 298.0, 198.0] release years Started Collecting That was an expensive year
Vision Description:

The image is a line graph titled “release years.” The x-axis represents years ranging from 1969 to 2024, marked in increments of a few years, while the y-axis represents an unspecified numerical value ranging from 0 to 600.The data line starts low in 1969, with values remaining under 100 through to about 2014. There are minor fluctuations throughout these years. From around 2014, the line shows a steep upward trend, peaking sharply at around the year 2020, where it hits just below 500. After this peak, the line quickly descends, showing a sharp decline.Two labels are present on the graph. One is at the peak, indicating “That was an expensive year,” suggesting that 2020 was notable. Another is near the start of the upward trend, marked “Started Collecting,” which points around the year 2014.These labels provide context for the significant increase in numbers during this period, indicating critical points of interest in the timeline of data collection or events.

Slide 8
Plain Text

Chart Title: Top Publishers Series Name: Read Values: [550.0, 100.0, 57.0, 43.0, 41.0] Series Name: Unread Values: [247.0, 87.0, 51.0, 64.0, 34.0] reading status Chart Title: All Comic Books Series Name: Total Values: [964.0, 816.0]
Vision Description:

The image consists of two main parts: a pie chart on the left and a bar chart on the right. The overall topic is “reading status” of comic books.Pie Chart (Left Side):Title: “All Comic Books.”The pie chart is divided into two sections.The larger section is labeled “Read” and is represented in dark blue, making up 54% of the chart.The smaller section, labeled “Unread,” is in orange and accounts for 46% of the chart.The chart visually shows that more comic books have been read than are unread.Bar Chart (Right Side):Title: “Top Publishers.”There are five publishers along the horizontal axis, each with its own bar rising upward:Marvel ComicsDC ComicsBOOM! StudiosDark Horse ComicsIDW PublishingThe height of the bars corresponds to the number of comic books, with specific amounts divided into read and unread.The bars have two sections, similar in color to the pie chart: dark blue for read and orange for unread.Marvel Comics’ bar is the tallest, showing the highest number of comics, with a significant portion unread.The other publishers have smaller bars, with varying ratios of read to unread comics.This combination of charts provides an overview of comic book reading status, both generally and by specific publishers.

Slide 9
Plain Text

Series Name: Star Wars Books Values: [554.0, 34.0, 67.0, 2.0] Series Name: Other Books Values: [243.0, 41.0, 40.0, 0.0] Series Name: Values: [657.0, 324.0] star wars comics Star Wars 67% All other books 33% 2
Vision Description:

The image contains two main elements: a pie chart on the left and a bar graph on the right, with logos of comic publishers below the bar graph.Pie Chart (Left Side)Title: The chart is titled “Star Wars Comics.”Composition: The pie chart is divided into two segments.Larger Segment: This takes up 67% of the circle, colored in blue, and labeled as “Star Wars 67%.”Smaller Segment: This occupies 33% of the circle, colored in orange, and labeled as “All other books 33%.”Bar Graph (Right Side)Axes: The vertical axis represents quantity, with numerical labels from 0 to 900 at intervals of 100. The horizontal axis lacks numerical labels but is associated with publisher logos.Bars:The largest bar corresponds to the “Marvel Comics” logo. It is composed of a large blue lower section and a smaller orange upper section.The next two bars are significantly smaller and correspond to “IDW Publishing” and “Dark Horse Comics” logos. Each has a small orange segment on top of a blue base.There is also the “Viz” logo on the far right with a barely noticeable or absent bar.Publisher Logos (Below the Graph)The logos of four different comic publishers are displayed underneath the bar graph, aligned with the bars they represent:Marvel Comics: Identified with the largest bar.IDW Publishing: With a smaller bar.Dark Horse Comics: Also with a smaller bar.Viz: Associated with a negligible bar presence.The image presents a visual comparison and distribution of “Star Wars” comics versus other comic books, showcasing dominance by Star Wars comics in both charts and identifying the publishers involved in this context.
Comic panel of Han Solo yelling YEEE-HOOO!

Those are some really helpful descriptions! With these rich descriptions, we can provide our LLMs with greater content to help answer your users’ questions!

Does vision actually help?

Obviously, the above will work better, but since I have some time on my hands, let’s quickly find out 😉.

First, let’s identify some questions to ask about this content:

[
{
“question”: “How many Marvel books are there?”,
“keywords”: [“Marvel”]
},
{
“question”: “How many Marvel books are Star Wars related?”,
“keywords”: [“Marvel”, “Star Wars”]
},
{
“question”: “What year did the collection start?”,
“keywords”: [“start”, “collect”, “journey”]
},
{
“question”: “What are the top brands?”,
“keywords”: [“brands”, “publishers”]
}
]

I added search keywords to find the content. Any decent RAG system should transform user questions into semantic phrases, and possibly vectors, to return some really robust results. I’m not creating a search tool for this exercise, so I’m cheating a little bit here to find related content.

I loop through these questions, and call ChatGPT twice. Once with the plain text extraction and one with the vision descriptions.

How many Marvel books are there? 

Plain Answer

  • The context provided does not contain specific information about the total number of Marvel books. The data mentions Star Wars books and other books, but does not explicitly reference Marvel books or provide their count. Therefore, based on the data available, it is not possible to determine the number of Marvel books.

Vision Answer

  • In the context provided, the number of items attributed to Marvel Comics is represented in the bar chart titled “by publisher,” where the height of Marvel Comics’ bar reaches above 800. Therefore, there are more than 800 Marvel books.

How many Marvel books are Star Wars related? 

Plain Answer

  • Based on the context provided, there isn’t explicit information about how many Marvel books are specifically Star Wars related. The values given in the “Star Wars Books” category are [554.0, 34.0, 67.0, 2.0], but there is no direct reference associating these numbers with Marvel. To determine how many Marvel books are Star Wars related, you would need additional information linking the publishers to these values.

Vision Answer

  • The pie chart indicates that 67% of the total “Star Wars Comics” belong to Marvel Comics. Depending on the precise context and total volume of material available, this percentage suggests a substantial portion of Marvel’s catalog is Star Wars related, especially considering Marvel’s large bar in the related bar graph. However, specific numeric values are not provided within the context, so we cannot determine the exact number without additional data.

What year did the collection start? 

Plain Answer

  • The collection started on July 4th, 2021.

Vision Answer

  • The collection started around the year 2014.

What are the top brands? 

Plain Answer

  • Based on the context provided, it seems like the top brands or series are likely related to the highest recorded values. “Star Wars” is mentioned several times and associated with high values in the data, suggesting it might be one of the top brands. However, specific “top brands” are not directly listed in the charts, so it’s inferred from the context that “Star Wars” is a prominent brand. If this does not fully answer your question, please provide more context or clarify.

Vision Answer

  • Based on the bar charts provided in the contexts, the top brands in terms of comic book volume are:
  1. Marvel Comics: Consistently has the tallest bar, indicating it is the leading publisher across different charts.
  2. DC Comics: Typically appears as the second highest after Marvel Comics.
  3. Dark Horse Comics: Also features prominently, though with smaller volumes compared to Marvel and DC.
  4. IDW Publishing: Present with a notable quantity of comics, though less than Marvel and DC.
  5. BOOM! Studios: Listed as a top publisher in one of the charts.

These publishers are repeatedly highlighted as the major contributors in the comic book industry within the given contexts.

Comic panel of Chewie yelling

You can’t beat that, can you? Using vision with these slides significantly improved my answers (except for that one wrong answer, did you see it? More below). Now apply this to your content, how many slides and pages have charts, images, and other non-text context that is lost by simply scraping text?

Comiv panel of Mand saying you tell me

Check out the answers to “How many Marvel books are Star Wars related?” Not great, right? The challenge here is that the original vision description didn’t have this detail in it. So this approach is not perfect, but it certainly averages better than plain text! To really improve things we should explore multimodal RAG (next blog post).

Just use GPT 4o Vision!

It seems quite obvious to just say “Hey, let’s use vision on our content!”, and if you had a blank check, I’d say do it today!

In reality, most of us live within budgets, have end-user performance expectations, speed to market requirements, blah blah blah. It’ll come down to the business stakeholders to see the value in using vision, compared to the current value you’re delivering, and decide the cost is worth it.

How much did this cost me?

For my limited example above, vision cost me 8127 input tokens and 2692 output tokens. This cost me a total of $0.05.

Cheap enough for a dozen slides, but when you’re in the hundreds of thousands to millions of slides, this can add up quickly! There are a few things to check out to help curb the cost:

  • Only send slides that need vision. Check for specific element types on the slides, and only send if there are images and charts. An all text slide doesn’t need vision.
  • Explore setting the detail parameter to low, which should incur lower costs, but might impact quality of the output. Learn more on OpenAI’s site about Low or High fidelity image understanding. I’d guess most powerpoints can get away with low.
  • The size of the slide also impacts the cost. All of my examples here are 1024×578, which cost 0.001913perhigh−resslide,0.000213 per low-res (from OpenAI’s pricing page). A larger image, like 2048 wide, will cost twice as much. OpenAI does some resizing and calculations to determine the total tokens for an image. See their Vision pricing calculate for details.

Given the results we see above, and the vastly improved answers, adding vision to your RAG should be seriously considered! Figure out your scale of cost, perform some tests, and showcase the value to your team!

In my next post, we’ll explore multi-modal RAG, sending these images to GPT at the time of chat. Stay tuned!

See this entire post, with code, on github.


Subscribe to my blog and get posts like this in your inbox. Share your email below, or follow me on Threads, LinkedIn, or BlueSky.


One thought on “Using GPT Vision in RAG

Add yours

Leave a Reply

Blog at WordPress.com.

Up ↑