OpenAI recently released ChatGPT-4o and on its heals ChatGPT 4o mini. Of course, we all jumped on it! “Upgrading” to 4o through the ChatGPT website is easy, but what about upgrading your products and solutions? 4o is fast, it’s cheap (4o mini incredibly cheap!), it’s smart; it’s certainly worth upgrading to!
There are some considerations we must be aware of before we blindly change model names in our code. What exactly? Well, of course “it depends”. Some POCs and solutions are very hard-coded to specific LLMs and models. Some are dynamic and extensible allowing for easy plug-and-play with other LLMs and models. Even if you can easily swap models, I think these tips can still apply.
Switching LLM models should be treated as a code change
Before we dive into upgrading to 4o, let’s quickly discuss what goes into a model change. Changing your model seems simple enough: just change the model name, but it has impacts. Any time we change a model, we should treat it like a code change. (it literally is a code change, btw)
Understand what’s new in the model
New models introduce new features, capabilities, and limitations. Understanding these changes and assessing if your solution is ready to support them is important. Let’s take a small example: a newer model’s context token size increased over the previous model. Does your solution currently trim tokens to fit in the older, smaller, token limit? If so, you’ll want to update that code to expand into the larger limit. Make sure to consider all of your services: user-facing chats, agentic workflows, ingestion enrichment, etc.
Review your prompt.
When we move between models, it’s a good idea to review your prompt and ensure it’s optimal for that model. This is a new model, built on its predecessors, but there are changes and your prompt may need to change to accommodate it. I’ve seen model changes impact the deterministic side of the output, like formatting, citations, code, etc.
Test the change.
As we briefly reviewed above, changing the model is not a simple change, it can introduce code changes along with differing outputs from the LLM. Treat changing your model like code and validate the outputs from the model. Test it. Run users through it, are they satisfied with the machine’s responses? Is it performant? Does the new model perform at the same speeds or faster than before? This is where a robust evaluation framework can save you time and headaches.
Upgrade to GPT-4o
With the above considerations, let’s discuss moving into GPT 4o specifically.
Understand what’s new.
GPT 4o token limits
GPT 4o doesn’t only add a robust 128k token limit (up from 8k with GPT 4), it also adds a response limit. GPT 4o has a max response limit of 4k tokens. (I can’t find documentation about this but it’s true. We ran up against it and discovered it through trial and error.) In some solutions, we’ve had some basic calculations where we take the system prompt token size, the user query, the RAG content, and any remaining tokens we give to GPT to use. This worked great with GPT 4, but 4o (and turbo) have this 4k limit. This requires a code change.
GPT 4o tokenization
New with GPT 4o, they’re using a different tokenizer! This looks promising as it uses fewer tokens now, which should result in lower costs. As a result, if your code is calculating tokens using a library, like tiktoken, you might be using cl100k_base model. With 4o, you need to use o200k_base to get accurate token sizing. (Shout out to Brandon for this find!)
Keep in mind your usage logging as well, if you have it. Make sure token counts and associated costs for the models are accurate for the business.
Review your prompt.
Prompt review with 4o
There haven’t been significant changes needed to my prompts moving to 4o. However, the team came across one little oddity that now makes me say: “review the prompt with a new model”. Revisiting your prompt may have been obvious with a move from 3.5 to 4, but even within the model suite of GPT 4’s, we should review the prompt and ensure it’s optimal.
I’m seeing differences in specific formatting asks between 4 and 4o. Formatting text as HTML, laying out specific RAG content in a defined manner, code samples, and even plain JSON output, I’m finding I have to be more specific in what I want from GPT 4o.
For example, asking GPT to give me back some JSON. GPT 4 was consistently outputting like:
```
{
"name": "value,
"other": "other
}
```
And now with 4o I consistently see:
See the difference? Proper markdown includes the language of the code, so its now adding json to the first line. It’s a small change, but depending on how my code parses the output, it can break.
I prefer the latter as it is proper markdown anyway ;). GPT also has a json mode available, which you add to your request, that can tighten up the output, i.e. "response_format": { "type": "json_object" }.
Test the change.
Test GPT 4o
Nothing specific about 4o here. Follow your testing practices, run some ingestions, run user queries, chat with it, try to force an answer you know your RAG can’t answer (I like using “who is Darth Vader?”).
It’s a code change, treat it as such
Changing your model is a code change, plan for it like a code change. Depending on your product and architecture, this could be a small change in 1 config file or a large change across various services. Depending on your solution, you do not have to make that change everywhere, all at once. You can focus on the ingestion first, or the chat experience first. Up to you. By the way, you certainly don’t have to move to GPT 4o or GPT 4o mini. Keep in mind GPT 3.5 is still very applicable for some use cases.
and no, ChatGPT did not write this post.
Subscribe to my blog and get posts like this in your inbox. Share your email below, or follow me on Threads, LinkedIn, or BlueSky.

Leave a Reply