Hey friends,
I’ve started working on centralizing all my educational content in a more structured and guided way. I’m realizing the newsletter format isn’t the most ideal since some of you are subscribing at different points in time.
If you’re curious and want to be one of the first people to see/try it, check out this page to join the early access list.
Now, let’s talk about reasoning models.
– Fawzi
What’s a reasoning model and how does it work?
Almost a year ago, OpenAI introduced its new family of “reasoning” models with the inaugural o1 model. Since then, we’ve seen the likes of Google, Anthropic, and DeepSeek release their own versions of it.
But how are they different from your standard LLMs?
You may recall that LLMs answer questions by predicting and generating a series of tokens (text). If you need a quick refresher, you can revisit the explainer I made a few months ago.
Reasoning models answer questions a bit differently than your standard LLM.
A standard LLM would immediately output the first answer that (probabilistically) comes to mind, which may not always be correct.
A reasoning model spends more time “thinking” by breaking down a question into intermediate steps and solving each one before giving a final answer. This is called a “chain of thought” and it helps the model work through a problem, investigate different solutions, and refine its answer. If you’ve ever used the “think step-by-step” prompt engineering technique, this is pretty much the same thing.
These types of models were designed for complex problem-solving in areas like maths, physics, and coding. They’re not meant for memory and recall questions, like asking about historical facts or events.
Because it’s breaking down a task, evaluating different solutions, and refining its answers, a reasoning model is costs more and uses more energy than your standard LLM due to the higher token output. The chain of thought tokens also take up space in your context window, which means the model may start “forgetting” things sooner.
A simple example
Imagine you asked a reasoning model:
Problem:
If a train leaves Station A at 3:00 PM traveling at 60 km/h, and another train leaves Station B at 4:00 PM traveling at 90 km/h toward Station A, and the distance between stations is 360 km, at what time do the trains meet?
The model would start by breaking down the problem into smaller steps:
Step 1: Understand the problem.
We are told:
Distance between Station A and B = 360 km
Train 1 leaves Station A at 3:00 PM at 60 km/h
Train 2 leaves Station B at 4:00 PM at 90 km/h
Both trains are moving toward each other
Goal: Find the time when they meet.
Step 2: Break down the timeline.
Between 3:00 PM and 4:00 PM, only Train 1 is moving.
In that 1 hour, Train 1 travels: 60 km/h × 1 h = 60 km
So, by 4:00 PM, the distance remaining between the two trains is:
360 km - 60 km = 300 km
Now both trains are moving toward each other.
Step 3: Set up relative speed.
Train 1 speed = 60 km/h
Train 2 speed = 90 km/h
Combined speed when moving toward each other = 60 + 90 = 150 km/h
Step 4: Calculate time to meet after 4:00 PM.
Time = Distance / Speed = 300 km / 150 km/h = 2 hours
So they will meet 2 hours after 4:00 PM, which is 6:00 PM.
Finally, it outputs the final answer
✅ Final Answer: 6:00 PM
Are reasoning models good enough?
It’s unclear.
A recent research paper from Apple exposed some of the limitations that reasoning models face. You can read the full paper here.
According to the Apple researchers, AI benchmarks aren’t a valuable measure of reasoning quality because they’re limited and suffer from data contamination. Data contamination is when an AI model is trained on the same data that it’s evaluated on, where it has “memorized” specific answers instead of developing generalized intelligence and problem-solving abilities.
To test the capabilities of reasoning models on new and complex problems, they used puzzle games like Tower of Hanoi, Checkers Jumping, River Crossing, and Blocks World.

Here’s what they learned:
Reasoning models fail to develop generalizable problem-solving capabilities, and their performance collapses to zero past a certain complexity threshold.
For simple problems, standard LLMs were more efficient and accurate than reasoning models. As the complexity increased, reasoning models gain an advantage. When problems become highly complex, both types of models experience complete performance collapse.
Reasoning models have a tendency to “overthink” simple problems. They often identify the correct solutions early but inefficiently continue exploring incorrect alternatives.
Even when models were provided the solution algorithm for the complex problems, they still failed.
Final thoughts
As new model types emerge, an important skill to have is choosing the right model for the right task. Our first instinct is to assume something called a “reasoning” model would be better than all other types of models because it’s more expensive and spends more time “thinking”. I’ve seen many instances of AI influencers online telling people to use reasoning models at all times.
I’m using quotation marks on words like “reasoning” and “thinking” because whether these models are actually doing these is a whole other debate. I’m learning that I need to be more careful with the language I use to describe AI because it can set false expectations about how these systems work and what they can do.
Share this post with someone
Share this post in your group chats with friends, family, and coworkers.
If you’re not a free subscriber yet, join to get my latest work directly in your inbox.