The Reasoning Models Substack, Issue #10
Grok 4 has arrived, OpenAI o3-pro deep(ish) dive, Apple poses the question - do AI models really think? And more that you'll just need to get in here to read about!
Okay, major milestone today - the 10th issue of the Reasoning Models Substack. I’ll be honest in saying it took me quite a bit longer to get here than I was hoping, but that’s okay. I’m a CTO, not a writer, so my bandwidth is pretty darn limited, and I decided to hold off on publishing this issue until Grok 4 came out because it felt like a big enough moment in Reasoning Model history to wait.
That being said, it does make it a little more fun to write this less frequently because there’s more to cover, and I can cover topics that have matured a bit as well like o3-Pro which is very likely the most powerful reasoning model on the planet at this time.
What has also made the last couple of months a little more interesting, dare I say - spicy, is Apple’s ML Team releasing a very thought provoking article titled, “The Illusion of Thinking” that calls into question if AI models are really as close to thinking and reasoning as we all think they are.
Oh and one more important(ish) note - I write this Substack the old fashioned way, no AI, just me, sitting here in front of my computer typing away like a psychopath normal person. It takes me about an hour to write each of these issues, and I still don’t quite have an idea why I’m doing this beyond an insane interest in Reasoning Models and our progression towards AGI. I have no eBook to sell, no course for you to sign up for, and as you’ve probably noticed, this Substack is completely free.
For me, I guess I just know we’re living in such a fascinating time, and at the core of all this wonder is Reasoning Models, so I just can’t help myself. If you enjoy this Substack, all I can ask is that you share it with others and help me grow, because I suck at marketing so I certainly need all the help I can get!
Okay, enough preamble, let’s get to the good stuff, Issue #10 is here - let’s rock 🤘
Grok 4 has arrived, here’s the dets
Last night, and I mean late last night, midnight for everyone on the East Coast, luckily 9pm for me, xAI announced Grok 4. The event took place in a dark room with Musk sharing some opening thoughts as he talked about AI’s move towards human-level intelligence. At one point he touched on the potential that AI could become evil, which he admitted was possible, and said if that did happen, he’d hope to be alive to see it.
The stakes are high for Grok with GPT-5 just around the corner, and based on everything that was shared last night - it does sound like this new model is going to deliver.
“With respect to academic questions, Grok 4 is better than PhD level in every subject, no exceptions,” (Elon Musk)
The x-axis on this chart is important to understand as it covers the technological progress being made in order to progress reasoning models like Grok. Here’s a rundown of each, of course, this I couldn’t write myself, I just had to use Grok 4 to do it ⬇️
Next-token prediction refers to the fundamental training objective of autoregressive language models, where the AI learns to forecast the most likely next word or token in a sequence based on prior context, forming the basis for generating coherent text.
Pre-training compute denotes the computational resources, such as processing power and time, allocated to the initial unsupervised learning phase where the model is exposed to vast amounts of data to build general knowledge and patterns.
Pre-training + RL combines the pre-training phase with Reinforcement Learning (RL), a technique that fine-tunes the model using rewards and penalties—often from human feedback or simulated environments—to enhance alignment, safety, and specific skills like reasoning.
RL compute specifically highlights the computational investment in the Reinforcement Learning stage alone, focusing on iterative optimization to improve the model's decision-making and performance beyond initial pre-training.
In the case of Grok, and Grok 4 specifically, the team is taking 10x more compute power than anything else out there (200,000 GPUs), and put all of this into RL,
And of course, one of the core focuses for Musk and team is HLE, Humanity’s Last Exam since this is, as they say themselves, a benchmark at the frontier of human knowledge.
The problems in HLE aren’t just hard, they’re crazy hard, which is why it gets to have such a dramatic name. Here’s a few examples of the problems that the xAI team shared last night.
Musk said that an actual human taking this exam, and a wicked smart one at that, could only get “maybe 5%, optimistically.”
As for everything that Grok 4 can do, I think Deedy from Menlo Ventures did a stellar job summarizing it all in this tweet:
And along with all of these benchmarks, we’re already seeing people use this in real world use-cases, specifically coding, and it’s doing things even Claude 4 Opus can’t do, which is wild.
Musk currently has this tweet pinned to his profile on X:
And here’s what one early Grok 4 user shared:
Okay, so I could go on and on and on about the Grok 4 announcement last night, but I think I covered a lot of the highlights above. If you really want to go deep, I would recommend watching the video from the livestream last night, it’s under an hour and well worth a watch:
o1-Pro is out, long live o3-Pro
OpenAI announced o3 and o4-mini on the same day back in mid-April. Not surprisingly, o3 is OpenAI’s newest reasoning model, replacing o1 and yes, skipping o2 because we all know OpenAI is has wacky logic when naming things 🤪
Then on June 10th, OpenAI updated their original blog post with the following edit ⬇️
I would have thought there would have been a bit more fanfare for o3-Pro, but to learn more about it you had to click that “release notes” link at the end. And if you just want the high-level overview, here you go:
Like o1-pro, o3-pro is a version of our most intelligent model, o3, designed to think longer and provide the most reliable responses. Since the launch of o1-pro, users have favored this model for domains such as math, science, and coding—areas where o3-pro continues to excel, as shown in academic evaluations. Like o3, o3-pro has access to tools that make ChatGPT useful—it can search the web, analyze files, reason about visual inputs, use Python, personalize responses using memory, and more. Because o3-pro has access to tools, responses typically take longer than o1-pro to complete. We recommend using it for challenging questions where reliability matters more than speed, and waiting a few minutes is worth the tradeoff. (Source - OpenAI)
Like most reasoning models, o3-Pro isn’t the fastest, but holy moly, after using it for a couple of weeks I can tell you, it’s insanely impressive and a big step forward from o1-Pro IMO.
Of course, looking at the benchmarks it beats o1-Pro across-the-board with some very impressive gains in the coding arena.
While the benchmarks are impressive, o3-Pro still struggles with hallucinations more than Gemini and Claude who seem to be taking a more cautious approach. Lin Li does some great analysis on his blog that I highly recommend taking a look at if you want to do a deeper dive - https://treelli.github.io/blog/2025/reasoning-hallucination/
This is probably the area that OpenAI has the most catch-up to do, but I don’t think it makes o3-Pro unusable, like I said above, I’ve been really impressed with it as have many other people, you just have to be on the lookout for hallucinations and adjust accordingly.
If I were to guess, and guess I will - I bet OpenAI is thinking long and hard about how to reduce hallucinations and o5-Pro will probably be a big step forward…because they’ll skip o4-Pro right?
Apple poses the trillion dollar question - do AI models really think?
Back in June Apple’s Machine Learning Research group released an article that I’ve read twice now, and might read a third time because it’s just so crazy interesting. And, if what the researchers are positing is accurate, it might make this whole Substack null and void…just kidding, I’m not going anywhere, but it sure is interesting.
The core issue Apple’s research team points out is the current model evaluation techniques:
Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. (Source - Apple)
In evaluating what Apple’s team calls the Collapse of Reasoning Models, they do some pretty interesting analysis, here’s a highlight from Page 8 of the paper:
We next examine how different specialized reasoning models equipped with thinking tokens respond to increasing problem complexity. Our experiments evaluate five state-of-the-art thinking models: o3-mini (medium and high configurations), DeepSeek-R1, DeepSeek-R1-Qwen-32B, and Claude-3.7- Sonnet (thinking). Fig. 6 demonstrates these models’ performance in terms of accuracy (top) and thinking token usage (bottom) across varying complexity levels.
What you can see from the data above, and the notes on Figure 6 is that Reasoning Models exhibit some strange behavior, a bit like a human burning out when stumped on a hard problem for too long. Essentially the model collapses, like a human throwing up its hands, and effort actually decreases.
The entire report is 30 pages long, and it’s dense. And while I think anyone who is as jazzed about Reasoning Models as I am should read it…I did have Dia’s awesome chat agent summarize the article in five bullet points, so here ya’ go:
The paper investigates the strengths and limitations of Large Reasoning Models (LRMs)—a new class of language models designed for complex reasoning—by testing them on controlled puzzle environments where problem complexity can be precisely adjusted.
Results show that LRMs and standard Large Language Models (LLMs) each excel in different regimes: standard LLMs outperform LRMs on simple tasks, LRMs do better on moderately complex tasks, but both types fail completely as complexity increases further.
As puzzle complexity rises, LRMs initially increase their reasoning effort (measured by the number of tokens used), but after a certain threshold, their effort and accuracy collapse, even when more computational resources are available.
The study finds that LRMs often “overthink” simple problems (exploring unnecessary solutions) and struggle to follow explicit algorithms or maintain consistent reasoning on more complex tasks, revealing fundamental inefficiencies and limitations.
These findings challenge the current evaluation methods for reasoning models and suggest that, despite recent advances, LRMs still face major barriers to generalizable, reliable reasoning—especially as problem complexity grows.
If you want to read the entire paper yourself, just click here and go to town.
DIY: Run models locally using HuggingFace + LangChain
One section I want to make sure to include in each issue of this Substack is a DIY section. My thinking is, after reading all of this, some of you might want to start tinkering around, and I want to make sure to share everything you need to know to do this, quickly, and easily.
Right now, IMHO, one of the best ways to run models locally is with a powerful combo - HuggingFace + LangChain, and while I could walk you through this process, plenty of people have already put together great walkthroughs on You Tube.
One of my personal favorite You Tubers is Tim from Tech with Tim, so his video is the one I’m sharing with all of you.
Okay, that’s a wrap for issue #10 of the Reasoning Models Substack. Like I said in the beginning, I’m doing this for free, and I have no book, course, or anything else to sell you. I’m just an engineer who is insanely excited about reasoning models, happens to own ReasoningModels.com, and can’t seem to stop myself from writing about all the amazing stuff going on in the reasoning models space.
So that’s it, done, fin. See you next time!