The Reasoning Models Substack, Issue #9

OpenAI ups the Ante, small buy mighty models get mightier, and more

May 07, 2025

Hello and welcome to the ninth issue of the Reasoning Models Substack. I’m writing this as I’m getting ready to head to SF for the Bolt.new party so this issue is especially short and sweet.

Still, I didn’t want to leave everyone hanging, heck - this is ReasoningModels.com so you’re expecting to come here to stay in the know on everything happening in the world of Reasoning Models, right!??!

So on that note, let’s get to the good stuff. Here’s what happened in the world of Reasoning Models over the last week ⬇️

OpenAI Releases o3 and o4-mini

On 16 April OpenAI announced two new models—o3 (their new “big brain” that can literally think with images) and the trimmed‑down o4‑mini. Both slot into ChatGPT and promise deeper chain‑of‑thought, but early testers noticed o3 also hallucinates more than its older siblings, so… growing pains.

From a benchmarking standpoint, both models perform better than o1 but in some tests like AIME 2025 Competition Math and GPQA Diamond PhD-Level Science Questions the improvement isn’t anything to write home about.

Still, overall, despite some hallucination issues, people have been loving o3, and for me, it has become my daily driver. The problem I had with o1 is, while I typically liked the output the most, it took a long time to answer. o3 is snappy and does a better job of searching the web than I’ve seen with any other model…honestly, I’m a little worried for Perplexity now 👀

I think Dan Shipper did a great job breaking down what makes o3 so special in this tweet above, so I’d give that a read if you want to do a deeper dive.

Google DeepMind answered with a “reasoning dial.”

Gemini 2.5 Flash landed mid‑April along with a little dial, and I think it’s pretty neat to call it a “reasoning dial,” that you can twist to decide how hard the model thinks—crank it up for thorny logic puzzles, dial it down for quick autocomplete.

MIT Technology review did a solid article on Google’s new model, here’s a good nugget from it ⬇️

“We’ve been really pushing on ‘thinking,’” says Jack Rae, a principal research scientist at DeepMind. Such models, which are built to work through problems logically and spend more time arriving at an answer, rose to prominence earlier this year with the launch of the DeepSeek R1 model. They’re attractive to AI companies because they can make an existing model better by training it to approach a problem pragmatically. That way, the companies can avoid having to build a new model from scratch. (Source - MIT Technology Review)

It’s basically “turbo mode” for LLMs, and devs on X immediately started comparing token costs on screenshots.

Phi-4 Reasoning: NEW Microsoft Models Beats OpenAI... and it's OPENSOURCE? (FREE!) 🤯

The “small‑but‑mighty” model wave kept rolling 🌊

Microsoft pushed out Phi 4 Reasoning Plus on 30 April and bragged it matches o3‑mini on math while being tiny.
A Cornell/Together AI crew showed off M1, a Mamba‑style hybrid that ditches full transformers and still nails state‑of‑the‑art reasoning at 3× faster inference.
Azure published docs for an entire o‑series lineup (o1 → o4‑mini) so enterprises can pick a reasoning tier like they pick VM sizes.

Researchers were busy poking the hype balloon.

DeepMind rolled out QuestBench, a dataset that checks if models know when they’re missing info before answering, while an Ars Technica write‑up highlighted a study showing today’s “simulated reasoning” still flops on Olympiad‑style proofs. Translation: great at short math, still shaky on long logic chains.

Meta‑trend: evaluation & safety.

Every release came with a PDF on “preparedness” or “reasoning safety,” and vendors keep reminding us your private docs don’t train the model—clearly a response to enterprise lawyers side‑eyeing chain‑of‑thought logging.

TL;DR

April was the month reasoning models got options: big vs. mini, dial‑a‑depth, transformer vs. Mamba. They’re sharper, occasionally weirder, and definitely cheaper to run—perfect timing for anyone building agents or complex QA, as long as you keep an eye on those hallucination stats 😵‍💫

And that’s a wrap, thanks for reading and I’ll see you next time.

The Reasoning Models Substack