Last month, AI founders and investors told TechCrunch that we're now in the “second age of scaling laws,” noting how established methods of improving AI models were showing diminishing returns. A promising new method they suggested could hold benefits was “test-time scaling,” which appears to be behind the performance of OpenAI's o3 model—but it comes with its own drawbacks.
Much of the AI world took the announcement of OpenAI's o3 model as proof that progress in scaling AI hasn't “hit a wall.” The o3 model does well on benchmarks, significantly outperforming all other models on a test of general ability called ARC-AGI and scoring 25% on a difficult math test, in which no other AI model scored higher more than 2%.
Of course, we at TechCrunch are taking all of this with a grain of salt until we can test o3 ourselves (very few have so far). But even before the release of o3, the AI world is already convinced that something big has changed.
Co-creator of OpenAI's o series of models, Noam Brown, noted Friday that the startup is announcing o3's impressive earnings just three months after the startup announced o1 — a relatively short time frame for such growth in performance.
“We have every reason to believe this trajectory will continue,” Brown said in a tweet.
Anthropic co-founder Jack Clark said in a blog post on Monday that o3 is proof that AI “progress will be faster in 2025 than in 2024.” (Note that it takes advantage of Anthropic — specifically its ability to raise capital — to suggest that the laws of AI scaling are continuing, even if Clark is filling in for a competitor.)
In the coming year, Clark says the AI world will merge test-time scaling and traditional pre-training scaling methods together to provide even greater returns from AI models. Perhaps he's suggesting that Anthropic and other AI model providers will release their own reasoning models in 2025, just as Google did last week.
Test time scaling means that OpenAI is using more computation during the inference phase of ChatGPT, the period of time after you press enter on a request. It's not clear exactly what's going on behind the scenes: OpenAI is either using more computer chips to answer a user's question, using more powerful inference chips, or using those chips for longer periods of time – 10 to 15 minutes in some cases – before the AI produces an answer. We don't know all the details of how o3 was created, but these benchmarks are early signs that scaling test time can work to improve the performance of AI models.
While o3 may give some new faith in the advancement of AI's scaling laws, OpenAI's newest model also uses a never-before-seen level of computation, which means a higher price per answer.
“Perhaps the only important caveat here is realizing that one reason O3 is so much better is that it costs more money to run at runtime—the ability to use computational tools at test time for some problems that you can turn the calculation into a better answer,” Clark writes in his blog. the service of a generative model only by looking at the model and the cost to generate a certain result.”
Clark, and others, pointed to o3's performance on the ARC-AGI benchmark—a difficult test used to assess advances in AGI—as an indicator of its progress. It is worth noting that passing this test, according to its creators, does not mean an AI model has reached AGI, but rather it is a way to measure progress towards a nebulous goal. That said, the o3 model outperformed all previous AI models that took the test, scoring 88% on one of its attempts. OpenAI's next best AI model, o1, scored only 32%.
But the logarithmic x-axis in this table may be alarming to some. The high-scoring version of o3 used more than $1,000 worth of calculations for each task. The o1 models used about $5 of computing per task, and the o1-mini used just a few cents.
The creator of the ARC-AGI benchmark, François Chollet, writes in a blog that OpenAI used approximately 170 times more computation to generate that 88% result, compared to the high-efficiency version of o3 that scored just 12% lower. The high-scoring version of o3 used more than $10,000 in resources to complete the test, which makes it very expensive to compete for the ARC Prize – an unbeatable competition for AI models to beat the ARC test.
However, Chollet says the o3 was still a breakthrough for AI models.
“o3 is a system capable of adapting to tasks it has never encountered before, probably approaching human-level performance in the ARC-AGI domain,” Chollet said in the blog. “Of course, such generalization comes at a huge cost, and it still wouldn't be quite economical: You could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did it ), while consuming only cents. in energy.”
It's too early to speculate on the exact price of all this – we've seen prices for AI models drop significantly in the past year, and OpenAI has yet to announce what the o3 will actually cost. However, these prices show how much computation is required to break, even slightly, the performance barriers imposed by today's leading AI models.
This raises several questions. What is o3 really for? And how much more computation is needed to make more gains around ending up with o4, o5, or whatever OpenAI names its future reasoning models?
It doesn't seem like o3, or its successors, would be anyone's “daily driver” like GPT-4o or Google Search might be. These models simply use a lot of calculations to answer little questions throughout your day, like, “How can the Cleveland Browns still make the 2024 playoffs?”
Instead, it seems like AI models with scalable time-trial calculations may only be good for big-picture requests like, “How can the Cleveland Browns become a Super Bowl franchise in 2027 ?” Even then, it's probably only worth the high computational costs if you're the general manager of the Cleveland Browns and you're using these tools to make some big decisions.
Institutions with deep pockets may be the only ones who can afford o3, at least to begin with, as Wharton professor Ethan Mollick notes in a tweet.
We've already seen OpenAI release a $200 tier to use a high-computing version of o1, but the startup is said to be considering creating subscription plans that cost up to $2,000. When you see how much the o3 calculation uses, you can understand why OpenAI would consider it.
But there are drawbacks to using o3 for high-impact work. As Chollet notes, o3 is not AGI, and it still fails at some very easy tasks that a human would do very easily.
This is not necessarily surprising, as large language models still have a major hallucinatory problem, which o3 and the test time calculation do not seem to have solved. That's why ChatGPT and Gemini include disclaimers below every answer they produce, asking users not to trust answers at face value. Apparently AGI, if ever achieved, would need no such disclaimer.
One way to unlock more gains in test time scaling could be better AI inference chips. There is no shortage of startups that address just this, such as Groq or Cerebras, while other startups are designing more cost-effective AI chips, such as MatX. Andreessen Horowitz general partner Anjney Midha previously told TechCrunch that he expects these startups to play a bigger role in scaling test time moving forward.
While o3 is a significant improvement in the performance of AI models, it raises some new questions about usage and costs. That said, o3's performance adds credence to the claim that test-time computing is the tech industry's next best way to scale AI models.
TechCrunch has an AI-focused newsletter! Register here to receive it in your inbox every Wednesday.