OpenAI's o-1 and Inference-Time Scaling Laws

Machine Thinking, Fast and Slow

Oct 28, 2024

This is a weekly newsletter about the business of the technology industry. To receive Tanay’s Newsletter in your inbox, subscribe here for free:

Hi friends,

This week I’ll be discussing OpenAI’s recently released o-1 model, which represents a new leap in the reasoning capabilities of LLMs, and the consequences of the new inference-time scaling laws that underpin it. But first, a bit of background. If you’re wonder what is inference-time scaling, what are inference time scaling laws or inference time reasoning, hopefully this post should help!

Pre-training Scaling Laws

Scaling laws for LLMs are pretty well understood at this point – as compute and dataset sizes increase and model parameters increase, model performance improves.

Source: Scaling Laws for Neural Language Models

Already today hundreds of millions are being spent on pre-training models, but the expectation is that this number is going to only go up, as noted by Mark Zuckerberg on Meta’s spend and by Dario from Anthropic.

“The amount of compute needed to train Llama 4 will likely be almost 10 times more than what we used to train Llama 3, and future models will continue to grow beyond that.” — Mark Zuckerberg, Meta CEO

“Right now, 100 million. There are models in training today that are more like a billion. I think if we go to ten or a hundred billion, and I think that will happen in 2025, 2026, maybe 2027, and the algorithmic improvements continue a pace, and the chip improvements continue a pace, then I think there is in my mind a good chance that by that time we'll be able to get models that are better than most humans at most things." — Dario Amodei, Anthropic CEO

At the same time, there are questions around how much better models will get just with more scale, and whether more approaches will be needed since we’re running out of good data and the models still feel like “next token predictors” ultimately. That’s where OpenAI’s o-1 comes in.

OpenAI’s o-1

OpenAI’s o-1, named after the US Visa for aliens with extraordinary ability, is a large language model which launched in September 2024 trained with reinforcement learning to perform complex reasoning.

The model uses a “chain of thought” approach, where essentially it thinks before it answers. The model breaks down the query that was asked, reasons through it potentially iterates on the response to ensure accuracy, and then answers the question.

This approach allows the model to outperform GPT-4o on a wide variety of reasoning tasks, as below.

If you don’t yet receive Tanay's newsletter in your email inbox, please join the 10,000+ subscribers who do:

Inference-time scaling laws

One of the most interesting aspects from OpenAI’s data on o-1 is that as inference-time compute available to the model increases (i.e., it could think for longer/harder), the performance of the model improves, as below in the right chart.

The premise of inference or test-time scaling is that for harder problems, one can improve the model’s performance by just spending compute thinking about that problem at inference time.

We’re still extremely early in understanding inference-time scaling, but this represents an exciting potential avenue to improve model performance vs just relying on pre-training scaling laws. A few reasons why it matters:

I. Better models: We now have another way to get more performant models. Rather than spending 10x or more making them larger at training time, we can give them more time to think at inference time. It’s possible over time that we get to a point where we have small pre-trained models that are good at reasoning, and are just given all the information and tools they need at inference time to solve whatever problem they have.

II. A shift to inference compute: Today, in the context of LLMs, a lot of the compute is spent ahead of time in the pre-training and post-training phases rather than in the inference phase. Through inference-time scaling, we’ll see a faster increase in the shift of compute towards inference. This has one major benefit for model providers — it moves what was spend on capex to train these models, into something that is an opex expense and that can be directly measured and potentially charged for.

Jim Fan, who leads NVIDIA’s embodied AI team, summarized it well with the graphic below.

There was also a discussion between Brad Gerstner of Altimeter and Jensen Huang of NVIDIA around what this does to the aggregate shift of spend:

Brad: Inference is about 40% of your revenue today, because of chain of reasoning [what happens]?
Jensen: It’s about to go up a billion times.

III. Unlocks harder problems / problems with more consequences: It seems almost a bit silly that the same model that can write a funny poem will with the same latency be able to prove a new mathematical conjecture. It’s clear that certain problems are harder, or just have more consequences and so need to be thought through more / checked properly. Scaling compute at inference time presents a way for models to tackle these sets of problems, in a way akin to humans who just take longer on more challenging tasks. In this way, it could unlock a whole new set of use cases that models can solve, and also allow for models to be used for things that are very consequential which today they aren’t considered good for given hallucinations and the risk of being wrong.

While still very early, over time, depending on the task at hand, certain problems may be worth answering immediately (think simple problems or fact-based recall or simple-search based queries), while other ones may be worth spending differing amounts of time reasoning on (from minutes to hours to even days), as Jensen Huang alludes to below:

“The idea that a model before it answers your question has done internal inference 10,000 times on that is not unreasonable. It’s also done tree search, It’s done some simulation, some reflection, it probably has looked up some data … this type of intelligence it’s what we do.” — Jensen Huang, NVIDIA CEO

Closing Thoughts

It’s an exciting time to be building AI or building with AI, and inference-time scaling laws open up new sets of problems that AI wasn’t able to solve or was considered as suitable for. For those building with AI, a few things to potentially ponder:

Scope: What new problems or adjacent problems in the domain that you’re building in are now solvable with the advances in AI? More generally, which ones either require significant reasoning, which could potentially be solved now or soon given the right tools and the latest advances in these models.
Interface: If certain tasks can be solved with AI but may take a lot of reasoning and so take longer, what are the right interfaces to make those available to users? Are chatbots right or should there be more asynchronous interfaces such as via email? How do you set the right expectations with users?
Orchestration and Productization: A new class of problems can now be solved with these models in theory, but how do you put that into practice? What problems do you use o-1 type models on vs not? How do you decide how often to let it think / how many iterations to run? What tools do you have to give the model to get you to the systems working well in production? How do you optimize your costs and keep them reasonable?