So-called AI reasoning models are becoming easier – and cheaper – to develop.
On Friday, NovaSky, a team of researchers based out of UC Berkeley’s Sky Computing Lab, released Sky-T1-32B-Preview, a reasoning model that is competitive with an earlier version of OpenAI’s o1 on a number of key benchmarks. Sky-T1 appears to be the first open-source reasoning model in the sense that it can be replicated from scratch; the team released the dataset they used to train it, as well as the necessary training code.
“Remarkably, Sky-T1-32B-Preview was trained for less than $450,” the team wrote in a blog post, “demonstrating that it is possible to replicate high-level reasoning skills in a way affordable and efficient.”
$450 may not seem that affordable. But not so long ago, the price for training a model with comparable performance was often in the millions of dollars. Synthetic training data, or training data generated from other models, has helped reduce costs. Palmyra X 004, a model recently released by AI company Writer, trained almost entirely on synthetic data, reportedly cost just $700,000 to develop.
Unlike most artificial intelligence, reasoning models are effectively self-monitoring, which helps them avoid some of the pitfalls that typically hold models back. Reasoning patterns take slightly longer—usually seconds to minutes longer—to reach a solution compared to a typical non-reasoning pattern. The good thing is that they tend to be more reliable in areas such as physics, science and math.
The NovaSky team says it used another reasoning model, Alibaba’s QwQ-32B-Preview, to generate the initial training data for Sky-T1, then “curated” the data mix and used GPT-4o-mini of OpenAI to refactor the data into a more. applicable format. Training Sky-T1 with 32 billion parameters took about 19 hours using a rack of 8 Nvidia H100 GPUs. (The parameters roughly correspond to the problem-solving capabilities of the model.)
According to the NovaSky team, the Sky-T1 outperforms an early preview version of the o1 in MATH500, a collection of “competition-level” math challenges. The model also beats the o1 preview on a number of difficult problems from LiveCodeBench, a coding benchmark.
However, Sky-T1 fails to predict o1 in GPQA-Diamond, which contains questions related to physics, biology and chemistry that a PhD graduate is expected to know.
Also important to note is that OpenAI’s GA release of o1 is a more robust model than the previous version of o1, and that OpenAI is expected to release an even better performing reasoning model, o3, in the coming weeks.
But the NovaSky team says Sky-T1 marks just the beginning of their journey to develop open-source models with advanced reasoning capabilities.
“Moving forward, we will focus on developing more efficient models that maintain robust reasoning performance and exploring advanced techniques that further increase the efficiency and accuracy of the models at test time,” the team wrote in the post. “Stay tuned as we make progress on these exciting initiatives.”