Recently launched O3 and O4-MINI models are better in many respects. However, new models still hallucinate, or do things – in fact, they hallucinates MORE than some of the older models of Openai.
Hallucinations have proven to be one of the biggest and most difficult problems to solve in it, affecting even the best systems of today. Historically, every new model has improved slightly in the hallucination department, with less halucating than its predecessor. But that does not seem to be the case for O3 and O4-Mini.
According to Openai, O3 and O4-Mini internal tests, which are so-called reasoning patterns, hallucinates more often than the previous models of the company-o1, O1-Mini, and O3-Mini-Si and traditional, “not reasonable” Openai, such as the GPT-4o.
Perhaps more worrying, the chatgpt manufacturer does not really know why it is happening.
In his technical report on O3 and O4-Mini, Openai writes that “more research is needed” to understand why hallucinations are deteriorating as they scaling the reasoning patterns. O3 and O4-MINI perform better in some areas, including coding and mathematics tasks. But because they “make more claims in general”, they are often led to make “more accurate claims, as well as more inaccurate/halucined claims”, according to the report.
Openai found that O3 hallucinated in response to 33% of persons’ questions, the company’s internal standard for measuring the accuracy of a model for people. This is approximately twice the rate of hallucination of previous models of reasoning of Openai, O1 and O3-Mini, which marked 16% and 14.8% respectively. O4-Mini made worse in person-48% of the time.
Testing third parties by translating, a nonprofit research lab also found evidence that O3 has a tendency to take actions it took in the process of reaching out to the response. In an example, translate O3 observed claiming that he directed the code on a 2021 MacBook Pro “Out of Chatgt”, then copied the numbers in his response. While O3 has access to some tools, it cannot do so.
“Our hypothesis is that the type of reinforcement lesson used for series O series models can amplify issues that are usually softened (but not fully deleted) from standard post-training pipes,” Neil Chowhray, a translated researcher and former Openai employee, said in an email to Techcrunch.
Sarah Schwettmann, co -founder of the translation, added that the O3 hallucination rate could make it less useful than it would be different.
Kian Katanforo, a stanford assistant professor and general director of the right starting starting, told Techcrunch that his team is already testing O3 in their coding course, and that they have seen him be a step on the competition. However, Katanforoos says O3 tends to hallucinates broken internet connections. The model will provide a connection that, when clicked, does not work.
Hallucinations can help models reach interesting ideas and be creative in their “thinking”, but they also make some models a difficult sale for businesses in markets where accuracy is primary. For example, a legal firm is likely not to be satisfied with a model that introduces many factual errors into the client’s contracts.
A promising approach to enhancing the accuracy of models is to give them online search skills. OpenAi’s GPT-4o with online search reaches 90% accuracy in Simpleqa, another of Openai’s accuracy standards. Potential, research can improve the levels of hallucination of reasoning models, as well as at least when users are willing to expose requests to a third party search provider.
If the scaling of reasoning patterns really continues to worsen hallucinations, it will hunt for an increasingly urgent solution.
“Addressing hallucinations in all our models is a continuous field of research, and we are constantly working to improve their accuracy and reliability,” Openai spokesman Niko Felix said in an email for Techcrunch.
In the last year, he has decided to focus on reasoning patterns after techniques to improve traditional models of he began to show reduced returns. Reasoning improves model performance in a variety of tasks without seeking massive amounts of computing and data during training. However, it seems that reasoning can also lead to more hallucination – posing a challenge.