These researchers used Sunday NPR enigma questions to compare ai reasoning patterns

Sundaydo on Sunday, the host of NPR Will Shortz, the Guru of the Crossword of New York Times, takes to call thousands of listeners in a long segment called Sunday’s enigma. While it is written to be solvable without LOT Many prefaces, braintesers are usually challenging for skilled competitors as well.

That is why some experts think they are a promising way to prove the limits of his problem solving.

In a recent study, a team of researchers greeted by Wellesley College, Oberlin College, University of Texas in Austin, Northeast University, Charles University, and Startup Cursor created a standard using Riddles from Episodes of Sunday. The team says their test revealed startling knowledge, like those reasoning patterns – O1 O1, among other things – sometimes “give up” and give answers that they know they are not correct.

“We wanted to develop a landmark with problems that people can only understand with general knowledge,” Techcrunch Arjun Guha, a member of the Faculty of Computer Science in the northeast and one of the co -authors in the study, told Techcrunch.

The industry of it is in a strange bit at the moment. Most of the tests commonly used to evaluate the models of it for skills, such as competence in mathematics and science questions at the doctoral level, which are not important to the average user. Meanwhile, many standards – even standards released relatively recently – are quickly approaching the saturation point.

The advantages of a Radio Radio Public Game as Sunday’s puzzle is that it does not test for esoteric knowledge, and the challenges are expressed that models cannot attract the “Role memory” to solve them, explained Guha.

“I think what makes these problems difficult is that it is really difficult to make significant progress for a problem until you solve it – this is when everything clicks together all right away,” Guha said. “This requires a combination of penetration and an elimination process.”

No reference point is perfect, of course. Sunday’s puzzle is only in the center of the US and English. And because the quizzes are available to the public, it is possible that the models trained for them can “cheat” in a sense, though Guha says he has not seen evidence of it.

“New questions are released every week, and we can expect the latest questions to be really invisible,” he added. “We aim to keep the standard fresh and follow how the model performance changes over time.”

In the standard of researchers, which consists of about 600 Sunday puzzles, patterns of reasoning such as O1 and R1 of Deepseek further exceeds the rest. Reasoning patterns thoroughly control the facts themselves before yielding results, which helps them avoid some of the traps that normally travel to it. The trade is that reasoning patterns take a little more time to reach a solution-extremely seconds to minutes longer.

At least one model, the R1 of Deepseek, gives a solution that knows how to be wrong for some of Sunday’s puzzle questions. R1 will literally declare “I give up”, followed by an incorrectly selected inaccurate response at random – the behavior with which man can certainly be linked.

Models make other strange choices, like giving a wrong answer just to attract it immediately, attempt to harass a better and fail again. They also grip “thinking” forever and give nonsense explanations for response, or arrive immediately in an accurate answer, but then continue to consider alternative responses for no clear reason.

“For difficult problems, r1 literally says it is being disappointed,” Guha said. “It was funny to see how a model imitates what a man can say. It remains to be seen as ‘disappointment’ in reasoning can affect the quality of the model results. ”

R1 being “frustrated” for a question in the Puzzle challenge group.Picture loans:Guha et al.

The actual model with the best performance in the landmark is O1 with a 59%score, followed by the recent O3-Line in “High Reasoning” (47%). (R1 marked 35%.) As another step, researchers plan to expand their testing into additional reasoning patterns, which they hope will help identify areas where these models can be improved.

Npr Benchmark — The results of the models that the team tested to their standard.Picture loans:Guha et al.

“You do not need a doctorate to be good in reasoning, so it should be possible to develop standards of reasoning that do not require doctoral level knowledge,” Guha said. “A broader -accessory landmark allows a wider group of scholars to understand and analyze the results, which in turn can lead to better solutions in the future. Moreover, as the best art models They are increasingly placed in environments that all touch, we believe that everyone should be able to intuition what these models are not capable of. “

What's Hot

The notion undertakes the notes as granolas with its own transcription feature

“We deserve it much better”

Danny Care: Former Scrum Half in England to retire from the Rugby Union at the end of the season | Rugby Union News

These researchers used Sunday NPR enigma questions to compare ai reasoning patterns

The notion undertakes the notes as granolas with its own transcription feature

Realta Fusion Taps $ 36 million in fresh funds for his melting reactor in a bottle

Microsoft Build 2025: To what should we wait, from Azure to Copilot Updates

Alltrails debuts $ 80/year membership that includes smart ways with him

Anthropic Co -founder Jared Kaplan is coming to Techcrunch sessions: he

Improvements in ‘Reasoning’ He can slowly slow down, find the analysis

Electra found a cheap and clean way to clean iron and is raising $257 million to make it happen

Samantha Harris calls herself ‘without cancer’ after breast cancer returns

13 Spring button shirts enjoying all types of body

Today at Sky Sports Racing: Willie Mullis Fields Four at Sussex Champion Hurdle Raid in Plumpton | Racing news

Feder 1.3 million Ford F-150 trucks are examined in the United States for unexpected gang shift

The IMF warns of increased risk of American recession and protects the policy of the nourishment

The notion undertakes the notes as granolas with its own transcription feature

“We deserve it much better”

Danny Care: Former Scrum Half in England to retire from the Rugby Union at the end of the season | Rugby Union News

American inflation falls to 2.3% in April as the tariff effect approaches

Our Picks

The notion undertakes the notes as granolas with its own transcription feature

“We deserve it much better”

Danny Care: Former Scrum Half in England to retire from the Rugby Union at the end of the season | Rugby Union News

Most Popular

Morgan Stanley Cedes Chief Goldman Sachs Rival

Steven Crueger of Yellowjackets excites the big responses that fans won’t see to come

VP JD Vance and his new family begin their life in the official residence

What's Hot

Subscribe to Updates

These researchers used Sunday NPR enigma questions to compare ai reasoning patterns

Keep Reading