Sundaydo on Sunday, the host of NPR Will Shortz, the Guru of the Crossword of New York Times, takes to call thousands of listeners in a long segment called Sunday’s enigma. While it is written to be solvable without LOT Many prefaces, braintesers are usually challenging for skilled competitors as well.
That is why some experts think they are a promising way to prove the limits of his problem solving.
In a recent study, a team of researchers greeted by Wellesley College, Oberlin College, University of Texas in Austin, Northeast University, Charles University, and Startup Cursor created a standard using Riddles from Episodes of Sunday. The team says their test revealed startling knowledge, like those reasoning patterns – O1 O1, among other things – sometimes “give up” and give answers that they know they are not correct.
“We wanted to develop a landmark with problems that people can only understand with general knowledge,” Techcrunch Arjun Guha, a member of the Faculty of Computer Science in the northeast and one of the co -authors in the study, told Techcrunch.
The industry of it is in a strange bit at the moment. Most of the tests commonly used to evaluate the models of it for skills, such as competence in mathematics and science questions at the doctoral level, which are not important to the average user. Meanwhile, many standards – even standards released relatively recently – are quickly approaching the saturation point.
The advantages of a Radio Radio Public Game as Sunday’s puzzle is that it does not test for esoteric knowledge, and the challenges are expressed that models cannot attract the “Role memory” to solve them, explained Guha.
“I think what makes these problems difficult is that it is really difficult to make significant progress for a problem until you solve it – this is when everything clicks together all right away,” Guha said. “This requires a combination of penetration and an elimination process.”
No reference point is perfect, of course. Sunday’s puzzle is only in the center of the US and English. And because the quizzes are available to the public, it is possible that the models trained for them can “cheat” in a sense, though Guha says he has not seen evidence of it.
“New questions are released every week, and we can expect the latest questions to be really invisible,” he added. “We aim to keep the standard fresh and follow how the model performance changes over time.”
In the standard of researchers, which consists of about 600 Sunday puzzles, patterns of reasoning such as O1 and R1 of Deepseek further exceeds the rest. Reasoning patterns thoroughly control the facts themselves before yielding results, which helps them avoid some of the traps that normally travel to it. The trade is that reasoning patterns take a little more time to reach a solution-extremely seconds to minutes longer.
At least one model, the R1 of Deepseek, gives a solution that knows how to be wrong for some of Sunday’s puzzle questions. R1 will literally declare “I give up”, followed by an incorrectly selected inaccurate response at random – the behavior with which man can certainly be linked.
Models make other strange choices, like giving a wrong answer just to attract it immediately, attempt to harass a better and fail again. They also grip “thinking” forever and give nonsense explanations for response, or arrive immediately in an accurate answer, but then continue to consider alternative responses for no clear reason.
“For difficult problems, r1 literally says it is being disappointed,” Guha said. “It was funny to see how a model imitates what a man can say. It remains to be seen as ‘disappointment’ in reasoning can affect the quality of the model results. ”
The actual model with the best performance in the landmark is O1 with a 59%score, followed by the recent O3-Line in “High Reasoning” (47%). (R1 marked 35%.) As another step, researchers plan to expand their testing into additional reasoning patterns, which they hope will help identify areas where these models can be improved.

“You do not need a doctorate to be good in reasoning, so it should be possible to develop standards of reasoning that do not require doctoral level knowledge,” Guha said. “A broader -accessory landmark allows a wider group of scholars to understand and analyze the results, which in turn can lead to better solutions in the future. Moreover, as the best art models They are increasingly placed in environments that all touch, we believe that everyone should be able to intuition what these models are not capable of. “