Arc Arcimo Foundation, a nonprofit co-founded by prominent scholar François Chollet, announced in a blog post on Monday that he has created a new, challenging test to measure the general intelligence of the main models of him.
So far, the new test, called Arc-Agi-2, has prevented most models.
“Reasoning” of it like O1-Pro Openai and Deepseek’s R1 score between 1% and 1.3% in ARC-AGI-2, according to the ARC pricing manager. Powerful non-reasonable models, including GPT-4.5, Claude 3.7 sonnet, and Gemini 2.0 Flash the score about 1%.
ARC-AGI tests consist of enigma-like problems, where one it has to identify visual patterns from a collection of different colored squares, and generate the correct “answer” network. Problems were created to force a one to adapt to the new problems he has not seen before.
The ARC Price Foundation had over 400 people to get the ARC-Agi-2 to create a human base. On average, these people’s “panels” received 60% of the test questions – much better than any of the models results.
In a post on X, Chollet claimed that the ARC-Agi-2 is a better measure of the current intelligence of a model than the first repetition of the test, the ARC-Agi-1. Arc Arc Foundation tests are intended to evaluate whether a system can effectively acquire new skills outside the data for which he was trained.
Chollet said that unlike the ARC-Agi-1, the new test prevents the models of it from relying on “brutal strength”-wide computing-find solutions. Chollet previously admitted that this was a big Arc-Agi-1 flaw.
To address the flaws of the first test, the ARC-Agi-2 introduces a new metric: efficiency. It also requires models to interpret the models in flight instead of relying on memorization.
“Intelligence is not only determined by the ability to solve problems or achieve high results,” wrote ARC Foundation co -founder Greg Kamradt in a blog post. “The efficiency with which those skills are acquired and placed is an essential, determining ingredient. The essential question that is raised is not alone,” Can that ability to solve a task? “But also, ‘with what efficiency or cost?’ “
The Arc-Agi-1 was invincible for nearly five years until December 2024, when Openai issued his advanced model of reasoning, O3, which exceeded all other models of him and matched human performance in evaluation. However, as we have noticed at the time, the O3 performance wins at the ARC-Agi-1 came with a huge price.
Openai-o3 (low) (low) version of the O3 model-which was first to reach new heights in the ARC-Agi-1, marking 75.7% in the test, received one at 4% speed in the ARC-AGI-2 using $ 200 with information power for the task.

The arrival of ARC-Agi-2 comes as much as possible in the technology industry are calling for new, unsaturated standards to measure it. Hugging Face’s co -founder, Thomas Wolf, recently told Techcrunch that he lacks enough tests to measure the main features of the so -called general artificial intelligence, including creativity.
In addition to the new standard, the ARC Imim Foundation announced a new ARC 2025 price competition, challenging developers to reach 85% accuracy in the ARC-AGI-2 test while spending only 0.42 dollars per task.