Nor is Pokémon safe from the controversy of Benchmarking he.
Last week, a post on X went viral, claiming that the latest Google model Gemini exceeded the Claude model of the anthropic flag in the original Pokémon video games. The Gemini is reported to have reached the city of Lavender in the course of a developer of a developer; Claude was stuck in Mount Moon since the end of February.
Gemini is literally in front of the CLAUDE ATM in Pokemon after reaching Lavender City
119 LIVE CLOTH ONLY BTW, General independent risk Pic.twitter.com/8avsovi4x
– you (@you21e8) 10 April 2025
But what the post failed to mention is that the Gemini had an advantage.
As users in Reddit pointed out, the developer who holds the twin flow built a custom minimum that helps the model identify “tiles” in the game as cut trees. This reduces the need for Gemini to analyze screen footage before making game decisions.
Now, Pokémon is a semi-serious reference point, and at best-they will argue that it is a very informative test of a model skills. But it IS A guide example of how different applications of a landmark can affect the results.
For example, Anthropic reported two results for its latest anthropic model 3.7 sonnet in the verified SWE-Bench Benchmark, which is designed to evaluate the coding skills of a model. Claude 3.7 Sonnet reached 62.3% accuracy in verified SWE-Bench, but 70.3% with a “custom scaffolding” that developed anthropic.
Recently, Meta arranged a version of one of his newest models, Llama 4 Maverick, to perform well in a special landmark, LM Arena. The model’s vanilla version marks significantly worse in the same rating.
Given that the standards of the AI-operated are included-are imperfect measures to begin, custom and non-standard applications threaten to muddy waters further. That is, it does not appear to be easier to make it easier to compare the models after they are released.