One of the new Flag models he released on Saturday, Maverick, ranks second in the LM Arena, a test that has human appraisers to compare the results of the models and choose which they prefer. But it seems that the Maverick version that Meta placed in the LM arena differs from the version that is widely available to developers.
As some of his scholars on X pointed out, Meta pointed out in her announcement that Maverick in the LM Arena is an “experimental version of conversation”. A chart on the Llama official website, meanwhile, unfolds that the LM Arena Arena Test was carried out using “Llama 4 Maverick optimized for conversation.”
As we have written before, for various reasons, the LM Arena has never been the most reliable measure of the performance of a model. But the companies of it generally have not adapted or otherwise arranged their models to better score in the LM Arena-or have refused to do so, at least.
The problem with adapting a model to a landmark, keeping it, and then issuing a “vanilla” variant of the same model is that it makes it challenging for developers to predict exactly how well the model will perform in separate contexts. Also deceitful. Ideally, standards – with inappropriate misery such as – provide a picture of the hard and weak points of a single model in a range of tasks.
Indeed, researchers in X have observed strict changes in the behavior of Maverick publicly discharged compared to the LM Arena model. The LM Arena version seems to use a lot of emojis, and gives very long answers.
Good llama 4 is def a cooked lol, which is this city yap Pic.twitter.com/y3GVHBVZ65
– Nathan Lambert (@natolambert) 6 April 2025
For some reason, the Llama 4 model in the arena uses much more emojis
over together. He looks better: Pic.twitter.com/f74odx4ztt
– Technique Dev Notes (@techdevnotes) 6 April 2025
We have reached Meta and Chatbot Arena, LM Arena organizations for comment.