At the beginning of this week, Meta landed in hot water to use an experimental, unlawful version of his 4 Maverick model to achieve a high score in a crowd reference point, LM Arena. The incident led LM Arena maintained to apologize, change their policies, and mark Maverick un moded, vanilla.
It turns out, it is not very competitive.
Maverick un modified, “Llama-4-Maverick-17b-128e-INSTRUCT”, ranked under models including GPT-4o to Openai, Sonet Claude 3.5 Anthropic, and Google Gemini 1.5 Pro since Friday. Many of these models are several months.
Llama 4 release version has been added to lmerena after it was revealed that they cheated, but you probably didn’t see it because you have to move down to the 32nd place, which is where the ranks are Pic.twitter.com/a0bxkdx4lx
– p: ɡsn (@pigeon__s) April 11, 2025
Why poor performance? Meta Experimental Maverick, Llama-4-Maverick-03-26-Experimental, was “Optimized for Conversation”, the company explained in a table published last Saturday. These optimizations significantly played well with the LM Arena, which has human evaluators to compare the results of the models and choose which they prefer.
As we have written before, for various reasons, the LM Arena has never been the most reliable measure of the performance of a model. However, adapting a model to a landmark – besides being misleading – makes it challenging for developers to predict exactly how well the model will perform in different contexts.
In a statement, a Meta spokesman told Techcrunch that Meta experimented with “all kinds of custom variants”.
“” Llama-4-Maverick-03-26-Experimental “is an optimized version of the conversation we experimented with with what it performs well in Lmarena,” the spokesman said. “We have now released our open source version and we will see how developers personalize Llama 4 for their use cases. We are excited to see what they will build and we look forward to their continuous reactions.”