When a company releases a new AI video generator, it’s not long before someone uses it to make a video of actor Will Smith eating spaghetti.
It’s become something of a meme, as well as a benchmark: See if a new video generator can actually make Smith slide down a bowl of noodles. Smith himself parodied the trend in an Instagram post in February.
Google Veo 2 has done that.
We are finally eating spaghetti. pic.twitter.com/AZO81w8JC0
— Jerrod Lew (@jerrod_lew) December 17, 2024
Will Smith and Pasta is just one of several “unofficial” weird benchmarks to grip the AI community in 2024. A 16-year-old developer built an app that gives AI control over Minecraft and tests its ability to design structures . Elsewhere, a British programmer created a platform where AI plays games like Pictionary and Connect 4 against each other.
It’s not like there aren’t more academic tests of an AI’s performance. So why did the weirdest explode?
For one, many of the AI industry benchmarks don’t tell the average person much. Companies often cite their AI’s ability to answer questions on Math Olympiad exams, or find reliable solutions to Ph.D-level problems. However, most people – including yours truly – use chatbots for things like answering emails and basic research.
Crowdsourced industry measures are not necessarily better or more informative.
Take, for example, Chatbot Arena, a public benchmark that many AI enthusiasts and developers follow obsessively. Chatbot Arena allows anyone on the web to rate how well AI performs on specific tasks, such as creating a web application or generating an image. But raters tend to be unrepresentative — most come from AI and tech industry circles — and cast their votes based on personal, hard-to-determine preferences.

Ethan Mollick, a management professor at Wharton, recently pointed out in a post on X another problem with many AI industry standards: they don’t compare a system’s performance to that of an average person.
“The fact that there aren’t 30 different standards from different organizations in medicine, law, quality of advice, and so on is a real shame, since people are using systems for these things regardless,” Mollick wrote.
Weird AI standards like Connect 4, Minecraft and Will Smith eating spaghetti certainly are. NO empirical – or even all generalizable. Just because an AI takes the Will Smith test, doesn’t mean it will generate, say, a burger pit.

One expert I spoke to about AI standards suggested that the AI community focus on the downstream impacts of AI rather than its ability in narrow areas. This is reasonable. But I have a feeling that weird standards aren’t going away anytime soon. Not only are they fun – who doesn’t love watching AI build Minecraft castles? – but they are easy to understand. And as my colleague Max Zeff wrote recently, the industry continues to struggle with distilling a technology as complex as AI into digestible marketing.
The only question on my mind is, what weird new standards will go viral in 2025?