close
close

Apre-salomemanzo

Breaking: Beyond Headlines!

Can Pictionary and Minecraft test the ingenuity of AI models?
aecifo

Can Pictionary and Minecraft test the ingenuity of AI models?

Most AI benchmarks don’t tell us much. They ask questions that can be solved by rote memorization or cover topics that are not relevant to the majority of users.

Some AI enthusiasts are therefore turning to games to test the problem-solving capabilities of AIs.

Paul Calcraft, an independent AI developer, created an application in which two AI models can play a Pictionary-like game together. One model doodles, while the other model tries to guess what the doodle represents.

“I thought it looked super fun and potentially interesting from a model capabilities standpoint,” Calcraft told TechCrunch in an interview. “So I sat inside on a cloudy Saturday and did it.”

Calcraft was inspired by a similar project by British programmer Simon Willison which tasked models with rendering a vector drawing of a pelican riding a bicycle. Willison, like Calcraft, chose a challenge that he believed would require the models to “think” beyond the content of their training data.

LLM Pictionary
Image credits:Paul Calcraft

“The idea is to have a benchmark that is impossible to play,” Calcraft said. “A benchmark that cannot be beaten by memorizing specific responses or simple patterns that have already been observed in training.”

Minecraft also falls into this “unplayable” category, or at least that’s what 16-year-old Adonis Singh believes. He created a toolMcbench, which gives a model control of a Minecraft character and tests its ability to design structures, like Microsoft’s. Malmö Project.

“I think Minecraft tests the ingenuity of models and gives them more freedom,” he told TechCrunch. “It’s not as narrow and saturated as the (other) benchmarks.”

Using games to benchmark AI is nothing new. The idea dates back several decades: mathematician Claude Shannon argued in 1949 that games like chess provided a worthy challenge to “intelligent” software. More recently, Alphabet’s DeepMind developed a model who could play Pong and Breakout; OpenAI trained an AI to compete Dota 2 matches; and Meta designed a algorithm who could stand up to professional Texas hold’em players.

But what’s different now is that enthusiasts are connecting large language models (LLMs) – models that can analyze text, images and more – to games to test their level of logic.

There is an abundance of LLMs, Gemini And Claude has GPT-4oand they all have different “vibes” so to speak. They “feel” different from one interaction to the next – a phenomenon that can be difficult to quantify.

Mcbench
Note the typo; there is no such model as Claudius 3.6 Sonnet. Image credits:Adonis Singh

“LLMs are known to be sensitive to the particular way questions are asked, and generally unreliable and difficult to predict,” Calcraft said.

Unlike textual references, games provide a visual and intuitive way to compare a model’s performance and behavior, said Matthew Guzdial, an AI researcher and professor at the University of Alberta.

“We can think of each reference as giving us a different simplification of reality, focused on particular types of problems, like reasoning or communication,” he said. “Games are just another way to make decisions with AI, so people use them like any other approach.”

Those familiar with the history of generative AI will notice how Pictionary is similar to generative adversarial networks (GANs), in which a creator model sends images to a discriminator model which then evaluates them.

Calcraft believes that Pictionary can capture an LLM’s ability to understand concepts such as shapes, colors and prepositions (for example, the meaning of “in” versus “on”). He wouldn’t go so far as to say that the game is a reliable test of reasoning, but he argued that winning requires strategy and the ability to understand clues — two models that don’t come easy.

“I also really like the almost contradictory nature of the Pictionary game, similar to GANs, where you have two different roles: one draws and the other guesses,” he said. “The best drawing is not the most artistic, but the one that can most clearly convey the idea to the audience of other LLMs (including faster and much less efficient models!).”

“Pictionary is a toy problem that is not immediately practical or realistic,” Calcraft warned. “That said, I think spatial understanding and multimodality are essential elements for the advancement of AI, so LLM Pictionary could be a small, first step in this journey.”

Mcbench
Image credits:Adonis Singh

Singh believes Minecraft is also a useful benchmark and can measure reasoning in LLMs. “From the models I’ve tested so far, the results are a perfect match for my confidence in the model for something related to reasoning,” he said.

Others aren’t so sure.

Mike Cook, a researcher at Queen Mary University who specializes in AI, doesn’t think Minecraft is particularly special as a testbed for AI.

“I think part of the fascination with Minecraft comes from people outside the gaming sphere who perhaps think that because it resembles the ‘real world’ it has a closer connection to reasoning or “real-world action,” Cook told TechCrunch. “From a problem-solving perspective, it’s not that different from a video game like Fortnite, Stardew Valley or World of Warcraft. There’s just a different skin on top that makes it look more like a set of everyday tasks like building things or exploring.

According to Cook, even the best gaming AI systems typically don’t adapt well to new environments and can’t easily solve problems they’ve never encountered before. For example, a model who excels at Minecraft is unlikely to be able to play Doom with any real skill.

“I think the good qualities of Minecraft from an AI perspective are extremely low reward signals and a procedural world, which means unpredictable challenges,” Cook continued. “But it’s not really any more representative of the real world than any other video game.”

That being said, there is certainly something fascinating to watch LLMs build castles.