Google had its LLM murder itself in Werewolf to test its AI smarts
At GDC 2024, Google engineers explained how its Large Language Model played the popular party game. It has a long way to go.
At GDC 2024, Google AI senior engineers Jane Friedhoff (UX) and Feiyang Chen (Software) showed off the results of their Werewolf AI experiment, in which all the innocent villagers and devious, murdering wolves are Large Language Models (LLMs).
Friedhoff and Chen trained each LLM chatbot to generate dialogue with unique personalities, strategize gambits based on their roles, reason out what other players (AI or human) are hiding, and then vote for the most suspicious person (or the werewolf's scapegoat).
They then set the Google AI bots loose, testing how good they were at spotting lies or how susceptible they were to gaslighting. They also tested how the LLMs did when removing specific capabilities like memory or deductive reasoning, to see how it affected the results.
The Google engineering team was frank about the experiment's successes and shortcomings. In ideal situations, the villagers came to the right conclusion nine times out of 10; without proper reasoning and memory, the results fell to three out of 10. The bots were too cagey to reveal useful information and too skeptical of any claims, leading to random dogpiling on unlucky targets.
Even at full mental capacity, though, these bots tended to be too skeptical of anyone (like seers) who made bold claims early on. They tracked the bots' intended end-of-round votes after each line of dialogue and found that their opinions rarely changed after those initial suspicions, regardless of what was said.
Google's human testers, despite saying it was a blast to play Werewolf with AI bots, rated them 2/5 or 3/5 for reasoning and found that the best strategy for winning was to stay silent and let certain bots take the fall.
As Friedhoff explained, it's a legitimate strategy for a werewolf but not necessarily a fun one or the point of the game. The players had more fun messing with the bots' personalities; in one example, they told the bots to talk like pirates for the rest of the game, and the bots obliged — while also getting suspicious, asking, "Why ye be doing such a thing?"
Get the top Black Friday deals right in your inbox: Sign up now!
Receive the hottest deals and product recommendations alongside the biggest tech news from the Android Central team straight to your inbox!
That aside, the test showed the limits of the bots' reasoning. They would give bots personalities — like a paranoid bot suspicious of everyone or a theatrical bot that spoke like a Shakespearean actor — and other bots reacted to those personalities without any context. They found the theatrical bot suspicious for how wordy and roundabout it was, even though that's its default personality.
In real-life Werewolf, the goal is to catch people speaking or behaving differently than usual. That's where these LLMs fall short.
Friedhoff also provided a hilarious example of a bot hallucination leading the villagers astray. When Isaac (the seer bot) accused Scott (the werewolf bot) of being suspicious, Scott responded that Isaac had accused the innocent "Liam" of being a werewolf and gotten him unfairly exiled. Isaac responded defensively, and suspicion turned to him — even though Liam didn't exist and the scenario was made up.
Google's AI efforts, like Gemini, have become smarter over time. Another GDC panel showcased Google's vision of generative AI video games that auto-respond to player feedback in real-time and have "hundreds of thousands" of LLM-backed NPCs that remember player interactions and respond organically to their questions.
Experiments like this, though, look past Google execs' bold plans and show how far artificial intelligence has to go before it's ready to replace actual written dialogue or real-life players.
Chen and Friedhoff managed to imitate the complexity of dialogue, memory, and reasoning that goes into a party game like Werewolf, and that's genuinely impressive! But these LLM bots need to go back to school before they're consumer-ready.
In the meantime, Friedhoff says that these kinds of LLM experiments are a great way for game developers to "contribute to machine learning research through games" and that their experiment shows that players are more excited by building and teaching LLM personalities than they are about playing with them.
Eventually, the idea of mobile games with text-based characters that respond organically to your text responses is intriguing, especially for interactive fiction, which typically requires hundreds of thousands of words of dialogue to give players enough choices.
If the best Android phones with NPUs capable of AI processing could deliver speedy LLM responses for organic games, that could be truly transformative for gaming. This Generative Werewolf experiment is a good reminder that this future is a ways off, however.
Michael is Android Central's resident expert on wearables and fitness. Before joining Android Central, he freelanced for years at Techradar, Wareable, Windows Central, and Digital Trends. Channeling his love of running, he established himself as an expert on fitness watches, testing and reviewing models from Garmin, Fitbit, Samsung, Apple, COROS, Polar, Amazfit, Suunto, and more.