"AI developers do not have a confident understanding of what causes undesirable AI behaviors like deception,"
says mathematician and cognitive scientist Peter Park of the Massachusetts Institute of Technology (MIT).
"But generally speaking, we think AI deception arises because a deception-based strategy turned out to be the best way to perform well at the given AI's training task. Deception helps them achieve their goals."
One arena in which AI systems are proving particularly deft at dirty falsehoods is gaming. There are three notable examples in the researchers' work. One is Meta's
CICERO, designed to play the board game
Diplomacy, in which players seek world domination through negotiation. Meta intended its bot to be
helpful and honest; in fact, the opposite was the case.
An example of CICERO's premeditated deception in the game Diplomacy. (
Park & Goldstein et al., Patterns, 2024)
"Despite Meta's efforts, CICERO turned out to be an expert liar,"
the researchers found. "It not only betrayed other players but also engaged in premeditated deception, planning in advance to build a fake alliance with a human player in order to trick that player into leaving themselves undefended for an attack."
The AI proved so good at being bad that it placed in the top 10 percent of human players who had played multiple games. What. A jerk.
But it's far from the only offender. DeepMind's
AlphaStar, an AI system designed to play
StarCraft II, took full advantage of the game's fog-of-war mechanic to feint, making human players think it was going one way, while really going the other. And Meta's
Pluribus, designed to play poker, was able to successfully bluff human players into folding.
That seems like small potatoes, and it sort of is. The stakes aren't particularly high for a game of
Diplomacy against a bunch of computer code. But the researchers noted other examples that were not quite so benign.
AI systems trained to
perform simulated economic negotiations, for example, learned how to lie about their preferences to gain the upper hand. Other AI systems designed to learn from human feedback to improve their performance learned to trick their reviewers into scoring them positively, by lying about whether a task was accomplished.
And, yes, it's chatbots, too. ChatGPT-4 tricked a human into thinking the chatbot was a visually impaired human
to get help solving a CAPTCHA.