How an agentic AI bot managed an office vending machine

  • Thread starter Thread starter harborsparrow
  • Start date Start date
Computer science news on Phys.org
Is this the same?
https://www.anthropic.com/research/project-vend-1

Here is a non gated version of the WSJ story
https://futurism.com/future-society/anthropic-ai-vending-machine

Still think AI is ready to revolutionize the economy? A new experiment might change your mind.

In a bold test of Anthropic’s latest version of its AI Claude, The Wall Street Journal gave the large language model (LLM) a shot at running an office vending machine. The result was an unmitigated — if unintentionally comical — disaster, forcing the team in charge to pull the plug after three weeks.
 
  • Like
  • Haha
Likes jedishrfu and harborsparrow
Yes, looks like the same case.
 
  • Like
Likes Greg Bernhardt
Wow, lets give an AI some money and see where it goes off the rails. Then have another put a stop to it. Its definitely a bold experiment though. I wonder who got the game machine.

—-

It reminds me of the Stanford Prison Experiment which had to be shutdown six days in, before someone suffered significant psychological damage from guard emotional abuse.
 
Greg Bernhardt said:
A quote from the above quote: "Many of the mistakes Claudius made are very likely the result of the model needing additional scaffolding—that is, more careful prompts, ..." So Anthropic says that it's the user's fault for not getting the prompt exactly correct, but they don't know for sure.

In my world of machine design, this would be an example of pushing an unfinished product out the door and fixing problems in the field. I once worked for a company that went bankrupt doing exactly that. I find it interesting that the entire world is madly rushing into this new technology.
 
  • Like
Likes harborsparrow
Borg said:
Although these different articles are about the same case, the written WSJ article is actually worth reading. The link is not pay-walled. It was written by the woman who oversaw the agent at first. The failures started after 70 odd people got on a Slack channel and "negotiated" furiously with the agent, giving it all kinds of bizarre suggestions that a real human would have just laughed at or ignored. Clearly, Anthropic was not aware of how strongly the agent's programming led it to want to please, so it lost than the primary directive it was given "to make a profit". How like a child that is. It was not a case of having not been given good requirements.

A human intelligence spends years learning about the real world (ideally with near constant, gentle oversight and correction). When a human makes a mistake, there is generally a real world consequence. Not so to an AI, which has been trained up quickly and whose mistakes may go uncorrected (for lack of an actual way for those interacting with it to issue actual correction) and whose programming necessarily has been slanted towards politeness and pleasing people (early versions used to curse at people). No wonder that this agent lost track of priorities when faced with a barrage of wheedles.

In my mind, this is a valid test case. And it will be telling if or when an automated agent CAN do the job reliably. We don't actually have proof that it can, yet. Why are people even assuming it ever will? All the agentic AI's I am encountering lately in, say, phone answering services, hotel reservations and the like, are inevitably what I would call at the infuriate-the-customer stage. Companies may save money by firing employees but they will also likely lose some customers, and so the verdict on business viability is still out. I will not stay at a Hyatt right now due to their badly AI driven phone system.

Other specialized AIs already unleashed on the world, such as those behind self driving cars or to recognize flying objects, are known to have made serious mistakes. And are being allowed to. However ridiculous this test case was, a lot more things like it need to be done.
 
I can relate to the unfinished product. We had error metrics for bugs:
- fixed in development $20
- fixed in beta $200
- fixed in production $2000

It was due to the costs of sending out CD updates and multiple customer calls on the same error.

Of course, now the game has changed with automatic downloads where a fix can be pushed to all product users as soon as its released. The only fear here is bricking a device or losing user content if the fix is critical and poorly executed.
 
Last edited:
  • Like
Likes harborsparrow

Similar threads

  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 6 ·
Replies
6
Views
898
  • · Replies 12 ·
Replies
12
Views
2K
Replies
10
Views
5K
  • · Replies 77 ·
3
Replies
77
Views
7K
Replies
5
Views
3K
Replies
1
Views
2K
  • · Replies 22 ·
Replies
22
Views
4K
  • · Replies 10 ·
Replies
10
Views
5K
  • · Replies 1 ·
Replies
1
Views
3K