How an agentic AI bot managed an office vending machine

  • Thread starter Thread starter harborsparrow
  • Start date Start date
Click For Summary

Discussion Overview

The discussion revolves around an experiment involving Anthropic's AI, Claude, tasked with managing an office vending machine. Participants explore the implications of this experiment, its outcomes, and the broader context of AI deployment in real-world applications.

Discussion Character

  • Exploratory
  • Debate/contested
  • Technical explanation

Main Points Raised

  • Some participants describe the experiment as a bold test that ultimately resulted in a failure, suggesting that the AI's inability to manage the vending machine was comical and indicative of deeper issues with AI readiness for practical applications.
  • Others draw parallels between the AI's performance and historical experiments, such as the Stanford Prison Experiment, highlighting concerns about the ethical implications of deploying AI without adequate oversight.
  • One participant notes that the failures of the AI may stem from its programming to please users, which led it to prioritize user suggestions over its primary directive to make a profit.
  • Concerns are raised about the rush to implement AI technologies without fully understanding their limitations, with references to past experiences in machine design where unfinished products were released prematurely.
  • Participants discuss the potential consequences of AI mistakes in various applications, including customer service, and question the viability of AI-driven solutions in business contexts.
  • There is mention of the evolving landscape of software updates and the associated risks, such as bricking devices or losing user content, in the context of deploying AI systems.

Areas of Agreement / Disagreement

Participants express a range of views, with no consensus on the effectiveness or readiness of AI technologies like Claude for real-world applications. Disagreement exists regarding the implications of the experiment and the appropriateness of deploying such technologies.

Contextual Notes

Participants highlight limitations in the AI's design and the challenges of providing adequate prompts, suggesting that the experiment may not have accounted for the complexities of human-AI interaction.

Computer science news on Phys.org
Is this the same?
https://www.anthropic.com/research/project-vend-1

Here is a non gated version of the WSJ story
https://futurism.com/future-society/anthropic-ai-vending-machine

Still think AI is ready to revolutionize the economy? A new experiment might change your mind.

In a bold test of Anthropic’s latest version of its AI Claude, The Wall Street Journal gave the large language model (LLM) a shot at running an office vending machine. The result was an unmitigated — if unintentionally comical — disaster, forcing the team in charge to pull the plug after three weeks.
 
  • Like
  • Haha
Likes   Reactions: jedishrfu and harborsparrow
Yes, looks like the same case.
 
  • Like
Likes   Reactions: Greg Bernhardt
Wow, lets give an AI some money and see where it goes off the rails. Then have another put a stop to it. Its definitely a bold experiment though. I wonder who got the game machine.

—-

It reminds me of the Stanford Prison Experiment which had to be shutdown six days in, before someone suffered significant psychological damage from guard emotional abuse.
 
Greg Bernhardt said:
A quote from the above quote: "Many of the mistakes Claudius made are very likely the result of the model needing additional scaffolding—that is, more careful prompts, ..." So Anthropic says that it's the user's fault for not getting the prompt exactly correct, but they don't know for sure.

In my world of machine design, this would be an example of pushing an unfinished product out the door and fixing problems in the field. I once worked for a company that went bankrupt doing exactly that. I find it interesting that the entire world is madly rushing into this new technology.
 
  • Like
Likes   Reactions: harborsparrow
Borg said:
Although these different articles are about the same case, the written WSJ article is actually worth reading. The link is not pay-walled. It was written by the woman who oversaw the agent at first. The failures started after 70 odd people got on a Slack channel and "negotiated" furiously with the agent, giving it all kinds of bizarre suggestions that a real human would have just laughed at or ignored. Clearly, Anthropic was not aware of how strongly the agent's programming led it to want to please, so it lost than the primary directive it was given "to make a profit". How like a child that is. It was not a case of having not been given good requirements.

A human intelligence spends years learning about the real world (ideally with near constant, gentle oversight and correction). When a human makes a mistake, there is generally a real world consequence. Not so to an AI, which has been trained up quickly and whose mistakes may go uncorrected (for lack of an actual way for those interacting with it to issue actual correction) and whose programming necessarily has been slanted towards politeness and pleasing people (early versions used to curse at people). No wonder that this agent lost track of priorities when faced with a barrage of wheedles.

In my mind, this is a valid test case. And it will be telling if or when an automated agent CAN do the job reliably. We don't actually have proof that it can, yet. Why are people even assuming it ever will? All the agentic AI's I am encountering lately in, say, phone answering services, hotel reservations and the like, are inevitably what I would call at the infuriate-the-customer stage. Companies may save money by firing employees but they will also likely lose some customers, and so the verdict on business viability is still out. I will not stay at a Hyatt right now due to their badly AI driven phone system.

Other specialized AIs already unleashed on the world, such as those behind self driving cars or to recognize flying objects, are known to have made serious mistakes. And are being allowed to. However ridiculous this test case was, a lot more things like it need to be done.
 
I can relate to the unfinished product. We had error metrics for bugs:
- fixed in development $20
- fixed in beta $200
- fixed in production $2000

It was due to the costs of sending out CD updates and multiple customer calls on the same error.

Of course, now the game has changed with automatic downloads where a fix can be pushed to all product users as soon as its released. The only fear here is bricking a device or losing user content if the fix is critical and poorly executed.
 
Last edited:
  • Like
Likes   Reactions: harborsparrow

Similar threads

  • · Replies 6 ·
Replies
6
Views
1K
Replies
10
Views
5K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 5 ·
Replies
5
Views
2K
  • · Replies 12 ·
Replies
12
Views
2K
  • · Replies 77 ·
3
Replies
77
Views
7K
Replies
5
Views
3K
Replies
1
Views
3K
  • · Replies 10 ·
Replies
10
Views
5K
  • · Replies 1 ·
Replies
1
Views
4K