Is Google's Chatbot BARD Failing in Public Testing?

  • Thread starter Thread starter kyphysics
  • Start date Start date
  • Tags Tags
    Google
AI Thread Summary
BARD, Google's chatbot, has faced significant criticism following its public testing, particularly after a disappointing debut on February 8. Recent evaluations by Fortune revealed that BARD struggled with practice SAT math questions, answering incorrectly between 50% to 75% of the time, even when provided with multiple-choice options. The chatbot often provided answers not listed among the choices and required repeated questions to improve its accuracy. In language tests, BARD achieved around 30% correct answers initially, but performance improved with repeated questioning, ultimately scoring about 50% on reading tests. Despite its frequent inaccuracies, BARD maintained a confident tone, often asserting incorrect answers as definitive. The discussion also highlighted the tendency of AI to generate hallucinated responses, especially when queried about niche topics with limited sources, raising concerns about the reliability of AI models like BARD. The conversation concluded that while BARD's performance is questionable, it reflects broader issues in AI reliability and trustworthiness.
kyphysics
Messages
684
Reaction score
445
After an inauspicious debut by Google last month (Feb. 8th), where BARD (Alphabet's rival chatbot to Microsoft/C3.ai's Chat GPT) gave an incorrect answer to an "astronomy-related" question, BARD again seems to flop with its abilities in public testing the past week:
https://fortune.com/2023/03/28/google-chatbot-bard-would-fail-sats-exam/

Fortune sourced practice SAT math questions from online learning resources and found that Bard got anywhere from 50% to 75% of them wrong—even when multiple-choice answers were provided.

Often Bard gave answers which were not even a multiple-choice option, though it sometimes got them correct when asked the same question again. . .

Bard’s first written language test with Fortune came back with around 30% correct answers, often needing to be asked the questions twice for the A.I. to understand.

Even when it was wrong, Bard’s tone is confident, frequently framing responses as: “The correct answer is”—which is a common feature of large language models.

The more Bard was asked language-based questions by Fortune—around 45 in total—the less frequently it struggled to understand or needed the question to be repeated.

On reading tests, Bard similarly performed better than it did in math—getting around half the answers correct on average.
 
Physics news on Phys.org
No idea about math problems, but in my experience the best way to make AI hallucinate is to ask a question on some niche subject, that is discussed only in some obscure sources. As you probably don't know I am author of the first commercial Polish video game, Puszka Pandory (Pandora's Box), for ZX Spectrem. That was in 1986, so the sources are scarce, but they do exist. We were playing with ChatGPT last week and for fun asked about the game. Before we got bored ChatGPT listed at least four different authors, each time starting with "I am sorry, you are right I was wrong, the correct answer is XXXX". It never named me as the author :biggrin:

That was in Polish, I suppose if you will ask about details of something like FIDO net technology or BBS software it will give similarly nonsensical answers.
 
  • Like
Likes Jarvis323 and russ_watters
Borek said:
No idea about math problems, but in my experience the best way to make AI hallucinate is to ask a question on some niche subject, that is discussed only in some obscure sources. As you probably don't know I am author of the first commercial Polish video game, Puszka Pandory (Pandora's Box), for ZX Spectrem. That was in 1986, so the sources are scarce, but they do exist. We were playing with ChatGPT last week and for fun asked about the game. Before we got bored ChatGPT listed at least four different authors, each time starting with "I am sorry, you are right I was wrong, the correct answer is XXXX". It never named me as the author :biggrin:

That was in Polish, I suppose if you will ask about details of something like FIDO net technology or BBS software it will give similarly nonsensical answers.

It may not be a good test to check authorship errors because they deliberately scrub datasets of authorship information and personal information, to an extent. They want to avoid legal issues pertaining to privacy, copyright/attribution, defamation, or whatever else.
 
Jarvis323 said:
It may not be a good test to check authorship errors because they deliberately scrub datasets of authorship information and personal information, to an extent. They want to avoid legal issues pertaining to privacy, copyright/attribution, defamation, or whatever else.
It is perfectly good test to prove why GPT in its current state is unreliable and can't be trusted.
 
Similar to the 2024 thread, here I start the 2025 thread. As always it is getting increasingly difficult to predict, so I will make a list based on other article predictions. You can also leave your prediction here. Here are the predictions of 2024 that did not make it: Peter Shor, David Deutsch and all the rest of the quantum computing community (various sources) Pablo Jarrillo Herrero, Allan McDonald and Rafi Bistritzer for magic angle in twisted graphene (various sources) Christoph...
Thread 'My experience as a hostage'
I believe it was the summer of 2001 that I made a trip to Peru for my work. I was a private contractor doing automation engineering and programming for various companies, including Frito Lay. Frito had purchased a snack food plant near Lima, Peru, and sent me down to oversee the upgrades to the systems and the startup. Peru was still suffering the ills of a recent civil war and I knew it was dicey, but the money was too good to pass up. It was a long trip to Lima; about 14 hours of airtime...
Back
Top