Is Google's Chatbot BARD Failing in Public Testing?

  • Thread starter Thread starter kyphysics
  • Start date Start date
  • Tags Tags
    Google
Click For Summary

Discussion Overview

The discussion centers around the performance of Google's chatbot BARD in public testing, particularly in relation to its accuracy in answering questions, including those from practice SATs and language tests. Participants explore the implications of BARD's performance and the challenges faced by AI in handling niche subjects.

Discussion Character

  • Debate/contested
  • Technical explanation
  • Exploratory

Main Points Raised

  • Some participants note that BARD performed poorly on practice SAT math questions, with accuracy ranging from 50% to 75% incorrect, and often provided answers not listed as options.
  • It was observed that BARD's performance improved with repeated questioning, suggesting a learning curve in understanding language-based queries.
  • One participant shares an anecdote about AI hallucination when asking about niche subjects, citing their experience with ChatGPT and the misattribution of authorship regarding their video game.
  • Another participant argues that testing for authorship errors may not be effective due to the deliberate scrubbing of datasets for personal information to avoid legal issues.
  • There is a claim that BARD's current state demonstrates unreliability and a lack of trustworthiness in its responses.

Areas of Agreement / Disagreement

Participants express differing views on the reliability of BARD and the effectiveness of testing methods, indicating that multiple competing perspectives remain without consensus.

Contextual Notes

Limitations noted include the potential influence of dataset scrubbing on authorship accuracy and the challenges of evaluating AI performance on niche topics.

kyphysics
Messages
685
Reaction score
445
After an inauspicious debut by Google last month (Feb. 8th), where BARD (Alphabet's rival chatbot to Microsoft/C3.ai's Chat GPT) gave an incorrect answer to an "astronomy-related" question, BARD again seems to flop with its abilities in public testing the past week:
https://fortune.com/2023/03/28/google-chatbot-bard-would-fail-sats-exam/

Fortune sourced practice SAT math questions from online learning resources and found that Bard got anywhere from 50% to 75% of them wrong—even when multiple-choice answers were provided.

Often Bard gave answers which were not even a multiple-choice option, though it sometimes got them correct when asked the same question again. . .

Bard’s first written language test with Fortune came back with around 30% correct answers, often needing to be asked the questions twice for the A.I. to understand.

Even when it was wrong, Bard’s tone is confident, frequently framing responses as: “The correct answer is”—which is a common feature of large language models.

The more Bard was asked language-based questions by Fortune—around 45 in total—the less frequently it struggled to understand or needed the question to be repeated.

On reading tests, Bard similarly performed better than it did in math—getting around half the answers correct on average.
 
Computer science news on Phys.org
No idea about math problems, but in my experience the best way to make AI hallucinate is to ask a question on some niche subject, that is discussed only in some obscure sources. As you probably don't know I am author of the first commercial Polish video game, Puszka Pandory (Pandora's Box), for ZX Spectrem. That was in 1986, so the sources are scarce, but they do exist. We were playing with ChatGPT last week and for fun asked about the game. Before we got bored ChatGPT listed at least four different authors, each time starting with "I am sorry, you are right I was wrong, the correct answer is XXXX". It never named me as the author :biggrin:

That was in Polish, I suppose if you will ask about details of something like FIDO net technology or BBS software it will give similarly nonsensical answers.
 
  • Like
Likes   Reactions: Jarvis323 and russ_watters
Borek said:
No idea about math problems, but in my experience the best way to make AI hallucinate is to ask a question on some niche subject, that is discussed only in some obscure sources. As you probably don't know I am author of the first commercial Polish video game, Puszka Pandory (Pandora's Box), for ZX Spectrem. That was in 1986, so the sources are scarce, but they do exist. We were playing with ChatGPT last week and for fun asked about the game. Before we got bored ChatGPT listed at least four different authors, each time starting with "I am sorry, you are right I was wrong, the correct answer is XXXX". It never named me as the author :biggrin:

That was in Polish, I suppose if you will ask about details of something like FIDO net technology or BBS software it will give similarly nonsensical answers.

It may not be a good test to check authorship errors because they deliberately scrub datasets of authorship information and personal information, to an extent. They want to avoid legal issues pertaining to privacy, copyright/attribution, defamation, or whatever else.
 
Jarvis323 said:
It may not be a good test to check authorship errors because they deliberately scrub datasets of authorship information and personal information, to an extent. They want to avoid legal issues pertaining to privacy, copyright/attribution, defamation, or whatever else.
It is perfectly good test to prove why GPT in its current state is unreliable and can't be trusted.
 

Similar threads

Replies
14
Views
750
  • Sticky
  • · Replies 0 ·
Replies
0
Views
4K
  • · Replies 39 ·
2
Replies
39
Views
10K
  • · Replies 34 ·
2
Replies
34
Views
4K
Replies
3
Views
3K
  • · Replies 10 ·
Replies
10
Views
2K
Replies
10
Views
5K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
3
Views
2K
  • · Replies 8 ·
Replies
8
Views
6K