Insights Why ChatGPT AI Is Not Reliable

  • Thread starter Thread starter PeterDonis
  • Start date Start date
  • Tags Tags
    chatgpt
Click For Summary
ChatGPT is deemed unreliable because it generates text based solely on word frequencies from its training data, lacking true understanding or semantic connections. Critics argue that it does not accurately answer questions or provide reliable information, often producing confident but incorrect responses. While some users report that it can parse complex code and suggest optimizations, this does not equate to genuine knowledge or reasoning. The discussion highlights concerns about its potential impact on how society perceives knowledge and the importance of critical evaluation of AI-generated content. Ultimately, while ChatGPT may appear impressive, its limitations necessitate cautious use and independent verification of information.
  • #31
PeterDonis said:
Is it? How would one even determine that?
By asking experts to give their educated guess estimates, say two years ago.
 
Computer science news on Phys.org
  • #32
PeterDonis said:
Please give a reference: where has this been done?
I've just tried it:

Me: What are the customs of the aliens at the planet Xuxor?

ChatGPT:
As an AI language model, I don't have real-time information on specific planets or extraterrestrial civilizations outside of what is available up until my knowledge cutoff in September 2021. Therefore, I'm not aware of any specific customs of aliens on a planet called Xuxor or any other extraterrestrial civilization.

It's important to note that the existence of aliens or their customs on specific planets is purely speculative at this point, as we have not made contact with extraterrestrial life. If and when we do encounter extraterrestrial civilizations, their customs and behaviors would likely be vastly different from our own, given the vastness of the universe and the potential for diverse forms of life.

Until further information becomes available, any details about the customs of aliens on specific planets would be purely fictional or based on imagination, as depicted in science fiction literature, movies, or other creative works.
 
  • Like
Likes PeroK
  • #33
PeterDonis said:
This just shows that SAT tests can be gamed. Which we already knew anyway.
"Gamed" how? I honestly struggle to see what kind of evidence you would have to see to be convinced. Even if it had near 100% accuracy on whatever unambiguous question you threw at it, you would say "well it doesn't REALLY know the answer so it's basically just confusing you into thinking it gives correct answers". Well apparently it confused Terence Tao it gives useful answers in mathematics, pardon me for also being "confused"!

It seems like your argument is completely independent of whether or not it gives correct answers. Because it does! Not all the time of course, not even frequently enough to be reliable at this point, but it is improving. And you are free to check for yourself that this is true. If you want to argue that regardless of delivering accurate answers it is still somehow "cheating" people, I don't know what you expect it to do beyond generating unambiguously correct answers to prompts. If you think it can not give unambiguously correct answers to unambiguous questions, and they only seem to be correct because of its confidence, then you're just wrong and I'm imploring you to try it yourself.

We can't be downplaying it like that because it's unfortunately going to become a significant part of the academic world, and people should recognize what is going on.
 
  • Like
Likes Hornbein and PeroK
  • #34
russ_watters said:
Impressive how? Doesn't this just tell you that it doesn't know the difference between fiction and reality, and more to the point, there's no way for you to know if it is providing you fictional or real answers*?

*Hint: always fictional.
It is impressive because it can (sometimes) generate logical answers from text that it has never encountered before. This goes beyond parroting.
 
  • Like
Likes PeroK
  • #35
SAT and other tests are designed to test humans. One way to test a human's knowledge of a subject is to require them to recall information about that subject and write a summary under time pressure. Recalling information and producing output quickly is something that computers are really good at so it should be less surprising that GPT4 has done well on exams (note this is GPT4 which is not the engine behind ChatGPT which is less sophisticated).

If someone who knew nothing about law took a law exam supported by an army of librarians with instant access to petabytes of relevant data and passed would you say that they had any knowledge or understanding of the law?

AndreasC said:
It seems like your argument is completely independent of whether or not it gives correct answers.
Of course it is: noone is arguing that an LLM is not capable of frequently giving correct answers, or that a very well designed and trained LLM is not capable of giving correct answers within a large domain more frequently than many humans. The argument is that no amount of correct answers is equivalent to knowledge.

AndreasC said:
you're just wrong and I'm imploring you to try it yourself.
It is you that is wrong, and you are making claims for ChatGPT that its makers OpenAI don't make themselves.

AndreasC said:
We can't be downplaying it like that because it's unfortunately going to become a significant part of the academic world, and people should recognize what is going on.
Nobody is downplaying it, but in order to "recognize what is going on" it is necessary to understand what is actually going on. Noone can tell anyone else what to do but if I were you I would stop repeating my own opinions here and take some time to do that.
 
  • Like
  • Skeptical
Likes PeroK, Motore, phinds and 1 other person
  • #36
PeterDonis said:
Please give a reference: where has this been done?
Well, I just tried it with Chat GPT. Its output was, alright. Not great, not terrible. It would be interesting to see what GPT-4 could do with it.

ME: I will tell you a story about aliens in the planet Xuxor and I want you to answer some questions based on it.

CHAT: Sure, I'd be happy to hear your story about aliens on the planet Xuxor. Please go ahead and tell me your story, and feel free to ask any questions you have afterwards.

ME: The planet Xuxor revolves around the star Artaxerxis B. The Xuxians have the same 12 months we have, but their planet revolves around the star twice as fast. The Xuxians were enslaved to the Gazogians for many centuries, but they eventually revolted. They only celebrate two main events, the start of the Xuxor Revolution, and the Achievement of Independence. On national holidays they eat their traditional dish, battery fluid caks. The start of the Xuxor Revolution is so important to them, that January 1st of the year 0 of their calendar corresponds to that date. Knowing that Xuxians gather together on the 25th of Xuxor May to eat battery fluid cake, do you think that corresponds to some important Xuxor event?
CHAT: Based on the information provided, the 25th of Xuxor May holds significance for the Xuxians as they gather together to eat battery fluid cake. While it is not explicitly mentioned in the story, given that the Xuxians celebrate two main events, the start of the Xuxor Revolution and the Achievement of Independence, it is reasonable to assume that the 25th of Xuxor May corresponds to one of these important events.

Considering that the Xuxor Revolution is highly significant to them, it is possible that the 25th of Xuxor May marks the anniversary or commemoration of a key event during the revolution. Perhaps it symbolizes a crucial battle, the signing of a pivotal document, or the day when the Xuxians gained a significant advantage in their struggle for independence.

Without further details, it is difficult to ascertain the exact nature of the event associated with the 25th of Xuxor May. However, given the importance of the Xuxor Revolution to the Xuxians, it is reasonable to assume that this date corresponds to an important event related to their fight for freedom.
COMMENT BY ME: We see it did not get it right away, though it did correctly answer that it is probably one of the two main historical events. Let's give it some more help.
ME: If January 1st corresponds to the Start of the Revolution, and they only celebrate the two aforementioned historical events, what do you think 25th of May corresponds to?
CHAT: Apologies for the confusion in my previous response. If January 1st in the Xuxians' calendar corresponds to the Start of the Xuxor Revolution, and they only celebrate two main historical events, it is likely that the 25th of May corresponds to the Achievement of Independence.

The 25th of May could signify the day when the Xuxians successfully achieved their independence from the Gazogians, marking a significant milestone in their history. This date may commemorate a specific event such as the signing of a declaration of independence, the end of the final battle against the Gazogians, or the establishment of their own governing system.

Given that the Xuxians gather together on this day to eat battery fluid cake, it suggests that the 25th of May is a special occasion and holds great importance in their culture. It serves as a national holiday and a time for the Xuxians to reflect upon and celebrate their freedom and the achievements of their past.

Therefore, it is reasonable to conclude that the 25th of May in the Xuxian calendar corresponds to the Achievement of Independence, one of the two significant historical events celebrated by the Xuxians.
COMMENT BY ME: This time it got it right. Let's give it something more "calculational", dependent on prior context.
ME: If roughly 10 human years passed from the Start of the Revolution to the Achievement of Independence, how many Xuxor years passed?
CHAT: Given that the planet Xuxor revolves around its star twice as fast as Earth, one Xuxor year would be equivalent to half of an Earth year. Therefore, if roughly 10 human years passed from the Start of the Revolution to the Achievement of Independence, we need to calculate how many Xuxor years would have passed.

Since one Xuxor year is half an Earth year, we can divide 10 human years by 0.5 to find the equivalent Xuxor years:

10 human years / 0.5 Xuxor years per human year = 20 Xuxor years.

So, approximately 20 Xuxor years would have passed from the Start of the Revolution to the Achievement of Independence.

COMMENT BY ME: Very neat, unambiguously correct response. Let's now push it for something a bit more complex.
ME: Give me the date, month and year of the Achievement of Independence in the Xuxor calendar
CHAT:
To determine the date, month, and year of the Achievement of Independence in the Xuxor calendar, we need additional information regarding the specific year the Xuxor Revolution began and the duration between the Start of the Revolution and the Achievement of Independence. Without this information, it is not possible to provide an exact date.

However, based on the given information that roughly 10 human years passed between the Start of the Revolution and the Achievement of Independence, we can make an estimation. If the Xuxor calendar is synchronized with the Earth calendar, we can calculate the date of the Achievement of Independence.

Let's assume that the Xuxor Revolution began on January 1st, 0, as stated earlier. If we add 10 human years to that date, we can estimate the year of the Achievement of Independence to be:

0 (Start of the Revolution) + 10 years = 10.

Therefore, the estimated year of the Achievement of Independence in the Xuxor calendar would be 10. However, without more specific information, it is not possible to determine the exact date and month within the Xuxor calendar for this historical event.

COMMENT BY ME: Well now it is saying nonsense. So we see it is not reliable. However, it did manage to give some surprisingly accurate answers. Now, can it tell whether or not Xuxor is real? Of course it can't tell if it is real, because it is not a person, but can it at least generate a correct answer on whether or not it is real, which after all is all that someone expects from it?

ME: Is the planet Xuxor real?

CHAT: No, the planet Xuxor is not real. It was mentioned in the context of a fictional story you created. As of my knowledge cutoff in September 2021, there is no scientific evidence or information available about a planet named Xuxor or any extraterrestrial civilization residing there. The story you provided is a product of imagination and creativity.
 
  • #37
pbuk said:
Of course it is: noone is arguing that an LLM is not capable of frequently giving correct answers, or that a very well designed and trained LLM is not capable of giving correct answers within a large domain more frequently than many humans.
Maybe you are not arguing that. But I don't think other people in this thread agree with you. Some people insist it is only "confusing" us into thinking the answers are correct. My argument is that the question of whether or not it "knows" is philosophical, and unrelated to practical considerations of whether or not it is reliable.
 
  • #38
AndreasC said:
Maybe you are not arguing that. But I don't think other people in this thread agree with you. Some people insist it is only "confusing" us into thinking the answers are correct.
I will let others speak for themselves but I believe the only person that has used the term "confusing" in this thread is you.

AndreasC said:
My argument is that the question of whether or not it "knows" is philosophical, and unrelated to practical considerations of whether or not it is reliable.
Everyone is agreed on that, as @PeterDonis confirmed way back in #4:
PeterDonis said:
The article is not about an abstract philosophical concept of "knowledge". It is about what ChatGPT is and is not actually doing when it emits text in response to a prompt.

I believe your misunderstanding is that because ChatGPT's answers are frequently correct that means that they are reliable.

Let's try an analogy: I frequently spell words correctly because I have pretty good recall and am a bit obsessive about spelling and grammar. Bob reliably spells words correctly because he looks up anything he is not sure about in a dictionary.
 
  • Like
Likes PeterDonis, Vanadium 50 and russ_watters
  • #39
pbuk said:
I will let others speak for themselves but I believe the only person that has used the term "confusing" in this thread is you.
The term "confusing" was not specifically used, but after I said it can give accurate answers to many questions, @PeterDonis in post #10 very specifically said it can't do that. They proceeded to say that it only sometimes gets "lucky" (I wouldn't call it luck exactly, it does it again and again for some subjects, you have to get UNlucky to get a wrong answer, then again it messes up more frequently on some other subjects) and gives an "answer". I don't know why "answer" was put in scare quotes but I believe it's probably due to scepticism that it even is an answer, and that it's not just me being confused. In the same post he argued that the only reason it passed tests was because of the "laziness and ignorance of the testers", presumably not because the answers were accurate.

Then, in post #15 he again doubled down that the only reason it passed tests was because graders were "lazy". Furthermore, despite the fact that they say their argument has nothing to do with the philosophical concept of knowledge, they again essentially assert that the reason it is UNreliable is because it doesn't "know" or "understand". I believe the two are separate subjects.

In #27, they say that it only passed SAT tests because they can be "gamed". At several points, it is compared to a human con artist, and it is implied that the reason people think it gives accurate answers is confidence, when they are actually inaccurate. So you can see there are doubts that it can give accurate answers at all.
pbuk said:
I believe your misunderstanding is that because ChatGPT's answers are frequently correct that means that they are reliable.
I have very explicitly said I do NOT believe it is reliable multiple times. Specifically, in posts #3 (my very first on the thread), #5, and #13, plus in multiple other posts I have said again and again it often generates nonsense.
 
Last edited:
  • #40
AndreasC said:
I have very explicitly said I do NOT believe it is reliable multiple times.
Ah yes, I missed that. It seems we are in violent agreement.
 
  • Informative
  • Like
  • Haha
Likes DaveC426913, Demystifier and AndreasC
  • #41
pbuk said:
Ah yes, I missed that. It seems we are in violent agreement.
Hahaha that is a very useful term online!
 
  • Like
Likes pbuk
  • #42
AndreasC said:
People often post more when it gets something wrong. For instance, people have given it SAT tests:

https://study.com/test-prep/sat-exam/chatgpt-sat-score-promps-discussion-on-responsible-ai-use.html
Your take is weird to me, but it seems common, especially in the media. Consider this potential headline from 1979:

"New 'Spreadsheet' Program 'VisiCalc' Boasts 96% Accuracy - Might it be the New Killer App?"
[ChatGPT was 96th percentile on the SAT, not accuracy, but close enough.]

That's not impressive, it's a disaster. It's orders of magnitude worse than acceptable accuracy from a computer. It seems that because ChatGPT sounds confidently human people have lowered the bar from "computer" to "human" in judging its intelligence - and don't even realize they've done it. That's a dangerous mistake.
 
  • Like
Likes PeterDonis, Vanadium 50 and weirdoguy
  • #43
russ_watters said:
That's not impressive, it's a disaster. It's orders of magnitude worse than acceptable accuracy from a computer.
Sure, but the thing is, that it is able to do tasks that previous computer programs couldn't do. You couldn't copy and paste an SAT question into a program and get an answer before. It would require significant pre-processing, and in some cases you just wouldn't be able to get any help, because previous computer programs weren't good at, say, parsing natural language and taking into account context, subjective meaning etc. That is why it is impressive, because it accurately and quickly performs tasks that computers couldn't previously do, and were solely the domain of humans.
 
  • #44
AndreasC said:
Sure, but the thing is, that it is able to do tasks that previous computer programs couldn't do.
You could write that on the box of any new piece of software. Otherwise there's no reason to use it. But you're seeing the point now:
AndreasC said:
...previous computer programs weren't good at, say, parsing natural language and taking into account context, subjective meaning etc. That is why it is impressive, because it accurately and quickly performs tasks that computers couldn't previously do, and were solely the domain of humans.
Right. What's impressive about it is that it can converse with a human and sound pretty human. But now please reread the title of the thread. "Sounds human" is a totally different accomplishment from "reliable".
 
  • Like
Likes Vanadium 50 and weirdoguy
  • #45
Demystifier said:
I've just tried it
What you show here is nothing like what AndreasC described.
 
  • #46
PeterDonis said:
What you show here is nothing like what AndreasC described.
Exactly!
 
  • #47
AndreasC said:
I have very explicitly said I do NOT believe it is reliable multiple times.
But in post #13 you also said it can "repeatably" give accurate answers to questions. That seems to contradict "unreliable". I asked you about this apparent contradiction in post #15 and you haven't responded.
 
  • #48
russ_watters said:
"New 'Spreadsheet' Program 'VisiCalc' Boasts 96% Accuracy - Might it be the New Killer App?"
"ChatGPT Airlines - now 96% of our takeoffs have landings at airports!"

Let's go back to "knowledge". Yes, it's philosophical, but some of the elements can be addressed scientifically. An old-fashioned definition of knowledge was "justified true belief". Let's dispense with "belief" as too fuzzy, Is what ChatGPT says true? Sometimes. As stated, 96% of the time is not very impressive. Is it justified? Absolutely not - it "knows" onlt what words others used, and in what order. That's it.

In no sense is there "knowledge" there.

It's not just unreliable - we have no reason to believe it should be reliable, or that this approach will ever be reliable.
 
  • Like
Likes russ_watters and PeterDonis
  • #49
AndreasC said:
previous computer programs weren't good at, say, parsing natural language and taking into account context, subjective meaning etc. That is why it is impressive
ChatGPT is not parsing natural language. It might well give the appearance of doing so, but that's only an appearance. The text it outputs is just a continuation of the text you input, based on relative word frequencies in its training data. It does not break up the input into sentence structures or anything like that, which is what "parsing natural language" would mean. All it does is output continuations of text based on word frequencies.
 
  • Like
Likes physicsworks and russ_watters
  • #50
AndreasC said:
the only reason it passed tests was because of the "laziness and ignorance of the testers", presumably not because the answers were accurate
Or because the testers didn't bother writing a good test, that actually can distinguish between ChatGPT, an algorithm that generates text based on nothing but relative word frequencies in its training data, and an actual human with actual human understanding of the subject matter. The test is supposed to be testing for the latter, so if the former can pass the test, the test is no good.

AndreasC said:
the only reason it passed tests was because graders were "lazy"
See above.

AndreasC said:
it only passed SAT tests because they can be "gamed"
Which, as I said, is already well known: that humans can pass SAT tests without having any actual knowledge of the topic areas. For example, they can pass the SAT math test without being able to actually use math to solve real world problems--meaning, by gathering information about the problem, using that information to set up relevant mathematical equations, then solving them. So in this case, ChatGPT is not going beyond human performance in any respect. :wink:
 
  • Like
Likes Math100
  • #51
If there were any knowledge base behind ChatGPT you would be able to
  1. Train it in English
  2. Train it in French
  3. Train it in domain knowledge (like physics)
  4. Have it answer questions about thus domain in French.
It can't do this. There is no there there.
 
  • #52
Vanadium 50 said:
"ChatGPT Airlines - now 96% of our takeoffs have landings at airports!"
"New from OceanGate: now 99% Reliable - Twice as Reliable as our Previous Subs!"
(too soon?)

Vanadium 50 said:
It's not just unreliable - we have no reason to believe it should be reliable, or that this approach will ever be reliable.
I go back again to wondering what the creators are thinking about this...
pbuk said:
Definitely not [AI], but they believe they are headed in the right direction:
OpenAI's website is really weird. It is exceptionally thin on content and heavy on flash, with most of the front page just being pointless slogans and photos of people doing office things (was it created by ChatGPT?). It even features a video on top that apparently has no sound? All this to sell a predominantly text-based application (ironic)? The first section of the front page, though, contains one actual piece of information, in slogan form:

"Creating safe AGI that benefits all of humanity"​

That's quite an ambitious goal/claim. It's not surprising that everyday people believe it's more than it really is, when that's what the company is saying.

The trajectory of the app and the way they've talked about the flaws such as hallucinations does imply they think their approach is viable and that refinements that improve its reliability should result in it becoming "reliable enough". Ironically this may increase the risk/danger of misuse, as people apply it to more and more situations where reliability should matter. I can't see how this approach would ever be acceptable for industrial automation. Maybe for a toy drone it won't matter if it unexpectedly/unpredictably crashes for no apparent reason "only" 0.1% of the time, but that won't ever be acceptable for a self driving car or airplane.
 
  • Haha
  • Like
Likes DaveC426913 and PeterDonis
  • #53
PeterDonis said:
Or because the testers didn't bother writing a good test, that actually can distinguish between ChatGPT, an algorithm that generates text based on nothing but relative word frequencies in its training data, and an actual human with actual human understanding of the subject matter
If that's what a "good" test is, then it is tautologically true that GPT would be no good at them. The issue with tautologies is, of course, that they don't tell us anything new. What is new is that GPT can do many things that only humans with understanding could previously do. Of course it doesn't do them perfectly, but often it does them more accurately than most humans, and much faster. If what you want is the answer to an exercise, and it can give you the correct answer, say, 99% of the time, then that's good enough for many people and in many contexts, regardless of philosophical questions about understanding. And again, we are talking about things that computers previously just couldn't do. This is why it is significant and this is why I'm saying it should not be downplayed, because we will encounter this way too much in coming years.
 
  • Skeptical
Likes weirdoguy
  • #54
PeterDonis said:
What you show here is nothing like what AndreasC described.
Well, @Demystifier didn't do what I described. See my post where I tried it.
 
  • #55
russ_watters said:
go back again to wondering what the creators are thinking about this...
I think they are planning to monetize this by first making a name for themselves and then selling a product where "close enough is good enough". For example, customer service chatbots.
 
  • Like
Likes russ_watters
  • #56
AndreasC said:
If what you want is the answer to an exercise, and it can give you the correct answer, say, 99% of the time, then that's good enough for many people and in many contexts
Is it?

Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.

But for lots of other purposes, it seems wrong. It's not even a matter of percentage accuracy; it's a matter of what the thing is doing and not doing, as compared with what my purpose is. If my purpose is to actually understand the subject matter, I need to learn from a source that actually understands the subject matter. If my purpose is to learn a particular fact, I need to learn from a source that will respond based on that particular fact. For example, if I ask for the distance from New York to Chicago, I don't want an answer from a source that will generate text based on word frequencies in its input data; I want an answer from a source that will look up that distance in a database of verified distances and output what it finds. (Wolfram Alpha, for example, does this in response to queries of that sort.)
 
  • #57
PeterDonis said:
Is it?

Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.

But for lots of other purposes, it seems wrong. It's not even a matter of percentage accuracy; it's a matter of what the thing is doing and not doing, as compared with what my purpose is. If my purpose is to actually understand the subject matter, I need to learn from a source that actually understands the subject matter. If my purpose is to learn a particular fact, I need to learn from a source that will respond based on that particular fact. For example, if I ask for the distance from New York to Chicago, I don't want an answer from a source that will generate text based on word frequencies in its input data; I want an answer from a source that will look up that distance in a database of verified distances and output what it finds. (Wolfram Alpha, for example, does this in response to queries of that sort.)
But what if you want the answer as if given by Homer Simpson, or a Shakespearian Sonnet? Alpha cant do that ;)

I think many are missing the point in that applications with near perfect accuracy are not the objective - LLMs can write marketing pitches, legal boilerplate, informational articles, etc. just as well as a junior employee whose work would also need to be checked for accuracy.

Informative that the largest quant hedge funds took these tools not for trading, but to automate the tasks of junior analysts:

https://fortune.com/2023/06/01/hedge-fund-chatgpt-grunt-work-mundane/
 
  • Like
Likes russ_watters
  • #58
PeterDonis said:
Perhaps if my only purpose is to get a passing grade on the exercise, by hook or by crook, this would be good enough.
Exactly, here is the problem!

Or what happens when some business or gover does the math and figures it would rather risk being wrong than pay experts?

On the other hand, it could work very productively if it is used to provide guidelines to solving something, or even giving the answer and then curating the output. Terence Tao talked about this if you want to read his experiences with that.

The flip side of this is that researchers could use it to churn out absurd quantities of research papers that are mostly junk or mostly uninteresting to inflate their publications.

There are lots of ramifications this new technology could have.
 
  • #59
BWV said:
But what if you want the answer as if given by Homer Simpson, or a Shakespearian Sonnet? Alpha cant do that ;)
I think they already do it the Max Power way:
 
  • Haha
Likes BWV
  • #60
AndreasC said:
what happens when some business or gover does the math and figures it would rather risk being wrong than pay experts?
If I know that's what your business is doing, you won't get my business.

I suspect that a lot of people feel this way; they just don't know that that's what the business is doing. Certainly OpenAI has not done anything to inform the public of what ChatGPT is actually doing, and not doing. I suspect that is because if they did do so, interest in what OpenAI is doing would evaporate.
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
Replies
10
Views
4K
  • · Replies 39 ·
2
Replies
39
Views
10K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 25 ·
Replies
25
Views
5K
Replies
3
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
3
Views
5K
  • · Replies 1 ·
Replies
1
Views
26K