Insights Why ChatGPT AI Is Not Reliable

PeterDonis · Jul 5, 2023

I’ll start with the simple fact: ChatGPT is not a reliable answerer to questions.
To try to explain why from scratch would be a heavy lift, but fortunately, Stephen Wolfram has already done the heavy lifting for us in his article, “What is ChatGPT Doing… and Why Does It Work?” [1] In a PF thread discussing this article, I tried to summarize the key message of Wolfram’s article as briefly as I could. Here is what I said in my post there [2]:
ChatGPT does not make use of the meanings of words at all. All it is doing is generating text word by word based on relative word frequencies in its training data. It is using correlations between words, but that is not the same as correlations in the underlying information that the words represent (much less causation). ChatGPT literally has no idea that the words it strings together represent anything.
In other words, ChatGPT is not designed to answer questions or provide information. It is explicitly designed not to do those things, because...

Bystander · Jul 5, 2023

Call it what it is, "Artificial 'William's Syndrome.'" https://www.google.com/search?q=wil...99i465i512.19081j1j7&sourceid=chrome&ie=UTF-8

..., pre-politically correct characteristics included "often precocious vocabulary with no apparent 'real understanding/ability' for use/application/reasoning." That is my recollection from Googling ten-fifteen years ago; ymmv.

This is
https://www.physicsforums.com/threa...an-appropriate-source-for-discussion.1053525/
another/one more case; some wiki/google sources lack "shelf life."

AndreasC · Jul 5, 2023

How do we know at what point it "knows" something? There are non-trivial philosophical questions here... These networks are getting so vast and their training so advanced that I can see someone eventually arguing they have somehow formed a decent representation of what things "are" inside them.

Of course chatGPT is not reliable but honestly I was surprised at some of the things that it can do. I was really surprised when I fed it some relatively long and complicated code and asked what it did. It was able to parse it rather accurately, suggest what problem it was supposed to solve, and then suggest specific optimizations. And now it is said GPT-4 significantly improves over it. It's pretty impressive, and somewhat disconcerting given that people always look for the worst possible way to use something first.

PeterDonis · Jul 5, 2023

AndreasC said:

How do we know at what point it "knows" something? There are non-trivial philosophical questions here

Perhaps, but they are irrelevant to this article. The article is not about an abstract philosophical concept of "knowledge". It is about what ChatGPT is and is not actually doing when it emits text in response to a prompt.

AndreasC said:

I can see someone eventually arguing they have somehow formed a decent representation of what things "are" inside them

Not as long as there are no semantic connections between the network and the world. No entity forms "representations" of actual things just by looking at relative word frequencies in texts. There has to be two-way interaction with the actual world. That's how, for example, we humans form our mental representations of things. We interact with them and learn how they work.

AndreasC · Jul 6, 2023

PeterDonis said:

Perhaps, but they are irrelevant to this article. The article is not about an abstract philosophical concept of "knowledge". It is about what ChatGPT is and is not actually doing when it emits text in response to a prompt.Not as long as there are no semantic connections between the network and the world. No entity forms "representations" of actual things just by looking at relative word frequencies in texts. There has to be two-way interaction with the actual world. That's how, for example, we humans form our mental representations of things. We interact with them and learn how they work.

We definitely learn about lots of things by just reading about them...

I think lots of people don't give enough credit to what it does. It can already give accurate answers about a wide range of questions, pass tests etc and, importantly, answer new problems it has not been specifically trained on. I always thought somebody knows something if they can not only recall the facts, but also apply them in new contexts.

Of course you can argue that it doesn't really know things because
, well, it doesn't have a consciousness, and it doesn't reason or learn in the exact same sense that people do. But imo this is not related much to whether or not it is reliable. It is NOT reiable, but it may well become significantly more reliable. Allegedly, GPT-4 already is much more reliable. In a few years, I expect that it would be no more unreliable than asking a human expert (who are, of course, not completely reliable). At that point, would you still say it is unreliable because it doesn't really know, or that it now knows?

We should pay more attention and be a little more concerned, because honestly I didn't believe it would reach this point yet. Not because of any "AI singularity" nonsense but because it may very well affect the way society views and uses knowledge in radical ways. Plus because it has a sizeable environmental footprint.

AndreasC · Jul 6, 2023

Ok, I think I should probably qualify the "as reliable as a human expert in a few years" a bit, because I think stated like that it is a bit too much. I meant to say as reliable when it comes to factual recollection that involves only a little bit (but still a non-trivial amount) of actual reasoning.

Demystifier · Jul 6, 2023

In my view, the right question is not why ChatGPT is not reliable. Given the general principles how it works, the right question is: Why is it more reliable than one would expect? I think even the creators of it were surprised how good it was.

pintudaso · Jul 6, 2023

ChatGPT, like any AI language model, has certain limitations that can affect its reliability in certain situations. Here are some reasons why ChatGPT may not always be considered reliable:

Lack of real-time information
Dependence on training data
Inability to verify sources
Limited context understanding
Biased and offensive content

It's important to approach AI language models like ChatGPT with a critical mindset and to independently verify information obtained from such models when accuracy is crucial. While ChatGPT can be a valuable tool for generating ideas, providing general information, or engaging in casual conversation, it's always advisable to cross-reference and fact-check important or sensitive information from reliable sources.

Rive · Jul 6, 2023

Demystifier said:

Given the general principles how it works, the right question is: Why is it more reliable than one would expect?

I would push that a bit further: if that thing (working as-is) looks so reliable almost in a humane way, then how many people might live off on the same principles? Confidence tricking through most communication?
What about our performance?

PeterDonis · Jul 6, 2023

AndreasC said:

We definitely learn about lots of things by just reading about them...

That's because our minds have semantic connections between words and things in the world. When we read words, we make use of those connections--in other words, we know that the words have meanings, and what those meanings are. If we get the meanings of words wrong, we "learn" things that are wrong.

ChatGPT has none of this. It has no connections between words and anything else. It doesn't even have the concept of there being connections between words and anything else. The only information it uses is relative word frequencies in its training data.

AndreasC said:

It can already give accurate answers about a wide range of questions

No, it can't. It can get lucky sometimes and happen to give an "answer" that happens to be accurate, but, as you will quickly find out if you start looking, it also happily gives inaccurate answers with the same level of confidence. That's because it's not designed to give accurate answers to questions; that's not what it's for.

AndreasC said:

pass tests

Only because the "tests" are graded so poorly that even the inaccurate but confident-sounding responses that ChatGPT gives "pass" the tests. That is a reflection of the laziness and ignorance of the test graders, not of the knowledge of ChatGPT.

AndreasC said:

answer new problems it has not been specifically trained on

Sure, because it can generate text in response to any prompt whatever. But the responses it gives will have no reliable relationship to reality. Sometimes they might happen to be right, other times they will be wrong, often egregiously wrong. But all of the responses seem just as confident.

AndreasC said:

I always thought somebody knows something if they can not only recall the facts, but also apply them in new contexts.

ChatGPT does not and cannot do these things. What it does do is, as a side effect of its design, produce text that seems, to a naive observer, to be produced by something that does these things. But the illusion is quickly shattered when you start actually checking up on its responses.

PeterDonis · Jul 6, 2023

Demystifier said:

Why is it more reliable than one would expect?

Is it? How would one even determine that?

PeterDonis · Jul 6, 2023

Rive said:

how many people might live off on the same principles? Confidence tricking through most communication?

Yes, I think one way of describing ChatGPT is that it is crudely simulating a human con artist: it produces statements that seem to come from an entity that is knowledgeable, but actually don't.

AndreasC · Jul 6, 2023

PeterDonis said:

That's because our minds have semantic connections between words and things in the world. When we read words, we make use of those connections--in other words, we know that the words have meanings, and what those meanings are. If we get the meanings of words wrong, we "learn" things that are wrong.

ChatGPT has none of this. It has no connections between words and anything else. It doesn't even have the concept of there being connections between words and anything else. The only information it uses is relative word frequencies in its training data.No, it can't. It can get lucky sometimes and happen to give an "answer" that happens to be accurate, but, as you will quickly find out if you start looking, it also happily gives inaccurate answers with the same level of confidence. That's because it's not designed to give accurate answers to questions; that's not what it's for.Only because the "tests" are graded so poorly that even the inaccurate but confident-sounding responses that ChatGPT gives "pass" the tests. That is a reflection of the laziness and ignorance of the test graders, not of the knowledge of ChatGPT.Sure, because it can generate text in response to any prompt whatever. But the responses it gives will have no reliable relationship to reality. Sometimes they might happen to be right, other times they will be wrong, often egregiously wrong. But all of the responses seem just as confident.ChatGPT does not and cannot do these things. What it does do is, as a side effect of its design, produce text that seems, to a naive observer, to be produced by something that does these things. But the illusion is quickly shattered when you start actually checking up on its responses.

The semantic connections you are talking about are connections between sensory inputs and pre-existing structure inside our brains. You're just reducing what it's doing to the bare basics of its mechanics, but its impressive behavior comes about because of how massively complex the structure is.

I don't know if you've tried it out, but it doesn't just "get lucky". Imagine a student passing one test after another, would you take someone telling you they only "got lucky" seriously, and if yes, how many tests would it take? Plus, it can successfully apply itself to problems it never directly encountered before. Yes, not reliably, but enough that it's beyond "getting lucky".

You talk about it like you haven't actually tried it out. It's not at all the same as previous chatbots, it has really impressive capabilities. It can give you correct answers to unambiguous questions that are non-trivial and that it has not specifically encountered before in its training. And it can do that a lot, repeatably. Nothing to do with how confident it sounds, I am talking about unambiguously correct answers.

Again, I'm not saying it is reliable, but you are seriously downplaying its capabilities if you think that's all it does and I encourage you to try it out for yourself. Especially when it comes to programming, it is incredible. You can put in it complicated code that is undocumented, and it can explain to you what the code does exactly, what problem it probably was intended for, and how to improve it, and it works a lot of the time, much more frequently than "luck".

If all you want to say is that it isn't right all the time, then yeah, that's true. It's very, very frequently wrong. But that has little to do with what you are describing. It could (and will) improve significantly on accuracy, using the same mechanism. And practically, what you are saying doesn't matter. A database doesn't "know" what something is either in your sense of the word, neither does a web crawler, or anything like that. That doesn't make them unreliable. Neither is a human reliable because they "know" something (again going by your definition).

ChatGPT is unreliable because we observe it to be unreliable. That requires no explanation. What does require explanation is why, as @Demystifier said, it is so much more reliable (especially at non trivial, "reasoning" type problems) than you would naively expect.

AndreasC · Jul 6, 2023

PeterDonis said:

Is it? How would one even determine that?

Try it. Feed it questions which have unambiguous answers. You'll see that even though sometimes it generates nonsense, very, VERY frequently it gives right answers. Amusingly, one thing it does struggle with a bit is arithmetic. But it is getting better. Seriously though, try it.

PeterDonis · Jul 6, 2023

AndreasC said:

The semantic connections you are talking about are connections between sensory inputs and pre-existing structure inside our brains.

Not necessarily pre-existing. We build structures in our brains to represent things in the world as a result of our interactions with them. ChatGPT does not. (Nor does ChatGPT have any "pre-existing" structures that are relevant for this.)

AndreasC said:

Imagine a student passing one test after another, would you take someone telling you they only "got lucky" seriously

If the reason they passed was that their graders were lazy and didn't actually check the accuracy of the answers, yes. And that is exactly what has happened in cases where ChatGPT supposedly "passed" tests. If you think graders would never be so lazy, you have led a very sheltered life. It's just a more extreme version of students getting a passing grade on a book report without ever having read the book, and I can vouch for that happening from my own personal experience.

AndreasC said:

It can give you correct answers to unambiguous questions that are non-trivial and that it has not specifically encountered before in its training. And it can do that a lot, repeatably.

Please produce your evidence for this claim. It is contrary to both the analysis of how ChatGPT actually works, which I discuss in the Insights article, and the statements of many, many people who have used it. Including many posts here at PF where people have given ChatGPT output that is confident-sounding but wrong.

AndreasC said:

ChatGPT is unreliable because we observe it to be unreliable.

Doesn't this contradict your claim quoted above?

AndreasC said:

That requires no explanation.

The fact that it is observed to be unreliable is just a fact, yes. But in previous discussions of ChatGPT here at PF, it became clear to me that many people do not understand how ChatGPT works and so do not understand both that it is unreliable and why it is unreliable. That is why I wrote this article.

AndreasC said:

What does require explanation is why, as @Demystifier said, it is so much more reliable (especially at non trivial, "reasoning" type problems) than you would naively expect.

And I have already responded to @Demystifier that such a claim is meaningless unless you can actually quantify what "you would naively expect" and then compare ChatGPT's actual accuracy to that. Just saying that subjectively it seems more accurate than you would expect is meaningless.

PeterDonis · Jul 6, 2023

AndreasC said:

Try it. Feed it questions which have unambiguous answers. You'll see that even though sometimes it generates nonsense, very, VERY frequently it gives right answers.

This does not seem consistent with many posts here at PF by people who have tried ChatGPT and posted the output. The general sense I get from those posts is that ChatGPT was less reliable than they expected--because they did not realize what it is actually doing and not doing. For example, apparently many people expected that when you asked it a factual question about something in its training data, it would go look in its training data to find the answer. But it doesn't, even if the right answer is in its training data. Wolfram's article, referenced in my Insights article, makes all this clear.

AndreasC · Jul 6, 2023

PeterDonis said:

This does not seem consistent with many posts here at PF by people who have tried ChatGPT and posted the output. The general sense I get from those posts is that ChatGPT was less reliable than they expected--because they did not realize what it is actually doing and not doing. For example, apparently many people expected that when you asked it a factual question about something in its training data, it would go look in its training data to find the answer. But it doesn't, even if the right answer is in its training data. Wolfram's article, referenced in my Insights article, makes all this clear.

Have YOU tried it? People often post more when it gets something wrong. For instance, people have given it SAT tests:

https://study.com/test-prep/sat-exam/chatgpt-sat-score-promps-discussion-on-responsible-ai-use.html

Try giving it a SAT test yourself if you don't trust that.

phinds · Jul 6, 2023

pintudaso said:

Limited context understanding

That is incorrect. "Limited understanding" implies that there is at least SOME understanding but chatGPT has zero understanding of anything.

russ_watters · Jul 6, 2023

I suspect ChatGPT has infiltrated this thread...

Edit: Btw, While I'm not certain of this, here's how I can often tell: it's the lack of focus in the responses. When the content is dumped into the middle of an ongoing conversation, it doesn't acknowledge or respond to the ongoing conversation, it just provides generic information that is often not useful for/connected to the discussion.

russ_watters · Jul 6, 2023

Demystifier said:

In my view, the right question is not why ChatGPT is not reliable. Given the general principles how it works, the right question is: Why is it more reliable than one would expect?

PeterDonis said:

Is it? How would one even determine that?

I think it's just a qualitative feeling, but I feel the same way. When first learning about it, it never occurred to me that it didn't access stored information (either its own or 3rd party) to form its replies*. Now that I know it doesn't, it surprises me that it gets so much right. If it's just doing word association and statistical analysis, I'm surprised that asking about Independence Day doesn't return "On July 4, 1776 Will Smith fought a group of alien invaders before signing the Declaration of Independence in Philadelphia..." It seems that through statistical analysis it is able to build a model that approximates or simulates real information. To me, surprisingly well.

*I don't know the intent of the designers, but I can't imagine this is an oversight. Maybe the intent was always to profit from 3rd parties using it as an interface for their data sources (some of which they are doing it appears)?

But whatever the real goals of the company, I think it is wrong and risky that it's been hyped (whether by the media or the company) to make people think that it is a general purpose AI with real knowledge. As a result, people have their guard down and are likely to mis/over-use it.

I wonder if the developers really believe it qualifies for the title "AI" or that complexity = intelligence?

BWV · Jul 6, 2023

Good article, perhaps worth mentioning that this is the same way that language translation engines like Google Translate work - obviously Google translate does not understand English or Mandarin, it just has sufficient training data to statistically match phrases. The immediate applications seem to be as a 'word calculator' to generate prose where accuracy is less important or can be easily checked - this is no different of where ML gets used today (and this is just another ML tool). Recommending items to Amazon.com customers or targeting adds on Facebook has a wide margin for error, unlike, say, driving a car.

russ_watters said:

If it's just doing word association and statistical analysis, I'm surprised that asking about Independence Day doesn't return "On July 4, 1776 Will Smith fought a group of alien invaders before signing the Declaration of Independence in Philadelphia..." It seems that through statistical analysis it is able to build a model that approximates or simulates real information. To me, surprisingly well.

Well if only IMDB was used for the training set ;) - My real guess is the volume of data in the training set matters - the right answer for July 4 just shows up with a higher frequency.

AndreasC · Jul 6, 2023

russ_watters said:

I think it's just a qualitative feeling, but I feel the same way. When first learning about it, it never occurred to me that it didn't access stored information (either its own or 3rd party) to form its replies*. Now that I know it doesn't, it surprises me that it gets so much right. If it's just doing word association and statistical analysis, I'm surprised that asking about Independence Day doesn't return "On July 4, 1776 Will Smith fought a group of alien invaders before signing the Declaration of Independence in Philadelphia..." It seems that through statistical analysis it is able to build a model that approximates or simulates real information. To me, surprisingly well.

*I don't know the intent of the designers, but I can't imagine this is an oversight. Maybe the intent was always to profit from 3rd parties using it as an interface for their data sources (some of which they are doing it appears)?

But whatever the real goals of the company, I think it is wrong and risky that it's been hyped (whether by the media or the company) to make people think that it is a general purpose AI with real knowledge. As a result, people have their guard down and are likely to mis/over-use it.

I wonder if the developers really believe it qualifies for the title "AI" or that complexity = intelligence?

This isn't even what surprises me that much. You could say that it has learned that the correct date follows these prompts. But the thing is, you can make up an alien planet, tell gpt about it and their customs, and it will answer understanding questions on your text, plus it may even manage to infer when their alien independence day is, given enough clues. It's really impressive.

kith · Jul 6, 2023

FWIW, here are the results of ChatGPT taking two university-level exams in labor economics and quantum computing (graded and commented on by the lecturers who created the exams):
https://betonit.substack.com/p/chatgpt-takes-my-midterm-and-gets (GPT-3, D)
https://betonit.substack.com/p/gpt-retakes-my-midterm-and-gets-an (GPT-4, A)
https://scottaaronson.blog/?p=7209 (GPT-4, B)

Filip Larsen · Jul 6, 2023

I haven't read all the posts in this thread so perhaps someone already mentioned it, but since I have started explaining LLM, like ChatGPT, as akin to a stochastic parrot to family and non-tech friends who cared to ask me I sense my points about the quality of its output gets across much easier. Probably because most already have an idea what (some) parrots are capable of language-wise so I only have to explain a little about statistics and randomness. Of course, the analogy does not work to explain anything about how LLM work.

pbuk · Jul 6, 2023

russ_watters said:

Maybe the intent was always to profit from 3rd parties using it as an interface

Ya think?

russ_watters said:

But whatever the real goals of the company, I think it is wrong and risky that it's been hyped (whether by the media or the company) to make people think that it is a general purpose AI with real knowledge.

Unfortunately "people" tend to believe what they want to believe, like @AndreasC here, despite evidence and information to the contrary.

russ_watters said:

I wonder if the developers really believe it qualifies for the title "AI"

Definitely not, but they believe they are headed in the right direction:

https://openai.com/research/overview said:

We believe our research will eventually lead to artificial general intelligence, a system that can solve human-level problems.

russ_watters · Jul 6, 2023

AndreasC said:

This isn't even what surprises me that much. You could say that it has learned that the correct date follows these prompts. But the thing is, you can make up an alien planet, tell gpt about it and their customs, and it will answer understanding questions on your text, plus it may even manage to infer when their alien independence day is, given enough clues. It's really impressive.

Impressive how? Doesn't this just tell you that it doesn't know the difference between fiction and reality, and more to the point, there's no way for you to know if it is providing you fictional or real answers*?

*Hint: always fictional.

PeterDonis · Jul 6, 2023

AndreasC said:

people have given it SAT tests

This just shows that SAT tests can be gamed. Which we already knew anyway.

PeterDonis · Jul 6, 2023

russ_watters said:

It seems that through statistical analysis it is able to build a model that approximates or simulates real information.

Yes, because while the information that is contained in the relative word frequencies in the training data is extremely sparse compared to the information that a human reader could extract from the same data, it is still not zero information. There is information contained in those word frequencies. For example, "Thomas Jefferson" is going to appear correlated with "july 4, 1776" in the training data to a much greater degree than "Will Smith" does.

russ_watters said:

I can't imagine this is an oversight

It's not; it was an intentional feature of the design that only the relative word frequencies in the training data would be used. The designers, from what I can tell, actually believe that piling up enough training data with such word frequencies can lead to actual "knowledge" of subject matter.

PeterDonis · Jul 6, 2023

AndreasC said:

you can make up an alien planet, tell gpt about it and their customs, and it will answer understanding questions on your text, plus it may even manage to infer when their alien independence day is, given enough clues.

Please give a reference: where has this been done?

Rive · Jul 7, 2023

AndreasC said:

But the thing is, you can make up an alien planet, tell gpt about it and their customs, and it will answer understanding questions on your text, plus it may even manage to infer when their alien independence day is, given enough clues. It's really impressive.

It's indeed impressive that a limited set of text (training data) can hold so much hidden information through the encoding of the language that even not thoroughly trash responses/reflections can be extracted for extremely weird questions.

But still, ChatGPT is fundamentally a static machine so it cannot have any 'understanding' about your input.Somewhere I wrote that I expect some accidents/cases to happen in the following decades which retrospectively might be characterized as preliminary conscience or something like that: and actually I think these language models might be some preliminary parts of those preliminary cases, but - still, just parts. Nothing more but pieces.

Ps.: the closest thing to 'understanding' in this case would be some apparently wired-in linguistics rules like composition of sentences and such. But that's also the static kind of 'understanding'.
I wonder whether it can be tweaked to make linguistics mistakes. How deep is that 'wiring' o0)

Insights Why ChatGPT AI Is Not Reliable

Similar threads

Is AI Overhyped?

On Progress Toward AGI

How to disable AI responses in Google Searches?

If you think having a backup is too expensive, try not having one

Is this a good deal (laptop)?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers