Insights Why ChatGPT AI Is Not Reliable

  • Thread starter Thread starter PeterDonis
  • Start date Start date
  • Tags Tags
    chatgpt
Click For Summary
ChatGPT is deemed unreliable because it generates text based solely on word frequencies from its training data, lacking true understanding or semantic connections. Critics argue that it does not accurately answer questions or provide reliable information, often producing confident but incorrect responses. While some users report that it can parse complex code and suggest optimizations, this does not equate to genuine knowledge or reasoning. The discussion highlights concerns about its potential impact on how society perceives knowledge and the importance of critical evaluation of AI-generated content. Ultimately, while ChatGPT may appear impressive, its limitations necessitate cautious use and independent verification of information.
  • #121
If one has access to, say, OpenAI's GPT via the full API, would there be a way to drill down and get some sense of how a particular input prompt gave rise to a corresponding response? I do realise that huge amounts of data would be involved, which could overwhelm a human. But could one build some kind of graphical visualization tool that would help explore the connection from a specific input to a specific output, to the extent possible?
 
Computer science news on Phys.org
  • #122
Swamp Thing said:
Can one break down how a blind statistical process can do that?
Well the word whale has already a statisticaly higher chance to have 'bigger', 'biggest', 'large', ... words asocciated with it, so I don't see a problem.

Let's try a more subtle case:
Me: In the next input I will give you a list of words. In the input after that I will give you a sentence containing the placeholder "xyz". In response to that, please give me the sentence with the most appropriate word from the list.
ChatGPT3.5: Sure, I'm ready to assist you. Please provide the list of words in the next input, and in the following input, give me the sentence containing the placeholder "xyz". I'll then replace the placeholder with the most appropriate word from the provided list. Go ahead and provide the list of words.
Me: Horse, zebra, donkey.
ChatGPT3.5: Got it! Please provide the sentence containing the placeholder "xyz" that you'd like me to fill in using the words "horse", "zebra", or "donkey".
Me: The xyz is the biggest animal from the list.
ChatGPT3.5: The zebra is the biggest animal from the list.
In fact zebras are smaller than horses.

Swamp Thing said:
If one has access to, say, OpenAI's GPT via the full API, would there be a way to drill down and get some sense of how a particular input prompt gave rise to a corresponding response? I do realise that huge amounts of data would be involved, which could overwhelm a human.
Not really. ChatGPT3.5 has cca 180 billion parameters, that are densely connected, so it would be really hard to see how the choices are beign made.
The OpenAI team will have to resolve this issue somehow (perhaps giving information about important choices in the tree or some snippets or ...), because eventualy the regulatory agencies will need it to be in place.
 
  • Like
Likes Swamp Thing
  • #123
Swamp Thing said:
Can one break down how a blind statistical process can do that?
Is there any mathematical logic in it at all? Obviously search engines and chatbots use word association so "big" = "mass" and from that you get a list of masses. Can it then do <>=? Seems messy, but I'd think by statistical analysis/word frequency it would associate a 5 digit number with "bigger" than a 4 digit number even without doing the math.

For what it's worth, translating your question into searchenginese doesn't give a one-word answer but does give a table of masses in the 5th hit.

I think this is a good exercise, because people so vastly over-estimate the capabilities of chat-bots. It's amazing how convincing you can make it with a large enough data set without using actual intelligence.
 
  • #124
russ_watters said:
Is there any mathematical logic in it at all?
Not in the version of ChatGPT that my Insights article, and the Wolfram article it references, are based on. Future versions might add something along these lines.
 
  • #125
Large language models "use statistics about written text" (on which it has been trained) as a proxy of "something else" that we call "the real world".

Biological brains/nervous systems "use (something that we humans ourselves can describe as) temporal and spatial statistical distribution of neuronal spikes" as a proxy of "something else" that we like to call "the real world".

There are many differences between how biological nervous systems work, and how the computers that execute the algorithms behind large language models work.

Some people seem to think that "what we can describe as temporal and spatial statistical distributions of neuronal spikes" (or a subset of it) is the only thing/system/process in the Universe that can also be described as "phenomenological experience", and so they deny any possibility of a different type of system (or part of its processes) to also be described as "phenomenological experiences".

But the reality about this, is that we just don't know...
 
  • Like
Likes gentzen
  • #126
mattt said:
There are many differences between how biological nervous systems work, and how the computers that execute the algorithms behind large language models work.
Yes, and a major difference, at least for the version of ChatGPT that this article discusses, is that biological nervous systems have two-way real-time connections with the outside world, whereas ChatGPT does not--it has a one-way, "frozen" connection from its training data, and that connection is much more limited than the connections biological systems have since it only consists of relative word frequencies in text.
 
  • Like
Likes gentzen
  • #127
GPT is not 100% reliable from my experience. Especially the ones that have arisen from the former (Chat GPT).
 
  • #128
A recent article on Nature describes situations where LLM becomes unreliable (i.e. fails to provide the correct answer) especially in visual abstraction tasks and rephrased texts that humans in general have no trouble with:

https://www.nature.com/articles/d41586-023-02361-7

I like the title of the printed (PDF) version of the article better though: "The Easy Intelligence Tests that AI Chatbots Fail". While some the emergent behaviors of LLM are indeed surprising it is my opinion people in general are too easily fooled into perceiving this as a sign of "intelligent thinking" when it (so far) is mostly just a very good language pattern matcher.
 
  • Like
Likes russ_watters
  • #129
PeterDonis said:
About the article, I believe that there is not such innovative information, but it presents thoughts that some people did not have.
I agree that chatgpt is unreliable, so it needs humans and I consider it as something better than "an excel sheet", that's all.
Some people have blindly used this tool as if it were a human head working for you and that's not it. In this context, I believe that the article has its value to highlight this point and avoid using this form.
I believe that chatgpt is a tool that came to build at work and it is inevitable to use it to optimize some tasks.
 
  • #130
PeterDonis said:
We should be skeptical of the information provided by chatGPT but it is excellent when you have a question that is difficult to compose for a regular search engine. Its value is that it can interpret your question better than google or bing and you can build context. If it misunderstands you, you can even clarify your meaning in a way that is really only possible with another human. You can always verify the facts.
 
  • #131
neobaud said:
Its value is that it can interpret your question
But it doesn't "interpret" at all. It has no semantics.

neobaud said:
you can even clarify your meaning in a way that is really only possible with another human
This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.
 
  • Like
Likes russ_watters
  • #132
One thing I have noticed interacting with ChatGPT is that it is no good at all at explaining how it comes up with the answers it does. Recently I asked it a question that required mathematical reasoning (something it is really terrible at for some reason), and it did an excellent calculation that involved invoking physics equations that were not at all obvious from the prompt. So that was impressive, and it arrived at a correct expression that involved simply multiplying a chain of different terms that varied widely in magnitude. After the remarkably correct analysis it did, it got that final multiplication wrong by many orders of magnitude!

So I told it that it was wrong, but the rest was right, so just do the multiplication again. This time it got a different answer, but still completely wrong. So I asked it to describe how it carried out the multiplication, step by step, and it couldn't do it. So I told it to carry out the multiplication of the first two numbers, report the answer, then multiply the third one, and so on, and it then did get the answer correct. Then I used Bard and it got the answer wrong also, though it could report the Python code it used, but the Python code did not give the answer Bard gave! So those AIs cannot seem to track how they arrive at their own answers, and I think that may be closely related to why their numerical results are horrendously unreliable. At some point along the way, they seem to do something that in human terms would be called "guessing", but they cannot distinguish that from any other type of analysis they do, including following Python codes.
 
  • Like
Likes AndreasC
  • #133
I admittedly only read the first page of posts, and then skimmed through the rest to arrive here so please forgive me if someone else has asked, but, all of you are aware that there are many, many more LLMs besides ChatGPT? I understand that it is probably the most well known, but in my experience ( which is vast, but also shallow, and from a humble interacters perspective ) there are better ones out there that are suited to your needs. Google Bard for instance. I think that ChatGPT is faster, and better at creative text generation, but Bard is leaps and bounds better at providing accurate answers. It can also summarize any link you provide to it, such as from Google Scholar or anywhere else for that matter. It's accuracy in this however, is about like ChatGPT's accuracy at being able to site it's sources.

Further, there are now LLMs that can be run locally. I have a few installed, but they require about what my system has, just as a minimum, so my usage of them is limited. Anyone curious, it's called gpt4all and can be found on github. It allows you to download quite a few LLMs directly from the app. The only plugin currently availabe is one that allows you to give the models access to folders or files on your local system. It can do a few things with this ability.

Anyway, this has turned into a rant. Perhaps someone better at coding, and with vastly more patience, as well as a better PC will look into the local ran LLM's and after using various models, realize that LLM's are about a good as a text book. If you do not know what youre looking for, neither will the LLM...very well anyway.
 
  • Like
Likes russ_watters
  • #134
https://arxiv.org/abs/2308.02312
Who Answers It Better? An In-Depth Analysis of ChatGPT and Stack Overflow Answers to Software Engineering Questions

Over the last decade, Q&A platforms have played a crucial role in how programmers seek help online. The emergence of ChatGPT, however, is causing a shift in this pattern. Despite ChatGPT's popularity, there hasn't been a thorough investigation into the quality and usability of its responses to software engineering queries. To address this gap, we undertook a comprehensive analysis of ChatGPT's replies to 517 questions from Stack Overflow (SO). We assessed the correctness, consistency, comprehensiveness, and conciseness of these responses. Additionally, we conducted an extensive linguistic analysis and a user study to gain insights into the linguistic and human aspects of ChatGPT's answers. Our examination revealed that 52% of ChatGPT's answers contain inaccuracies and 77% are verbose. Nevertheless, users still prefer ChatGPT's responses 39.34% of the time due to their comprehensiveness and articulate language style. These findings underscore the need for meticulous error correction in ChatGPT while also raising awareness among users about the potential risks associated with seemingly accurate answers.

Users get tricked by appearance.
Our user study results show
that users prefer ChatGPT answers 34.82% of the time. However, 77.27% of these preferences are incorrect answers. We believe this observation is worth investigating. During our study, we observed that only when the error in the ChatGPT answer is obvious, users can identify the error. However, when the error is not readily verifiable or requires external IDE or documentation, users often fail to identify the incorrectness or underestimate the degree of error in the answer. Surprisingly, even when the answer has an obvious error, 2 out of 12 participants still marked them as correct and preferred that answer. From semi-structured interviews, it is
apparent that polite language, articulated and text-book style answers, comprehensiveness, and affiliation in answers make completely wrong answers seem correct. We argue that these seemingly correct-looking answers are the most fatal. They can easily trick users into thinking that they are correct, especially when they
lack the expertise or means to readily verify the correctness. It is even more dangerous when a human is not involved in the generation process and generated results are automatically used elsewhere by another AI. The chain of errors will propagate and have devastating effects in these situations. With the large percentage
of incorrect answers ChatGPT generates, this situation is alarming. Hence it is crucial to communicate the level of correctness to users
 
  • #135
AngryBeavers said:
all of you are aware that there are many, many more LLMs besides ChatGPT?
Yes, and no doubt how these models work will continue to evolve. The reason for specifically considering ChatGPT in the Insights article under discussion is that that specific one (because of its wide public accessibility) has been the subject of quite a few PF threads, and in a number of those threads it became apparent that there are common misconceptions about how ChatGPT works, which the Insights article was intended to help correct.
 
  • Like
Likes russ_watters
  • #136
I see. I was wondering. By what I read, it seemed the only LLM being discussed was ChatGPT, which I now understand why. I do not understand exactly why ChatGPT gets so much flack though, even still. At the end of the day, it's merely a tool. I've never seen so much hesitance, anger, and irrational fear over a tool...well, ever. Not here so much, I mean in the world in general. Frankly, I have other things to be worried about, such as Property Taxes, or the Grapefruit sized hernia protuding from my stomach.

Sometimes inaccurate information, and the upcoming downfall of mankind to our AI overlords is pretty low on my list of things to get emotional about.

:)
 
  • #137
In my rather limited experience, I found Bard was even worse than ChatGPT when it comes to mathematical reasoning that results in a quantitative answer to a physics or astronomy question. ChatGPT often uses the correct physics equation but does the calculation wrong, whereas Bard just makes the physics up completely. If there are any free LLMs that are better at these types of questions than those two I'd like to know about them, because those two are, quite frankly, just awful if you need reliability. That said, I will say that ChatGPT can often be prompted into a correct answer if you check it carefully, whereas I found Bard to be hopeless even when corrected. Hence I do think ChatGPT can be a useful physics tool, but only when used interactively and from a position of some prior knowledge. On the other hand, it is probably correct often enough to be able to give a student a passing grade on most physics exams, though I should not think it would ever result in an A result, at least in my experience with it.

Ten years from now? Well of course it will have improved enough to give excellent results to the kinds of exams that are currently given. Does that point to a problem in the exams, if they can be answered correctly by a language manipulation model that does not have any comprehension of its source material? Possibly, yes, it may mean that we are not asking the right kinds of questions to our students if we want them to be critical thinkers and not semantic parrots.
 
  • Like
Likes AndreasC
  • #138
Well LLMs do not have computational algorhitms (yet), they deal with text pattern recognition, so I don't know why it's so surprising they cannot do calculations.
Here is a proposal for a math extension for LLMs: https://aclanthology.org/2023.acl-industry.4.pdf

Anyway testing ChatGPT (or Bard) a little more I find it useful for initial code generation, but to have a properly functioning script I found myself goning to StackOverflow 70% of the time. The explanation, examples are all already there, wtih LLMs you have to ask a lot of questions (which means typing and waiting) and still don't get the right answers some of the time. And mind you this is not complex code (for that, I never use LLMs), just some small scripts for everyday use.
 
  • #139
Ken G said:
Possibly, yes, it may mean that we are not asking the right kinds of questions to our students if we want them to be critical thinkers and not semantic parrots.
I think that just about sums up why LLMs are not a risk to most peoples sense of security. I try to think of any LLM as being an interactive search engine. I can search for whatever my little heart desires, but I still have to filter the results myself. Honestly, I get more entertainment value from them in their current state. A lot of frustration also. For the purposes of genealogy, often LLMs are completely wrong and generate responses that merely parrot your intial input in a weird way. However, the "playground" version of ChatGPT gets so close that I find it interesting. The results are off, but if it were a human, I'd probaby consider it the result of "Uncle GPTs bad memory."

More than I meant to reply....heh. Basically, LLMs have strengths and weaknesses, but none are going to stun an intellectual community in any area that might be relavent.
 
  • #140
AngryBeavers said:
I see. I was wondering. By what I read, it seemed the only LLM being discussed was ChatGPT, which I now understand why. I do not understand exactly why ChatGPT gets so much flack though, even still. At the end of the day, it's merely a tool.
The issue, on this forum at least, it not actually the tool so much as people's misunderstanding about how it works and the reasons for its behavior. People give it too much credit for "intelligence" and so forth.
 
  • #141
Motore said:
Well LLMs do not have computational algorhitms (yet), they deal with text pattern recognition, so I don't know why it's so surprising they cannot do calculations.
It's because I have seen them report a Python code they used to do the calculation, and the Python code does not yield the quantitative result they report. So that's pretty odd, they seem to be able to associate their prompts with actual Python code that is correct, and still get the answer wrong.
Motore said:
Here is a proposal for a math extension for LLMs: https://aclanthology.org/2023.acl-industry.4.pdf
Yes, this is the kind of thing that is needed, and is what I'm expecting will be in place in a few years, so it seems likely that ten years from now, LLMs will be able to answer physics questions fairly well, as long as they only require associating the question with a formula without conceptual analysis first. It will then be interesting to see how much LLMs have to teach us about what we do and do not comprehend about our own physics, and what physics understanding actually is. This might be pedagogically significant for our students, or something much deeper.
Motore said:
Anyway testing ChatGPT (or Bard) a little more I find it useful for initial code generation, but to have a properly functioning script I found myself goning to StackOverflow 70% of the time. The explanation, examples are all already there, wtih LLMs you have to ask a lot of questions (which means typing and waiting) and still don't get the right answers some of the time. And mind you this is not complex code (for that, I never use LLMs), just some small scripts for everyday use.
Then the question is, why do you not use LLMs for complex code, and will that still be true in ten years? That might be the coding equivalent of using LLMs to solve physics questions, say on a graduate level final exam.
 
  • #142
AngryBeavers said:
More than I meant to reply....heh. Basically, LLMs have strengths and weaknesses, but none are going to stun an intellectual community in any area that might be relavent.
My question is, how much of this is due to the fact that these are just early generation attempts, versus how much is fundamental to the way LLMs must work? If we fix up their ability to recognize logical contradictions, and enhance their ability to do mathematical logic, will we get to a point where it is very hard to distinguish their capabilities from the capabilities of physics teachers who pose the questions in the first place? And if we did, what would that mean for our current ideas about what conceptual understanding is, since physics seems like a place where conceptual understanding plays a crucial role in achieving expertise. These kinds of AI related questions always remind me of B. F. Skinner's great point, "The real question is not whether machines think but whether men do. The mystery which surrounds a thinking machine already surrounds a thinking man.” (Or woman.)
 
  • #143
neobaud said:
.... If it misunderstands you, you can even clarify your meaning in a way that is really only possible with another human.

PeterDonis said:
This seems to be vastly overstating ChatGPT's capabilities. You can "clarify" by giving it another input, but it won't have any semantic meaning to ChatGPT any more than your original input did.

My understanding is that it is designed to* accept criticism/clarification/correction.That makes such follow up mostly useless, since it will simply be riffing on what you tell it, regardless of whether is accurate or not. In other words, you'll always win a debate with it, even if you're wrong.

*Whether actual design or "emergent" behavior I don't know, but I don't think it matters.
 
  • #144
Also, it should perhaps be noted that this question goes way beyond whether we should shun or accept AI in the physics classroom, the question of "what does an LLM really know" goes to the heart of what this type of AI will be useful for in the future. Let us not forget IBM's painful lesson that was Watson! They thought that because it was the best Jeopardy contestant ever, it could analyze patient data and help suggest improved cancer treatments. So far that has been a dismal failure, because of the problem of connecting the machine and its capabilities with the necessary data. A human can better tell what matters and what doesn't, and can integrate disparate forms of information, whereas Watson had vastly more computing power but could not find a way to use it, unlike if it was trying to win a chess game or a Jeopardy match, games with simple rules and relatively straightforward connections between the data. To get AI to live up to its potential, we may have to first understand better what knowledge even is, and what separates it from some vast pile of disjoint information. What will that knowledge tell us about ourselves?
 
  • #145
russ_watters said:
it is designed to* accept criticism/clarification/correction
No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.
 
  • Like
Likes PeroK, Motore and russ_watters
  • #146
PeterDonis said:
No, it's designed to accept any input whatever and generate response text based on the relative word frequency algorithm that Wolfram describes in his article (referenced in the Insights article). It has no idea that some input is "criticism/clarification" based on its previous responses. It has no semantics.
Yet it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious. It is not normal in human language to be that obsequious, I don't think it could get that from simply predicting the next word. When it has made a mistake and you tell it that it has, it will typically search its database a little differently, placing some kind of emphasis on your correction. If you tell it that one plus one equals three, however, it has enough contrary associations in its database that it sticks to its guns, but it will still not tell you that you have made a mistake (which surely would be the norm in its database of reactions), it will suggest you might be joking or using an alternative type of mathematics. The status it gives to the prompt must be an important element of how it works, and it is constrained to be polite in the extreme, which amounts to "riffing on" your prompt.
 
  • #147
Ken G said:
it gives the prompt some kind of special status, so when you correct something it said, it tends to become quite obsequious
I don't think that's anything explicitly designed, at least not in the version of ChatGPT that was reviewed in the Insights article and the Wolfram article it references. It is just a side effect of its algorithm.
 
  • #148
Ken G said:
it will typically search its database
It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.
 
  • #149
PeterDonis said:
It doesn't search any database; that was one of the main points of the Insights article. All it does is generate text based on relative word frequencies, using the prompt given as input as its starting point.
Yes, but it has a database to inform those frequencies. That database must be "primed" in some way to establish what the ChatGPT is, and how it should relate to prompts. For example, if you prompt it with "describe yourself", it will say "I am ChatGPT, a creation of OpenAI. I'm a language model powered by the GPT-3.5 architecture, designed to understand and generate human-like text based on the input I receive. I have been trained on a diverse range of text sources up until September 2021, so I can provide information, answer questions, assist with writing, generate creative content, and more. However, it's important to note that I don't possess consciousness, emotions, or personal experiences. My responses are based on patterns in the data I've been trained on, and I aim to be a helpful and informative tool for various tasks and conversations." So that's a highly specialized set of data to look for word associations with "describe yourself," it has been trained to favor certain word frequencies in response to certain prompts.

Also, if you correct it, it will invariably apologize obsequiously. So it is in some sense "programmed to accept corrections," in the sense that it uses a word association database that expects to be corrected and is trained to respond to that in certain ways.

It would seem that its training also expects to provide certain types of responses. For example, if you just give it the one word prompt "poem", it will write a poem. Also, since you did not specify a subject, it will write a poem about poetry! I think that was a conscious decision by its programmers, there are built in expectations about what a prompt is trying to accomplish, including corrections. It could be said that ChatGPT inherits some elements of the intelligence of its trainers.
 
  • #150
Ken G said:
it has a database to inform those frequencies
It has a database of relative word frequencies. That's the only database it has. (At least, that's the case for the version that was reviewed in the articles being discussed here. Later versions might have changed some things.)
 

Similar threads

  • · Replies 8 ·
Replies
8
Views
3K
Replies
10
Views
4K
  • · Replies 39 ·
2
Replies
39
Views
9K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 25 ·
Replies
25
Views
5K
Replies
3
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
3
Views
5K
  • · Replies 1 ·
Replies
1
Views
26K