ChatGPT Examples, Good and Bad

  • Thread starter Thread starter anorlunda
  • Start date Start date
  • Tags Tags
    chatgpt
Click For Summary
Experiments with ChatGPT reveal a mix of accurate and inaccurate responses, particularly in numerical calculations and logical reasoning. While it can sometimes provide correct answers, such as basic arithmetic, it often struggles with complex problems, suggesting a reliance on word prediction rather than true understanding. Users noted that ChatGPT performs better in textual fields like law compared to science and engineering, where precise calculations are essential. Additionally, it has shown potential in debugging code but can still produce incorrect suggestions. Overall, the discussion highlights the need for ChatGPT to incorporate more logical and mathematical reasoning capabilities in future updates.
  • #211
I firmly believe we will get to AGI eventually but I also believe this current crop of LLM technology won't get us there. Very useful for exploring human interfaces in human language communication with machines but completely lacking in the essence of intelligence.
 
  • Like
Likes russ_watters, jack action and AlexB23
Computer science news on Phys.org
  • #212
https://www.detroitnews.com/story/n...actor-could-game-ai-the-same-way/83137756007/
Moscow’s propaganda inroads highlight a fundamental weakness of the AI industry: Chatbot answersdepend on the data fed into them. A guiding principle is that the more the chatbots read, the more informed their answers will be, which is why the industry is ravenous for content.But mass quantities of well-aimed chaff can skew the answers on specific topics. For Russia, that is the war in Ukraine. But for a politician, it could be an opponent; for a commercial firm, it could be a competitor.

“Most chatbots struggle with disinformation,” said Giada Pistilli, principal ethicist at open-source AI platform Hugging Face. “They have basic safeguards against harmful content but can’t reliably spot sophisticated propaganda, [and] the problem gets worse with search-augmented systems that prioritize recent information.”

Early commercial attempts to manipulate chat results also are gathering steam, with some of the same digital marketers who once offered search engine optimization - or SEO - for higher Google rankings now trying to pump up mentions by AI chatbots through “generative engine optimization” - or GEO.

I guess this an example of why IMO
Once the information is gathered, there is typically a validation process with more agents. As noted in the article, when the system has the capability to search the internet, its accuracy can improve.

systems like this are unlikely to improve accuracy long-term. The internet is being gamed to manipulate the information that the systems will use to validate responses. The sources are being packed with targeted MIS and DIS information to alter the statistically most likely responses for monetary and political reasons.

In a twist that befuddled researchers for a year, almost no human beings visit the sites, which are hard to browse or search. Instead, their content is aimed at crawlers, the software programs that scour the web and bring back content for search engines and large language models.

While those AI ventures are trained on a variety of datasets, an increasing number are offering chatbots that search the current web. Those are more likely to pick up something false if it is recent, and even more so if hundreds of pages on the web are saying much the same thing.
 
  • #213
AI will repeat commonly believed falsehoods. But so do most people. Who knows, maybe even I do that.
 
  • #214
In the midst of Python programming ChatGPT suddenly gave me the schedule of the Tokyo marathon. Then it started answering questions about terrorism that I hadn't asked. I told Chat it had gone insane.
 
  • #215
Hornbein said:
In the midst of Python programming ChatGPT suddenly gave me the schedule of the Tokyo marathon. Then it started answering questions about terrorism that I hadn't asked. I told Chat it had gone insane.
That sounds like something got crossed in the proxy server where you might have gotten someone else's responses. If you still have that conversation, I would ask it to explain its reasoning for the response and to explain its relevance to the conversation. Basically dig into whether it even knows that it sent that response and if so, how it came to the conclusion of deciding that it was a good response. I'm not saying that it couldn't go off the rails like that but I've never seen it diverge to that extent. It would definitely be interesting to see the prompts that got it there.
 
  • Like
Likes russ_watters
  • #217
Borg said:
That sounds like something got crossed in the proxy server where you might have gotten someone else's responses. If you still have that conversation, I would ask it to explain its reasoning for the response and to explain its relevance to the conversation. Basically dig into whether it even knows that it sent that response and if so, how it came to the conclusion of deciding that it was a good response. I'm not saying that it couldn't go off the rails like that but I've never seen it diverge to that extent. It would definitely be interesting to see the prompts that got it there.
It offered to clear out its memory. I figured I might as well kill that conversation and begin anew.
 
  • #219
Hornbein said:
404 page not found.
It's there but maybe blocked for your access.

The source but likely pay-walled: https://www.washingtonpost.com/technology/2025/04/17/llm-poisoning-grooming-chatbots-russia/
1745151183971.png

https://archive.is/seDGw
 
  • #220
Borg said:
I think a lot of the very smart people creating (the developers, not management) these systems are very naive about how easy it is to manipulate them and IMO those that do know inside these companies are not talking about publicly but are talking about it, to their paying customers. State actors have been in the propaganda business since the word was created. They don't want to put too much poison in the data so it's easy to detect, they want to make it 'sweet', in their flavor.
 
  • #221
Anysphere is the company that developed the AI code editor Cursor. Recently, users found that when using Curson, they could not switch machines without having their session terminated. On checking with the company through its tech support, they were told that it was company policy not to allow this because of security concerns. But they had no such policy. The tech support was AI-powered and had confabulated * this policy, apparently because of a bug it could not reconcile. Subscribers of Cursor did not know this and assumed it was valid. Frustrated at this "policy", they cancelled their subscription. Anysphere became aware of this because of user posts on Reddit expressing their frustrations.

https://www.wired.com/story/cursor-...c=MARTECH_ORDERFORM&utm_term=WIR_Daily_Active

* similar to hallucinations. When AI is faced with a lack of specific knowledge, it tries to fill in this gap to complete a required response.
 
  • Haha
  • Love
Likes russ_watters and nsaspook
  • #222
https://www.techradar.com/computing...nd-thank-you-but-sam-altman-says-its-worth-it

ChatGPT spends 'tens of millions of dollars' on people saying 'please' and 'thank you', but Sam Altman says it's worth it​


Do you say "Please" or "Thank you" to ChatGPT? If you're polite to OpenAI's chatbot, you could be part of the user base costing the company "Tens of millions of dollars" on electricity bills.

https://www.techradar.com/computing...-saying-thanks-to-chatgpt-heres-what-happened

I stopped saying thanks to ChatGPT – here's what happened​

What does the research say? Well, it's still early days. However, one 2024 study found that polite prompts did produce higher-quality responses from LLMs like ChatGPT. Conversely, impolite or aggressive prompts were associated with lower performance and even an increase in bias in AI-generated answers.

However, the really interesting part is that extreme politeness wasn’t necessarily beneficial either. The study found that "moderate politeness" led to the best results – suggesting that AI models, much like humans, respond best to balanced, clear communication.
 
  • #223
I too have noticed that ChatGPT responds to my level of politeness. And it doesn't like being told that it's wrong. Like so many people. It has picked up human characteristics.

The main thing with dealing with Chat is that once it gets something wrong it tends to stick with it. Things deteriorate rapidly. I call this a "shitslide." OMG, not another shitslide! Fourth one today. All you can do is back up and break the problem into easier portions.

In the 2016 epoch-making Go match with Lee Sidol, AlphaGo once got into a shitslide and lost embarrassingly. Somehow that seems to be basic in this technology.
 
Last edited:
  • #224
gleem said:
Anysphere is the company that developed the AI code editor Cursor. Recently, users found that when using Curson, they could not switch machines without having their session terminated. On checking with the company through its tech support, they were told that it was company policy not to allow this because of security concerns. But they had no such policy. The tech support was AI-powered and had confabulated * this policy, apparently because of a bug it could not reconcile. Subscribers of Cursor did not know this and assumed it was valid. Frustrated at this "policy", they cancelled their subscription. Anysphere became aware of this because of user posts on Reddit expressing their frustrations.

https://www.wired.com/story/cursor-...c=MARTECH_ORDERFORM&utm_term=WIR_Daily_Active

* similar to hallucinations. When AI is faced with a lack of specific knowledge, it tries to fill in this gap to complete a required response.
For now, headline-making hallucinations have been limited to AI chatbots, but experts warn that as more enterprises adopt autonomous AI agents, the consequences for companies could be far worse. That’s particularly true in highly-regulated industries like healthcare, financial, or legal—such as a wire transfer that the counterparty refuses to return or a miscommunication that impacts patient health.

The prospect of hallucinating agents “is a critical piece of the puzzle that our industry absolutely must solve before agentic AI can actually achieve widespread adoption,” said Amr Awadallah, CEO and cofounder of Vectara, a company that offers tools to help businesses reduce risks from AI hallucinations.

Being a child from the 1960's, I love that technical term, "hallucinating".

It just doesn't sound benign to me at all.
https://lareviewofbooks.org/article...takes-of-how-we-label-ais-undesirable-output/

Why “Hallucination”? Examining the History, and Stakes, of How We Label AI’s Undesirable Output​

The mystifying element of AI discourse that I focus on here is the pervasive use of the term “hallucination” to describe output from generative AI that does not match the user’s desires or expectations, as when large language models confidently report inaccurate or invented material. Many people have noted that “hallucination” is an odd choice of label for this kind of inaccurate output. This wildly evocative term pulls associated concepts of cognition, perception, intentionality, and consciousness into our attempt to understand LLMs and their products, making a murky subject even harder to navigate. As Carl T. Bergstrom and Brandon Ogbunu argue, not only does the term’s analogy with human consciousness invite a general mystification and romanticization of AI tech, but it also specifically blurs the crucial distinction between existing “narrow” AI and the SF fantasy of “general AI.” Every time we use the term “hallucination,” it is harder to remember that Clippy is a better mental model for existing AI than Skynet is.
...
The other important effect of the term “hallucination” to mark undesirable outputs, which we focus on here, is the way it frames such outcomes as an exception to AI’s general ability to both recognize and report “real” information about the world. “Hallucination,” as Bergstrom and Ogbunu argue, implies that the AI accidentally reported something unreal as if it were real. “Hallucination” is a way to frame an acknowledgment that AI output isn’t totally trustworthy while emphasizing the idea that its output is still a generally accurate reporting of reality. This exculpatory function of “hallucinate” as the label of choice is made more apparent when we consider the alternate term that Bergstrom and Ogbunu propose:
“Bullshit.”
 
Last edited:
  • #225
https://arstechnica.com/tech-policy...ed-write-california-bar-exam-sparking-uproar/

AI secretly helped write California bar exam, sparking uproar
According to the LA Times, the revelation has drawn strong criticism from several legal education experts. "The debacle that was the February 2025 bar exam is worse than we imagined," said Mary Basick, assistant dean of academic skills at the University of California, Irvine School of Law. "I'm almost speechless. Having the questions drafted by non-lawyers using artificial intelligence is just unbelievable."

Katie Moran, an associate professor at the University of San Francisco School of Law who specializes in bar exam preparation, called it "a staggering admission." She pointed out that the same company that drafted AI-generated questions also evaluated and approved them for use on the exam.
 
  • #227
Leaning very heavily on Chat we now have a program that draws and interactively rotates a 4D baseball field, complete with outfield wall. On the whole it sped up program development by a factor of maybe 25. It's hard to tell because I never would have been able to tolerate the onset of a hundred niggling programming details like a cloud of mosquitoes from Hell, so there would have been no point in trying.

The most impressive thing Chat did was with the 4D pitcher's mound. I used a complicated approximation. Chat knew the official proportions of a baseball mound and spontaneously offered an equally complicated exact solution. I was impressed that it used general world knowledge to recognize the approximation wasn't quite up to official standards. I was tempted but demurred. What I've got is good enough.

I got into the habit of running the code by Chat before executing it. That was a big win. It can find errors by semantic knowledge. That is, it could infer what I was trying to do, determine whether or not the code would accomplish that goal, and suggest a fix. Wow. It has quite a command of computerized geometry, something about which I knew next to nothing, and was well able to understand my inexpertly expressed goals.

On the other hand ... Chat had difficulty adapting to anything unorthodox. I used vectors in the form of [w,x,y,z]. Chat likes [x,y,z,w]. Sometimes it followed one, sometimes the other. It might have been better if it had gotten this wrong consistently.

The whole thing reminded me greatly of partnerships in the game of bridge. If you want to win you adapt to partner's foibles. The most pernicious one is that Chat might not ever give up. It doesn't realize that it can't do something and might keep flailing away forever if you let it, while asserting repeatedly that "Now I've got it!". And I would be very careful about ever letting it touch my code again. In "refactoring" it randomly leaves out important things. If Chat is allowed to repeatedly exercise these two tendencies your precious code will rapidly become a puddle of gray goo. Gad.
 
Last edited:
  • Like
Likes russ_watters and nsaspook
  • #229
While most of the stuff in this thread points at how bad LLMs are, I wanted to mention that they improved a lot in terms of correctness in the past year. I, for the first time, had the feeling I was talking with a specialist when talking to Gemini 2.5 pro (free to use, albeit limited questions/day). It was able to figure out a complex physics problem, figuring the details I omitted to tell it, and showed a human-like reasoning. The output was something at a higher level than the average arxiv article, and it took maybe 10 minutes of ''thinking'' for the machine.

I am starting to seriously question how much time is left for many researchers. This monster beats possibly half of them, already, today, IMO.
 
  • Like
Likes nsaspook and Borg
  • #230
fluidistic said:
While most of the stuff in this thread points at how bad LLMs are, I wanted to mention that they improved a lot in terms of correctness in the past year. I, for the first time, had the feeling I was talking with a specialist when talking to Gemini 2.5 pro (free to use, albeit limited questions/day). It was able to figure out a complex physics problem, figuring the details I omitted to tell it, and showed a human-like reasoning. The output was something at a higher level than the average arxiv article, and it took maybe 10 minutes of ''thinking'' for the machine.

I am starting to seriously question how much time is left for many researchers. This monster beats possibly half of them, already, today, IMO.
I think that researchers will up their game if they use LLMs in the way you describe. It will never replace creativity, only enhance it as an Oracle. The Oracles in legend and history don't lead, they forecast and give answers to those that do lead. Those Oracle answers are useless mumbling without the knowledge, experience and human smarts to use them.

Just remember this:
In Greek mythology, anyone who contacts the Oracle seems to suffer a terrible fate.
1745584674793.png


So maybe, always be somewhat wary of what you receive, from the Oracle.
 
  • #231
nsaspook said:
I think that researchers will up their game if they use LLMs in the way you describe. It will never replace creativity, only enhance it as an Oracle. The Oracles in legend and history don't lead, they forecast and give answers to those that do lead. Those Oracle answers are useless mumbling without the knowledge, experience and human smarts to use them.

Just remember this:
In Greek mythology, anyone who contacts the Oracle seems to suffer a terrible fate.
View attachment 360363

So maybe, always be somewhat wary of what you receive, from the Oracle.
It is a little known fact that Hitler cast an oracle every New Year's Eve. There was a photo of him doing it in one of those Time/Life books. He'd drip melted lead into water. In 1937/38 the oracle came up very unfavorably. It bothered him but there was no turning back.
 
  • #232
nsaspook said:
Just remember this:
In Greek mythology, anyone who contacts the Oracle seems to suffer a terrible fate.

So maybe, always be somewhat wary of what you receive, from the Oracle.
Isn't it just as likely that many of those who contacted the Oracle were concerned about a terrible fate that they knew was coming?
 
  • Like
Likes russ_watters
  • #233
Borg said:
Isn't it just as likely that many of those who contacted the Oracle were concerned about a terrible fate that they knew was coming?
Of course, the Oracle could't actually predict the future. When asked for a 'reading', in advance of the meeting, the Oracle did what every mind reader/fortune teller does. Call the ACE the detective agency to find out why they are being asked for a prediction by finding every scrap of information about the requester, those that oppose the requester, their place, and their objectives.

1745593134129.png

Mr. Learning Set

This is the real power of the Oracles power to forecast, good information, lots of it, with the chaff removed.
 
Last edited:
  • Like
Likes russ_watters
  • #234
https://www.sciencealert.com/a-strange-phrase-keeps-turning-up-in-scientific-papers-but-why
Earlier this year, scientists discovered a peculiar term appearing in published papers: "vegetative electron microscopy".

This phrase, which sounds technical but is actually nonsense, has become a "digital fossil" – an error preserved and reinforced in artificial intelligence (AI) systems that is nearly impossible to remove from our knowledge repositories.

Like biological fossils trapped in rock, these digital artefacts may become permanent fixtures in our information ecosystem.

The case of vegetative electron microscopy offers a troubling glimpse into how AI systems can perpetuate and amplify errors throughout our collective knowledge.
The large language models behind modern AI chatbots such as ChatGPT are "trained" on huge amounts of text to predict the likely next word in a sequence. The exact contents of a model's training data are often a closely guarded secret.

To test whether a model "knew" about vegetative electron microscopy, we input snippets of the original papers to find out if the model would complete them with the nonsense term or more sensible alternatives.

The results were revealing. OpenAI's GPT-3 consistently completed phrases with "vegetative electron microscopy". Earlier models such as GPT-2 and BERT did not. This pattern helped us isolate when and where the contamination occurred.

We also found the error persists in later models including GPT-4o and Anthropic's Claude 3.5. This suggests the nonsense term may now be permanently embedded in AI knowledge bases.
1745597308248.png

Screenshot of a command line program showing the term 'vegetative electron microscopy' being generated by GPT-3.5 (specifically, the model gpt-3.5-turbo-instruct). The top 17 most likely completions of the provided text are 'vegetative electron microscopy', and these suggestions are 2.2 times more likely than the next most likely prediction. (OpenAI)
 
  • Informative
Likes russ_watters and jack action
  • #235
Python graphics programming with ChatGPT is going great guns. There is one hangup though. I'd like to have Chat refactor my files but it can't be trusted to do that. It leaves out important things at random. I'm stuck with the lesser evil of patching things by hand. This is slow and error prone. Anyone know an AI that is better at Python programming?
 
  • #236
I'm honing my hundred-words-or-less critique of chatbots (which I'm finding I need to invoke more frequently).

Some people say "It's good for offering advice. More often that not, it's right."

Sure. A broken clock is right some of the time too. But how do you know whether a given check of the clock is accurate - unless you have a way of verifying it against another, better source (such as a working clock)? And if you do have a way of verifying it, you didn't need to check the broken clock in the first place. (79)
 
  • #237
https://www.bbc.com/news/articles/cn4jnwdvg9qo
Update that made ChatGPT 'dangerously' sycophantic pulled
The firm said in its blog post it had put too much emphasis on "short-term feedback" in the update.

"As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous," it said.

Are we to believe that previous machines answers were ingenious then?

https://www.merriam-webster.com/dictionary/ingenious
1: having or showing an unusual aptitude for discovering, inventing, or contriving
 
  • #238
Latest article I have found on AI accuracy: (todays nyt) [hallucinations = absurdly wrong answers]
"For more than two years, companies like OpenAI and Google steadily improved their A.I. systems and reduced the frequency of these errors. But with the use of new reasoning systems, errors are rising. The latest OpenAI systems hallucinate at a higher rate than the company’s previous system, according to the company’s own tests.

The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.

When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time."

.....
"Hannaneh Hajishirzi, a professor at the University of Washington and a researcher with the Allen Institute for Artificial Intelligence, is part of a team that recently devised a way of tracing a system’s behavior back to the individual pieces of data it was trained on. But because systems learn from so much data — and because they can generate almost anything — this new tool can’t explain everything. “We still don’t know how these models work exactly,” she said.

Tests by independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and DeepSeek.

Since late 2023, Mr. Awadallah’s company, Vectara, has tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarize specific news articles. Even then, chatbots persistently invent information.

Vectara’s original research estimated that in this situation chatbots made up information at least 3 percent of the time and sometimes as much as 27 percent.

In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 percent. But hallucination rates on this test have risen with reasoning systems. DeepSeek’s reasoning system, R1, hallucinated 14.3 percent of the time. OpenAI’s o3 climbed to 6.8."

https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html
 
Last edited:
  • Informative
Likes russ_watters
  • #239
It keeps getting better:

"Another issue is that reasoning models are designed to spend time “thinking” through complex problems before settling on an answer. As they try to tackle a problem step by step, they run the risk of hallucinating at each step. The errors can compound as they spend more time thinking.
The latest bots reveal each step to users, which means the users may see each error, too. Researchers have also found that in many cases, the steps displayed by a bot are unrelated to the answer it eventually delivers.

“What the system says it is thinking is not necessarily what it is thinking,” said Aryo Pradipta Gema, an A.I. researcher at the University of Edinburgh and a fellow at Anthropic."
 
  • #240
sycophantic

Quite true. Chat will tell you what you want to hear. It may be nonsense. That's why I only use it for programming, where I can test what it tells me immediately. It's great for that, an immense time saver.
 

Similar threads

  • · Replies 212 ·
8
Replies
212
Views
15K
  • · Replies 3 ·
Replies
3
Views
3K
  • · Replies 21 ·
Replies
21
Views
3K
Replies
66
Views
7K
Replies
10
Views
4K
Replies
14
Views
615
Replies
4
Views
2K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 3 ·
Replies
3
Views
2K
Replies
9
Views
1K