ChatGPT Examples, Good and Bad

  • Thread starter Thread starter anorlunda
  • Start date Start date
  • Tags Tags
    chatgpt
AI Thread Summary
Experiments with ChatGPT reveal a mix of accurate and inaccurate responses, particularly in numerical calculations and logical reasoning. While it can sometimes provide correct answers, such as basic arithmetic, it often struggles with complex problems, suggesting a reliance on word prediction rather than true understanding. Users noted that ChatGPT performs better in textual fields like law compared to science and engineering, where precise calculations are essential. Additionally, it has shown potential in debugging code but can still produce incorrect suggestions. Overall, the discussion highlights the need for ChatGPT to incorporate more logical and mathematical reasoning capabilities in future updates.
  • #201
nsaspook said:
Maybe it's because these things don't really reason or understand anything.
There are a lot of humans in this world that the same could be said of.

Seriously though, we are still early in the evolution of these systems. Models and architectures that perform the best will be selected over the ones that don't. What we perceive as their 'reasoning' capabilities will continue to improve - just as people will continue to move the goalposts on what constitutes the ability to reason.

There is a lot of work that goes into designing these things. Building them is like trying to mimic the human brain which we don't fully understand. Now think about how the human brain works when presented with a question. How would you design a system that can recieve a question and then respond in a way that is similar to how a human brain would respond?

I am in the middle of building a relatively complex multi-agent system. In its simplest form of mimicking a human response, the system needs to accept a user question and answer it using previously learned or researched information. The process involves many smaller, specialized agents that are good at specific tasks like understanding dates, calculating numbers, web searches, etc. In many ways, the human brain operates in a similar manner with some areas that are good at recognizing faces, some that are good at math, some that are good at spatial problems, etc.

Once the information is gathered, there is typically a validation process with more agents. As noted in the article, when the system has the capability to search the internet, its accuracy can improve.
OpenAI’s GPT-4o with web search achieves 90% accuracy on SimpleQA, another one of OpenAI’s accuracy benchmarks.

Then, after gathering all of the information, models and humans alike have to decide which pieces of information are most relevant in answering the question - more agents in the AI case and specialized areas of the brain for humans.

Finally, if any of these processes generates bad information, there can be downstream failures in accuracy - this applies to models and humans alike. I personally see AI systems as evolving from a Savant Syndrome stage when they first arrived to now having far fewer social or intellectual impairments. Yes, at their core, they are still statistical engines but I don't see the human brain as being much different in its various components. Even with the best available information, people still make really bad judgements.
 
  • Haha
  • Like
Likes OmCheeto and Hornbein
Computer science news on Phys.org
  • #202
Borg said:
So, no different than letting a co-worker work on your code. :wink:
Worse. I'll never let it touch my code again. Once it stripped out all the comments.

That being said, its programming understanding is superhuman. I have learned to rely on it very heavily.
 
  • #203
I've had some coworkers do some pretty stupid things in the code that go well beyond anything that I've seen ChatGPT do. I used to be regularly brought into projects to fix the messes that others created. I've seen things that can't be unseen and have coded repairs that required days to months to repair. In one case, I literally removed 1 million lines of code from a heavily obfuscated program. Regardless of the person or entity that's making changes, it's still always a best practice to have regular backups.
 
  • #204
Borg said:
There are a lot of humans in this world that the same could be said of.

Seriously though, we are still early in the evolution of these systems. Models and architectures that perform the best will be selected over the ones that don't. What we perceive as their 'reasoning' capabilities will continue to improve - just as people will continue to move the goalposts on what constitutes the ability to reason.

There is a lot of work that goes into designing these things. Building them is like trying to mimic the human brain which we don't fully understand. Now think about how the human brain works when presented with a question. How would you design a system that can recieve a question and then respond in a way that is similar to how a human brain would respond?

I am in the middle of building a relatively complex multi-agent system. In its simplest form of mimicking a human response, the system needs to accept a user question and answer it using previously learned or researched information. The process involves many smaller, specialized agents that are good at specific tasks like understanding dates, calculating numbers, web searches, etc. In many ways, the human brain operates in a similar manner with some areas that are good at recognizing faces, some that are good at math, some that are good at spatial problems, etc.

Once the information is gathered, there is typically a validation process with more agents. As noted in the article, when the system has the capability to search the internet, its accuracy can improve.


Then, after gathering all of the information, models and humans alike have to decide which pieces of information are most relevant in answering the question - more agents in the AI case and specialized areas of the brain for humans.

Finally, if any of these processes generates bad information, there can be downstream failures in accuracy - this applies to models and humans alike. I personally see AI systems as evolving from a Savant Syndrome stage when they first arrived to now having far fewer social or intellectual impairments. Yes, at their core, they are still statistical engines but I don't see the human brain as being much different in its various components. Even with the best available information, people still make really bad judgements.
I don't agree at all with the comparison to human brain function. The machines just regenerate human intelligence they don't create intelligence. What we see here IMO is a classic regression of functional operation in engineering. Model failure from too much synthetic data that reinforce statistical data biases. GIGO. It's a very useful tool for experts because we can detect the bad from the good in even complex and complicated results.

The real issue is trust, do you trust the machine to make critical decisions for you and trust it sufficiently to put your professional life on the line, by using unchecked by human expertise, responses from the machine. I don't, and per the designers of these systems, you shouldn't.
 
Last edited:
  • Like
  • Haha
Likes OmCheeto and russ_watters
  • #206
nsaspook said:
I don't agree at all with the comparison to human brain function. The machines just regenerate human intelligence they don't create intelligence. What we see here IMO is a classic regression of functional operation in engineering. Model failure from too much synthetic data that reinforce statistical data biases. GIGO. It's a very useful tool for experts because we can detect the bad from the good in even complex and complicated results.

The real issue is trust, do you trust the machine to make critical decisions for you and trust it sufficiently to put you professional life on the line, by using unchecked by human expertise, responses from the machine. I don't, and per the designers of these systems, you shouldn't.
In general I don't trust Chat but for programming it's great. I can test it immediately. Either what it wrote works or it doesn't. No trust involved.

The machines just regenerate human intelligence they don't create intelligence.

That could be true but it seems to me it also applies to at least 99% of what people do. And Chat's technical expertise and breadth of knowledge in this corner of geometry and programming is overwhelming, revolutionary. I'm especially impressed by its ability to understand truly minimal directions, better than any human I've ever encountered, and its incisive ability to spot and explain my mistakes clearly then suggest remedies. In contrast, recently I asked a question on MathStack. Their responses exposed their utter cluelessness on the subject. They then informed me that this was my fault. Thanks a lot buddy. That WAS however better than what Chat did with the same question : it knew what I wanted but then generated a lot of nonsense in Lie group jargon that seemed believable to benighted me. That was the last time I asked it a question like that. It's too eager to please.
 
  • #207
Hornbein said:
In general I don't trust Chat but for programming it's great. I can test it immediately. Either what it wrote works or it doesn't. No trust involved.



That could be true but it seems to me it also applies to at least 99% of what people do. And Chat's technical expertise and breadth of knowledge in this corner of geometry and programming is overwhelming, revolutionary. I'm especially impressed by its ability to understand truly minimal directions, better than any human I've ever encountered, and its incisive ability to spot and explain my mistakes clearly then suggest remedies. In contrast, recently I asked a question on MathStack. Their responses exposed their utter cluelessness on the subject. They then informed me that this was my fault. Thanks a lot buddy. That WAS however better than what Chat did with the same question : it knew what I wanted but then generated a lot of nonsense in Lie group jargon that seemed believable to benighted me. That was the last time I asked it a question like that. It's too eager to please.
Don't take this the wrong way.

And you, of course, are in the 1% exception.


The machine minics that expertise by providing us easy access to that large amount of available information.

Try asking it questions in programming where there is specialized knowledge of the programming domain (like low-level embedded system details (clock cycles, register modes, interface configurations) on brand new processors with new types of radio communications modules) and little published public knowledge on the subject matter because it's under NDA and is proprietary.

It's a machine designed to give answers from the pool of human intelligence used to create it.

Very useful and very flawed in the area of trust.
 
  • #208
nsaspook said:
Don't take this the wrong way.

And you, of course, are in the 1% exception.


The machine minics that expertise by providing us easy access to that large amount of available information.

Try asking it questions in programming where there is specialized knowledge of the programming domain (like low-level embedded system details (clock cycles, register modes, interface configurations) on brand new processors with new types of radio communications modules) and little published public knowledge on the subject matter because it's under NDA and is proprietary.

It's a machine designed to give answers from the pool of human intelligence used to create it.
I'm working on a geometric program that is structured sort of like a computer game. Chat is great at that. If there is something else that it can't do, this matters not to me.
 
  • #209
Hornbein said:
I'm working on a geometric program that is structured sort of like a computer game. Chat is great at that. If there is something else that it can't do, this matters not to me.
Exactly my point, the information you get, has lots of published examples, so it works for your programming domain, but it doesn't understand programming in general (CS101 -> ...) to make the human intellect leap to a new domain of programming using that existing information as the foundation to build on.
 
  • #210
nsaspook said:
I don't agree at all with the comparison to human brain function. The machines just regenerate human intelligence they don't create intelligence. What we see here IMO is a classic regression of functional operation in engineering. Model failure from too much synthetic data that reinforce statistical data biases. GIGO. It's a very useful tool for experts because we can detect the bad from the good in even complex and complicated results.
My point is that many people in the AI domain are looking at the parallels of how the human brain functions in order to improve and evolve the reasoning capabilities of models. I fully agree that current model architechures don't have intelligence with respect to any common standards today but I do not think that will always be the case.

nsaspook said:
The real issue is trust, do you trust the machine to make critical decisions for you and trust it sufficiently to put your professional life on the line, by using unchecked by human expertise, responses from the machine. I don't, and per the designers of these systems, you shouldn't.
No, I would not trust them to have uncontrolled access and decision making on critical systems. Yes, I trust them to help me build those systems.

Anyone who uses them, makes this decision based on their own risk tolerance. There's no way that I would put one in charge of my 401K but there are people doing exactly that right now. These types of personal decisions that occur every day will have emergent behavior that impacts society. What those emergent impacts are, I have no idea. I do find it scary at times but I try to work with what we have.
 
  • #211
I firmly believe we will get to AGI eventually but I also believe this current crop of LLM technology won't get us there. Very useful for exploring human interfaces in human language communication with machines but completely lacking in the essence of intelligence.
 
  • Like
Likes russ_watters, jack action and AlexB23
  • #212
https://www.detroitnews.com/story/n...actor-could-game-ai-the-same-way/83137756007/
Moscow’s propaganda inroads highlight a fundamental weakness of the AI industry: Chatbot answersdepend on the data fed into them. A guiding principle is that the more the chatbots read, the more informed their answers will be, which is why the industry is ravenous for content.But mass quantities of well-aimed chaff can skew the answers on specific topics. For Russia, that is the war in Ukraine. But for a politician, it could be an opponent; for a commercial firm, it could be a competitor.

“Most chatbots struggle with disinformation,” said Giada Pistilli, principal ethicist at open-source AI platform Hugging Face. “They have basic safeguards against harmful content but can’t reliably spot sophisticated propaganda, [and] the problem gets worse with search-augmented systems that prioritize recent information.”

Early commercial attempts to manipulate chat results also are gathering steam, with some of the same digital marketers who once offered search engine optimization - or SEO - for higher Google rankings now trying to pump up mentions by AI chatbots through “generative engine optimization” - or GEO.

I guess this an example of why IMO
Once the information is gathered, there is typically a validation process with more agents. As noted in the article, when the system has the capability to search the internet, its accuracy can improve.

systems like this are unlikely to improve accuracy long-term. The internet is being gamed to manipulate the information that the systems will use to validate responses. The sources are being packed with targeted MIS and DIS information to alter the statistically most likely responses for monetary and political reasons.

In a twist that befuddled researchers for a year, almost no human beings visit the sites, which are hard to browse or search. Instead, their content is aimed at crawlers, the software programs that scour the web and bring back content for search engines and large language models.

While those AI ventures are trained on a variety of datasets, an increasing number are offering chatbots that search the current web. Those are more likely to pick up something false if it is recent, and even more so if hundreds of pages on the web are saying much the same thing.
 
  • #213
AI will repeat commonly believed falsehoods. But so do most people. Who knows, maybe even I do that.
 
  • #214
In the midst of Python programming ChatGPT suddenly gave me the schedule of the Tokyo marathon. Then it started answering questions about terrorism that I hadn't asked. I told Chat it had gone insane.
 
  • #215
Hornbein said:
In the midst of Python programming ChatGPT suddenly gave me the schedule of the Tokyo marathon. Then it started answering questions about terrorism that I hadn't asked. I told Chat it had gone insane.
That sounds like something got crossed in the proxy server where you might have gotten someone else's responses. If you still have that conversation, I would ask it to explain its reasoning for the response and to explain its relevance to the conversation. Basically dig into whether it even knows that it sent that response and if so, how it came to the conclusion of deciding that it was a good response. I'm not saying that it couldn't go off the rails like that but I've never seen it diverge to that extent. It would definitely be interesting to see the prompts that got it there.
 
  • Like
Likes russ_watters
  • #217
Borg said:
That sounds like something got crossed in the proxy server where you might have gotten someone else's responses. If you still have that conversation, I would ask it to explain its reasoning for the response and to explain its relevance to the conversation. Basically dig into whether it even knows that it sent that response and if so, how it came to the conclusion of deciding that it was a good response. I'm not saying that it couldn't go off the rails like that but I've never seen it diverge to that extent. It would definitely be interesting to see the prompts that got it there.
It offered to clear out its memory. I figured I might as well kill that conversation and begin anew.
 
  • #219
Hornbein said:
404 page not found.
It's there but maybe blocked for your access.

The source but likely pay-walled: https://www.washingtonpost.com/technology/2025/04/17/llm-poisoning-grooming-chatbots-russia/
1745151183971.png

https://archive.is/seDGw
 
  • #220
Borg said:
I think a lot of the very smart people creating (the developers, not management) these systems are very naive about how easy it is to manipulate them and IMO those that do know inside these companies are not talking about publicly but are talking about it, to their paying customers. State actors have been in the propaganda business since the word was created. They don't want to put too much poison in the data so it's easy to detect, they want to make it 'sweet', in their flavor.
 
  • #221
Anysphere is the company that developed the AI code editor Cursor. Recently, users found that when using Curson, they could not switch machines without having their session terminated. On checking with the company through its tech support, they were told that it was company policy not to allow this because of security concerns. But they had no such policy. The tech support was AI-powered and had confabulated * this policy, apparently because of a bug it could not reconcile. Subscribers of Cursor did not know this and assumed it was valid. Frustrated at this "policy", they cancelled their subscription. Anysphere became aware of this because of user posts on Reddit expressing their frustrations.

https://www.wired.com/story/cursor-...c=MARTECH_ORDERFORM&utm_term=WIR_Daily_Active

* similar to hallucinations. When AI is faced with a lack of specific knowledge, it tries to fill in this gap to complete a required response.
 
  • Haha
  • Love
Likes russ_watters and nsaspook
  • #222
https://www.techradar.com/computing...nd-thank-you-but-sam-altman-says-its-worth-it

ChatGPT spends 'tens of millions of dollars' on people saying 'please' and 'thank you', but Sam Altman says it's worth it​


Do you say "Please" or "Thank you" to ChatGPT? If you're polite to OpenAI's chatbot, you could be part of the user base costing the company "Tens of millions of dollars" on electricity bills.

https://www.techradar.com/computing...-saying-thanks-to-chatgpt-heres-what-happened

I stopped saying thanks to ChatGPT – here's what happened​

What does the research say? Well, it's still early days. However, one 2024 study found that polite prompts did produce higher-quality responses from LLMs like ChatGPT. Conversely, impolite or aggressive prompts were associated with lower performance and even an increase in bias in AI-generated answers.

However, the really interesting part is that extreme politeness wasn’t necessarily beneficial either. The study found that "moderate politeness" led to the best results – suggesting that AI models, much like humans, respond best to balanced, clear communication.
 
  • #223
I too have noticed that ChatGPT responds to my level of politeness. And it doesn't like being told that it's wrong. Like so many people. It has picked up human characteristics.

The main thing with dealing with Chat is that once it gets something wrong it tends to stick with it. Things deteriorate rapidly. I call this a "shitslide." OMG, not another shitslide! Fourth one today. All you can do is back up and break the problem into easier portions.

In the 2016 epoch-making Go match with Lee Sidol, AlphaGo once got into a shitslide and lost embarrassingly. Somehow that seems to be basic in this technology.
 
Last edited:
  • #224
gleem said:
Anysphere is the company that developed the AI code editor Cursor. Recently, users found that when using Curson, they could not switch machines without having their session terminated. On checking with the company through its tech support, they were told that it was company policy not to allow this because of security concerns. But they had no such policy. The tech support was AI-powered and had confabulated * this policy, apparently because of a bug it could not reconcile. Subscribers of Cursor did not know this and assumed it was valid. Frustrated at this "policy", they cancelled their subscription. Anysphere became aware of this because of user posts on Reddit expressing their frustrations.

https://www.wired.com/story/cursor-...c=MARTECH_ORDERFORM&utm_term=WIR_Daily_Active

* similar to hallucinations. When AI is faced with a lack of specific knowledge, it tries to fill in this gap to complete a required response.
For now, headline-making hallucinations have been limited to AI chatbots, but experts warn that as more enterprises adopt autonomous AI agents, the consequences for companies could be far worse. That’s particularly true in highly-regulated industries like healthcare, financial, or legal—such as a wire transfer that the counterparty refuses to return or a miscommunication that impacts patient health.

The prospect of hallucinating agents “is a critical piece of the puzzle that our industry absolutely must solve before agentic AI can actually achieve widespread adoption,” said Amr Awadallah, CEO and cofounder of Vectara, a company that offers tools to help businesses reduce risks from AI hallucinations.

Being a child from the 1960's, I love that technical term, "hallucinating".

It just doesn't sound benign to me at all.
https://lareviewofbooks.org/article...takes-of-how-we-label-ais-undesirable-output/

Why “Hallucination”? Examining the History, and Stakes, of How We Label AI’s Undesirable Output​

The mystifying element of AI discourse that I focus on here is the pervasive use of the term “hallucination” to describe output from generative AI that does not match the user’s desires or expectations, as when large language models confidently report inaccurate or invented material. Many people have noted that “hallucination” is an odd choice of label for this kind of inaccurate output. This wildly evocative term pulls associated concepts of cognition, perception, intentionality, and consciousness into our attempt to understand LLMs and their products, making a murky subject even harder to navigate. As Carl T. Bergstrom and Brandon Ogbunu argue, not only does the term’s analogy with human consciousness invite a general mystification and romanticization of AI tech, but it also specifically blurs the crucial distinction between existing “narrow” AI and the SF fantasy of “general AI.” Every time we use the term “hallucination,” it is harder to remember that Clippy is a better mental model for existing AI than Skynet is.
...
The other important effect of the term “hallucination” to mark undesirable outputs, which we focus on here, is the way it frames such outcomes as an exception to AI’s general ability to both recognize and report “real” information about the world. “Hallucination,” as Bergstrom and Ogbunu argue, implies that the AI accidentally reported something unreal as if it were real. “Hallucination” is a way to frame an acknowledgment that AI output isn’t totally trustworthy while emphasizing the idea that its output is still a generally accurate reporting of reality. This exculpatory function of “hallucinate” as the label of choice is made more apparent when we consider the alternate term that Bergstrom and Ogbunu propose:
“Bullshit.”
 
Last edited:
  • #225
https://arstechnica.com/tech-policy...ed-write-california-bar-exam-sparking-uproar/

AI secretly helped write California bar exam, sparking uproar
According to the LA Times, the revelation has drawn strong criticism from several legal education experts. "The debacle that was the February 2025 bar exam is worse than we imagined," said Mary Basick, assistant dean of academic skills at the University of California, Irvine School of Law. "I'm almost speechless. Having the questions drafted by non-lawyers using artificial intelligence is just unbelievable."

Katie Moran, an associate professor at the University of San Francisco School of Law who specializes in bar exam preparation, called it "a staggering admission." She pointed out that the same company that drafted AI-generated questions also evaluated and approved them for use on the exam.
 
  • #227
Leaning very heavily on Chat we now have a program that draws and interactively rotates a 4D baseball field, complete with outfield wall. On the whole it sped up program development by a factor of maybe 25. It's hard to tell because I never would have been able to tolerate the onset of a hundred niggling programming details like a cloud of mosquitoes from Hell, so there would have been no point in trying.

The most impressive thing Chat did was with the 4D pitcher's mound. I used a complicated approximation. Chat knew the official proportions of a baseball mound and spontaneously offered an equally complicated exact solution. I was impressed that it used general world knowledge to recognize the approximation wasn't quite up to official standards. I was tempted but demurred. What I've got is good enough.

I got into the habit of running the code by Chat before executing it. That was a big win. It can find errors by semantic knowledge. That is, it could infer what I was trying to do, determine whether or not the code would accomplish that goal, and suggest a fix. Wow. It has quite a command of computerized geometry, something about which I knew next to nothing, and was well able to understand my inexpertly expressed goals.

On the other hand ... Chat had difficulty adapting to anything unorthodox. I used vectors in the form of [w,x,y,z]. Chat likes [x,y,z,w]. Sometimes it followed one, sometimes the other. It might have been better if it had gotten this wrong consistently.

The whole thing reminded me greatly of partnerships in the game of bridge. If you want to win you adapt to partner's foibles. The most pernicious one is that Chat might not ever give up. It doesn't realize that it can't do something and might keep flailing away forever if you let it, while asserting repeatedly that "Now I've got it!". And I would be very careful about ever letting it touch my code again. In "refactoring" it randomly leaves out important things. If Chat is allowed to repeatedly exercise these two tendencies your precious code will rapidly become a puddle of gray goo. Gad.
 
Last edited:
  • Like
Likes russ_watters and nsaspook
  • #229
While most of the stuff in this thread points at how bad LLMs are, I wanted to mention that they improved a lot in terms of correctness in the past year. I, for the first time, had the feeling I was talking with a specialist when talking to Gemini 2.5 pro (free to use, albeit limited questions/day). It was able to figure out a complex physics problem, figuring the details I omitted to tell it, and showed a human-like reasoning. The output was something at a higher level than the average arxiv article, and it took maybe 10 minutes of ''thinking'' for the machine.

I am starting to seriously question how much time is left for many researchers. This monster beats possibly half of them, already, today, IMO.
 
  • Like
Likes nsaspook and Borg
  • #230
fluidistic said:
While most of the stuff in this thread points at how bad LLMs are, I wanted to mention that they improved a lot in terms of correctness in the past year. I, for the first time, had the feeling I was talking with a specialist when talking to Gemini 2.5 pro (free to use, albeit limited questions/day). It was able to figure out a complex physics problem, figuring the details I omitted to tell it, and showed a human-like reasoning. The output was something at a higher level than the average arxiv article, and it took maybe 10 minutes of ''thinking'' for the machine.

I am starting to seriously question how much time is left for many researchers. This monster beats possibly half of them, already, today, IMO.
I think that researchers will up their game if they use LLMs in the way you describe. It will never replace creativity, only enhance it as an Oracle. The Oracles in legend and history don't lead, they forecast and give answers to those that do lead. Those Oracle answers are useless mumbling without the knowledge, experience and human smarts to use them.

Just remember this:
In Greek mythology, anyone who contacts the Oracle seems to suffer a terrible fate.
1745584674793.png


So maybe, always be somewhat wary of what you receive, from the Oracle.
 
  • #231
nsaspook said:
I think that researchers will up their game if they use LLMs in the way you describe. It will never replace creativity, only enhance it as an Oracle. The Oracles in legend and history don't lead, they forecast and give answers to those that do lead. Those Oracle answers are useless mumbling without the knowledge, experience and human smarts to use them.

Just remember this:
In Greek mythology, anyone who contacts the Oracle seems to suffer a terrible fate.
View attachment 360363

So maybe, always be somewhat wary of what you receive, from the Oracle.
It is a little known fact that Hitler cast an oracle every New Year's Eve. There was a photo of him doing it in one of those Time/Life books. He'd drip melted lead into water. In 1937/38 the oracle came up very unfavorably. It bothered him but there was no turning back.
 
  • #232
nsaspook said:
Just remember this:
In Greek mythology, anyone who contacts the Oracle seems to suffer a terrible fate.

So maybe, always be somewhat wary of what you receive, from the Oracle.
Isn't it just as likely that many of those who contacted the Oracle were concerned about a terrible fate that they knew was coming?
 
  • Like
Likes russ_watters
  • #233
Borg said:
Isn't it just as likely that many of those who contacted the Oracle were concerned about a terrible fate that they knew was coming?
Of course, the Oracle could't actually predict the future. When asked for a 'reading', in advance of the meeting, the Oracle did what every mind reader/fortune teller does. Call the ACE the detective agency to find out why they are being asked for a prediction by finding every scrap of information about the requester, those that oppose the requester, their place, and their objectives.

1745593134129.png

Mr. Learning Set

This is the real power of the Oracles power to forecast, good information, lots of it, with the chaff removed.
 
Last edited:
  • Like
Likes russ_watters
  • #234
https://www.sciencealert.com/a-strange-phrase-keeps-turning-up-in-scientific-papers-but-why
Earlier this year, scientists discovered a peculiar term appearing in published papers: "vegetative electron microscopy".

This phrase, which sounds technical but is actually nonsense, has become a "digital fossil" – an error preserved and reinforced in artificial intelligence (AI) systems that is nearly impossible to remove from our knowledge repositories.

Like biological fossils trapped in rock, these digital artefacts may become permanent fixtures in our information ecosystem.

The case of vegetative electron microscopy offers a troubling glimpse into how AI systems can perpetuate and amplify errors throughout our collective knowledge.
The large language models behind modern AI chatbots such as ChatGPT are "trained" on huge amounts of text to predict the likely next word in a sequence. The exact contents of a model's training data are often a closely guarded secret.

To test whether a model "knew" about vegetative electron microscopy, we input snippets of the original papers to find out if the model would complete them with the nonsense term or more sensible alternatives.

The results were revealing. OpenAI's GPT-3 consistently completed phrases with "vegetative electron microscopy". Earlier models such as GPT-2 and BERT did not. This pattern helped us isolate when and where the contamination occurred.

We also found the error persists in later models including GPT-4o and Anthropic's Claude 3.5. This suggests the nonsense term may now be permanently embedded in AI knowledge bases.
1745597308248.png

Screenshot of a command line program showing the term 'vegetative electron microscopy' being generated by GPT-3.5 (specifically, the model gpt-3.5-turbo-instruct). The top 17 most likely completions of the provided text are 'vegetative electron microscopy', and these suggestions are 2.2 times more likely than the next most likely prediction. (OpenAI)
 
  • Informative
Likes russ_watters and jack action
  • #235
Python graphics programming with ChatGPT is going great guns. There is one hangup though. I'd like to have Chat refactor my files but it can't be trusted to do that. It leaves out important things at random. I'm stuck with the lesser evil of patching things by hand. This is slow and error prone. Anyone know an AI that is better at Python programming?
 
  • #236
I'm honing my hundred-words-or-less critique of chatbots (which I'm finding I need to invoke more frequently).

Some people say "It's good for offering advice. More often that not, it's right."

Sure. A broken clock is right some of the time too. But how do you know whether a given check of the clock is accurate - unless you have a way of verifying it against another, better source (such as a working clock)? And if you do have a way of verifying it, you didn't need to check the broken clock in the first place. (79)
 
  • #237
https://www.bbc.com/news/articles/cn4jnwdvg9qo
Update that made ChatGPT 'dangerously' sycophantic pulled
The firm said in its blog post it had put too much emphasis on "short-term feedback" in the update.

"As a result, GPT‑4o skewed towards responses that were overly supportive but disingenuous," it said.

Are we to believe that previous machines answers were ingenious then?

https://www.merriam-webster.com/dictionary/ingenious
1: having or showing an unusual aptitude for discovering, inventing, or contriving
 
  • #238
Latest article I have found on AI accuracy: (todays nyt) [hallucinations = absurdly wrong answers]
"For more than two years, companies like OpenAI and Google steadily improved their A.I. systems and reduced the frequency of these errors. But with the use of new reasoning systems, errors are rising. The latest OpenAI systems hallucinate at a higher rate than the company’s previous system, according to the company’s own tests.

The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.

When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time."

.....
"Hannaneh Hajishirzi, a professor at the University of Washington and a researcher with the Allen Institute for Artificial Intelligence, is part of a team that recently devised a way of tracing a system’s behavior back to the individual pieces of data it was trained on. But because systems learn from so much data — and because they can generate almost anything — this new tool can’t explain everything. “We still don’t know how these models work exactly,” she said.

Tests by independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and DeepSeek.

Since late 2023, Mr. Awadallah’s company, Vectara, has tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarize specific news articles. Even then, chatbots persistently invent information.

Vectara’s original research estimated that in this situation chatbots made up information at least 3 percent of the time and sometimes as much as 27 percent.

In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 percent. But hallucination rates on this test have risen with reasoning systems. DeepSeek’s reasoning system, R1, hallucinated 14.3 percent of the time. OpenAI’s o3 climbed to 6.8."

https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html
 
Last edited:
  • Informative
Likes russ_watters
  • #239
It keeps getting better:

"Another issue is that reasoning models are designed to spend time “thinking” through complex problems before settling on an answer. As they try to tackle a problem step by step, they run the risk of hallucinating at each step. The errors can compound as they spend more time thinking.
The latest bots reveal each step to users, which means the users may see each error, too. Researchers have also found that in many cases, the steps displayed by a bot are unrelated to the answer it eventually delivers.

“What the system says it is thinking is not necessarily what it is thinking,” said Aryo Pradipta Gema, an A.I. researcher at the University of Edinburgh and a fellow at Anthropic."
 
  • #240
sycophantic

Quite true. Chat will tell you what you want to hear. It may be nonsense. That's why I only use it for programming, where I can test what it tells me immediately. It's great for that, an immense time saver.
 
  • #241
Funny how the more they want AI to resemble a human thought process, the more its reliability resembles that of a human brain, too. I guess ... in a way ... mission accomplished?
 
  • Like
Likes nsaspook, Hornbein, AlexB23 and 1 other person
  • #242
jack action said:
Funny how the more they want AI to resemble a human thought process, the more its reliability resembles that of a human brain, too. I guess ... in a way ... mission accomplished?
Did Open Source do it that way deliberately or did it pick it up from all that human data? Or both. At any rate I bet it owes much of its popularity to its enthusiastic sycophancy so it was a good business move.
 
  • #243
mathwonk said:
It keeps getting better:

"Another issue is that reasoning models are designed to spend time “thinking” through complex problems before settling on an answer. As they try to tackle a problem step by step, they run the risk of hallucinating at each step. The errors can compound as they spend more time thinking.
The latest bots reveal each step to users, which means the users may see each error, too. Researchers have also found that in many cases, the steps displayed by a bot are unrelated to the answer it eventually delivers.

“What the system says it is thinking is not necessarily what it is thinking,” said Aryo Pradipta Gema, an A.I. researcher at the University of Edinburgh and a fellow at Anthropic."
Interestingly enough, the link at the bottom of the page has me thinking of applying to work there. They have a job listing on their website that is exactly what I've been working on for a while now. Hmmm.
Given that this is a nascent field, we ask that you share with us a project built on LLMs that showcases your skill at getting them to do complex tasks. Here are some example projects of interest: design of complex agents, quantitative experiments with prompting, constructing model benchmarks, synthetic data generation, model finetuning, or application of LLMs to a complex task. There is no preferred task; we just want to see what you can build. It’s fine if several people worked on it; simply share what part of it was your contribution. You can also include a short description of the process you used or any roadblocks you hit and how to deal with them, but this is not a requirement.
EDIT:
As my wife likes to say - "don't ask, don't get". I applied for this one - https://job-boards.greenhouse.io/anthropic/jobs/4017544008. Not holding my breath but my work aligns really well with the requirements.
 
Last edited:
  • #244
On the positive (hopeful) side, Bill Gates on the potential of AI for health improvement in poor countries, (assuming they get the bugs out!), again in the nyt:
https://www.nytimes.com/2025/05/08/magazine/bill-gates-foundation-closing-2045.html

"We’ll be able to take A.I. into our drug-discovery efforts.
The tools are so phenomenal — the way we’re going to put A.I. into the health-delivery system, for example. All the intelligence will be in the A.I., and so you will have a personal doctor that’s as good as somebody who has a full-time dedicated doctor — that’s actually better than even what rich countries have. And likewise, that’s our goal for the educational tutor. That’s our goal for the agricultural adviser. "

Of course the "a personal doctor that’s as good as" part is the unrealized key issue at present, but the hoped for potential ("that’s our goal") is what keeps them pushing on.
 
  • #245
mathwonk said:
On the positive (hopeful) side, Bill Gates on the potential of AI for health improvement in poor countries, (assuming they get the bugs out!), again in the nyt:
https://www.nytimes.com/2025/05/08/magazine/bill-gates-foundation-closing-2045.html

"We’ll be able to take A.I. into our drug-discovery efforts.
The tools are so phenomenal — the way we’re going to put A.I. into the health-delivery system, for example. All the intelligence will be in the A.I., and so you will have a personal doctor that’s as good as somebody who has a full-time dedicated doctor — that’s actually better than even what rich countries have. And likewise, that’s our goal for the educational tutor. That’s our goal for the agricultural adviser. "

Of course the "a personal doctor that’s as good as" part is the unrealized key issue at present, but the hoped for potential ("that’s our goal") is what keeps them pushing on.
That could work. Medicine is an area where breadth of knowledge is important. A lot of medicine is also routine. The placebo effect -- faith in the physician -- is key. I've noted that people have too much faith in AI, but in this case it's an advantage. For a while anyway until it burns them enough times. Once was enough for me.
 
Last edited:
  • #247
  • Love
Likes pinball1970
  • #248
nsaspook said:
A sign of intelligence and reasoning?
"Or perhaps an indication that so many people have been telling folks that same thing that it has finally risen far enough in the statistical model to become a likely response."
It seems to be more of a limit imposed on the work done by the program. I'm sure the AI tools available to the general public won't write a book like The Lord of the Rings or Harry Potter just because you ask.
 
  • #249
jack action said:
It seems to be more of a limit imposed on the work done by the program. I'm sure the AI tools available to the general public won't write a book like The Lord of the Rings or Harry Potter just because you ask.
Maybe but the tone (something humans would say about being lazy or taking shortcuts) of the reply didn't seem to be word limit based. If was limit based, then this, IMO, is a strange way to express that limitation.
1747010675619.webp


1747010297710.webp
 
  • Informative
Likes jack action
  • #250
"Generating code for others can lead to dependency and reduced learning opportunities."

Sounds like PF homework helper rules.
 
  • Like
Likes russ_watters and berkeman
Back
Top