Latest article I have found on AI accuracy: (todays nyt) [hallucinations = absurdly wrong answers]
"For more than two years, companies like OpenAI and Google steadily improved their A.I. systems and reduced the frequency of these errors. But with the use of new
reasoning systems, errors are rising. The latest OpenAI systems hallucinate at a higher rate than the company’s previous system, according to the company’s own tests.
The company found that o3 — its most powerful system — hallucinated 33 percent of the time when running its PersonQA benchmark test, which involves answering questions about public figures. That is more than twice the hallucination rate of OpenAI’s previous reasoning system, called o1. The new o4-mini hallucinated at an even higher rate: 48 percent.
When running another test called SimpleQA, which asks more general questions, the hallucination rates for o3 and o4-mini were 51 percent and 79 percent. The previous system, o1, hallucinated 44 percent of the time."
.....
"Hannaneh Hajishirzi, a professor at the University of Washington and a researcher with the Allen Institute for Artificial Intelligence, is part of a team that recently devised a way of tracing a system’s behavior back to the
individual pieces of data it was trained on. But because systems learn from so much data — and because they can generate almost anything — this new tool can’t explain everything. “We still don’t know how these models work exactly,” she said.
Tests by independent companies and researchers indicate that hallucination rates are also rising for reasoning models from companies such as Google and DeepSeek.
Since late 2023, Mr. Awadallah’s company, Vectara, has
tracked how often chatbots veer from the truth. The company asks these systems to perform a straightforward task that is readily verified: Summarize specific news articles. Even then, chatbots persistently invent information.
Vectara’s original research estimated that in this situation chatbots made up information at least 3 percent of the time and sometimes as much as 27 percent.
In the year and a half since, companies such as OpenAI and Google pushed those numbers down into the 1 or 2 percent range. Others, such as the San Francisco start-up Anthropic, hovered around 4 percent. But hallucination rates on this test have risen with reasoning systems. DeepSeek’s reasoning system, R1, hallucinated 14.3 percent of the time. OpenAI’s o3 climbed to 6.8."
https://www.nytimes.com/2025/05/05/technology/ai-hallucinations-chatgpt-google.html