ChatGPT Was Asked the Same Question 10 Times. The Answers Kept Changing

Humanoid Artificial Intelligence Robot Confused AI Questions Thought — ChatGPT may sound confident, but when tested on complex scientific claims, it often guesses and even contradicts itself. Researchers found it struggles especially with spotting false information. Credit: Shutterstock

ChatGPT can sound convincing, but this study shows it still struggles to tell what’s actually true.

Washington State University professor Mesut Cicek and his team repeatedly evaluated ChatGPT by giving it hypotheses drawn from scientific studies. The AI was asked to decide whether each statement was supported by research — essentially judging if it was true or false.

In total, the researchers tested more than 700 hypotheses and submitted each one 10 times to examine how consistent the responses would be.

Accuracy Results and Performance Limits

In the initial 2024 experiment, ChatGPT answered correctly 76.5% of the time. When the study was repeated in 2025, accuracy rose slightly to 80%. However, once the results were adjusted for random guessing, the performance looked far less reliable. The AI was only about 60% better than chance, which the researchers described as closer to a low D than strong performance.

The system had particular difficulty identifying false statements, correctly labeling them only 16.4% of the time. It also showed inconsistency. When given the exact same prompt 10 times, ChatGPT produced consistent results for only about 73% of the cases.

Inconsistent Answers to Identical Questions

“We’re not just talking about accuracy, we’re talking about inconsistency, because if you ask the same question again and again, you come up with different answers,” said Cicek, an associate professor in the Department of Marketing and International Business in WSU’s Carson College of Business and lead author of the new publication.

“We used 10 prompts with the same exact question. Everything was identical. It would answer true. Next, it says it’s false. It’s true, it’s false, false, true. There were several cases where there were five true, five false.”

AI Fluency Versus Real Understanding

The study, published in the Rutgers Business Review, highlights the importance of caution when using AI for important decisions, especially those involving nuance or complex reasoning. While generative AI can produce fluent and convincing language, it does not necessarily demonstrate true understanding.

Cicek said the findings suggest that artificial general intelligence capable of genuine reasoning may still be further away than some expect.

“Current AI tools don’t understand the world the way we do — they don’t have a ‘brain,’” Cicek said. “They just memorize, and they can give you some insight, but they don’t understand what they’re talking about.”

Study Design and Methods

Cicek worked alongside Sevincgul Ulu of Southern Illinois University, Can Uslay of Rutgers University, and Kate Karniouchina of Northeastern University.

The team analyzed 719 hypotheses from scientific papers published in business journals since 2021. Determining whether research supports a hypothesis is often complex, involving multiple factors that can influence the outcome. Reducing that complexity to a simple true-or-false decision requires careful reasoning.

The researchers tested the free version of ChatGPT-3.5 in 2024 and the updated ChatGPT-5 mini in 2025. Overall, results were similar across both versions. After adjusting for random chance, which gives a 50% likelihood of a correct answer, the AI’s performance was only about 60% better than chance in both years.

Key Weakness in AI Reasoning

The findings reveal an important limitation of large language model AI systems. Although they can generate polished and persuasive responses, they often struggle with deeper reasoning. This can lead to answers that sound convincing but are actually incorrect, Cicek said.

Why Experts Urge Caution

Based on these results, the researchers recommend that business leaders verify AI-generated outputs and approach them with skepticism. They also emphasize the importance of training users to understand both the strengths and limitations of AI tools.

While this study focused on ChatGPT, Cicek noted that similar tests with other AI systems have shown comparable outcomes. The research also builds on earlier work highlighting concerns about AI hype. A 2024 national survey found that consumers were less likely to purchase products when they were marketed with a focus on AI.

“Always be skeptical,” he said. “I’m not against AI. I’m using it. But you need to be very careful.”

Never miss a breakthrough: Join the SciTechDaily newsletter.
Follow us on Google and Google News.

4 Comments

Bruce on March 18, 2026 5:46 pm
What I see in front of me right I don’t think nothing less
Jojo on March 18, 2026 10:55 pm
Where is the linkout to the actual study???
- Andrei Conovaloff on March 21, 2026 6:20 pm
  Unstable Intelligence: GenAI Struggles with Accuracy and Consistency
  by Mesut Cicek , Sevincgul Ulu, Can Uslay, Kate Karniouchina
  Rutgers Business Review (2025), Vol. 10, No. 2, pp.266-277
  https://rbr.business.rutgers.edu/article/unstable-intelligence-genai-struggles-accuracy-and-consistency
SquirrelTech on March 20, 2026 10:09 am
I hope they used hypotheses which they had a verifiable “correct” answer. There are plenty of hypotheses that “everyone agrees are right” that later prove false, so it’s not a simple task. We can’t compare the consistency of answers from humans, so maybe humans would be just as inconsistent.

ChatGPT Was Asked the Same Question 10 Times. The Answers Kept Changing

Will Artificial Intelligence End Civilization?

Misinformation Express: How Generative AI Models Like ChatGPT, DALL-E, and Midjourney May Distort Human Beliefs

New Tool Detects ChatGPT-Generated Academic Text With 99% Accuracy

Cancer and AI – Can ChatGPT Be Trusted?

AI vs MD: ChatGPT Outperforms Physicians in Providing High-Quality, Empathetic Healthcare Advice

Humans Reign Supreme: ChatGPT Falls Short on Accounting Exams

New Study: ChatGPT Can Influence Users’ Moral Judgments

ChatGPT Generative AI: USC Experts With Key Information You Should Know

The Rise of Artificial Intelligence: ChatGPT’s Stunning Results on the US Medical Licensing Exam

4 Comments

Two Drinks a Day May Be Riskier Than Many Americans Think

A Lost Human Lineage May Have Left a Genetic Legacy in People Today

Study Reveals a Surprising Link Between Birth Control Pills and Binge Eating

NASA’s HiRISE Captures Perseverance Rover Completing a Marathon on Mars

Ancient DNA Reveals the Hidden Origins of China’s Mysterious Shimao Civilization

Scientists Discover a Surprising Link Between Sleep, Genes, and Alzheimer’s

Popular Childhood Drinks Linked to Higher Blood Pressure Later in Life

Scientists Just Challenged a 70-Year-Old Myth About the Human Brain

ChatGPT Was Asked the Same Question 10 Times. The Answers Kept Changing

Accuracy Results and Performance Limits

Inconsistent Answers to Identical Questions

AI Fluency Versus Real Understanding

Study Design and Methods

Key Weakness in AI Reasoning

Why Experts Urge Caution

Related Articles

4 Comments