Is It Worth Researching AI Chat Answers? The Data Says Yes

May 27, 2026

Operator question I get from founders every week: 'should I actually spend time researching what ChatGPT and Perplexity say about my brand, or is this whole GEO thing overcooked?' Short verdict: yes, and the published data makes it embarrassing not to. The case is not vibes. It is four large studies from groups with no axe to grind. Here is what they actually show, and the 30-minute monthly audit I run on my own brands.

45%
EBU/BBC: AI answers with a significant issue
60%+
Tow Center: AI search citation failures
35%
NewsGuard: chatbot false-claim rate on news
17%
Stanford: legal AI hallucination rate

What the research actually says

Four studies, no overlap in funding or methodology, same conclusion: AI chat answers are wrong often enough that not checking is malpractice.

  1. EBU/BBC News Integrity in AI Assistants, October 2025. 22 public service media organisations across 18 countries and 14 languages evaluated more than 3,000 responses from ChatGPT, Copilot, Gemini and Perplexity. 45% had at least one significant issue, 31% had serious sourcing problems, 20% had major accuracy issues including hallucinated details and outdated information. Gemini was the worst performer with 76% of responses flagged. Source: EBU research report, full PDF: News Integrity in AI Assistants (PDF).
  2. Columbia Tow Center 'AI Search Has a Citation Problem', March 2025. Tested 200 queries across 8 AI search engines including ChatGPT Search, Perplexity, Gemini and DeepSeek. Over 60% of citations were wrong. ChatGPT failed 153 of 200 cases by fabricating sources, crediting rewritten versions of articles, or misattributing publishers. The detail that should keep you up: of 134 wrong ChatGPT citations, only 15 used hedging language. Wrong with confidence is the failure mode. Source: Columbia Journalism Review.
  3. NewsGuard one-year AI Audit, September 2025. Monthly benchmark on controversial news topics. Average false-claim rate across 10 leading chatbots: 35%, up from 18% a year earlier. Worst offenders Inflection at 57% and Perplexity at 47%. Claude held flat at a low rate, Gemini rose from 7% to 17%. The reason it got worse: vendors stopped refusing to answer and added live web access. More useful, more confidently wrong, same quarter. Source: NewsGuard.
  4. Stanford HAI / RegLab 'Hallucination-Free?', May 2024, peer-reviewed in the Journal of Empirical Legal Studies in 2025. Tested LexisNexis Lexis+ AI and Thomson Reuters Westlaw AI-Assisted Research, both marketed to lawyers as retrieval-grounded and hallucination-resistant. Lexis+ AI hallucinated on more than 17% of queries. Westlaw AI on roughly 33%. The 'we have RAG, we are safe' pitch from any vendor needs to die. Source: Stanford HAI.

Average those across the studies and the floor is a 1-in-5 problem; the ceiling is closer to 1-in-2. Pick your favorite number. None of them justify treating AI chat answers as ground truth.

The thing that should worry you most

It is not the error rate. It is the confidence rate. The Tow Center finding (134 wrong citations, only 15 hedges) is the operator's nightmare. Your customer asks ChatGPT about your shipping policy, gets a wrong answer in a confident tone, never sees an 'I am not sure' disclaimer. They just decide.

Why this matters for your brand specifically

Two failure modes, both expensive:

  1. The model misrepresents you. Old prices, killed features, wrong return policy, a competitor's domain credited for your product. I have audited 50+ brands in the last year and not one came back clean across all four models. Most common pattern: pricing on an archived 2023 blog post overrides the current pricing page because the old post has more inbound links and the model trusts link count over freshness.
  2. The model omits you. Worse than wrong. You are not in the candidate set at all. The user gets a clean confident answer with three competitors named, ends the session, buys from one of them. You never see the lost search because there is no referrer string on a query that never left the chat window.

Both are recoverable, but only if you know they are happening. You cannot fix what you do not measure. The brands that started auditing in 2024 are now 18 months ahead of the ones still arguing whether this matters.

The 30-minute monthly audit I actually run

Not theoretical. This is what I open my laptop and do on the first Monday of every month for the brands I work with. Tools named, URLs included.

  1. Pull your top 10 buyer prompts. Not keywords - prompts. 'What is the return policy for [your brand]?', 'Best [your category] for [their use case] under $X?', 'Is [your brand] better than [competitor]?'. Real sentences a customer types or speaks. Five minutes with a notepad.
  2. Run them in 4 places, same day. ChatGPT with search on, Perplexity, Gemini, and Claude. Same prompt, no personalisation tricks, fresh chat each time. Screenshot every answer. The answers shift next week, you will need the receipts.
  3. Fact-check every claim against your own source pages. Price, stock, shipping window, warranty, feature list, comparison points. Anything wrong gets a row in a spreadsheet: model, query, claim, correct answer, suspected bad source URL.
  4. Click every cited source the model surfaced. If the linked source does not contain the claim the model attributed to it, log it as a citation hallucination - the Tow Center failure mode. If the cited source is a competitor when the claim is about you, that is a worse problem (the model picked the wrong entity entirely).
  5. Trace the bad source and fix at the root. Wrong price came from a 2023 blog post? Email the publisher and ask for an update or a noindex. Wrong feature came from a Reddit thread? Reply in the thread with the correct info and a link to the current page. Models recrawl - the fix lands within weeks, not months.
  6. Run GEOlikeaPro's Visibility Vitals on the same pages. Free, paste a URL, get bot-allowance + schema + extractability + freshness flags in 60 seconds. Half the misrepresentation problems I see trace back to GPTBot or PerplexityBot being blocked, or to schema missing the fields the agent reads first. This surfaces both in one pass.
  7. File the diff month over month. The point is not one snapshot. The point is the trend. Are your prompts getting cleaner, or are new errors appearing as models update? Without the diff you cannot tell whether the work is paying off.
The rule I'd ship

If you sell anything to a human who might ask an AI about it before buying, you owe yourself one 30-minute audit per month. Not weekly, not daily - the answers do not move that fast. Skip a month and you have earned nothing. Skip a quarter and you are flying blind in a channel that EBU's data says misrepresents content 45% of the time.

The 'they will fix themselves' counterargument

The NewsGuard one-year report kills this cleanly. False-claim rates doubled from 18% to 35% in twelve months. The fix-themselves theory predicts the opposite. What actually happened: vendors made the bots more eager to answer and gave them live web access, so the bots got more useful and more wrong at the same time. You cannot wait this out.

The Reuters Institute Digital News Report 2025 notes 7% of online news consumers already use AI assistants to get news, rising to 15% of under-25s. The audience is moving while accuracy slips. That is the worst possible combination for a brand that ignores the channel.

When NOT to bother

To be honest about the limit:

  • Pure offline business. Brick-and-mortar only, no online discovery layer, 95% of customers walk in. Low priority. Spend the 30 minutes elsewhere.
  • Pre-product brands with no public surface yet. The model has nothing to misrepresent. Come back when you ship.
  • Enterprise sales with $100k+ deal sizes. Your buyer is in a call with your sales team within 24 hours of the first query. The AI answer is a footnote, not the decision. Audit anyway for accuracy, but the urgency is lower.

Everyone else - DTC, SaaS under $10k ACV, local service businesses, marketplaces, anyone with a public-search-driven funnel - is in the population where the 45% / 35% / 17% numbers above are about to bite you, not someone else.

What I am watching next

  1. Whether the EBU/BBC repeat the study in 2026 with the same protocol. If the 45% number drops below 30% across all four models in one year, the audit cadence can stretch to quarterly. If it does not, monthly stays.
  2. Whether NewsGuard's monthly monitor breaks out non-news categories (commerce, health, finance) separately. The 35% figure is news-specific. Commerce queries may be cleaner or dirtier - we do not know yet, and the answer changes how aggressively brands need to audit shopping prompts.
  3. Whether vendors start labelling low-confidence answers visibly. Right now the confidence problem (Tow Center) is silent. A small 'low confidence' badge would change user behavior overnight and reduce the brand-damage surface area. No vendor has shipped this. I am not holding my breath.

GEOlikeaPro tracks how ChatGPT, Perplexity, Claude, Gemini and Google AI Mode answer questions about your brand - scored on accuracy, citation quality, and entity recognition. The monthly audit above takes 30 minutes manually or runs automatically in one dashboard. See where you stand - free.

FAQ

Is it actually worth researching what AI chats say about my brand?

Yes for any brand with a public-search-driven funnel. Four independent studies (EBU/BBC, Columbia Tow Center, NewsGuard, Stanford HAI) put error rates between 17% and 45% across ChatGPT, Perplexity, Gemini and Copilot. Citation accuracy is worse - the Tow Center found over 60% of AI search citations were wrong, and only 15 out of 134 incorrect ChatGPT citations included any hedging language. Wrong with confidence is the failure mode brands actually pay for.

How often are AI chatbot answers wrong?

Depends on category, but the published floors are alarming. The October 2025 EBU/BBC study of 3,000+ news responses found 45% had a significant issue, 31% had serious sourcing problems, 20% had major accuracy issues with hallucinated or outdated content. NewsGuard's one-year audit puts the false-claim rate on controversial news topics at 35% across 10 leading chatbots, doubled from 18% twelve months earlier. Stanford HAI found 17% hallucination rates even in retrieval-grounded legal AI (Lexis+) and 33% in Westlaw AI.

Which AI chatbot is most accurate?

By NewsGuard's September 2025 one-year audit on controversial news topics, Claude held the lowest false-claim rate (held flat year over year). Inflection (57%) and Perplexity (47%) had the highest. Gemini rose from 7% to 17%. By the EBU/BBC October 2025 study, Gemini was worst on news with 76% of responses flagged for significant issues. There is no single 'most accurate' chatbot across all categories - the rankings flip by topic and benchmark, which is exactly why you need to test against your own prompts.

How often should I audit how AI chats answer questions about my brand?

Once a month for any brand with a public funnel. The answers do not move fast enough to justify weekly audits, and quarterly is too sparse to catch a regression before it damages conversion. The audit takes 30 minutes if you have the spreadsheet template ready. Skip a quarter and you are flying blind in a channel where the EBU/BBC data says 45% of answers contain a significant issue.

What should I do when an AI chat gets my brand wrong?

Trace the bad source and fix at the root, do not just complain at the model. If the wrong price came from a 2023 blog post, email the publisher and ask for an update or a noindex. If a wrong feature claim came from a Reddit thread, reply in the thread with the corrected info and a link to your current page. Models recrawl and the fix lands within weeks. Block GPTBot, ClaudeBot or PerplexityBot in robots.txt and you opt out of the corresponding model's ability to ever quote you correctly.

Are AI chatbots getting more accurate over time?

No, at least not on news. NewsGuard's one-year audit (September 2025) showed average false-claim rates doubled from 18% to 35% in twelve months. The reason is structural: vendors made the bots more eager to answer instead of declining, and gave them live web access. Both changes made the bots more useful and more confidently wrong at the same time. The 'wait for them to fix it' strategy has lost a year of evidence already.

What is the single highest-leverage step to fix bad AI chat answers about my brand?

Make sure GPTBot, OAI-SearchBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User and Google-Extended are allowed in your robots.txt, then put a real Last Updated date and a human author byline on your top 20 pages. The bot list closes the door on the most common 'why is the model citing my competitor?' problem, the freshness signals push stale 2023 content out of the answer. Both together, one afternoon. GEOlikeaPro's Visibility Vitals checker surfaces missing bot allowances in 60 seconds.

Stay ahead of AI search changes

Join store owners getting weekly GEO insights, AI search updates, and optimisation tips.

Get GEO tips →

Free tier · No credit card required