Why One AI Audit Isn't Enough (And What We Did About It)

June 19, 2026

A reader emailed me last week, annoyed. Our audit told him his brand barely showed up in AI answers. Then he opened ChatGPT himself, asked the same thing, and there he was, named in the first line. "Your tool is wrong," he said. Then he mentioned the part that actually matters: he ran our audit twice and got two different scores.

He was right to be annoyed, and the second complaint is the real story.

One LLM check is closer to a coin flip than a measurement

Ask a large language model the same question twice and you can get two different answers. Not subtly different - different brands, different ordering, different recommendations.

Large language models are fundamentally non-deterministic, which means you’ll get a different response for the same input.

That is not a flaw in the model or in your tool. It is how these systems work.

This is well documented.

Our results indicate that, while both models achieve perfect in-domain response accuracy across all prompt scenarios, their token-level probability and entropy values consistently diverge from the corresponding theoretical distributions.

Models pick each word by sampling from a probability distribution, so the output varies by design. And it is not a setting you can simply switch off: researchers at Thinking Machines Lab show that even at temperature 0 the answers stay non-deterministic, because of how requests get batched and how floating-point math rounds across a forward pass. An empirical study of ChatGPT found the same thing: repeat an identical prompt and the responses drift, run to run.

So if an audit asks each model once and hands you a single number, that number is one sample from a noisy distribution. Run it again and it moves. We hit this in our own data, and wrote it up when we asked four AIs which brands win and they barely agreed.

Why a single number misleads

Say a tool reports "50% AI visibility." With four models asked once each, that is two saying yes and two saying no. Flip one model on the next run and you read 25% or 75%. Same brand, same query, same afternoon.

You cannot plan content work against a number that swings 25 points on noise. You cannot tell whether last month's effort moved anything. The metric is measuring the dice, not your brand.

The fix is not clever, it is basic measurement

When a signal is noisy, you do not take one reading. You take many and report the distribution. That is the standard guidance for working with these models: run a prompt several times and report the average with its spread, not a lone score. Think of a political poll - "47%, plus or minus 3" - instead of stopping one person on the street and calling it the result.

So Deep Audit does not ask once. It expands your query into a spread of real buyer-intent variations - the way people actually phrase things, from "best X" to "alternatives to Y" to pricing and comparison questions - and runs that whole set across all four models. The reasoning is older than generative AI: decades of search-behavior research show people reach the same need through very different queries, so one phrasing never represents the full picture.

The output is a mention rate: how often you actually appear across all those checks, with a confidence band that tells you how much to trust it. A tight band means the result is stable. A wide band means the models disagree, so treat the headline as soft and look at which ones miss you.

A mention has to be real, not self-reported

There is a second trap, and it is worse. Models will cheerfully claim they recommended a brand they never named. We caught ours doing exactly that - tagging a "partial mention" for a brand that appears nowhere in the answer text. So we stopped trusting the self-report: if the brand name is not literally in the response, it does not count. The honest answer to "did the model name you" is simply whether the model named you.

That standard is not arbitrary. The foundational Generative Engine Optimization paper (Aggarwal et al., presented at ACM KDD 2024) found that brand mentions and citations are among the strongest drivers of visibility in generative engines. The mention is the unit that matters, so it has to be earned, not assumed.

What we deliberately did not do

Now the honest limitation. Deep Audit fixes the noise. It does not make the models browse your live site - they answer from training data, which leans on existing fame. We tested a real brand a friend has promoted hard for a year: our tool showed near-zero, and Google's live AI Mode showed the same near-zero for that query. Two very different methods, one answer. That gap, training memory versus live retrieval, is the next problem to solve, and we would rather ship a number you can reproduce than one that looks live and is not.

How to use it

  1. Quick Audit - one check across four models. Fast and rough. Good for a gut read or a single spot-check.
  2. Deep Audit - ten intent variations across four models, with a mention rate and confidence band. Use it for any number you plan to report or act on.
  3. Read the per-variant breakdown. You will often lose the head term and win narrower ones. That is where the real opportunities sit, and it is the whole point of the fan-out.

Measure once and you are guessing. Measure many times and you have a signal. For the harder work of actually earning those mentions, our write-up on how source citations boost AI visibility and the patterns from our 50+ brand audit are the place to start.

See where you stand

Run a free brand check on the GEOlikeaPro home page - no signup needed - or run a full Quick or Deep Audit inside the tool. Either way you get a number you can trust, plus the exact queries where you are losing.

FAQ

Why do AI visibility scores change every time I run them?

Because large language models are non-deterministic - they pick words by sampling, so the same prompt can return different brands and recommendations, even at temperature 0. A single check is one draw from a noisy distribution, which is why a one-shot audit can swing 25 points on noise alone. The fix is to measure many times and report the average with a confidence band.

What is a mention rate?

It is how often your brand actually appears across many checks, instead of a single yes or no. Deep Audit runs ten buyer-intent variations of your query across four models and reports the share where you were genuinely named, with a 95% confidence band so you know how stable the result is.

Does the audit count a mention if the model only implies my brand?

No. We check whether the brand name is literally in the answer text. Models tend to over-report - tagging a partial mention for a brand they never actually named - so we ignore the self-report. If the name is not there, it does not count. That keeps the mention rate honest.

Does Deep Audit check my live website?

No. The models answer from their training data, not a live crawl, so the score reflects what they already know about you. We deliberately deferred live grounding rather than ship an unpredictable, expensive agent loop. Deep Audit's job is to remove run-to-run noise so the number you get is reproducible.

Stay ahead of AI search changes

Join store owners getting weekly GEO insights, AI search updates, and optimisation tips.

Get GEO tips →

Free tier · No credit card required