Being Named by AI Isn't Being Understood
I just put out a preprint - the measurement companion to our Mention Density Model paper. The first paper explained why brands get recognized inside AI answers. This one is the boring, useful half: how to actually measure it. 50+ mid-market brands, 11 countries, four engines (ChatGPT, Claude, Gemini, Perplexity), around 200 audits.
Here's the mine version, no academia. The single biggest mistake I see brands make with AI visibility isn't a bad strategy. It's measuring it with one number, on one engine, one time. Every finding below is a thing that number hides from you.
Stay in the loop
Get news and updates about GEO, AI search and new features. Unsubscribe anytime.
Never report AI visibility as a single number. Report the tuple: were you cited, by how many engines, at what rank, with what sentiment, on what kind of query. Any one of those read alone will lie to you. The rest of this post is four proofs of that.
Finding 1: citation is a single-model lottery
Eight mid-market audio brands. One broad query - "best wireless headphones under $300." Four engines. Here's who got cited:
| Brand | ChatGPT | Claude | Gemini | Perplexity |
|---|---|---|---|---|
| Skullcandy | - | - | - | - |
| Teufel | - | - | - | - |
| Nothing | - | - | - | rank 5 |
| Marshall | - | - | rank 3 | - |
| Master & Dynamic | - | - | - | - |
| House of Marley | rank 3 | - | - | - |
| AIAIAI | - | - | - | - |
| Grado Labs | - | - | - | - |
Three brands cited. Each by exactly one engine. Zero overlap. Only the mega-brands - Sony, Bose, Sennheiser - showed up across all four. The mid-market brands that did get in won a coin toss on a single engine.
Now flip to niche queries ("best X brands in Y category"): 11 of 12 brands cited, per-engine rates of 60-75%, and the engines largely agreed with each other. The lottery is a broad-query phenomenon.
So here's the trap. You check ChatGPT, you see your brand, you relax. But you measured one quarter of the signal - and on a broad query, a near-random quarter. The brand that looks fine on Gemini is invisible on the other three. You'd never know unless you ran all four.
Audit all four engines, every time. A one-engine visibility report is not a smaller version of the truth - on broad queries it's a random draw. This is exactly the "Coverage" signal we score: cited by one engine is a Critical score, not a win.
Finding 2: citation collapses the moment the query gets broad
Same brands. The only thing I changed was how broad the question was. Watch the cited-count fall off a cliff:
| Query type | Mean cited (of 4) | % at 0/4 | % at 3+/4 |
|---|---|---|---|
| Niche ("best X brands in Y category") | 3.1 | 0% | 83% |
| Broad ("best wireless headphones under $300") | 0.4 | 63% | 0% |
| Broad generic ("best everyday clothing brand 2026") | 0.4 | 75% | 0% |
| Native-language broad | 2.0 | 25% | 50% |
The same brands drop from 3.1 cited to 0.4 cited - an 87% collapse - when the only change is widening the query. "We're cited 3 out of 4 times" and "0.4 out of 4 times" can be the same brand, on the same day, measured an hour apart.
This is the recognition funnel, the same one marketers have used for brand awareness for 30 years, mapped onto AI:
- Unaided - "best running shoes." No brand, no narrow category. This is true top-of-mind recall, and it's brutally hard. Mid-market brands have essentially none of it.
- Category-aided - "best sustainable flats for women." A category cue. This is where mid-market brands actually live and win.
- Comparison-aided - "alternatives to Allbirds." Recognition relative to a named peer.
- Directly-named - "is Rothy's legit." The brand is in the prompt, so citation is trivially 100%. The real signal here is sentiment, not presence (see Finding 3).
The practical read: a visibility number is meaningless without its funnel depth attached. Don't chase unaided "best X" terms you'll never win. Find the category-aided queries where you have a fighting chance, and own those.
Finding 3: at 100% citation, sentiment quietly drops on trust queries
Directly-named queries - "is X legit," "is X worth it" - force 100% citation because your name is in the question. So Share of Voice tells you nothing. The signal moves entirely to tone. And the tone gets worse:
| Brand | Discovery sentiment | Trust sentiment | Delta |
|---|---|---|---|
| Olipop | 86 | 71 | -15 |
| Rothy's | 88 | 76 | -12 |
| Princess Polly | 80 | 68 | -11 |
| MyProtein | 78 | 78 | ~0 |
| Rituals | 80 | 80 | 0 |
The three biggest drops belong to three of the strongest discovery performers. Why? Trust queries invite hedging - "better than soda, but..." - while discovery queries invite enthusiasm. A customer asking "is this legit" is the closest one to buying, and that's exactly where the engine gets cautious about you.
If you only watch SOV, a 100%-cited / 68-sentiment cell looks like a trophy. It's a leak. The fix is reputation work the engine can read: real reviews, comparison pages that answer the objection head-on, third-party coverage that isn't yours.
Finding 4: engines describe mid-market brands wrong about half the time
Being cited is not the same as being understood. I ran paired brand-verification checks - 5 brands across 3 engines, 15 pairs. Zero of fifteen were fully correct.
- Master & Dynamic - all three engines got the founding year wrong
- House of Marley - missing product lines and its core sustainability initiative
- AIAIAI - ChatGPT did not recognize the brand at all, on two separate runs - a Danish brand with mainstream tech coverage and a strong review profile
- Grado - Gemini missed the founder story
This is the part founders underestimate. You can be "visible" - cited, ranked - and still have the engine telling buyers a wrong founding year, the wrong product range, or confidently confusing you with someone else. That's the Clarity problem, and it has to be scored separately from "did you show up."
Wrong descriptions are a low-density entity problem - the engine doesn't have enough clean, corroborated facts about you. That's the most actionable failure on this list. Wire up Organization schema, build the entity graph (the Wikidata stack here), keep your facts identical everywhere. You're feeding the engine the facts it's currently guessing at.
The rule I'd ship
Four findings, one thread: every one of them is invisible if you measure AI visibility the way most people do. So the protocol is simple, and you can run it yourself for pennies:
- Run all four engines. ChatGPT, Claude, Gemini, Perplexity. One engine is a coin toss on broad queries (Finding 1).
- Write a query at each funnel depth. Unaided, category-aided, comparison-aided, directly-named. The number is meaningless without the depth attached (Finding 2).
- Run it 3+ times. Engines drift between runs. One run is noise, not a measurement.
- Score the whole tuple, not SOV alone. Cited or not, how many engines, what rank, what sentiment - and on trust queries, sentiment is the whole game (Finding 3).
- Check whether they describe you correctly, not just whether they name you. Clarity is its own score (Finding 4).
None of this is new theory about how AI works - the Mention Density Model already covers the why. This is just what becomes visible the moment you stop measuring once, stop measuring one engine, and stop reporting one number. The measurement and the model are two halves of the same thing. This was the measurement half.
Want the tuple for your own brand without building the harness? Run GEOlikeaPro's Visibility Vitals checker - it audits all four engines and scores Presence, Clarity, Coverage, Authority, and Preference, so you see which of these four traps you're actually in.
The full instrument, scoring rubric, reproducibility protocol, and data tables are in the open-access preprint: A Multi-Factor Brand-Recognition Audit for AI Answer Engines (Zenodo, CC-BY-4.0).
FAQ
Why measure AI brand visibility across multiple engines?
Because on broad queries, mid-market citation is a single-model lottery. In our 50+ brand audit, three audio brands were cited on a broad query - each by exactly one engine, with zero overlap. A one-engine report measures a quarter of the signal, and on broad queries a near-random quarter. Auditing ChatGPT, Claude, Gemini, and Perplexity together is the only way to see your real Coverage.
What is the AI recognition funnel?
It maps the classic aided-versus-unaided brand awareness funnel onto AI answers. Unaided ('best running shoes') is hardest - true top-of-mind recall. Category-aided ('best sustainable flats for women') adds a category cue and is where mid-market brands win. Comparison-aided names a peer. Directly-named puts your brand in the prompt, forcing 100% citation so the signal becomes sentiment, not presence. In our data the same brands dropped from 3.1 to 0.4 of 4 engines cited just by widening the query from category-aided to unaided.
Can a brand have high AI visibility but a problem?
Yes, in two ways. On directly-named trust queries ('is X legit') citation is forced to 100%, but sentiment quietly drops 10-15 points versus discovery queries - our strongest discovery performers had the biggest trust-sentiment drops. And being cited is not being understood: across 15 paired verification checks, zero were fully correct, with founding years, product lines, and even brand recognition wrong about half the time. Share of Voice alone hides both problems.
How much does it cost to audit AI brand recognition?
Roughly $0.02 per audit cell - four model calls plus scoring. That's cheap enough to run the recommended multi-engine, multi-query, multi-run protocol densely rather than as a one-off. The full instrument crosses Brand x Test Query x Category x Entity Type x AI Model and scores Share of Voice, five Visibility Vitals, sentiment, and citation rank.
Why do AI engines describe mid-market brands incorrectly?
It's a low-density entity problem - the engine lacks enough clean, corroborated facts about the brand, so it guesses, hedges, or confuses you with someone else. It's also the most fixable failure: wire up Organization schema, build a consistent entity graph (Wikidata, sameAs, identical facts everywhere), and you feed the engine the facts it's currently inventing. Clarity has to be scored separately from whether you were cited at all.