Site Navigation for AI Crawlability — Training Crawlers and Retrieval Crawlers Need Different Things

April 27, 2026

Almost every "AI crawlability" guide treats AI crawlers as a single audience and gives one set of advice: flatten your site, fix breadcrumbs, kill JS nav. That advice is half right and half actively harmful, because there are two categories of AI crawler hitting your store and they want opposite things from your navigation. Optimize for one and you punish the other.

The thesis in one sentence

Training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended) sweep your whole catalog once and care about graph reachability. Retrieval crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User, Claude-User, Gemini grounding) fetch a single page on demand and care about chunk addressability. Sitemap depth, faceted-URL rules, JS rendering, and breadcrumb schema have different correct answers for each.

AI crawlers serve two fundamentally different purposes: they either train models (by extracting your content to improve AI capabilities) or retrieve content for generative answers (by pulling your information to answer user questions directly within AI interfaces).

2
Crawler categories — training and retrieval
#fragment
What retrieval crawlers cite, not bare URLs
/llms.txt
Separate nav file most stores still don't ship

The two-category map

The user-agent string tells you which mode the crawler is in. The mode tells you which of your navigation choices it actually uses.

Mode Examples Reads Cares about Doesn't care about
Training GPTBot, ClaudeBot, CCBot, Google-Extended, Applebot-Extended Whole site over weeks/months, batched, stateless, mostly no JS Sitemap, click depth, server-rendered links, canonicals, robots.txt budget Anchor IDs, fragment URLs, page latency under 2s
Retrieval OAI-SearchBot, PerplexityBot, ChatGPT-User, Claude-User, Gemini grounding fetch One URL at a time in response to a live query, JS sometimes rendered Anchor IDs (citation targets), HTTP Link headers, time-to-byte, llms.txt priorities Click depth, breadcrumb schema, faceted-URL bloat (it's been handed a URL)

The split is documented at the source: OpenAI's bot directory separates GPTBot (training) from OAI-SearchBot and ChatGPT-User (retrieval), and Anthropic publishes the same split for ClaudeBot vs Claude-User. Cloudflare's 2025 crawler study shows the two categories driving very different traffic patterns across its network. For more on the OpenAI side specifically, see our crawler comparison and the GPTBot traffic data piece.

Click depth: a training-crawler concept, not a retrieval one

Pages four or more clicks from the homepage are statistically less likely to be discovered by a training crawler walking your link graph. That is a real effect, and it's the only piece of "flatten your nav" advice that survives scrutiny.

Distance From Index (DFI) shows how many clicks separate a page from the homepage. It remains one of the strongest signals influencing how both search crawlers and AI bots prioritize your content.

It does not apply to retrieval crawlers at all. When a shopper asks ChatGPT "which Allbirds Tree Runner is on sale," OAI-SearchBot is handed a URL that's already three layers deep into your collection structure. It doesn't traverse from the homepage — it fetches the deep URL directly. Click depth is irrelevant to that fetch.

Practical translation:

  • Keep canonical product URLs reachable in 2–3 clicks from the homepage so training crawlers find them at all. This is what determines whether a product enters the corpus that next year's models are trained on.
  • Stop optimizing variant URLs and filtered combinations for crawl depth. They're for retrieval crawlers, which arrive directly. Don't link them from the homepage just to flatten depth — you'll dilute your link graph for training and gain nothing.

Anchor IDs: the missing layer for retrieval crawlers

Retrieval crawlers don't cite URLs. They cite chunks. Open any Perplexity answer and inspect the citation links — they end in #:~:text=... (a text fragment, the W3C-standardized URL form for deep-linking arbitrary text) or #section-id (a named anchor). ChatGPT's inline citations and Google AI Overviews do the same. The model wants to point its user at the exact paragraph that supports the claim, not the page top.

For that to work, your H2/H3/H4 tags need stable, descriptive id attributes. Most Shopify themes, blog templates, and CMS exports omit them entirely — which means a retrieval crawler can index the page, but can't deep-cite anything inside it, so it picks a different source that can.

<!-- Bad: invisible to chunk-level citation -->
<h2>Sizing guide</h2>

<!-- Good: addressable, stable, linkable -->
<h2 id="sizing-guide">Sizing guide</h2>

<!-- Better: paragraph-level for long-form -->
<p id="return-window">Returns are accepted within 30 days...</p>

The same principle applies to FAQ answers. Each <dt>/<summary> block should carry an id derived from the question slug. That's the difference between "the model knew your store had a return policy" and "the model quoted your return policy verbatim and linked the exact line."

Quick test

Open your top product page, view source, and grep for id= on H2/H3 tags. If you see only the auto-generated theme IDs (shopify-section-...) and no semantic IDs on content headings, retrieval crawlers can't deep-cite this page. Add semantic IDs once at the template level — it propagates to every product.

llms.txt: a navigation file most stores still don't ship

Sitemap.xml was built for traditional search crawlers — a flat list of every URL with last-modified timestamps. It's a poor fit for AI agents, which want a small, prioritized index of the documents that actually answer questions. The community-proposed answer is /llms.txt (and the longer /llms-full.txt), a markdown file at the site root that lists your highest-value pages with one-sentence descriptions. The full proposal lives at llmstxt.org.

# Acme Running

Specialty running shoe retailer. Independent sizing and gait advice.

## Products
- [CloudRunner Pro](https://acme.com/products/cloudrunner-pro): Daily trainer, 4mm drop, 280g
- [TrailMax 7](https://acme.com/products/trailmax-7): Aggressive lugged trail shoe, waterproof

## Policies
- [Returns](https://acme.com/policies/returns): 30-day window, free return shipping
- [Sizing guide](https://acme.com/pages/sizing): How our sizing compares to Nike, Adidas, Hoka

Two things to keep in mind. First, llms.txt is not yet a ratified standard, so don't expect it to do heavy lifting; treat it as cheap insurance rather than a primary channel. Second, ship /llms.txt at the apex domain (not /blog/llms.txt) and link it from the HTML head with <link rel="alternate" type="text/markdown" href="/llms.txt"> so retrieval crawlers can discover it without guessing the path.

HTTP Link headers: a markdown side door

Retrieval crawlers parse HTML, but they parse markdown faster and more reliably (no boilerplate stripping, no nav-bar deduplication). You can serve a markdown rendering of any page alongside the HTML using HTTP Link headers, with no impact on what humans see:

HTTP/2 200 OK
Content-Type: text/html
Link: </products/cloudrunner-pro.md>; rel="alternate"; type="text/markdown"
Link: <https://acme.com/products/cloudrunner-pro>; rel="canonical"

The crawler that prefers markdown follows the alternate; the crawler that doesn't ignores the header and reads the HTML. ChatGPT's Atlas browser and several agentic frameworks already prefer markdown alternates when offered. The build cost is small — most static-site generators emit markdown anyway, and Shopify/Next.js/Astro stores can render .md via the same product loader as HTML.

Faceted navigation: the advice you've heard is wrong for half your audience

The standard advice — "block ?color= in robots.txt, canonicalize filtered URLs to the unfiltered collection" — is correct for training crawlers and harmful for retrieval crawlers. A filtered collection URL is exactly what a retrieval crawler arrives at when a user asks "red running shoes size 10 under $150." Block it and you remove yourself from that query. Google's canonicalization guidance covers the mechanics; the part the guides skip is that the right rule depends on whether the filter is decorative or semantic.

The right configuration distinguishes by intent, not by URL pattern:

  1. Use noindex, follow on filtered pages, not robots.txt blocks. Training crawlers see "don't add this to the corpus, but the links inside are fine to walk." Retrieval crawlers see a perfectly servable page when their user requests it directly.
  2. Reserve robots.txt blocks for true infinite spaces: session IDs, sort-order combos, paginated facets beyond page 50. Anything a human would never deep-link to.
  3. Set rel="canonical" from filtered to unfiltered only when the filter is decorative (sort order, view mode). Keep the canonical self-referential when the filter is semantic (color, size, price band) — the retrieval crawler's user asked for that filter.
  4. Don't nofollow internal facet links. That's a 2014 SEO reflex. It hides paths from training crawlers without doing anything useful.

Diagnose from server logs, not from theory

Stop guessing. Your access logs already tell you whether each crawler category is reaching what you want. The diagnostic is a one-liner you can run on any nginx/Cloudflare log export:

# What URLs is each crawler hitting?
awk -F'"' '/GPTBot|ClaudeBot|CCBot/      {print "TRAIN  " $2}
            /OAI-SearchBot|PerplexityBot|ChatGPT-User|Claude-User/ {print "RETR   " $2}' \
  access.log | sort | uniq -c | sort -rn | head -50

Read the output against this checklist:

  • Training crawlers hitting only your homepage and sitemap.xml. Your link graph is broken — they can't traverse beyond the entry. Check JS-only navigation and orphan collections.
  • Retrieval crawlers hitting deep filtered URLs you've been blocking. Your robots.txt is killing your AI search visibility. Switch those URLs to noindex, follow.
  • Retrieval crawlers fetching #anchor URLs that 200 but render the wrong section. Your headings have id attributes but they don't match what the crawler is requesting — the model is hallucinating fragment IDs because there's no canonical list. Ship llms.txt with stable anchors.
  • Training crawlers stuck on faceted URLs (10x your product count). Standard facet bloat — apply the noindex, follow + selective canonical fix above.
  • Zero hits from one category. Something is hard-blocking the user-agent. Check WAF, Cloudflare bot rules, and robots.txt. See Cloudflare's default AI-crawler blocks for the most common cause.

The actual ship list

If you take only one pass at navigation for AI crawlers, do these in order. Each one is cheap, and each one closes a gap that exists on roughly nine out of ten Shopify and DTC stores we audit.

  1. Add semantic id attributes to every H2/H3 in product, collection, policy, and FAQ templates. One template change, full-catalog impact.
  2. Ship /llms.txt at the apex domain. Five high-value pages with one-sentence descriptions is enough to start.
  3. Replace robots.txt blocks on filter URL patterns with <meta name="robots" content="noindex, follow">. Keep blocks only for true infinite spaces.
  4. Verify breadcrumbs render server-side and emit BreadcrumbList JSON-LD. This is the one piece of the old "flatten your nav" advice that still earns its keep, because it helps both crawler categories.
  5. Run the awk one-liner above against this week's access logs. Decide what to fix based on what crawlers are actually doing, not what guides assume they do.

Sources


GEOlikeaPro's Crawler View separates training-mode and retrieval-mode hits, surfaces the URLs each category is reaching (and the ones it isn't), and flags missing anchor IDs and llms.txt at the template level. Sign up free to run the audit on your store.

FAQ

Why split AI crawlers into training and retrieval categories?

Because they make opposite navigation demands. Training crawlers (GPTBot, ClaudeBot, CCBot, Google-Extended) sweep your whole catalog and care about graph reachability — sitemap, click depth, server-rendered links. Retrieval crawlers (OAI-SearchBot, PerplexityBot, ChatGPT-User, Claude-User, Gemini grounding) arrive on a single deep URL chosen by a user query and care about chunk-level addressability — anchor IDs, llms.txt, fast first byte. Advice that helps one often hurts the other.

Does click depth still matter?

Only for training crawlers, and only for canonical product/collection URLs. Keep those reachable in 2–3 clicks from the homepage so they enter the training corpus. Click depth is irrelevant to retrieval crawlers because they arrive directly at the URL the user's query produced — a four-clicks-deep filtered collection page is fine for OAI-SearchBot if a shopper asks for that exact filter.

What are anchor IDs and why do they matter for AI citations?

Retrieval crawlers don't cite URLs — they cite chunks. Perplexity, ChatGPT, and Google AI Overviews link to fragment URLs (#section-id or #:~:text=...) so the user lands on the exact paragraph that supports the claim. If your H2/H3 tags don't carry stable, descriptive id attributes, the model can index your page but can't deep-cite it, so it picks a different source. Add semantic IDs at the template level.

What is llms.txt and should I ship it?

It's a markdown file at the apex domain (/llms.txt) listing your highest-value pages with one-sentence descriptions — a small, prioritized index for AI agents that's distinct from sitemap.xml. It's a community-proposed convention, not a ratified standard, so treat it as cheap insurance rather than a primary channel. Ship it, link it from the HTML head with rel=alternate type=text/markdown, and keep it under 50 entries.

Should I block faceted URLs in robots.txt?

No, except for true infinite spaces (session IDs, sort-order combos, deep pagination beyond ~50 pages). Filtered URLs like /shoes?color=red&size=10 are exactly what retrieval crawlers fetch when a user asks 'red running shoes size 10' — a robots.txt block removes you from that query. Use meta robots noindex, follow instead: training crawlers skip indexing, retrieval crawlers can still serve the page directly to a user.

How do I tell if my navigation is actually working for AI crawlers?

Check your access logs, not theory. Filter for GPTBot/ClaudeBot/CCBot (training) and OAI-SearchBot/PerplexityBot/ChatGPT-User/Claude-User (retrieval) and look at which URLs each category hits. Training crawlers stuck on the homepage means a broken link graph. Retrieval crawlers fetching filtered URLs you've been blocking means your robots.txt is killing AI search visibility. Both categories absent means a WAF or Cloudflare bot rule is hard-blocking the user-agent.

Stay ahead of AI search changes

Join store owners getting weekly GEO insights, AI search updates, and optimisation tips.

Get GEO tips →

Free tier · No credit card required