How AI Crawlers Index Your Site (And Why Your robots.txt May Be the Problem)

Before any of your AEO content work pays off, a bot has to fetch your page first.

No crawl, no citation. It does not matter how well your content answers the question. If GPTBot or PerplexityBot cannot reach the page, they cannot use it. I keep seeing the same pattern: teams put real work into structuring content for AI answers, adding FAQs, fixing schema, reformatting headings — then the citations still do not come, and nobody knows why. Usually it is because the crawlers were blocked before they ever reached the content. Nothing in GA4 shows it. It just stays quiet.

The Three Types of AI Crawler (and They Are Not Interchangeable)

Most people treat AI crawlers as a single category. They are three distinct things, and the distinction matters for how you set up robots.txt.

Training crawlers are the bulk collectors. GPTBot (OpenAI) and ClaudeBot (Anthropic) crawl the web to improve future model versions. Blocking them keeps your content out of training datasets but has no direct effect on whether ChatGPT or Claude cites you in live answers today.

Retrieval and search crawlers are what actually drive live citations. OAI-SearchBot powers ChatGPT Search. Claude-SearchBot feeds Anthropic's retrieval pipeline. PerplexityBot indexes pages that Perplexity cites with live links. These are the bots you need to allow if you want to show up in AI-generated answers.

On-demand fetchers work differently again. ChatGPT-User and Claude-User fire when a real user asks something that needs a current page. They visit one URL, in real time, because a person triggered it. A lot of robots.txt configs accidentally catch these alongside the bulk bots.

You can block training crawlers and allow retrieval crawlers at the same time. OpenAI documents this explicitly. Block GPTBot to stay out of training data, allow OAI-SearchBot to keep your pages in ChatGPT answers. Two decisions, two separate directives.

AI Crawlers Do Not Run JavaScript

Most brand sites render content client-side. React, Next.js, Vue. A browser executes JavaScript, fills the DOM, and a user sees the full page. AI crawlers skip that step.

Vercel published analysis of how GPTBot and ClaudeBot interact with Next.js applications. Both crawlers fetch JavaScript files but do not execute them. They read the raw HTML the server sends. If your main content only exists after JavaScript runs, those crawlers get an empty container.

Gemini is the one exception. Every other major AI crawler works from initial HTML only. If you are running a React SPA with client-side data fetching, a significant portion of your content is invisible to GPTBot and PerplexityBot right now. Server-side rendering or static generation for key pages fixes this.

What Your robots.txt Should Actually Say

Here is a clean starting point for a brand that wants AI search visibility while keeping content out of training datasets:

# Retrieval and search crawlers — allow for citations
User-agent: OAI-SearchBot
Allow: /

User-agent: ChatGPT-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: Claude-User
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Perplexity-User
Allow: /

User-agent: Google-Extended
Allow: /

# Training crawlers — block to stay out of training data
User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: CCBot
Disallow: /

# Standard search — leave these open
User-agent: Googlebot
Allow: /

User-agent: Bingbot
Allow: /

User-agent: *
Allow: /

Sitemap: https://yourdomain.com/sitemap.xml

Each user-agent needs its own directive. ClaudeBot and Claude-SearchBot are independently controllable — blocking ClaudeBot does not affect Claude-SearchBot. OpenAI follows the same pattern: block GPTBot, allow OAI-SearchBot, and both decisions hold independently.

Three Checks to Run Now

Fetch your key pages as plain HTML. Run curl -A "GPTBot" https://yourdomain.com/your-page/ and read what comes back. If the content you want AI to cite is not in that output, JavaScript is hiding it.
Test your robots.txt against each bot individually. Look specifically for wildcard Disallow rules that may be catching retrieval bots you want to allow.
Connect crawl activity to citation data. A bot visiting your page and that page appearing in an AI answer are two separate events. The NotioncCue AI Crawler Audit links crawl frequency data to citation tracking across ChatGPT, Perplexity, Claude, and Google AI Overviews.

The Blocker Nobody Checks: CDN and WAF Rules

robots.txt is not the only layer that can stop AI crawlers. CDN and WAF configurations run before robots.txt is ever checked. AI crawlers operate from US-based cloud infrastructure. Firewall rules targeting "unknown bots" or "datacenter IPs" catch them as collateral damage. Cloudflare's Bot Fight Mode does this regularly.

If your logs show zero AI crawler traffic and your robots.txt is clean, check your WAF and CDN bot management settings. The fix is allowlisting specific user-agent strings at that layer, not just in robots.txt.

Run the NotioncCue AI Crawler Audit to see which crawlers are accessing your site, which pages are being ignored, and where citation gaps are costing you AI search visibility.