How LLMs Choose Which Source to Cite: The RAG Pipeline Every AEO Practitioner Must Understand

Your content does not compete at the level of pages. It competes at the level of 256-token chunks. Understanding that single fact changes every AEO decision you make.

When a user asks Perplexity or ChatGPT a question, the engine does not retrieve ten web pages and read them. It retrieves dozens of passages — small text segments, typically 200 to 500 tokens each — scores them against the query semantics, reranks the best candidates, and feeds three to five passages to the language model as context. The model generates a response grounded in those passages and cites the sources they came from.

This is Retrieval-Augmented Generation, or RAG. Lewis et al. introduced the architecture in a 2020 paper at NeurIPS. Every major AI search system in 2026 — Perplexity, ChatGPT Search, Google AI Overviews, Claude in web search mode — runs some variant of it. The specific implementations differ. The underlying pipeline stages are consistent. Knowing those stages tells you exactly what your content needs to do at each step to get cited.

What Are the Four Stages of the RAG Pipeline That Determine Citation Selection?

The pipeline has four sequential stages. Your content must pass all four to earn a citation. Failing any one stage removes you from the candidate pool regardless of how well you perform on the others.

Stage 1: Crawl and chunking — how your content enters the retrieval pool

Before any query is run, AI systems index content by crawling it and splitting it into chunks. Chunking is not arbitrary splitting at a fixed character count. Production RAG systems in 2026 use sentence-aware chunking — splitting at semantic boundaries like sentence ends and paragraph breaks — to preserve meaning within each chunk. Chunks that split mid-sentence produce retrieval failures: the passage lacks the subject or verb that gives it meaning, and the embedding model cannot represent it accurately.

The crawl precedes the chunking. If your page is blocked to GPTBot, ClaudeBot, or PerplexityBot — by robots.txt, by a WAF rule, or by JavaScript rendering that delays content — the content never enters the retrieval pool at all. There is no Stage 2, 3, or 4 for a page that cannot be crawled. This is the most common fixable AEO failure: technically well-written content sitting behind an access barrier that makes it invisible to the retrieval system. The AI crawlers guide covers every crawler user-agent and every common blocking pattern.

For AEO, the chunking implication is direct: each section of your content should be self-contained enough to be citable when extracted without the surrounding context. A paragraph that begins "As mentioned above, this technique also..." is unchunkable. A paragraph that begins "FAQPage schema improves AI citation rates by making Q&A pairs independently extractable at the passage level" is self-contained and can be retrieved as a standalone chunk that answers a specific query.

Stage 2: Embedding and vector similarity — the semantic matching layer

Every chunk is passed through an embedding model that converts it to a high-dimensional vector — typically 768 to 1,536 floating-point numbers. The query is converted to a vector by the same model. The system then finds chunks whose vectors are close to the query vector in that high-dimensional space, using approximate nearest-neighbour search across potentially billions of stored vectors.

This is semantic matching, not keyword matching. Two passages can have zero word overlap and still score as highly similar if they cover the same concept. "How to improve AI citation rates" and "strategies for getting your content cited by ChatGPT" will match each other at the embedding level even though they share no keywords. This matters for AEO because keyword density — the traditional SEO concern — has almost no bearing on retrieval at this stage. Entity density does. The embedding model encodes conceptual meaning. A passage that covers the concept thoroughly, with named entities, specific mechanisms, and clear relationships, embeds into a more information-rich vector than a passage that covers the same surface area with vague language.

In 2026, production RAG systems use hybrid search by default. Hybrid search combines vector similarity (which catches semantic equivalence) with BM25 sparse retrieval (which catches exact keyword matches). The results from both methods are merged using Reciprocal Rank Fusion before being passed to Stage 3. For AEO, hybrid search means you need both semantic density (to score in vector search) and exact-phrase matching on the specific query terms buyers use (to score in BM25). Writing only in synonyms fails BM25. Writing only in exact query phrases fails the semantic depth check. Both layers require substance in the same passage.

Stage 3: Cross-encoder reranking — the precision filter that sets the final shortlist

Initial vector retrieval is fast but approximate. It returns 20 to 50 candidate chunks, some of which are semantically adjacent to the query but not actually the best answer. A cross-encoder reranker then scores each candidate against the original query with full attention — meaning it reads both the query and the passage together and produces a relevance score far more accurate than the embedding comparison.

Cross-encoders are expensive. They process one query-passage pair at a time. Running them across all indexed content would be prohibitively slow. The two-stage pipeline — fast approximate retrieval, then precise reranking — gives production systems both speed and accuracy. In 2026, Cohere Rerank 3.5 and Jina Reranker v2 are the most widely used rerankers in production RAG pipelines (Let's Data Science, March 2026). Perplexity uses a custom variant of this architecture in its Sonar model.

The reranking stage is where content quality separates winners from losers. A chunk that scored well in Stage 2 on semantic similarity but does not actually answer the query precisely gets demoted. A chunk that answers the exact query — with specific claims, named evidence, and a direct opening sentence — gets promoted even if it scored lower on raw embedding similarity. This is the mechanical explanation for why the BLUF writing structure improves citation rates: the direct-answer opening sentence is what the cross-encoder rewards. A passage that opens with context, hedges, or prerequisite explanation before reaching the answer scores lower than a passage where sentence one is the answer.

Stage 4: Context assembly and generation — what goes into the LLM's context window

The top three to five passages from Stage 3 are assembled into a context window and passed to the language model along with the user query. The model generates a response grounded in that context and produces citations pointing back to the source URLs of the passages it used.

Context window assembly introduces a final selection decision: which passages fit. Context windows are finite. GPT-4o operates at 128,000 tokens, but production RAG pipelines do not fill the entire context with retrieved passages — doing so degrades response quality. Most production systems use 2,000 to 4,000 tokens of retrieved context. At 256 tokens per chunk, that means three to sixteen passages compete for the final context slots. Passage quality at Stage 3 determines inclusion. Once included, passage position within the context window affects how strongly the model attends to it — a mechanism covered in detail in the next article in this series on attention mechanics and position bias.

What Does Chunking Strategy Mean for How You Write AEO Content?

Chunk boundaries are invisible to you when you publish. You cannot specify where the RAG system will cut your content. But you can write in a way that makes every natural paragraph boundary a viable chunk boundary — meaning every paragraph can be extracted and cited independently without losing meaning.

NVIDIA's chunking research found that page-level chunking achieved accuracy of 0.648, while chunk size affected optimal retrieval accuracy by query type: factoid queries (specific single-fact answers) performed best with 256 to 512 tokens, and analytical queries needed 1,024 or more tokens (Metricus, April 2026, citing NVIDIA technical research). For AEO, this means FAQ sections — which answer short factoid queries — should have dense, standalone answers of roughly 60 to 100 words per question. Technical explanation sections should be longer, 200 to 300 words, to give the embedding model enough context to represent the concept accurately.

Three concrete writing rules that match how chunking actually works:

Name the entity in every paragraph. A chunk extracted from the middle of your article needs its subject to be identifiable without the preceding paragraph. "It improves citation rates by 40%" fails as a standalone chunk. "FAQPage schema improves AI citation rates by 40% in Princeton GEO study analysis" succeeds. Entity-first writing is not a stylistic preference — it is a chunking requirement. The schema types guide applies this principle to structured data; it applies equally to prose.

Avoid mid-thought paragraph breaks. If a sentence begins a thought that continues in the next paragraph, the chunking model may split them. Each paragraph should complete one point, not carry one point across two. This is different from making every paragraph short — it is making every paragraph structurally complete.

Use specific, verifiable claims rather than hedged generalisations. Embedding models encode specificity as information density. "AI citations favour fresh content" has low information density. "ChatGPT cited 76.4% of pages updated within the last 30 days, per ICODA's April 2026 citation decay analysis" has high information density. The second embeds into a richer vector that scores better against specific queries.

How Does Query Fan-Out Change the Retrieval Dynamics?

Most user queries are too complex to answer from a single passage. "What AEO tool is best for a B2B SaaS team that needs Perplexity tracking and GA4 integration?" is not a single-passage query. It is three sub-queries: what AEO tools track Perplexity, which of those have GA4 integration, and which are suited for B2B SaaS teams.

AI engines handle this through query fan-out: the system decomposes the original query into sub-queries, runs each sub-query through the RAG pipeline independently, retrieves passages for each, and synthesises a response from the combined context. ChatGPT Search averages 2.1 sub-queries per prompt. Google AI Mode fans out across 8 or more sub-queries on complex questions (My Web Audit, citing query fan-out research from 2026). Perplexity uses a more contained sub-query model but still fans out on multi-part queries.

Query fan-out has a specific AEO implication that most content teams have not operationalised: your content earns citations through sub-query matches, not just full-query matches. A page that answers one specific sub-query well — "NotioncCue tracks Perplexity citations" — can be cited in the synthesised response to a much broader query, even if the full query would not match your page on direct comparison. This is why the prompt engineering guide recommends designing multi-requirement test prompts: they expose which sub-queries your content is and is not winning, which maps directly back to the specific passages in your content that need to be written or strengthened.

How NotioncCue Targets Each Stage of the RAG Pipeline

Most AEO tracking tools measure the output of the RAG pipeline — whether your brand appeared in the final answer. NotioncCue tools target the inputs at each stage where the decision is actually made.

The NotioncCue AI Crawler Audit targets Stage 1. It checks whether your content enters the retrieval pool at all by verifying that each AI crawler — GPTBot, OAI-SearchBot, PerplexityBot, ClaudeBot, Claude-SearchBot, Googlebot-Extended, and Bingbot — can access your content in the server-rendered HTML response. It flags JavaScript-rendered content that crawlers receive as an empty page, WAF rules that block AI crawler IP ranges, and robots.txt configurations that accidentally exclude specific AI user-agents. A crawler that cannot reach your content cannot chunk it. The audit surfaces exactly which of your high-value pages are failing Stage 1 before they fail Stages 2, 3, and 4.

The NotioncCue Prompt Tracker measures Stage 4 output — which passages made it into the generated response and which source URLs were cited. By running the same prompt set weekly across ChatGPT, Perplexity, Claude, Gemini, and Google AI Overviews, the Prompt Tracker shows you week-over-week citation rate changes that correspond to upstream pipeline changes. When you add schema, rewrite an opening paragraph for BLUF structure, or fix a crawler access issue, the Prompt Tracker is what tells you whether the change moved citation rate on the target prompts. Without that measurement, you cannot distinguish content improvements from external factors like competitor changes or model updates.

Start your free NotioncCue trial and run the AI Crawler Audit first. Until you know your content is in the retrieval pool, no content improvement can produce citation results — and the crawler check takes 15 minutes, not the weeks that content improvements require.

One practical test reveals whether your content is being chunked effectively without any tools. Paste any paragraph from your highest-value AEO page into ChatGPT with the prompt: "What specific question does this passage answer?" If ChatGPT cannot identify a specific question — if it returns "this passage covers X broadly" rather than "this passage answers Y" — the paragraph is not structured as a retrievable chunk. Rewrite it with a direct claim in sentence one and run the test again. Chunkable paragraphs answer specific questions. Unchunkable paragraphs cover topics.

Frequently Asked Questions About How LLMs Choose Which Sources to Cite

Does domain authority affect RAG retrieval the same way it affects Google rankings?
Not directly. Vector similarity and cross-encoder reranking score passage quality, not page authority. A high-authority domain with a vague, low-information-density passage loses to a low-authority domain with a specific, information-dense passage in the retrieval stage. Domain authority enters the RAG pipeline indirectly: high-authority pages are more likely to be indexed and crawled frequently, which means their content enters the retrieval pool more reliably and stays fresher. The authority signal is about access and indexing confidence, not about passage-level scoring.

How does structured data like schema affect the RAG pipeline?
Schema is read by AI crawlers during Stage 1. FAQPage schema makes Q&A pairs independently extractable as chunks with explicit question-answer structure that the chunking model recognises. HowTo schema makes each step a named, self-contained chunk. Organisation schema establishes entity identity that embedding models use to resolve ambiguous references. Schema does not affect the embedding or reranking scores directly — but it does improve chunking quality, which means the passages that enter Stage 2 are better structured and score more consistently at Stages 2 and 3.

Can the same passage be cited by multiple AI engines from the same page?
Yes. Each engine runs its own RAG pipeline independently. The same passage can score in the top three on Perplexity's reranker and also score in Google AI Overview's context assembly on the same day. Different engines chunk differently, embed with different models, and use different rerankers — so the same content can score differently across engines and be cited on some but not others. This is why the AEO measurement guide recommends tracking citation rate per engine rather than as an aggregate: a passage consistently cited on Perplexity but never on ChatGPT has a specific gap at Stage 2 or 3 that is addressable without changing the underlying content quality.

How long does it take for a content change to work through the RAG pipeline and appear in citations?
Stage 1 (crawling and re-indexing) is the rate-limiting step. An IndexNow ping triggers Bingbot within minutes. Perplexity typically re-crawls and re-indexes within 24 to 72 hours. Google's AI systems follow Google's standard crawl schedule, typically one to two weeks for most content. Once a chunk is in the index, it is immediately eligible for retrieval at Stages 2, 3, and 4 on the next relevant query. The practical measurement window: check Perplexity citation rate three to five days after a content change to get the fastest feedback signal. Check ChatGPT and Google AI Overviews four to six weeks after to confirm the change is holding across slower-updating engines.

How LLMs Choose Which Source to Cite: The RAG Pipeline Every AEO Practitioner Must Understand

What Are the Four Stages of the RAG Pipeline That Determine Citation Selection?

Stage 1: Crawl and chunking — how your content enters the retrieval pool

Stage 2: Embedding and vector similarity — the semantic matching layer

Stage 3: Cross-encoder reranking — the precision filter that sets the final shortlist

Stage 4: Context assembly and generation — what goes into the LLM's context window

What Does Chunking Strategy Mean for How You Write AEO Content?

How Does Query Fan-Out Change the Retrieval Dynamics?

How NotioncCue Targets Each Stage of the RAG Pipeline

Frequently Asked Questions About How LLMs Choose Which Sources to Cite

Google AI Overview: How to Get Cited in 2026 (Complete Ranking Factors Guide)

Attention Mechanisms and Position Bias: The LLM Architecture That Explains Why BLUF Works

Parametric vs Retrieval Memory in LLMs: Why ChatGPT, Perplexity, and Claude Need Different AEO Strategies

Entity Disambiguation in LLMs: Why Consistent Naming Is an AEO Technical Requirement