How Perplexity Selects Sources: The Citation Algorithm Decoded

Two websites publish nearly identical articles about the same topic on the same day. Same domain authority, similar writing quality, comparable backlink profiles. Three weeks later, one gets cited in Perplexity answers every few days. The other has zero citations.

This is not random. Perplexity's source selection runs a documented six-stage pipeline that filters candidates at every step. Most brands are optimising for stage one and wondering why nothing works.

Perplexity processes roughly 780 million queries per month, up 239% from 230 million in August 2024. At that scale, it is a meaningful B2B distribution channel, not a novelty. Getting cited in a Perplexity response carries the same reach as coverage in a tier-one trade publication — except it happens in real time, every day, for the queries your buyers are actually running.

What Is the Six-Stage Perplexity Citation Pipeline?

Perplexity generates cited answers through a Retrieval-Augmented Generation pipeline with six distinct stages. A page must clear every stage to earn a citation. Failing any one of them removes it from consideration regardless of content quality.

Stage 1: Query intent parsing. Perplexity does not pass the raw user question to a search index. It uses a language model to parse the semantic structure of the query first. For complex questions, Pro Search and Deep Research modes break the query into three to five sub-queries and execute each separately. A user asking "what is the best AEO tracking tool for a B2B team" triggers sub-queries like "AEO tracking tools comparison," "B2B AI visibility platforms," and "citation monitoring software." Your page needs to be a candidate for at least one sub-query, not necessarily the full prompt.

Stage 2: Real-time retrieval. Perplexity combines BM25 keyword matching with dense embedding search to cast a wide net of candidate documents. Unlike ChatGPT, which leans on training data and activates web search on demand, Perplexity runs a live web search for every single query. Pages that Perplexity cannot crawl at query time are invisible at this stage. If PerplexityBot is blocked in your robots.txt or WAF, nothing else matters.

Stage 3: Three-layer ML reranking. The initial retrieval set gets pushed through a three-layer reranker. Layer one scores basic relevance. Layer two evaluates domain authority through a machine learning model that weighs E-E-A-T signals, cross-platform presence, and structured data. Layer three — the most important for competitive brands — applies a "topic multiplier." Content in AI, technology, science, and business gets amplified. Entertainment and sports content gets suppressed. This is why Perplexity citations are not proportional to web traffic; niche B2B content on technical topics punches far above its domain authority weight.

Stage 4: Freshness scoring. Perplexity applies an aggressive time decay. The freshness sweet spot sits at roughly 30 days. Content published or updated more than 30 days ago loses visibility rapidly unless it ranks in a high-authority source. This is not about republishing the same article with a new date. Perplexity's systems detect that. The decay resets only when you update the evidence, add new data points, and make the publication date visible and accurate.

Stage 5: Context assembly. Before any text is generated, citations are embedded into the prompt structure. The model is constrained to synthesise from the selected sources, not from general training data. This is why citation selection determines answer quality — and why a page that survives to this stage has already won the hard part.

Stage 6: LLM synthesis. The language model generates the answer from the assembled context. At this stage, extractability determines whether your content shapes the answer or just appears as a footnote. A page cited in the answer is different from a page cited at the bottom. The difference is passage extractability: whether your key claim appears in the first sentence of a section, standing on its own.

Why Most Perplexity Optimisation Advice Is Wrong

Most guidance treats Perplexity as one problem: get your page selected. It is two problems. Source selection — whether Perplexity retrieves and cites your page at all — is separate from answer absorption — whether your page's evidence actually shapes the generated response.

A page can be listed as a source without any of its content influencing the answer. A page can shape an answer even when it is not prominently featured. These require different fixes. Optimising content structure helps absorption. Fixing crawl access helps selection. Strengthening off-site authority affects both, but through different mechanisms in the pipeline.

Teams that conflate these two outcomes build for the wrong stage. They rewrite content structure when PerplexityBot cannot reach the page. Or they focus entirely on crawl access when the real problem is that their evidence is buried in paragraph four.

What Are the Actual Ranking Factors by Weight?

Independent analysis of Perplexity's source selection reveals approximate factor weights that shift by query type. For informational queries: content relevance accounts for roughly 30%, visual placement in the page for 20%, domain authority for 15%, content freshness for 15%, source diversity for 10%, and structured data for 10%. For commercial queries, trust signals, review platform presence, and G2 or Capterra listings gain additional weight while pure content relevance drops.

The visual placement figure is counterintuitive but consistent across studies. "Visual placement" means where your key answer appears on the page. Content at the top of the page, in the first paragraph under each heading, scores higher than equivalent content deeper in the same article. Perplexity's crawler spends more time on above-the-fold content, and its scoring reflects that allocation.

How Do You Pass the Crawl Gate First?

Nothing else matters if PerplexityBot cannot reach your page. Check three things before anything else.

First, your robots.txt. Confirm PerplexityBot is explicitly allowed, not just un-blocked by omission. An explicit allow rule is more reliable than relying on default behaviour.

User-agent: PerplexityBot
Allow: /

Second, your server logs. Search for PerplexityBot activity in the past 30 days. If you see zero hits on pages you care about, your WAF or CDN bot rules are blocking it before robots.txt is checked. Cloudflare's Bot Fight Mode and many security plugins block AI crawlers by default.

Third, your JavaScript dependency. Perplexity's crawler does not execute JavaScript. Critical content that loads client-side after the initial HTML response is invisible to Perplexity's retrieval system. Server-side rendering or static generation for key pages is not optional if you want Perplexity citations.

The NotioncCue AI Crawler Audit shows which pages PerplexityBot is fetching, how often, and which pages return empty content because of JavaScript rendering issues. Run it before anything else. Fixing crawl access is the highest-leverage single action in Perplexity optimisation.

What Content Structure Earns Absorption?

Getting into Perplexity's source list is stage one. Getting your evidence into the generated answer is stage two. These require different things.

The absorption rate correlates almost entirely with one variable: whether your key claim appears in the first sentence of each section. Not in the second paragraph. Not after context-setting. The first sentence. Perplexity's synthesis model pulls passages directly when they are self-contained. A section that starts with setup before the point gets scored lower than an identical section that leads with the conclusion.

The format that earns the highest citation rates across studies: FAQ pages, definition pages, and data-dense articles with visible source citations. The common factor is extractability. Each section answers exactly one question. The answer comes first. Supporting evidence follows.

How Does Off-Site Authority Affect Perplexity Citations?

Perplexity's layer-three reranker weighs external validation heavily. News placements and journalism coverage from tier-one publications carry structural advantages that website content alone cannot replicate. For B2B companies, this makes earned media strategy — pitching to niche trade publications your buyers read — a direct Perplexity optimisation tactic, not a separate brand activity.

60% citation overlap between Perplexity and Google's top-ten results exists, per Search Engine Land data. But the remaining 40% are pages Perplexity cites that Google does not surface. These are typically smaller, more specific pages with strong topical focus and recent publication dates. Competing for that 40% means building content that is more current, more specific, and more externally validated than what the high-DA generalist sites are publishing.

Reddit participation in relevant subreddits, third-party review platform profiles, and expert mentions in industry publications all feed Perplexity's entity authority signals. The same work that builds E-E-A-T for Google improves Perplexity citation probability through a different mechanism — the ML reranker at layer two weighs these cross-platform signals directly.

Frequently Asked Questions

How often does Perplexity re-crawl pages?
Perplexity retrieves live web content at query time, not on a fixed crawl schedule. This means a page updated this morning can appear in citations this afternoon. The flip side is that outdated content loses citation velocity rapidly. The 30-day freshness sweet spot applies to most informational queries.

Does Perplexity citation correlate with Google rankings?
60% overlap exists between Perplexity citations and Google's top-ten results, per Search Engine Land. But 40% of Perplexity citations go to pages that do not rank in Google's top ten. Perplexity's algorithm applies its own evaluation criteria — specifically freshness, topical specificity, and cross-platform entity signals — that differ from Google's link-based ranking signals.

Can a small site get cited as often as a high-authority domain?
Yes, in specific conditions. A detailed technical analysis on a specific industry topic from a smaller but authoritative site can be cited alongside, or instead of, major publication content if it better addresses the query intent at that moment. Freshness, specificity, and extractability can overcome traditional authority gaps for informational queries.

What is the fastest change that moves Perplexity citation rates?
Fixing PerplexityBot access, where it was blocked, is the single highest-leverage change. If the crawler is already reaching your pages, moving your key answer to the first sentence of each section produces the fastest absorption improvement. Both are structural fixes that do not require new content.