Getting cited by AI: AEO and GEO for ChatGPT, Perplexity and AI Overviews
For two decades the goal of SEO was a blue link in position one. That goal is now necessary but no longer sufficient. The new prize is being the source an answer engine quotes, links and names when it composes a reply, and the mechanics of winning it are different enough that treating it as "SEO with extra steps" will quietly cost you visibility you can't even see in your rank tracker.
What actually happens when an AI answers a query
Before tactics, you need an accurate mental model, because most AEO advice is built on a wrong one. A generative answer is not retrieved from a ranking like a SERP. It is composed. When someone asks ChatGPT with search enabled, Perplexity, or triggers a Google AI Overview, roughly four things happen in sequence: the engine rewrites the user's prompt into one or more search queries (query fan-out), it retrieves a candidate set of documents from an index, it grounds a draft answer against passages pulled from those documents, and it attributes specific sentences back to specific URLs. Each of those stages is a place you can win or lose, and they are not the same stage.
The retrieval stage still looks a lot like classic search: it leans on a conventional index (Bing for ChatGPT and Copilot, Google's own index for AI Overviews and AI Mode, a blend of indices plus its own crawl for Perplexity). So your organic foundations matter. But the grounding and attribution stages are governed by passage-level extractability, not page-level authority. A page can rank fifth organically and still be the passage the model quotes, because it phrased one specific claim in a way the model could lift cleanly. That asymmetry is the entire opportunity in GEO.
The practical consequence: optimise the page to rank, then optimise individual passages to be extracted. Those are two distinct jobs and the second one is where most sites are leaving citations on the table.
Query fan-out is why long pages beat thin ones now
Google's AI Mode and, to a lesser degree, AI Overviews decompose a single user question into a sheaf of sub-queries, fetch results for each, and synthesise across them. A question like "best CRM for a 12-person agency" silently becomes "CRM pricing for small teams", "CRM with agency client management", "CRM integrations with invoicing", and several more. The model then assembles an answer from passages that each satisfy one sub-query.
This rewards comprehensive, well-segmented pages over a constellation of thin ones. A single page that genuinely answers the pricing question in one section, the integrations question in another, and the team-size fit in a third can be cited multiple times in one synthesised answer. Five thin pages each covering a fragment will more often lose to one consolidated competitor page because the model would rather ground several sentences against one trustworthy source than stitch fragments from five.
The diagnostic I run: take your top twenty commercial queries, expand each into the five to ten sub-questions a buyer actually has, and check whether a single coherent page on your site answers all of them in clearly delineated sections. If the answer is scattered, consolidate. This is also where the old "topic cluster" model gets sharper, the pillar page is no longer just an internal-linking hub, it is the unit a generative engine grounds against.
Write for extraction: the claim-and-evidence pattern
The single highest-leverage change is structural. Models extract well from prose that states a claim plainly and immediately backs it with a specific, checkable fact. They extract badly from throat-clearing, buried ledes, and hedged corporate copy. Compare these two sentences answering "how long does onboarding take":
- Weak: "Our onboarding experience is designed to be as smooth and rapid as possible, and most customers find they're up and running before they know it."
- Strong: "Onboarding takes 7 to 10 business days for a standard plan and up to 3 weeks for an enterprise migration with custom SSO."
The second sentence is self-contained, factual, and quotable without surrounding context. That property, a passage that survives being cut out and pasted into an answer, is what I call extractability, and it is the core writing discipline of GEO. Lead each section with the direct answer in the first sentence (the inverted-pyramid structure journalists have used for a century), then add nuance. Put numbers, dates, ranges and named entities inline rather than relying on the reader to infer them. Use a question as the section heading when the query is a question, because heading-to-passage proximity helps the model align the right passage to the right sub-query.
A few patterns that measurably increase the chance of being quoted: a one-to-two sentence definition immediately under a heading; comparison facts expressed as explicit statements ("X supports 10,000 events per second; Y supports 2,000") rather than vague superlatives; and clear sourcing of any statistic ("according to the 2025 Stack Overflow Developer Survey…"). Academic research into generative engine optimisation has repeatedly found that adding citations, quotations from credible sources, and statistics to a passage raises its inclusion rate in generated answers, sometimes by double-digit percentages. Models are trained to prefer grounded, attributable claims, so hand them passages that already look grounded.
Classic ranking is the entry ticket, not the prize
I want to be precise about the relationship, because "AI is killing SEO" is lazy and "AEO is just SEO" is complacent, and both are wrong. The candidate set a generative engine grounds against is drawn from a conventional index. If you are not in the top ten or so for the underlying sub-queries, you are usually not in the retrieval pool at all, and no amount of beautiful extractable prose helps a page the engine never fetched. So technical health, crawlability, internal linking, page experience and topical authority remain prerequisites.
What changes is that ranking is now table stakes rather than the finish line. Two sites can both rank on page one; the one whose passages are cleaner, more factual and better structured gets cited, and the one cited gets the referral, the brand mention, and increasingly the click. I treat the work as a funnel: classic SEO gets you into the consideration set, GEO determines whether you're the source quoted from it. Skipping either half is the common failure mode, technical teams who nail crawlability but write hedgy copy, and content teams who write beautifully but sit on a slow, poorly-linked site that never makes the retrieval pool.
Entity SEO: be a thing the model already knows
Generative engines reason over entities, not just keywords. Before an engine can confidently cite you as an authority on "local-first AI tooling", it needs to recognise you as a known entity with stable attributes, what you are, what you make, who's behind you, how you relate to other entities. This is where structured data and consistency across the web do real work, and where most mid-market brands are sloppy.
Concretely, this is the entity SEO checklist I work through:
- Mark up your
Organization(orPerson) with JSON-LD, includingname,url,logo,description, and crucially a completesameAsarray linking to every authoritative profile, LinkedIn, Crunchbase, GitHub, Wikidata, Wikipedia if you have it, your Companies House or equivalent registry entry. - Keep that entity's facts identical everywhere. If your founding date, founder name, headquarters, or product category differs between your homepage, your LinkedIn, and your Crunchbase entry, you are feeding the model conflicting signals and it will hedge or omit you.
- Build or claim a
Wikidataitem if you have any notability, it is a structured source many systems reconcile against, and it is editable far more readily than Wikipedia. - Use specific schema types where they fit:
Product,SoftwareApplication,FAQPage,HowTo,Articlewith a realauthorentity that itself links out viasameAs. - Establish author entities, not anonymous bylines. A named author with a consistent profile across your site and external publications strengthens the topical association the model needs to treat that author (and your domain) as a credible source.
Here is the shape of an organisation entity I'd ship, trimmed for clarity:
{
"@context": "https://schema.org",
"@type": "Organization",
"name": "Acme Local AI",
"url": "https://acme.example",
"logo": "https://acme.example/logo.png",
"description": "Local-first AI tooling for regulated teams.",
"foundingDate": "2023-04-01",
"sameAs": [
"https://www.linkedin.com/company/acme-local-ai",
"https://github.com/acme-local-ai",
"https://www.crunchbase.com/organization/acme-local-ai",
"https://www.wikidata.org/wiki/Q000000"
]
}
This is not a ranking hack, it is identity disambiguation. The payoff is that when a model assembles an answer in your category, it can attach your name to a stable concept rather than guessing. I see this firsthand in my own work: because I publish technical SEO and mathematics books under a consistent author identity and ship local-first AI software under the same name, those facts reinforce each other as entity signals, and that coherence is exactly what you're engineering for a brand.
Crawler access and llms.txt: don't lock the door you want open
You cannot be cited by a system you've blocked. The AI crawler landscape now has distinct user agents for distinct purposes, and conflating them is a common, expensive mistake. There are broadly three categories: training crawlers (GPTBot, Google-Extended, CCBot), live retrieval/search fetchers (OAI-SearchBot, PerplexityBot, Google's standard Googlebot feeding AI Overviews), and on-demand user-action fetchers (ChatGPT-User, Perplexity-User) that fetch a page because a user asked about it in real time.
The trade-off is genuine and you should make it deliberately. Blocking GPTBot and Google-Extended opts you out of model training, a defensible IP decision, with little downside to live citation. But blocking the retrieval and user-action agents opts you out of being cited in answers, which for most commercial sites is self-defeating. Audit your robots.txt and your WAF/CDN bot rules together, I regularly find sites that allow the bots in robots.txt while Cloudflare or a security vendor silently 403s them. Test with the actual user-agent string, not just by reading the rules.
An illustrative split that allows citation while declining training:
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: Google-Extended
Disallow: /
On llms.txt: it's a proposed convention, a Markdown file at your root that points AI systems to your most important, cleanest content, conceptually a sitemap for LLMs. I treat it as low-cost insurance rather than a proven ranking factor. No major engine has confirmed using it for retrieval today, and you should be honest with stakeholders about that. But it costs an afternoon, it forces you to articulate your canonical sources, and if adoption grows you're already positioned. Ship a concise llms.txt that links your definitive product, pricing, docs and key explainer pages with one-line descriptions, and keep it current. Do not treat it as a substitute for the real work of extractable on-page content, it is a pointer, not a payload.
Measuring AI visibility when the click often never comes
The hardest part of this discipline is that the value is partly invisible to your existing analytics. A user can read your brand name and your quoted fact inside an AI Overview or a ChatGPT answer and never click through, that is a brand impression and a buying-signal you can't see in GA4. So you need to instrument three layers.
First, referral traffic, the part that does click. In GA4, build segments for the known AI referrers, source/medium containing chatgpt.com, perplexity.ai, gemini.google.com, copilot.microsoft.com and the like. This traffic is typically low in volume but high in intent and conversion rate, treat the conversion-rate signal seriously even when the sessions are few. Watch for referrals arriving without UTM tags and with unusually direct intent, that's often AI-assisted discovery landing as "direct".
Second, citation visibility, whether you're being quoted at all. This is the genuinely new measurement category. You can do it manually at small scale by running your priority queries through ChatGPT, Perplexity, Google AI Mode and Copilot and logging whether your domain is cited, what passage was used, and which competitors appear. At scale, dedicated AI-visibility tools (the category is young, vendors include Profound, Otterly, and AI-tracking features now bolted onto established rank trackers) automate this query-set monitoring. The metrics that matter: share of citations versus competitors for your core query set, which specific URLs and passages get pulled, and sentiment of how you're described.
Third, server-log analysis, the ground truth of access. Parse your logs for the AI user agents above to confirm they're successfully fetching (200s, not 403s or soft-404s), how often, and which pages. This is the only way to verify that your crawler-access decisions are actually being honoured at the edge, and it catches the silent CDN block before it costs you a quarter of citations.
- GA4 AI-referrer segment, sessions, conversion rate, and assisted conversions.
- Manual or tooled citation tracking across the four major engines for your top 20-50 queries.
- Server-log confirmation that retrieval and user-action bots get 200s on priority URLs.
- Branded-search lift in Search Console as a proxy, AI mentions often drive later branded queries.
Where this is heading, and what to do this quarter
The direction of travel is clear, even if the specifics keep moving. More queries will be answered without a click, retrieval will get more agentic with deeper query fan-out, and the brands that win will be the ones recognised as stable, well-described entities whose content is clean enough to quote. None of that is exotic, it is disciplined information architecture, honest factual writing, and consistent identity, executed with the knowledge that a machine, not just a human, is the reader.
If I had one quarter and a single team, I'd sequence it like this: fix crawler access and verify it in the logs first, because everything else is wasted if the bots can't fetch you; ship complete, consistent entity markup with a real sameAs graph; rewrite your top commercial pages into claim-and-evidence structure with question-shaped headings; consolidate thin fragments into comprehensive pages aligned to query fan-out; and stand up citation tracking so you can actually see the needle move. Do those five things well and you stop guessing whether AI search is "worth it", you start watching your name appear in the answers your buyers are already reading.
Where this fits in my work
This is the kind of technical-SEO and growth work I ship end to end, not just advise on. You can see the full portfolio of sites, software and publications I’ve built, browse what I do, request my AI search optimisation (AEO / GEO) services, or get in touch about applying it to your site. Related reading: A structured-data playbook for rich results and AI citations and Topic clusters and internal linking.