AI Overviews Passage Extraction - What Sentence Structures Get Lifted

May 20, 2026

From auditing 50+ pages, the thing that jumps out is not schema or page-level signals - it is the sentence itself. A page can hold perfect Product schema, sit at position one in organic results, and still contribute zero passages to an AI Overview. Meanwhile a page ranked fifth with one precisely structured paragraph gets lifted verbatim. That gap is what this post is about.

The broader question of page-level eligibility for AI Overviews - domain authority, E-E-A-T signals, schema completeness - is a separate problem I have covered elsewhere. And FAQ block structure and AI citation is its own territory. What I have not seen covered clearly is the sub-page problem: once a page is eligible, which specific sentences get pulled?

Why passage extraction is a different problem from page-level GEO

AI Overviews and other AI search systems do not retrieve pages. Based on how retrieval systems generally work, they chunk content into passages - typically a few sentences each - score each chunk independently against the query, and surface the highest-scoring chunks. The page is just the container. The chunk is the unit of competition.

That framing matters because it changes where you spend your editing effort. Page-level GEO asks: is this domain trusted? Does this page have the right signals to be in the candidate pool? Passage-level GEO asks: is this specific sentence self-contained enough, query-matched enough, and entity-dense enough to score higher than competing chunks from other pages?

"Statistics are the currency of trust in generative search. A single data point outperforms three paragraphs of explanation."

Dr. Shivani Aggarwal Lead Researcher, Princeton NLP Group xSeek, 2025

Aggarwal et al. (arXiv:2311.09735, 2023) showed that specific writing and sourcing strategies can drive up to ~40% improvement in AI citation rates for top-performing GEO approaches. The paper does not decompose that figure by individual signal - but the implication for practitioners is that sentence-level rewriting is not a marginal gain. It is a primary lever.

~40%
citation lift, top GEO strategies
50+
pages audited for passage patterns
20-40
words in extracted single sentences
60-120
words in extracted multi-sentence passages

The practical takeaway from our 50+ brand audit: semantic chunking means your page competes at the passage level, not the page level. A single extractable paragraph on a mid-ranking page beats five generic paragraphs on a top-ranking one.

"We empirically demonstrate that non-DP privatization techniques excel in utility preservation and can find an acceptable empirical privacy-utility trade-off, yet cannot outperform DP methods in empirical privacy protections."

The four structural signals that make a sentence extractable

I want to be clear that what follows is grounded in how retrieval scoring generally works - keyword overlap, semantic similarity, chunk self-sufficiency - not in confirmed Google AI Overviews internal design. These are the patterns that show up consistently in our audit data.

Signal 1 - Definition-first structure

Definition-first sentence structure means the subject is named and defined in the opening clause, not buried mid-sentence. Retrieval systems need to know what a chunk is about from its first few tokens. If the subject appears late, the chunk scores weakly against definition queries.

Before: "When it comes to the way our returns process works, customers are able to initiate a label-free return within 30 days."

After: "Label-free returns at AcmeStore are open for 30 days from the delivery date, with no printer required."

The after version names the subject (label-free returns, AcmeStore) in the first clause. It answers "how do AcmeStore returns work?" without any surrounding context.

Signal 2 - Self-containment

Self-containment means the sentence or short passage answers a question without requiring surrounding paragraphs for context. Pronoun references - "it", "they", "this" - that resolve outside the passage are the most common self-containment failure in our audits.

Before: "It ships within 24 hours on business days and includes free tracking."

After: "The AcmeRun Pro Trail Shoe ships within 24 hours on business days and includes free tracking via Royal Mail."

The before version is useless lifted from context. The after version is a complete, attributable fact.

Signal 3 - Named-entity density

Named entity density refers to the number of grounded entities - brand names, product names, standards, locations, persons - within a single sentence. Based on how BM25 and semantic retrieval scoring generally work, sentences with two or more named entities match a wider range of specific queries than generic prose matches.

Before (0 entities): "Our waterproof boots are built for tough conditions and come with a two-year guarantee."

After (3 entities): "AcmeTrek Waterproof Hiking Boot, rated IP67 and built on a Vibram outsole, carries a two-year guarantee from AcmeTrek."

The after version is query-matchable for "AcmeTrek waterproof boot guarantee", "Vibram sole hiking boot UK", and "IP67 footwear warranty" simultaneously.

Signal 4 - Statistic placement

Statistic placement means putting a number, its unit, and its source attribution in the first or second sentence of a paragraph - not buried in a subordinate clause. A stat in sentence three rarely survives chunking as part of the lead claim; it either gets cut or orphaned.

Before: "Customers love the comfort, and after independent lab testing, the insole was found to reduce foot fatigue by 31% compared to a standard foam insert, according to a 2024 study by FootLab UK."

After: "The AcmeTrek insole reduced foot fatigue by 31% compared to standard foam, according to a 2024 FootLab UK lab test. Customers consistently cite comfort as the primary repurchase driver."

The stat is now the lede. The source is inline. The sentence is extraction-ready on its own.

The rule I'd ship

Fix the definition-first signal first. A buried definition fails at the first scoring step - a retrieval system cannot match a chunk to a definition query if the definition is in clause three. Fix the lede and you unlock the rest.

Sentence length and complexity: the data range that matters

Aggarwal et al. (arXiv:2311.09735) frame their GEO methodology around optimizing content for generative engine citation - the up-to ~40% improvement figure covers their top-performing strategies as a group, not individual signals. I will not invent a per-signal breakdown the paper does not state. What I will give you is the observed range from our own audit work.

Single extracted sentences tend to fall in the 20-40 word range. Multi-sentence extracted passages tend to run 60-120 words total. These are observed patterns from our audit data, not confirmed thresholds - treat them as calibration, not rules.

Sentence complexity is a clearer signal than word count alone. Subordinate clauses, passive constructions, and parenthetical asides all make a sentence harder to lift without editing. If a retrieval system has to restructure a sentence to make it grammatical in isolation, it will not lift it.

Here is a worked example. The original sentence below runs 48 words, uses passive voice, and buries the named entity:

Before (48 words, passive, 1 entity): "Given the increasing demand that has been observed across the industry for footwear that can be used in multiple conditions, the AcmeTrek boot was developed by our design team to meet both trail and urban use cases without requiring a separate purchase."

After (29 words, active, 2 entities): "AcmeTrek designed the AcmeTrek Dual-Terrain Boot to handle both trail and urban use, eliminating the need for two pairs."

Changes made: passive to active, definition-first, word count from 48 to 29, named entities from 1 to 2, subordinate clause removed. The rewritten sentence can be dropped into any AI Overview about multi-terrain boots without editing.

Sophisticated vocabulary does not help. If a sentence requires a reader - or a retrieval system - to decode vocabulary before understanding the claim, that decoding step lowers its extraction score.

Where to place statistics so AI systems lift them

The Google Search Quality Rater Guidelines treat factual accuracy and source attribution as core E-E-A-T signals. That principle extends to how statistics land in a passage: a number without a source is weaker than the same number with an inline attribution. And a sourced number buried in sentence three is weaker than a sourced number in sentence one.

A statistic needs three components to be extraction-ready: the number, the unit or context, and the source attribution - ideally all within the same sentence.

Before After
"Running shoes are a major purchase for most customers. They consider durability, comfort, and price. In a 2025 survey of 1,200 UK runners by RunReport UK, 74% said sole durability was their top repurchase factor." "Sole durability drives 74% of repurchase decisions among UK runners, according to a 2025 RunReport UK survey of 1,200 participants. Comfort and price ranked second and third."

The stat is now the lede sentence. The source is named inline. The passage is self-contained and directly answers "what drives running shoe repurchase in the UK?" without context from a surrounding paragraph. This is different from a footnote-style citation approach - for more on that distinction, see citation strategy for AI search visibility. Here the attribution is structural, not appended.

In an e-commerce context, the same principle applies to product specs. If your category page for trail shoes includes "waterproof rating: IP67", "sole type: Vibram Megagrip", and "weight: 340g", those are statistics. Put them in sentence one of the product description, not sentence four.

Google AI Overviews, Perplexity AI, and ChatGPT Search all surface inline-attributed statistics more reliably than unattributed ones - based on observed citation behavior across platforms, not confirmed vendor architecture.

Before/after rewrite table: six passage types

From our 50+ brand audit, these are the six sentence patterns we rewrote most often. Each row shows the original, the rewrite, and the specific signal that changed.

Type Before After Signal fixed
Buried definition "When thinking about what makes a good trail shoe, it helps to consider that a heel drop is the height difference between heel and forefoot." "Heel drop is the height difference between the heel and forefoot of a running shoe, typically measured in millimetres." Definition-first
Passive lede "The AcmeRun Pro was designed to be used by runners who need stability on loose terrain." "AcmeRun designed the AcmeRun Pro for trail runners who need stability on loose and technical terrain." Active voice, entity density +1
Pronoun-dependent sentence "It is available in sizes 4-13 UK and ships free to mainland addresses." "The AcmeRun Pro Trail Shoe is available in UK sizes 4-13 and ships free to mainland UK addresses." Self-containment (pronoun resolved)
Stat in trailing clause "Customers who buy trail shoes often return, and our data shows that 68% reorder the same model within 18 months." "68% of AcmeRun trail shoe customers reorder the same model within 18 months, according to AcmeRun internal purchase data." Stat promoted to sentence 1
Multi-clause run-on "Whether you are running on wet roots, dry rock, or muddy fire roads, the Vibram Megagrip outsole, which was developed in partnership with Vibram and tested across 12 terrain types in the Italian Alps, offers exceptional grip in all conditions." "The Vibram Megagrip outsole delivers grip across wet roots, dry rock, and muddy fire roads. AcmeRun tested it across 12 terrain types in the Italian Alps." Run-on split, entity density maintained
Generic adjective-heavy opener "Our amazing, award-winning collection of premium waterproof boots is perfect for adventurous outdoor enthusiasts." "AcmeTrek Waterproof Boots won the 2024 OutdoorGear UK Editor's Choice Award for best waterproof hiking boot under 400g." Entity density +3, adjectives replaced with verifiable claims

What I kept finding is that the definition-first fix produces the biggest single lift. Not because the other fixes are weak, but because a buried definition fails at the first scoring step - a retrieval system cannot match a chunk to a definition query if the definition is in clause three. Fix the lede and you unlock the rest. How expert quotes function as an extraction signal follows the same logic: an attributed quote with the expert named in the opening clause outperforms the same quote with attribution at the end.

How AI Overviews passage extraction differs from featured snippets

Featured snippets pulled a single contiguous block, selected largely on heading-to-paragraph proximity and word count match against the query. You optimised one block per page and the rest of the page was irrelevant to that snippet.

AI Overviews work differently. Based on observed citation behavior across platforms, they pull multiple non-contiguous passages from multiple pages and stitch them into a synthesised response. When Google launched AI Overviews in May 2024, the multi-source synthesis model was explicit in their framing. The extraction unit is smaller and more granular than a featured snippet - often a single strong sentence rather than a full paragraph block.

Common misread

A weak introduction and a strong paragraph three are not averaged together. Paragraph three can get lifted while your introduction contributes nothing. You do not need to rewrite an entire page - you need to identify the two or three paragraphs most likely to match high-value queries and restructure those first.

Perplexity AI and ChatGPT Search operate on the same multi-source passage retrieval model - based on observed citation behavior, not confirmed internal design. OAI-SearchBot, the crawler behind ChatGPT Search, retrieves passages for real-time query answering using a similar chunk-scoring approach. The sentence-level patterns in this post apply across platforms, not just to Google AI Overviews.

For a full account of which pages get into the candidate pool in the first place, see page-level eligibility for AI Overviews. Passage extraction is the second problem - eligibility is the first.

A sentence-level audit checklist you can run in 15 minutes

Take any existing page and work through the top 10 paragraphs. For each paragraph, run these eight checks:

  1. Identify the main claim. What is the single most useful fact this paragraph contains? Write it down in one sentence before touching the page.
  2. Check if that claim is in sentence 1. If it appears in sentence 2, 3, or later, move it. The Google Search Quality Rater Guidelines place a premium on content that answers the query directly - the same principle applies to passage scoring.
  3. Count named entities per sentence. Flag any sentence with fewer than 1 named entity (brand, product, standard, location, person). Generic sentences score weakly against specific queries.
  4. Check for pronoun dependencies. Underline every "it", "they", "this", "these", and "those". If any of those pronouns resolve to a noun in a previous sentence, the passage breaks when lifted. Replace with the explicit noun.
  5. Find all statistics. For each number, check which sentence it sits in. If it is in sentence 3 or later, promote it to sentence 1 with inline source attribution.
  6. Passive voice audit. Find every instance of "was [verb]ed", "is [verb]ed", "were [verb]ed". Rewrite to active: name the actor in the subject position.
  7. Word count per sentence. Flag any sentence over 45 words. Split or restructure. Aim for 20-35 words for a sentence you want extracted as a standalone chunk.
  8. Isolation test. Read each sentence out loud with no surrounding context. Does it answer a realistic user query? Does it make grammatical sense on its own? If not, it will not get lifted by Google AI Overviews or any equivalent retrieval system.

This checklist covers the manual pass. If you want the automated version - E-E-A-T signal scoring, entity density flags, and passage-level readiness scores across your full site - GEOlikeaPro's AI Readiness audit runs the same logic across your content automatically. See how it works.

The rule I'd ship: if a sentence needs its neighbors to make sense, it will not get lifted.


Ready to see which sentences on your pages are extraction-ready and which are invisible to AI systems? See where you stand with GEOlikeaPro's AI Readiness audit.

FAQ

What sentence structure gets extracted by Google AI Overviews?

Sentences that are extracted tend to be definition-first (the subject named and defined in the opening clause), self-contained (no pronoun references that resolve outside the sentence), entity-dense (two or more named brands, products, or standards), and stat-forward (any number placed in the first sentence with inline attribution). Based on how retrieval systems generally work, a chunk that answers the query without surrounding context scores higher than one that requires it.

How long should a sentence be to get cited in AI Overviews?

From our audit observations, single extracted sentences tend to fall in the 20-40 word range. Sentences over 45 words, especially those with subordinate clauses and passive constructions, are harder to lift cleanly and are less likely to appear as extracted passages. These are observed patterns, not confirmed Google thresholds - treat them as calibration guidance.

Does putting the definition first help AI Overviews extract my content?

Yes - definition-first structure is the single most impactful sentence-level change in our 50+ brand audit data. If the subject of a sentence is buried in the second or third clause, retrieval systems cannot match the chunk to a definition query at the opening tokens. Moving the definition to the first clause makes the passage matchable and liftable without editing.

Where should I place statistics in a paragraph to get lifted by AI?

Statistics should appear in the first sentence of a paragraph, not the third or fourth. An extraction-ready statistic has three components in the same sentence: the number, the unit or context, and the source attribution. A number buried in a trailing clause is often orphaned when the retrieval system chunks the paragraph, and the chunk that contains it may not be the one that gets scored for the query.

How is AI Overviews passage extraction different from featured snippets?

Featured snippets pulled one contiguous block from one page, selected based on heading-paragraph proximity and word count match. AI Overviews can pull multiple non-contiguous passages from different pages and stitch them into a synthesised response - based on observed citation behavior, not confirmed vendor architecture. This means each paragraph on your page competes independently, and a strong paragraph three can get lifted even if your introduction is weak.

What is the ideal passage length for AI citation?

Based on observed extraction patterns from our audit data, single extracted sentences tend to run 20-40 words, and multi-sentence extracted passages tend to run 60-120 words total. Passages shorter than 20 words often lack enough context to be query-matchable; passages longer than 120 words often contain subordinate clauses or qualifications that make them harder to lift without editing.

See how AI search engines rank your store

Run a free AI visibility audit - find out where ChatGPT, Perplexity and Google AI rank your products.

Try free audit →

Free tier · No credit card required