Agent search failure modes

LLM agents that search the web inherit the backend’s opacity. A raw SERP is 10 URLs + titles + descriptions with no quality discrimination — the agent treats them uniformly. Specific, recurring failure modes follow, and sift’s design is shaped around them. Each feature in sift maps to a concrete weakness it offsets.

1. Vocabulary mismatch

Symptom: agents pass user natural language to the search backend. The query what makes a good leader returns a SERP dominated by consulting-firm blogs and online-university lead magnets. The same concept phrased as transformational leadership meta-analysis effect size returns 10/10 peer-reviewed journals.

Measured: mean_authoritative_weight = 0.38 (general phrasing) vs 0.87 (academic phrasing). Identical topic.

How sift addresses it: mean_authoritative_weight and tier_distribution make the shift visible. Below ~0.3, the agent has explicit grounds to re-query with more academic vocabulary, caveat its answer, or defer.

2. Structural vendor dominance (no triangulation target)

Symptom: for SaaS operational queries like series B magic number benchmark, the entire SERP is vendor content marketing. There is no independent editorial or peer-reviewed alternative in the SERP to triangulate against. The agent happily synthesizes “industry benchmarks show…” from 10 VC blog posts.

How sift addresses it: the structural-dominance summary hint fires when peer_reviewed + independent_editorial + regulated_primary == 0:

“No peer-reviewed, independent-editorial, or regulated-primary source in SERP — triangulation against non-commercial sources is not possible here. Treat aggregated claims as commercial positioning, not research.”

The agent is expected to incorporate this verbatim into its response.

3. Affiliate listicle contamination

Symptom: reputable publishers (PCMag, CNet, Wired, The Verge) produce “Best X 2026” listicles that share editorial voice with commercial incentive. An agent pattern-matching on brand reputation treats these articles identically to the same publisher’s independent reporting.

How sift addresses it: the classifier applies tier=affiliate to the specific article regardless of publisher. The default recommended_action on affiliate is block. The tier definitions call this out explicitly.

4. Parasite SEO and domain-content mismatch

Symptom: a law firm publishing crypto price predictions, a hospital-linen supplier with weight-loss advice, a pet-food domain writing about mortgages. These look like legitimate niche expertise to an agent reading only title + description.

How sift addresses it: domain_content_mismatch=true is a first-class flag on the quality vector. It downgrades recommended_action to tag by default, and a summary hint is emitted when any result in the SERP is flagged.

5. TLD-anchored false authority

Symptom: .edu, .gov, .ac.xx host a mix of peer-reviewed papers, popular explainers, degree-program marketing, and regulator content. Agents that equate .edu with “academic source” get misled — e.g., waldenu.edu/programs/business/resource/what-makes-a-good-leader-... cited as research.

How sift addresses it: boundary rule 6 in the classifier prompt requires classification by URL path and content, not TLD. The Walden U marketing page lands as vendor_content_marketing. A files.eric.ed.gov/fulltext/...pdf paper lands as peer_reviewed. A sec.gov/rules/... guidance document lands as regulated_primary.

6. Polluted independent-editorial slot

Symptom: in B2B SaaS, VC firms (a16z, SaaStr, First Round Review) and consulting firms (Bain, Deloitte/KPMG/McKinsey “insights”) have historically filled the “independent editorial” slot with thought leadership dressed as neutral analysis. Agents trained to trust editorial-looking long-form writing mis-classify it as neutral.

How sift addresses it: the classifier tiers these explicitly as vendor_content_marketing. They are vendors of a position — their incentive is to drive business back to their firm, not to produce research. Trade associations publishing “research” about their own industry fall under the same rule, with an additional domain_content_mismatch=true.

7. Source-uniformity blindness

Symptom: an agent reading 10 search results rarely notices that all 10 come from the same source category. This is hard to catch from individual URLs alone.

How sift addresses it: diversity_entropy (Shannon entropy of the tier distribution) quantifies tier concentration. When it’s low AND mean_authoritative_weight < 0.7, a summary hint fires. (Suppressed when mean_auth is high — 10/10 peer-reviewed is a feature, not a bug.)

8. Hosted-API opacity

Symptom: hosted AI-search APIs (Tavily, Exa) return opaque relevance scores. The agent can’t explain in a downstream summary why a specific source was surfaced or trusted — and can’t defend its conclusions if a reviewer asks.

How sift addresses it: every classification is reasoned. Per-result reason (under 80 characters) plus normalized signals[] (each carrying origin / type / match / weight) make the decision inspectable and auditable. The LLM prompt is in src/llm-judge.ts, the tier definitions are in src/types.ts, and the recommend policy is in src/vectorize.ts. Nothing is hidden.

Where this originates

These failure modes weren’t theoretical. Each was observed empirically — running queries across three business-specialty layers (regulated / academic / SaaS-operational) plus a vocabulary-shift comparison — and the tier and aggregate outputs were what surfaced the patterns.

sift’s value is not that it prevents these failures — it’s that it makes them visible, so the downstream agent can decide how to respond (caveat, re-query, defer) rather than silently laundering commercial content as research.