What sift does

sift has one job: make the augmentation layer between a search backend and an AI agent explicit.

A raw SERP is 10 URLs + titles + descriptions. From an agent’s perspective that’s undifferentiated text — the agent has no built-in signal for whether atlassian.com/blog should be weighed differently from nature.com/articles. sift inserts a classification step between the backend and the agent, and the classification is fully visible in both source and output.

The design is deliberately shaped around specific, recurring failure modes of agent-driven search — vocabulary mismatch, vendor-dominated SERPs, affiliate contamination, parasite SEO, TLD-anchored false authority. Each sift feature maps to one of these. See Agent search failure modes for the full mapping.

The three outputs

1. Per-result `quality_vector`

Every result carries:

tier — one of 9 categorical labels (regulated_primary, peer_reviewed, independent_editorial, vendor_primary, vendor_content_marketing, affiliate, content_farm, ugc, unknown)
editorial_standards, commercial_intent, self_promoting, third_party, domain_content_mismatch — orthogonal axes
authoritative_weight — 0..1 scalar derived from tier + editorial standards + confidence
confidence — 0..1, the classifier’s own certainty
reason — under 80 characters explaining the tier
signals[] — normalized contribution log (which inputs drove the decision)

See Quality vector.

2. SERP-level `aggregate_vector`

Per-query landscape metrics:

tier_distribution — counts per tier over the full SERP
mean_authoritative_weight — the SERP’s overall trust level
diversity_entropy — Shannon entropy of tier distribution
vendor_dominance_ratio — fraction of vendor-published results

See Aggregate vector.

3. `summary_hints[]`

Deterministic meta-observations the agent must incorporate when synthesizing. They fire on specific structural conditions — heavy vendor dominance, parasite-SEO mismatch, structurally commercial SERPs (no non-commercial source to triangulate against), etc.

See Summary hints.

What sift does not do

Rank or re-order. Brave’s original ranking is preserved.
Fetch page content. Classification uses SERP metadata only (URL, title, description).
Filter hard by default. recommended_action (keep / tag / block) is advisory — the agent decides.
Replace Google Safe Browsing. safety_flag is a parallel axis, not a tier. Classification and safety are orthogonal.

Design non-goals

In-process fine-tuning. Prompt refinement is an offline process, informed by the learning-loop observation log.
Pre-filter below a threshold. Opacity would defeat the entire design. recommended_action is made visible precisely so it can be overridden.
Generate sources that aren’t in the SERP. sift is diagnostic, not generative. If a SERP is 100% vendor content, sift reports that — it cannot invent peer-reviewed work that wasn’t indexed.