Pipeline
Brave Search ──► sift vectorizer ──► your agent (raw SERP) │ ├─ 1. safety Google Safe Browsing (orthogonal axis) ├─ 2. cache known-trusted / known-good presets (LLM skipped) ├─ 3. llm judgeResultVector() → per-result QualityVector ├─ 4. aggregate tier distribution, diversity, vendor dominance ├─ 5. recommend policy applied per-result (keep / tag / block) └─ 6. hints deterministic meta-observations for the agentStage 1 — Fetch
Section titled “Stage 1 — Fetch”search_vectorized forwards the query to Brave Search, requesting max_results + 5 to give the aggregate a broader landscape view. Country / search_lang localization is passed through.
Stage 2 — Safety (orthogonal)
Section titled “Stage 2 — Safety (orthogonal)”All URLs are batched against Google Safe Browsing. A hit sets safety_flag with the threat category (MALWARE / SOCIAL_ENGINEERING / UNWANTED_SOFTWARE) and forces recommended_action=block regardless of tier.
Safety is a parallel axis, not a tier. A high-quality editorial brand compromised by a supply-chain attack still gets blocked; a content farm that isn’t actively malicious still gets tier-classified normally.
Stage 3 — Allow-cache (cost optimization)
Section titled “Stage 3 — Allow-cache (cost optimization)”Before calling the LLM, sift checks two lists:
known-trusted(src/known-trusted.ts) — authoritative editorial brands (BBC, Reuters, RTINGS, Wirecutter, Consumer Reports, etc.). Synthetic vector injected:tier=independent_editorial,editorial_standards=high,authoritative_weight=0.85.known-good(src/known-good.ts) — community, reference, dev-platform, and encyclopedic domains. Categorized into buckets (ugc_community,reference_docs,encyclopedic,academic_preprint,dev_platform,video_ugc). Each bucket has a preset vector.
Both are cost caches, not allowlists. A safety threat on a known-trusted domain still forces a block.
.gov / .edu / .ac.xx TLDs are deliberately not cached — the same domain can host peer-reviewed papers, popular press, degree-program marketing, and regulatory guidance. The LLM prompt (boundary rule 6) classifies them correctly per-URL.
Stage 4 — LLM judge
Section titled “Stage 4 — LLM judge”For everything else, judgeResultVector() sends {title, description, domain, url_path} to the configured LLM with the vectorizer system prompt (in src/llm-judge.ts#VECTOR_SYSTEM_PROMPT). The prompt covers:
- Definitions of the 9 tiers with concrete examples
- Six boundary rules (affiliate listicles, vendor blog paths, sponsored/contributor sub-paths, industry-front lobbies, domain/content mismatch, academic-TLD discipline)
- Commercial-intent, editorial-standards, and self-promotion axes
Returned JSON is validated (validateVectorShape) and assembled into a full QualityVector with authoritative_weight derived from tier + editorial_standards + confidence + mismatch.
If the LLM call fails (timeout, HTTP error, non-JSON), sift emits a fallback vector with tier=unknown and a TLD-aware authoritative_weight (0.40 for academic/gov TLDs, 0.15 otherwise). The agent can tell from the reason field that classification failed.
Stage 5 — Aggregate
Section titled “Stage 5 — Aggregate”computeAggregate() walks the per-result vectors and produces tier_distribution, mean_authoritative_weight, diversity_entropy, vendor_dominance_ratio, and mean_editorial_standards. Aggregate is always computed over the full fetched SERP, not the trimmed view.
Stage 6 — Recommend action
Section titled “Stage 6 — Recommend action”recommendAction() applies the default policy (src/vectorize.ts#DEFAULT_RECOMMEND_POLICY):
block— tier ∈{affiliate, content_farm}, ORsafety_flagis settag— tier ∈{vendor_content_marketing, unknown}, ORdomain_content_mismatchis truekeep— everything else
The policy is a single source file. Customize by editing it or, in future, passing your own.
Stage 7 — Summary hints
Section titled “Stage 7 — Summary hints”generateSummaryHints() inspects the aggregate and the per-result mismatch/safety flags, producing deterministic hint strings. See Summary hints.
Stage 8 — Verbosity projection
Section titled “Stage 8 — Verbosity projection”The caller’s verbosity parameter (full / concise / summary) selects which fields to serialize. See Verbosity modes.
Stage 9 — Observation
Section titled “Stage 9 — Observation”Regardless of verbosity, the full payload is appended to data/observations.jsonl (if SIFT_OBSERVATIONS=on) and optionally mirrored to a remote HTTP store. See Learning loop.
Why no domain blocklist or keyword filter?
Section titled “Why no domain blocklist or keyword filter?”Prior versions layered a plain-text data/blocklists/ directory and a harvested keyword list in front of the LLM. Empirically these caught only what Brave’s own ranking had already demoted, while adding taxonomy ambiguity. The redesign removed that scaffolding so classification quality is attributable to the LLM prompt alone — and so it can be refined offline via the observation log.