Pipeline

Brave Search  ──►  sift vectorizer  ──►  your agent
 (raw SERP)            │
                       ├─ 1. safety      Google Safe Browsing (orthogonal axis)
                       ├─ 2. cache       known-trusted / known-good presets (LLM skipped)
                       ├─ 3. llm         judgeResultVector() → per-result QualityVector
                       ├─ 4. aggregate   tier distribution, diversity, vendor dominance
                       ├─ 5. recommend   policy applied per-result (keep / tag / block)
                       └─ 6. hints       deterministic meta-observations for the agent

Stage 1 — Fetch

search_vectorized forwards the query to Brave Search, requesting max_results + 5 to give the aggregate a broader landscape view. Country / search_lang localization is passed through.

Stage 2 — Safety (orthogonal)

All URLs are batched against Google Safe Browsing. A hit sets safety_flag with the threat category (MALWARE / SOCIAL_ENGINEERING / UNWANTED_SOFTWARE) and forces recommended_action=block regardless of tier.

Safety is a parallel axis, not a tier. A high-quality editorial brand compromised by a supply-chain attack still gets blocked; a content farm that isn’t actively malicious still gets tier-classified normally.

Stage 3 — Allow-cache (cost optimization)

Before calling the LLM, sift checks two lists:

known-trusted (src/known-trusted.ts) — authoritative editorial brands (BBC, Reuters, RTINGS, Wirecutter, Consumer Reports, etc.). Synthetic vector injected: tier=independent_editorial, editorial_standards=high, authoritative_weight=0.85.
known-good (src/known-good.ts) — community, reference, dev-platform, and encyclopedic domains. Categorized into buckets (ugc_community, reference_docs, encyclopedic, academic_preprint, dev_platform, video_ugc). Each bucket has a preset vector.

Both are cost caches, not allowlists. A safety threat on a known-trusted domain still forces a block.

.gov / .edu / .ac.xx TLDs are deliberately not cached — the same domain can host peer-reviewed papers, popular press, degree-program marketing, and regulatory guidance. The LLM prompt (boundary rule 6) classifies them correctly per-URL.

Stage 4 — LLM judge

For everything else, judgeResultVector() sends {title, description, domain, url_path} to the configured LLM with the vectorizer system prompt (in src/llm-judge.ts#VECTOR_SYSTEM_PROMPT). The prompt covers:

Definitions of the 9 tiers with concrete examples
Six boundary rules (affiliate listicles, vendor blog paths, sponsored/contributor sub-paths, industry-front lobbies, domain/content mismatch, academic-TLD discipline)
Commercial-intent, editorial-standards, and self-promotion axes

Returned JSON is validated (validateVectorShape) and assembled into a full QualityVector with authoritative_weight derived from tier + editorial_standards + confidence + mismatch.

If the LLM call fails (timeout, HTTP error, non-JSON), sift emits a fallback vector with tier=unknown and a TLD-aware authoritative_weight (0.40 for academic/gov TLDs, 0.15 otherwise). The agent can tell from the reason field that classification failed.

Stage 5 — Aggregate

computeAggregate() walks the per-result vectors and produces tier_distribution, mean_authoritative_weight, diversity_entropy, vendor_dominance_ratio, and mean_editorial_standards. Aggregate is always computed over the full fetched SERP, not the trimmed view.

recommendAction() applies the default policy (src/vectorize.ts#DEFAULT_RECOMMEND_POLICY):

block — tier ∈ {affiliate, content_farm}, OR safety_flag is set
tag — tier ∈ {vendor_content_marketing, unknown}, OR domain_content_mismatch is true
keep — everything else

The policy is a single source file. Customize by editing it or, in future, passing your own.

Stage 7 — Summary hints

generateSummaryHints() inspects the aggregate and the per-result mismatch/safety flags, producing deterministic hint strings. See Summary hints.

Stage 8 — Verbosity projection

The caller’s verbosity parameter (full / concise / summary) selects which fields to serialize. See Verbosity modes.

Stage 9 — Observation

Regardless of verbosity, the full payload is appended to data/observations.jsonl (if SIFT_OBSERVATIONS=on) and optionally mirrored to a remote HTTP store. See Learning loop.

Why no domain blocklist or keyword filter?

Prior versions layered a plain-text data/blocklists/ directory and a harvested keyword list in front of the LLM. Empirically these caught only what Brave’s own ranking had already demoted, while adding taxonomy ambiguity. The redesign removed that scaffolding so classification quality is attributable to the LLM prompt alone — and so it can be refined offline via the observation log.