Skip to content

Quality vector

Every result in a sift response carries a quality_vector. It is the core transparent artifact: each field is derived from a specific input and logged in signals[] so downstream agents (and you) can trace why a classification landed where it did.

interface QualityVector {
schema_version: "1.0",
tier:
| "regulated_primary" // SEC / EDGAR / gov / court records
| "peer_reviewed" // arXiv / PubMed / .edu papers
| "independent_editorial" // BBC / Reuters / Ars Technica (sans sponsored)
| "vendor_primary" // vendor's own product / docs / homepage
| "vendor_content_marketing" // vendor blog / strategic advice
| "affiliate" // "Best X" listicle + commercial intent
| "content_farm" // templated / AI / low-effort aggregator
| "ugc" // Reddit / forums / community
| "unknown",
editorial_standards: "high" | "medium" | "low" | "unknown",
self_promoting: boolean,
third_party: boolean,
commercial_intent: "high" | "medium" | "low" | "none",
domain_content_mismatch: boolean, // parasite-SEO flag
authoritative_weight: number, // 0..1
confidence: number, // 0..1
reason: string, // under 80 chars
signals: Array<{
origin: "safety" | "authoritative" | "llm",
type: string,
match: string,
weight: number
}>
}

See Tier definitions for the full set of examples and boundary rules. Briefly:

TierExampleTypical use by agent
regulated_primarySEC filings, government publications, court recordsCite as authoritative
peer_reviewedarXiv, PubMed, SAGE/Elsevier journal articlesCite as authoritative
independent_editorialBBC, Reuters, Ars Technica (main articles, not sponsored)Cite with attribution
vendor_primaryProduct homepage, API docs, official changelogCite for vendor facts
vendor_content_marketingHubSpot blog, Stripe blog, VC firm thought leadershipTreat as positioning
affiliate”Best X 2026” listiclesCommercial — do not use for objective comparison
content_farmTemplated AI / SEO churnDo not cite
ugcReddit, Stack Overflow, QuoraTreat as opinion
unknownCannot determine from signalsFlag uncertainty

tier is the primary categorical dimension, but several orthogonal flags give the agent finer-grained context:

  • editorial_standardshigh / medium / low / unknown. A reputable domain’s /sponsored/ path has medium or low standards even if the parent brand is high.
  • self_promoting — true when the publisher has a direct stake in recommending itself (e.g., a hosting company’s “Best Web Hosting” list).
  • third_party — true when the publisher is independent of the subject.
  • commercial_intenthigh / medium / low / none. Orthogonal to tier; a vendor_primary page can be none (docs) or high (pricing).
  • domain_content_mismatch — true when the domain’s implied business strongly differs from the content topic. A parasite-SEO flag.
  • authoritative_weight is a scalar in 0..1 derived from tier + editorial_standards + confidence + domain_content_mismatch. It’s what you’d multiply source claims by when aggregating. The exact formula lives in src/vectorize.ts#authoritativeWeightFromLlm.
  • confidence is the classifier’s own certainty about the whole classification. Low confidence + unusual signals is a good trigger for an agent to caveat the result.

Every classification is backed by at least one signal. Three origins:

  • safety — Google Safe Browsing (threat type as match)
  • authoritative — known-trusted / known-good cache hit (domain as match)
  • llm — LLM judge output (tier + reason as match)

A result can have multiple signals. For example, a reddit.com URL that Google Safe Browsing also flagged would carry both authoritative (known-good:ugc_community) and safety signals.

Because agents need to explain their reasoning. If an agent writes “According to industry data, the SaaS magic number benchmark is 0.7-1.0”, the user (or a reviewer) should be able to ask “where did ‘industry data’ come from?” — and the agent should be able to say “10 vendor_content_marketing blog posts from VC firms, with mean authoritative_weight 0.23; no peer-reviewed source was in the SERP.” That transparency is what the quality vector enables.