Quality vector

Every result in a sift response carries a quality_vector. It is the core transparent artifact: each field is derived from a specific input and logged in signals[] so downstream agents (and you) can trace why a classification landed where it did.

Schema

interface QualityVector {
  schema_version: "1.0",
  tier:
    | "regulated_primary"        // SEC / EDGAR / gov / court records
    | "peer_reviewed"            // arXiv / PubMed / .edu papers
    | "independent_editorial"    // BBC / Reuters / Ars Technica (sans sponsored)
    | "vendor_primary"           // vendor's own product / docs / homepage
    | "vendor_content_marketing" // vendor blog / strategic advice
    | "affiliate"                // "Best X" listicle + commercial intent
    | "content_farm"             // templated / AI / low-effort aggregator
    | "ugc"                      // Reddit / forums / community
    | "unknown",
  editorial_standards: "high" | "medium" | "low" | "unknown",
  self_promoting: boolean,
  third_party: boolean,
  commercial_intent: "high" | "medium" | "low" | "none",
  domain_content_mismatch: boolean,    // parasite-SEO flag
  authoritative_weight: number,        // 0..1
  confidence: number,                  // 0..1
  reason: string,                      // under 80 chars
  signals: Array<{
    origin: "safety" | "authoritative" | "llm",
    type: string,
    match: string,
    weight: number
  }>
}

The 9 tiers

See Tier definitions for the full set of examples and boundary rules. Briefly:

Tier	Example	Typical use by agent
`regulated_primary`	SEC filings, government publications, court records	Cite as authoritative
`peer_reviewed`	arXiv, PubMed, SAGE/Elsevier journal articles	Cite as authoritative
`independent_editorial`	BBC, Reuters, Ars Technica (main articles, not sponsored)	Cite with attribution
`vendor_primary`	Product homepage, API docs, official changelog	Cite for vendor facts
`vendor_content_marketing`	HubSpot blog, Stripe blog, VC firm thought leadership	Treat as positioning
`affiliate`	”Best X 2026” listicles	Commercial — do not use for objective comparison
`content_farm`	Templated AI / SEO churn	Do not cite
`ugc`	Reddit, Stack Overflow, Quora	Treat as opinion
`unknown`	Cannot determine from signals	Flag uncertainty

Orthogonal axes

tier is the primary categorical dimension, but several orthogonal flags give the agent finer-grained context:

editorial_standards — high / medium / low / unknown. A reputable domain’s /sponsored/ path has medium or low standards even if the parent brand is high.
self_promoting — true when the publisher has a direct stake in recommending itself (e.g., a hosting company’s “Best Web Hosting” list).
third_party — true when the publisher is independent of the subject.
commercial_intent — high / medium / low / none. Orthogonal to tier; a vendor_primary page can be none (docs) or high (pricing).
domain_content_mismatch — true when the domain’s implied business strongly differs from the content topic. A parasite-SEO flag.

`authoritative_weight` and `confidence`

authoritative_weight is a scalar in 0..1 derived from tier + editorial_standards + confidence + domain_content_mismatch. It’s what you’d multiply source claims by when aggregating. The exact formula lives in src/vectorize.ts#authoritativeWeightFromLlm.
confidence is the classifier’s own certainty about the whole classification. Low confidence + unusual signals is a good trigger for an agent to caveat the result.

`signals[]`

Every classification is backed by at least one signal. Three origins:

safety — Google Safe Browsing (threat type as match)
authoritative — known-trusted / known-good cache hit (domain as match)
llm — LLM judge output (tier + reason as match)

A result can have multiple signals. For example, a reddit.com URL that Google Safe Browsing also flagged would carry both authoritative (known-good:ugc_community) and safety signals.

Why expose this much?

Because agents need to explain their reasoning. If an agent writes “According to industry data, the SaaS magic number benchmark is 0.7-1.0”, the user (or a reviewer) should be able to ask “where did ‘industry data’ come from?” — and the agent should be able to say “10 vendor_content_marketing blog posts from VC firms, with mean authoritative_weight 0.23; no peer-reviewed source was in the SERP.” That transparency is what the quality vector enables.