Skip to content

Learning loop

Every vectorized result — plus the per-query aggregate — is appended to data/observations.jsonl. This file is local by default and never leaves your machine unless you explicitly configure a mirror.

The log is the substrate for the learning loop: sift doesn’t pre-specify every tier boundary, it emits a best-effort classification per call, records it, and makes the log available for post-hoc analysis. Four practical uses:

Do the same domains get the same tier over time? If atlassian.com/blog started as vendor_content_marketing and is now sometimes independent_editorial, the prompt has drifted — or the model behind the endpoint has changed.

Terminal window
jq -r 'select(.domain) | [.domain, .quality_vector.tier] | @tsv' \
data/observations.jsonl | sort | uniq -c | sort -rn | head -30

Search for low-confidence classifications:

Terminal window
jq -r 'select(.quality_vector.confidence < 0.6) | .url + " → " + .quality_vector.tier + " (" + .quality_vector.reason + ")"' \
data/observations.jsonl

These are the edges where the prompt needs more rules — exactly where taxonomy extensions tend to come from.

The same concept expressed in different vocabulary yields very different SERP quality. For transformational leadership meta-analysis effect size, mean authoritative_weight is ~0.87. For what makes a good leader, the same concept drops to ~0.38 — Brave surfaces vendor content marketing and UGC for the general phrasing.

Collecting observations over time lets you see which user-facing queries consistently trigger low-authority SERPs. Those are candidates for sift-side re-query suggestions (a planned v0.5 feature).

If a specific domain consistently gets the same tier with high confidence across many calls, it’s a candidate for promotion to the known-trusted or known-good cache — skipping the LLM entirely and saving cost.

Set SIFT_OBSERVATIONS=off to disable the log entirely. No data is written.

For drift review, long-running QA, or sharing observations across multiple machines, the log can be mirrored to any HTTP PUT-capable endpoint. Opt-in — unset means no network traffic.

Terminal window
# .env — Obsidian Local REST API example
SIFT_OBSERVATION_SYNC_URL=https://127.0.0.1:27124/vault
SIFT_OBSERVATION_SYNC_AUTH=Bearer <your-obsidian-api-token>
SIFT_OBSERVATION_SYNC_PATH=observations/sift/
SIFT_OBSERVATION_SYNC_INSECURE_TLS=true # Obsidian ships a self-signed cert

When configured:

  • Each MCP call appends to a daily UTC file at {URL}/{PATH}YYYY-MM-DD.jsonl
  • The write is GET-merge-PUT (read current content, append lines, PUT back)
  • Errors are fire-and-forget (logged once, never block the MCP response)
  • The local JSONL file is always written regardless of sync state — losing the remote doesn’t lose data

Any HTTP PUT-capable store with the same read-modify-write semantics works:

  • Obsidian Local REST API
  • WebDAV servers (Nextcloud, ownCloud)
  • S3-compatible with signed URLs (short-lived URLs complicate the model)
  • Gitea / Gitlab raw API with auth
  • Custom log collectors that accept PUT

Each per-result line:

{
"ts": "2026-04-23T12:49:28.610Z",
"query": "dynamic capabilities theory Teece empirical validation",
"safety": "standard",
"url": "https://tandfonline.com/doi/full/...",
"domain": "tandfonline.com",
"quality_vector": { /* full vector including signals[] */ },
"safety_flag": null,
"recommended_action": "keep"
}

One additional line per query marked "kind": "aggregate" carries the full aggregate vector.

A first-class analysis toolkit is planned for v0.5 — drift dashboards, domain-clustered tier stability reports, automated boundary-case surfacing. Until then, the log is a plain JSONL file and jq / Python / Obsidian Dataview are all viable for inspection.