Choosing an LLM

sift is provider-agnostic. Any endpoint that speaks the OpenAI chat-completions API will work: OpenRouter, OpenAI, Anthropic (via a proxy), Groq, LM Studio, Ollama, llama-server, vLLM, and so on.

Measured models

On a 12-case hand-picked validation suite (boundary classifications from the tier taxonomy):

Model	Provider	Tier accuracy	Avg per-call latency	Cost / 1K calls
`openai/gpt-oss-20b`	local (LM Studio)	12/12	~15s	$0
`meta-llama/llama-3.3-70b-instruct`	OpenRouter auto-route	11–12/12	~3s (Groq) / ~5s (auto)	~$0.22
`google/gemini-2.5-flash`	OpenRouter	12/12	~1.3s	~$1.05
`google/gemma-4-e4b`	local (LM Studio)	12/12	~9s	$0

Default recommendation

meta-llama/llama-3.3-70b-instruct via OpenRouter with auto-routing. Reasoning:

Accurate enough — 11–12/12 on the validation suite
Fast enough — 3–5s per call amortizes well under LLM_CONCURRENCY=4
Cheap — ~$0.22 per 1000 calls is negligible for typical use
Route-resilient — OpenRouter’s fallback keeps the service alive when any single provider throttles

LLM_ENDPOINT=https://openrouter.ai/api/v1
LLM_MODEL=meta-llama/llama-3.3-70b-instruct
LLM_API_KEY=<your-openrouter-key>

When to go local

Pick a local model if any of these apply:

You don’t want per-call cost at all
You’re testing a prompt change and want deterministic replay
Your queries involve privacy-sensitive topics you don’t want to send to a hosted API
You’re testing an MCP workflow offline

LM Studio hosts gpt-oss-20b and gemma-4-e4b well. LLM_CONCURRENCY=2 is the sweet spot for a single RTX 5060 Ti 16GB.

LLM_ENDPOINT=http://localhost:1234/v1
LLM_MODEL=openai/gpt-oss-20b
LLM_CONCURRENCY=2
# LLM_API_KEY can be omitted

When to go faster

google/gemini-2.5-flash via OpenRouter at ~1.3s per call is the fastest measured hosted option. About 5× the cost of Llama 70B but still under $2 per 1000 calls. Pick it if per-query latency matters (interactive UX, long-running agent loops).

Concurrency tuning

LLM_CONCURRENCY gates the maximum in-flight LLM requests. Higher is faster up to the provider’s rate limit; too high triggers timeouts and degrades accuracy.

OpenRouter hosted: 4-8 works well
Local LM Studio: 2 (GPU-bound)
Groq direct: 4 is the sweet spot; higher saturates single-provider routing

`LLM_PROVIDER_ORDER` on OpenRouter

OpenRouter accepts a provider preference list via provider.order in the request body. sift exposes this as LLM_PROVIDER_ORDER=groq,cerebras. Set only when you want to pin latency; leave unset to benefit from auto-route fallback, which is more resilient.

Thinking mode

LLM_THINKING=true prepends a reasoning preamble and raises the token budget. Useful for local small models (gemma-4-e4b) that benefit from explicit reasoning traces. Unnecessary — and slower — for modern hosted models. Default: false.