Choosing an LLM
sift is provider-agnostic. Any endpoint that speaks the OpenAI chat-completions API will work: OpenRouter, OpenAI, Anthropic (via a proxy), Groq, LM Studio, Ollama, llama-server, vLLM, and so on.
Measured models
Section titled “Measured models”On a 12-case hand-picked validation suite (boundary classifications from the tier taxonomy):
| Model | Provider | Tier accuracy | Avg per-call latency | Cost / 1K calls |
|---|---|---|---|---|
openai/gpt-oss-20b | local (LM Studio) | 12/12 | ~15s | $0 |
meta-llama/llama-3.3-70b-instruct | OpenRouter auto-route | 11–12/12 | ~3s (Groq) / ~5s (auto) | ~$0.22 |
google/gemini-2.5-flash | OpenRouter | 12/12 | ~1.3s | ~$1.05 |
google/gemma-4-e4b | local (LM Studio) | 12/12 | ~9s | $0 |
Default recommendation
Section titled “Default recommendation”meta-llama/llama-3.3-70b-instruct via OpenRouter with auto-routing. Reasoning:
- Accurate enough — 11–12/12 on the validation suite
- Fast enough — 3–5s per call amortizes well under
LLM_CONCURRENCY=4 - Cheap — ~$0.22 per 1000 calls is negligible for typical use
- Route-resilient — OpenRouter’s fallback keeps the service alive when any single provider throttles
LLM_ENDPOINT=https://openrouter.ai/api/v1LLM_MODEL=meta-llama/llama-3.3-70b-instructLLM_API_KEY=<your-openrouter-key>When to go local
Section titled “When to go local”Pick a local model if any of these apply:
- You don’t want per-call cost at all
- You’re testing a prompt change and want deterministic replay
- Your queries involve privacy-sensitive topics you don’t want to send to a hosted API
- You’re testing an MCP workflow offline
LM Studio hosts gpt-oss-20b and gemma-4-e4b well. LLM_CONCURRENCY=2 is the sweet spot for a single RTX 5060 Ti 16GB.
LLM_ENDPOINT=http://localhost:1234/v1LLM_MODEL=openai/gpt-oss-20bLLM_CONCURRENCY=2# LLM_API_KEY can be omittedWhen to go faster
Section titled “When to go faster”google/gemini-2.5-flash via OpenRouter at ~1.3s per call is the fastest measured hosted option. About 5× the cost of Llama 70B but still under $2 per 1000 calls. Pick it if per-query latency matters (interactive UX, long-running agent loops).
Concurrency tuning
Section titled “Concurrency tuning”LLM_CONCURRENCY gates the maximum in-flight LLM requests. Higher is faster up to the provider’s rate limit; too high triggers timeouts and degrades accuracy.
- OpenRouter hosted: 4-8 works well
- Local LM Studio: 2 (GPU-bound)
- Groq direct: 4 is the sweet spot; higher saturates single-provider routing
LLM_PROVIDER_ORDER on OpenRouter
Section titled “LLM_PROVIDER_ORDER on OpenRouter”OpenRouter accepts a provider preference list via provider.order in the request body. sift exposes this as LLM_PROVIDER_ORDER=groq,cerebras. Set only when you want to pin latency; leave unset to benefit from auto-route fallback, which is more resilient.
Thinking mode
Section titled “Thinking mode”LLM_THINKING=true prepends a reasoning preamble and raises the token budget. Useful for local small models (gemma-4-e4b) that benefit from explicit reasoning traces. Unnecessary — and slower — for modern hosted models. Default: false.