Skip to content

Choosing an LLM

sift is provider-agnostic. Any endpoint that speaks the OpenAI chat-completions API will work: OpenRouter, OpenAI, Anthropic (via a proxy), Groq, LM Studio, Ollama, llama-server, vLLM, and so on.

On a 12-case hand-picked validation suite (boundary classifications from the tier taxonomy):

ModelProviderTier accuracyAvg per-call latencyCost / 1K calls
openai/gpt-oss-20blocal (LM Studio)12/12~15s$0
meta-llama/llama-3.3-70b-instructOpenRouter auto-route11–12/12~3s (Groq) / ~5s (auto)~$0.22
google/gemini-2.5-flashOpenRouter12/12~1.3s~$1.05
google/gemma-4-e4blocal (LM Studio)12/12~9s$0

meta-llama/llama-3.3-70b-instruct via OpenRouter with auto-routing. Reasoning:

  • Accurate enough — 11–12/12 on the validation suite
  • Fast enough — 3–5s per call amortizes well under LLM_CONCURRENCY=4
  • Cheap — ~$0.22 per 1000 calls is negligible for typical use
  • Route-resilient — OpenRouter’s fallback keeps the service alive when any single provider throttles
LLM_ENDPOINT=https://openrouter.ai/api/v1
LLM_MODEL=meta-llama/llama-3.3-70b-instruct
LLM_API_KEY=<your-openrouter-key>

Pick a local model if any of these apply:

  • You don’t want per-call cost at all
  • You’re testing a prompt change and want deterministic replay
  • Your queries involve privacy-sensitive topics you don’t want to send to a hosted API
  • You’re testing an MCP workflow offline

LM Studio hosts gpt-oss-20b and gemma-4-e4b well. LLM_CONCURRENCY=2 is the sweet spot for a single RTX 5060 Ti 16GB.

LLM_ENDPOINT=http://localhost:1234/v1
LLM_MODEL=openai/gpt-oss-20b
LLM_CONCURRENCY=2
# LLM_API_KEY can be omitted

google/gemini-2.5-flash via OpenRouter at ~1.3s per call is the fastest measured hosted option. About 5× the cost of Llama 70B but still under $2 per 1000 calls. Pick it if per-query latency matters (interactive UX, long-running agent loops).

LLM_CONCURRENCY gates the maximum in-flight LLM requests. Higher is faster up to the provider’s rate limit; too high triggers timeouts and degrades accuracy.

  • OpenRouter hosted: 4-8 works well
  • Local LM Studio: 2 (GPU-bound)
  • Groq direct: 4 is the sweet spot; higher saturates single-provider routing

OpenRouter accepts a provider preference list via provider.order in the request body. sift exposes this as LLM_PROVIDER_ORDER=groq,cerebras. Set only when you want to pin latency; leave unset to benefit from auto-route fallback, which is more resilient.

LLM_THINKING=true prepends a reasoning preamble and raises the token budget. Useful for local small models (gemma-4-e4b) that benefit from explicit reasoning traces. Unnecessary — and slower — for modern hosted models. Default: false.