All systems normal
v0.9.4 · 5 models
/ llm infrastructure

Open-source LLMs,
served fast. Priced honestly.

Production inference for Llama, Mistral, Qwen, and Phi. One API. Flat monthly plans from $15/mo, free tier, no card required.

One endpoint.
Any open model.

Plain HTTP and JSON. Pick a model, pass your key, go. No SDKs to install, no vendor lock-in, no hidden token sampling.

No SDK required
Any HTTP client works — curl, requests, fetch.
Streaming by default
Server-Sent Events, tokens as they're generated.
Self-hosted open weights
Llama, Mistral, Qwen, Phi — served on infrastructure we control.
1curl https://llmrack.com/v1/chat/completions \
2 -H "Authorization: Bearer $LLMRACK_API_KEY" \
3 -H "Content-Type: application/json" \
4 -d '{
5 "model": "llama-3.1-8b",
6 "messages": [{"role":"user","content":"Explain RAG in 2 sentences."}]
7 }'

Built for production
from the first token.

Quantized Q4 serving

Q4_K_M weights, CPU-optimized. Warm keep-alive between requests.

Multiple open models

Llama, Mistral, Qwen, Phi. Pick by changing a string.

Flat-rate pricing

Predictable monthly plans with daily token caps. No surprises on the invoice.

Plain HTTP API

REST + JSON. Streaming via Server-Sent Events. No SDK required.

No third-party model hops

All inference runs on infrastructure we operate. Your prompts don't get forwarded to another vendor.

Private by default

Prompts and completions are never stored — only token counts for billing. No training pipeline on your data.

Flat monthly plans.
Daily caps, no overage.

Each tier sets a daily token allowance and a per-minute request rate. Hit the cap and requests return 429 until UTC midnight. No surprise invoices.

Free
$0forever
For testing and light experimentation
10,000 tokens / day
10 requests / minute
All open models
1 API key
Community support
popular
Pro
$15/ month
For developers and daily use
550,000 tokens / day
100 requests / minute
Unlimited API keys
Usage analytics + invoices
All open models
Business
$65/ month
For production and high-volume usage
5,000,000 tokens / day
500 requests / minute
Unlimited API keys
Priority email response
All open models

Questions we
actually get asked.

Yes. Point your OpenAI client at https://llmrack.com/v1, swap the API key, and chat.completions.create works unchanged — including stream=True, tools (on models that support them), and response_format for JSON output. n > 1 is the one exception; call the endpoint multiple times instead. See the API capabilities table in /docs for the full field-by-field breakdown.
/ ship today

Start building.
Free tier, no card.