Quickstart

LLMRack serves open-weight LLMs over a plain HTTP+JSON API at https://llmrack.com/v1. No SDK required — any HTTP client works.

1. Get an API key

Sign up at llmrack.com/signup (free tier, no card).
Open Dashboard → API Keys.
Click Generate new key, name it, pick permissions, click Generate.
Copy the key from the reveal modal immediately — it's shown once, then only a SHA-256 hash is stored. If lost, revoke and regenerate.

Keys look like rl_live_… (production) or rl_test_… (testing). Treat them like passwords; never commit them to git or ship them in client code.

export LLMRACK_API_KEY="rl_live_..."

2. Send your first request

Plain cURL, Python requests, or Node fetch — no client library needed. The code rail on the right shows all three. Here's cURL:

curl https://llmrack.com/v1/chat/completions \
  -H "Authorization: Bearer $LLMRACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-3-mini",
    "messages": [{"role":"user","content":"hi"}]
  }'

Start with phi-3-mini — it's the fastest model on CPU. Full model list below.

Verified response shapes

The exact bodies the server returns. Every snippet below was captured from a live request against this instance.

Non-streaming chat completion — phi-3-mini, "Reply with the single word PONG":

{
  "id": "chatcmpl-68247b8b",
  "object": "chat.completion",
  "created": 1776587088,
  "model": "phi-3-mini",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "PONG." },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 18, "completion_tokens": 4, "total_tokens": 22 }
}

Tool call — llama-3.1-8b, "What is the weather in Paris?" with a get_weather tool. Note finish_reason: "tool_calls" and content is empty:

{
  "choices": [{
    "index": 0,
    "finish_reason": "tool_calls",
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "id": "call_69e491a8_0",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Paris\"}"
        }
      }]
    }
  }],
  "usage": { "prompt_tokens": 159, "completion_tokens": 16, "total_tokens": 175 }
}

Error envelope — the exact shape every 4xx/5xx returns:

{
  "error": {
    "message": "n > 1 is not supported; call the endpoint multiple times",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

3. Stream tokens

Add "stream": true. The server returns Server-Sent Events — each data: line is one JSON chunk, the stream terminates with data: [DONE]. Captured chunks from llama-3.1-8b streaming "Count from 1 to 5":

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587146,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":"Here"},"finish_reason":null}]}

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587146,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":" it"},"finish_reason":null}]}

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587147,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":" goes"},"finish_reason":null}]}

…

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587153,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Every chat.completion.chunk carries the same id across the whole stream. The final chunk carries an empty delta and a non-null finish_reason. Then data: [DONE] closes the stream.

1import os, json, requests
2
3with requests.post(
4    "https://llmrack.com/v1/chat/completions",
5    headers={"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"},
6    json={
7        "model": "phi-3-mini",
8        "messages": [{"role": "user", "content": "Haiku about SSDs."}],
9        "stream": True,
10    },
11    stream=True, timeout=60,
12) as r:
13    for raw in r.iter_lines():
14        if not raw or not raw.startswith(b"data: "):
15            continue
16        payload = raw[6:]
17        if payload == b"[DONE]":
18            break
19        chunk = json.loads(payload)
20        delta = chunk["choices"][0]["delta"].get("content", "")
21        print(delta, end="", flush=True)

4. Embeddings

Text embeddings come from nomic-embed — 768 dimensions, 8k context. Use them for semantic search, RAG, clustering.

curl https://llmrack.com/v1/embeddings \
  -H "Authorization: Bearer $LLMRACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": ["the quick brown fox", "hello world"]
  }'

5. Tool calling, end-to-end

Define tools in the first call. If the model picks one, parse the tool_calls and dispatch. Send the tool's output back with role: "tool". The final turn returns prose.

Important: the value of function.arguments on the model's output is a JSON string, not an object. json.loads() it before using. Tool-call-capable models: llama-3.1-8b, qwen-2.5-7b, mistral-7b.

1import os, json, requests
2
3BASE = "https://llmrack.com/v1"
4HDR  = {"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"}
5
6# 1) Ask with tools attached
7r = requests.post(f"{BASE}/chat/completions", headers=HDR, json={
8    "model": "llama-3.1-8b",
9    "messages": [{"role": "user", "content": "Weather in Paris?"}],
10    "tools": [{
11        "type": "function",
12        "function": {
13            "name": "get_weather",
14            "description": "Get current weather for a city",
15            "parameters": {
16                "type": "object",
17                "properties": {"city": {"type": "string"}},
18                "required": ["city"],
19            },
20        },
21    }],
22}).json()
23
24call = r["choices"][0]["message"]["tool_calls"][0]
25args = json.loads(call["function"]["arguments"])   # → {"city": "Paris"}
26tool_result = {"temp_c": 18, "condition": "cloudy"}  # do your real lookup here
27
28# 2) Send the tool result back so the model can answer.
29r2 = requests.post(f"{BASE}/chat/completions", headers=HDR, json={
30    "model": "llama-3.1-8b",
31    "messages": [
32        {"role": "user", "content": "Weather in Paris?"},
33        {"role": "assistant", "content": None, "tool_calls": [call]},
34        {"role": "tool",  "tool_call_id": call["id"],
35                          "content": json.dumps(tool_result)},
36    ],
37}).json()
38
39print(r2["choices"][0]["message"]["content"])
40# → "The weather in Paris is currently cloudy with a temperature of 18 degrees Celsius."

6. What the API actually supports

LLMRack's /v1/chat/completions speaks the OpenAI Chat Completions wire shape. Here's the honest field-by-field support — use it when choosing whether LLMRack fits your tool.

Request field	Status	Notes
model	required	From the catalog below. Plain id — no `openai/` prefix.
messages	required	Roles `system`, `user`, `assistant`, `tool`.
stream	yes	Server-Sent Events, terminated by `data: [DONE]`.
temperature	yes	Forwarded to Ollama.
top_p	yes	Forwarded.
max_tokens	yes	Maps to Ollama `num_predict`.
stop	yes	String or list of strings.
presence_penalty / frequency_penalty	yes	Forwarded (not all model families honor these).
tools / tool_choice	yes	Forwarded to Ollama. Works on `llama-3.1-8b`, `qwen-2.5-7b`, `mistral-7b`. Small models (`phi-3-mini`) often ignore tools; that's a model limitation, not an API one.
response_format	yes	`{type: "json_object"}` → Ollama `format: "json"`. JSON Schema (`type: "json_schema"`) is passed to Ollama's schema-constrained decoding.
n	1 only	`n > 1` returns `400`. Call the endpoint multiple times.
logprobs / top_logprobs	no	Not emitted. Ollama doesn't expose these.
seed	no	Not forwarded yet.
logit_bias	no	Not forwarded.
user	accepted	Stored in logs; has no behavioral effect (your API key already identifies you).

Response fields: id, object, created, model, choices[0].message.{role,content,tool_calls?}, choices[0].finish_reason (values: stop, length, tool_calls), usage.{prompt_tokens, completion_tokens, total_tokens}. Tokens come from Ollama's prompt_eval_count / eval_count (real, not estimated) for chat and text completions. For /v1/embeddings, usage.prompt_tokens is an approximation (len(text)/4) since Ollama's embeddings endpoint doesn't return a token count.

7. Available models

All prices in USD per 1M tokens. Q4_K_M quantization (except Nomic, F16).

Model id	Params	Context	In $/1M	Out $/1M
phi-3-mini	3.8B	128k	$0.04	$0.04
mistral-7b	7B	32k	$0.08	$0.10
llama-3.1-8b	8B	128k	$0.09	$0.11
qwen-2.5-7b	7B	32k	$0.10	$0.12
nomic-embed	137M	8k	$0.02	$—

Live list at GET https://llmrack.com/v1/models (or the Models page).

8. Authentication

Pass the API key as a bearer token:

Authorization: Bearer rl_live_...

Missing, malformed, revoked, or expired keys return 401 authentication_error.

9. Rate limits

Tier	Requests / min	Tokens / day	Monthly
Free For testing and light experimentation	10	10,000	$0
Pro For developers and daily use	100	550,000	$15
Business For production and high-volume usage	500	5,000,000	$65

Hitting a limit returns 429 with Retry-After (seconds). Upgrade at Dashboard → Billing.

10. Error codes

HTTP	Type	When
400	invalid_request_error	Unknown model, malformed body, bad params.
401	authentication_error	Missing / bad / revoked API key.
429	rate_limit_exceeded	RPM or daily token budget hit.
502	api_error	Upstream model error — retry is safe.

11. Use with agents & tools

Anything that supports a custom OpenAI-compatible endpoint works with LLMRack. Configure the tool with:

Base URLhttps://llmrack.com/v1

API keyrl_live_…

Modelphi-3-mini · mistral-7b · llama-3.1-8b · qwen-2.5-7b · nomic-embed

Provideropenai (whenever the tool asks which provider — pick OpenAI, then override the URL)

Why you might see openai/ in configs. LLMRack's own API uses plain model names — llama-3.1-8b, phi-3-mini, etc. No prefix. A few router-style tools route by <provider>/<model> — the prefix names the transport protocol, not the model's creator. Where the tool lets you register a custom provider or alias (OpenClaw, LiteLLM), the examples below register llmrack as the provider name, so your code only sees llmrack/llama-3.1-8b. Where it doesn't (Aider), openai/ stays on the command line.

Open WebUI

Settings → Connections → OpenAI API. Set the endpoint to https://llmrack.com/v1 and paste your key. Models appear automatically.

LibreChat

Add a custom endpoint to librechat.yaml:

endpoints:
  custom:
    - name: "LLMRack"
      apiKey: "${LLMRACK_API_KEY}"
      baseURL: "https://llmrack.com/v1"
      models:
        default: ["llama-3.1-8b", "mistral-7b", "phi-3-mini", "qwen-2.5-7b"]
      titleModel: "phi-3-mini"
      iconURL: "https://llmrack.com/favicon.ico"

Continue (VS Code / JetBrains)

Edit ~/.continue/config.json:

{
  "models": [
    {
      "title": "LLMRack · Llama 3.1 8B",
      "provider": "openai",
      "model": "llama-3.1-8b",
      "apiBase": "https://llmrack.com/v1",
      "apiKey": "rl_live_..."
    },
    {
      "title": "LLMRack · Phi-3 Mini (fast)",
      "provider": "openai",
      "model": "phi-3-mini",
      "apiBase": "https://llmrack.com/v1",
      "apiKey": "rl_live_..."
    }
  ],
  "embeddingsProvider": {
    "provider": "openai",
    "model": "nomic-embed",
    "apiBase": "https://llmrack.com/v1",
    "apiKey": "rl_live_..."
  }
}

Cursor

Cursor → Settings → Models → OpenAI API Key → enable Override OpenAI Base URL, set it to https://llmrack.com/v1, paste your LLMRack key. Add model names (e.g. llama-3.1-8b) in the custom models list.

LangChain (Python)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    model="llama-3.1-8b",
    streaming=True,
)

# Works with every LangChain agent, chain, and graph — CrewAI, LangGraph, etc.
for chunk in llm.stream("Summarize RAG in one paragraph."):
    print(chunk.content, end="", flush=True)

LlamaIndex

from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding

Settings.llm = OpenAILike(
    model="llama-3.1-8b",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    is_chat_model=True,
)
Settings.embed_model = OpenAILikeEmbedding(
    model_name="nomic-embed",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

AnythingLLM

Settings → LLM Provider → Generic OpenAI. Base URL https://llmrack.com/v1, API key rl_live_…, model llama-3.1-8b. For embeddings choose Generic OpenAI with model nomic-embed.

OpenClaw (openclaw.ai) — end-to-end

Full path from empty machine to "I typed a message and LLMRack answered." OpenClaw lets us register llmrack as a first-class provider, so your agents reference models as llmrack/llama-3.1-8b with no openai/ prefix visible anywhere.

1. Install OpenClaw & get a key.

# macOS / Linux
curl -fsSL https://openclaw.ai/install.sh | bash

# Get your LLMRack key
export LLMRACK_API_KEY="rl_live_..."        # from llmrack.com/dashboard/keys

2. Drop a config file that registers LLMRack as a provider.

Write this to ~/.openclaw/config.json5:

{
  agents: {
    defaults: {
      // Pick the default model for new agents. No "openai/" in sight.
      model: { primary: "llmrack/llama-3.1-8b" },
    },
  },
  models: {
    providers: {
      llmrack: {
        baseUrl: "https://llmrack.com/v1",
        apiKey: "${LLMRACK_API_KEY}",    // resolves from env

        // "api" names the WIRE PROTOCOL, not the vendor. OpenClaw supports a few
        // dialects (openai-completions, anthropic-messages, google-generateContent,
        // …). Our /v1/chat/completions endpoint implements the OpenAI Chat
        // Completions shape — request: {model, messages[], stream, temperature,
        // tools, response_format, …}; response: {choices[0].message.{content,
        // tool_calls}, usage.{prompt_tokens, completion_tokens}}; streaming as
        // SSE with "data: [DONE]" terminator. So the correct value here is:
        api: "openai-completions",

        models: [
          { id: "llama-3.1-8b", name: "Llama 3.1 8B",  contextWindow: 128000, maxTokens: 4096 },
          { id: "mistral-7b",   name: "Mistral 7B",    contextWindow: 32000,  maxTokens: 4096 },
          { id: "qwen-2.5-7b",  name: "Qwen 2.5 7B",   contextWindow: 32000,  maxTokens: 4096 },
          { id: "phi-3-mini",   name: "Phi-3 Mini",    contextWindow: 128000, maxTokens: 4096 },
          { id: "nomic-embed",  name: "Nomic Embed",   contextWindow: 8192,   input: ["text"] },
        ],
      },
    },
  },
}

3. Run the onboarding wizard & start the gateway.

openclaw onboard --install-daemon     # one-time setup
openclaw gateway status                # verify it's live

4. Talk to it.

openclaw dashboard                     # opens the Control UI
# Type a message → reply streams back from llmrack/llama-3.1-8b.

# Or swap the default model live:
openclaw models set llmrack/phi-3-mini

What just happened.

  you type in dashboard
      ↓
  openclaw gateway  (localhost:18789)
      ↓   matches model "llmrack/llama-3.1-8b" → provider config above
      ↓   POST https://llmrack.com/v1/chat/completions
      ↓     Authorization: Bearer $LLMRACK_API_KEY
      ↓     {"model":"llama-3.1-8b","messages":[...],"stream":true}
      ↓
  llmrack backend → rate-limit gate → Ollama inference → SSE stream
      ↑
  openclaw streams tokens back into the UI

Want to wire it into Slack / Discord / Telegram instead of the dashboard? Re-run the wizard: openclaw onboard, pick the channel, paste the channel bot token when prompted. The LLM leg (LLMRack) stays identical — only the inbound transport changes.

Reference: Model providers, Configuration reference, Channels.

Hermes Agent (Nous Research)

Edit ~/.hermes/config.yaml and put secrets in ~/.hermes/.env:

# ~/.hermes/config.yaml
model:
  provider: custom
  base_url: "https://llmrack.com/v1"
  api_key: ${LLMRACK_API_KEY}
  model: "llama-3.1-8b"

# ~/.hermes/.env
LLMRACK_API_KEY=rl_live_...

# Alternative: use the OpenAI env vars Hermes also recognizes.
OPENAI_BASE_URL=https://llmrack.com/v1
OPENAI_API_KEY=rl_live_...

Paperclip (paperclip.ing)

Paperclip is an orchestration plane — it doesn't call LLMs directly, it invokes adapters (Claude Code, OpenAI Codex, shell, HTTP, etc.). To wire an agent to LLMRack, use an adapter whose runtime does the LLM call. Two paths:

Codex / CLI adapters: most underlying CLIs read OPENAI_API_BASE / OPENAI_API_KEY. Set those in the adapter's environment and they'll flow through to LLMRack.
HTTP adapter: configure it to POST to https://llmrack.com/v1/chat/completions with the Authorization: Bearer rl_live_… header in adapterConfig.

// paperclip agent adapterConfig (HTTP adapter)
{
  adapterType: "http",
  adapterConfig: {
    url: "https://llmrack.com/v1/chat/completions",
    method: "POST",
    headers: {
      "Authorization": "Bearer rl_live_...",
      "Content-Type": "application/json"
    },
    body: {
      model: "llama-3.1-8b",
      messages: [{ role: "user", content: "{{prompt}}" }]
    }
  }
}

See docs.paperclip.ing/adapters/overview for the full adapter list. If you need a dedicated OpenAI-style adapter, Paperclip is open-source (MIT) and adding one is straightforward.

Aider (AI pair programmer)

export OPENAI_API_KEY=rl_live_...
export OPENAI_API_BASE=https://llmrack.com/v1

aider --model openai/llama-3.1-8b
# or for a fast helper:
aider --model openai/phi-3-mini

Agent Zero (agent-zero.ai)

Agent Zero is configured from the Web UI, not a .env file.

Open the Web UI (usually http://localhost:50001) → Settings.
Go to the External Services tab → under Other OpenAI-compatible API keys paste your rl_live_… key.
Switch to Chat Model Settings → set provider to OpenAI Compatible, model name to llama-3.1-8b (or any from the catalog), and API URL to https://llmrack.com/v1.
Repeat for Utility Model (try phi-3-mini for speed) and Embeddings (nomic-embed).
Save.

AutoGen (Microsoft)

config_list = [{
    "model": "llama-3.1-8b",
    "api_key": os.environ["LLMRACK_API_KEY"],
    "base_url": "https://llmrack.com/v1",
    "api_type": "openai",
}]

assistant = AssistantAgent("assistant", llm_config={"config_list": config_list})
user = UserProxyAgent("user", code_execution_config=False)
user.initiate_chat(assistant, message="Plan a CPU benchmark run.")

CrewAI

CrewAI uses LangChain's ChatOpenAI underneath — pass it as the agent LLM:

from langchain_openai import ChatOpenAI
from crewai import Agent, Task, Crew

llm = ChatOpenAI(
    model="llama-3.1-8b",
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

researcher = Agent(role="Researcher", goal="Find sources", llm=llm)
crew = Crew(agents=[researcher], tasks=[Task(description="…", agent=researcher)])
crew.kickoff()

LangChain (Python)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    model="llama-3.1-8b",
    streaming=True,
)

# Works with every LangChain agent, chain, and graph — LangGraph, CrewAI, etc.
for chunk in llm.stream("Summarize RAG in one paragraph."):
    print(chunk.content, end="", flush=True)

LlamaIndex

from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding

Settings.llm = OpenAILike(
    model="llama-3.1-8b",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    is_chat_model=True,
)
Settings.embed_model = OpenAILikeEmbedding(
    model_name="nomic-embed",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

Flowise (no-code agent flows)

In any ChatOpenAI, OpenAI, or OpenAI Embeddings node: click the node → Additional Parameters → set BasePath to https://llmrack.com/v1, paste your key in the credential, and pick a model (llama-3.1-8b, phi-3-mini, etc.).

n8n

In the OpenAI or OpenAI Chat Model node, open the credential, enable Custom API base URL, set it to https://llmrack.com/v1, and paste your LLMRack key.

LiteLLM proxy

If you front multiple providers through LiteLLM, add LLMRack as a model backend. The user-facing model_name is whatever you want — downstream code only sees it:

# config.yaml
model_list:
  - model_name: llmrack/llama-3.1-8b          # how your apps refer to it
    litellm_params:
      model: openai/llama-3.1-8b              # LiteLLM's internal transport slug
      api_base: https://llmrack.com/v1
      api_key: os.environ/LLMRACK_API_KEY
  - model_name: llmrack/phi-3-mini
    litellm_params:
      model: openai/phi-3-mini
      api_base: https://llmrack.com/v1
      api_key: os.environ/LLMRACK_API_KEY

# Your apps then call LiteLLM with model="llmrack/llama-3.1-8b" — the "openai/"
# prefix never leaves config.yaml.

Anything else — the general pattern

Any tool that accepts a custom OpenAI base URL works with LLMRack. The knob is usually labeled:

OPENAI_API_BASE, OPENAI_BASE_URL, or OPENAI_API_HOST (env var)
"Base URL", "API Host", "Endpoint override", "OpenAI-compatible server"
"Custom API base", "Proxy URL" (n8n, Flowise, Cursor, Continue)

Set it to https://llmrack.com/v1, paste your rl_live_… key, pick a model from the catalog above. Done.

If your tool has a quirk that isn't covered above, email stevenkingit02@gmail.com and I'll add it.

12. Already using OpenAI?

LLMRack's request/response shapes match OpenAI's, so any OpenAI client library works as a drop-in — point it at our base URL and pass an LLMRack key. This is optional; the plain-HTTP examples above are the canonical path.

# if you already have the openai package installed and want to reuse it
from openai import OpenAI

client = OpenAI(
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

r = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "hi"}],
)

Model names differ (llama-3.1-8b vs gpt-4o), but streaming, tool calls, and JSON mode all behave the same way.

Try it without writing code. The Playground lets you hit any model with your key from the browser.

Need help? stevenkingit02@gmail.com