Quickstart

LLMRack serves open-weight LLMs over a plain HTTP+JSON API at https://llmrack.com/v1. No SDK required — any HTTP client works.

1. Get an API key

  1. Sign up at llmrack.com/signup (free tier, no card).
  2. Open Dashboard → API Keys.
  3. Click Generate new key, name it, pick permissions, click Generate.
  4. Copy the key from the reveal modal immediately — it's shown once, then only a SHA-256 hash is stored. If lost, revoke and regenerate.
Keys look like rl_live_… (production) or rl_test_… (testing). Treat them like passwords; never commit them to git or ship them in client code.
export LLMRACK_API_KEY="rl_live_..."

2. Send your first request

Plain cURL, Python requests, or Node fetch — no client library needed. The code rail on the right shows all three. Here's cURL:

curl https://llmrack.com/v1/chat/completions \
  -H "Authorization: Bearer $LLMRACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "phi-3-mini",
    "messages": [{"role":"user","content":"hi"}]
  }'

Start with phi-3-mini — it's the fastest model on CPU. Full model list below.

Verified response shapes

The exact bodies the server returns. Every snippet below was captured from a live request against this instance.

Non-streaming chat completionphi-3-mini, "Reply with the single word PONG":
{
  "id": "chatcmpl-68247b8b",
  "object": "chat.completion",
  "created": 1776587088,
  "model": "phi-3-mini",
  "choices": [{
    "index": 0,
    "message": { "role": "assistant", "content": "PONG." },
    "finish_reason": "stop"
  }],
  "usage": { "prompt_tokens": 18, "completion_tokens": 4, "total_tokens": 22 }
}
Tool callllama-3.1-8b, "What is the weather in Paris?" with a get_weather tool. Note finish_reason: "tool_calls" and content is empty:
{
  "choices": [{
    "index": 0,
    "finish_reason": "tool_calls",
    "message": {
      "role": "assistant",
      "content": "",
      "tool_calls": [{
        "id": "call_69e491a8_0",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Paris\"}"
        }
      }]
    }
  }],
  "usage": { "prompt_tokens": 159, "completion_tokens": 16, "total_tokens": 175 }
}
Error envelope — the exact shape every 4xx/5xx returns:
{
  "error": {
    "message": "n > 1 is not supported; call the endpoint multiple times",
    "type": "invalid_request_error",
    "param": null,
    "code": null
  }
}

3. Stream tokens

Add "stream": true. The server returns Server-Sent Events — each data: line is one JSON chunk, the stream terminates with data: [DONE]. Captured chunks from llama-3.1-8b streaming "Count from 1 to 5":

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587146,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":"Here"},"finish_reason":null}]}

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587146,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":" it"},"finish_reason":null}]}

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587147,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":" goes"},"finish_reason":null}]}

…

data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587153,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

Every chat.completion.chunk carries the same id across the whole stream. The final chunk carries an empty delta and a non-null finish_reason. Then data: [DONE] closes the stream.

1import os, json, requests
2
3with requests.post(
4 "https://llmrack.com/v1/chat/completions",
5 headers={"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"},
6 json={
7 "model": "phi-3-mini",
8 "messages": [{"role": "user", "content": "Haiku about SSDs."}],
9 "stream": True,
10 },
11 stream=True, timeout=60,
12) as r:
13 for raw in r.iter_lines():
14 if not raw or not raw.startswith(b"data: "):
15 continue
16 payload = raw[6:]
17 if payload == b"[DONE]":
18 break
19 chunk = json.loads(payload)
20 delta = chunk["choices"][0]["delta"].get("content", "")
21 print(delta, end="", flush=True)

4. Embeddings

Text embeddings come from nomic-embed — 768 dimensions, 8k context. Use them for semantic search, RAG, clustering.

curl https://llmrack.com/v1/embeddings \
  -H "Authorization: Bearer $LLMRACK_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nomic-embed",
    "input": ["the quick brown fox", "hello world"]
  }'

5. Tool calling, end-to-end

Define tools in the first call. If the model picks one, parse the tool_calls and dispatch. Send the tool's output back with role: "tool". The final turn returns prose.

Important: the value of function.arguments on the model's output is a JSON string, not an object. json.loads() it before using. Tool-call-capable models: llama-3.1-8b, qwen-2.5-7b, mistral-7b.

1import os, json, requests
2
3BASE = "https://llmrack.com/v1"
4HDR = {"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"}
5
6# 1) Ask with tools attached
7r = requests.post(f"{BASE}/chat/completions", headers=HDR, json={
8 "model": "llama-3.1-8b",
9 "messages": [{"role": "user", "content": "Weather in Paris?"}],
10 "tools": [{
11 "type": "function",
12 "function": {
13 "name": "get_weather",
14 "description": "Get current weather for a city",
15 "parameters": {
16 "type": "object",
17 "properties": {"city": {"type": "string"}},
18 "required": ["city"],
19 },
20 },
21 }],
22}).json()
23
24call = r["choices"][0]["message"]["tool_calls"][0]
25args = json.loads(call["function"]["arguments"]) # → {"city": "Paris"}
26tool_result = {"temp_c": 18, "condition": "cloudy"} # do your real lookup here
27
28# 2) Send the tool result back so the model can answer.
29r2 = requests.post(f"{BASE}/chat/completions", headers=HDR, json={
30 "model": "llama-3.1-8b",
31 "messages": [
32 {"role": "user", "content": "Weather in Paris?"},
33 {"role": "assistant", "content": None, "tool_calls": [call]},
34 {"role": "tool", "tool_call_id": call["id"],
35 "content": json.dumps(tool_result)},
36 ],
37}).json()
38
39print(r2["choices"][0]["message"]["content"])
40# → "The weather in Paris is currently cloudy with a temperature of 18 degrees Celsius."

6. What the API actually supports

LLMRack's /v1/chat/completions speaks the OpenAI Chat Completions wire shape. Here's the honest field-by-field support — use it when choosing whether LLMRack fits your tool.

Request fieldStatusNotes
modelrequiredFrom the catalog below. Plain id — no openai/ prefix.
messagesrequiredRoles system, user, assistant, tool.
streamyesServer-Sent Events, terminated by data: [DONE].
temperatureyesForwarded to Ollama.
top_pyesForwarded.
max_tokensyesMaps to Ollama num_predict.
stopyesString or list of strings.
presence_penalty / frequency_penaltyyesForwarded (not all model families honor these).
tools / tool_choiceyesForwarded to Ollama. Works on llama-3.1-8b, qwen-2.5-7b, mistral-7b. Small models (phi-3-mini) often ignore tools; that's a model limitation, not an API one.
response_formatyes{type: "json_object"} → Ollama format: "json". JSON Schema (type: "json_schema") is passed to Ollama's schema-constrained decoding.
n1 onlyn > 1 returns 400. Call the endpoint multiple times.
logprobs / top_logprobsnoNot emitted. Ollama doesn't expose these.
seednoNot forwarded yet.
logit_biasnoNot forwarded.
useracceptedStored in logs; has no behavioral effect (your API key already identifies you).

Response fields: id, object, created, model, choices[0].message.{role,content,tool_calls?}, choices[0].finish_reason (values: stop, length, tool_calls), usage.{prompt_tokens, completion_tokens, total_tokens}. Tokens come from Ollama's prompt_eval_count / eval_count (real, not estimated) for chat and text completions. For /v1/embeddings, usage.prompt_tokens is an approximation (len(text)/4) since Ollama's embeddings endpoint doesn't return a token count.

7. Available models

All prices in USD per 1M tokens. Q4_K_M quantization (except Nomic, F16).

Model idParamsContextIn $/1MOut $/1M
phi-3-mini3.8B128k$0.04$0.04
mistral-7b7B32k$0.08$0.10
llama-3.1-8b8B128k$0.09$0.11
qwen-2.5-7b7B32k$0.10$0.12
nomic-embed137M8k$0.02$

Live list at GET https://llmrack.com/v1/models (or the Models page).

8. Authentication

Pass the API key as a bearer token:

Authorization: Bearer rl_live_...

Missing, malformed, revoked, or expired keys return 401 authentication_error.

9. Rate limits

TierRequests / minTokens / dayMonthly
Free
For testing and light experimentation
10 10,000 $0
Pro
For developers and daily use
100 550,000 $15
Business
For production and high-volume usage
500 5,000,000 $65

Hitting a limit returns 429 with Retry-After (seconds). Upgrade at Dashboard → Billing.

10. Error codes

HTTPTypeWhen
400 invalid_request_error Unknown model, malformed body, bad params.
401 authentication_error Missing / bad / revoked API key.
429 rate_limit_exceeded RPM or daily token budget hit.
502 api_error Upstream model error — retry is safe.

11. Use with agents & tools

Anything that supports a custom OpenAI-compatible endpoint works with LLMRack. Configure the tool with:

Base URLhttps://llmrack.com/v1
API keyrl_live_…
Modelphi-3-mini · mistral-7b · llama-3.1-8b · qwen-2.5-7b · nomic-embed
Provideropenai (whenever the tool asks which provider — pick OpenAI, then override the URL)
Why you might see openai/ in configs. LLMRack's own API uses plain model names — llama-3.1-8b, phi-3-mini, etc. No prefix. A few router-style tools route by <provider>/<model> — the prefix names the transport protocol, not the model's creator. Where the tool lets you register a custom provider or alias (OpenClaw, LiteLLM), the examples below register llmrack as the provider name, so your code only sees llmrack/llama-3.1-8b. Where it doesn't (Aider), openai/ stays on the command line.

Open WebUI

Settings → Connections → OpenAI API. Set the endpoint to https://llmrack.com/v1 and paste your key. Models appear automatically.

LibreChat

Add a custom endpoint to librechat.yaml:

endpoints:
  custom:
    - name: "LLMRack"
      apiKey: "${LLMRACK_API_KEY}"
      baseURL: "https://llmrack.com/v1"
      models:
        default: ["llama-3.1-8b", "mistral-7b", "phi-3-mini", "qwen-2.5-7b"]
      titleModel: "phi-3-mini"
      iconURL: "https://llmrack.com/favicon.ico"

Continue (VS Code / JetBrains)

Edit ~/.continue/config.json:

{
  "models": [
    {
      "title": "LLMRack · Llama 3.1 8B",
      "provider": "openai",
      "model": "llama-3.1-8b",
      "apiBase": "https://llmrack.com/v1",
      "apiKey": "rl_live_..."
    },
    {
      "title": "LLMRack · Phi-3 Mini (fast)",
      "provider": "openai",
      "model": "phi-3-mini",
      "apiBase": "https://llmrack.com/v1",
      "apiKey": "rl_live_..."
    }
  ],
  "embeddingsProvider": {
    "provider": "openai",
    "model": "nomic-embed",
    "apiBase": "https://llmrack.com/v1",
    "apiKey": "rl_live_..."
  }
}

Cursor

Cursor → Settings → Models → OpenAI API Key → enable Override OpenAI Base URL, set it to https://llmrack.com/v1, paste your LLMRack key. Add model names (e.g. llama-3.1-8b) in the custom models list.

LangChain (Python)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    model="llama-3.1-8b",
    streaming=True,
)

# Works with every LangChain agent, chain, and graph — CrewAI, LangGraph, etc.
for chunk in llm.stream("Summarize RAG in one paragraph."):
    print(chunk.content, end="", flush=True)

LlamaIndex

from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding

Settings.llm = OpenAILike(
    model="llama-3.1-8b",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    is_chat_model=True,
)
Settings.embed_model = OpenAILikeEmbedding(
    model_name="nomic-embed",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

AnythingLLM

Settings → LLM Provider → Generic OpenAI. Base URL https://llmrack.com/v1, API key rl_live_…, model llama-3.1-8b. For embeddings choose Generic OpenAI with model nomic-embed.

OpenClaw (openclaw.ai) — end-to-end

Full path from empty machine to "I typed a message and LLMRack answered." OpenClaw lets us register llmrack as a first-class provider, so your agents reference models as llmrack/llama-3.1-8b with no openai/ prefix visible anywhere.

1. Install OpenClaw & get a key.
# macOS / Linux
curl -fsSL https://openclaw.ai/install.sh | bash

# Get your LLMRack key
export LLMRACK_API_KEY="rl_live_..."        # from llmrack.com/dashboard/keys
2. Drop a config file that registers LLMRack as a provider.

Write this to ~/.openclaw/config.json5:

{
  agents: {
    defaults: {
      // Pick the default model for new agents. No "openai/" in sight.
      model: { primary: "llmrack/llama-3.1-8b" },
    },
  },
  models: {
    providers: {
      llmrack: {
        baseUrl: "https://llmrack.com/v1",
        apiKey: "${LLMRACK_API_KEY}",    // resolves from env

        // "api" names the WIRE PROTOCOL, not the vendor. OpenClaw supports a few
        // dialects (openai-completions, anthropic-messages, google-generateContent,
        // …). Our /v1/chat/completions endpoint implements the OpenAI Chat
        // Completions shape — request: {model, messages[], stream, temperature,
        // tools, response_format, …}; response: {choices[0].message.{content,
        // tool_calls}, usage.{prompt_tokens, completion_tokens}}; streaming as
        // SSE with "data: [DONE]" terminator. So the correct value here is:
        api: "openai-completions",

        models: [
          { id: "llama-3.1-8b", name: "Llama 3.1 8B",  contextWindow: 128000, maxTokens: 4096 },
          { id: "mistral-7b",   name: "Mistral 7B",    contextWindow: 32000,  maxTokens: 4096 },
          { id: "qwen-2.5-7b",  name: "Qwen 2.5 7B",   contextWindow: 32000,  maxTokens: 4096 },
          { id: "phi-3-mini",   name: "Phi-3 Mini",    contextWindow: 128000, maxTokens: 4096 },
          { id: "nomic-embed",  name: "Nomic Embed",   contextWindow: 8192,   input: ["text"] },
        ],
      },
    },
  },
}
3. Run the onboarding wizard & start the gateway.
openclaw onboard --install-daemon     # one-time setup
openclaw gateway status                # verify it's live
4. Talk to it.
openclaw dashboard                     # opens the Control UI
# Type a message → reply streams back from llmrack/llama-3.1-8b.

# Or swap the default model live:
openclaw models set llmrack/phi-3-mini
What just happened.
  you type in dashboard
      ↓
  openclaw gateway  (localhost:18789)
      ↓   matches model "llmrack/llama-3.1-8b" → provider config above
      ↓   POST https://llmrack.com/v1/chat/completions
      ↓     Authorization: Bearer $LLMRACK_API_KEY
      ↓     {"model":"llama-3.1-8b","messages":[...],"stream":true}
      ↓
  llmrack backend → rate-limit gate → Ollama inference → SSE stream
      ↑
  openclaw streams tokens back into the UI

Want to wire it into Slack / Discord / Telegram instead of the dashboard? Re-run the wizard: openclaw onboard, pick the channel, paste the channel bot token when prompted. The LLM leg (LLMRack) stays identical — only the inbound transport changes.

Reference: Model providers, Configuration reference, Channels.

Hermes Agent (Nous Research)

Edit ~/.hermes/config.yaml and put secrets in ~/.hermes/.env:

# ~/.hermes/config.yaml
model:
  provider: custom
  base_url: "https://llmrack.com/v1"
  api_key: ${LLMRACK_API_KEY}
  model: "llama-3.1-8b"
# ~/.hermes/.env
LLMRACK_API_KEY=rl_live_...

# Alternative: use the OpenAI env vars Hermes also recognizes.
OPENAI_BASE_URL=https://llmrack.com/v1
OPENAI_API_KEY=rl_live_...

Paperclip (paperclip.ing)

Paperclip is an orchestration plane — it doesn't call LLMs directly, it invokes adapters (Claude Code, OpenAI Codex, shell, HTTP, etc.). To wire an agent to LLMRack, use an adapter whose runtime does the LLM call. Two paths:

  1. Codex / CLI adapters: most underlying CLIs read OPENAI_API_BASE / OPENAI_API_KEY. Set those in the adapter's environment and they'll flow through to LLMRack.
  2. HTTP adapter: configure it to POST to https://llmrack.com/v1/chat/completions with the Authorization: Bearer rl_live_… header in adapterConfig.
// paperclip agent adapterConfig (HTTP adapter)
{
  adapterType: "http",
  adapterConfig: {
    url: "https://llmrack.com/v1/chat/completions",
    method: "POST",
    headers: {
      "Authorization": "Bearer rl_live_...",
      "Content-Type": "application/json"
    },
    body: {
      model: "llama-3.1-8b",
      messages: [{ role: "user", content: "{{prompt}}" }]
    }
  }
}

See docs.paperclip.ing/adapters/overview for the full adapter list. If you need a dedicated OpenAI-style adapter, Paperclip is open-source (MIT) and adding one is straightforward.

Aider (AI pair programmer)

export OPENAI_API_KEY=rl_live_...
export OPENAI_API_BASE=https://llmrack.com/v1

aider --model openai/llama-3.1-8b
# or for a fast helper:
aider --model openai/phi-3-mini

Agent Zero (agent-zero.ai)

Agent Zero is configured from the Web UI, not a .env file.

  1. Open the Web UI (usually http://localhost:50001) → Settings.
  2. Go to the External Services tab → under Other OpenAI-compatible API keys paste your rl_live_… key.
  3. Switch to Chat Model Settings → set provider to OpenAI Compatible, model name to llama-3.1-8b (or any from the catalog), and API URL to https://llmrack.com/v1.
  4. Repeat for Utility Model (try phi-3-mini for speed) and Embeddings (nomic-embed).
  5. Save.

AutoGen (Microsoft)

config_list = [{
    "model": "llama-3.1-8b",
    "api_key": os.environ["LLMRACK_API_KEY"],
    "base_url": "https://llmrack.com/v1",
    "api_type": "openai",
}]

assistant = AssistantAgent("assistant", llm_config={"config_list": config_list})
user = UserProxyAgent("user", code_execution_config=False)
user.initiate_chat(assistant, message="Plan a CPU benchmark run.")

CrewAI

CrewAI uses LangChain's ChatOpenAI underneath — pass it as the agent LLM:

from langchain_openai import ChatOpenAI
from crewai import Agent, Task, Crew

llm = ChatOpenAI(
    model="llama-3.1-8b",
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

researcher = Agent(role="Researcher", goal="Find sources", llm=llm)
crew = Crew(agents=[researcher], tasks=[Task(description="…", agent=researcher)])
crew.kickoff()

LangChain (Python)

from langchain_openai import ChatOpenAI

llm = ChatOpenAI(
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    model="llama-3.1-8b",
    streaming=True,
)

# Works with every LangChain agent, chain, and graph — LangGraph, CrewAI, etc.
for chunk in llm.stream("Summarize RAG in one paragraph."):
    print(chunk.content, end="", flush=True)

LlamaIndex

from llama_index.llms.openai_like import OpenAILike
from llama_index.embeddings.openai_like import OpenAILikeEmbedding

Settings.llm = OpenAILike(
    model="llama-3.1-8b",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
    is_chat_model=True,
)
Settings.embed_model = OpenAILikeEmbedding(
    model_name="nomic-embed",
    api_base="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

Flowise (no-code agent flows)

In any ChatOpenAI, OpenAI, or OpenAI Embeddings node: click the node → Additional Parameters → set BasePath to https://llmrack.com/v1, paste your key in the credential, and pick a model (llama-3.1-8b, phi-3-mini, etc.).

n8n

In the OpenAI or OpenAI Chat Model node, open the credential, enable Custom API base URL, set it to https://llmrack.com/v1, and paste your LLMRack key.

LiteLLM proxy

If you front multiple providers through LiteLLM, add LLMRack as a model backend. The user-facing model_name is whatever you want — downstream code only sees it:

# config.yaml
model_list:
  - model_name: llmrack/llama-3.1-8b          # how your apps refer to it
    litellm_params:
      model: openai/llama-3.1-8b              # LiteLLM's internal transport slug
      api_base: https://llmrack.com/v1
      api_key: os.environ/LLMRACK_API_KEY
  - model_name: llmrack/phi-3-mini
    litellm_params:
      model: openai/phi-3-mini
      api_base: https://llmrack.com/v1
      api_key: os.environ/LLMRACK_API_KEY

# Your apps then call LiteLLM with model="llmrack/llama-3.1-8b" — the "openai/"
# prefix never leaves config.yaml.

Anything else — the general pattern

Any tool that accepts a custom OpenAI base URL works with LLMRack. The knob is usually labeled:

  • OPENAI_API_BASE, OPENAI_BASE_URL, or OPENAI_API_HOST (env var)
  • "Base URL", "API Host", "Endpoint override", "OpenAI-compatible server"
  • "Custom API base", "Proxy URL" (n8n, Flowise, Cursor, Continue)

Set it to https://llmrack.com/v1, paste your rl_live_… key, pick a model from the catalog above. Done.

If your tool has a quirk that isn't covered above, email stevenkingit02@gmail.com and I'll add it.

12. Already using OpenAI?

LLMRack's request/response shapes match OpenAI's, so any OpenAI client library works as a drop-in — point it at our base URL and pass an LLMRack key. This is optional; the plain-HTTP examples above are the canonical path.

# if you already have the openai package installed and want to reuse it
from openai import OpenAI

client = OpenAI(
    base_url="https://llmrack.com/v1",
    api_key=os.environ["LLMRACK_API_KEY"],
)

r = client.chat.completions.create(
    model="llama-3.1-8b",
    messages=[{"role": "user", "content": "hi"}],
)

Model names differ (llama-3.1-8b vs gpt-4o), but streaming, tool calls, and JSON mode all behave the same way.

Try it without writing code. The Playground lets you hit any model with your key from the browser.