LLMRack serves open-weight LLMs over a plain HTTP+JSON API at https://llmrack.com/v1. No SDK required — any HTTP client works.
rl_live_… (production) or rl_test_… (testing). Treat them like passwords; never commit them to git or ship them in client code.export LLMRACK_API_KEY="rl_live_..."
Plain cURL, Python requests, or Node fetch — no client library needed. The code rail on the right shows all three. Here's cURL:
curl https://llmrack.com/v1/chat/completions \ -H "Authorization: Bearer $LLMRACK_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "phi-3-mini", "messages": [{"role":"user","content":"hi"}] }'
Start with phi-3-mini — it's the fastest model on CPU. Full model list below.
The exact bodies the server returns. Every snippet below was captured from a live request against this instance.
phi-3-mini, "Reply with the single word PONG":{ "id": "chatcmpl-68247b8b", "object": "chat.completion", "created": 1776587088, "model": "phi-3-mini", "choices": [{ "index": 0, "message": { "role": "assistant", "content": "PONG." }, "finish_reason": "stop" }], "usage": { "prompt_tokens": 18, "completion_tokens": 4, "total_tokens": 22 } }
llama-3.1-8b, "What is the weather in Paris?" with a get_weather tool. Note finish_reason: "tool_calls" and content is empty:{ "choices": [{ "index": 0, "finish_reason": "tool_calls", "message": { "role": "assistant", "content": "", "tool_calls": [{ "id": "call_69e491a8_0", "type": "function", "function": { "name": "get_weather", "arguments": "{\"city\": \"Paris\"}" } }] } }], "usage": { "prompt_tokens": 159, "completion_tokens": 16, "total_tokens": 175 } }
{ "error": { "message": "n > 1 is not supported; call the endpoint multiple times", "type": "invalid_request_error", "param": null, "code": null } }
Add "stream": true. The server returns Server-Sent Events — each data: line is one JSON chunk, the stream terminates with data: [DONE]. Captured chunks from llama-3.1-8b streaming "Count from 1 to 5":
data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587146,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":"Here"},"finish_reason":null}]} data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587146,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":" it"},"finish_reason":null}]} data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587147,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{"role":"assistant","content":" goes"},"finish_reason":null}]} … data: {"id":"chatcmpl-6dffd80d","object":"chat.completion.chunk","created":1776587153,"model":"llama-3.1-8b","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]} data: [DONE]
Every chat.completion.chunk carries the same id across the whole stream. The final chunk carries an empty delta and a non-null finish_reason. Then data: [DONE] closes the stream.
1import os, json, requests23with requests.post(4 "https://llmrack.com/v1/chat/completions",5 headers={"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"},6 json={7 "model": "phi-3-mini",8 "messages": [{"role": "user", "content": "Haiku about SSDs."}],9 "stream": True,10 },11 stream=True, timeout=60,12) as r:13 for raw in r.iter_lines():14 if not raw or not raw.startswith(b"data: "):15 continue16 payload = raw[6:]17 if payload == b"[DONE]":18 break19 chunk = json.loads(payload)20 delta = chunk["choices"][0]["delta"].get("content", "")21 print(delta, end="", flush=True)
Text embeddings come from nomic-embed — 768 dimensions, 8k context. Use them for semantic search, RAG, clustering.
curl https://llmrack.com/v1/embeddings \ -H "Authorization: Bearer $LLMRACK_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "nomic-embed", "input": ["the quick brown fox", "hello world"] }'
Define tools in the first call. If the model picks one, parse the tool_calls and dispatch. Send the tool's output back with role: "tool". The final turn returns prose.
Important: the value of function.arguments on the model's output is a JSON string, not an object. json.loads() it before using. Tool-call-capable models: llama-3.1-8b, qwen-2.5-7b, mistral-7b.
1import os, json, requests23BASE = "https://llmrack.com/v1"4HDR = {"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"}56# 1) Ask with tools attached7r = requests.post(f"{BASE}/chat/completions", headers=HDR, json={8 "model": "llama-3.1-8b",9 "messages": [{"role": "user", "content": "Weather in Paris?"}],10 "tools": [{11 "type": "function",12 "function": {13 "name": "get_weather",14 "description": "Get current weather for a city",15 "parameters": {16 "type": "object",17 "properties": {"city": {"type": "string"}},18 "required": ["city"],19 },20 },21 }],22}).json()2324call = r["choices"][0]["message"]["tool_calls"][0]25args = json.loads(call["function"]["arguments"]) # → {"city": "Paris"}26tool_result = {"temp_c": 18, "condition": "cloudy"} # do your real lookup here2728# 2) Send the tool result back so the model can answer.29r2 = requests.post(f"{BASE}/chat/completions", headers=HDR, json={30 "model": "llama-3.1-8b",31 "messages": [32 {"role": "user", "content": "Weather in Paris?"},33 {"role": "assistant", "content": None, "tool_calls": [call]},34 {"role": "tool", "tool_call_id": call["id"],35 "content": json.dumps(tool_result)},36 ],37}).json()3839print(r2["choices"][0]["message"]["content"])40# → "The weather in Paris is currently cloudy with a temperature of 18 degrees Celsius."
LLMRack's /v1/chat/completions speaks the OpenAI Chat Completions wire shape. Here's the honest field-by-field support — use it when choosing whether LLMRack fits your tool.
Response fields: id, object, created, model, choices[0].message.{role,content,tool_calls?}, choices[0].finish_reason (values: stop, length, tool_calls), usage.{prompt_tokens, completion_tokens, total_tokens}. Tokens come from Ollama's prompt_eval_count / eval_count (real, not estimated) for chat and text completions. For /v1/embeddings, usage.prompt_tokens is an approximation (len(text)/4) since Ollama's embeddings endpoint doesn't return a token count.
All prices in USD per 1M tokens. Q4_K_M quantization (except Nomic, F16).
Live list at GET https://llmrack.com/v1/models (or the Models page).
Pass the API key as a bearer token:
Authorization: Bearer rl_live_...Missing, malformed, revoked, or expired keys return 401 authentication_error.
Hitting a limit returns 429 with Retry-After (seconds). Upgrade at Dashboard → Billing.
Anything that supports a custom OpenAI-compatible endpoint works with LLMRack. Configure the tool with:
openai/ in configs. LLMRack's own API uses plain model names — llama-3.1-8b, phi-3-mini, etc. No prefix. A few router-style tools route by <provider>/<model> — the prefix names the transport protocol, not the model's creator. Where the tool lets you register a custom provider or alias (OpenClaw, LiteLLM), the examples below register llmrack as the provider name, so your code only sees llmrack/llama-3.1-8b. Where it doesn't (Aider), openai/ stays on the command line.Settings → Connections → OpenAI API. Set the endpoint to https://llmrack.com/v1 and paste your key. Models appear automatically.
Add a custom endpoint to librechat.yaml:
endpoints: custom: - name: "LLMRack" apiKey: "${LLMRACK_API_KEY}" baseURL: "https://llmrack.com/v1" models: default: ["llama-3.1-8b", "mistral-7b", "phi-3-mini", "qwen-2.5-7b"] titleModel: "phi-3-mini" iconURL: "https://llmrack.com/favicon.ico"
Edit ~/.continue/config.json:
{ "models": [ { "title": "LLMRack · Llama 3.1 8B", "provider": "openai", "model": "llama-3.1-8b", "apiBase": "https://llmrack.com/v1", "apiKey": "rl_live_..." }, { "title": "LLMRack · Phi-3 Mini (fast)", "provider": "openai", "model": "phi-3-mini", "apiBase": "https://llmrack.com/v1", "apiKey": "rl_live_..." } ], "embeddingsProvider": { "provider": "openai", "model": "nomic-embed", "apiBase": "https://llmrack.com/v1", "apiKey": "rl_live_..." } }
Cursor → Settings → Models → OpenAI API Key → enable Override OpenAI Base URL, set it to https://llmrack.com/v1, paste your LLMRack key. Add model names (e.g. llama-3.1-8b) in the custom models list.
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], model="llama-3.1-8b", streaming=True, ) # Works with every LangChain agent, chain, and graph — CrewAI, LangGraph, etc. for chunk in llm.stream("Summarize RAG in one paragraph."): print(chunk.content, end="", flush=True)
from llama_index.llms.openai_like import OpenAILike from llama_index.embeddings.openai_like import OpenAILikeEmbedding Settings.llm = OpenAILike( model="llama-3.1-8b", api_base="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], is_chat_model=True, ) Settings.embed_model = OpenAILikeEmbedding( model_name="nomic-embed", api_base="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], )
Settings → LLM Provider → Generic OpenAI. Base URL https://llmrack.com/v1, API key rl_live_…, model llama-3.1-8b. For embeddings choose Generic OpenAI with model nomic-embed.
Full path from empty machine to "I typed a message and LLMRack answered." OpenClaw lets us register llmrack as a first-class provider, so your agents reference models as llmrack/llama-3.1-8b with no openai/ prefix visible anywhere.
# macOS / Linux curl -fsSL https://openclaw.ai/install.sh | bash # Get your LLMRack key export LLMRACK_API_KEY="rl_live_..." # from llmrack.com/dashboard/keys
Write this to ~/.openclaw/config.json5:
{ agents: { defaults: { // Pick the default model for new agents. No "openai/" in sight. model: { primary: "llmrack/llama-3.1-8b" }, }, }, models: { providers: { llmrack: { baseUrl: "https://llmrack.com/v1", apiKey: "${LLMRACK_API_KEY}", // resolves from env // "api" names the WIRE PROTOCOL, not the vendor. OpenClaw supports a few // dialects (openai-completions, anthropic-messages, google-generateContent, // …). Our /v1/chat/completions endpoint implements the OpenAI Chat // Completions shape — request: {model, messages[], stream, temperature, // tools, response_format, …}; response: {choices[0].message.{content, // tool_calls}, usage.{prompt_tokens, completion_tokens}}; streaming as // SSE with "data: [DONE]" terminator. So the correct value here is: api: "openai-completions", models: [ { id: "llama-3.1-8b", name: "Llama 3.1 8B", contextWindow: 128000, maxTokens: 4096 }, { id: "mistral-7b", name: "Mistral 7B", contextWindow: 32000, maxTokens: 4096 }, { id: "qwen-2.5-7b", name: "Qwen 2.5 7B", contextWindow: 32000, maxTokens: 4096 }, { id: "phi-3-mini", name: "Phi-3 Mini", contextWindow: 128000, maxTokens: 4096 }, { id: "nomic-embed", name: "Nomic Embed", contextWindow: 8192, input: ["text"] }, ], }, }, }, }
openclaw onboard --install-daemon # one-time setup openclaw gateway status # verify it's live
openclaw dashboard # opens the Control UI # Type a message → reply streams back from llmrack/llama-3.1-8b. # Or swap the default model live: openclaw models set llmrack/phi-3-mini
you type in dashboard
↓
openclaw gateway (localhost:18789)
↓ matches model "llmrack/llama-3.1-8b" → provider config above
↓ POST https://llmrack.com/v1/chat/completions
↓ Authorization: Bearer $LLMRACK_API_KEY
↓ {"model":"llama-3.1-8b","messages":[...],"stream":true}
↓
llmrack backend → rate-limit gate → Ollama inference → SSE stream
↑
openclaw streams tokens back into the UIWant to wire it into Slack / Discord / Telegram instead of the dashboard? Re-run the wizard: openclaw onboard, pick the channel, paste the channel bot token when prompted. The LLM leg (LLMRack) stays identical — only the inbound transport changes.
Reference: Model providers, Configuration reference, Channels.
Edit ~/.hermes/config.yaml and put secrets in ~/.hermes/.env:
# ~/.hermes/config.yaml model: provider: custom base_url: "https://llmrack.com/v1" api_key: ${LLMRACK_API_KEY} model: "llama-3.1-8b"
# ~/.hermes/.env LLMRACK_API_KEY=rl_live_... # Alternative: use the OpenAI env vars Hermes also recognizes. OPENAI_BASE_URL=https://llmrack.com/v1 OPENAI_API_KEY=rl_live_...
Paperclip is an orchestration plane — it doesn't call LLMs directly, it invokes adapters (Claude Code, OpenAI Codex, shell, HTTP, etc.). To wire an agent to LLMRack, use an adapter whose runtime does the LLM call. Two paths:
OPENAI_API_BASE / OPENAI_API_KEY. Set those in the adapter's environment and they'll flow through to LLMRack.https://llmrack.com/v1/chat/completions with the Authorization: Bearer rl_live_… header in adapterConfig.// paperclip agent adapterConfig (HTTP adapter) { adapterType: "http", adapterConfig: { url: "https://llmrack.com/v1/chat/completions", method: "POST", headers: { "Authorization": "Bearer rl_live_...", "Content-Type": "application/json" }, body: { model: "llama-3.1-8b", messages: [{ role: "user", content: "{{prompt}}" }] } } }
See docs.paperclip.ing/adapters/overview for the full adapter list. If you need a dedicated OpenAI-style adapter, Paperclip is open-source (MIT) and adding one is straightforward.
export OPENAI_API_KEY=rl_live_... export OPENAI_API_BASE=https://llmrack.com/v1 aider --model openai/llama-3.1-8b # or for a fast helper: aider --model openai/phi-3-mini
Agent Zero is configured from the Web UI, not a .env file.
http://localhost:50001) → Settings.rl_live_… key.llama-3.1-8b (or any from the catalog), and API URL to https://llmrack.com/v1.phi-3-mini for speed) and Embeddings (nomic-embed).config_list = [{ "model": "llama-3.1-8b", "api_key": os.environ["LLMRACK_API_KEY"], "base_url": "https://llmrack.com/v1", "api_type": "openai", }] assistant = AssistantAgent("assistant", llm_config={"config_list": config_list}) user = UserProxyAgent("user", code_execution_config=False) user.initiate_chat(assistant, message="Plan a CPU benchmark run.")
CrewAI uses LangChain's ChatOpenAI underneath — pass it as the agent LLM:
from langchain_openai import ChatOpenAI from crewai import Agent, Task, Crew llm = ChatOpenAI( model="llama-3.1-8b", base_url="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], ) researcher = Agent(role="Researcher", goal="Find sources", llm=llm) crew = Crew(agents=[researcher], tasks=[Task(description="…", agent=researcher)]) crew.kickoff()
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], model="llama-3.1-8b", streaming=True, ) # Works with every LangChain agent, chain, and graph — LangGraph, CrewAI, etc. for chunk in llm.stream("Summarize RAG in one paragraph."): print(chunk.content, end="", flush=True)
from llama_index.llms.openai_like import OpenAILike from llama_index.embeddings.openai_like import OpenAILikeEmbedding Settings.llm = OpenAILike( model="llama-3.1-8b", api_base="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], is_chat_model=True, ) Settings.embed_model = OpenAILikeEmbedding( model_name="nomic-embed", api_base="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], )
In any ChatOpenAI, OpenAI, or OpenAI Embeddings node: click the node → Additional Parameters → set BasePath to https://llmrack.com/v1, paste your key in the credential, and pick a model (llama-3.1-8b, phi-3-mini, etc.).
In the OpenAI or OpenAI Chat Model node, open the credential, enable Custom API base URL, set it to https://llmrack.com/v1, and paste your LLMRack key.
If you front multiple providers through LiteLLM, add LLMRack as a model backend. The user-facing model_name is whatever you want — downstream code only sees it:
# config.yaml model_list: - model_name: llmrack/llama-3.1-8b # how your apps refer to it litellm_params: model: openai/llama-3.1-8b # LiteLLM's internal transport slug api_base: https://llmrack.com/v1 api_key: os.environ/LLMRACK_API_KEY - model_name: llmrack/phi-3-mini litellm_params: model: openai/phi-3-mini api_base: https://llmrack.com/v1 api_key: os.environ/LLMRACK_API_KEY # Your apps then call LiteLLM with model="llmrack/llama-3.1-8b" — the "openai/" # prefix never leaves config.yaml.
Any tool that accepts a custom OpenAI base URL works with LLMRack. The knob is usually labeled:
OPENAI_API_BASE, OPENAI_BASE_URL, or OPENAI_API_HOST (env var)Set it to https://llmrack.com/v1, paste your rl_live_… key, pick a model from the catalog above. Done.
If your tool has a quirk that isn't covered above, email stevenkingit02@gmail.com and I'll add it.
LLMRack's request/response shapes match OpenAI's, so any OpenAI client library works as a drop-in — point it at our base URL and pass an LLMRack key. This is optional; the plain-HTTP examples above are the canonical path.
# if you already have the openai package installed and want to reuse it from openai import OpenAI client = OpenAI( base_url="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], ) r = client.chat.completions.create( model="llama-3.1-8b", messages=[{"role": "user", "content": "hi"}], )
Model names differ (llama-3.1-8b vs gpt-4o), but streaming, tool calls, and JSON mode all behave the same way.