LLMRack serves open-weight LLMs over a plain HTTP+JSON API at https://llmrack.com/v1. No SDK required — any HTTP client works.
rl_live_… (production) or rl_test_… (testing). Treat them like passwords; never commit them to git or ship them in client code.export LLMRACK_API_KEY="rl_live_..."
Plain cURL, Python requests, or Node fetch — no client library needed. The code rail on the right shows all three. Here's cURL:
curl https://llmrack.com/v1/chat/completions \ -H "Authorization: Bearer $LLMRACK_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "phi-3-mini", "messages": [{"role":"user","content":"hi"}] }'
Start with phi-3-mini — it's the fastest model on CPU. Full model list below.
Add "stream": true. The server returns Server-Sent Events — each data: line is one JSON chunk, and the stream terminates with data: [DONE].
1import os, json, requests23with requests.post(4 "https://llmrack.com/v1/chat/completions",5 headers={"Authorization": f"Bearer {os.environ['LLMRACK_API_KEY']}"},6 json={7 "model": "phi-3-mini",8 "messages": [{"role": "user", "content": "Haiku about SSDs."}],9 "stream": True,10 },11 stream=True, timeout=60,12) as r:13 for raw in r.iter_lines():14 if not raw or not raw.startswith(b"data: "):15 continue16 payload = raw[6:]17 if payload == b"[DONE]":18 break19 chunk = json.loads(payload)20 delta = chunk["choices"][0]["delta"].get("content", "")21 print(delta, end="", flush=True)
Text embeddings come from nomic-embed — 768 dimensions, 8k context. Use them for semantic search, RAG, clustering.
curl https://llmrack.com/v1/embeddings \ -H "Authorization: Bearer $LLMRACK_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "model": "nomic-embed", "input": ["the quick brown fox", "hello world"] }'
All prices in USD per 1M tokens. Q4_K_M quantization (except Nomic, F16).
Live list at GET https://llmrack.com/v1/models (or the Models page).
Pass the API key as a bearer token:
Authorization: Bearer rl_live_...Missing, malformed, revoked, or expired keys return 401 authentication_error.
Hitting a limit returns 429 with Retry-After (seconds). Upgrade at Dashboard → Billing.
Anything that supports a custom OpenAI-compatible endpoint works with LLMRack. Configure the tool with:
Settings → Connections → OpenAI API. Set the endpoint to https://llmrack.com/v1 and paste your key. Models appear automatically.
Add a custom endpoint to librechat.yaml:
endpoints: custom: - name: "LLMRack" apiKey: "${LLMRACK_API_KEY}" baseURL: "https://llmrack.com/v1" models: default: ["llama-3.1-8b", "mistral-7b", "phi-3-mini", "qwen-2.5-7b"] titleModel: "phi-3-mini" iconURL: "https://llmrack.com/favicon.ico"
Edit ~/.continue/config.json:
{ "models": [ { "title": "LLMRack · Llama 3.1 8B", "provider": "openai", "model": "llama-3.1-8b", "apiBase": "https://llmrack.com/v1", "apiKey": "rl_live_..." }, { "title": "LLMRack · Phi-3 Mini (fast)", "provider": "openai", "model": "phi-3-mini", "apiBase": "https://llmrack.com/v1", "apiKey": "rl_live_..." } ], "embeddingsProvider": { "provider": "openai", "model": "nomic-embed", "apiBase": "https://llmrack.com/v1", "apiKey": "rl_live_..." } }
Cursor → Settings → Models → OpenAI API Key → enable Override OpenAI Base URL, set it to https://llmrack.com/v1, paste your LLMRack key. Add model names (e.g. llama-3.1-8b) in the custom models list.
from langchain_openai import ChatOpenAI llm = ChatOpenAI( base_url="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], model="llama-3.1-8b", streaming=True, ) # Works with every LangChain agent, chain, and graph — CrewAI, LangGraph, etc. for chunk in llm.stream("Summarize RAG in one paragraph."): print(chunk.content, end="", flush=True)
from llama_index.llms.openai_like import OpenAILike from llama_index.embeddings.openai_like import OpenAILikeEmbedding Settings.llm = OpenAILike( model="llama-3.1-8b", api_base="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], is_chat_model=True, ) Settings.embed_model = OpenAILikeEmbedding( model_name="nomic-embed", api_base="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], )
Settings → LLM Provider → Generic OpenAI. Base URL https://llmrack.com/v1, API key rl_live_…, model llama-3.1-8b. For embeddings choose Generic OpenAI with model nomic-embed.
In the OpenAI or OpenAI Chat Model node, open the credential, enable Custom API base URL, set it to https://llmrack.com/v1, and paste your LLMRack key.
If the tool accepts a custom OpenAI base URL, LLMRack works. If the option is labeled differently ("OpenAI-compatible server", "OAI proxy", "custom endpoint", etc.), look for the place to override the URL — https://llmrack.com/v1 — then paste your rl_live_… key.
LLMRack's request/response shapes match OpenAI's, so any OpenAI client library works as a drop-in — point it at our base URL and pass an LLMRack key. This is optional; the plain-HTTP examples above are the canonical path.
# if you already have the openai package installed and want to reuse it from openai import OpenAI client = OpenAI( base_url="https://llmrack.com/v1", api_key=os.environ["LLMRACK_API_KEY"], ) r = client.chat.completions.create( model="llama-3.1-8b", messages=[{"role": "user", "content": "hi"}], )
Model names differ (llama-3.1-8b vs gpt-4o), but streaming, tool calls, and JSON mode all behave the same way.