How to Cut OpenClaw API Costs with Ollama and Local Models

The Cost Problem
If you're self-hosting OpenClaw with cloud APIs like OpenAI or Gemini, you've probably noticed the bills adding up fast. We've seen users reporting $20+ per week on basic search and browsing tasks — and that's with moderate usage.
The reason is token volume. OpenClaw's tool-use architecture means a single web search can consume millions of tokens. The agent reasons about which tools to call, processes search results, summarizes content, and formats a response. Each step burns tokens. Multiply that by dozens of queries per day and you're looking at serious monthly spend.
The good news: you have options. OpenClaw supports custom model providers, which means you can run local models for free, use cheaper cloud alternatives, or set up a hybrid approach that gives you the best of both worlds.
Important: Minimum Model Size
Before you get started, a critical caveat: OpenClaw requires models with strong tool-calling and reasoning capabilities. Its agentic architecture involves multi-step planning, tool selection, structured output parsing, and long context windows. Small models simply can't handle this reliably.
Minimum recommended sizes:
- 32B+ parameters — the practical minimum for reliable OpenClaw usage (e.g. DeepSeek R1 32B, Qwen 2.5 32B)
- 14B parameters — may work for simple tasks but will frequently fail on multi-step workflows
- 7–8B parameters — not recommended. Models like
llama3.1:8bormistral:7black the reasoning depth for OpenClaw's tool-use chains and will produce errors, hallucinated tool calls, or get stuck in loops
If you don't have the hardware to run 32B+ models locally, consider OpenRouter or the hybrid approach instead.
Option 1: Run Local Models with Ollama
Ollama is an open-source tool that lets you run large language models locally on your own hardware. It exposes an OpenAI-compatible API, which means OpenClaw can use it as a drop-in replacement for cloud models.
Installing Ollama
Install Ollama on the same machine as OpenClaw (or a separate machine — more on that later):
curl -fsSL https://ollama.com/install.sh | sh
Then pull a model. We recommend starting with a 32B+ model:
ollama pull qwen3:32b
Configuring OpenClaw to Use Ollama
In your openclaw.json config, add Ollama as a custom provider under models.providers:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://localhost:11434/v1",
"api": "openai-completions",
"apiKey": "ollama",
"models": [
{
"id": "qwen3:32b",
"name": "Qwen 3 32B",
"reasoning": true,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 32768,
"maxTokens": 8192
}
]
}
}
}
}
The key settings:
- baseUrl — points to Ollama's OpenAI-compatible endpoint (port 11434 by default, with
/v1path) - api — must be
"openai-completions"(the OpenAI Chat Completions API adapter) - apiKey — required by OpenClaw even though Ollama doesn't need authentication. Use any placeholder value like
"ollama" - models — list the models you've pulled with
ollama pull. Each model entry requiresid,name,reasoning,input,cost,contextWindow, andmaxTokens
Setting a Local Model as Default
To make OpenClaw use your local model by default instead of a cloud API, configure agents.defaults:
{
"agents": {
"defaults": {
"model": "ollama/qwen3:32b"
}
}
}
Now every new conversation starts with your local model — zero API cost.
Hardware Requirements for Local Models
Local models run on your CPU or GPU. The limiting factor is almost always memory — the model needs to fit entirely in RAM (or VRAM for GPU inference).
| RAM / VRAM | Model Size | OpenClaw Compatibility | Examples |
|---|---|---|---|
| 8 GB | 7B parameters | Not compatible — too small for tool-use | Llama 3.1 8B, Mistral 7B |
| 16 GB | 14B parameters | Limited — simple tasks only | Qwen 2.5 14B |
| 32 GB+ | 32B+ parameters | Recommended — reliable for most tasks | DeepSeek R1 32B, Qwen 3 32B |
| 48 GB+ VRAM (GPU) | 70B+ parameters | Excellent — comparable to mid-tier cloud models | Llama 3.1 70B, Mixtral 8x22B |
GPU Recommendations
Running models on a GPU is 5–10x faster than CPU inference. If you're serious about local models:
- RTX 3090 (24 GB VRAM) — runs 14B models at full speed, 32B models with quantization
- RTX 4090 (24 GB VRAM) — same capacity, faster inference
- Dual GPUs or 48 GB+ VRAM — needed for 70B+ models without heavy quantization
CPU-only inference works but expect slower response times — around 5–15 tokens per second depending on your hardware and model size.
Running Ollama on a Separate Machine
Your VPS probably doesn't have a GPU. But your desktop at home might. You can run Ollama on a home machine with a GPU and connect your VPS to it securely using Tailscale.
Setup
- Install Tailscale on both your VPS and your home machine
- Install Ollama on your home machine (the one with the GPU)
- Start Ollama with network access enabled:
OLLAMA_HOST=0.0.0.0 ollama serve
- Update OpenClaw config on your VPS to point to the Tailscale IP:
{
"models": {
"providers": {
"ollama": {
"baseUrl": "http://100.x.x.x:11434/v1",
"api": "openai-completions",
"apiKey": "ollama",
"models": [
{
"id": "deepseek-r1:32b",
"name": "DeepSeek R1 32B",
"reasoning": true,
"input": ["text"],
"cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 65536,
"maxTokens": 8192
}
]
}
}
}
}
Replace 100.x.x.x with your home machine's Tailscale IP. The connection is encrypted and doesn't require opening any ports on your home network.
Option 2: Use Cheaper Cloud Models via OpenRouter
Not everyone has GPU hardware at home. OpenRouter aggregates dozens of AI models and lets you pay per token — often at a fraction of the cost of direct API access.
Models like Gemini Flash, Llama 3.1 70B, and Mistral Large are available at significantly lower rates than GPT or Claude, and work well for routine OpenClaw tasks.
OpenRouter Config
{
"models": {
"providers": {
"openrouter": {
"baseUrl": "https://openrouter.ai/api/v1",
"api": "openai-completions",
"apiKey": "sk-or-your-key-here",
"models": [
{
"id": "google/gemini-2.0-flash-001",
"name": "Gemini 2.0 Flash",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0.1, "output": 0.4, "cacheRead": 0.025, "cacheWrite": 0.1 },
"contextWindow": 1048576,
"maxTokens": 8192
},
{
"id": "meta-llama/llama-3.1-70b-instruct",
"name": "Llama 3.1 70B",
"reasoning": false,
"input": ["text"],
"cost": { "input": 0.39, "output": 0.39, "cacheRead": 0, "cacheWrite": 0 },
"contextWindow": 131072,
"maxTokens": 8192
}
]
}
}
}
}
OpenRouter pricing varies by model, but expect to pay 50–90% less than equivalent OpenAI or Anthropic models for routine tasks.
Option 3: The Hybrid Approach
The smartest strategy combines local and cloud models. Use cheap or free models for everyday tasks, and reserve expensive cloud models for when you actually need them.
How It Works
OpenClaw lets you assign different models to different agents. The idea:
- Routine tasks (web search, summarization, formatting) → Ollama local model or cheap OpenRouter model
- Complex reasoning (code generation, multi-step analysis, creative writing) → GPT, Claude, or Gemini Pro
Configure per-agent model selection in your openclaw.json:
{
"agents": {
"defaults": {
"model": "ollama/qwen3:32b"
},
"overrides": {
"coder": { "model": "anthropic/claude-sonnet-4-5-20250929" },
"researcher": { "model": "openrouter/google/gemini-2.0-flash-001" }
}
}
}
This way, your default agent uses a free local model, but specialized agents like the coder can use a more capable cloud model when the task demands it.
Cost Comparison
Here's what typical monthly costs look like for moderate usage (~50 queries/day) across different setups:
| Setup | AI Model Cost | Infrastructure | Total Monthly |
|---|---|---|---|
| Pure cloud (GPT / Gemini Pro) | $60–120 | $5–10 VPS | $65–130 |
| OpenRouter (Gemini Flash / Llama) | $10–30 | $5–10 VPS | $15–40 |
| Pure local (Ollama 32B+) | $0 | $5–10 VPS + electricity | $5–15 |
| Hybrid (local default + cloud for complex) | $10–25 | $5–10 VPS | $15–35 |
The difference is dramatic. A hybrid setup can cut your monthly AI spend by 70–85% compared to using cloud APIs exclusively.
The Trade-offs
Local models aren't free of cost — they trade money for other things:
- Hardware requirements — you need 32 GB+ RAM or a decent GPU for OpenClaw-compatible models
- Slower responses compared to cloud APIs, especially on CPU-only hardware
- Lower quality on complex reasoning tasks compared to frontier cloud models
- More setup work and occasional troubleshooting
- Power consumption if running a GPU 24/7
For many users, the hybrid approach hits the sweet spot: fast and cheap for routine work, high quality when it matters.
Getting Started
- Check your hardware — you need 32 GB+ RAM or a GPU with 24 GB+ VRAM for reliable results
- Install Ollama and pull a 32B+ model:
ollama pull qwen3:32borollama pull deepseek-r1:32b - Update your config — add the provider with
"api": "openai-completions"and a placeholderapiKey - Evaluate quality — run your typical tasks and see if the output is good enough
- Go hybrid — configure per-agent models once you know which tasks need cloud quality
The OpenClaw community on Discord is a great place to share configs and get recommendations for which models work best for specific tasks.
Want the convenience of managed hosting without the server management? ClawNest handles infrastructure, updates, and backups so you can focus on configuring your AI assistants. Start with a free 3-day trial — no credit card required.