How to Cut OpenClaw API Costs with Ollama and Local Models

The Cost Problem

If you're self-hosting OpenClaw with cloud APIs like OpenAI or Gemini, you've probably noticed the bills adding up fast. We've seen users reporting $20+ per week on basic search and browsing tasks — and that's with moderate usage.

The reason is token volume. OpenClaw's tool-use architecture means a single web search can consume millions of tokens. The agent reasons about which tools to call, processes search results, summarizes content, and formats a response. Each step burns tokens. Multiply that by dozens of queries per day and you're looking at serious monthly spend.

The good news: you have options. OpenClaw supports custom model providers, which means you can run local models for free, use cheaper cloud alternatives, or set up a hybrid approach that gives you the best of both worlds.

Important: Minimum Model Size

Before you get started, a critical caveat: OpenClaw requires models with strong tool-calling and reasoning capabilities. Its agentic architecture involves multi-step planning, tool selection, structured output parsing, and long context windows. Small models simply can't handle this reliably.

Minimum recommended sizes:

32B+ parameters — the practical minimum for reliable OpenClaw usage (e.g. DeepSeek R1 32B, Qwen 2.5 32B)
14B parameters — may work for simple tasks but will frequently fail on multi-step workflows
7–8B parameters — not recommended. Models like llama3.1:8b or mistral:7b lack the reasoning depth for OpenClaw's tool-use chains and will produce errors, hallucinated tool calls, or get stuck in loops

If you don't have the hardware to run 32B+ models locally, consider OpenRouter or the hybrid approach instead.

Option 1: Run Local Models with Ollama

Ollama is an open-source tool that lets you run large language models locally on your own hardware. It exposes an OpenAI-compatible API, which means OpenClaw can use it as a drop-in replacement for cloud models.

Installing Ollama

Install Ollama on the same machine as OpenClaw (or a separate machine — more on that later):

curl -fsSL https://ollama.com/install.sh | sh

Then pull a model. We recommend starting with a 32B+ model:

ollama pull qwen3:32b

Configuring OpenClaw to Use Ollama

In your openclaw.json config, add Ollama as a custom provider under models.providers:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://localhost:11434/v1",
        "api": "openai-completions",
        "apiKey": "ollama",
        "models": [
          {
            "id": "qwen3:32b",
            "name": "Qwen 3 32B",
            "reasoning": true,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 32768,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

The key settings:

baseUrl — points to Ollama's OpenAI-compatible endpoint (port 11434 by default, with /v1 path)
api — must be "openai-completions" (the OpenAI Chat Completions API adapter)
apiKey — required by OpenClaw even though Ollama doesn't need authentication. Use any placeholder value like "ollama"
models — list the models you've pulled with ollama pull. Each model entry requires id, name, reasoning, input, cost, contextWindow, and maxTokens

Setting a Local Model as Default

To make OpenClaw use your local model by default instead of a cloud API, configure agents.defaults:

{
  "agents": {
    "defaults": {
      "model": "ollama/qwen3:32b"
    }
  }
}

Now every new conversation starts with your local model — zero API cost.

Hardware Requirements for Local Models

Local models run on your CPU or GPU. The limiting factor is almost always memory — the model needs to fit entirely in RAM (or VRAM for GPU inference).

RAM / VRAM	Model Size	OpenClaw Compatibility	Examples
8 GB	7B parameters	Not compatible — too small for tool-use	Llama 3.1 8B, Mistral 7B
16 GB	14B parameters	Limited — simple tasks only	Qwen 2.5 14B
32 GB+	32B+ parameters	Recommended — reliable for most tasks	DeepSeek R1 32B, Qwen 3 32B
48 GB+ VRAM (GPU)	70B+ parameters	Excellent — comparable to mid-tier cloud models	Llama 3.1 70B, Mixtral 8x22B

GPU Recommendations

Running models on a GPU is 5–10x faster than CPU inference. If you're serious about local models:

RTX 3090 (24 GB VRAM) — runs 14B models at full speed, 32B models with quantization
RTX 4090 (24 GB VRAM) — same capacity, faster inference
Dual GPUs or 48 GB+ VRAM — needed for 70B+ models without heavy quantization

CPU-only inference works but expect slower response times — around 5–15 tokens per second depending on your hardware and model size.

Running Ollama on a Separate Machine

Your VPS probably doesn't have a GPU. But your desktop at home might. You can run Ollama on a home machine with a GPU and connect your VPS to it securely using Tailscale.

Setup

Install Tailscale on both your VPS and your home machine
Install Ollama on your home machine (the one with the GPU)
Start Ollama with network access enabled:

OLLAMA_HOST=0.0.0.0 ollama serve

Update OpenClaw config on your VPS to point to the Tailscale IP:

{
  "models": {
    "providers": {
      "ollama": {
        "baseUrl": "http://100.x.x.x:11434/v1",
        "api": "openai-completions",
        "apiKey": "ollama",
        "models": [
          {
            "id": "deepseek-r1:32b",
            "name": "DeepSeek R1 32B",
            "reasoning": true,
            "input": ["text"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 65536,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

Replace 100.x.x.x with your home machine's Tailscale IP. The connection is encrypted and doesn't require opening any ports on your home network.

Option 2: Use Cheaper Cloud Models via OpenRouter

Not everyone has GPU hardware at home. OpenRouter aggregates dozens of AI models and lets you pay per token — often at a fraction of the cost of direct API access.

Models like Gemini Flash, Llama 3.1 70B, and Mistral Large are available at significantly lower rates than GPT or Claude, and work well for routine OpenClaw tasks.

OpenRouter Config

{
  "models": {
    "providers": {
      "openrouter": {
        "baseUrl": "https://openrouter.ai/api/v1",
        "api": "openai-completions",
        "apiKey": "sk-or-your-key-here",
        "models": [
          {
            "id": "google/gemini-2.0-flash-001",
            "name": "Gemini 2.0 Flash",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0.1, "output": 0.4, "cacheRead": 0.025, "cacheWrite": 0.1 },
            "contextWindow": 1048576,
            "maxTokens": 8192
          },
          {
            "id": "meta-llama/llama-3.1-70b-instruct",
            "name": "Llama 3.1 70B",
            "reasoning": false,
            "input": ["text"],
            "cost": { "input": 0.39, "output": 0.39, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 131072,
            "maxTokens": 8192
          }
        ]
      }
    }
  }
}

OpenRouter pricing varies by model, but expect to pay 50–90% less than equivalent OpenAI or Anthropic models for routine tasks.

Option 3: The Hybrid Approach

The smartest strategy combines local and cloud models. Use cheap or free models for everyday tasks, and reserve expensive cloud models for when you actually need them.

How It Works

OpenClaw lets you assign different models to different agents. The idea:

Routine tasks (web search, summarization, formatting) → Ollama local model or cheap OpenRouter model
Complex reasoning (code generation, multi-step analysis, creative writing) → GPT, Claude, or Gemini Pro

Configure per-agent model selection in your openclaw.json:

{
  "agents": {
    "defaults": {
      "model": "ollama/qwen3:32b"
    },
    "overrides": {
      "coder": { "model": "anthropic/claude-sonnet-4-5-20250929" },
      "researcher": { "model": "openrouter/google/gemini-2.0-flash-001" }
    }
  }
}

This way, your default agent uses a free local model, but specialized agents like the coder can use a more capable cloud model when the task demands it.

Cost Comparison

Here's what typical monthly costs look like for moderate usage (~50 queries/day) across different setups:

Setup	AI Model Cost	Infrastructure	Total Monthly
Pure cloud (GPT / Gemini Pro)	$60–120	$5–10 VPS	$65–130
OpenRouter (Gemini Flash / Llama)	$10–30	$5–10 VPS	$15–40
Pure local (Ollama 32B+)	$0	$5–10 VPS + electricity	$5–15
Hybrid (local default + cloud for complex)	$10–25	$5–10 VPS	$15–35

The difference is dramatic. A hybrid setup can cut your monthly AI spend by 70–85% compared to using cloud APIs exclusively.

The Trade-offs

Local models aren't free of cost — they trade money for other things:

Hardware requirements — you need 32 GB+ RAM or a decent GPU for OpenClaw-compatible models
Slower responses compared to cloud APIs, especially on CPU-only hardware
Lower quality on complex reasoning tasks compared to frontier cloud models
More setup work and occasional troubleshooting
Power consumption if running a GPU 24/7

For many users, the hybrid approach hits the sweet spot: fast and cheap for routine work, high quality when it matters.

Getting Started

Check your hardware — you need 32 GB+ RAM or a GPU with 24 GB+ VRAM for reliable results
Install Ollama and pull a 32B+ model: ollama pull qwen3:32b or ollama pull deepseek-r1:32b
Update your config — add the provider with "api": "openai-completions" and a placeholder apiKey
Evaluate quality — run your typical tasks and see if the output is good enough
Go hybrid — configure per-agent models once you know which tasks need cloud quality

The OpenClaw community on Discord is a great place to share configs and get recommendations for which models work best for specific tasks.

Want the convenience of managed hosting without the server management? ClawNest handles infrastructure, updates, and backups so you can focus on configuring your AI assistants. Start with a free 3-day trial — no credit card required.