Subs -30% SUB30
OpenClaw + Ollama: How to Run Your AI Agent With Free Local LLMs
$ ./blog/guides
Guides

OpenClaw + Ollama: How to Run Your AI Agent With Free Local LLMs

ClawHosters
ClawHosters by Daniel Samer
6 min read

Zero API costs. Your data never leaves your machine. And since Ollama 0.17 shipped in February 2026, the setup takes one command. That's the pitch for running OpenClaw with a local LLM, and it's mostly true.

Mostly. There are two gotchas that will waste your afternoon if nobody warns you first.

One Command to Start

Ollama 0.17 introduced native OpenClaw support. If you've got Ollama installed, this is it:

ollama launch openclaw --model qwen3-coder:32b

That pulls the model, configures the connection, and starts OpenClaw pointed at your local Ollama instance. No API key. No account. No cloud.

For users who want more control, the Ollama integration docs cover manual configuration with JSON config files and Docker setups.

Pick the Right Model (This Actually Matters)

Not every model in the Ollama library works with OpenClaw. The reason? Tool calling. OpenClaw agents don't just chat. They read files, run shell commands, and call APIs. Models without reliable tool calling support turn your agent into a chatbot that can't do anything.

Here's what actually works, based on community benchmarks:

VRAM Model What to Expect
8GB qwen3:8b Barely usable. Simple tasks only.
16GB qwen2.5-coder:14b Decent for routine work
24GB qwen3-coder:32b The sweet spot. Recommended.
48GB+ llama3.3:70b Near cloud quality
Mac 32GB unified qwen3-coder:32b Excellent on Apple Silicon

I'd skip the 8B models unless you're just testing the waters. Start at 14B minimum, and if you can run 32B, do that.

Gotcha #1: The Context Window Trap

This one catches almost everyone.

Ollama defaults to a 4096 token context window. OpenClaw needs at minimum 16,000 tokens, and the official docs recommend 64,000. Without fixing this, your agent silently loses context. It looks like it's working, responds to your messages, but has no memory of what happened 10 minutes ago.

The fix: create a Modelfile.

FROM qwen3-coder:32b
PARAMETER num_ctx 32768

Then build it:

ollama create qwen3-coder-32k -f Modelfile

Or use the native Ollama API ("api": "ollama") instead of the OpenAI-compatible endpoint. The native API handles context settings correctly. The OpenAI-compatible endpoint at /v1 has documented issues with context truncation.

Gotcha #2: Tool Calls Disappearing

OpenClaw sends stream: true to all models by default. Ollama's streaming implementation doesn't properly return tool call chunks. So the model decides to read a file or run a command, but that decision vanishes. You get a text response and nothing happens.

The latest OpenClaw versions auto-detect Ollama and disable streaming for tool calls. If you're on an older version, add this to your model config:

"params": { "streaming": false }

Problem gone. GitHub issue #5769 has the full technical details if you're curious why streaming and tool calling don't play nice with Ollama.

Performance: What to Honestly Expect

An RTX 4090 running a 32B model generates around 55 tokens per second. A Mac M3 Max with 32GB unified memory hits roughly 35 tokens per second, according to independent benchmarks by Till Freitag. That's fast enough for most agent tasks, but noticeably slower than cloud models for long, complex operations.

Hardware break-even versus cloud API costs? Somewhere between 7 and 15 months depending on your setup and usage. If you're running agents heavily, local pays for itself. If you use it a few times a week, cloud APIs through free LLM tiers or a managed host are probably cheaper.

The Hybrid Approach

The smartest setup I've seen? Local for the 80% of tasks that are routine, cloud for the 20% that need serious reasoning. OpenClaw's config supports model routing:

{
  "model": {
    "primary": "ollama/qwen3-coder:32b",
    "fallbacks": ["anthropic/claude-sonnet-4-20250514"]
  }
}

Analysis by LaoZhang AI shows this hybrid approach cuts costs by 55-67% compared to running everything through a cloud provider. That's real money if you're a heavy user.

For more ways to reduce your API spend, check our token cost optimization guide.

Or Just Skip the Setup

All of the above assumes you want to manage models, configure context windows, and debug tool calling issues yourself. Some people enjoy that. Others would rather their AI agent just work.

That's what ClawHosters does. We handle the hosting, model selection, and configuration. Plans start at $19/month. No hardware required, no debugging context windows, and you can always connect your own Ollama instance later if you want the hybrid approach. See the self-hosted vs managed comparison if you're weighing the options.

Frequently Asked Questions

Yes. Ollama is free, OpenClaw is open source, and local models have no per-token charges. Your only cost is the hardware you already own. The `ollama launch openclaw` command handles the full setup.

qwen3-coder:32b if you have 24GB+ VRAM or 32GB unified memory on a Mac. It handles tool calling reliably and generates code at roughly 92% on the HumanEval benchmark. For tighter hardware, qwen2.5-coder:14b is the minimum I'd recommend.

Ollama defaults to a 4096 token context window. OpenClaw needs 16,000+ tokens minimum. Create a custom Modelfile with `PARAMETER num_ctx 32768` or switch to the native Ollama API (`"api": "ollama"`) instead of the OpenAI-compatible endpoint.

For routine tasks like file operations, message handling, and simple coding, a 32B local model is surprisingly capable. For complex reasoning, multi-step debugging, or architectural decisions, cloud models still win. The hybrid approach gives you both.

8GB is the bare minimum (with an 8B model), but you won't get reliable tool calling. 24GB is where it gets genuinely useful with 32B models. Mac users with 32GB unified memory are in an excellent spot since Apple Silicon handles these models efficiently.
*Last updated: March 2026*

Sources

  1. 1 Ollama integration docs
  2. 2 community benchmarks
  3. 3 documented issues with context truncation
  4. 4 GitHub issue #5769
  5. 5 independent benchmarks by Till Freitag
  6. 6 free LLM tiers
  7. 7 Analysis by LaoZhang AI
  8. 8 token cost optimization guide
  9. 9 ClawHosters
  10. 10 self-hosted vs managed comparison