A stuck OpenClaw session ran overnight on a test server we manage. Nobody noticed until the morning. By then, it had burned through $47 in API tokens doing absolutely nothing useful. That's when we stopped treating monitoring as a nice-to-have.
If you're running an AI agent in production, LLM observability isn't optional. OpenClaw agents act autonomously. They don't wait for you to click something. Sessions can stall, rate limits can silently drop messages, and context windows can fill up until your costs spike with zero warning. Five documented silent failure modes exist that won't show up in your logs unless you're actively looking.
So here's how to set up proper openclaw monitoring in about 15 minutes.
Enable the Built-in OpenTelemetry Exporter
OpenClaw ships with a plugin called diagnostics-otel. It's disabled by default. To turn it on, add this to your ~/.openclaw/openclaw.json:
{
"diagnostics": {
"otel": {
"enabled": true,
"endpoint": "http://127.0.0.1:4318",
"serviceName": "openclaw-prod",
"traces": true,
"metrics": true,
"logs": true,
"sampleRate": 1.0,
"flushIntervalMs": 5000
}
}
}
Two things to note. Set sampleRate to 1.0 for single-instance deployments so you don't lose traces. And drop flushIntervalMs to 5000 (five seconds) instead of the default 60000. As the SigNoz engineering team found, the default 60-second interval makes your dashboard nearly useless for real-time debugging.
One gotcha that'll cost you time: only http/protobuf works. If your collector expects gRPC, the plugin silently sends nothing. No error, no warning. Just silence. Check the official logging docs if you run into this.
The Four Metrics That Actually Matter
You'll get a wall of telemetry data. Ignore most of it at first. These four tell you whether your agent is healthy:
openclaw.cost.usd tracks spend per session. Set an alert for anything over your expected daily budget. This catches runaway sessions before they drain your API credits.
openclaw.run.duration_ms measures LLM response latency. A p95 above 5 seconds usually means something is wrong, either model overload or a context window that's gotten too large.
openclaw.context.tokens shows how much of the model's context window is consumed. When this creeps toward the limit, response quality drops and costs climb.
openclaw.queue.depth reveals message backlog. If depth keeps growing, your agent can't keep up with incoming requests. Messages may get dropped depending on your queueOverflow setting.
Architecture: How the Pieces Fit
The data pipeline looks like this:
OpenClaw gateway sends OTLP/HTTP to an OTel Collector on port 4318. The collector exposes a Prometheus scrape endpoint on 127.0.0.1:9464. Prometheus scrapes that endpoint. Grafana queries Prometheus and renders dashboards.
Keep the collector endpoint on loopback only. You don't want telemetry data exposed to the internet.
The LumaDock VPS monitoring guide recommends running node_exporter alongside the OpenClaw metrics on the same Grafana dashboard. That way you can tell whether a latency spike is the LLM provider being slow or your VPS running out of RAM.
Health Endpoints: /health vs /readyz
OpenClaw exposes two types of probes on port 18789. /health (or /healthz) is a shallow liveness check. It returns {"ok": true} if the process is running. /ready (or /readyz) is deeper. It checks whether your messaging channels (Telegram, Discord, etc.) are actually connected. If a channel drops, /readyz returns 503.
For Docker Compose or Kubernetes health checks, use /readyz. Using /health for readiness probes means your container reports healthy even when your Telegram bot is disconnected. The health endpoint docs cover this in detail.
Setting Up Alerts
Three Prometheus alert rules that'll save you from most production surprises:
High error rate: rate(openclaw_gateway_errors_total[5m]) > 0.1 fires if errors exceed 10% over a five-minute window. Catches gateway crashes and webhook failures.
Slow responses: Alert when p95 latency stays above 5 seconds for more than two minutes. This is usually a provider issue or a bloated context window.
Agent down: openclaw_agent_status == 0 fires when the agent process stops responding entirely. Pair this with an automatic restart policy in your systemd unit or Docker config.
These thresholds are starting points. Tune them after a week of baseline observation.
The Lightweight Alternative: ClawMetry
If Prometheus and Grafana feel like too much infrastructure for a single instance, look at ClawMetry. It's an open-source Python dashboard with 23,000+ installations that auto-detects your OpenClaw workspace. One command to install: pip install clawmetry.
ClawMetry understands OpenClaw concepts natively: channels, sub-agents, memory files, cron jobs. For a single VPS on ClawHosters, it's probably the right starting point.
For teams already running Grafana, or anyone who needs to correlate AI agent metrics with host performance data, the full Prometheus stack is worth the setup time.
What ClawHosters Handles for You
If you're on a ClawHosters managed instance, host-level monitoring (uptime, disk space, restarts) is already covered. What you still need to configure yourself is application-level observability: token cost tracking, response latency, and channel readiness. The diagnostics-otel config above works on any ClawHosters plan. Our setup guide walks through the full process.
For optimizing token spend once you have visibility, check our guide on OpenClaw token cost optimization.