A team tweaked one system prompt. Looked fine in testing. In production, average reasoning chains jumped from 4 steps to 11. Token spend doubled. No error logs, no alerts, no HTTP 500s. They found out when the invoice arrived. That story, documented by Agentix Labs, is probably the best argument for ai agent observability you'll ever hear.
If you're running an AI agent in production, standard uptime monitoring won't save you. Your agent can be "up" and still burn through your budget, loop on the same tool call for 40 iterations, or quietly lose instructions as the context window fills up.
Here's what to watch instead.
Why Traditional Monitoring Breaks for AI Agents
Web apps are predictable. Same request, same response, same code path. AI agents aren't like that. The same input can trigger completely different chains of tool calls depending on what the model decides in that moment.
That creates problems traditional APM can't catch.
Silent failures. Your agent returns a confident-sounding wrong answer. No error code. No stack trace. Just a user who got bad information.
Cost spirals. A single agent turn can consume 30,000 input tokens when you add the system prompt, conversation history, tool schemas, and retrieved documents. Multiply that across dozens of turns per task.
Loop storms. The agent retries the same failing tool call over and over. Each retry costs tokens. No timeout, no circuit breaker, just a growing bill.
You need ai agent observability that tracks behavior, not just availability.
The Open-Source Stack: OpenTelemetry, Prometheus, Grafana
The industry has settled on a clear winner for llm observability in self-hosted setups. OpenTelemetry's GenAI semantic conventions define standard metric names that work across any AI framework. Contributors include Amazon, Google, IBM, and Microsoft.
The data flow looks like this:
| Component | Role |
|---|---|
| OpenTelemetry SDK | Instruments your agent, emits traces and metrics |
| OTel Collector | Receives OTLP data, exports to backends |
| Prometheus | Scrapes the collector, stores time-series metrics |
| Grafana | Visualizes everything, sends alerts |
All four run alongside your agent in Docker Compose. No SaaS vendor required. Your conversation data never leaves your server. Compared to paid ai observability tools like Datadog or Honeycomb, you trade a slick UI for full data ownership.
Six AI Agent Observability Metrics You Should Track
Not everything needs a dashboard. These are the six ai agent observability metrics that catch real problems, based on what we've seen across ClawHosters deployments.
| Metric | OTel Name | Why It Matters |
|---|---|---|
| Token usage | gen_ai.client.token.usage |
Cost control. Split by input vs output, by model |
| Operation duration | gen_ai.client.operation.duration |
End-to-end latency per LLM call |
| Time to first token | gen_ai.server.time_to_first_token |
User experience for streaming responses |
| Tool call success rate | Custom counter | Failing tools cause retry loops |
| Agent loop iterations | Custom counter | Catches runaway reasoning chains |
| Context window utilization | Custom gauge | Above 80%, reasoning quality drops off a cliff |
According to LangChain's 2024 State of AI Agents report, 89% of organizations running agents in production have already adopted observability tools. It's not a nice-to-have anymore. It's the baseline.
OpenClaw's Built-in OTel Plugin
If you're running OpenClaw on a ClawHosters plan, good news. OpenClaw ships with a diagnostics-otel plugin that handles the instrumentation side for you.
Enable it with one command:
openclaw plugins enable diagnostics-otel
Configure your OTLP endpoint in the OpenClaw config, restart, and your instance starts exporting traces, metrics, and structured logs in standard OTel format. Token counts, cost attribution, execution duration, queue depth, session states. All of it.
From there, point a Prometheus scrape job at the OTel Collector's metrics endpoint and connect Grafana. The SigNoz OpenClaw monitoring guide walks through the full setup path if you want step-by-step instructions.
For an even simpler start, the community-built openclaw-metrics package adds a /metrics Prometheus endpoint directly to the gateway. It ships with a pre-built Grafana dashboard JSON covering around 30 metrics across seven categories.
Grafana Alerting: What to Actually Alert On
Dashboards are great for exploration. But the real value of this stack is grafana alerting that wakes you up before costs spiral.
Set up these alerts:
Token budget threshold. Alert when daily token spend exceeds 120% of your 7-day average. Catches prompt regressions and loop storms early.
Context window warning. Alert when any session hits 80% of the model's context limit. Beyond that point, the model starts losing early instructions and quality degrades fast. Context compression can reduce token usage by 90%, but only if you know it's needed.
Tool failure rate spike. Alert when any tool's error rate crosses 15% over a 5-minute window. A broken tool is the most common trigger for retry loops.
Want to learn what else you can do to protect your instance? Check our security hardening guide and our breakdown of how to cut token costs by 77%.
Common Pitfalls
A few things that trip people up.
Logging full prompts. Don't. Prompts contain user data. Log token counts and metadata, not content. Redact if you must trace full conversations.
Static latency thresholds. LLM response times vary by model, load, and prompt length. Use percentile-based thresholds (P95, P99) relative to your own baselines, not fixed numbers.
Missing model version labels. When you upgrade from GPT-4o to a newer release, your metrics need to reflect that. Label every metric with the model name and version so you can spot regressions.
Ignoring silent retries. Some agent frameworks retry failed LLM calls automatically without logging them. If your framework does this, instrument the retry path separately.