Your OpenClaw agent reads, writes, searches the web, calls tools. But it can't hear you. And it can't talk back. Voice changes that, and it's not as complicated as you'd think.
An openclaw voice assistant turns a text chatbot into something closer to a hands-free coworker. Ask questions while cooking. Get briefings during your commute. Let your Telegram group fire voice notes at the bot instead of typing. The setup takes about 10 minutes if you know which pieces go where.
Three Layers, Three Choices
Voice in OpenClaw isn't one monolithic feature. It's built from three independent layers, and you only configure what you need.
STT (Speech-to-Text) converts incoming audio into text your agent can process. Someone sends a voice note on Telegram, STT transcribes it.
TTS (Text-to-Speech) turns your agent's text replies into spoken audio. The agent types "meeting at 3pm," TTS makes it say it out loud.
Talk Mode ties both together into a continuous, bidirectional voice loop. VAD (Voice Activity Detection) listens for speech, transcribes it, runs it through the LLM, speaks the response, and loops. Think Siri or Alexa, except running on your own server with your own model.
You can run STT alone (transcribe voice messages, reply in text). You can run TTS alone (type to the agent, hear audio back). Or wire up the full loop with Talk Mode.
Setting Up STT
Three practical options, ordered by how much you want to spend.
| Provider | Cost | Latency | Best For |
|---|---|---|---|
| OpenAI Whisper | $0.006/min | ~2s | General use, best accuracy |
| Deepgram | $0.0077/min ($200 free credit) | ~1s | Real-time conversation |
| Local Whisper | $0 | 4-8s (CPU dependent) | Privacy, zero ongoing cost |
Whisper is the safe default. If you're building a real-time voice assistant for Discord or Talk Mode, Deepgram's streaming endpoint shaves roughly a second off each turn. That sounds small. In conversation, it's the gap between responsive and awkward.
Local Whisper costs nothing but needs decent hardware. A 4GB GPU handles the base model fine. The large-v3 model needs 10GB+ VRAM and patience.
# config.yaml - STT with Whisper
stt:
provider: openai
model: whisper-1
Setting Up TTS with ElevenLabs
ElevenLabs is the community default for a reason. The voices sound natural enough that people on the other end of a Discord call sometimes don't realize they're talking to an agent.
Models to pick from:
eleven_turbo_v2.5is the go-to. Fast, ~$0.05 per 1,000 characters, good enough for 90% of use cases.eleven_multilingual_v2if your agent speaks German, Spanish, or any of the 29 supported languages.eleven_v3adds emotional range. The agent can sound excited, calm, or serious based on context. Premium tier only.
Popular voices: Rachel (warm, professional), Adam (clear, neutral). Browse all voices in ElevenLabs' voice library before committing.
Free tier gets you 10,000 characters per month. That's roughly 8 minutes of spoken output, maybe enough for testing. You'll probably want a paid plan for anything real.
# config.yaml - TTS with ElevenLabs
tts:
provider: elevenlabs
model: eleven_turbo_v2.5
voice: Rachel
Two free alternatives worth knowing: Microsoft Edge TTS (no API key needed, decent quality, limited voice control) and OpenAI TTS ($15 per million characters, six voices, reliable).
Streaming vs Realtime: Pick One
These sound similar but work very differently. They're mutually exclusive.
Streaming mode is the pipeline approach. Audio goes to STT, transcript goes to LLM, response goes to TTS, audio comes back. Each step runs separately. Latency sits around 1.7 to 4.9 seconds depending on your providers. But your agent keeps full access to tools, memory, and MCP servers. This is what most people should use.
Realtime mode uses a single full-duplex WebSocket connection (OpenAI Realtime API or Gemini Live). Latency drops to 300-800ms. Conversations feel instant. But here's the catch: your agent loses access to tools and skills. No web search, no MCP calls, no calendar lookups. It's voice-only with the base model. If your agent needs to actually do things, Realtime mode isn't ready yet.
I think most people are better off with streaming for now. The latency penalty is real but the tool access matters more. Probably.
Platform Support
| Platform | How Voice Works | Setup Difficulty |
|---|---|---|
| Telegram | Voice notes auto-transcribed, replies as audio bubbles | Low |
| Discord | Agent joins voice channels via /vc join |
Medium |
| Web (WebRTC) | Browser-based Talk Mode, push-to-talk or VAD | Medium |
| Phone (Twilio/Telnyx) | Inbound/outbound phone calls via plugin | Higher |
Telegram is the easiest starting point. Users send voice notes, OpenClaw transcribes and replies. No special setup beyond the STT/TTS config.
For a deeper look at choosing the right AI model behind your voice agent, we covered that separately.
What It Actually Costs
| Setup | Monthly Estimate | Notes |
|---|---|---|
| STT only (Whisper) | ~$3.60 | 10 hrs of voice input |
| STT + TTS (Whisper + ElevenLabs Starter) | $5-22 | Depends on output volume |
| Realtime (OpenAI) | $5-8 | Per-minute billing, no tools |
| Zero-cost (local Whisper + Edge TTS) | $0 | Needs hardware, higher latency |
These numbers assume moderate personal use. A busy Telegram group or Discord server with 50 active voice users will push STT costs up fast. For cost optimization strategies across the board, that guide has you covered.
On ClawHosters, voice config lives in your dashboard. No ffmpeg installs, no Docker volume mounts, no hunting for codec libraries. Your instance is always on, so your openclaw voice assistant picks up voice notes at 3am the same as 3pm.