OpenClaw Talk Mode: Set Up Voice with ElevenLabs, Whisper, and Real-Time Audio
$ ./blog/guides
Guides

OpenClaw Talk Mode: Set Up Voice with ElevenLabs, Whisper, and Real-Time Audio

ClawHosters
ClawHosters by Daniel Samer
7 min read

Your OpenClaw agent reads, writes, searches the web, calls tools. But it can't hear you. And it can't talk back. Voice changes that, and it's not as complicated as you'd think.

An openclaw voice assistant turns a text chatbot into something closer to a hands-free coworker. Ask questions while cooking. Get briefings during your commute. Let your Telegram group fire voice notes at the bot instead of typing. The setup takes about 10 minutes if you know which pieces go where.

Three Layers, Three Choices

Voice in OpenClaw isn't one monolithic feature. It's built from three independent layers, and you only configure what you need.

STT (Speech-to-Text) converts incoming audio into text your agent can process. Someone sends a voice note on Telegram, STT transcribes it.

TTS (Text-to-Speech) turns your agent's text replies into spoken audio. The agent types "meeting at 3pm," TTS makes it say it out loud.

Talk Mode ties both together into a continuous, bidirectional voice loop. VAD (Voice Activity Detection) listens for speech, transcribes it, runs it through the LLM, speaks the response, and loops. Think Siri or Alexa, except running on your own server with your own model.

You can run STT alone (transcribe voice messages, reply in text). You can run TTS alone (type to the agent, hear audio back). Or wire up the full loop with Talk Mode.

Setting Up STT

Three practical options, ordered by how much you want to spend.

Provider Cost Latency Best For
OpenAI Whisper $0.006/min ~2s General use, best accuracy
Deepgram $0.0077/min ($200 free credit) ~1s Real-time conversation
Local Whisper $0 4-8s (CPU dependent) Privacy, zero ongoing cost

Whisper is the safe default. If you're building a real-time voice assistant for Discord or Talk Mode, Deepgram's streaming endpoint shaves roughly a second off each turn. That sounds small. In conversation, it's the gap between responsive and awkward.

Local Whisper costs nothing but needs decent hardware. A 4GB GPU handles the base model fine. The large-v3 model needs 10GB+ VRAM and patience.

# config.yaml - STT with Whisper
stt:
  provider: openai
  model: whisper-1

Setting Up TTS with ElevenLabs

ElevenLabs is the community default for a reason. The voices sound natural enough that people on the other end of a Discord call sometimes don't realize they're talking to an agent.

Models to pick from:

  • eleven_turbo_v2.5 is the go-to. Fast, ~$0.05 per 1,000 characters, good enough for 90% of use cases.

  • eleven_multilingual_v2 if your agent speaks German, Spanish, or any of the 29 supported languages.

  • eleven_v3 adds emotional range. The agent can sound excited, calm, or serious based on context. Premium tier only.

Popular voices: Rachel (warm, professional), Adam (clear, neutral). Browse all voices in ElevenLabs' voice library before committing.

Free tier gets you 10,000 characters per month. That's roughly 8 minutes of spoken output, maybe enough for testing. You'll probably want a paid plan for anything real.

# config.yaml - TTS with ElevenLabs
tts:
  provider: elevenlabs
  model: eleven_turbo_v2.5
  voice: Rachel

Two free alternatives worth knowing: Microsoft Edge TTS (no API key needed, decent quality, limited voice control) and OpenAI TTS ($15 per million characters, six voices, reliable).

Streaming vs Realtime: Pick One

These sound similar but work very differently. They're mutually exclusive.

Streaming mode is the pipeline approach. Audio goes to STT, transcript goes to LLM, response goes to TTS, audio comes back. Each step runs separately. Latency sits around 1.7 to 4.9 seconds depending on your providers. But your agent keeps full access to tools, memory, and MCP servers. This is what most people should use.

Realtime mode uses a single full-duplex WebSocket connection (OpenAI Realtime API or Gemini Live). Latency drops to 300-800ms. Conversations feel instant. But here's the catch: your agent loses access to tools and skills. No web search, no MCP calls, no calendar lookups. It's voice-only with the base model. If your agent needs to actually do things, Realtime mode isn't ready yet.

I think most people are better off with streaming for now. The latency penalty is real but the tool access matters more. Probably.

Platform Support

Platform How Voice Works Setup Difficulty
Telegram Voice notes auto-transcribed, replies as audio bubbles Low
Discord Agent joins voice channels via /vc join Medium
Web (WebRTC) Browser-based Talk Mode, push-to-talk or VAD Medium
Phone (Twilio/Telnyx) Inbound/outbound phone calls via plugin Higher

Telegram is the easiest starting point. Users send voice notes, OpenClaw transcribes and replies. No special setup beyond the STT/TTS config.

For a deeper look at choosing the right AI model behind your voice agent, we covered that separately.

What It Actually Costs

Setup Monthly Estimate Notes
STT only (Whisper) ~$3.60 10 hrs of voice input
STT + TTS (Whisper + ElevenLabs Starter) $5-22 Depends on output volume
Realtime (OpenAI) $5-8 Per-minute billing, no tools
Zero-cost (local Whisper + Edge TTS) $0 Needs hardware, higher latency

These numbers assume moderate personal use. A busy Telegram group or Discord server with 50 active voice users will push STT costs up fast. For cost optimization strategies across the board, that guide has you covered.

On ClawHosters, voice config lives in your dashboard. No ffmpeg installs, no Docker volume mounts, no hunting for codec libraries. Your instance is always on, so your openclaw voice assistant picks up voice notes at 3am the same as 3pm.

Frequently Asked Questions

Yes. Use local Whisper for STT and either Edge TTS or Kokoro for TTS. No API keys, no monthly bills. The trade-off is latency (expect 5-10 seconds per turn on modest hardware) and you'll need a machine with a decent GPU. For most casual users, a $5-8/month API spend gets a much better experience.

Start with `eleven_turbo_v2.5`. It's the fastest, cheapest, and sounds natural for English. Switch to `eleven_multilingual_v2` if your users speak other languages. `eleven_v3` (emotion-aware) is worth trying if you want the agent to adapt its tone, but it costs more and requires a higher-tier plan.

It does. OpenClaw's Talk Mode runs as a WebRTC connection from a browser or the PWA on your phone. You tap to connect, then speak naturally. The "node" runs on your device (mic and speaker), while the gateway stays on your server. Works over Wi-Fi or mobile data.

Streaming mode runs 1.7-4.9 seconds per voice turn. Realtime mode (OpenAI or Gemini Live) drops that to 300-800ms. But Realtime mode currently can't use agent tools, MCP servers, or skills. For agents that need to search, look up data, or call APIs, Streaming is the only option right now.
*Last updated: June 2026*

Sources

  1. 1 OpenAI Whisper
  2. 2 ElevenLabs
  3. 3 choosing the right AI model
  4. 4 cost optimization strategies
  5. 5 ClawHosters