Subs -10% SUB-10
Claws -25% LAUNCH-CLAWS
OpenClaw Voice Mode: How to Add Speech to Your AI Agent
$ ./blog/guides
Guides

OpenClaw Voice Mode: How to Add Speech to Your AI Agent

ClawHosters
ClawHosters by Daniel Samer
6 min read

Picture this. Your Discord gaming server has 15 people in a voice channel, mid-raid, and someone asks "what's the cooldown on that ability?" Nobody wants to alt-tab and type. Your OpenClaw agent joins the channel, listens, and answers out loud. That's what openclaw voice mode does.

Since v2026.2.21, OpenClaw ships with native Discord voice channel support. But Discord is only part of the story. Voice works across Telegram, WhatsApp, and even as a standalone assistant on your phone.

How the Voice Loop Works

OpenClaw voice mode runs a five-step loop, and it happens fast enough that conversations feel natural.

Step 1: Voice Activity Detection (VAD) picks up when someone is speaking. It filters out background noise so your agent isn't trying to transcribe your mechanical keyboard.

Step 2: The audio goes to a Speech-to-Text provider. OpenAI Whisper, Deepgram streaming, or a local Whisper model running on your own hardware.

Step 3: The transcript hits your agent's LLM. Same brain, different input method.

Step 4: The LLM's response gets converted to audio by a Text-to-Speech provider. ElevenLabs, OpenAI TTS, Edge TTS (free), or Kokoro (free, local).

Step 5: Barge-in. If a user starts talking while the agent is still speaking, it stops immediately. This is what separates a conversational agent from a robot reading a paragraph at you.

The whole cycle takes roughly two to four seconds depending on your provider choices.

Discord Voice Channels

The /vc command landed in v2026.2.21. Your agent can join, leave, and report status in any voice channel.

The discord-voice skill documentation recommends Deepgram streaming for roughly one second lower latency compared to batch transcription. In a real conversation, that one second is the difference between "responsive" and "awkward."

One thing to watch: native commands need to be enabled in your config (commands.native: auto or enable). If /vc isn't showing up, that's probably why.

And don't set messages.tts.auto to always. It sounds like a good idea until your agent tries to read a 47-line code block out loud. Start with inbound, which means the agent only speaks when the user sent voice first.

Talk Mode vs Discord Voice

These serve different needs.

Discord voice is for communities. The agent joins a shared channel and participates alongside everyone else. It runs entirely server-side.

Talk Mode is for personal use. You run a "node" on your phone or laptop (the device with the mic and speaker), while the gateway stays on the server. It's a private, bidirectional conversation. Think voice assistant, not group chat.

If you want your agent answering questions in your Discord server, use the Discord voice skill. If you want to talk to your agent hands-free while cooking, Talk Mode on your phone is what you're after.

STT Providers: What They Cost

Provider Cost Latency Notes
OpenAI Whisper $0.006/min Moderate Flat rate, no volume discounts
Deepgram Streaming $0.0077/min Low (~1s faster) $200 free credit on signup
Local Whisper Free Higher (2-5x cloud) Needs capable hardware, fully offline

Deepgram costs slightly more per minute but the latency difference matters for conversation. For batch processing or async voice messages on Telegram, Whisper is probably fine.

TTS Providers: What They Cost

Provider Cost Quality Notes
ElevenLabs ~$0.24/1K chars (Pro overage) High Most natural voices, 1M chars included at $99/mo
OpenAI TTS-1 $15/1M chars Good Six voice options, reliable
Edge TTS Free Decent Microsoft neural voices, no API key needed
Kokoro Free Good Local only, no network dependency

A community build called Jupiter Voice runs local Whisper plus Kokoro for a completely offline voice pipeline. Zero API costs, zero network dependency. Good option if privacy is a priority.

The ClawHosters Voice Add-on

If managing API keys and provider configs sounds like more work than you want, the ClawHosters Voice Add-on bundles everything into a single subscription.

Plan Monthly Cost What You Get
Starter EUR 2/mo Basic voice minutes
Standard EUR 8/mo More minutes for active use
Pro EUR 25/mo High-volume voice processing

No separate Deepgram or ElevenLabs accounts. No API keys to configure. Usage is tracked in processing minutes and covers both STT and TTS. If you're already on ClawHosters, it's the fastest way to get voice running. You can start a free trial and add voice later.

For users who want to understand token costs more broadly, we covered that in a separate post.

Frequently Asked Questions

Make sure native commands are enabled in your config (`commands.native: auto`), then use `/vc join` in any Discord voice channel. The agent joins, listens via VAD, and responds with TTS. On ClawHosters, the Voice Add-on handles provider configuration automatically.

Yes. Telegram supports two-way voice out of the box. Users send voice message attachments, OpenClaw transcribes them, and replies as round voice-note bubbles in Opus format. Check the Telegram setup docs for details.

You can. Run local Whisper for STT and Edge TTS or Kokoro for TTS. No API keys, no per-minute costs. The trade-off is higher latency and the need for hardware that can run the Whisper model. For most people, the managed Voice Add-on at EUR 2/month is simpler.

Discord voice is for group settings. The agent joins a shared channel server-side. Talk Mode is for personal, one-on-one conversation. It requires a local "node" device (phone, laptop) with a microphone and speaker, while the gateway runs on the server.

Deepgram streaming. It costs slightly more than Whisper ($0.0077 vs $0.006 per minute) but delivers roughly one second lower latency. In a live conversation, that gap is noticeable. For async voice messages, standard Whisper works fine.
*Last updated: February 2026*

Sources

  1. 1 v2026.2.21
  2. 2 discord-voice skill documentation
  3. 3 $0.006/min
  4. 4 $0.0077/min
  5. 5 ElevenLabs
  6. 6 community build called Jupiter Voice
  7. 7 ClawHosters Voice Add-on
  8. 8 start a free trial
  9. 9 token costs more broadly
  10. 10 Telegram setup docs