Picture this. Your Discord gaming server has 15 people in a voice channel, mid-raid, and someone asks "what's the cooldown on that ability?" Nobody wants to alt-tab and type. Your OpenClaw agent joins the channel, listens, and answers out loud. That's what openclaw voice mode does.
Since v2026.2.21, OpenClaw ships with native Discord voice channel support. But Discord is only part of the story. Voice works across Telegram, WhatsApp, and even as a standalone assistant on your phone.
How the Voice Loop Works
OpenClaw voice mode runs a five-step loop, and it happens fast enough that conversations feel natural.
Step 1: Voice Activity Detection (VAD) picks up when someone is speaking. It filters out background noise so your agent isn't trying to transcribe your mechanical keyboard.
Step 2: The audio goes to a Speech-to-Text provider. OpenAI Whisper, Deepgram streaming, or a local Whisper model running on your own hardware.
Step 3: The transcript hits your agent's LLM. Same brain, different input method.
Step 4: The LLM's response gets converted to audio by a Text-to-Speech provider. ElevenLabs, OpenAI TTS, Edge TTS (free), or Kokoro (free, local).
Step 5: Barge-in. If a user starts talking while the agent is still speaking, it stops immediately. This is what separates a conversational agent from a robot reading a paragraph at you.
The whole cycle takes roughly two to four seconds depending on your provider choices.
Discord Voice Channels
The /vc command landed in v2026.2.21. Your agent can join, leave, and report status in any voice channel.
The discord-voice skill documentation recommends Deepgram streaming for roughly one second lower latency compared to batch transcription. In a real conversation, that one second is the difference between "responsive" and "awkward."
One thing to watch: native commands need to be enabled in your config (commands.native: auto or enable). If /vc isn't showing up, that's probably why.
And don't set messages.tts.auto to always. It sounds like a good idea until your agent tries to read a 47-line code block out loud. Start with inbound, which means the agent only speaks when the user sent voice first.
Talk Mode vs Discord Voice
These serve different needs.
Discord voice is for communities. The agent joins a shared channel and participates alongside everyone else. It runs entirely server-side.
Talk Mode is for personal use. You run a "node" on your phone or laptop (the device with the mic and speaker), while the gateway stays on the server. It's a private, bidirectional conversation. Think voice assistant, not group chat.
If you want your agent answering questions in your Discord server, use the Discord voice skill. If you want to talk to your agent hands-free while cooking, Talk Mode on your phone is what you're after.
STT Providers: What They Cost
| Provider | Cost | Latency | Notes |
|---|---|---|---|
| OpenAI Whisper | $0.006/min | Moderate | Flat rate, no volume discounts |
| Deepgram Streaming | $0.0077/min | Low (~1s faster) | $200 free credit on signup |
| Local Whisper | Free | Higher (2-5x cloud) | Needs capable hardware, fully offline |
Deepgram costs slightly more per minute but the latency difference matters for conversation. For batch processing or async voice messages on Telegram, Whisper is probably fine.
TTS Providers: What They Cost
| Provider | Cost | Quality | Notes |
|---|---|---|---|
| ElevenLabs | ~$0.24/1K chars (Pro overage) | High | Most natural voices, 1M chars included at $99/mo |
| OpenAI TTS-1 | $15/1M chars | Good | Six voice options, reliable |
| Edge TTS | Free | Decent | Microsoft neural voices, no API key needed |
| Kokoro | Free | Good | Local only, no network dependency |
A community build called Jupiter Voice runs local Whisper plus Kokoro for a completely offline voice pipeline. Zero API costs, zero network dependency. Good option if privacy is a priority.
The ClawHosters Voice Add-on
If managing API keys and provider configs sounds like more work than you want, the ClawHosters Voice Add-on bundles everything into a single subscription.
| Plan | Monthly Cost | What You Get |
|---|---|---|
| Starter | EUR 2/mo | Basic voice minutes |
| Standard | EUR 8/mo | More minutes for active use |
| Pro | EUR 25/mo | High-volume voice processing |
No separate Deepgram or ElevenLabs accounts. No API keys to configure. Usage is tracked in processing minutes and covers both STT and TTS. If you're already on ClawHosters, it's the fastest way to get voice running. You can start a free trial and add voice later.
For users who want to understand token costs more broadly, we covered that in a separate post.