Voice
Handle inbound phone calls with an AI voice agent that books appointments.
Overview
Empfio's voice channel answers inbound calls using a real-time speech pipeline. Callers speak naturally; the AI agent transcribes their speech, processes it through the same booking flow, and speaks the reply back — all in real time.
The voice service runs as a separate Docker container (voice/) alongside the main backend.
Telephony provider: Telnyx (primary) — German local numbers from €0.86/month, 17× cheaper than Twilio.
How it works
- Customer calls the Empfio AI phone number (provisioned via Telnyx)
- Telnyx webhook — Telnyx sends the call event to the voice service
- Speech-to-text — the caller's speech is transcribed in real time via Deepgram Nova-3
- Agent processing — the transcript is sent to the LangGraph AI agent (same agent as text channels)
- Text-to-speech — the agent's text reply is converted to speech via ElevenLabs and streamed back to the caller
- Booking confirmation — when a booking is made, a confirmation SMS is sent to the caller's phone number
Setup
Voice is provisioned through Empfio's AI Number feature, which provides a single phone number that handles voice, SMS, and WhatsApp:
- Go to Settings → AI Number in the Empfio dashboard
- Provision a new AI phone number
- The number is automatically configured for voice calls, SMS, and WhatsApp
- Test by calling the number
Speech pipeline
| Stage | Provider | Notes |
|---|---|---|
| Telephony | Telnyx | TeXML + Media Stream WebSocket, PCMU/8000 |
| Speech-to-text | Deepgram Nova-3 | ~100ms latency, accepts mulaw natively |
| AI agent | LangGraph (GPT-4o) | Same agent as WhatsApp/Telegram |
| Text-to-speech | ElevenLabs Turbo v2.5 | ~200ms first-byte latency |
Streaming for low latency
Voice conversations require low latency to feel natural. The agent uses streaming mode (/chat/stream) so the first words of the reply are spoken before the full response is generated. This significantly reduces the perceived wait time.
Typical latency breakdown:
| Stage | Time |
|---|---|
| Speech-to-text (STT) | ~100ms |
| LLM first token | ~500ms |
| Text-to-speech (TTS) | ~200ms |
| Total perceived delay | ~800ms |
Barge-in
When a caller speaks while the AI is talking, Empfio immediately stops playback and listens. This makes conversations feel natural rather than robotic. Barge-in is enabled by default and can be controlled via ENABLE_BARGE_IN in the voice service configuration.
Limitations
- Voice requires a Telnyx account and a provisioned AI Number
- Conference calls and multi-party calls are not supported
- The agent cannot process DTMF (keypad/tone input) — speech only
- Voice recognition works best in English and German
Troubleshooting
| Problem | Fix |
|---|---|
| No audio when calling | Check that the Telnyx TeXML Application voice_url points to the voice service |
| Agent not responding | Verify the voice service is running (GET /health) and TELNYX_API_KEY is set |
| Wrong language in replies | Check your organization's language setting in Settings → General |
| Long pauses before replies | High LLM latency — check the agent service logs |
| Call drops after a few seconds | Verify the Telnyx number is active and the TeXML Application is correctly assigned |