Empfio Docs

Voice

Handle inbound phone calls with an AI voice agent that books appointments.

Overview

Empfio's voice channel answers inbound calls using a real-time speech pipeline. Callers speak naturally; the AI agent transcribes their speech, processes it through the same booking flow, and speaks the reply back — all in real time.

The voice service runs as a separate Docker container (voice/) alongside the main backend.

Telephony provider: Telnyx (primary) — German local numbers from €0.86/month, 17× cheaper than Twilio.

How it works

  1. Customer calls the Empfio AI phone number (provisioned via Telnyx)
  2. Telnyx webhook — Telnyx sends the call event to the voice service
  3. Speech-to-text — the caller's speech is transcribed in real time via Deepgram Nova-3
  4. Agent processing — the transcript is sent to the LangGraph AI agent (same agent as text channels)
  5. Text-to-speech — the agent's text reply is converted to speech via ElevenLabs and streamed back to the caller
  6. Booking confirmation — when a booking is made, a confirmation SMS is sent to the caller's phone number

Setup

Voice is provisioned through Empfio's AI Number feature, which provides a single phone number that handles voice, SMS, and WhatsApp:

  1. Go to Settings → AI Number in the Empfio dashboard
  2. Provision a new AI phone number
  3. The number is automatically configured for voice calls, SMS, and WhatsApp
  4. Test by calling the number

Speech pipeline

StageProviderNotes
TelephonyTelnyxTeXML + Media Stream WebSocket, PCMU/8000
Speech-to-textDeepgram Nova-3~100ms latency, accepts mulaw natively
AI agentLangGraph (GPT-4o)Same agent as WhatsApp/Telegram
Text-to-speechElevenLabs Turbo v2.5~200ms first-byte latency

Streaming for low latency

Voice conversations require low latency to feel natural. The agent uses streaming mode (/chat/stream) so the first words of the reply are spoken before the full response is generated. This significantly reduces the perceived wait time.

Typical latency breakdown:

StageTime
Speech-to-text (STT)~100ms
LLM first token~500ms
Text-to-speech (TTS)~200ms
Total perceived delay~800ms

Barge-in

When a caller speaks while the AI is talking, Empfio immediately stops playback and listens. This makes conversations feel natural rather than robotic. Barge-in is enabled by default and can be controlled via ENABLE_BARGE_IN in the voice service configuration.

Limitations

  • Voice requires a Telnyx account and a provisioned AI Number
  • Conference calls and multi-party calls are not supported
  • The agent cannot process DTMF (keypad/tone input) — speech only
  • Voice recognition works best in English and German

Troubleshooting

ProblemFix
No audio when callingCheck that the Telnyx TeXML Application voice_url points to the voice service
Agent not respondingVerify the voice service is running (GET /health) and TELNYX_API_KEY is set
Wrong language in repliesCheck your organization's language setting in Settings → General
Long pauses before repliesHigh LLM latency — check the agent service logs
Call drops after a few secondsVerify the Telnyx number is active and the TeXML Application is correctly assigned

On this page