Empfio Docs

Voice Provider Costs

Telephony, speech-to-text, and text-to-speech pricing comparison for the Empfio voice channel.

Overview

Running an AI voice agent involves three cost layers: telephony (the phone number and call minutes), speech-to-text (transcribing the caller), and text-to-speech (speaking the AI's reply). Empfio selects providers at each layer for the best cost-to-quality ratio for European SMEs.


Telephony — Phone Numbers

Empfio uses Telnyx as its primary telephony provider, replacing Twilio.

Number monthly costs

ProviderGerman localGerman mobileUS localNotes
Telnyx (default)€0.86/mo€0.65/mo$1.15/moPrimary provider
Twilio~€15/mo~€15/mo$1.15/moLegacy / fallback
Vonage~€3/mo~€5/mo~$1/mo
Sinch~€3/mo~$1/mo

Telnyx German local numbers are 17× cheaper than Twilio for the same capability (voice + SMS).

Inbound call per-minute rates (Telnyx)

TypeRate
Inbound voice$0.002/min
Outbound voice (US)$0.01/min
Inbound SMS$0.004/message
Outbound SMS$0.006/message

Speech-to-Text (STT)

Converts the caller's speech to text in real time.

ProviderPrice/minStreamingNotes
Deepgram Nova-3 (default)$0.0077/min ($0.46/hr)YesAccepts mulaw/8000 natively, no audio conversion needed
AssemblyAI Universal-Streaming$0.0025/min ($0.15/hr)YesCheapest real-time option, 3× cheaper than Deepgram
OpenAI Whisper API$0.006/minNoCheapest overall but batch-only — not suitable for real-time calls
Google Cloud STT (standard)$0.024/minYes15-second billing rounding
Azure Speech-to-Text$0.017/minYes
Telnyx STT (in-house)$0.025/minYes3× more expensive than Deepgram
Telnyx STT (Google engine)$0.050/minYesMost expensive option

Current choice: Deepgram Nova-3 — best balance of accuracy, real-time streaming latency (~100ms), and cost. Accepts mulaw/8000 audio natively so no conversion step is needed.

Best cost alternative: AssemblyAI Universal-Streaming at $0.15/hr is 3× cheaper than Deepgram with comparable real-time performance. Can be enabled via STT_PROVIDER=assemblyai in the voice service (requires adding the provider implementation).


Text-to-Speech (TTS)

Converts the AI agent's text reply to audio and streams it to the caller.

ProviderPrice/1K charsLatencyVoice quality
ElevenLabs Turbo v2.5 (default)~$0.10/1K chars~200msBest
Cartesia Sonic 3$0.030/1K charsVery lowExcellent — purpose-built for voice agents
Google Cloud Neural2$0.016/1K chars~300msGood
Azure Neural HD$0.015/1K chars~300msGood
Telnyx TTS (Azure HD bundled)$0.045/1K charsSame as Azure at higher cost
OpenAI TTS-1$0.015/1K chars~300msAverage

Current choice: ElevenLabs Turbo v2.5 — most natural voice quality, purpose-built for low-latency streaming. Cost is higher but quality difference is clearly audible to callers, which matters for SME trust.

Best cost alternative: Cartesia Sonic 3 at $0.030/1K chars — 3× cheaper than ElevenLabs, built specifically for real-time voice agents with very low latency. Can be enabled via TTS_PROVIDER=cartesia.


Cost per call — example

Assumptions: 5-minute call, 200 words spoken by caller (~1,200 chars), 150 words spoken by AI (~900 chars).

ComponentCurrent stackOptimized stack
Telnyx inbound (5 min)$0.01$0.01
STT — Deepgram (5 min)$0.039
STT — AssemblyAI (5 min)$0.013
TTS — ElevenLabs (900 chars)$0.090
TTS — Cartesia (900 chars)$0.027
Total per call~$0.14~$0.05

The optimized stack (AssemblyAI + Cartesia) is ~3× cheaper per call while maintaining good quality. ElevenLabs voices are noticeably more natural, which may be worth the premium for customer-facing businesses.


Switching providers

All STT and TTS providers are swappable via environment variables in the voice service. No code changes required for supported providers.

# .env (voice service)
STT_PROVIDER=deepgram       # or: assemblyai, whisper, google
TTS_PROVIDER=elevenlabs     # or: cartesia, google, piper

To add a new provider:

  1. Create voice/app/stt/your_provider.py implementing BaseSTTEngine
  2. Add an elif branch in voice/app/stt/factory.py
  3. Add settings in voice/app/core/config.py

On this page