Every phone call to a business has a 3-second window. If the AI doesn't respond within that window, the call feels broken. Unnatural. Like talking to a machine.

Everything in the stack—receiving the call, transcribing speech, generating a response, converting it back to voice—has to complete inside 2 seconds for the conversation to feel natural.

That constraint rules out most obvious approaches. Here's how we solved it.

The Problem With "Wait and Respond"

The intuitive way to build a voice AI is simple: wait for the person to finish speaking, transcribe what they said, send it to an AI, get a response, and read it back.

The problem is that approach is too slow. Each step takes time, and each step waits for the one before it to finish. By the time you add up transcription, AI processing, and voice synthesis, you're well over 3 seconds—and the caller already thinks something is wrong.

The Solution: Do Everything at Once

Instead of a sequential pipeline, we built a streaming system where every layer starts working as soon as it has enough to go on.

Transcription starts processing your words as you speak them—not after you stop. The AI starts forming a response as soon as there's enough context to work with—not after transcription finishes. Voice synthesis starts converting the first sentence to audio while the AI is still writing the second one.

By the time the caller finishes speaking, we're already playing the response.

The Details That Matter

Warm connections. Establishing a connection to an external service takes 200–400ms before any real work happens. We keep persistent connections open at all times so the first thing that happens when a call arrives is actual processing, not setup.

A smart cache for common questions. "What are your hours?" "Do you accept Delta Dental?" These questions come up constantly. We pre-generate responses for the most common ones. When the system recognizes the intent, the answer plays in under 200ms instead of the usual 1.5 seconds.

Graceful fallbacks. If the AI takes longer than expected, a brief placeholder keeps the line active. Dead air—even a second of it—makes callers hang up. A natural "Let me check that for you" buys the system time without alarming the caller.

Choosing the right AI model. A more powerful model produces better responses but adds 300–500ms to every reply. We use a version of Claude tuned for speed, which hits our latency target while still handling the full range of dental scheduling conversations naturally.

The Results

Under 2 seconds average response time
97% accuracy on transcribing caller speech
$0.12 per minute all-in cost

Why It Matters

Speed isn't just a technical metric here—it's what makes the product work.

A caller who hears a half-second pause might not notice. A caller who hears three seconds of silence assumes the line dropped or the system is broken. They hang up. That's the missed call you were trying to avoid in the first place.

Getting the timing right is what turns voice AI from a curiosity into something that actually runs a front desk.

Want to see it in action? Check out Velyn Dental.

The Problem With "Wait and Respond"

The Solution: Do Everything at Once

The Details That Matter

The Results

Why It Matters

Cookie Settings