Blog · 8 April 2026 · 12 min read

We measured perceived latency in 9 voice-to-text tools. Here's what we found.

We instrumented nine voice-to-text tools and measured key-release-to-first-character on identical hardware. The spread between fastest and slowest is 11x. Why that matters more than accuracy in 2026.

TL;DR: We instrumented nine voice-to-text tools on identical Apple Silicon hardware and measured key-release-to-text latency. The fastest was Yapper at 142ms median. The slowest was a cloud-only product at 1,580ms. The 11x spread is bigger than any accuracy spread we measured. For high-volume voice users, latency dominates the experience. Methodology, results, and a note on what’s improving.

Why latency dominates accuracy in 2026

For most of the last decade, the headline number for any speech-to-text product was word-error-rate (WER). That made sense when WER ranged from 12% to 25% — the difference between a usable transcript and a garbled one. In 2026, the best models have converged. Whisper-large-v3-turbo, Deepgram Nova-3 and the latest Google Gemini speech models all sit in the 3-5% WER band on English. The accuracy ceiling is here.

What hasn’t converged is latency. The fastest pipelines we measured run in 140ms; the slowest in over 1.5 seconds. Eleven-fold. That gap is now what users feel.

What we measured

Perceived latency — the wall-clock time between a user releasing the dictation hotkey and the first non-whitespace character appearing in the target text field. Not transcription time. Not network time. Not cleanup time. The thing your eyes actually wait for.

Methodology

Hardware: MacBook Pro 14″ M3 Pro, 36GB RAM, macOS 14.4.
Network: residential fibre, sustained 200Mbps down / 50Mbps up.
Sample sentence: a single 12-second utterance, 38 words, mix of common English and three deliberately uncommon proper nouns.
Each app warmed up with five throwaway dictations before recording.
100 trials per app, dictated 5 seconds apart to avoid thermal throttling effects on the local-Whisper apps.
Capture: 120fps screen recording, frame-stepped to find the first frame where any non-whitespace character appears in the cursor’s target field. Visual frame inspection rather than internal API timestamps because what we wanted is what users see.
Same Mac, same OS, same network, same target text field for every app.

The results

App	p50	p95	WER	Pipeline
Yapper	142ms	210ms	3.8%	Local + deferred LLM
Glido	410ms	680ms	4.1%	Local + sync cleanup
whisper.cpp (raw)	480ms	740ms	4.4%	Local, no cleanup
MacWhisper	590ms	920ms	4.0%	Local, no cleanup
Apple Dictation	760ms	1,100ms	5.6%	Cloud, no cleanup
Wispr Flow	880ms	1,340ms	3.6%	Cloud + cloud LLM
Otter live	920ms	1,420ms	4.2%	Cloud streaming
AssemblyAI realtime	1,210ms	1,560ms	3.9%	Cloud streaming
Deepgram Aura	1,580ms	2,040ms	3.4%	Cloud + post-process

Three things the data tells us

1. The cleanup pass is the longest pole

Compare MacWhisper (590ms, no cleanup) against Glido (410ms, with cleanup). Glido is faster despite doing more work, because it compiles to native code more aggressively. The cleanup itself adds roughly 200-400ms in any pipeline that does it synchronously. That’s why Yapper does it asynchronously — the user sees the raw text first, the cleaner version arrives in the background.

2. Cloud is now structurally slower

Even with a 200Mbps fibre line, every cloud-based app sat above 700ms median. The network round-trip is unavoidable: TCP handshake, TLS, request, model invocation, response. On a worse network, this gets dramatically worse — we re-ran AssemblyAI on a coffee shop’s 25Mbps and saw p95 jump to 3.2 seconds.

3. Accuracy is converged

WER ranged from 3.4% (Deepgram Aura) to 5.6% (Apple Dictation). That’s a 65% relative spread, but in practice it’s the difference between “0.4 errors per 100 words” and “0.6 errors per 100 words”. Both are below most people’s ability to notice. The latency spread is felt on every dictation; the accuracy spread is rarely felt at all.

Why this matters for the way you build

If you’re a product team shipping voice features, the traditional priority order — accuracy first, then latency, then UX — is now the wrong order. Latency first, UX second, accuracy third. The cheapest accuracy gain past the 95% threshold is dwarfed by the felt benefit of cutting another 200ms.

Yapper was built around this insight. We use a smaller cleanup model (Claude Haiku, prompt-cached) than competitors who reach for Sonnet or GPT-4-class models, and we run it asynchronously. The result: the user gets text immediately and the LLM polish arrives invisibly. Full architecture writeup.

What’s improving

Apple Silicon. Each generation has roughly halved the cost of running Whisper-large locally. M5 should put local Whisper-v4 inside 50ms.
Smaller cleanup models. Haiku-class models are now sub-200ms TTFT. The gap between “ship raw” and “ship polished” is closing.
Streaming injection. Apps that start injecting before transcription completes will get to sub-100ms perceived latency this year. We’re shipping this in Yapper’s Instant Mode (Max plan).

Frequently asked questions

What’s a “good” latency target?

Below 150ms it feels instant. Below 300ms it feels fast. Above 600ms it feels like a feature you’re using rather than a natural input. We design for the first.

Why test on a Mac and not a PC?

Apple Silicon is the only consumer hardware where local Whisper-v3 runs comfortably faster than realtime. PC numbers depend on whether you have a recent NVIDIA GPU. We’ll publish a Windows comparison when Yapper ships there.

Where can I get the raw data?

We’ll publish the test harness, sentence corpus, and per-trial timing CSV at /research.

Is this an ad for Yapper?

Yes. It’s also true. We were the first to publish the deferred-cleanup approach and it’s reproducible — the methodology is open. If someone builds a faster Mac dictation tool, we’ll cite them.

Want to try the fastest dictation tool we measured? Download Yapper for macOS — 2,500 free words, no card. Or read the next post: /blog.

Calculate your savings.

Plug in how many words you write a day and how much your hour is worth. We’ll show you what Yapper pays back.

Words / day

words

Your hourly rate

$/hr

Assumes typing at 45 wpm vs yapping at 220 wpm. Your mileage will, charmingly, vary.

You’d save

$16,130/yr

215

hours back

730k

words shipped

Start yapping.

2,500 words on us. Refer friends to earn more — or upgrade to Pro for unlimited.

Download for macOS Try the web demo