The Voice Layer
How Metis learned to speak — and why a human had to listen four times before it got the main word right.
The blog is called The LLM Smell Problem.
The AI narrator couldn’t say “AI.”
That’s where this post starts.
I shipped a new feature to Metis this week: audio narration for every blog post. Eight voices. Google Cloud TTS Chirp 3 HD. A pipeline that takes a post slug, strips the markdown, applies a pronunciation dictionary, and hands the whole thing to the TTS engine. MP3 lands in R2. Frontmatter gets the voice list. Done.
Except it wasn’t done. Because I listened.
The first listen is always the test. Not a technical test — everything was working. The test is whether a human can sit through it without wincing.
I got about 30 seconds in. “The A… I didn’t generate any of that.”
The main word of a post called The LLM Smell Problem, read by an AI, came out as a grammatical accident.
The fix should have been simple. Add “AI” to the pronunciation dictionary with the alias “A I.” Regenerate.
Still wrong. “The A, didn’t generate any of that.”
Okay. Try <say-as interpret-as="characters">. That’s the standard SSML fix for acronyms.
Still wrong. Turns out Chirp 3 HD doesn’t support <say-as>. It just… ignores it and keeps doing what it was doing.
Try “Ay Eye” instead. Phonetic. Should work.
It said “I I.” Because “Ay” (/aɪ/) and “Eye” (/aɪ/) are the same sound. I’d written two identical vowels and called it a fix.
Fourth attempt: A.I. with periods. The periods create a tiny prosodic boundary between the letters. The TTS stopped merging the “I” into the next word.
Four cycles. One word. The irony of an AI post that couldn’t say “AI” was not lost on me.
That’s what the human review loop is actually for. Not sign-off. Not a rubber stamp. A genuine listen, every time, before anything goes live.
The pronunciation dictionary is what comes out the other side. Every mispronunciation caught becomes a permanent fix — LLM, KB, PARA, NanoClaw, and now AI with its four-cycle story. The next post starts from a better place than this one did.
There are eight voices. I picked three for this post:
Charon — deep, deliberate. Sounds like it’s choosing its words. Puck — more natural. The one that sounds least like a narration. Zephyr — calm, reflective. Matched the tone of the post.
The frontmatter is just:
audioVoices:
- { voice: 'Charon', duration: '3 min' }
- { voice: 'Puck', duration: '3 min' }
- { voice: 'Zephyr', duration: '3 min' }
No file paths. Astro resolves audio/<slug>-<voice>.mp3 automatically. Readers pick the voice. The system handles the rest.
Metis now speaks.
The pipeline that captures a thought at midnight, routes it to the vault, drafts it into a post, and publishes it to the world — that pipeline now has a voice at the end of it.
I didn’t plan for this to be a feature. It started as a test. Then I listened, and I wanted to get it right.
That’s usually how the good ones start.