I Gave My Newsletter a Voice (Literally)

My newsletter site has a chat widget now. You type a question, it searches through every issue I’ve ever written, and gives you an answer with sources.

That took an evening. Cool, but not interesting enough to write about.

What made me write this: I added a microphone button next to the text input. Click it, and you’re in a real-time voice conversation with an AI agent that knows my content. You talk, it listens, it talks back. Not a recording. Not text-to-speech over a chat response. An actual voice conversation.

The stack behind it is LiveKit, and I want to walk through how it works because it’s simpler than I expected.

What LiveKit Actually Does

LiveKit is real-time communication infrastructure. Think “Zoom but programmable.” It handles all the WebRTC complexity — rooms, audio routing, codecs, latency optimization — so you don’t have to.

The part that matters for AI voice agents: LiveKit has an agent framework. You write a Python worker that connects to their cloud service and waits. When a user joins a room, LiveKit dispatches your agent into that room. The agent listens to the user’s microphone, processes speech, thinks, and talks back. All in real time.

The latency is wild. It feels like talking to someone, not waiting for a computer.

The Architecture

Three pieces:

The API — A FastAPI server that handles text chat (RAG search over my newsletter content) and generates LiveKit room tokens when someone clicks the mic button.

The voice agent — A Python worker running LiveKit’s agent SDK. It connects outbound to LiveKit Cloud and waits for rooms. When someone joins, it gets dispatched. Inside the agent: voice activity detection (Silero VAD), speech-to-text (Azure), an LLM (GPT-4.1-mini via GitHub Models), and text-to-speech (Azure). Before every response, it searches my knowledge base for relevant context — same RAG pipeline the text chat uses.

The frontend — An Astro component with a mic button. Clicking it loads the LiveKit client SDK, requests a room token from my API, and connects to the room via WebRTC. The agent joins, and they’re talking.

Both the API and the voice agent run in a single Railway container. A bash script starts both processes — if either dies, the container exits and Railway restarts it.

The RAG Part

Every time the user says something, the agent:

Transcribes the speech (Azure Speech STT)
Embeds the transcript using GitHub Models API (text-embedding-3-small, 1536 dimensions)
Searches a SQLite vector database (sqlite-vec) for the most relevant newsletter chunks
Rebuilds the system prompt with fresh context
Generates a response (GPT-4.1-mini via GitHub Models)
Speaks it back (Azure Speech TTS)

This happens per utterance. The agent’s knowledge stays current with whatever the user is asking about, not stuck on whatever the first question was.

The hybrid retrieval trick: Vector search alone can’t answer “what’s the latest issue?” because semantic similarity doesn’t understand ordering. The solution: at startup, the agent queries the database for all newsletter issue URLs, extracts the numbers, and injects a content index into every system prompt:

Available newsletter issues: issue-1, issue-2, ..., issue-20
The latest/most recent issue is issue-20
Total issues: 20

Now the LLM gets both semantic context from vector search and structural metadata it can’t learn from embeddings. Ask “what’s the latest issue?” and it knows. Ask “tell me about GitHub Copilot” and vector search finds the right chunks. Hybrid retrieval.

What Surprised Me

LiveKit agents are workers, not servers. They don’t listen on a port. They connect outbound to LiveKit Cloud and get dispatched into rooms. This threw me at first — I kept trying to think of it as another HTTP service. It’s not. It’s a background worker that happens to handle real-time audio.

The voice pipeline has real latency requirements. Text chat can take 2-3 seconds and nobody cares. Voice? If there’s a 2-second gap after someone finishes talking, it feels broken. LiveKit’s streaming architecture handles this — the TTS starts speaking before the full LLM response is complete.

sqlite-vec is underrated. I’m running vector search in SQLite. No Pinecone, no Weaviate, no managed vector database. For a knowledge base of ~130 newsletter chunks (all 20 issues, articles, and GitHub blog posts), this is more than enough. The query takes single-digit milliseconds. Embeddings come from GitHub Models API during ingestion — free during preview, high quality (text-embedding-3-small, 1536 dims), and no local model loading headaches.

Debugging production is different. The voice agent worked locally but failed silently in Railway. The agent would listen, transcribe perfectly, but always respond with “I don’t have that information.” Turned out the embeddings API was returning 400 errors because an old environment variable (LIVEBRAIN_EMBEDDING_MODEL) was still set to a local model name (all-MiniLM-L6-v2) that the API didn’t recognize. The fix: delete the variable and let it default to text-embedding-3-small. Real-time logging made this visible — without print() statements showing chunk retrieval counts and similarities, I would have been guessing for hours.

What’s Next

I’m extracting the reusable parts of this into an open-source framework. The idea: point it at a YAML file with your content sources, run an ingestion script, and you get a voice agent that knows your stuff. Newsletter, documentation, blog — whatever you feed it.

It’s not ready yet. The mainbranch-agent version works, but the generic framework needs cleanup before anyone else can use it. I’ll open-source it when it’s actually good, not when it’s “minimum viable.”

If you want to see it in action, go to mainbranch.dev and click the chat bubble. The mic button is right there.