A field guide for building a meeting co-pilot that actually participates — not a delayed transcript, not a post-call summary. Your agent listens live, speaks when you say so, and answers when addressed. You keep the strings.
There are dozens of AI meeting tools. All of them do the same thing — transcribe your call, summarize it after, maybe extract action items. Useful, but after the fact.
The gap is the during: an agent that's actually in the meeting, listening real-time, that you can puppet in real-time. Make it answer the question that just got asked. Deliver the pricing slide verbally. Push back politely on the wrong assumption. Take the meeting while you go grab coffee.
Three operating modes. You toggle between them mid-call:
The win is Puppet mode. Auto-answer is a bonus, transcription is table stakes. If your tool stops at "we transcribe your meetings" you're playing the same game as everyone else.
Every meeting bot is these five boxes plus you. Most existing tools collapse them into a vertical so you can't swap pieces — that's the trap.
Trade-offs happen per box. Below: real options for each.
| Path | How | Cost | Notes |
|---|---|---|---|
| Local sink | PipeWire (Linux), BlackHole (macOS), VB-Cable (Windows) | $0 | Loopback the meeting app's output |
| Tab capture | Chromium getDisplayMedia({audio:true}) via CDP | $0 | Works headless, no OS audio plumbing |
| Platform API | Zoom RTMS, Meet Media API | varies | Cleanest but platform-locked + permission-gated |
| Path | Tool | Cost | Latency |
|---|---|---|---|
| Local Whisper | whisper.cpp, faster-whisper, WhisperX | $0 + GPU | 200–800ms |
| Local Q3-ASR | Qwen3-ASR-Flash (multilingual, fast) | $0 + GPU | 200–500ms |
| Cloud stream | Deepgram Nova-3, AssemblyAI, Speechmatics | $0.004–0.02/min | 100–300ms |
| Cloud batch | OpenAI Whisper API, Google STT | $0.006/min | 1–3s |
For Puppet mode, ASR latency doesn't really matter — you're typing. For Auto-answer, budget <500ms end-to-end or the agent talks over people.
| Path | Tool | Cost | Notes |
|---|---|---|---|
| Local LLM | Ollama / vLLM / llama.cpp + Qwen3, Llama 3.3, Mistral, DeepSeek | $0 + GPU | Tool-calling capable models only |
| API | Claude, GPT-4o, Gemini | $0.50–15/M tokens | Best quality, fastest iteration |
| Hybrid | Local for routine, API for hard | varies | Route by question complexity |
The brain needs persistent context — the rolling transcript IS the system prompt. Most failures come from not feeding it the last 2–5 minutes before each generation. Don't be clever; just paste the transcript.
| Path | Tool | Cost | Notes |
|---|---|---|---|
| Local neural | Piper, Kokoro, XTTS-v2, F5-TTS, Qwen3-TTS | $0 + GPU/CPU | Kokoro + Piper run on CPU |
| Cloud premium | ElevenLabs, Cartesia Sonic, PlayHT | $0.10–0.30/1K chars | Best naturalness, voice clones |
| Cloud commodity | OpenAI TTS, Google TTS, Azure | $0.015–0.03/1K chars | Good enough, cheap |
Voice cloning matters more than you think. In your voice, people forget it's a bot within 30 seconds. Generic stock voice — every utterance breaks the spell.
| Path | How | Notes |
|---|---|---|
| Virtual mic | PipeWire module-null-sink as the meeting's microphone | Linux default |
| Aggregate device | macOS aggregate of real mic + BlackHole | macOS default |
| Browser inject | MediaStreamAudioSourceNode in Chromium via CDP | No OS plumbing |
| PSTN dial-in | Telnyx, Twilio, Vonage — bot dials the phone bridge | Universal but $0.005–0.01/min |
Two postures the bot can take. Same five boxes underneath — different identity model.
The bot has its own seat, name tag, video tile (avatar or static image). Joins via its own browser profile or account.
✓ Clearly identified as the AI · ✓ Can stay after you leave · ✗ Needs its own account · ✗ Some hosts auto-eject unknowns
The bot's audio is mixed into your mic feed. From the meeting's view, it's all coming from you.
✓ No extra participant · ✓ Works when bots are banned · ✗ Needs a voice clone of you · ✗ Harder to leave alone
How do you drive it in real-time without breaking eye contact and staring at a terminal? This is the part most tutorials skip.
| Channel | Setup | Best for |
|---|---|---|
| Terminal | bot say "the price is forty-nine dollars" in a tmux pane | Solo / dev workflow |
| Hotkey + voice | Push-to-talk hotkey → ASR → bot speaks the transcript | Hands-on-keyboard, eyes-on-meeting |
| Phone DM | Type into Telegram/Slack DM → bot speaks it | Phone-as-puppet, invisible to camera |
| Mention trigger | "Buddy, what's our SLA?" → RAG-backed reply | Auto-answer mode |
Most people land on terminal + phone in practice. The phone is the killer because it's invisible to the camera and you can use it while making eye contact.
Pick the path that matches what you already have. All three produce the same outcome.
Resist the urge to start with the brain. Audio plumbing is where attempts die.
.wav into the virtual mic and confirming the other participant hears it. Do not move on until this is rock solid.say(prompt) command.A 60–90 second demo: Buddy joins a meeting, gets puppeted by Brian, switches to auto-answer, and exits on command.
DEMO VIDEO — COMING SOON
Recorded with Buddy on Google Meet
If the bot's TTS reaches your real mic, the bot will transcribe itself and respond to itself. Mute your real mic in puppet mode, or route the bot's audio to a sink the mic physically can't pick up.
Google Meet has an adaptive-audio feature that suppresses one mic when it detects two participants on the same Workspace domain in the same room. Use a different domain for the bot's account than yours. (Cost us a day.)
pactl unload-module doesn't always free a virtual node — if you created it with object.linger=true, you also have to pw-cli destroy <global-id>. Otherwise you accumulate duplicate sinks across runs.
If you're building a mobile puppet controller: URLSessionWebSocketTask silently refuses ws://100.x.x.x Tailscale addresses with no error. Use Starscream or NWConnection.
Meeting platforms increasingly detect headless Chromium and shadow-ban the bot. Run a real browser (windowed, even if you never look at it) for any external meeting.
If you augment your own mic with a generic TTS voice, every utterance is jarring. Clone your voice (XTTS, F5, ElevenLabs) before going live in co-pilot mode.
Everything above is free. If you want help shipping it — voice cloning at scale, hosted brain, meeting-platform update channel, the boring parts — that's what we do.