SCRAPPYLABS / FIELD GUIDE

Put your agent in the meeting.
Stay the puppet master.

A field guide for building a meeting co-pilot that actually participates — not a delayed transcript, not a post-call summary. Your agent listens live, speaks when you say so, and answers when addressed. You keep the strings.

2026-05-26 · v1.0 By ScrappyLabs Stack-agnostic · 3 reference paths

01The gap nobody fills

There are dozens of AI meeting tools. All of them do the same thing — transcribe your call, summarize it after, maybe extract action items. Useful, but after the fact.

OtterFirefliesGranolaRead.ai Fathomtl;dvZoom AI CompanionMS CopilotGemini Notes

The gap is the during: an agent that's actually in the meeting, listening real-time, that you can puppet in real-time. Make it answer the question that just got asked. Deliver the pricing slide verbally. Push back politely on the wrong assumption. Take the meeting while you go grab coffee.

The pieces have existed for two years.
Nobody's shipped a clean recipe. So here's one.

02What "puppet master mode" actually means

Three operating modes. You toggle between them mid-call:

Mode A

Listen

Agent silently transcribes and contextualizes. You're talking. It's reading the room.

controlled by: you (silent)

Mode B — the win

Puppet

You type or whisper a prompt. The agent speaks it in your voice (or its own). Every utterance, your call.

controlled by: you (every word)

Mode C

Auto-answer

Agent responds when @-mentioned, addressed by name, or a wake phrase fires. Inside the scope you set.

controlled by: agent (in your guardrails)

The win is Puppet mode. Auto-answer is a bonus, transcription is table stakes. If your tool stops at "we transcribe your meetings" you're playing the same game as everyone else.

03The five-box architecture

Every meeting bot is these five boxes plus you. Most existing tools collapse them into a vertical so you can't swap pieces — that's the trap.

Audio In

capture the meeting audio

→

ASR

speech → text, streaming

→

Brain

LLM + rolling transcript context

→

TTS

text → voice

→

Audio Out

inject back into the meeting

CONTROL CHANNEL — you steering, in real-time

Trade-offs happen per box. Below: real options for each.

04Pick a path per box

Box 1 — Audio in (capture)

Path	How	Cost	Notes
Local sink	PipeWire (Linux), BlackHole (macOS), VB-Cable (Windows)	$0	Loopback the meeting app's output
Tab capture	Chromium `getDisplayMedia({audio:true})` via CDP	$0	Works headless, no OS audio plumbing
Platform API	Zoom RTMS, Meet Media API	varies	Cleanest but platform-locked + permission-gated

Box 2 — ASR (speech to text)

Path	Tool	Cost	Latency
Local Whisper	`whisper.cpp`, `faster-whisper`, WhisperX	$0 + GPU	200–800ms
Local Q3-ASR	Qwen3-ASR-Flash (multilingual, fast)	$0 + GPU	200–500ms
Cloud stream	Deepgram Nova-3, AssemblyAI, Speechmatics	$0.004–0.02/min	100–300ms
Cloud batch	OpenAI Whisper API, Google STT	$0.006/min	1–3s

For Puppet mode, ASR latency doesn't really matter — you're typing. For Auto-answer, budget <500ms end-to-end or the agent talks over people.

Box 3 — Brain (the LLM)

Path	Tool	Cost	Notes
Local LLM	Ollama / vLLM / llama.cpp + Qwen3, Llama 3.3, Mistral, DeepSeek	$0 + GPU	Tool-calling capable models only
API	Claude, GPT-4o, Gemini	$0.50–15/M tokens	Best quality, fastest iteration
Hybrid	Local for routine, API for hard	varies	Route by question complexity

The brain needs persistent context — the rolling transcript IS the system prompt. Most failures come from not feeding it the last 2–5 minutes before each generation. Don't be clever; just paste the transcript.

Box 4 — TTS (text to voice)

Path	Tool	Cost	Notes
Local neural	Piper, Kokoro, XTTS-v2, F5-TTS, Qwen3-TTS	$0 + GPU/CPU	Kokoro + Piper run on CPU
Cloud premium	ElevenLabs, Cartesia Sonic, PlayHT	$0.10–0.30/1K chars	Best naturalness, voice clones
Cloud commodity	OpenAI TTS, Google TTS, Azure	$0.015–0.03/1K chars	Good enough, cheap

Voice cloning matters more than you think. In your voice, people forget it's a bot within 30 seconds. Generic stock voice — every utterance breaks the spell.

Box 5 — Audio out (inject)

Path	How	Notes
Virtual mic	PipeWire `module-null-sink` as the meeting's microphone	Linux default
Aggregate device	macOS aggregate of real mic + BlackHole	macOS default
Browser inject	`MediaStreamAudioSourceNode` in Chromium via CDP	No OS plumbing
PSTN dial-in	Telnyx, Twilio, Vonage — bot dials the phone bridge	Universal but $0.005–0.01/min

05Two ways to be in the meeting

Two postures the bot can take. Same five boxes underneath — different identity model.

Posture 1 — Separate participant

The bot has its own seat, name tag, video tile (avatar or static image). Joins via its own browser profile or account.

✓ Clearly identified as the AI · ✓ Can stay after you leave · ✗ Needs its own account · ✗ Some hosts auto-eject unknowns

Best for: external — sales, discovery, anything where consent + transparency matter.

Posture 2 — Co-pilot (your mic)

The bot's audio is mixed into your mic feed. From the meeting's view, it's all coming from you.

✓ No extra participant · ✓ Works when bots are banned · ✗ Needs a voice clone of you · ✗ Harder to leave alone

Best for: internal — backup brain on standups, technical reviews, anywhere a third name on screen is weird.

06The control channel (the puppet strings)

How do you drive it in real-time without breaking eye contact and staring at a terminal? This is the part most tutorials skip.

Channel	Setup	Best for
Terminal	`bot say "the price is forty-nine dollars"` in a tmux pane	Solo / dev workflow
Hotkey + voice	Push-to-talk hotkey → ASR → bot speaks the transcript	Hands-on-keyboard, eyes-on-meeting
Phone DM	Type into Telegram/Slack DM → bot speaks it	Phone-as-puppet, invisible to camera
Mention trigger	"Buddy, what's our SLA?" → RAG-backed reply	Auto-answer mode

Most people land on terminal + phone in practice. The phone is the killer because it's invisible to the camera and you can use it while making eye contact.

07Three reference stacks

Pick the path that matches what you already have. All three produce the same outcome.

PATH A

Full Local

$0/min · needs hardware

AudioPipeWire null-sink

ASRWhisper.cpp / Q3-ASR

BrainQwen3 / Llama 3.3

TTSPiper / XTTS / Q3-TTS

OutPipeWire null-source

Controltmux + phone over LAN/Tailscale

Hardware floor: one 24GB GPU runs the full stack (Whisper-large + Qwen3-30B Q4 + Kokoro/Piper).

PATH B — RECOMMENDED

Hybrid

~$0.05–0.15/min · no GPU needed

AudioBlackHole (Mac) / PipeWire

ASRDeepgram Nova-3 stream

BrainClaude Sonnet 4.6 API

TTSCartesia Sonic-2 / ElevenLabs

OutAggregate device → meeting

Controltmux + phone

Best quality-per-effort ratio for anyone without a GPU. This is what we'd build first if we were starting over.

PATH C

All Cloud

~$0.10–0.20/min · rented VM

AudioHeadless Chromium tab capture

ASRAssemblyAI Universal-Streaming

BrainGPT-4o

TTSOpenAI TTS

OutChromium MediaStream

ControlWeb puppet panel (you build it)

Runs on a rented VM, zero local install. Useful when you're building this as a hosted product for others.

08Build in this order (the only one that works)

Resist the urge to start with the brain. Audio plumbing is where attempts die.

Audio loopback proven first. Virtual mic the meeting app sees + a virtual sink that captures meeting audio. Test by playing a .wav into the virtual mic and confirming the other participant hears it. Do not move on until this is rock solid.
TTS into the meeting. Pipe TTS output to the virtual mic. You now have a "type → others hear it" loop. This alone is a useful tool.
ASR off the meeting audio. Capture the sink, feed to ASR, see live transcript in your terminal. Now you can read what's happening.
Brain + rolling transcript context. Wrap an LLM, give it the last 2–5 minutes of transcript as context, expose a say(prompt) command.
Puppet mode. Add one control channel (terminal, phone, or Slack — pick one). Make it work well before adding more.
Auto-answer (optional, last). Wake-word or @-mention detection → auto-trigger. Add this last — it's the most likely to embarrass you.

09What it looks like

A 60–90 second demo: Buddy joins a meeting, gets puppeted by Brian, switches to auto-answer, and exits on command.

▶

DEMO VIDEO — COMING SOON

Recorded with Buddy on Google Meet

10Pitfalls we hit so you don't have to

Echo loops

If the bot's TTS reaches your real mic, the bot will transcribe itself and respond to itself. Mute your real mic in puppet mode, or route the bot's audio to a sink the mic physically can't pick up.

Same-domain Meet "adaptive audio"

Google Meet has an adaptive-audio feature that suppresses one mic when it detects two participants on the same Workspace domain in the same room. Use a different domain for the bot's account than yours. (Cost us a day.)

PipeWire linger nodes

pactl unload-module doesn't always free a virtual node — if you created it with object.linger=true, you also have to pw-cli destroy <global-id>. Otherwise you accumulate duplicate sinks across runs.

iOS WebSocket to Tailscale CGNAT

If you're building a mobile puppet controller: URLSessionWebSocketTask silently refuses ws://100.x.x.x Tailscale addresses with no error. Use Starscream or NWConnection.

Headless browsers reveal themselves

Meeting platforms increasingly detect headless Chromium and shadow-ban the bot. Run a real browser (windowed, even if you never look at it) for any external meeting.

TTS voice mismatch in Posture 2

If you augment your own mic with a generic TTS voice, every utterance is jarring. Clone your voice (XTTS, F5, ElevenLabs) before going live in co-pilot mode.

11What's NOT in this guide

Persona / prompt engineering — separate problem. This guide is plumbing for getting the agent's voice into the room.
Compliance / consent — recording and AI participation have jurisdiction-specific rules. Two-party consent states, GDPR, EU AI Act. Don't ship to customers without legal review.
Avatar / video — audio only. Adding a tile is doable (D-ID, HeyGen, or a static image + name tag) but a different scope.

Build it yourself, or ship faster.

Everything above is free. If you want help shipping it — voice cloning at scale, hosted brain, meeting-platform update channel, the boring parts — that's what we do.

Talk to us scrappylabs.ai github