Database as Debugging Loop#

A day spent fixing bugs in a personal AI voice assistant where the telemetry database became the shared context with the agent. The pattern that fell out — "telemetry in a queryable shape" — turned out to be more portable than the specific stack: the same loop works with local logs, a local database, or an observability stack, as long as the agent has a programmatic connector.

I spent today doing something I hadn't done before: I played a game, the assistant answered a few questions wrong, and Claude debugged itself by reading the telemetry database directly. No log paste. No "can you send me the relevant lines?"

A few numbers from the day:

29 PRs merged (#283–#311)
About 20 of those were bug fixes by my count; the rest were telemetry adds and refactors that fed the next loop
Most fixes landed correctly the first try; a minority needed a follow-up PR to refine
18 voice sessions logged, 178 turns of voice telemetry collected
The hardest bug — "why does the assistant keep saying it's not sure?" — took 5 PRs of misdiagnosis before a diagnostic-log PR (#294) finally surfaced the real cause and the next PR fixed it
The fastest bug — today's "want me to look that up?" failure — took one SQL query and one PR, ~30 minutes from symptom to merged fix

The pattern that fell out of the day is worth writing down because it's the actual point — not the specific bugs.

The loop#

I play the game. A voice companion: I speak, it answers through my headset, that's the loop.
Every voice turn writes to a database. A row per turn — transcript, response, classifier intent, route attribution, grounding fact count, web_search attached y/n, STT path + per-stage timing, TTFA, and so on. Consent-gated; the writer is fire-and-forget so it never blocks the voice path. Lives in Supabase.
I notice something off. "It told me Citizen Sleeper has only one class — there are at least two." Or: "It said it would look something up but didn't actually search until my third attempt."
I either paste the session_id to Claude copied from the logs, or I tell Claude to grab the latest session from the database. That's it.

Claude pulls the rows. Through the Supabase MCP, it runs SQL against the database — and reads the answer back like any other tool result:

SELECT turn_ix, route, web_search_attached, transcript, response_text
  FROM voice_turns
 WHERE session_id = '...'
 ORDER BY turn_ix;

Claude identifies the actual bug from the telemetry. Often within a single query.
Claude ships a PR + applies whatever schema migration the next iteration needs. I review, then merge after CI passes, and we're set up for the next loop.

That's the whole thing. It compounds because each loop has the potential to add additional telemetry that makes the next loop sharper. By the end of today, we had added extra columns for routing, routing reason, grounding count, whether web search was attached, and per-stage speech-to-text timings. Tomorrow's debugging will use them.

A worked example#

Around 12:13pm I asked the assistant "What does it mean when I fail a Drives?" in Citizen Sleeper. The assistant said "want me to look that up for you?" I said "yes please." Nothing happened. I asked again. Still nothing. Eventually it just guessed.

Here's what Claude saw when I pasted the session_id:

Turn	Transcript	web_search_attached	Response
12	What does it mean when I fail a Drives?	true	"…want me to look that up for you?"
13	Yes. Yes, please.	false	"Let me look that up for you!" (no search)
14	(silence)	—	—
15	Did you look it up?	false	"Sorry, my search didn't go through…"

Two compounding bugs, both visible in one query:

Turn 12: the model had web_search in its tools and asked permission instead of using it. The system prompt didn't forbid the permission-ask pattern, so the polite-default won.
Turn 13: my "Yes please" was three words of pure filler. The minimal-substantive gate that decides whether to attach web_search filtered it out — so the very turn I authorized the lookup, the tool was no longer attached.

Without the web_search_attached column we couldn't have separated those two failures. With it, the diagnosis was thirty seconds of reading.

The fix was equally direct. Two changes in PR #311:

A new rule in the voice system prompt: "CALL TOOLS DIRECTLY — DO NOT ASK PERMISSION." The model was running with the SDK's "let the model decide whether to use a tool" mode (Anthropic calls this tool_choice="auto", OpenAI "auto", Gemini AUTO), which is what allowed the permission-ask pattern in the first place. A stricter alternative would have been the "force tool use" mode (Anthropic "any", OpenAI "required", Gemini ANY); I picked the prompt rule because it preserves model judgment for queries that genuinely don't need a tool. Calling the failure mode out by name in the prompt also makes future prompt trims less likely to re-introduce it.
A "sticky" web_search across promise→confirm turns: if the previous assistant message contained a "let me look that up" pattern, the tool stays attached on the next turn regardless of substantive-gate filtering.

Both shipped. The next time the promise→confirm pattern shows up, the new stickyFromPromise column will disambiguate which fix is doing the work: if it's true, fix 2 (sticky attach) is what kept web_search on the confirm turn; if it's false and web_search is still attached, fix 1 (the prompt rule) is doing the work because the model stopped asking permission in the first place.

Three feedback modes after the fix#

The seven-step loop above is the diagnosis half. The other half is verification — and the dynamic between human and agent in that half is where the pattern really shows up. After every shipped PR, the next play session is a test, and my response to it falls into one of three modes.

Confirmed. I play, the bug is gone, I say "fixed" and move on. The clean case. Examples from today: PR #288's phonetic stop-words ("should" stopped becoming "Sealed" in the next session). PR #287's empty-graph fallthrough (the polite-decline disappeared on the retry). These happen often when the bug is well-isolated and the fix touches the actual cause.

Not fully fixed. I play, something is better but the original symptom — or a near-cousin — is still there. I paste the new session_id, Claude pulls the rows, and the data shows what's left to fix. This is the iteration mode. Today's examples:

#287 added a fallthrough for empty grounding. Next session: still dodging with "I'm not sure" — those replies were passing the success check. Needed #289 to fall through on insufficient facts too.
#302 wired fact-graph grounding into the legacy pipeline. Next session: 0 hits — the search required every transcript word in the same fact (AND) instead of any (OR). #303 fixed it.
#304 added a new web_search attachment path. Next session: a runtime error from a scope bug in the new code. #306 fixed it.

These are the most informative loops. Each "not fully fixed" feedback taught the next iteration something the first one missed. The cost is one extra round-trip. The benefit is the second fix tends to land cleanly because the surface area shrunk.

Fixed, and now I see the next thing. This is the most interesting outcome and the one I didn't expect. The fix lands correctly, the original symptom is gone — and a downstream symptom that was hidden by the upstream bug becomes visible. Today's chain:

#298 fixed the bug where descriptions weren't reaching the answerer — the pipeline component that turns retrieved facts into a response. Next session: a new failure became visible — the answerer was finding entities whose descriptions were too thin (one source, one sentence), and the model hallucinated to fill in. The grounding wasn't the problem anymore — the thin source data was.
That drove #304 (attach web_search when local grounding is thin) — a feature whose need had been hidden by the upstream bug.

This third mode is what made the loop generative rather than just maintenance. Bugs weren't a backlog discovered randomly over time; they were a stack where the top item changed each time I fixed it. The telemetry is what made each new top item visible immediately rather than weeks later. The value of the loop wasn't just that the next bug became easier to see — it was that the same loop surfaced it the same way every time. Visibility plus repeatability is what made this a reliable process.

The dynamic this creates between me and Claude is something I hadn't experienced with a coding partner before: I describe the symptom in plain language; Claude pulls the actual telemetry, names the failure, ships the fix (this isn't a commercial app, so it's ok if the fix is wrong), and asks me to verify in the next session. I don't translate the symptom into terms a debugger would understand. Claude doesn't ask me to send logs. There's no translation step between describing a bug and diagnosing it — both of us read directly from the same database, and the diagnosis happens in plain language on top of structured data.

Misdiagnosis is the cost of bad telemetry#

The most expensive bug today was the assistant dodging "What's a Stabilizer?" with "I'm not sure" — five wrong fixes before the right one shipped:

#287: empty-graph fallthrough — thought the graph was empty. It wasn't.
#289: insufficient-facts fallthrough — thought the model was dodging because it had too few facts. The dodge was a symptom, not a cause.
#292: strengthened the response-generation prompt — thought the model needed clearer instructions to combine facts. Made the prompt better, didn't fix the bug.
#293: bumped the response-generation model from a smaller to a larger one — thought it was a model-quality floor. Wrong; reverted in #298.
#294: added a diagnostic log surfacing the model's actual dodge text + the facts it received.
#298: root cause found. A wrong field path in the prompt builder meant descriptions never reached the model. It dodged because it had nothing to work with.

Each "fix" before #294 was diagnosing a symptom because the model's actual input was opaque. The moment one PR (#294) made the input visible, the next PR (#298) found the bug from a five-line look at the output.

By contrast, every bug after #294 — once we had the what's actually in the prompt signal — landed in 1-2 PRs. Same shape every time, once the data is there: the loop's hardest job is the first time it sees a class of bug — building the column that will make the next instance of that bug obvious.

This is the same shape the writing-verification post describes for editorial discipline: a mechanical check turns "active discipline that fails because the writer forgets" into "habit triggered by a visible surface pattern." Here the surface pattern is a column in a database row instead of a trigger word in prose, but the loop is identical.

Why this works#

A few properties of the loop I didn't appreciate until I was inside it.

The telemetry has to live in a queryable place. Files-on-disk logs require copy-paste. JSON-line logs require grep/jq. The moment the data is in a queryable store, the cost of asking a precise question drops to writing one query. SQL with the right schema is good at providing a shape for the question Claude actually wants to ask — "what's different about the turns where the bad behavior happened?"

Attribution columns matter more than total rows. I had voice turns logging since week one. The early schema was sparse — transcript, response, latency — so the interesting cross-cutting questions ("which pipeline answered?" / "did grounding fire?" / "which STT path?") needed columns that didn't exist yet. Every cross-cutting question started with a schema migration, then the question. Each loop added one or two columns. The slow accumulation through the feedback loop made each iteration a form of continuous improvement — every column added sharpened the diagnosis the next time around.

Schema migrations are part of the loop. Claude applies the migration to my dev database via the Supabase MCP, captures the SQL in a migrations folder, updates the canonical schema doc, and ships the code that writes the new columns — all in one PR. So the database, the code, and the docs stay aligned automatically.

The model's own bugs are also debuggable this way. Half of today's bugs were "the model hallucinated wrong info" / "the model asked permission instead of using a tool." Those used to feel un-debuggable — what's the bug, the model is wrong sometimes? But once the prompt + grounding context + tool inventory are all collected in the telemetry database tables, "model behavior" becomes "given inputs X, the model produced Y, was Y a reasonable output of X?" That's a regular bug in my app. This connects to the two kinds of things the AI models were getting wrong: both Bucket One (knows-but-pads) and Bucket Two (doesn't-know) failures leave fingerprints in the prompt + grounding context, so the telemetry reduces the vague "model bug" to a specific, fine-grained "input/output bug."

The shape, not the storage#

A clarification worth making explicit: the loop is "telemetry in a queryable shape," not "must be SQL." The database is one instantiation; the property that makes the loop work is that the agent can ask precise questions about specific events without you copy-pasting anything in.

For my own setup, Supabase + the Supabase MCP is what makes the debugging loop between me and the agent fast enough to be practical. The agent writes SQL, runs it against the database, and reads the rows back as a regular tool result. SQL is also genuinely well-suited to the kinds of questions debugging tends to ask — "what's different about the failing rows compared to the passing ones?", "which column changed across these two turns?", "give me percentiles for this metric over yesterday's session." That's data extraction, comparison, and aggregation — SQL's home turf, and it composes cleanly across multi-line, nested-event logs in a way that grep and jq don't.

But the shape generalizes — any queryable telemetry surface the agent has a programmatic way to talk to fits the loop:

Local JSONL logs + an MCP that can read them. Same loop. Slower per query because the agent has to do parsing work a database does for free, but functionally the same.
A local database (SQLite, embedded Postgres) with a connector. Same loop, no network hop, often the right call for small or offline projects.
A telemetry endpoint (custom HTTP, gRPC) the agent can call through an MCP. Same loop — the data source can be anything that returns structured rows when asked.
An observability stack with an MCP connector (Datadog, Honeycomb, Grafana Loki). Same loop. These often ship with polished connectors; whether the loop runs as smoothly depends on whether the query language is rich enough for cross-cutting questions like "what's different about the failing rows?"
Files on disk + a Read/Grep tool. Still works at the simplest end. The friction is per-question, and the agent has to do parsing work a database does for free, so the loop runs slower and the agent's context fills up faster — but for a small project or a single session, it's enough.

The load-bearing requirement is the agent has a programmatic connector to whatever holds the data. Without one, every debugging round-trip needs you to copy-paste the relevant rows in by hand; with one, the agent reads directly. SQL is one query language that fits the kinds of questions debugging asks (data extraction, filtering, comparison, aggregation); other query languages (KQL, LogQL, Datadog's query language, an HTTP endpoint that returns rows) slot into the same loop as long as an MCP connector exists for them. What you lose moving away from a database is mostly composition: the ability for the agent to ask one question, get a result, and ask a follow-up that joins or filters on what it just saw. SQL supports that natively; simpler stacks need a script per follow-up, which means the agent does more synthesis in its own context window and the loop's per-question cost goes up.

For me, the database is what made the cycle time short enough to fit in a single focused session — notice the bug, diagnose it, ship the fix, all without context-switching away. Three of today's PRs went from "I noticed a thing → Claude pulled the data → fix shipped" in under thirty minutes. The same flow with paste-the-log would have been an evening. The 30-minute version lets you stay on one bug from start to finish; the evening version sprawls across breaks, and you stop doing it.

The pattern this echoes — and the one I keep noticing — is the same one the runtime-vs-harvest post names for attribution: the shape of your pipeline is the shape of your downstream behavior. Here, the shape of your telemetry is the shape of your debugging. When the data is queryable from where the agent already lives, debugging is conversational. When it's not, debugging is translation — symptom in your head, log in a file, snippet pasted into chat, agent guessing at causation. Every handoff in that chain — searching the log, extracting the relevant lines, pasting them into chat, waiting for the agent to infer cause from a snippet — adds cycle time.

What this doesn't do#

Worth being honest about the caveats using this approach.

It doesn't make your first guess at the cause correct. When I described the symptom today, my initial hypothesis ("the main process is starving STT") was wrong twice. But the data caught me before I shipped a fix based on either wrong guess. The loop adds a data-backed validation step to whatever symptom-correlation you start with; it doesn't replace correlation-thinking, but it does mean a wrong guess turns into a corrected guess instead of a wrong fix.

It doesn't replace running the app. The telemetry tells you what happened on a turn. It doesn't tell you what the user heard. "This response sounds robotic" / "the wave glyph isn't visible" / "the answer was technically correct but missed the player's actual question" — only a human playing the game catches those.

It doesn't make every bug small. Some of today's bugs were one-line fixes; others were architectural. The loop just makes diagnosis faster — implementation is still implementation.

It depends on consent. The telemetry is gated on user consent — this worked for me because I gave myself consent to log my voice session data in the database. Without consent, no rows.

See also Writing isn't verification for the same "build the system that catches the failure mode" pattern applied to writing, the two kinds of things an AI gets wrong for what's in those input/output rows when the model misbehaves, and the runtime-vs-harvest post for a different domain where pipeline shape determines downstream behavior.