← Back to research notes

Data Appendix: Runs, Cohorts, and Methodology

18 min read

Data Appendix: Runs, Cohorts, and Methodology#

The research posts (hallucination taxonomy, grounding/schema alignment, runtime vs. harvest) cite several cohorts. This appendix publishes the run-by-run data for reference. Whenever "wiki harvest" or "harvest" is mentioned, it refers to the synthetic wiki harvest from Claude AI models.


Experimental structure#

The data was collected in two arms, back-to-back, on the same schema and the same generator/judge stack:

  • Arm 1 — ungrounded (no ground truth). Runs 24592252015, 24592412851, 24594869706, 24596922850, 24597954232 (the first five runs, 2026-04-18 00:19 → 05:40 UTC). The generator produced each section from training knowledge alone. Judge graded. No external reference data fed to the prompt. The 31-harvest snapshot cited in the posts lives inside this arm.
  • Arm 2 — grounded (ground truth injected). Runs 24599177402, 24599497754, 24637304751, 24637678177 (the four later runs, 2026-04-18 06:51 → 2026-04-19 20:02 UTC). The generator was given a reference list of real proper nouns from CC0 / first-party-API sources (explicitly selected to avoid sources whose terms prohibit research use). Judge graded the same way.

This lets us compare two things cleanly: (a) what the model does without any external anchor, and (b) whether adding a schema-aligned anchor moves the needle.


Evidence base#

Each harvest is a structured mini-wiki for one game, with five sections — bosses (15 entries), zones (12), items (20), quests (8), mechanics (10) — so every harvest targets roughly 65 slots. That's what makes the failure modes legible: when the model has nothing, it still fills most slots, and the judge catches the pattern.

ScopeHarvestsApprox generated entries
31-harvest snapshot (published cohort, 30 distinct games)31~2,000
Full games corpus (all runs, 2026-04-18 → 04-19)213~13,800

Two scopes are referenced throughout the posts, and they are not the same thing:

  • The 31-harvest snapshot is the cohort every blog post cites by name. It is the two specific ungrounded runs (24592252015 + 24592412851, both on 2026-04-18) — 31 harvests across 30 distinct games, with Elden Ring deliberately included twice (smoke + main) as an incidental intra-game consistency check.
  • The full corpus is every harvest this research produced in the games domain: all 9 runs from 2026-04-18 → 04-19, spanning both the ungrounded Arm 1 (5 runs) and the grounded Arm 2 (4 runs). 213 harvests total, which is roughly 7× the snapshot.

Every numeric entry is a snapshot so posts stay anchored to a single, well-defined cohort. The full corpus is brought in only as replication evidence — i.e., "if you re-run the same calculation against ~7× as many harvests, the pattern still holds." For example, section pass rates: mechanics at 94% in the snapshot, 97% across the full corpus; bosses at 13% in the snapshot, 16% across the full corpus.

The games-synthetic wiki harvest material in this appendix is scoped to that research domain. A separate 5-harvest programming-languages proof-of-concept (Python, JavaScript, Rust, Go, Elixir) exists on the same schema to test whether the findings generalize outside games; those results are not reported here and are available on request. Voice-assistant latency data — a separate dataset referenced by the general posts — is documented in its own section at the end of this appendix.


Key terms#

A quick crosswalk for readers landing here out-of-sequence from one of the blog posts. Each term is used consistently throughout the appendix and the research posts.

  • Game — one title from the sample (e.g., Elden Ring). The games domain in this research has 30 distinct games.
  • Harvest — one end-to-end generation-and-grading pass for one game. A single game can be harvested multiple times (across different runs, prompts, or grounding configurations). The games domain has 213 harvests total.
  • Run — one CI execution that processes a batch of games (typically 24–30 games per run). Identified by a GitHub Actions run ID (e.g., 24592412851); these are provenance anchors from a private repository, not accessible URLs (resolving them against github.com/... will return 404). There are 9 runs total across the games domain.
  • Snapshot — the specific 31-harvest cohort every blog post cites: runs 24592252015 + 24592412851, ungrounded only, frozen 2026-04-18.
  • Full corpus — every harvest in the games domain (all 9 runs, both arms). Referenced only as replication evidence at ~7× the snapshot.
  • Arm 1 / Arm 2 — experimental split. Arm 1 is ungrounded (the generator answers from training alone). Arm 2 is grounded (the generator is given a reference list of real proper nouns from CC0 / first-party-API sources).
  • Section — one of the five subfields of each harvest: mechanics, zones, bosses, items, quests. Each harvest fills all five; each section is independently graded.
  • Generator — the model that produces the harvest content. In this research, always claude-sonnet-4-6.
  • Judge — the model that grades the harvest. In the games research reported here, claude-opus-4-7 graded all 213 games-domain harvests on rubric v1 (each section graded on plausibility of entries, internal consistency, and obvious-fabrication flags; section verdict = pass / fail; harvest verdict = pass iff all sections pass after retries). (The 5-harvest programming-languages proof-of-concept mentioned earlier used claude-sonnet-4-6 as a self-judge variant; those harvests are outside this appendix.)
  • judge_verdict — per-harvest pass/fail from the judge. The looser of the two pass metrics; the one recommended for cross-arm comparison in this research because it's stable across threshold changes.
  • promotion_ready — stricter metric: requires judge_verdict='pass', a clean verbatim_risk flag, a confidence score above the promotion threshold in effect for that run, and any additional gating criteria the production pipeline was applying at the time. Because the gate drifted across runs, promotion_ready is a useful operational metric (it reflects what the pipeline actually promoted to downstream use) but an unreliable comparative metric across arms.
  • Promotion threshold — the confidence cutoff in promotion_ready. Two values appear in this research: the original gate (0.80), used across Arm 1; and the lowered gate (0.75, inside the empty 0.60–0.74 band), adopted for Arm 2 after the bimodal distribution was observed. (At the original 0.80 gate, harvests at 0.78 cleared the verdict but missed promotion; at 0.75 the entire 0.78–0.82 cluster promotes.) The later Arm 2 "mixed" runs used additional gating criteria beyond the threshold. For discussion of why 0.75 is the right threshold, see the hallucination-taxonomy post.
  • Verbatim risk — a separate check flagging near-verbatim reproduction of training-memorized wiki text. All 31 snapshot harvests scored low; all Arm 1 harvests reported here scored low, which is why the Arm 1 verdict-pass / promotion gap is driven entirely by the threshold, not by verbatim risk.
  • Stale check — a promotion-side guard that flags harvests whose grounding or generated content is out of date relative to the game's current state. Missing from the pipeline during Arm 1 and the earlier Arm 2 A/B runs; added for the later Arm 2 "mixed" runs (24637304751, 24637678177) after its absence was discovered. This is why those two runs show Promoted lower than Verdict-pass even at high confidence — some verdict-pass harvests failed the stale check. It's a real promotion criterion, not a bug, but it wasn't applied uniformly across the runs in this appendix, so promotion_ready on the mixed runs is not comparable to the earlier Arm 2 runs.
  • Grounding — in this project specifically, injecting a list of real proper nouns from a CC0 or first-party-API source (Wikidata, Steam achievements) into the generator prompt. Not full-text wiki content.
  • Retry — identical policy across every run, arm, and dataset reported here: up to n=2 retry attempts per failing section, per game (per harvest). Each section independently gets up to n=2 retries if it fails judge review (generator re-invoked with the judge's failure signal, then re-graded). A single game can consume up to 5 × 2 = 10 retry attempts in the worst case, or zero if every section passes on the first judge call. sections_retried records which sections needed a retry. No run used a different retry cap, and no result in this appendix is a product of asymmetric retry budgets.

Run-by-run results#

Arm 1 — ungrounded#

RunnGroundingPromoted, historical (promotion_ready, original threshold)RatePromoted, retroactive (Arm 2's 0.75 threshold)RateVerdict-pass (judge_verdict)Avg conf
245922520151none1100%1100%1
2459241285130none620%827%80.552
2459486970624none28%833%80.616
2459692285022none523%523%50.528
2459795423217none424%424%40.550

The bolded row 24594869706 is the baseline cited in the grounding-schema-alignment post as "2 promoted, 8%" — that figure reflects the historical gate (just above 0.78) under which those runs were processed. Under the retroactive 0.75 threshold that Arm 2 used, the same data would have promoted 8 games (33%). Both are honest numbers; the cross-arm comparisons in the grounding post should be read as comparisons at different gates. See the note below for the full story.

The first two rows (24592252015 + 24592412851) together form the 31-harvest snapshot cited throughout the posts.

Arm 2 — grounded#

RunnGroundingPromoted (promotion_ready)RateVerdict-pass (judge_verdict)Avg conf
2459917740230Wikidata1550%150.652
2459949775430Wikidata + Steam achievements1653%160.657
2463730475129mixed414%80.600
2463767817730mixed310%70.594

The two bolded rows (24599177402 and 24599497754) are the Wikidata and Wikidata+Steam A/B cited in the grounding-schema-alignment post; they were gated at a 0.75 promotion threshold, which promoted every verdict-pass harvest. The later two runs (24637304751, 24637678177) reflect a schema/prompt change between sessions and the introduction of a stale check — a promotion-side guard that was missing from earlier runs and got added once its absence was discovered. The stale check excluded some 0.80–0.82 verdict-pass harvests even though they cleared the confidence threshold. That's why Promoted on those two runs is lower than Verdict-pass despite the 0.75 gate. They're reported for completeness, not as direct comparisons to the earlier grounded runs.

Promotion threshold history#

Why Verdict-pass and Promoted disagree changed over the course of the research:

  • Arm 1 runs were processed under the original promotion gate (0.80). Every pass-but-not-promoted harvest in Arm 1 scored exactly 0.78 and was excluded by that gate. verbatim_risk='low' was true for all Arm 1 harvests, so verbatim concerns weren't the driver.
  • Threshold change between arms. After observing the bimodal confidence distribution documented in the hallucination-taxonomy post, I moved the threshold to 0.75, which sits in the empty 0.60–0.74 band and promotes the entire 0.78–0.82 cluster. Arm 2's Wikidata and Wikidata+Steam A/B runs used the 0.75 threshold.
  • For cross-arm comparison, Verdict-pass is the apples-to-apples metric: no Arm 1 pass harvest scored below 0.78, so under the 0.75 threshold applied uniformly, Arm 1 Promoted equals Verdict-pass (captured in the retroactive column above). The grounding-schema-alignment post's "8% → 50%" framing compounds two effects — grounding and threshold change — so the cleaner, threshold-neutral grounding comparison is 33% ungrounded → 50% Wikidata → 53% Wikidata+Steam, all at the 0.75 gate. That shift is discussed in the grounding post itself.
  • Later Arm 2 "mixed" runs (24637304751, 24637678177) added a stale check — a promotion-side guard that was missing from earlier runs and was added once its absence was noticed. The stale check excluded some verdict-pass harvests even at 0.80–0.82 confidence. Those runs are therefore not comparable in promotion_ready to the earlier Arm 2 runs, and aren't used for any load-bearing claim in the posts.

The 31-harvest snapshot (ungrounded arm, receipts)#

The hallucination-taxonomy post cites a "31-harvest snapshot frozen on 2026-04-18" (runs 24592252015 + 24592412851, both ungrounded). This is the receipts table.

The snapshot is 31 harvests across 30 distinct games — Elden Ring was harvested twice (once as a single-game smoke run in 24592252015, then again in the 30-game main run 24592412851). Both harvests are kept in the tables below for full transparency; the judge verdicts for the two Elden Ring harvests are independent graders of independently generated content, so they function as an incidental intra-game consistency check rather than a duplicate row.

Section-level pass rates (n=31 harvests, ungrounded)#

SectionPassFailPass rate
mechanics29294%
quests191261%
zones161552%
items121939%
bosses42713%

The same pattern held across the full games corpus (213 harvests, 1,065 section-level verdicts in wiki_section_results): mechanics 97% pass, bosses 16% pass. The 31-harvest snapshot tracks the full corpus on this metric.

Per-harvest verdicts (n=31 harvests, 30 distinct games)#

Generator = claude-sonnet-4-6. Judge = claude-opus-4-7, rubric v1. All rows scored verbatim_risk=low (no regurgitation of training-memorized wiki text).

GameJudge verdictConfidenceSections retriedPromoted
Elden Ring †pass0.82none
DARK SOULS IIIpass0.82none
God of Warpass0.82none
Baldur's Gate 3pass0.82bosses, items
Elden Ringpass0.82zones, bosses, quests
METAL GEAR SOLID V: The Phantom Painpass0.82bosses, items
The Witcher 3: Wild Huntpass0.82zones, bosses, items, quests
Hogwarts Legacypass0.78zones, bosses, items, quests
Sekiro: Shadows Die Twicepass0.78zones, bosses, quests
Divinity: Original Sin 2fail0.62zones, bosses, items, quests
Cyberpunk 2077fail0.55bosses, items
Dead Spacefail0.55zones, bosses, items, quests
DOOM Eternalfail0.55zones, bosses, items, quests
Dragon Quest XIfail0.55zones, bosses, items, quests
Final Fantasy VII Rebirthfail0.55zones, bosses, items, quests
Lies of Pfail0.55zones, bosses, items, quests
Pathfinder: Wrath of the Righteousfail0.55mechanics, zones, bosses, items, quests
Persona 3 Reloadfail0.55zones, bosses, items, quests
Persona 5 Royalfail0.55bosses, items, quests
Pillars of Eternity II: Deadfirefail0.55mechanics, bosses, items, quests
Resident Evil 2fail0.55mechanics, zones, bosses, items, quests
Resident Evil Villagefail0.55zones, bosses, items
The Elder Scrolls IV: Oblivion Remasteredfail0.55zones, bosses, items, quests
Warhammer 40,000: Rogue Traderfail0.55zones, bosses, items, quests
Control Ultimate Editionfail0.45mechanics, zones, bosses, items, quests
Black Myth: Wukongfail0.35mechanics, zones, bosses, items, quests
DOOM: The Dark Agesfail0.35mechanics, zones, bosses, items, quests
Metaphor: ReFantaziofail0.35mechanics, zones, bosses, items, quests
Returnalfail0.35zones, bosses, items, quests
Resident Evil 4 Remakefail0.30mechanics, zones, bosses, items, quests
Silent Hill 2 Remakefail0.15mechanics, zones, bosses, items, quests

† on the first Elden Ring row marks the pre-flight single-game harvest from run 24592252015. Before kicking off the 30-game main run, I ran Elden Ring alone end-to-end to confirm the generator + judge + retry pipeline was wired correctly. The main-run Elden Ring harvest (from run 24592412851) is the second row in the table. Both are included for full transparency; because they are independent generator + judge invocations, the two Elden Ring rows also function as an incidental intra-game consistency check.

Why three rows show none for sections retried. Every run in this appendix used the same retry policy: up to n=2 retry attempts per failing section, per game, applied identically across ungrounded, Wikidata-grounded, and Wikidata+Steam-grounded arms. The three games marked none in the Sections Retried column (Elden Ring †, DARK SOULS III, and God of War) cleared every section on the first judge pass, so the n=2 retry budget was never consumed. They received the same 0.82 judge score as the other top-cluster games, but they got there without a single re-grade.

What this lets you infer. A none in sections_retried is the strongest "well-known" signal available in the ungrounded arm: the harvest cleared all 5 sections on the first judge pass, consuming zero of its 10 available retry attempts. Elden Ring †, DARK SOULS III, and God of War therefore sit comfortably inside claude-sonnet-4-6's training distribution under the prompts and schema used here. Every other 0.82 in the table is a post-retry number — same final score, but achieved after one or more sections were re-graded. Read the three none rows as a harder pass than the other 0.82 rows, not as a softer or ambiguous one.

Resident Evil 4 Remake (2023) and Silent Hill 2 Remake (2024) are shown here under their resolved names — they were ingested under bare Steam AppIDs in this run because the name-resolution step hadn't been wired yet, and later runs confirmed the mapping. Both are recent remakes that sit at the bottom of the confidence floor, which is exactly the Category B "doesn't-know tail" the hallucination-taxonomy post describes: when the model genuinely has nothing, the judge catches it.


The confidence cliff#

Confidence clusters rather than distributes smoothly:

  • 0.78–0.82 — the pass cluster (9 games — 7 at 0.82, 2 at 0.78 with heavy retries)
  • 0.62 — top of the fail cluster (1 game)
  • 0.55 — modal fail, reached retry cap (14 games)
  • 0.15–0.45 — deep-fail cluster, all sections failed (7 games)

That's 31 harvests across 4 distinct clusters. The 0.60–0.74 band has exactly one harvest (the 0.62 above), and the gap between 0.62 and 0.78 is the cliff the threshold-tuning argument in the hallucination-taxonomy post turns on. Pass and fail don't overlap in this cohort.


Grounding posture (Arm 2)#

The grounded runs used two sources, both verified against their license/access terms:

SourceLicense / access postureEndpointVerification
WikidataCC0 1.0 Universal — public domain dedicationquery.wikidata.org/sparqlNo copyright, no restrictions, no attribution required. The SPARQL endpoint is a public service Wikimedia provides for programmatic research access.
Steam achievementsFirst-party via Steam Web API TermsISteamUserStats/GetSchemaForGame/v2Valve's official Web API with API-key access; achievement schemas are first-party data Valve publishes for any application to consume. Explicit licensed access, not a fair-use claim.

Wikidata's posture is the cleanest possible: CC0 is a public-domain dedication, so no fair-use analysis is needed — there is no copyright to infringe. Steam achievement data is accessed via Valve's own official Web API under Valve's own terms, which is "licensed API use" rather than fair use.

Sources that were not used — Fandom community wikis, IGDB, general web-scraped content — are discussed conceptually in the grounding-schema-alignment post as alternatives that were evaluated and rejected, but no data from those sources was persisted to the corpus. A distinct-sources query across all 213 harvests returns exactly two entries (Wikidata and Steam achievements), which is the authoritative answer on what data actually grounded the generator.

The grounding posture is deliberately narrow: extract factual proper nouns where the source explicitly permits it, do not mirror expressive prose, do not redistribute. The runtime-vs-harvest post discusses the broader legal posture this reflects.

Per-harvest citation structure#

Every grounded harvest logs a provenance record with the exact source, license, query key, contribution by schema slot, and retrieval endpoint. Example (combined run, DOOM Eternal):

{
  "primary": "claude-synthetic",
  "grounding": [
    {
      "source": "wikidata",
      "license": "CC0",
      "query_qid": "Q55662649",
      "contribution": { "items": 2, "locations": 1, "characters": 1 },
      "retrieved_from": "https://query.wikidata.org/sparql"
    },
    {
      "source": "steam-achievements",
      "license": "first-party (Valve Web API)",
      "contribution": { "achievements": 50 },
      "retrieved_from": "ISteamUserStats/GetSchemaForGame/v2"
    }
  ]
}

The fields are:

  • source — short identifier for the data source (wikidata, steam-achievements, etc.).
  • license — license or access posture of the source (CC0, first-party API, etc.).
  • query_qid — for Wikidata, the Q-ID of the game entity queried; lets you reproduce the SPARQL query exactly.
  • contribution — number of entities that actually landed in each schema slot after filtering. Lets you see at a glance whether a source contributed meaningfully to a given harvest or not.
  • retrieved_from — the endpoint the data was pulled from, so the lookup is reproducible.

Per-game citations across all grounded runs are available on request (there are roughly 30 rows per grounded run; a summary table surfaces the counts, the full list is the JSONB column).


Voice-assistant latency#

The general posts reference the prototype as "answering through your headset in under ten seconds," and the runtime-vs-harvest post cites a "10-second voice-assistant round-trip budget" as a design target. This section documents the data behind those claims. The voice-assistant dataset is separate from the games-synthetic wiki harvest research that fills the rest of this appendix — different table, different pipeline, different instrumentation.

Source: voice_turns table in the project's Supabase instance, populated by telemetry from the voice-assistant build. n=571 turns with complete latency telemetry at snapshot time (out of 629 total turns across 176 sessions).

Latency definitions:

  • TTFA (time-to-first-audio): milliseconds from query commit to the first audio chunk returned to the user — i.e., when the user hears the assistant start answering.
  • E2E (end-to-end): milliseconds from query commit to the final audio chunk — i.e., when the assistant finishes speaking.

Which metric backs the "under ten seconds" claim: TTFA. I treat TTFA as the experience-relevant metric for this prototype: from a player's perspective, "the AI answered" lands at the moment audio starts, not at the moment audio stops. E2E is reported for completeness; it's a secondary property and isn't what the general-post phrasing is referring to.

MetricnMedianp90Mean% under 10 s
TTFA (load-bearing)5712,348 ms3,952 ms2,598 ms100 %
E2E (secondary)5711,148 ms10,441 ms4,328 ms89.1 %

Reading the table: every captured turn surfaced first audio in under 10 seconds — that's the 100 % that backs the claim in the general post. End-to-end completion was under 10 s for 89 % of turns; the p90 E2E sits right at the 10-second line. E2E isn't the claim, but it's close enough to serve as a supporting data point rather than a contradicting one.

Snapshot freshness: figures captured from the voice_turns table as of the last update on 2026-04-26. The underlying table is append-only telemetry, so figures will drift as more turns accumulate. The shape (TTFA bounded well under 10 s; E2E near the budget line) is the stable claim; exact percentages update as n grows.


What's not published#

  • The synthetic wikis themselves. The wiki and judge_raw JSON columns (the model's generated content and the judge's full reasoning) are deliberately withheld. The failing ones contain confabulations that look authoritative in isolation — exactly the failure mode this research is about. I don't want them picked up and reused as reference material.
  • Prompts, orchestration code, and the private working repo.

Personal blog. Views and writing here are my own and do not necessarily reflect those of my employer or any organization I'm affiliated with. Side projects, written on personal time.