Two Kinds of Hallucination, a Discrete Confidence Cliff, and What It Costs
10 min readTwo Kinds of Hallucination, a Discrete Confidence Cliff, and What It Costs#
Empirical data from 200+ harvests across 30 distinct games run through an LLM-generated reference-content pipeline splits the fabrications I observed cleanly into two buckets with different responses to intervention, sitting beneath a judge-side confidence distribution that clusters at a handful of discrete values. Published hallucination taxonomies (e.g., Ji et al. 2023) already distinguish intrinsic from extrinsic fabrication on a different axis; this post adds a knowledge-presence axis specific to schema-driven generation. Together the two buckets explain why prompt engineering moves some things and not others — and determine the cost-per-usable-output when scaling synthetic knowledge in this kind of pipeline. Scope caveats in Provenance.
TL;DR#
- Hallucination in this research splits into two empirically distinct buckets. "Knows-but-pads" — the model has strong recall but invents to hit a count target — responds to prompting. "Doesn't-know" — the model has no real recall — doesn't. A single prompt rule moved 4 of 17 games from fail (0.45–0.55) to pass (0.78–0.82) without moving the second bucket at all. (The split is specific to this research; see Provenance for scope limits.)
- LLM-as-judge rubrics look continuous (0.0–1.0) but cluster at ~5 discrete values. Across 200+ harvests (30 distinct games, multiple prompt and retry configurations), scores pile up at five discrete values (0.35, 0.55, 0.62, 0.78, 0.82); the gap between 0.62 and 0.78 is empty — zero harvests across the full corpus. Tuning a promotion threshold through this cliff moves whole buckets of harvests at once.
- The economics fall out of the taxonomy. Cost per usable output is a function of promotion rate × retry rate × per-call cost. The two hallucination buckets have different economics because they need different interventions.
This post walks through the data behind each claim, with the checkpoints and queries that back it up.
The setup#
I've been running a research project that produces LLM-generated structured reference content for video games. Pipeline: generator (Claude Sonnet 4.6) fills a schema per game, judge (Claude Opus 4.7) grades it on a 0.0–1.0 rubric. (All 213 games-domain harvests in this research used the Sonnet-generates / Opus-judges pairing; a separate 5-harvest programming-languages proof-of-concept used a Sonnet self-judge variant for judge-reliability probing, but those harvests are outside this post.) The current promotion threshold is 0.75; games above get promoted, games below go to quarantine for retry or grounding. (Earlier in the project the threshold was 0.80; the change is discussed below in Finding 2.)
Schema has five sections: mechanics, zones, bosses, items, quests. Each game gets ~65 entries total across those sections. Games come from a representative sample of recent releases and classics — Elden Ring, Control, Returnal, Black Myth Wukong, Doom: The Dark Ages, Metaphor: ReFantazio, Pathfinder: Wrath of the Righteous, and so on, with up to 30 games per large-scale run.
All numbers below come from real runs. Run IDs are cited and SQL queries against the synthetic wiki_harvests table are included where they're load-bearing — so any specific number can be traced back to the run that produced it.
After running about 200 harvests across 30 games, a pattern in the failing runs became visible. The judge produces per-fabrication notes alongside its score — short reasons explaining each issue — and reading those notes across failing harvests, the fabrications didn't all look like the same kind of mistake. Some were small inventions inside games the model otherwise knew well. Others were the model confidently producing major characters and locations in games it apparently didn't know well. That observation is what the rest of this post unpacks.
Finding 1: Two hallucination buckets#
The initial observation#
Looking at the judge's quality.issues output on Run 24596922850, the fabrications split cleanly into two categories:
Category A: Padding around real knowledge. The schema asks for a fixed number of entries per section (e.g., 12 weapons, 5 zones). When the model genuinely knows N–1 of those, it fills the last slot with a plausible-adjacent invention rather than leaving it blank or returning a shorter list.
- Resident Evil Village: invented weapons "Magnum WCX," "F2 Sniper Rifle" (neither real) mixed in with the correctly-named arsenal.
- Oblivion Remastered: invented location names "Ganonah," "Faithful Queendom" among the correct Imperial Province names.
Category B: Plausible-sounding fabrication without real recall. The model has low-fidelity memory of the game and is surfacing names that sound right but aren't.
- Control: boss named "Salvador" (doesn't exist), quest named "Self-Reflection" (doesn't exist), "Essej is Helen Marshall" (conflates two distinct characters).
- Silent Hill 2 Remake: places the Pyramid Head final fight at "Toluca Lake lighthouse" (correct answer: Lakeview Hotel — a different location).
- Dragon Quest XI S: quests "The Door to Digitopia," "A Walk on the Wild Side" (neither exists).
The framing#
Category A is responsive to prompting: adding "omit when uncertain rather than invent" to the system prompt changes what the model emits, which means the relevant signal exists internally and can be addressed at the prompt layer. Category B isn't responsive: the fabrications are statistically adjacent to training data (names that look like Control bosses, Silent Hill locations, Dragon Quest quests), so the model's uncertainty signal doesn't surface as a "decline to answer" completion. The judge catches them only by reasoning against an external ground truth referent ("does this match what I know about Control?"), which the generator can't do for itself.
The test: one prompt rule#
A single change shipped to the generator prompt:
NEVER invent proper nouns. If you aren't certain a name exists, OMIT the entry rather than fabricate it.
The hypothesis: this should move Category A (padded content — the model can choose to omit rather than invent) but not Category B (the model's uncertainty signal doesn't surface for those completions, so an honesty rule has nothing to grab onto). Both runs in the comparison below are ungrounded — this test isolates the prompt rule. Grounding is a separate lever, tested in different runs documented in the grounding-schema-alignment post.
Run 24597954232 was the post-change run. The clean same-game overlap against the pre-change run:
| Bucket | Count | Examples |
|---|---|---|
| Fail → pass | 4 | Oblivion 0.55→0.82, Persona 3R 0.55→0.78, SH2 Remake 0.55→0.78, RE Village 0.45→0.78 |
| Confidence up, still fail | 5 | DOS2 0.55→0.62, Dead Space Remake 0.45→0.55, Rogue Trader 0.35→0.55, Pillars Deadfire 0.45→0.55, Pathfinder WotR 0.50→0.55 |
| Flat confidence, fewer judge issues | 5 | Returnal 14→10 issues, Metaphor 11→8, Lies of P 18→15, DQXI 14→12, DOOM TDA 10→9 |
| Flat confidence, same or more judge issues | 3 | Black Myth 13→13, FF7 Rebirth 12→12, Control 12→14 |
24% (4/17) flipped from fail to pass on a prompt-only change. Across the full overlap, 9 games saw confidence rise and 8 held flat — none regressed. Issue counts fell on 10 of 17 games and held flat on 3 more.
The same-game overlap query that produced the underlying rows:
SELECT
game_key,
MAX(CASE WHEN run_id = '24596922850' THEN judge_confidence END) AS pre_conf,
MAX(CASE WHEN run_id = '24597954232' THEN judge_confidence END) AS post_conf,
MAX(CASE WHEN run_id = '24596922850' THEN jsonb_array_length(judge_notes) END) AS pre_issues,
MAX(CASE WHEN run_id = '24597954232' THEN jsonb_array_length(judge_notes) END) AS post_issues
FROM wiki_harvests
WHERE run_id IN ('24596922850', '24597954232')
GROUP BY game_key
HAVING COUNT(DISTINCT run_id) = 2
ORDER BY pre_conf, post_conf;(judge_notes is a JSONB array of issue records; the jsonb_array_length derivation is the issue count.)
What moved vs. what didn't#
Look at the games that moved: Oblivion Remastered, Persona 3 Reload, SH2 Remake, RE Village. Pre-PR fabrications on these games were Category A — the model had strong recall and was padding ("Ganonah" alongside real Imperial cities, made-up weapon models in an otherwise-accurate RE Village arsenal). Post-PR, it omitted rather than padded, the judge gave it credit for honest blanks, and the overall score cleared.
Look at the games that didn't move: DOOM: The Dark Ages, Black Myth: Wukong, Metaphor: ReFantazio, Returnal, FF7 Rebirth. Pre-PR fabrications on these were Category B — wholesale invented boss rosters, plausible-sounding location names, transplants from other games. Post-PR, the judge noted marginally fewer fabrications (Category A padding decreased) but the Category B failures dominate and the score stayed parked at 0.35.
Think of the model's per-game knowledge as a distribution: the middle is games the model knows well (Category A — moved by the prompt rule above); the tail is games the model doesn't know (Category B — not moved by prompting, and the focus of a separate experiment on grounding).
In this research, prompting flipped Category A games from fail to pass. Category B games stayed parked at 0.35.
Finding 2: The discrete confidence cliff#
The distribution#
A 31-harvest snapshot frozen on 2026-04-18 (runs 24592252015 + 24592412851, ungrounded) — 31 harvests across 30 distinct games, with Elden Ring harvested twice (smoke + main). Judge verdicts across all harvests:
| Confidence bucket | Count |
|---|---|
| < 0.50 | 7 |
| 0.50–0.59 | 14 |
| 0.60–0.69 | 1 |
| 0.70–0.74 | 0 |
| 0.75–0.79 | 2 |
| 0.80–0.89 | 7 |
Bimodal. A tight fail cluster around 0.55, a tight pass cluster at 0.78–0.82, nearly nothing in between. The 0.60–0.74 band has exactly one harvest across 31 samples.
Looking across the full set of runs (213 games-domain harvests spanning 30 distinct games and 9 runs, multiple prompt and retry configurations; generator = Sonnet 4.6 throughout, judge = Opus 4.7 throughout for the games research), scores cluster at a handful of discrete values:
- 0.35 — the "doesn't-know" fabrication baseline
- 0.55 — the "padded content, issues flagged" cluster
- 0.62 — "some padding resolved but structural issues remain"
- 0.78–0.82 — the pass cluster, above the 0.75
promotion_readygate
Why continuous rubrics collapse to discrete values#
The judge prompt asks for a 0.0–1.0 confidence score, but it anchors that score with rubric text like:
- Award 0.9 for excellent coverage.
- Award 0.8 for good coverage with minor issues.
- Award 0.7 for acceptable coverage, some gaps.
- Award 0.5 for significant gaps or inaccuracies.
- Award 0.3 for extensive fabrication.
The model lands on rubric anchors. It interprets "good coverage with minor issues" as 0.78–0.82 because that's the anchor pull. It rarely emits 0.71 or 0.84 because the rubric doesn't have a natural anchor there. In this setup, continuous emission collapsed to discrete clusters around the rubric anchors. I'd expect the same shape from any rubric that anchors specific text at specific score values, but that's a prediction from the mechanism, not a measurement across setups.
The threshold-tuning implication#
Our promotion threshold started at 0.80. Looking at the distribution, 0.80 is inside a cluster (the 0.78–0.82 pass cluster). Games scoring 0.78 are structurally similar to games scoring 0.82 — both are clean passes, rubric-wise. The 0.80 threshold was excluding about half of the clean-pass cluster arbitrarily.
Moving the threshold to 0.75 promoted the entire 0.78–0.82 cluster without letting any marginal cases through. 0.75 sits just above the empty 0.60–0.74 band, in the gap between the fail clusters and the pass cluster.
Section-level failure concentration#
The same 31-harvest snapshot broken down by schema section:
| Section | Pass | Fail | Fail rate |
|---|---|---|---|
| mechanics | 29 | 2 | 6% |
| zones | 16 | 15 | 48% |
| bosses | 4 | 27 | 87% |
| items | 12 | 19 | 61% |
| quests | 19 | 12 | 39% |
Bosses fail 13.5× more often than mechanics (27 boss failures vs. 2 mechanics failures across 31 harvests). My working hypothesis for why: mechanics (dodging, cooldowns, RPG stats) recur across many games and tutorials in training data, while boss-level content — a specific game's bosses, their locations, their drops, their attack patterns — is per-game-specific and lives in the long tail of training-data frequency. ("Long tail" describes the rare-events region of a distribution: a small number of items appear very often in training data, and a much larger number of items appear only a handful of times — facts in the long tail are seen rarely, and the model recalls them unreliably as a result.) The general pattern of LLM factual recall correlating with training-data frequency is documented (Kandpal et al. 2023 on long-tail knowledge in LLMs); mapping that mechanism onto specific schema sections in my pipeline is inference, not direct measurement.
Retry-with-judge-feedback doesn't close the gap alone#
Retry distribution on the same snapshot:
| Retries used | Games |
|---|---|
| 0 | 3 |
| 1 | 3 |
| 2 | 25 |
MAX_SECTION_RETRIES was 2. 80.6% of games exhausted the retry budget, and most of those still failed. Judge feedback passed back to the generator produces marginal lift — some fabrications get corrected on retry — but it's not enough on its own for most games.
Finding 3: Economics of LLM-generated reference content at scale#
The cost function#
Per-game cost breaks down as:
cost_per_game = (cost_generator + cost_judge) × (1 + avg_retries) × n_sections
For Sonnet-gen + Opus-judge at April 2026 prices (input/output per MTok), ~65 entries per game across 5 sections, with retry distribution from above:
- Generator cost per section: ~$0.03–0.08 depending on output length
- Judge cost per section: ~$0.12–0.20 (Opus is pricier than Sonnet)
- Average retries: 1.8
- Per-game cost: ~$2.50–$5.00 end-to-end (full retry budget consumed by most games)
Token assumptions behind these ranges: generator (Sonnet) ≈ 1.5–3K input tokens + 1–2K output tokens per section; judge (Opus) ≈ 3–5K input tokens (generator output + rubric) + 400–800 output tokens per section. Priced at published Anthropic model pricing for Sonnet 4.6 and Opus 4.7; your numbers will drift as prices or model mix change.
That's the gross cost. The cost that matters is cost per promoted game — the games that actually pass and become usable output.
Per-promoted-game economics by bucket#
Splitting the sample into the two hallucination buckets:
| Bucket | Promotion rate (prompt-alone) | Cost/game | Cost/promoted game |
|---|---|---|---|
| "Knows-but-pads" (middle) | ~70% after prompt change | ~$4.00 | ~$5.70 |
| "Doesn't-know" (tail) | ~0% | ~$4.00 | not viable |
The middle bucket buys ~$5.70 per promoted game. The tail bucket buys nothing on this lever — the games don't promote, so each ~$4 spent generating produces no usable output. This is the empirical basis for "don't try to generate-at-scale over the tail — use a different lever."
Where generate-at-scale wins vs. where retrieval wins#
From the economics + the taxonomy, the decision matrix falls out:
| Game bucket | Best strategy | Why |
|---|---|---|
| Model knows the game well | Pure synthesis | High promotion rate, low cost/promoted-game. Judge is cheap insurance. |
| Model knows but pads | Synthesis + prompt discipline | Cheap prompt rules lift the whole bucket across the cliff. |
| Model doesn't know | Don't synthesize. Fetch at runtime. | Synthesis doesn't produce usable output for this bucket on the prompt-alone lever. Transient web search answers one query at a time for pennies. |
This is a concrete economic argument for hybrid architectures, grounded in this project's measured costs. It's not "runtime fetch is faster to market" — it's "runtime fetch is the lever that produces answers on the tail when synthesis on its own doesn't." I'd expect the same mechanism (promotion rate going to zero on a doesn't-know tail under prompt-alone synthesis, so cost-per-promoted-output is undefined) to recur in other pipelines with a similar bucket split, but that's a prediction from the mechanism, not a measurement across pipelines.
The corollary: what scales and what doesn't#
Synthesis-at-scale works for the head of the distribution. Popular, well-documented games. AAA titles with rich training-data footprints. The model knows these; the judge validates cheaply; you ship with a known-good confidence profile.
Synthesis-at-scale does not work for the tail. Obscure releases, recent patches, live-service seasonal content. The model doesn't have the knowledge; no amount of prompt engineering manifests it; per-promoted-game economics aren't viable on this lever, because the promotion rate is effectively zero. For the tail, some other mechanism — runtime search, authorized first-party APIs, licensed content partnerships — has to do the work.
This is exactly the shape of the architecture I ended up with: synthetic knowledge for the head, runtime fetch for the tail. The economics and the hallucination taxonomy both point to the same architecture independently. That's the kind of convergence that suggests it's the right answer for this problem shape.
Putting it together#
The three findings compose into a single picture of this pipeline's synthetic-knowledge-generation economics:
- Hallucination in this pipeline was bimodal. Knows-but-pads vs. doesn't-know were empirically distinct in my data, responded to different interventions, and scaled differently.
- The judge produced clustered scores in this pipeline, not a continuous quality signal — which let the 0.75 promotion threshold sit in the empty gap above the fail clusters and below the pass cluster.
- The economics of scaling depend on which hallucination bucket the games fall into. Head and middle games are cheap per promoted output (~$5.70/promoted game on the prompt-rule lever); tail games don't promote on synthesis at all, so synthesizing them produces no usable output for the per-game cost.
The key learning for me: don't try to get the model to generate reference content for games it doesn't know. Let it generate cleanly for the games it does, gate on a threshold that sits in the empty band between clusters, and handle the tail with transient runtime fetch. That's the architecture the empirical data points at — and it happens to coincide with the architecture that minimizes copyright/redistribution exposure (no bulk third-party caching required) and preserves attribution (runtime fetch couples to source). ToS compliance on the runtime side is a separate constraint handled by the choice of fetch primitive.
Provenance#
All runs are numbered (run_id in the synthetic wiki_harvests table). The SQL queries in this post are the actual queries I ran against the project's database.
The two-bucket split isn't a universal property — it's specific to domains where the model has varying depth of training-data coverage across items.
This is the flagship empirical piece in a series on building LLM-based reference-content pipelines. See also: why "universal coverage" grounding sources can hurt, the runtime-vs-harvest design pattern that resolves the tail-bucket problem, the data appendix for the run-by-run cohort structure behind these findings, and — for the general-audience version of the two-bucket framing — Two Kinds of Things an AI Gets Wrong.