Universal Coverage Isn't Enough: Grounding Sources Must Align With Schema Slots#

A counterintuitive result from trying to make an LLM-authored synthetic game wiki more accurate: the source with the widest coverage didn't help on the games that needed it. Three consecutive negative results across Steam, Wikidata, and IGDB in this research pointed at the same diagnosis — and the implication for retrieval-augmented generation over a structured schema, at least in my setup, is "fit matters more than size."

The setup#

I've been running a research project that asks a large language model to generate structured reference content for video games — a schema with sections for mechanics, zones, bosses, items, and quests, one row each with specific fields. Then a judge model grades the output. Games that clear the quality bar get promoted; games that don't stay in quarantine.

The model can do this pretty well from training knowledge alone for popular, well-documented games. For obscure or recent ones, it hallucinates — invents boss names, conflates locations, transplants details from adjacent games. "Salvador" appears as a boss in Control, which has no such character; Oblivion gets NPCs from Skyrim; Doom: The Dark Ages gets entirely fabricated item names that sound plausible.

The obvious fix is to ground the generation: give the model a reference list of real proper nouns from an external source so it has something to anchor on instead of inventing. This is the standard retrieval-augmented generation pattern, and the intuition going in is "more grounding = better."

So I went shopping for grounding sources.

What I expected vs. what happened#

The hypothesis was straightforward: Wikidata for the games editors had bothered to curate, Steam's achievement schema as a near-universal backup that covers most games on the platform (achievement support is optional for developers, but widely adopted), and the combination should beat either one alone.

Three runs, 24–30 games each, varying only the grounding source (full per-run table, plus the threshold history explaining the numbers here, is in the data appendix):

Run	Grounding	Games	Promoted (at 0.75 threshold)	Avg confidence
Baseline	none (pure synthetic)	24	8 (33%)	0.616
Wikidata	Wikidata only	30	15 (50%)	0.652
Combined	Wikidata + Steam achievements	30	16 (53%)	0.657

Note on the baseline number. The ungrounded baseline was originally processed under a stricter promotion gate (0.80) and promoted only 2 games (8%) at that gate. Because Arm 2 used a lowered 0.75 threshold, I'm reporting every row above at the 0.75 threshold — otherwise the comparison confounds grounding effect with threshold change. Under 0.75 applied uniformly, the baseline is 8 of 24 games (33%). The historical 2-of-24 figure is preserved in the data appendix.

On the combined run, I broke it down further by cohort — which grounding source each game actually received:

Cohort	Games	Promoted	Rate	Avg confidence
wd + steam	20	13	65%	0.692
steam only	8	2	25%	0.566
none	2	1	50%	0.665

Check the pass rate. The steam-only cohort — the one where the only grounding available was the "universal backup" — did not beat the ungrounded baseline at all. 25% on n=8 is actually nominally below the 33% ungrounded baseline (at the same 0.75 threshold), though with n=8 the gap sits inside the noise floor of either cohort. The pass rate on those eight games is also well below the 65% achieved by the wd+steam cohort on the same run (where Wikidata did have entities). The games Wikidata couldn't cover were the same games Steam couldn't rescue — that's what "universal coverage didn't help on the games that needed it" means concretely. And the same-game A/B across Wikidata-only → Wikidata+Steam (the 30 games that appeared in both runs) showed 3 games crossing the verdict gate from fail → pass and 2 crossing the other way, with an average confidence delta of +0.005 across all 30. Noise.

Wikidata is doing essentially all the useful work. Steam achievements, the source I thought would backfill the games Wikidata missed, wasn't backfilling them at all — it was slightly confusing the generator on the games that got them exclusively.

Why universal coverage wasn't enough#

The wiki schema has specific slots: bosses, zones, items, quests, mechanics. Each slot is asking for proper nouns of a particular kind — Margit the Fell Omen (boss), Stormveil Castle (zone), Mimic Tear (boss or item depending on context).

Wikidata's linked entities map one-to-one onto those slots. A Wikidata character entity is a boss or NPC name. A location entity is a zone. An item entity is, well, an item. The grounding prompt says "these are real character names from this game" and the generator's next boss hallucination turns into "actually, use one of these real names instead."

Steam achievements are titles like "Victory Royale", "First Blood", "100% Completionist". They're near-universal (most Steam games ship with them; achievement support is optional but widely adopted), they're creative-team-written (not auto-generated), and they're exactly as numerous as you need — 30 to 150 per game. All the properties that made them look like a good grounding source on paper.

But none of those titles map to any of the schema's slots. An achievement title isn't a boss name. It isn't a zone name. It isn't an item name. It's a marketing artifact that names a player accomplishment, and the semantic distance between "Dragonslayer" (achievement) and "Valstrax" (the dragon boss you killed to earn it) is close to the distance between "Breakfast of Champions" and "cornflakes."

When the generator gets Steam achievements as grounding, it's getting a list of proper nouns that look formally similar to what it needs — capitalized, multi-word, in-game-universe-adjacent — but don't actually fit the slots it's filling. The result is predictable in retrospect: the generator either ignores the grounding (best case) or tries to thread achievement titles awkwardly into boss/zone/item descriptions (worst case, which hurts the output).

Coverage was broad. Alignment was wrong. The grounding source didn't fit the schema's shape.

The broader pattern: three negative results in a row#

I didn't believe this from one experiment. I tested it three more ways, and each time the diagnosis held.

Broader Wikidata queries didn't rescue the tail. The original query used narrow type filters — P31 (instance-of) constrained to character, P31/P279* (subclass-of transitive closure) constrained to location. The obvious next move was to widen the SPARQL: drop strict filters, add P527 (has-part) for DLC and chapters, include broader location subtypes. Same eight games across three quality buckets, production queries vs. four broadened variants. Net across all games: +2 entities on Pathfinder WotR, +2 on DOOM: The Dark Ages, zero anywhere else. And the two "new" entities on each game were cross-slot duplicates — the same entity surfacing twice as a character and an item because the P31 filter was doing the disambiguation work. Stripping the filter actively hurt the grounding because the generator got the same name twice, which would make the output worse than production would. The ceiling wasn't query precision. It was editor coverage. If Wikidata editors hadn't written the page, no SPARQL rewrite manifested one.

IGDB as an alternative catalog didn't rescue the tail either. IGDB (Twitch-owned, game-aware, comprehensive on the metadata layer) seemed like the natural complement to Wikidata. The probe showed IGDB has near-zero character curation — zero character entries for every game I tested. IGDB is a catalog of games, not a wiki of entities within games. Its real yield was DLC and expansion titles, which is useful for the quests slot but doesn't touch bosses or items — and crucially, still returns zero on the doesn't-know tail. Same eight-game matrix, same result: if the catalog wasn't written, no lookup fixes it.

Web search ground out the tail but still didn't clear the quality bar. Unbounded web search via Anthropic's web_search tool returned 245 credible proper nouns across the four hardest games where Wikidata had returned 3. An 80× yield, citations on the usual mix of game-wiki and gaming-news domains. This looked like the breakthrough — until it went through the downstream judge. The promoted rate on those games stayed at 0/12. The judge was catching a different failure mode at this point: the generator, given rich grounding, was producing structured relationships between entities that weren't in the grounding (wrong boss-location pairings, wrong item-location pairings, wrong drop tables). Names weren't the bottleneck any more. The relationships between names were. Which brings me to the meta-finding.

The meta-finding#

Across four separate grounding experiments in this research — Steam achievements, broader Wikidata queries, IGDB, and unbounded web search — the consistent diagnosis was:

In this research, grounding sources didn't rescue a schema they weren't shaped like. I'd expect the same mechanism to hold elsewhere — structured slots asking for a specific kind of entity won't be filled by a source whose entities are a different kind — but that's a prediction from the mechanism, not a measurement across pipelines.

Steam achievements: wrong slots (titles, not entities).
Broader Wikidata: right slots, but thin editor coverage below the ceiling.
IGDB: right catalog, wrong entity granularity (no characters).
Web search: right names, wrong relationships.

The question to ask before picking a grounding source isn't "what has the widest coverage?" It's "what shape is my schema asking for, and which source natively produces content of that shape?" If the answer is "no source does," my experience in this research is that further retrieval-augmentation engineering didn't fix it — what fixed it was a different lever (transient runtime search, licensed content, or human curation).

This is the opposite of the intuition I went in with, and I don't think I'm the only one — universal coverage feels like a Pareto-dominant choice because it includes everything; alignment with schema slots feels like a narrow constraint. In this research the alignment constraint was the binding one, and coverage only mattered conditional on alignment.

Practical guidance for picking a grounding source#

From here, my heuristic for evaluating any new candidate grounding source goes:

Write down the shape of each schema slot first. What kind of proper noun goes there? A character name? A location? A catalog entry? A numeric parameter? Write it plainly, not with abstract types.
Sample the candidate source and check if its entities natively answer "what goes in slot X." Not "could we reshape its output to fit slot X" — that's the Steam achievement trap. Does a person who already knows what's in slot X look at the source and see entities of that kind?
Test on the hard cases, not the easy ones. Your schema's popular-game coverage will tend to look good. The question is whether the source fills gaps on obscure or recent cases — that's where you're actually using it. If it doesn't, you're buying coverage you don't need.
Measure same-game A/B, not aggregate. Run the same 30 games with and without the source. Promotion rate moves or it doesn't. Aggregate comparisons across different game lists hide everything that matters.
Budget for one real negative result per grounding source. The universal-coverage temptation is easy to fall into (I did); each candidate source is worth its own measured rejection before you're confident you've exhausted the harvest-time lever.

The five steps add up to: trust alignment, verify coverage, A/B on the tail, and don't pick grounding sources by vibes.

The harvest-time ceiling#

The arc here closed on a finding I didn't expect going in. In this pipeline, there was a ceiling on how much harvest-time grounding could lift the model's output, and that ceiling wasn't set by grounding source quality — it was set by the model's preference for plausible completeness over honest blanks. Claude Sonnet 4.6 fabricated 2–5 entities per game on the doesn't-know tail regardless of whether grounding was narrow-names (Wikidata), rich-prose (web search), or structured-with-tuples. Those variables shifted the judge's verdict by ±0.10; none of them broke the cluster.

The cleanest response to that ceiling turns out not to be "ground harder at harvest time." It's "route around harvest entirely on the hard cases." At runtime, when a question comes in that lands on a low-confidence topic, fire a transient web search right then, answer from live results, don't persist. The harvest keeps its clean easy-middle output; the runtime handles the tail.

That's the pattern I ended up with: synthetic for the easy middle, runtime search for the hard tail, nothing third-party persisted. It also independently reduces copyright/redistribution exposure (nothing third-party sits in the persistent corpus), which is a second reason I converged on it. The ToS (Terms of Service) axis is separate and has to be handled by picking fetch primitives whose terms permit automated access; see the runtime-vs-harvest post for the longer argument.

The longer argument for that pattern lives in the runtime-vs-harvest post. For this post, the takeaway is narrower:

Universal coverage is not the right question. Slot alignment is. In this research, a narrow source that natively fit my schema beat a universal source that didn't — the Steam-only cohort, where the "universal" source was the only grounding available, hit 25% on n=8 against a 33% ungrounded baseline. Directionally worse, statistically indistinguishable, and well below the 65% the slot-aligned source delivered when it had coverage.

This is part of a series on empirical architectural findings from building an LLM-based reference-content pipeline. See also: the two-bucket hallucination taxonomy, the runtime-vs-harvest design pattern that resolves the tail-bucket problem, and — for the same finding in narrative form, with less SQL and more story — the general-audience post The Time I Thought "More Data" Would Fix My AI.