Writing Isn't Verification — and Neither of Us Caught It#

A short story about working with an LLM on factual writing — and noticing, after a careful editorial pass, that the prose was both clean and quietly wrong.

Who this is for: if you're using an LLM on factual writing and noticing it produces confident-sounding errors, this post is for you. It's the most technical of the general-audience posts on this site — it walks through the actual mechanics of my editorial-verification system (rules, a linter, CI, an LLM-judge audit). If you'd rather start with a less technical entry point, the two kinds of things an AI gets wrong covers the same problem space at the level of "when can you trust an AI answer." Tools exist in the AI-assisted-writing space; what follows is one approach I built and ran on my own writing, shared as an experiment, not a comparison against alternatives.

I recently published seven blog posts about AI failure modes — three for a general audience, four with the underlying research data. Along the way I wrote a standards document for the writing: rules to follow, things to avoid, what counts as a valid citation, when to scope a claim, what an "escape hatch" looks like and why to refuse them. Seven rules, carefully constructed, with examples and edge cases. The AI helping me revise the posts followed them.

Then I did one more pass. This time I went back and verified every specific number — promotion counts, confidence cluster sizes, snapshot run IDs, SQL queries — against the actual database where the research lives.

The posts had about twenty factual errors.

The shape of the errors#

A few representative ones, to give the flavor: a confidence cluster bullet that said "5 games at 0.82" when the actual count was 7. Three rows in a per-run table that showed "—" (no data) for promotion when the database actually had real values. A copy-paste-ready SQL example that referenced a column the schema doesn't have. A "8% → 50%" improvement claim that quietly compared two different promotion thresholds and credited the entire delta to the intervention. None caught by the writing rules; each catchable by a single query against the source of truth. There's a Receipts section toward the end of this post with three of them written out in full.

The prose was fluent, well-structured, and well-cited. It was also quietly wrong.

The diagnosis#

I'd been writing rules about how to write. None said how to verify.

The closest I came was the citations rule: "Every non-obvious factual claim needs a source the reader can verify." That rule — correctly — requires that a reader can check the claim. It does not require that I check it before publishing.

The hidden assumption in that omission: that the source the reader can verify is the source the writer copied from — that derivation has already been done correctly upstream, and the writer just cites it. That assumption holds for a writer working from a finished paper. It breaks for a writer working from an in-flight database that gets queried freshly each time. The numbers in the post have to be re-derived from the live source before publishing, not copied from an earlier draft.

Seven rules about writing. Zero about verifying. The AI followed every editorial rule while reproducing whatever number it found in the latest draft — sometimes right, sometimes not.

Two disciplines#

The discipline I'd been writing — writing rules — teaches voice, structure, restraint, and sourcing-as-norm. These rules make prose look trustworthy.

The discipline I'd been missing — verification rules — teaches re-derivation (compute every load-bearing number directly from the source), reproducibility (every SQL or code example must execute against the live schema), cross-document consistency (two posts referencing the same artifact must agree literally), and apples-to-apples comparison (don't compare a strict-gate "before" to a loose-gate "after" and attribute the delta to the intervention). These rules make prose be trustworthy.

Two distinct disciplines. One can be perfectly satisfied while the other is silently failing. That's exactly what happened in mine.

An older problem, suddenly louder#

The split between writing and verifying isn't an LLM-era invention. Newsrooms have run separate copy desks and fact-checking departments for the same reason: a sentence can be well-edited and still false, and a different cognitive mode catches each. Scientific journals separate editorial review from peer methodology checks. Engineering teams separate code review (style, design, intent) from CI tests (does it actually do what it claims). Wikipedia ships [citation needed] tags because writing rules alone don't enforce verification.

What's new is that one side of the ledger got much cheaper. Generation used to be the bottleneck; now it's nearly free. Verification did not get cheaper at the same rate. So the writing-to-verification ratio shifted — generation cheap, verification still expensive — and the gap between what's easy to produce and what's easy to check widened. That's the particular gap this post is in.

The lesson isn't novel, then. It's a re-discovery of why those separate desks existed in the first place. What's specific to working with LLMs is that the gap is wider, faster to open, and easier to walk into than it used to be — and that a one-person workflow now has to recreate the institutional pattern itself. In this project, the editor / copy-desk / fact-checker separation that professional newsrooms have long institutionalized as separate roles now lives inside the writer's own tooling. This is a single-person approximation, not a replacement for actual editors, copy-desks, or fact-checkers — it's the shape of the discipline made tractable for a solo workflow, not the institutional version. The pattern is old; the requirement to ship it as a single person has its own shape.

Where the responsibility actually sits#

This isn't a "look how unreliable the AI was" story. The AI followed every rule I gave it — and the same model that let the errors through caught and fixed them once I pointed it at the database. The gap wasn't capability. It was the instructions.

For anyone using an LLM on factual writing, the instructions split into two categories that don't substitute for each other:

Writing instructions — voice, audience, structure, scope, what kinds of claims need citations. These shape the surface of the output.

Verification instructions — which database is the source of truth for which claim, when to run the query versus trust the prose, how to test a code example before committing it, how to spot apples-to-oranges comparisons. These shape what the output is.

Write only the first category and you get fluent, well-structured, well-cited prose that may or may not be true. Write both and the load-bearing facts come pre-checked against a source you can point at.

Both matter. Conflating them — assuming good writing rules will produce truthful output by themselves — is the failure mode this post documents.

What I changed#

I added four verification rules to the standards document. Each names a habit that would have caught a specific class of error: re-derive cluster counts from underlying rows rather than copying summary text, test SQL against the live schema before committing it, check internal URLs against the actual route table the build produces, confirm any A/B comparison uses identical conditions on both sides. I also documented the source of truth explicitly — a database for the quantitative claims, a private narrative log for the qualitative examples — with a small table mapping which post relies on which.

The next pass through the posts caught the twenty-odd errors. A sanitized version of the standards document is published as Blog Writing Standards — the rules organized along three axes (correctness at publish time, verification before publishing, durability over time), with worked examples on this site. The post you're reading is itself part of the same discipline, made public.

Making the rule self-enforcing#

Adding a verification rule to a standards document is the first step. The rule still depends on the writer to remember it. Even a careful writer with the rule in front of them will miss applications — hyperbole feels right in the moment, and the audit step doesn't fire on its own. The same failure mode that produced the original twenty errors keeps producing new ones at a lower rate.

So the rule needed a second layer: a small script that scans the published Markdown for the structural patterns the rule enumerates ("every," "always," "all," "ten times," "stayed flat," "the only," etc.) and surfaces every place a check needs to happen. The script doesn't judge — many hits are defensible (mechanism descriptions, prescriptive rule statements, direct quotes from prompts). It just flags them so the writer can triage. A typical hit, taken from a recent run on this post:

src/content/posts/writing-isnt-verification.mdx  (1 new hit)
  146:61  [absolute modifier]  "every"  — ard RAG tutorial" instead of "every standard tutorial on the inte

New (un-baselined) hits: 1 across 1 file. Total in scan: 252 (baseline accepts 251).
Each hit needs a Rule 7 audit: universally true, citable inline, or explicitly scoped? If none, fix it.

Each line gives the writer everything needed to make the call: file, line, column, the matched trigger word, its category, and ~40 characters of surrounding context. Three categories cover most hits — absolute modifiers (every, always, only, never), sociological / industry claims (most engineers, every team), and hyperbole (dramatically, ten times, stayed flat). The example above is the trigger word "every" appearing inside a quoted anti-pattern phrase the post is critiquing — defensible in context, accepted into the baseline.

The script alone has a problem too. A first run on the existing posts surfaced about two hundred hits, most of them defensible. Treating every hit as a failure would either block all commits (annoying) or train the writer to ignore the red light (useless — the broken-windows failure mode for linters). So the script ships with a baseline file: each accepted hit is recorded as a tuple of file, line, matched word, category, and surrounding context, and the script's strict mode only fails on hits not in the baseline. New patterns introduced beyond the baseline cause the build to fail; existing defensible uses don't.

The baseline file is itself an audit record: every entry represents a writer-accepted claim with its location and context preserved in the repo. Adding a defensible new claim that the scanner flags is one command (npm run lint:claims:update) plus a deliberate commit. The friction is right — visible enough that the writer notices they're adding an exception, not so high that it blocks legitimate writing.

The third layer is CI. The same scan runs on every push and PR via a GitHub Action. New unbaselined hits fail the build. The editorial standard is now a property of the build, not a property of human memory.

Stated plainly: for any editorial rule that can be partly mechanized, build all three layers.

The rule — shapes the writer's habit at writing time.
The local check — catches the writer's lapses at commit time.
The CI gate — catches anything the writer forgot to run, on every push, forever.

Each layer catches what the others miss. The single-layer version — just the rule in a document — is what produced the original twenty errors. Three layers don't guarantee zero errors. They do guarantee that the system gets caught when the human doesn't.

The division of labor that emerged: the linter's job isn't to catch everything, it's to catch the cheap class so human attention can spend itself on the expensive class. The cheap class is anything with stable trigger vocabulary — "dramatically," "every team," "ten times," patterns the scanner can match without false-positive flooding. The expensive class is everything else: bare "most" + arbitrary noun ("most hallucination discourse"), prescriptive framings ("is the architectural resolution"), vague hedges ("no better than no source"). Trying to lint the expensive class would either flood the baseline with junk or miss the cases that matter — both of which kill the discipline. So the linter stays narrow on purpose, the human reviewer and the LLM judge handle the rest, and each layer ends up doing the work it's actually good at.

The loop in practice, live. While I was drafting this very section, I wrote "dramatically cheaper" and "the writing-to-verification ratio shifted hard." Both got flagged by the linter at the next commit attempt — "dramatically" is in the hyperbole pattern list, "shifted hard" is a magnitude claim without a number behind it. I softened both ("much cheaper," "shifted"). A few sections later, drafting an editorial-process metrics block, I wrote "the structure produced this way is usually shaped to the actual failure modes you have" — the linter flagged "usually" as a sociological-claim trigger; the qualifier got dropped, the prose got more direct. The post about the discipline got the discipline applied to it as it was being written. The CI loop tightening the writing it's documenting is the loop working as designed.

The pattern, beyond this post#

The discipline above isn't specific to writing about AI hallucination. The shape generalizes — replace "blog posts about LLM failure modes" with API documentation, research methodology, policy memos, or any factual-writing domain where claims need to be defensible. The mechanics stay the same:

Notice failures. Catch specific instances where the writing went wrong.
Generalize. Name the category — what kind of failure is this? What rule would have caught it?
Encode the rule. Write it into a standards document the writer reads at write time.
Mechanize the check. Build a scan that fires on the patterns the rule names. The scan doesn't judge; it surfaces places where a judgment is needed.
Baseline. Snapshot the existing acceptances so the scan doesn't drown the writer in noise on first run.
CI-gate. Wire the strict check into the build so the rule survives without depending on human memory.
Independently verify. Send a sample (or the full corpus) of accepted prose to a second judge — a different LLM context, a different model, or a human reviewer — and compute the disagreement rate.
Iterate from real findings. Each judge catch becomes either a prose fix or a rule refinement.

What's transferable is the system shape: rule, linter, baseline, CI, judge — five pieces, each well-defined, each replaceable independently. What doesn't transfer is the specific entries: your trigger patterns, your categorization tree, your judge's rubric, your disagreement-rate floor — those get shaped by the failure modes you run into. The structure is the portable artifact; the entries get re-derived against your own pipeline.

The architectural recognition that made this work: editorial verification is the generator/judge LLM pattern applied to a new domain. The same architecture used to grade model-generated content elsewhere — one model produces, another grades — instantiated for "an LLM agent classified each linter hit; was it right?" Once you see editorial verification as a special case of LLM-as-judge, the existing tooling fits without invention.

If you're using an LLM seriously on factual writing, the question shifts from "what's the right prompt" to "what's the right system around the prompt." The prompt is one component; the rule, linter, baseline, CI, and judge around it are what make the component reliable enough to trust at scale.

Receipts#

A claim about editorial verification deserves receipts of this verification at work — not borrowed from a different audit. Three from the discipline running on this very repository:

1. The judge caught a real miss, and the loop closed. A full audit against the baseline — every accepted Rule 7c verdict across all 12 .mdx files on the site, 250 entries — returned 2 "should-fix" flags. Both were on line 73 of ai-that-respects-creators.mdx, both the same shape: "every X does Y" sociological claim with no scope marker. The audit-log entry for one:

{
  "file": "src/content/posts/ai-that-respects-creators.mdx",
  "line": 73,
  "match": "every",
  "category": "absolute modifier",
  "verdict": "should-fix",
  "reason": "'every standard tutorial on the internet teaches' is a sociological/hyperbolic claim with no citation or scope marker."
}

The prose now reads "the standard RAG tutorial" instead of "every standard tutorial on the internet" — same content, scoped honestly. The next sweep, run after the fix landed, returned 0 should-fix flags across 252 entries. The judge flagged a real miss; the prose was corrected; the re-audit confirmed the fix. That round-trip — flag, fix, re-verify — is the discipline working end-to-end.

2. The disagreement rate dropped from 3.0% to 1.0%. Across the two full sweeps, the verdict counts moved from 237 / 2 / 11 (defensible / should-fix / borderline, N=250) to 247 / 0 / 5 (N=252). The full per-pass table — three earlier samples plus both full sweeps — is published on the standards page, along with the cost, the methodology, and what the 3–5% convergence does and doesn't tell you.

3. The same overclaim shape recurs across the corpus. That sociological-claim pattern wasn't a one-off. The post ai-that-respects-creators.mdx had it at line 31 (caught and softened during the original proofread to "It's the instinctive reach…") and again at line 73 (caught later by the audit). Same author, same post, same overclaim shape, different lines. The discipline alone doesn't make the writer immune to recurring patterns; the mechanical scan does. Without the loop, the line-73 instance would have shipped to the live site. With it, the round-trip above is what the system produced instead.

The detector litmus test#

Out of curiosity, I ran every post through two Hugging Face AI-detection models and compared against three known-human reference texts — a Wikipedia article, a Jane Austen novel, and the PEP 8 style guide. Both detectors are local-runnable; full script in scripts/ai-detect.py.

Source	GPT-2 detector (2019)	ChatGPT detector (2023)
All 9 posts in this corpus (avg)	97.5% AI	0.1% AI
Wikipedia: Apollo 11	99.3% AI	99.8% AI
Pride and Prejudice (1813)	99.5% AI	3.9% AI
PEP 8 (2001 technical)	99.8% AI	3.6% AI

The 2019 GPT-2 detector flags everything as AI — including Pride and Prejudice — so it's noise floor; ignore it. The 2023 ChatGPT-trained detector is more interesting: Wikipedia's encyclopedic, structured-fact prose scores 99.8% AI; pre-LLM personal and technical prose scores 3–4%; this corpus scores ~0%.

The detector isn't measuring origin — the posts are LLM-assisted, and disclose it openly. It's measuring voice. The corpus reads as personal-writer voice rather than encyclopedic-AI-summary voice. That's the kind of style marker the discipline (first-person scoping, specific receipts, narrative arc, no rhetoric inflation) is designed to produce. The honest takeaway: the discipline's output differs from what an AI-detection tool associates with AI-generated prose, even when the prose is in fact LLM-assisted.

Caveat: AI-detection tools are unreliable enough that no single score is definitive — OpenAI deprecated its own classifier in 2023 after measuring 26% true-positive at 9% false-positive; Sadasivan et al. 2023 argue detection has fundamental theoretical limits; and Liang et al. 2023 documented systematic bias against non-native English writers as one of several known failure modes. This is a snapshot of two specific detectors as of 2026-04-26 against a small sample — the 9 posts in this corpus, all written by one author applying the same discipline for the first time. A future, better-trained detector could score this corpus differently; a wider sample (more posts, more authors, different discipline configurations) could either reinforce or weaken the contrast. The discipline hasn't been tested at scale. But the contrast in this snapshot — Wikipedia at 99.8%, this corpus at ~0%, pre-LLM personal prose at 3–4% — is consistent and tells a coherent story about voice, not about origin.

The takeaway#

If you're working with an LLM on factual writing, the question isn't "how do I get the AI to write better." It's "how do I get the AI to verify what it just wrote." Writing rules without verification rules produce something that looks rigorous and isn't. Verification rules without writing rules produce something that's correct and unreadable. Verification rules without a mechanical check produce something that's correct until the writer forgets to apply them. You need all three layers — and the last one is the only layer that doesn't depend on the writer remembering to do anything.