Editorial standards#
These are the editorial rules behind the writing on this site. They were built from my own experience on one personal project — not from an attempt to codify an official standard, a generally-accepted best practice, or anything that has been through external review. I'm publishing them here as a public artifact in case the structure (or any of the specific rules) is useful to someone else doing similar work, with the up-front caveat that they're shaped by my pipeline, my audience, and the specific failure modes I personally ran into.
If you adapt them, treat them as one person's working notes, not a framework with the kind of authority that comes from peer review, institutional adoption, or consensus.
The site is a small personal blog. The posts split into two categories — general posts for non-specialists and research posts for practitioners — with different audiences, vocabularies, and durability mechanisms. The rules below are the discipline I've extracted from running into the wrong shape several times. Most are obvious in retrospect; none were obvious to me at the time.
The rules organize along three distinct axes:
- Rules 1–7 — correctness at publish time: voice, scope, citations, structure.
- Rules 8–11 — verification before publishing: queries, schema checks, URL routing, apples-to-apples comparisons.
- Rules 12–13 — durability over time: how each post category extends its useful life past the publish date.
The companion piece to this page is I Taught My AI to Write. I Forgot to Teach It to Check., which describes how the rules in the second category got added — they were missing for a while, and the consequences were visible.
Two post categories#
Posts split into two categories with different audiences, vocabularies, and bars. Know which one you're editing before you start.
General posts (non-specialist audience)#
Audience: smart non-specialist readers. Curious about software/AI but won't recognize insider vocabulary.
Purpose: make an idea legible. Tell a story, make the point, let a broad reader walk away with the insight.
Voice:
- Plain language. Avoid jargon — in these posts jargon is not load-bearing, so swap it for the everyday equivalent ("vector index" → "local searchable copy"; "chunks injected into prompts" → "snippets pasted into the AI's prompt").
- Keep internal project vocabulary only when the post defines it on first use and uses it consistently.
- Short sentences. Concrete examples. First-person when it helps.
Technical research posts (practitioner audience)#
Audience: researchers, practitioners, other engineers doing similar work. Comfortable with SQL, pseudocode, pipeline diagrams, statistical terms.
Purpose: fact-finding and knowledge sharing — receipts, methodology, per-run data, reproducibility notes.
Voice:
- Technical vocabulary is load-bearing here; preserve it. Swapping precise terms for plainer wording costs precision.
- Cite run IDs, SQL queries, and exact cohorts. Link to the underlying dataset when a claim rests on specific numbers.
- Still avoid jargon the post doesn't need — precision, not vocabulary for its own sake.
Rules 1–7: Correctness at publish time#
Governing principle: keep the content high quality. That means no generalizations the post can't defend, no unverified claims passed off as fact, no rhetorical sweep substituting for evidence. Every specific number, date, example, quote, or factual assertion needs a source. When you can't find backing, narrow or remove the claim rather than leave it exposed. The numbered rules below are the mechanics that implement this principle.
1. No escape hatches#
Don't frame an architecture, pattern, or design choice as a way to dodge legal, ethical, or ToS problems it doesn't actually solve. If a runtime fetch reduces copyright exposure but not ToS exposure, say exactly that — don't call the architecture "legally clean."
2. No overstated claims#
Avoid "most companies", "always", "eliminates", "completely clean", "collapses" unless the evidence in the post defends them. Narrower truthful claims beat broad rhetorical ones.
3. Citations are required, not optional#
Every non-obvious factual claim needs a source the reader can verify. Acceptable sources include:
- Project research / data appendix for claims backed by your own measurements.
- Primary sources: court filings, laws, academic papers, vendor documentation, platform ToS pages.
- Reputable secondary sources (major news outlets, established publications) when citing ongoing events.
Use inline markdown links, not bare URLs or "trust me" phrasing. If you can't find a citation for a claim, narrow or remove the claim. This applies equally to general and technical posts — general posts don't get a pass on rigor just because they're written in plain language.
4. Split axes when they're being conflated#
Copyright/redistribution, attribution, ToS compliance, and latency/cost are separate concerns. A fix that addresses one doesn't automatically address the others — say which one it addresses.
5. Cross-linking#
General posts should link to the research posts when making a measured claim; research posts can link sideways to each other and to the data appendix. Internal cross-references make a body of work feel coherent and let a reader audit a claim by following its trail.
6. End-to-end flow for a brand-new reader#
After any non-trivial edit, read the whole post top to bottom as someone who has never seen the content before. Each section should set up the next; each transition should be earned; the close should land. Watch for:
- Orphaned headings — section title no longer matches its body after edits.
- Ghost references — a paragraph mentioning something that was edited out elsewhere.
- Contradicting sentences across nearby paragraphs.
- Stale claims in the title/excerpt/deck that no longer match the body.
This is a structural/consistency check, not a reading-level check — the goal is coherence, not simplicity.
7. Every claim must fall into one of three categories#
Before a claim lands in a post, it should be:
- Universally true — mathematical facts, statistical reasoning that doesn't depend on the domain, well-established technical definitions. Can stand unscoped.
- Citable — backed by a verifiable source (research post, data appendix, primary filing, reputable secondary coverage). Must be linked inline.
- Specific to this research — explicitly scoped with markers like "in this pipeline," "in my runs," "the fabrications I observed," "in this research." Past tense and first-person help ("I measured," "I found," "moved," "was").
A claim that floats between the three — stated universally but only defensible from your own work — is the exact failure mode this rule catches. Default to scoping rather than generalizing. If you think a finding should generalize, say so explicitly as a prediction from the mechanism rather than as a measurement across pipelines ("I'd expect the same mechanism elsewhere, but that's a prediction, not a measurement").
Patterns that almost always need scope or citation:
- Absolute modifiers: "every," "always," "all," "universal," "only," "none"
- Sociological claims: "most engineers," "every team," "companies usually"
- Present-tense physical-law voice: "Hallucination splits into two buckets" vs. "Hallucination in this pipeline split into two buckets"
- Hyperbole: "in every LLM application ever," "the real answer," "the only lever"
Patterns that are fine unscoped: well-known technical terms with established definitions, statistical reasoning (e.g. "thresholds in empty bands are stable" is a statistical property, not a domain claim), and explicit heuristic advice clearly framed as such.
7b. Pre-publish structural scan (the enforcement mechanism for Rule 7)#
Rule 7 alone is a discipline that requires deliberate attention to every claim, every time. In practice that's hard to enforce reliably — gut-feel doesn't fire on hyperbole because hyperbole feels right (that's the whole point of hyperbole as a rhetorical device). Rule 7b makes the discipline mechanical: before committing changes to a post, run a structural keyword scan over the changed text. Each hit triggers the Rule 7 audit — is this universally true, citable inline, or explicitly scoped? if none, fix it. The scan turns Rule 7 from "active discipline" into "habit triggered by visible surface patterns" — something a tired writer at 11pm can still execute.
Trigger patterns to scan for:
- Absolute modifiers:
every,always,all,none,only,never,universal,entire,complete,fully - Sociological / industry claims:
most engineers,most companies,every team,everyone,nobody,usually,commonly - Hyperbole / magnitude:
ten times,100x,stayed flat,eliminates,the real,the only,dramatically,completely,essentially zero - Universal-voice findings on research-specific claims (harder to scan mechanically; requires recognizing when a finding from one pipeline is being stated as a property of LLMs in general):
hallucination splits,grounding works,the model fabricates, etc. — present tense, no scope marker.
Linter (implemented as npm run lint:claims). The trigger-pattern part of Rule 7b is mechanically lintable. A short Node script scans the published Markdown for the patterns above and reports each hit with file, line, category, and surrounding context. The linter skips frontmatter and code fences. It does not judge — it surfaces the places a Rule 7 audit needs to happen and lets the writer triage. Many hits are fine in context; some aren't. The writer decides which.
Baseline mechanism. A first run of the scanner produces hundreds of hits — most defensible (mechanism descriptions, prescriptive rule statements, direct quotes), a handful real overclaims. The baseline file (scripts/lint-claims-baseline.json) records the hits that have been audited and accepted as defensible; subsequent runs only fail on new hits not in the baseline. That keeps the linter useful (it doesn't shout about the same defensible "every section fills 65 slots" on every commit) without going silent (a new "every engineer reaches for X" introduced tomorrow still trips it). When you intentionally add a defensible claim that triggers the scanner, regenerate the baseline with npm run lint:claims:update and commit the updated file alongside the prose change. The baseline file thus doubles as an audit record — every entry is a claim the writer reviewed and accepted.
Two-pronged enforcement. A written rule alone doesn't enforce itself — even careful writers don't apply Rule 7 consistently because hyperbole feels right in the moment. A mechanical scan alone doesn't judge nuance — it produces false positives the writer must triage. Together they cover what each misses: the rule shapes the writer's working habit, the scan catches the lapses, and the baseline file makes the human judgment about each lapse part of the repo's history. The same dual-enforcement pattern applies to most of the rules on this page — write the rule, then if possible build a check that fires on its violations, then if possible put that check in CI so it fires automatically on every commit. (This site's GitHub Action runs lint:claims:strict on push/PR.)
7c. Triage process — how to audit a hit#
The linter surfaces hundreds of hits on a substantial corpus. Without a process for triaging them, the writer hits a wall and either ignores the linter or accepts every hit into the baseline indiscriminately. The process below is what makes the triage tractable — it's essentially a categorization tree. Name what kind of claim the trigger word is part of, then apply the test for that category. Most hits fall into a category with a default verdict, so the per-hit decision is fast. Only a few require deeper judgment.
| If the hit is part of… | …then it's defensible if… | Default verdict |
|---|---|---|
| Mechanism description ("every section fills 65 slots", "every push runs CI") | The system being described actually works that way | Defensible |
| Logical claim within a closed scope ("the only layer that doesn't depend on memory") | The logic holds given the definitions established in the surrounding paragraph | Defensible |
| Direct quote ("NEVER invent proper nouns" inside a blockquote, license text, error message) | It's accurately quoting an external source | Defensible |
| Prescriptive rule statement ("every load-bearing number must have a query alongside it") | It's stating an intention or recommendation, not measuring something | Defensible |
| Heading or label ("Why universal coverage wasn't enough") | The body of the section delivers on the framing the heading sets up | Defensible |
| Already-scoped first-person observation ("in this research, the only lever I tested") | The scope marker ("in this research," "I found," "in my runs") is in the same sentence or immediate paragraph | Defensible |
| Empirical claim about the world ("X happens most of the time") | It has an inline citation, OR it's scoped to "in my research," OR it's universally true (rare) | Needs check |
| Sociological claim about people/groups ("Every engineer reaches for X", "most companies do Y") | It has a citation OR is rewritten in first person ("the instinct I went in with") | Needs fix by default |
| Hyperbole stated as fact ("ten times more", "stayed flat", "dramatically cheaper") | It can be replaced with a directional claim that doesn't pin a specific magnitude | Needs fix by default |
| None of the above / unclear | — | Needs fix by default |
Self-checks against triage bias:
- If you can't articulate which category a hit falls into, default to needs-fix. Ambiguity is itself a signal.
- Read the paragraph, not just the line. The trigger word might be flagging an unsourced absolute or sociological claim elsewhere in the sentence even when the word itself looks innocuous in isolation.
- If the same trigger pattern produces many hits across many contexts and most are defensible, the pattern itself might be too broad — consider tightening it in the linter rather than accepting everything into the baseline. (The right test for a linter rule is "does it surface the cases I want to audit," not "does it match this string.")
- Be equally critical of pre-existing prose and prose you just wrote. The temptation is to soft-pedal the in-flight stuff because rewriting feels expensive; resist it.
This process is what lets an agent or a tired writer audit two hundred hits in a single pass without spiraling. Most hits resolve in seconds because they fall cleanly into the first six rows of the table; the writer's full attention goes to the bottom four rows, where the work actually is.
Rules 8–11: Verification before publishing#
Distinct from the writing rules above. A post can satisfy all of rules 1–7 and still be factually wrong, because no one re-derived the load-bearing numbers. These rules require the writer to leave the writer's chair and act as auditor.
8. Verify data claims against the source before publishing#
The citation rule (#3) requires that a reader can verify a claim. This rule says you must verify it before the claim is committed. For any post that cites specific numbers — counts, percentages, IDs, distributions, averages, retry budgets, latency p90s, cost ranges — run the query that produces them against the source of record. Do not trust an existing post's number as ground truth; the post may already be wrong. When the source-of-truth and the post disagree, the source wins.
Specific verification habits that catch real errors:
- Re-derive cluster/distribution counts directly from the underlying rows, not from copied summary text.
- When a column is shown as
—ornullin a table, query the row to confirm whether the column is actually null or whether the table is presenting a different column than the narrative claims. - When a post cites an A/B comparison ("+3 improved, -2 regressed"), run the same-game overlap query and verify both the count and the framing — "improved" might mean verdict-flip or raw confidence movement, and the right interpretation matters for the conclusion.
- When two posts reference the same artifact (a snapshot, a run ID, a threshold value), read both and ensure they agree literally.
9. SQL and code examples must run#
Any SQL example or code snippet a reader can copy must execute against the documented schema. Common pitfalls: derived-but-undefined columns (citing a column name that doesn't exist), wrong table names, JSONB key paths that have moved. Test the example against the live schema before committing it; if you can't test it, mark it as illustrative pseudocode, not runnable SQL.
10. Internal URL format check#
Cross-links between posts must use the actual route structure of the site, not a guessed pattern. After any meaningful edit that adds or changes internal links, build the site and check the route table; broken internal links don't 404 at build time but they do at runtime, and they erode reader trust.
11. Apples-to-apples comparisons must compare like with like#
Before citing an "X → Y" improvement across runs, arms, or cohorts, confirm that the metric definition, threshold, gate, and population are identical between X and Y. If they're not, either re-compute under a single set of conditions or clearly call out the conflated effect. The general failure mode is comparing a stricter-gate "before" to a looser-gate "after" and attributing the entire delta to the intervention.
Rules 12–13: Durability over time#
These two rules cover the post's value past its publish date. General posts and research posts have different durability mechanisms — what extends value for one would dilute the other.
12. Anchor current cases inside evergreen frames (general posts)#
When a general post argues something through a current case (an LLM workflow, a recent product launch, an in-progress lawsuit, a specific framework version), explicitly place the lesson inside a longer tradition or pattern that pre-dates the current case. The current case earns relevance and attention; the longer frame earns shelf life. Without the longer frame the post reads as a workflow tip that ages out fast; with it the same post still reads usefully in five years even when the specific tools have evolved.
How to apply, after drafting the case-study/argument:
- Ask: "If the specific tool or event in this post didn't exist anymore, would the underlying lesson still hold?"
- If yes, find the longer tradition this is an instance of (newsroom copy desks vs. fact-checking; code review vs. CI tests; peer review vs. methodology checks; cache-locality vs. fetch-locality in distributed systems; etc.) and name it explicitly — usually in a short section between the case study and the takeaway.
- Don't replace the current case with the historical frame. Keep both. Surface should be timely; substrate should be evergreen.
Where this rule does not apply: technical research posts. Their value is the specific receipts, not a portable frame; adding a historical anchor would dilute them. Research posts have their own durability mechanism — see rule 13.
13. Receipts must survive (research posts)#
Technical research posts earn shelf life differently. Where a general post extends value by anchoring its current case in an evergreen frame (rule 12), a research post extends value by making its receipts re-runnable years from now. Five years out, a specialist reader doesn't need a portable narrative — they need to re-run the query and see the same number you cited. If the receipts can't be re-run, the post ages out; if they can, it doesn't.
How to apply, when drafting or editing a research post:
- Cite immutable provenance anchors, not transient labels. "Run #X" survives; "the most recent run" doesn't. Use CI run IDs, commit SHAs, snapshot tags — anything that won't be overwritten.
- Freeze snapshots with a timestamp. "Snapshot frozen on YYYY-MM-DD" tells a future reader exactly which cohort the numbers come from. The data behind the post will keep accumulating; the snapshot doesn't have to.
- Publish the query, not just the result. Every load-bearing number should have a runnable query alongside it.
- Document the schema the queries run against. Tables, columns, JSONB key paths — name them so a future reader can tell whether the schema has drifted.
- Version the rubric and methodology. "Judge rubric v1" lets a future reader know what changed if a later v2 exists. Same for prompt versions, retry policies, gate thresholds.
- When data shifts under a post, update it explicitly. Add a dated note about what changed and why. Don't silently overwrite the figure; the trail of changes is itself part of the receipts.
Where this rule does not apply: general posts. Their durability comes from framing (rule 12), not from inline receipts. A general post burdened with run IDs and SQL queries reads like a research post in disguise — both audiences are poorly served.
Rule 14: Clarity over hedging#
14. Don't be vague when the post can be specific#
A claim that hedges ("no better than no source," "much higher," "shifted hard," "directionally worse," "a lot," "roughly the same") forces the reader to either trust it or scroll back to find the backing. When the data is in the post already, anchor the hedge to it: cite the cohort, the number, the comparison. When the data genuinely doesn't support a stronger claim, say that explicitly ("inside the noise floor on n=8") rather than leaving the reader to guess what the hedge is doing.
This is a separate discipline from rule 7 (overclaiming) and rule 8 (verification). A vague claim can pass both: it isn't overstated and it doesn't cite a number that's wrong. It's just mush. Mush reads like AI-generated prose because it leaves the writer's confidence floating — the reader can't tell whether the writer checked or guessed. Concrete anchoring is what makes prose feel like the writer actually did the work.
How to apply, when proofreading or editing:
- For each comparison or magnitude phrase ("better," "worse," "more," "less," "much," "no better than," "roughly," "directionally," "kind of," "sort of"), ask: does the post earlier give the specific number behind this? If yes, cite the number where the hedge currently sits.
- If no number exists in the post, either find it (run the query, link the source) or narrow the claim to what's actually defensible.
- Hedges aren't banned — sometimes "directionally worse, statistically indistinguishable" is the truthful claim. The rule is that the hedge has to come with the receipts, not stand alone as the entire claim.
- Watch for stale meta-references that have gone vague over time: "more on that in a follow-up" when the follow-up exists, "we'll come back to this" when you don't, "the architectural resolution" when the post above already named it. Replace with the specific link or remove.
Worked examples:
- "no better than no source on the games that needed help" → "the Steam-only cohort hit 25% on n=8 against a 33% ungrounded baseline. Directionally worse, statistically indistinguishable."
- "is the architectural resolution" → "That's the pattern I ended up with."
- "More on that in a follow-up." → "The longer argument lives in the [runtime-vs-harvest post]."
Why the linter can't catch this: the trigger words for vagueness ("better," "more," "less," "much") are too common to scan mechanically without flooding the baseline. This is a Rule 7-class hand-catch — the human reviewer and the LLM judge are the right enforcement layer.
Two sources of truth#
For any factual claim, there are two distinct sources of truth and they back different kinds of claims. Knowing which is which matters because the verification path differs.
1. A live database (the quantitative source)#
The source of record for numbers — counts, percentages, run IDs, distributions, averages, latencies, cost figures. Every number cited in a post should be re-derivable from a query against this database. The data appendix on this site documents the schema and which posts rely on which tables. When publishing a research post, the SQL that produces each load-bearing number should be inline in the post.
2. A private narrative log (the qualitative source)#
The source of record for specific examples — fabricated entity names, specific prompt experiments, anecdotes from individual experimental runs. These don't have a database row each; they live in an internal research log that records what happened during specific sessions. They're real artifacts but private (the log lives locally and isn't published).
When a post cites something from the narrative log, the rule is: the post can name the example, but the citation should point to the nearest published fact that supports the underlying phenomenon (a related post, a research entry, a methodology section), not to the private log itself. A reader on the published site can't resolve a private path.
These rules don't make AI infallible#
The discipline above doesn't eliminate AI mistakes — it makes them findable. AI assistance was used heavily on this site, both for the original drafting and for the verification pass. Even with the rules above in force, the writing went through a verification pass that caught around twenty factual errors before they shipped.
A representative few, all preserved openly in the published posts and the data appendix:
- A confidence-cluster bullet listed seven games as five (corrected in the data appendix's confidence-cliff section). The drift came from a per-harvest table where three rows displayed
—for confidence. Those rows actually hadjudge_confidence=0.82in the source database; the table was rendering a different column (final_confidence, which is null when no retry triggers). The narrative was reading off the displayed table instead of re-deriving from rows. - A SQL example in the hallucination-taxonomy post referenced a column called
judge_issue_countthat doesn't exist in the schema. The actual derivation isjsonb_array_length(judge_notes)— caught only by trying to run the example against the live database. The prose around it read perfectly fine. - The same post's snapshot definition cited the wrong run IDs entirely — the cited runs produced 17 harvests, not the 31 the post claimed. The data appendix had the correct IDs all along; the post was citing earlier runs from a different cohort.
- A general post claimed an "8% → 50%" improvement from adding Wikidata grounding. The two figures came from runs with different promotion thresholds; comparing them attributed the entire delta to grounding when about half came from a separate threshold change. The honest single-gate comparison is 33% → 50%.
Each of these passed every writing rule above. None of them survived the verification pass.
The lesson isn't that AI is unreliable. The same model that produced these errors caught and fixed them once it was pointed at the database with the verification rules in front of it. "Write well" and "check the numbers" are different jobs that need different instructions; rules 1–7 cover the first, rules 8–11 cover the second. Together they make the failure modes findable. Without the second category, the failures are invisible — beautifully prose-edited, citation-decorated, and quietly wrong.
If you're using an LLM to help with factual writing, treat the writing rules as necessary but not sufficient. Double-check the work. Run the verification queries. Test the SQL. Cross-check between documents. Don't ship without auditing the load-bearing numbers, especially when the prose reads beautifully — that's the precise moment a quietly-wrong fact is most likely to slip past you.
How these standards got built#
A note on the dynamic, since it turned out to be the most interesting thing this work surfaced.
I did not sit down and write these rules from scratch. They emerged from a back-and-forth between me and an LLM I was using to help draft and revise the writing on this site. The shape of the work was: I'd notice something missing or wrong — a claim that overstated, a number that didn't match the database, a section that contradicted another — and surface it as a question or correction. The agent would synthesize the underlying pattern into a rule, an example, or — once we got far enough — a mechanical check that would catch the same failure next time.
What surprised me was how cleanly the agent could operate once the rules existed. Triaging two hundred linter hits would normally be a wall — too many to look at one by one, too many edge cases for a single judgment to cover. But once Rule 7c (the triage table above) was in place, the agent could sweep through the hits autonomously, categorize each one, decide what to fix vs. what to accept into the baseline, and update the artifacts in one pass. I didn't review each verdict — I read the resulting baseline diff and the few prose changes, and the dynamic worked.
That is itself the most interesting thing this work surfaced — more than any individual rule. The pattern is something like:
- A human notices a problem and surfaces it as a question or correction.
- The agent synthesizes the underlying generalization (what kind of failure is this? what rule would have caught it?).
- The human validates or pushes back, often surfacing a gap the agent missed.
- The agent encodes the rule, the example, and — where possible — the mechanical check.
- The codified discipline lets the agent operate autonomously on the next instance of the same class of problem.
Neither side could have produced this alone. I couldn't have written Rule 7c without first experiencing the triage. The agent couldn't have surfaced the right gaps without being challenged on specific instances. The structure that emerged is genuinely jointly authored — the agent did most of the synthesis and writing, the human did most of the steering and gap-finding.
The practical takeaway, for anyone trying to build editorial standards with an LLM: don't try to specify everything upfront. Steer through observations and corrections. Let the agent propose the structure. Push back on weak or under-scoped parts. Iterate. The result will codify the actual discipline you've been running implicitly — often more cleanly than you could have stated it without the iteration — and it will leave the agent with enough structure to keep applying the discipline without needing per-decision supervision.
The dynamic generalizes beyond editorial work. Anywhere you'd otherwise be tempted to write a complete spec upfront, consider whether it'd be more honest to start with concrete corrections, let the structure emerge from what actually went wrong, and then encode the mechanical check that catches the same failure next time. The structure produced this way is shaped to the actual failure modes you've hit, not the failure modes you imagined you might.
Editorial-process metrics (snapshot)#
A page about a process for catching unsourced claims should itself include the empirical state of that process. The numbers below are receipts for the workflow described above — they're the basis on which the tractability claim ("an agent can triage hundreds of hits in one pass") rests, and they make explicit what the metrics do and don't support.
Architectural note up front: the verification setup is the same generator/judge architecture that this site's games-wiki research uses — one model produces content (or, here, triage decisions), a separate model graded the output. We didn't invent a new pattern for editorial verification; we identified that the existing LLM-as-judge pattern fits the shape of what we needed to measure, and instantiated it for the editorial domain. The sampler script described below is the "judge" half; the original triage is the "generator" half.
Baseline state at last update:
| Metric | Value |
|---|---|
.mdx files scanned (src/content/posts/** + src/app/standards/) | 12 |
| Total raw hits in current scan | 249 |
| Unique baseline entries | 229 |
| Files with at least one hit | 12 of 12 |
The unique-entry count is lower than total raw hits because the baseline keys on (file, line, lowercased match) — the same trigger word appearing twice on one line collapses to one baseline entry. The baseline file (scripts/lint-claims-baseline.json) is checked into the repo and is the auditable record of every accepted hit.
Triage decision rate across the editorial session that produced these standards:
| Pass | Hits added | Marked needs-fix | Accepted to baseline | Fix rate |
|---|---|---|---|---|
| Initial proofread of all posts | 212 | 2 | 210 | ~1% |
/standards page added | +60 | 0 | +60 | 0% |
"Making the rule self-enforcing" section in writing-isnt-verification | +13 | 0 | +13 | 0% |
| Rule 7c (triage process) added | +27 | 0 | +27 | 0% |
| "How these standards got built" section added | +2 | 0 | +2 | 0% |
| Editorial-metrics section (this one) added | ~+5 | 1 | ~+4 | ~20% |
| Total across session (snapshot) | ~319 | 3 | ~316 | ~1% |
Across the full editorial session, the agent doing the triage marked roughly 1% of surfaced hits as needing a prose fix and 99% as defensible in context. The "needs-fix" count includes one fix the linter caught during the drafting of this metrics section itself — a use of "usually" stating a workflow generalization without scope. The CI loop tightened the prose for the metrics section as the metrics section was being written.
Audit results — full-baseline sweep + earlier samples#
A full audit was run against every entry in the baseline immediately before this page went live, after three sampled runs had already shaken out the easier overclaims. The numbers below are the definitive snapshot at merge time:
| Pass | Sample | Defensible | Should-fix | Borderline | Disagreement |
|---|---|---|---|---|---|
| Sample 1 | 20 | 19 | 1 | 0 | 5.0% |
| Sample 2 | 50 | 46 | 0 | 4 | 4.0% |
| Sample 3 | 50 | 47 | 0 | 3 | 3.0% |
| Full sweep (pre-fix) | 250 (100% of baseline) | 237 | 2 | 11 | 3.0% |
| Full sweep (post-fix) | 252 (100% of baseline) | 247 | 0 | 5 | 1.0% |
The pre-fix sweep cost about $2.57 in API tokens (~109k input + ~12k output, batched into 5 calls of up to 50 entries each). It surfaced 2 hard should-fix entries the original triage missed (both flavors of the same sociological "every standard tutorial / every public example" overclaim on a single line) and 11 borderline flags spread across nine files — judgment calls in the categories Rule 7c marks as "needs check" (heuristic prescriptive prose, named-concept descriptors, hyperbole shading toward universal voice).
After this human pass on the borderlines, the 2 hard fixes plus 5 borderlines worth tightening were applied (sociological overclaims and "the only X" hyperboles softened to scoped or qualified phrasing); the remaining 6 borderlines were left as defensible voice — prescriptive advice, named concepts, an archived post, and rhetorical framing whose body delivers on the framing.
The post-fix sweep then ran against the corrected prose. It returned 0 should-fix flags across 252 entries — the 2 hard misses from the prior run were both gone — and the disagreement rate dropped from 3.0% to 1.0%. The 5 remaining borderlines clustered on a single post (ai-that-respects-creators.mdx) plus one archived note, all in categories Rule 7c marks as "needs check": thesis-claim hyperbole, a section-heading absolute, and prescriptive normative voice. The round-trip — judge flags, prose fixes, re-audit confirms — landed end-to-end.
One observation worth noting: across four runs (three samples + the pre-fix sweep), the disagreement rate converged to 3–5% independently of sample size — sample 3 hit 3% on N=50, the pre-fix full sweep hit 3% on N=250. That convergence suggests the rate was genuinely in that range rather than an artifact of small-sample variance. The post-fix 1.0% is a different measurement — judge-vs-corrected-prose, not judge-vs-original-triage — and isn't directly comparable to the earlier four runs.
Detector litmus test (external check on voice)#
Out of curiosity, every post in the corpus was run through two Hugging Face AI-detection models, with three known-human reference texts as controls (a Wikipedia article, a Jane Austen novel, and the PEP 8 style guide). Both detectors are local-runnable; the script lives at scripts/ai-detect.py in the repo.
| Source | GPT-2 detector (2019) | ChatGPT detector (2023) |
|---|---|---|
| All 9 posts in the corpus (avg) | 97.5% AI | 0.1% AI |
| Wikipedia: Apollo 11 | 99.3% AI | 99.8% AI |
| Pride and Prejudice (1813) | 99.5% AI | 3.9% AI |
| PEP 8 (2001 technical) | 99.8% AI | 3.6% AI |
The 2019 GPT-2 detector flags everything as AI — including Pride and Prejudice — so it's noise floor and worth ignoring. The 2023 ChatGPT-trained detector tells a more useful story: Wikipedia's encyclopedic, structured-fact prose scores 99.8% AI; pre-LLM personal and technical prose scores 3–4%; the corpus scores ~0%.
The detector isn't measuring origin — these posts are LLM-assisted, openly disclosed. It's measuring voice. The corpus reads as personal-writer voice rather than encyclopedic-AI-summary voice. That's the kind of style marker the discipline (first-person scoping, specific receipts, narrative arc, no rhetoric inflation) is designed to produce.
This isn't load-bearing evidence for the rules themselves; it's an external check that the discipline's output differs from what AI-detection tools associate with AI-generated prose, even when the prose is in fact LLM-assisted. AI-detection tools are unreliable enough that no single score is definitive — OpenAI deprecated its own classifier in 2023 after measuring 26% true-positive at 9% false-positive; Sadasivan et al. 2023 argue detection has fundamental theoretical limits; and Liang et al. 2023 documented systematic bias against non-native English writers as one of several known failure modes. This is a snapshot of two specific detectors as of 2026-04-26 against a small sample — the 9 posts in this corpus, all written by one author applying the same discipline for the first time. A future detector could score differently; a wider sample (more posts, more authors, different discipline configurations) could either reinforce or weaken the contrast. The discipline hasn't been tested at scale. But the contrast in this snapshot (Wikipedia at 99.8%, this corpus at ~0%, pre-LLM personal prose at 3–4%) is consistent and tells a coherent story about voice, not origin.
What these metrics support, and what they don't#
The metrics support a tractability claim and a bounded-accuracy claim:
- Tractability: the triage is fast enough and the fix rate is low enough that an agent can sweep through hundreds of hits in a single pass without bottlenecking on per-hit human review.
- Bounded accuracy: the original triage's accepted-as-defensible verdicts hold up under independent judge review about 95–96% of the time. A meta-observation worth recording: a third pass (the human re-auditing the judge's flags) disagreed with the judge on roughly the same rate (3 of 7 borderlines I'd call defensible). That suggests ~4% is approximately the noise floor for Rule 7c calls between equally-calibrated agents, not just a property of the original triage being sloppy.
The metrics do not support a zero-error claim. Several known biases remain in the system and the agent is aware of them but cannot fully neutralize them:
- Writer-bias. The agent is structurally more lenient on prose it just generated than on prose it inherited.
- Workload-bias. Accepting a hit is one command; fixing the prose requires a rewrite. The friction asymmetry pushes toward leniency.
- Pattern blindness. Across 200+ hits in a single pass, discrimination probably degrades toward the end of the pass.
- Missing cross-document context. A hit that depends on a claim made several paragraphs earlier (or in a different post) may slip past an agent reading line-by-line.
Closing the loop — independent second-pass audit (implemented)#
The first item in the list below is now implemented; the others are open follow-ups.
Implemented: generator/judge audit on a sampled subset of accepted hits. The pattern is the same one the games-wiki research uses (Sonnet generates content, Opus judges it). Here the original triage agent is the "generator" — it classified each linter hit as defensible per Rule 7c — and a second LLM call acts as the "judge" — it re-audits a random sample of those acceptances and returns its own verdicts. The disagreement rate across runs is the empirical estimate of the original triage's false-positive rate.
Two interfaces:
npm run lint:claims:audit-sample # generates a markdown review template,
# for manual review or paste-into-fresh-Claude
npm run lint:claims:audit-auto # calls the Anthropic API directly,
# writes a JSON audit-log entry,
# prints a markdown summary
The auto variant runs weekly in CI via .github/workflows/audit-baseline.yml (cron: Mondays 14:00 UTC) and on manual workflow_dispatch. Each run appends a new file to scripts/audit-log/ capturing timestamp, model, sample size, per-entry verdicts, and aggregate disagreement rate. The directory becomes the running history of how reliably the original triage matched independent review.
The auto-audit uses Opus 4.7 as the judge — a different model than the generator (which is whatever model is making the original commits) and explicitly fed only Rule 7 + 7b + 7c as its rubric, so it audits against the published standard rather than reproducing the generator's biases.
Open follow-ups:
- Periodic re-audit of older baseline entries. Add an
--age-days Nfilter to the sampler so older entries get sampled disproportionately — does an entry from a month ago still look defensible read cold? - A "borderline" flag the triage can apply at write time. Lets low-confidence verdicts be escalated for explicit human review rather than batched with high-confidence ones.
- Trend reporting on the audit log. A small script that reads
scripts/audit-log/*.jsonand emits a running disagreement-rate chart so drift over time is visible.
On agent autonomy#
A reader at this point may notice something: the editorial process described above runs largely without per-decision human supervision. For any given pass through the linter, the human's role has been to read the resulting commit and the baseline diff, push back on anything that looks wrong, and otherwise let the loop run.
That has an implication worth naming: agent autonomy on this kind of work isn't a property of the agent alone — it's a property of the system the agent is operating in. A bare LLM asked to triage 200 linter hits would either over-fix (treating every hit as a violation) or over-accept (rubber-stamping everything). The categorization tree (Rule 7c above) is what lets the agent make decisions confidently. Without it, autonomy collapses into either timidity or recklessness. With it, the agent can sweep through hundreds of hits in a pass, fix the few that need fixing, baseline the many that don't, regenerate the artifacts, and present the result for human review at the level of "diff" instead of "decision."
If you're trying to figure out how to use an LLM autonomously on a problem class, the framing this site landed on is: don't aim for an autonomous LLM, aim for an autonomous system that an LLM can operate. The agent is one component; the rules + checks + enforcement around it are what make the component reliable enough to trust at scale. The agent helps build that surrounding system through the iterative dynamic described above — but once the system exists, the agent can run inside it without needing the human to reach in for each decision.
That, more than any single rule on this page, is the load-bearing claim of the work.
Closing#
These rules grew out of running into specific problems and writing the rule that would have caught the problem next time. They are foundational, not exhaustive — a starting point to build on or experiment with, not a complete editorial framework that covers every scenario you'll hit. Your own pipeline, your own audience, and your own failure modes will surface gaps these rules don't cover. When they do, the move is to add the rule that would have caught the gap next time, not to assume the existing rules are sufficient.
If you adapt them for your own work, the bones — three axes (correctness, verification, durability), two post categories with different durability mechanisms, named rules with worked examples, and the human/agent collaborative dynamic that produced them — are probably the most portable part. The specific habits will need to be re-derived against your own context.
If you've built a similar editorial discipline for working with LLMs and want to compare notes, the contact page has a few ways to reach me.