← Back to research notes

Runtime vs. Harvest: A Design Pattern for RAG Over Third-Party Content

11 min read

Runtime vs. Harvest: A Design Pattern for RAG Over Third-Party Content#

Where you fetch and when you fetch can be two of the most consequential architectural decisions in retrieval-augmented generation. Latency, cost, and infra are the easier axes to reason about; the attribution axis is the one this post focuses on, because it turns on a single binary architectural property — does the pipeline persist third-party content? — where the others are parameter tuning.


The two regimes#

For an LLM application that needs knowledge the model doesn't have, there are two canonical fetch patterns we can use:

Harvest-time RAG. Crawl or batch-ingest a corpus ahead of time. Chunk. Embed. Index. At query time: embed the query, nearest-neighbor against the index, inject top-k chunks into the prompt as context. Classical retrieval-augmented generation.

Runtime fetch. At query time, call a search/fetch primitive (web search, remote API, authorized first-party endpoint), read the response, prompt the model with the response as transient context, discard after generation. Nothing is persisted.

Harvest-time RAG is the dominant pattern in RAG tutorials and frameworks. Runtime fetch is a less-named counterpart — used in products like Perplexity and assistants with live search, but rarely framed as a deliberate alternative to harvest-time. The framing this post adds is that they have fundamentally different shapes under four concerns at once — latency, cost, staleness, and attribution — and that the first two are easier to reason about than the last two.


Comparison across the four axes#

AxisHarvest-time RAGRuntime fetch
Query latency50–200 ms (vector retrieval + prompt build)1.5–6 s (network + parsing + prompt build)
Per-query cost~input tokens × price + embedding retrievalinput tokens × price + search API call ($0.01–0.10)
StalenessAs old as the last harvest (hours to weeks)As fresh as the fetch primitive (minutes to seconds)
AttributionStructurally detached from sourceStructurally coupled to source
Copyright / redistribution exposureHigh when corpus is third-partyLow; no redistribution, no indexed copy
ToS (Terms of Service) exposure on automated accessStill applies (source ToS governs the crawl)Still applies (source ToS governs the fetch) — runtime is not a ToS workaround
Infra complexityIndex maintenance, cache invalidation, storage growthIntent classification, citation rendering, discard auditing, retry/timeout, rate-limits
Failure modeStale answer delivered with confidenceFetch fails → fall back to synthetic or decline to answer

The latency gap is real — roughly 10–30×. The attribution gap is the less obvious one, and the one that compounds into legal/ToS exposure for any pipeline where the corpus isn't your own.

Where harvest-time RAG may detach from source#

Consider this minimal harvest-time pipeline over a third-party corpus:

                                                          source_url carried?
1. crawler   : fetches pages, extracts text                ✓ attached to text
2. chunker   : splits into ~500-token segments             ✓ attached to each chunk
3. embedder  : embeds each chunk                           ✓ alongside the embedding
4. store     : persists {chunk_id, source_url, text, ... } ✓ in the row
5. retriever : top-k by cosine distance                    ✓ returned with each chunk
6. prompt    : "Here is relevant context:\n\n{chunk_1}…"   ✗ chunk text inlined; URL dropped
7. model     : generates response                          ✗ no structural link to origin

The source_url field is present at every step from crawl through retrieval, then disappears at step 6. The retriever has it. The prompt-builder has it. The prompt could include it. But in practice, chunk text gets concatenated into the prompt as raw context — because who's really thinking about attribution at this stage, unless you've deliberately built it into the architecture? The model processes the chunks as knowledge equivalent to its own, and the response has no structural connection to the chunk's origin. You can ask the model to "cite your sources," but at that point you're asking a stochastic process to reproduce a URL it may or may not have attended to — an unreliable surface.

This is the attribution detachment: the data flow rejoins everything it pulled from many different origins into a single undifferentiated context, and the model's output is one opaque synthesis over that blend. The user sees an answer; they don't see what went into it. Even if you bolt on citation rendering, you're guessing which chunks the response drew on — there's no causal chain you can reliably trace.

Downstream consequences of this detachment:

  • Attribution becomes best-effort guesswork, not a schema-level invariant.
  • Creator/source compensation cannot be tied to usage because usage isn't traceable.
  • Legal defense on derivative use becomes hard (were chunks reproduced? at what length? in whose response?).
  • Content owners have no mechanism to opt their material out after-the-fact without a full re-harvest.

Why runtime fetch couples to source#

Now consider the minimal runtime-fetch pipeline:

                                                                  url carried?
1. query classifier : "does this need fresh info?" (yes)           — (no fetch yet)
2. fetch primitive  : call web search / API with query             ✓ url returned
3. response parser  : extract N results with {title, url, excerpt} ✓ structured per result
4. prompt           : "Here are current search results:\n\n[1] {url_1}: {excerpt_1}\n\n[2] {url_2}: ..." ✓ in the prompt as [1]/[2]
5. model            : generates response with inline citations [1], [2] ✓ emitted in output
6. render           : surface citation chips in UI with click-through   ✓ rendered to user
7. discard          : nothing persisted                            ✓ next query refetches → URL re-enters

The url is structurally present in the prompt. With numbered sources in the context and a system prompt that requires [1], [2] citation markers, the model reliably emits them — though this is prompted behavior, not a training-level guarantee, and index drift is a real failure mode that has to be validated against. The UI renders citation chips from the same numbering. The user clicks a chip and opens the source. The attribution is not bolted on — it's what made the answer possible in the first place, and every link in the chain preserves the source pointer.

More importantly: because nothing persists, the attribution contract is unavoidable. You cannot serve this query again from cache; the next time someone asks, you'll fetch again and the source will be in the pipeline again. There is no path through the system where a creator's content reaches a user without the citation being surfaced.

This is the structural property that makes runtime fetch materially lower-risk on the copyright and attribution axes where harvest-time RAG can be hazardous: the shape of the pipeline is the shape of the attribution. (ToS compliance is orthogonal — it applies to both regimes and has to be handled separately; see the ToS caveat below.)

The hybrid that ends up being load-bearing#

Pure runtime fetch is too expensive and too slow as a default. A 10-second voice-assistant round-trip budget can't afford a 4-second web search on every query. The honest production architecture is a hybrid:

query → intent classifier
        ├─ known-answer domain → serve from your own synthesized knowledge base
        │                        (harvest-time, but your own synthesis, not third-party corpus)
        │
        ├─ low-confidence knowledge → runtime fetch (web search / API)
        │                             → respond with citation
        │
        └─ high-freshness domain → always runtime fetch (live prices, patch notes, etc.)

The crucial invariant: what you harvest is your own work (synthesized outputs, first-party data, licensed content). What you fetch at runtime is third-party content. You never harvest-cache third-party expression for redistribution.

In my gaming-assistant research project, this split was empirically determined. The synthetic-generation pipeline (model generates structured game knowledge from training; judge validates; cache if it clears the quality bar) handled the queries that landed on well-known game content — the "knows" side of the two-bucket split the hallucination taxonomy post measures, which is the dominant share of the research sample. The long tail — obscure games, recent patches, live-service content — fell through to runtime web_search, answered with citations, nothing cached. The harvest side avoided copyright/redistribution exposure because the persistent corpus contained only derived/synthesized work. The runtime side avoided it because it held nothing, cited everything, and routed users back to primary sources. ToS exposure on the runtime side was managed separately by picking fetch primitives (Anthropic's web_search, first-party APIs) whose terms permit the access, not by assuming runtime fetch itself made ToS moot.

Cost mechanics: the hidden part of the tradeoff#

A common objection to runtime fetch is cost: "search calls are 1–10¢ each, that's $30–300 per user per day." This is true at naive scale, but a well-tuned hybrid doesn't call search on every query:

total_cost/query = P(intent needs fetch) × cost_fetch
                 + P(intent known-answer) × cost_synthetic

If your intent classifier routes 70% of queries to harvest (which is your own synthesis, so ~$0 incremental), 20% to low-cost model retrieval, and only 10% to runtime search, your effective cost per query is 0.10 × $0.05 = $0.005. That's tolerable for most consumer applications and is dominated by LLM inference cost rather than search.

The harder cost to reason about is the exposure cost of harvest-time RAG over a third-party corpus. It's denominated in legal-risk dollars, ToS-violation probability, and creator-relations damage — all tail events with long-horizon consequences. Runtime fetch trades a predictable per-query dollar cost for a significant reduction in the copyright/redistribution slice of that exposure (ToS exposure persists and must be handled by choice of fetch primitive). For a project with a long enough horizon to run into identifiable content creators, that's often the trade that actually matters — a framing that's held up in my own work but whose weight depends on the specific product's legal and relational exposure.

When harvest-time is the right call#

Harvest-time RAG is the right choice, not a compromise, when:

  • The corpus is first-party — user's own documents, your internal knowledge base, license-cleared CC0 data. No third-party content means no attribution detachment problem.
  • The corpus is explicitly licensed for redistribution — some enterprise data products ship with license grants that permit indexing and re-serving. Rare but real.
  • The content is factual-only, non-expressiveWikidata entity lists, price tables, scientific datasets. Facts aren't copyrightable (Feist Publications v. Rural Telephone Service Co., 499 U.S. 340 (1991)), so the attribution concern collapses.
  • You have written permission from the source — a partnership agreement that explicitly authorizes caching. This is what a principled creator-content pipeline looks like.

Harvest-time is wrong specifically when the corpus is third-party expressive content accessed without explicit authorization. That's the case where the attribution-detachment shape of the pipeline becomes a compliance hazard that no amount of attribution-layer UI can patch.

When runtime fetch is the right call#

Runtime fetch is right when:

  • The source is third-party and unowned. The cleanest path absent an explicit redistribution license.
  • The data is high-freshness (news, prices, live-service game state) where any cache is stale by definition.
  • Attribution is part of the product value — AI tools that want to send users back to original authors need the structural source-coupling runtime gives you.
  • You want to reduce redistribution/copyright exposure and the source's ToS permits automated reads (either directly or through an official API on its own terms).
  • The query distribution has a long tail and exhaustive pre-harvesting would require scraping-class access.

ToS caveat worth being explicit about. Runtime fetch is not a workaround for a ToS that prohibits automated access. Many platforms (YouTube, Reddit, Twitter/X, Fandom) restrict automated or "indirect" use regardless of whether you cache. If the ToS forbids automated access, runtime is just as prohibited as harvest — the fix is an official API with its own terms, a licensing deal, or a different source. What runtime fetch solves is copying (you don't keep a copy) and crediting (the citation is built into the flow). What it doesn't solve is whether you're allowed to access the source in the first place — that's still governed by the source's terms.

The pattern, formalized#

IF source_is_third_party_expression:
    IF have_explicit_redistribution_license:
        harvest_ok = True
    ELSE:
        harvest_ok = False
        IF source_tos_permits_automated_access:   # directly or via official API
            runtime_fetch_ok = True
            citation_required = True
        ELSE:
            # Neither regime is compliant. Need a license, an API key
            # under authorized terms, or a different source.
            runtime_fetch_ok = False
            escalate_to_licensing = True
ELSE IF source_is_first_party OR source_is_factual_non_expressive:
    # factual_non_expressive: facts/data without creative expression
    # (Wikidata-style entity lists, price tables, scientific datasets);
    # not copyrightable per Feist v. Rural Telephone (1991).
    harvest_ok = True

That's the decision rule. Two principles compose: you may cache what you own or what isn't copyrightable; for everything else, the ToS of the source governs whether you can access it at all, regardless of whether you cache.

Practical implementation notes#

A few things that aren't obvious until you build this:

1. Intent classification is load-bearing and underrated. The hybrid only works if you can cheaply decide "does this query need fresh info?" before committing to the expensive fetch path. A small classifier (or a rule-based triage on query features) is the operational key. Misclassifying a known-answer query as fetch-required wastes 4 seconds of latency and 5¢; misclassifying a fetch-required query as known-answer serves stale/wrong content. Both misclassifications erode user trust.

2. Fetch primitives vary by an order of magnitude in latency. General web search runs in seconds (the 1.5–6 s range in the comparison table above); authorized first-party APIs (Bungie for Destiny, Wikipedia REST) typically run in hundreds of milliseconds. Pipe first-party APIs when available; fall through to general search only when you need breadth.

3. Citation rendering is a product-level commitment, not a library. The rendering has to be honest about which result the model drew on, handle multi-source synthesis ("according to [1] and [3]..."), and degrade gracefully when the model's output doesn't cleanly map to retrieved sources. This is UX work, not infra work.

4. Discard-after-generation needs to be audited, not trusted. Check that your prompt-logging, observability, and error-replay pipelines aren't inadvertently persisting fetched content as a side effect of debug instrumentation. The "nothing persists" property is easy to accidentally violate.

5. The intent classifier trains on your own usage data. When you deploy the hybrid into a new product or use case, the first N queries are educational — you learn what your users actually ask. Instrument both routes for correctness and retrain/re-tune the classifier periodically. The ceiling on hybrid quality is how well you route.

The architectural takeaway#

Harvest-time RAG over third-party content is a dominated design once you include attribution and copyright/redistribution risk in the tradeoff table — meaning another option (runtime fetch with citation) ties or beats it on every axis you'd actually weigh. It's faster and cheaper, but makes it easy to skim over the structural attribution that's required to be honest about where your answers come from. Runtime fetch is slower and costlier, but its shape makes attribution unavoidable — which is the property that lets you build products that respect the sources they depend on. (ToS compliance is a separate axis that applies to both regimes; runtime doesn't dissolve it.)

The hybrid — harvest your own work, fetch third-party content at runtime through channels the source permits — is the production-grade resolution I've converged on, and in my analysis it's the shape most likely to scale without accumulating copyright and relational debt. I'd expect other shapes (e.g., fully licensed harvest under explicit partnership) to also work; what doesn't work is harvesting third-party expressive content without authorization and treating attribution as an afterthought that may or may not get bolted on later.

Much of today's AI content product landscape sits on the wrong side of this split — a pattern that has begun producing observable legal friction with publishers and creators (see, e.g., NYT v. OpenAI, 2023). None of this is architecturally unavoidable. The choice can be made long before it shows up in a privacy policy or a ToS analysis.


This is part of a series on empirical architectural findings from building a voice-first AI gaming assistant. See also: the two-bucket hallucination taxonomy and discrete-confidence cliff, why "universal coverage" grounding sources didn't help on the games that needed it, and — for the general-audience version of this architectural argument with less RAG vocabulary and more story — AI That Respects Creators.

Personal blog. Views and writing here are my own and do not necessarily reflect those of my employer or any organization I'm affiliated with. Side projects, written on personal time.