Stop paying for the same tokens twice

I ran five reviewer agents over a 16K-token PR. The naive setup cost $1.32. Two architectural changes brought it to $0.49. Here’s the bill, and what the experiment surprised me with along the way.

Five specialist agents reviewed the same pull request. Security, tests, perf, observability, API design. Each one re-read the full diff. Each one paid full price. The bill landed at $1.32 for a single PR pass on Claude Opus 4.6.

One PR. One review pass. Buck thirty-two.

If you’re running multi-agent code review at any scale, you’re bleeding money on the exact same tokens, over and over, paid in full each time.

I built a harness to measure the bleed and test the two obvious fixes. The numbers were sharper than I expected, and one of them inverted my assumptions about prompt caching.

What I built

A small Python harness with three modes and one shared 49K-character pull request diff. Five reviewer roles, each with a tight system prompt:

REVIEWER_ROLES = {
    "security":      "Auth, injection, secrets, access control...",
    "tests":         "Missing tests, skipped tests, weak assertions...",
    "perf":          "N+1 queries, sync I/O in request paths...",
    "observability": "Logging hygiene, error handling, missing metrics...",
    "api-design":    "Endpoint naming, status codes, idempotency...",
}

Three modes against the same diff:

Naive. Each reviewer makes its own call with the full diff in the user message. Role text comes first so the prefix differs per call. Nothing gets cached.
Cache. Each reviewer puts the diff in the system message, identical across all five calls. The first call writes to the provider’s prompt cache. Calls two through five read from it at a 90% discount.
Librarian. A first pass digests the diff into a compact JSON summary. Reviewers run against the digest, not the raw diff. Inspired by Cho et al.’s arXiv paper Long Live the Librarian!, which uses a persistent search sub-agent to suppress redundant exploration tokens across agents.

I priced every call at Anthropic’s API list rates: $15/M input, $1.50/M cache reads, $75/M output. Real measured token counts, list-price arithmetic on top.

The numbers

Large diff (49K characters, roughly 16K tokens per call), five reviewers, single fresh review pass:

Mode	Input	Cached	Output	Total
Naive	80,248	0	1,500	$1.316
Cache	80,268	63,968 (4/5 calls)	1,500	$0.453
Librarian	20,915	0	2,300	$0.486

Naive is comically wasteful. Cache and librarian both cut it by roughly two thirds.

Cache edges librarian by three cents. That’s real, and it changes how I think about which lever to pull first.

When caching actually beats the librarian

Caching wins on a warm series. The first call writes to cache, the next four read from it. If you can guarantee five reviewers run back-to-back against the same shared prefix in the same warm session, prompt caching is the cleanest move you can make. Two-line system message refactor. 66% off the bill.

The problem: code review bots almost never run that way in production.

Each PR is a fresh session. PRs come in from different developers, different repos, different branches, at unpredictable intervals. There’s no “warm series.” There’s a parade of cold starts.

When every call is cold, cache mode collapses back to naive mode. Same dollars per call. Same bleed.

The librarian pattern doesn’t care whether the session is warm or cold. Each reviewer reads a tiny digest, not the raw diff. The savings are baked into the shape of the calls, not the runtime state of a cache somewhere in the provider’s fleet.

Mode	Warm series (5 calls back-to-back)	Cold every call (operational reality)
Naive	$1.32	$1.32
Cache	$0.45	$1.32
Librarian	$0.49	$0.49

If your reviewers run in tight bursts inside a single session, cache. If they run independently across cold starts, use the librarian. Most real systems are the second kind.

What surprised me about caching

I built the harness with a fail-loud guardrail. If calls 2 through N in cache mode report zero cached tokens, the harness aborts and prints diagnostics. The cached prefix should hit, and if it doesn’t, the experiment is invalid.

It tripped twice on the small-diff run.

Same code, same five-call sequence, same identical system message at 4,003 tokens. Both runs returned cached_tokens=0 for every single call. Cache mode and naive mode came in at the exact same $0.413, to four decimal places.

Then I ran the large diff at 16,000 tokens per call. Four out of five calls hit the cache cleanly. 15,992 cached tokens each.

So somewhere between 4K and 16K tokens of identical prefix, against claude-opus-4.6 through the Copilot proxy with an OpenAI-shape request, prompt caching activates. Below that threshold, it doesn’t. The Anthropic docs say the minimum is roughly 1024 to 2048 tokens. Empirically, the activation curve is not flat.

If you’re budgeting on a stated cache discount, model your hit rate as a band. Run the experiment on your actual payload sizes. Don’t assume the marketing math holds for the call shapes you’re actually making.

Try it yourself

The full harness lives at github.com/AndreaGriffiths11/librarian-demo. Three Python files, one diff scenario, one pricing module. Run it with:

export COPILOT_OAUTH_TOKEN=$(sqlite3 ~/.copilot/data.db \
  "SELECT access_token FROM github_accounts WHERE is_default=1;")

python demo.py all --verbose --diff scenario/sample_pr_large.diff

If you have the GitHub Copilot CLI installed, you already have an OAuth token sitting in ~/.copilot/data.db. Point the OpenAI SDK at https://api.githubcopilot.com with the Copilot-Integration-Id: copilot-cli header and you get the full model catalog (Anthropic, OpenAI, Google) on the chat-completions shape, with prompt_tokens_details.cached_tokens reporting. No PAT, no token exchange. About 30 lines of glue:

import sqlite3, pathlib
from openai import OpenAI

con = sqlite3.connect(
    f"file:{pathlib.Path.home()}/.copilot/data.db?mode=ro", uri=True
)
token = con.execute(
    "SELECT access_token FROM github_accounts WHERE is_default=1"
).fetchone()[0]

client = OpenAI(
    base_url="https://api.githubcopilot.com",
    api_key=token,
    default_headers={"Copilot-Integration-Id": "copilot-cli"},
)

One billing note: GitHub Copilot moves to usage-based billing on June 1, 2026. Under the new model, token consumption maps directly to GitHub AI Credits at Anthropic’s published API rates, so the dollar figures in this article apply to Copilot users now too, not just people hitting the API directly. The architectural takeaway is the same either way. If you’re on a paid Copilot plan, these savings are real dollars out of your monthly credit allotment.

What I’d actually ship

If I were rebuilding a review pipeline tomorrow with this data on the table:

Move all shared context into the system message. It’s the cheapest refactor in software, and it costs you nothing when the cache doesn’t activate.
Build the librarian pass once you cross three reviewers, or once your shared context is large enough that the activation threshold actually kicks in. The digest pays for itself across cold sessions.
Stop treating cache hit rate as a constant. Measure it on your payloads, in your environment, against the model you ship with. Write a fail-loud guardrail and let it scream when assumptions break.

Cho et al.’s paper measures GPU energy savings on SWE-Bench Verified, up to 25% on multi-agent runs. My harness measures wall-clock dollars on one PR review pass. Different proxy, same lesson: persistent shared context across agents is a free architectural lever sitting on the floor.

Paper: Cho, Choi, Heo, Choi. “Long Live the Librarian! A Persistent Search Sub-Agent for Energy-Efficient Multi-Agent Software Engineering Systems” (arXiv:2605.27787, May 2026).

All numbers measured against claude-opus-4.6 via the GitHub Copilot proxy on May 29, 2026. Full per-call breakdowns and the harness source live in the runs/findings.md of the repo.