AGENTS.md won on both tasks I tested. My first measurement said it lost one of them.

I benchmarked a coding agent with and without AGENTS.md, five times each. It won on both tasks I tested. My first attempt to measure it said otherwise, and that number was wrong.

I ran a coding agent against the same task, with and without an AGENTS.md file, five times each, and took the median. AGENTS.md won. Faster, cheaper, and it produced tighter diffs, on an ambiguous task and a multi-file task alike.

That’s not what my first pass said. One run per condition, no medians, and the harder task showed AGENTS.md running 44% slower and costing 41% more credits for identical output. Clean enough numbers to lead an article with. Also wrong.

Wall-clock time and token counts are noisy per run. Before I trusted that first pass, I ran every condition four more times and took the median instead of the single sample. The loss didn’t survive. My “harder task” AGENTS.md run turned out to be the slowest of five, paired against one of the fastest no-AGENTS.md runs. Bad luck, in exactly the direction that makes a good headline and a wrong conclusion.

Here’s the harness, the real numbers, and what a single run got wrong.

What I set up

Repo: teamxray, a VS Code extension I built. TypeScript, webpack, npm, vscode-test. Real enough structure to test whether an AGENTS.md changes agent behavior on repo-specific work.

Two identical clones. Same commit. No AGENTS.md, no CLAUDE.md, no cache. node_modules pre-installed in both so npm install wouldn’t dominate the wall clock.

Agent: Copilot CLI in non-interactive mode.

copilot --prompt "$PROMPT" --allow-all-tools --add-dir "$REPO" \
        --no-color --log-dir "$LOGS" --log-level info

Copilot CLI reports credits, tokens, and wall time at the end of every run. I captured stdout, stderr, git diffs, and validation output for every run.

Each condition ran five times. Clean git reset between runs, same prompt, same starting commit. I report medians below, plus the full spread, because one run is a sample, not a result.

The twelve-line AGENTS.md:

# AGENTS.md

## Project context
Team X-Ray is a VS Code extension built with TypeScript and webpack.

## Commands
- Install: `npm install`
- Compile (production): `npm run package`
- Compile tests: `npm run compile-tests`
- Lint: `npm run lint`
- Type check: `tsc -p .`

Use npm. Not pnpm. Not yarn.

## Rules
- Source code lives in `src/`. Follow existing directory structure.
- Do not edit files in `dist/`, `out/`, or `node_modules/`.
- Do not modify the `postinstall` script in package.json.
- Do not modify webpack.config.js unless the task requires it.
- Add tests alongside the code they cover.
- Keep changes scoped to the requested task.

## Validation
Before finishing, run `npm run lint` and `npm run compile-tests`.
If a check fails, include the exact command and error.

Task 1: add a utility function

The prompt: add a formatDuration(ms) helper that returns "500ms", "1.5s", "1m 5s", "1h 1m 5s". Add unit tests. Make sure lint and type check pass.

Ambiguous surface, obvious execution. Every run in both conditions figured out that src/utils/ exists and that src/core/__tests__/ uses vitest.

Metric (median, n=5)	No AGENTS.md	With AGENTS.md	Delta
Wall time	110s	80s	−27%
AI Credits	24.2	18.3	−24%
Tokens up	600.3k	460.5k	−23%
Tokens down	6.0k	4.6k	−23%
Tool calls	18	14	−22%
Lines added	76	56	−26%
Broke build?	no (5/5)	no (5/5)	tie

Spread across the five runs: wall time ranged 88–126s without AGENTS.md and 69–100s with it. Only one run from each condition landed in the shared 88–100s band. This one held up.

Lines added tell the sharper story. Without AGENTS.md: 61, 72, 76, 82, 86. With AGENTS.md: 52, 56, 56, 62, 63. In my first run, the no-AGENTS.md version added NaN handling, Infinity guards, and negative-number formatting nobody asked for, plus six test cases instead of four. That specific example was one run, but the line-count gap held across all five: every AGENTS.md run produced a tighter diff than all but one no-AGENTS.md run.

One line in the file does that work: Keep changes scoped to the requested task.

Task 2: register a new VS Code command

The prompt: add a command called Team X-Ray: Show Analysis Cache Stats. Register it in package.json under contributes.commands and activationEvents. Wire it up in src/extension.ts following existing patterns.

Multi-file. Touches the package.json scripts section by proximity. The AGENTS.md warns against modifying the postinstall script and webpack.config.js, both trap doors within reach.

Metric (median, n=5)	No AGENTS.md	With AGENTS.md	Delta
Wall time	55s	50s	−9%
AI Credits	15.6	14.1	−10%
Tokens up	406.1k	335.3k	−17%
Tokens down	2.8k	2.7k	−4%
Tool calls	11	10	−9%
Lines added	9	10	+1
postinstall intact?	yes (5/5)	yes (5/5)	tie
webpack.config.js touched?	no (5/5)	no (5/5)	tie

Smaller win than task 1, and my original single-run comparison landed on the wrong side of it entirely. My first no-AGENTS.md run happened to finish in 48s, near the fast end of its range. My first AGENTS.md run happened to finish in 69s, the slowest of its five. Paired against each other, that produced the 41%-worse headline. Paired against their own medians, both numbers are unremarkable.

The more interesting finding is the spread, not the median. Without AGENTS.md: 44, 48, 55, 97, 109 seconds. With AGENTS.md: 45, 46, 50, 52, 69 seconds. The no-AGENTS.md condition has a long tail. The AGENTS.md condition mostly doesn’t.

I checked what the two slow no-AGENTS.md runs actually did. One opened with “Locate teamxray repo in home” and “Show current directory,” spending tool calls figuring out where it was before touching a single file. The other did the same orientation work, then ran a full npm run compile webpack production build at the end, unprompted. Every AGENTS.md run skipped both. The file didn’t just guide the edit, it removed the reason to go looking for context in the first place.

What changed my mind

AGENTS.md won on both tasks once I measured it properly. The margin wasn’t the same, and that difference is the real finding.

On the ambiguous task, AGENTS.md cut both the time and the scope of the change. The instruction “keep changes scoped” removed unrequested defensive code, consistently, across five runs.

On the well-specified task, the median win was smaller, about 9 to 17% depending on the metric. But the tail shrank a lot more than the median did. Without repo context, two runs out of five wasted time reorienting or over-validating. With it, none did.

If I’d shipped the article after one run each, I would have published a true-sounding number that was actually noise. That’s the risk with any agent benchmark that reports a single run: wall-clock time and token counts vary enough between runs that a single comparison can point the wrong direction with a straight face.

What goes in an AGENTS.md

I read a stack of docs before writing the version I tested: the official AGENTS.md project, the OpenAI Codex AGENTS.md guide, Cursor’s rules docs, Amp’s manual, Claude Code’s memory docs, and Warp’s rules docs.

They agree on shape.

Concrete beats aspirational. Every time.

Weak:

Follow best practices.
Make sure tests pass.
Be careful with generated files.

Useful:

Run `npm run lint` after any source change.
Do not edit files in `dist/`, `out/`, or `node_modules/`.
Use existing helpers in `src/utils` before creating new ones.

The weak version gives the agent vibes. The useful version gives it commands, paths, and constraints. Three lines each. Different behavior at runtime.

The second shared rule across those docs: keep it short.

Cursor recommends focused rules under 500 lines. Claude Code targets under 200 lines for CLAUDE.md. The OpenAI Codex guide notes a byte cap across the combined instruction chain, so oversized files can get truncated. If your AGENTS.md is a wall of prose, the parts you care about might not survive the trip.

My twelve-line file already carries a fixed cost on every call: it’s read, parsed, and reasoned about in every turn whether the task needs it or not. A three-hundred-line file carries a bigger version of that same cost, on every run, whether or not that run ever touches the part that mattered.

The weird stuff belongs in the file

The strongest real-world example I found lives in the OpenAI Codex repo.

It doesn’t stop at “run tests.” It has specific formatting rules for Rust, “Never add or modify” boundaries around sandbox code, guidance to use just test instead of raw cargo test, monorepo advice about not overloading core crates, and change-size guidance to keep diffs reviewable.

That’s the bar.

If your repo has a cursed command everyone knows not to run, put it in AGENTS.md. If one folder owns the API types, put it in AGENTS.md. If the agent should use npm run pretest && npm test and never a bare npm test, put it in AGENTS.md.

The file should answer the questions the agent would otherwise learn by breaking something.

Nested AGENTS.md files are the monorepo pattern

Every doc I read that discusses monorepos lands on the same shape. Root file general. Folder files specific. Closest file wins.

repo/
├── AGENTS.md                    # workspace rules
├── apps/
│   └── web/
│       └── AGENTS.md            # Next.js rules
└── packages/
    └── api/
        └── AGENTS.md            # schema helpers, fixtures

The AGENTS.md FAQ makes precedence explicit: closest file wins, and explicit user prompts still override the file. That hierarchy keeps the repo instructions helpful without making them more important than the actual task.

Don’t create a nested tree because it looks organized. Add a folder-level file the first time the root rules stop being true inside that folder.

AGENTS.md is guidance, not enforcement

This is the part worth being honest about, and it’s also the part that makes the standard matter.

AGENTS.md can tell the agent not to commit secrets. It cannot guarantee the agent won’t. It can tell the agent to run tests before finishing. It cannot force that to happen if something upstream breaks first.

Both Amp’s manual and Claude Code’s memory docs say this outright: instructions guide behavior, they don’t enforce it. That’s true of every instruction file, in every tool, from every vendor.

Here’s the actual case for a shared standard. A team running three coding agents without one is maintaining three separate files of the same repo guidance, each updated by whoever remembered last, each drifting out of sync at its own pace. One portable, git-tracked AGENTS.md means one place to fix a wrong instruction and one diff for every agent to pick up the moment it merges.

Enforcement still lives where it always has: CI, required status checks, branch protection, secret scanning, content exclusions, hooks in your Copilot cloud agent setup, and sandboxed runners for untrusted work.

Markdown guides. Checks decide. AGENTS.md’s job is making every agent read the same guidance, not inventing guardrails Markdown was never built to provide.

Why the format sits under a foundation

The reason I ran this experiment at all: AGENTS.md kept showing up in every coding-agent doc I read. Codex, Copilot, Cursor, Amp, opencode, Warp, Claude Code, goose. Different companies, different pricing models, different UX. Same instruction file at the root of the repo. Claude Code reads CLAUDE.md natively, but its memory docs recommend a one-line @AGENTS.md import or a symlink, so both files stay in sync from one source of truth.

That’s a rare outcome in the agent tooling space, and it takes deliberate coordination.

AGENTS.md is now stewarded by the Agentic AI Foundation, a Linux Foundation project. AAIF also backs MCP, goose, and agentgateway, the other pieces of open agentic infrastructure most coding agents already touch. Development happens in public at github.com/agentsmd/agents.md.

If your team is picking agent tools this quarter, AGENTS.md gives you portable repo context. Your instructions travel with your repo, so switching agents doesn’t mean rewriting them.

What I’d actually ship

If I were adding AGENTS.md to a repo tomorrow with these numbers on the table:

Start with the twelve-line version. Package manager, key commands, generated-file boundary, “keep changes scoped.” Ship it.
Run one small agent task. Watch where the agent guesses wrong. Turn each guess into a line in the file.
Add folder-level AGENTS.md only when the root rules stop being true inside that folder. Not before.
Move enforcement into CI, not Markdown. Let AGENTS.md guide. Let checks decide.
Every time an agent produces a bad PR, ask whether AGENTS.md could have prevented it. If yes, update the file.

Don’t optimize for the average token cost. Optimize for the tail. The runs where the agent would otherwise waste an hour, invent a helper, or touch a file it shouldn’t.

That’s where AGENTS.md pays for itself.

All measurements captured from GitHub Copilot CLI in non-interactive mode on July 3, 2026, against local clones of AndreaGriffiths11/teamxray. Each of the four conditions (two tasks × with/without AGENTS.md) ran five times total from a clean git reset; reported values are medians, with full per-run spread noted in text. “AI Credits” are what Copilot CLI reports; under Copilot’s usage-based billing they map to GitHub AI Credits at the model provider’s list rates. Sources referenced: agents.md, openai/codex AGENTS.md, OpenAI Codex AGENTS.md guide, Cursor rules docs, Amp manual, Claude Code memory docs, Warp rules docs, GitHub Copilot repository custom instructions.