Articles
Your AI bill is about to change. Here's what to do about it.
GitHub Copilot moves to usage-based billing on June 1. One deep debugging session can now eat most of your monthly allotment. Here's how to control costs without giving anything up.
Your AI bill is about to change. Here’s what to do about it.
GitHub Copilot moves to usage-based billing on June 1. If you’ve been running agentic workflows on a flat monthly plan, your next invoice is going to look different. One deep debugging session can now eat most of your monthly allotment.
You have real control here. Understanding what drives costs, and where you can cut without giving anything up, is the whole game.
What you’re actually paying for
Basic inline completions and Next Edit suggestions, the fast suggestions that appear as you type, stay unlimited across all paid plans. GitHub explicitly kept those out of the credit system. If you mostly want quick suggestions while you type, June 1 barely affects you.
What burns credits is everything else: premium models, multi-step agentic workflows, chat with file context, and advanced completions. If you’re running agents, it matters.
GitHub AI Credits are dollar-denominated. Every paid plan includes a monthly allotment equal to the subscription price:
- Copilot Pro: $10/month in AI Credits
- Copilot Pro+: $39/month in AI Credits
- Copilot Business: $19/user/month in AI Credits
- Copilot Enterprise: $39/user/month in AI Credits
Business and Enterprise get a temporary boost for June through August ($30 and $70 per user respectively) to soften the transition.
Every agentic request counts three token types:
- Input tokens: everything you send, including system prompt, conversation history, tool definitions, and file contents
- Output tokens: what the model generates back
- Cached tokens: context the model reuses from a previous call
Output tokens are the most expensive, at roughly 5x the input rate on Anthropic models. Input tokens are cheaper but compound fast in a multi-turn agent loop. Cached tokens cost about 10x less than fresh input on Anthropic models, and 50% less on OpenAI models.
That last one is where most people leave money on the table.
One behavior change worth knowing
Under the old system, when you burned through your quota you’d quietly fall back to a cheaper model and keep working. That’s gone. When your AI Credits run out, you stop or you buy more. No silent degradation.
For Business and Enterprise admins, you now have budget controls at the enterprise, cost center, and user level. Credits pool across the org so no individual seat strands unused capacity. Set those limits before June 1, not after.
One more: Copilot code review will consume GitHub Actions minutes in addition to AI Credits. If you’ve automated review heavily, factor that into your Actions budget too.
And if you’re on an annual Pro or Pro+ plan: you stay on PRU pricing until your plan expires, but model multipliers go up on June 1 for annual subscribers. You can convert to monthly early and get prorated credits if you want to switch now.
Prompt caching
If your agent sends the same system prompt, tool catalog, or file contents on every turn, and it does, you’re paying full input price for tokens the model already processed 30 seconds ago.
Prompt caching stores that prefix so subsequent requests reuse it at roughly 10% of the base input cost. Here’s what it looks like with the Anthropic SDK:
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
system=[
{
"type": "text",
"text": "You are a helpful coding assistant. You have access to the following tools...",
"cache_control": {"type": "ephemeral"} # Cache this prefix
}
],
messages=[
{"role": "user", "content": "Refactor the auth module to use async/await"}
]
)
# Check your cache hit rate
print(response.usage.cache_read_input_tokens) # tokens served from cache
print(response.usage.cache_creation_input_tokens) # tokens written to cache
On a 50-turn agent session with an 8,000-token system prompt, caching drops your per-turn prefix cost from ~$0.024 to ~$0.0024 on Claude Sonnet ($3/MTok input). Cached tokens cost 10x less than fresh input, which for long agent sessions makes caching effectively non-optional. VS Code already handles this automatically for Copilot. It places cache breakpoints at the end of the system prompt, tool definitions, and turn boundaries. If you’re building your own agents, you need to wire this yourself.
Keep your context lean
Caching helps, but watch what you’re putting in your prompts in the first place.
The biggest waste sources: system prompts that include everything, full conversation history on every call, and tool definitions for tools the model won’t use.
System prompt bloat. A 500-token system prompt repeated across 10,000 requests is 5 million tokens. A 200-token version saves 3 million. Cut instructions that are redundant or never actually fire. The single highest-ROI change you can make in Copilot is adding Code only, no explanation. to your .github/copilot-instructions.md. In my testing, that one line cuts output tokens by 40-70% on code tasks. Add Bullets over paragraphs. No explanations unless asked. and I see another 30-60% reduction across the board. Run it on your own workloads before you standardize on it.
Conversation history. Most agents append every prior message to each new call. If you’re on turn 30, turns 1 through 20 are probably noise. Trim to the last N turns or summarize older context into a compact block.
Tool definitions. VS Code 1.118 (release notes) handles this by splitting the agent toolset into a compact always-available core of ~30 tools covering ~88% of tool calls, with a larger deferred set that only loads when explicitly needed. Apply the same pattern to your own agents.
MCP servers. Every MCP tool you have connected adds prompt overhead. In my own setups, somewhere around 100-500 tokens per agent step just to describe each tool. With 15 servers across a 15-step task, that’s roughly 265,000 tokens of overhead before the model does anything useful. Audit what’s actually connected and disable anything you’re not actively using.
Multi-agent setups compound all of this. In my experiments they can consume 4-15x more tokens than single calls when not optimized. If you’re parallelizing work, make sure the tasks are genuinely independent and you’re not just duplicating context everywhere.
Ask Mode vs Agent Mode
Not every Copilot task needs Agent Mode. Ask Mode is for lookups, explanations, and quick questions. Agent Mode is for multi-step work where the model needs to read files, run commands, and iterate.
Defaulting to Agent Mode for everything is like spinning up a full agentic loop to answer “what does this function do.” In my testing, using Ask Mode for simple questions saves 60-90% on those interactions. Reserve Agent Mode for tasks that actually need it.
Pick the right model for the job
Not every task needs Claude Opus. The cost gap between Haiku and Opus is 5x both ways. Most of your tasks don’t need Opus.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Haiku 4.5 | $1.00 | $5.00 |
| Sonnet 4.6 | $3.00 | $15.00 |
| Opus 4.7 | $5.00 | $25.00 |
Haiku is fast and cheap. Good for summarization, classification, and simple Q&A. Sonnet handles most coding tasks, multi-file refactoring, and analysis without breaking the bank. Opus is for genuinely hard problems: complex architecture decisions, difficult debugging, and long-context reasoning that actually needs it.
Opus 4.7 ships with a new tokenizer that can produce up to 35% more tokens for the same input text. The rate card is unchanged, but your actual bill per request can still go up. Anthropic documents this on their pricing page. Benchmark your workloads on real traffic before assuming costs are identical to 4.6.
def route_to_model(complexity: str) -> str:
if complexity == "simple":
return "claude-haiku-4-5" # $1/MTok input
elif complexity == "medium":
return "claude-sonnet-4-6" # $3/MTok input
else:
return "claude-opus-4-7" # $5/MTok input
Send simple tasks to cheaper models and escalate only when you need the firepower.
Local models
You don’t have to send everything to a cloud API.
GitHub Copilot supports Ollama directly, both in VS Code and in Copilot CLI. Models like Qwen, DeepSeek, and Llama run locally and show up in the same model picker as cloud models. No credits. No telemetry. Your code stays on your machine.
Setting it up in Copilot CLI is one command:
ollama launch copilot
Ollama wires Copilot CLI to a local model and drops you into the agent. To pick a specific model:
ollama launch copilot --model qwen3.5
For VS Code, add your local Ollama instance URL in Language Models settings. VS Code auto-discovers every installed model and adds it to the picker. Running AI locally isn’t just about saving money. For proprietary code, regulated environments, or air-gapped networks, it’s the only option.
Fair warning: CPU-only machines struggle with multi-step tool execution. LM Studio tends to work better than Ollama on CPU hardware because you get actual visibility into what’s happening. For model choice, Qwen3.5 Coder 7B is the best speed-to-quality tradeoff on consumer hardware. Qwen 2.5 Coder 32B is stronger for multi-step commands if you have the VRAM.
One thing that trips people up: Ollama defaults to 4K context even for models that support much more. For any agentic use, set the context length via environment variable before starting the server:
export OLLAMA_CONTEXT_LENGTH=32768
ollama serve
Or set it per-session inside the interactive REPL:
/set parameter num_ctx 32768
32K to 64K is the practical sweet spot for most coding workflows.
What a session actually costs
Basic completions are free no matter how many you use. The expensive part is complex reasoning on a premium model. Run Opus for a 10-turn reasoning stretch and you’re looking at ~$6.75 of your $10 Pro allotment in one go. Swap that same stretch to Sonnet and the whole session costs under $1.50.
That’s the lever. Most tasks don’t need Opus. The ones that do are worth it. Everything else should be on Sonnet or Haiku.
What to do before June 1
Check your preview bill first. GitHub is making preview invoices available in early May on the Billing Overview page at github.com. It updates as you use Copilot, so you’ll see exactly what you’re on track to spend before the switch goes live.
-
Add output controls to
copilot-instructions.md. Start withCode only, no explanation.It’s the highest-ROI single change you can make. -
Enable prompt caching on any agents you’ve built. Add
cache_controlto your system prompt block in the Anthropic SDK. Checkcache_read_input_tokensin the response to confirm it’s working. -
Audit your MCP servers. Disconnect anything you’re not actively using. Each idle server costs tokens on every agent step.
-
Trim conversation history. Keep the last 5-10 turns in context, not the full session.
-
Use Ask Mode for simple questions. Reserve Agent Mode for tasks that actually need multi-step execution.
-
Pull a local model. Even if you don’t use it daily, having
qwen3.5running locally gives you a zero-cost option for anything you don’t want hitting a cloud API. -
Route by complexity. Stop defaulting to your most powerful model for everything. Save Opus for the problems that actually need it.
-
Admins: set budget controls now. Business and Enterprise have cost center and user-level limits. Configure them before June 1, not after your first surprise invoice.
That’s it. Go check your preview bill.
About the Author: Andrea Griffiths is a Senior Developer Advocate at GitHub, where she helps engineering teams adopt and scale developer technologies. She's passionate about making technical concepts accessible—to both humans and AI agents. Connect with her on LinkedIn, GitHub, or Twitter/X. · Read in Spanish · 阅读中文版