Articles
Your AI bill is about to change. Here's what to do about it.
GitHub Copilot moves to usage-based billing on June 1. One deep debugging session can now eat most of your monthly allotment. Here's how to control costs without giving anything up.
Your AI bill is about to change. Here’s what to do about it.
GitHub Copilot moves to usage-based billing on June 1. If you’ve been running agentic workflows on a flat monthly plan, your next invoice is going to look different. One deep debugging session can now eat most of your monthly allotment.
You have real control here. Understanding what drives costs, and where you can cut without giving anything up, is the whole game.
What you’re actually paying for
Basic inline completions and Next Edit suggestions, the fast suggestions that appear as you type, stay unlimited across all paid plans. GitHub explicitly kept those out of the credit system. If you mostly want quick suggestions while you type, June 1 barely affects you.
What burns credits is everything else: premium models, multi-step agentic workflows, chat with file context, and advanced completions. If you’re running agents, it matters.
GitHub AI Credits are dollar-denominated. Every paid plan includes a monthly allotment equal to the subscription price:
- Copilot Pro: $10/month in AI Credits
- Copilot Pro+: $39/month in AI Credits
- Copilot Business: $19/user/month in AI Credits
- Copilot Enterprise: $39/user/month in AI Credits
Business and Enterprise get a temporary boost for June through August ($30 and $70 per user respectively) to soften the transition.
Every agentic request counts three token types:
- Input tokens: everything you send, including system prompt, conversation history, tool definitions, and file contents
- Output tokens: what the model generates back
- Cached tokens: context the model reuses from a previous call
Output tokens are the most expensive, at roughly 5x the input rate on Anthropic models. Input tokens are cheaper but compound fast in a multi-turn agent loop. Cached tokens cost about 10x less than fresh input on Anthropic models, and 50% less on OpenAI models.
That last one is where most people leave money on the table.
One behavior change worth knowing
Under the old system, when you burned through your quota you’d quietly fall back to a cheaper model and keep working. That’s gone. When your AI Credits run out, you stop or you buy more. No silent degradation.
For Business and Enterprise admins, you now have budget controls at the enterprise, cost center, and user level. Credits pool across the org so no individual seat strands unused capacity. Set those limits before June 1, not after.
One more: Copilot code review will consume GitHub Actions minutes in addition to AI Credits. If you’ve automated review heavily, factor that into your Actions budget too.
And if you’re on an annual Pro or Pro+ plan: you stay on PRU pricing until your plan expires, but model multipliers go up on June 1 for annual subscribers. You can convert to monthly early and get prorated credits if you want to switch now.
Repeated context is where costs pile up
Long-running agent sessions get expensive when the same context keeps showing up on every turn. System instructions, tool definitions, conversation history, and file context all add to the input bill.
GitHub handles some of this optimization inside Copilot for you, but the underlying lesson still matters: repeated context has a cost. The more unnecessary context you carry forward, the more you pay.
That makes context discipline one of the biggest levers you still control. Keep instructions tight. Avoid dragging full history forward forever. Use Agent Mode only when the task actually needs it.
Keep your context lean
Caching helps, but watch what you’re putting in your prompts in the first place.
The biggest waste sources: system prompts that include everything, full conversation history on every call, and tool definitions for tools the model won’t use.
System prompt bloat. A 500-token system prompt repeated across 10,000 requests is 5 million tokens. A 200-token version saves 3 million. Cut instructions that are redundant or never actually fire. The single highest-ROI change you can make in Copilot is adding Code only, no explanation. to your .github/copilot-instructions.md. In my testing, that one line cuts output tokens by 40-70% on code tasks. Add Bullets over paragraphs. No explanations unless asked. and I see another 30-60% reduction across the board. Run it on your own workloads before you standardize on it.
Conversation history. Most agents append every prior message to each new call. If you’re on turn 30, turns 1 through 20 are probably noise. Trim to the last N turns or summarize older context into a compact block.
Tool definitions. VS Code 1.118 (release notes) handles this by splitting the agent toolset into a compact always-available core of ~30 tools covering ~88% of tool calls, with a larger deferred set that only loads when explicitly needed. Apply the same pattern to your own agents.
MCP servers. Every MCP tool you have connected adds prompt overhead. In my own setups, somewhere around 100-500 tokens per agent step just to describe each tool. With 15 servers across a 15-step task, that’s roughly 265,000 tokens of overhead before the model does anything useful. Audit what’s actually connected and disable anything you’re not actively using.
Multi-agent setups compound all of this. In my experiments they can consume 4-15x more tokens than single calls when not optimized. If you’re parallelizing work, make sure the tasks are genuinely independent and you’re not just duplicating context everywhere.
Ask Mode vs Agent Mode
Not every Copilot task needs Agent Mode. Ask Mode is for lookups, explanations, and quick questions. Agent Mode is for multi-step work where the model needs to read files, run commands, and iterate.
Defaulting to Agent Mode for everything is like spinning up a full agentic loop to answer “what does this function do.” In my testing, using Ask Mode for simple questions saves 60-90% on those interactions. Reserve Agent Mode for tasks that actually need it.
Pick the right model for the job
Not every task needs Claude Opus. The cost gap between Haiku and Opus is 5x both ways. Most of your tasks don’t need Opus.
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Haiku 4.5 | $1.00 | $5.00 |
| Sonnet 4.6 | $3.00 | $15.00 |
| Opus 4.7 | $5.00 | $25.00 |
Haiku is fast and cheap. Good for summarization, classification, and simple Q&A. Sonnet handles most coding tasks, multi-file refactoring, and analysis without breaking the bank. Opus is for genuinely hard problems: complex architecture decisions, difficult debugging, and long-context reasoning that actually needs it.
Opus 4.7 ships with a new tokenizer that can produce up to 35% more tokens for the same input text. The rate card is unchanged, but your actual bill per request can still go up. Anthropic documents this on their pricing page. Benchmark your workloads on real traffic before assuming costs are identical to 4.6.
In practice, that means using the model picker as a cost lever. Default to Haiku for quick lookups, doc edits, and small tweaks. Switch to Sonnet for most coding work. Reserve Opus for genuinely hard problems: thorny debugging, architecture calls, or long-context reasoning where the cheaper models keep getting it wrong. Send simple tasks to cheaper models and escalate only when you need the firepower.
Local models
You don’t have to send everything to a cloud API.
GitHub Copilot supports Ollama directly, both in VS Code and in Copilot CLI. Models like Qwen, DeepSeek, and Llama run locally and show up in the same model picker as cloud models. No credits. No telemetry. Your code stays on your machine.
Setting it up in Copilot CLI is one command:
ollama launch copilot
Ollama wires Copilot CLI to a local model and drops you into the agent. To pick a specific model:
ollama launch copilot --model qwen3.5
For VS Code, add your local Ollama instance URL in Language Models settings. VS Code auto-discovers every installed model and adds it to the picker. Running AI locally isn’t just about saving money. For proprietary code, regulated environments, or air-gapped networks, it might be the only option.
Fair warning: CPU-only machines struggle with multi-step tool execution. LM Studio tends to work better than Ollama on CPU hardware because you get actual visibility into what’s happening. For model choice, Qwen3.5 Coder 7B is the best speed-to-quality tradeoff on consumer hardware. Qwen 2.5 Coder 32B is stronger for multi-step commands if you have the VRAM.
One thing that trips people up: Ollama defaults to 4K context even for models that support much more. For any agentic use, set the context length via environment variable before starting the server:
export OLLAMA_CONTEXT_LENGTH=32768
ollama serve
Or set it per-session inside the interactive REPL:
/set parameter num_ctx 32768
32K to 64K is the practical sweet spot for most coding workflows.
What a session actually costs
Basic completions are free no matter how many you use. The expensive part is complex reasoning on a premium model. Run Opus for a 10-turn reasoning stretch and you’re looking at ~$6.75 of your $10 Pro allotment in one go. Swap that same stretch to Sonnet and the whole session costs under $1.50.
That’s the lever. Most tasks don’t need Opus. The ones that do are worth it. Everything else should be on Sonnet or Haiku.
What to do before June 1
Check your preview bill first. GitHub is making preview invoices available in early May on the Billing Overview page at github.com. It updates as you use Copilot, so you’ll see exactly what you’re on track to spend before the switch goes live.
-
Add output controls to
copilot-instructions.md. Start withCode only, no explanation.It’s the highest-ROI single change you can make. -
Audit your MCP servers. Disconnect anything you’re not actively using. Each idle server costs tokens on every agent step.
-
Trim conversation history. Keep the last 5-10 turns in context, not the full session.
-
Use Ask Mode for simple questions. Reserve Agent Mode for tasks that actually need multi-step execution.
-
Pull a local model. Even if you don’t use it daily, having
qwen3.5running locally gives you a zero-cost option for anything you don’t want hitting a cloud API. -
Route by complexity. Stop defaulting to your most powerful model for everything. Save Opus for the problems that actually need it.
-
Admins: set budget controls now. Business and Enterprise have cost center and user-level limits. Configure them before June 1, not after your first surprise invoice.
That’s it. Go check your preview bill.
About the Author: Andrea Griffiths is a Senior Developer Advocate at GitHub, where she helps engineering teams adopt and scale developer technologies. She's passionate about making technical concepts accessible—to both humans and AI agents. Connect with her on LinkedIn, GitHub, or Twitter/X. · Read in Spanish · 阅读中文版