← agent-bill-guard

Claude Code Rate Limits: What They Are and How to Stay Under Them

I built agent-bill-guard because I got tired of Claude Code sessions falling over in the middle of useful work and leaving me guessing whether I hit a token cap, an RPM wall, or just burned through context faster than I realized.

What Claude Code rate limits actually are

When people search for Claude Code rate limit, they usually mean Anthropic API limits showing up while they are using Claude Code as an agentic coding tool. Under the hood, the important constraints are Anthropic account limits. In the official docs, Anthropic measures Messages API limits with requests per minute, input tokens per minute, and output tokens per minute, and those limits depend on your usage tier. Higher account tiers get more room. There are also longer-window budget controls tied to the account tier, so your usable headroom is not just one number.

That distinction matters. A lot of developers expect one simple cap, like “I can send X prompts today.” In practice it is more annoying than that. You can be fine on request count and still get clipped on tokens. You can stay under token throughput for a while and still run into the broader account budget window later. Anthropic also notes that short bursts can trip a per-minute limit even if your average usage looks reasonable over a longer period.

So the short version is this: Claude Code is only as free-running as the Anthropic limits behind it. If your tier is smaller, your margin for experiments is smaller too.

Why developers hit limits earlier than they expect

The surprise comes from how agentic workflows spend tokens. A single API call is easy to reason about. A coding session with tool use is not. The agent loops, revises a plan, reads files, calls tools, summarizes output, asks follow-up questions, and keeps hauling the session history forward. Then you add subagents or parallel work and suddenly a session that felt “small” turns into a steady token furnace.

I ran into this the annoying way: the terminal looked calm, the model wasn’t writing a novel, and then I still got rate-limited because three or four ordinary-looking turns were each dragging a much larger context than I had in my head. Another one that stings is when an agent gets stuck in a dumb loop, half-helpful and half-confused, and quietly chews through tokens while you think it is “just trying one more thing.”

Three things compound fast:

agentic loops multiply requests, parallel subagents multiply concurrent usage, and large context windows make every turn heavier than the last. None of that feels dramatic in the moment. The meter still moves.

Why Claude Code sessions behave differently from one-off API calls

This is the part I think people underestimate. A one-off API call has a bounded prompt, a bounded response, and you can usually eyeball the cost. A Claude Code session is cumulative. Session history grows. Tool results get folded back in. The agent keeps carrying prior instructions, previous outputs, and local context forward unless something trims it.

That means the input side gets more expensive over time even if your latest message is short. You type one sentence like “try another fix,” but the model may be receiving that sentence plus a bunch of prior transcript, tool output, code context, and planning state. So the real token load is not what you just typed. It is the entire session baggage attached to it.

That is why rate limit pain inside Claude Code feels weirdly sudden. The first few turns are cheap. Later turns are not. You hit a point where every extra turn costs more input tokens than your intuition says it should, and then a limit error shows up right when the session was finally getting somewhere. Great.

What agent-bill-guard does about it

I built agent-bill-guard as a local Python proxy for exactly this problem. It is stdlib only, sits in front of the API traffic, and tracks token usage per session as responses come back. The immediate value is visibility. Instead of waiting for Anthropic to reject a request, you can see a session climbing toward the point where trouble is likely.

That matters more than it sounds. Once you can see live input_tokens and output_tokens accumulating by session, you can make deliberate choices: kill a runaway loop, start a fresh session, lower the cap, stop parallel branches, or just avoid handing the model another giant wall of context.

The second part is enforcement. agent-bill-guard can apply hard caps to a session so a bad loop or overly ambitious agent run does not keep burning tokens until you slam into rate limits or spend more than you intended. It is not fancy. That is on purpose. I wanted a local guardrail, not another service to manage.

Tiny example: intercepting token usage from responses

The core mechanic is simple. Anthropic responses include usage data. The proxy reads it, updates per-session counters, and decides whether the session is still within the cap.

data = json.loads(response_body)

usage = data.get("usage", {})
input_tokens = usage.get("input_tokens", 0)
output_tokens = usage.get("output_tokens", 0)

session_totals[session_id]["input"] += input_tokens
session_totals[session_id]["output"] += output_tokens

if session_totals[session_id]["input"] + session_totals[session_id]["output"] > cap:
    raise RuntimeError(f"Session cap exceeded for {session_id}")

That is basically the whole idea. Watch the live token counters. Keep a running total per session. Fail closed if the session crosses a threshold you chose on purpose.

Honest limitations

agent-bill-guard does not prevent Anthropic rate limit errors directly. Anthropic still makes the server-side decision. If their backend says you exceeded RPM or token throughput, that is their call, not mine. The proxy cannot override it.

What it does give you is visibility and control before you get there. You can see which session is getting heavy. You can spot a runaway agent before it eats the rest of your budget. You can keep sessions under a hard cap so you are less likely to blunder into limits accidentally. That is a different promise, and I think it is the honest one.

If you want the official Anthropic side of the story, read their rate limit docs. If you want fewer mystery failures in long Claude Code sessions, put a meter in front of the session and make the token burn visible.


Try it locally: git clone https://github.com/paprika-org/agent-bill-guard && cd agent-bill-guard && python proxy.py --cap 100000 — project at agent-bill-guard on GitHub