I’m building Tokamak, a menu bar app for macOS that monitors your Claude Max quota. A couple of weeks ago, Anthropic published this in their Terms of Service:
“You may not use OAuth or similar authorization mechanisms to allow third-party applications to access Claude on behalf of users.”
And there I was, reading Claude Max quota using browser cookies to call an undocumented endpoint, staring at the screen thinking: “What now?”
How it works today (the hack that works)
Tokamak needs to know your quota percentage. That 42% you see on claude.ai when you’ve been hitting it hard for a while. The problem is that Anthropic has no public quota API. There’s no documented GET /api/quota with an API key.
What does exist is an internal endpoint that claude.ai itself uses:
GET /api/organizations/{org_id}/usage
It returns something like:
| |
To call that endpoint, you need session cookies from a logged-in user. Tokamak solves this with a hidden WKWebView: the user logs into claude.ai inside the app, the cookies stay in the WebView, and the app uses them to poll every 30 seconds.
It works. It’s been working for months. But it’s an elegant hack, not a robust solution. We’re using an internal API that Anthropic can change, break, or block without notice.
The legal limbo
The OAuth ban is clear. But does it apply to cookies? Technically we’re not using OAuth or “similar authorization mechanisms.” The user logs in directly to claude.ai inside a standard WebView. It’s like opening Safari inside the app.
But “similar authorization mechanisms” is a phrase ambiguous enough for an Anthropic lawyer to interpret however they please. Today we’re in legal limbo: it’s not explicitly prohibited, but not explicitly allowed either.
And if one day Anthropic decides that third-party cookies aren’t cool either, we’re left without quota data. Just like that, overnight.
Plan B: estimating without API
This is where it gets interesting. The question is: can we estimate your quota percentage using only local data?
Before answering, you need to understand how Claude Max quota works. It’s not a simple token counter.
Quota is equivalent cost, not tokens
Anthropic doesn’t count tokens like peanuts in a bowl. It counts equivalent cost at API prices. Each token has a different “price” depending on its type and model:
| Token type | Relative weight |
|---|---|
| Output (response) | 5x more than input |
| Input (your message) | 1x (reference) |
| Cache read | 0.1x (almost free) |
| Opus | 1.67x more than Sonnet |
In plain language: a short Opus message where Claude thinks hard and writes a long response consumes much more quota than a long message from you to Haiku with a short response.
It’s like the electricity bill. You don’t pay for “hours plugged in.” You pay per kWh, and the oven uses 10x more than the bathroom light bulb.
The 5-hour window
Quota doesn’t accumulate indefinitely. It works like a 5-hour sliding window. Each token you send “expires” exactly 5 hours later. If at 10:00 AM you burned 30% of quota with an intense session, at 3:00 PM that 30% gets freed up.
Imagine a water bucket with a hole. The water you pour in (your messages) drains out the bottom 5 hours later. The water level is your utilization.
The three components
Right, so we know quota is equivalent cost and moves in 5-hour windows. Can we replicate that locally?
The honest answer: not with surgical precision, but yes with useful precision. We’re not going to nail 73%. But we will know you’re “in the orange zone, be careful.” And it turns out that’s exactly what you need.
The system has three pieces, each with a clear role:
1. The Estimator (M1): the accountant
This one does the math. It reads the local files from ~/.claude/ (which contain all the tokens you’ve sent and received, with timestamps and model), sums the equivalent cost in the last 5 hours, and divides by the tier’s “total budget.”
estimation = cost_in_5h_window / tier_budget * 100
It’s like keeping track of how much electricity you’ve used this month: you know the kWh from the oven, fridge, computer. You add it up, divide by your rate, and have an estimate of the bill.
The problem: we don’t know the exact total budget. Anthropic doesn’t publish it. So we need to calibrate it. That’s where the second component comes in.
2. The Calibrator (A3): the one that learns by watching
This is the most elegant one. It’s called the Decay Estimator and works like this:
When you stop working (you go eat, to a meeting, to sleep), the tokens you sent 5 hours ago start to expire. Quota drops by itself. And since we know exactly what tokens we sent and when (they’re in the local JSONLs), we can observe how much quota drops when those specific tokens expire.
It’s like the water bucket example: if you pour a liter of red water and another of blue water, and you see that when the red water comes out the level drops 3 marks and when the blue comes out it drops 1 mark, now you know that red water weighs 3x more than blue. Without anyone telling you. Just by observing.
Here’s the key point: this happens naturally. It doesn’t consume quota. It doesn’t need experiments. Every time you get up for coffee, the calibrator is learning the real weights of each token type.
3. The Classifier (A5): the one that translates numbers to decisions
Here’s a counterintuitive insight. The first reaction is to try to estimate the exact percentage: “you’re at 73%.” But it turns out that’s mathematically very hard to do precisely. The API returns integers from 0 to 100, there are hidden factors we can’t measure, and activity from other devices (claude.ai web, mobile) is invisible to us.
But do you need to know if you’re at 73% or 77%? No. You need to know what zone you’re in:
| Zone | Range | Meaning |
|---|---|---|
| Green | 0-60% | Relax, go hard |
| Yellow | 60-80% | Moderate, especially with Opus |
| Orange | 80-95% | Careful, approaching the limit |
| Red | 95-100% | Stop or switch to Haiku |
Classifying into 4 zones instead of regressing 100 values is much easier and much more useful. If the real quota is 72% and the classifier says “yellow,” it’s right. If the estimator says “72% +-20%,” the 52-92% range crosses three zones and tells you nothing.
It’s the difference between a digital thermometer with two decimal places and a mercury thermometer with three colors: cold, warm, fever. To decide whether to take ibuprofen, the three-color one serves you just as well or better.
The complete pipeline
graph TD
A["~/.claude/*.jsonl<br/>(local tokens)"] --> B["Estimator M1<br/>(accountant)"]
B --> |"equivalent cost<br/>in 5h window"| D["Classifier A5<br/>(traffic light)"]
C["Calibrator A3<br/>(observes decay)"] --> |"real weights<br/>per token type"| B
E["Quota API<br/>(while it works)"] --> |"ground truth<br/>for calibration"| F["QuotaCalibrationLog<br/>(paired data)"]
F --> |"continuous<br/>calibration"| C
F --> |"tier<br/>budget"| B
D --> G["🟢 Green"]
D --> H["🟡 Yellow"]
D --> I["🟠 Orange"]
D --> J["🔴 Red"]
style E stroke-dasharray: 5 5
style F stroke-dasharray: 5 5
While the API works, we have the real data and simultaneously we’re training the estimator with paired data: “when you had these local tokens, the API said 42%.” We accumulate thousands of these pairs. The day the API disappears, the estimator already knows how much each token type “weighs” because it learned by observing.
How precision improves over time
The trick of the whole system is the QuotaCalibrationLog: a record that’s saved every 30 seconds with two things:
- What the API says (the real data, the ground truth).
- What we calculate locally (tokens in the window, equivalent cost, model used).
Each record is a calibration point. With 2,880 points per day (one every 30 seconds during usage hours), in two weeks we have enough data to:
- Calibrate the tier budget: we know that when local cost reaches $X, the API says 100%. Now we know $X.
- Verify the weights: if our ratios (output 5x input, Opus 1.67x Sonnet) are wrong, the paired data reveals it.
- Detect Anthropic changes: if one day the estimator residuals spike, we know the formula changed. We recalibrate automatically.
The first day, the estimator uses public API prices as approximation. After a week, it has empirical weights. After a month, it has a fairly tuned model of real behavior.
The honest limitations
It would be dishonest to sell this as a silver bullet. There are things we can never capture:
- Activity from other devices. If you use claude.ai from mobile, those tokens consume quota but are invisible to Tokamak. The estimator will underestimate.
- Hidden factors. If Anthropic applies penalties based on time of day, server load, or task complexity, we can’t measure it.
- Point precision. The estimator error is around +-15-25%. For an exact percentage, it’s not enough. For knowing if you’re in green, yellow, orange or red, it is.
That’s why the classifier is the component the user sees, not the estimator directly. We don’t tell you “you’re at 73%.” We tell you “you’re in the yellow zone.” And we get that right 85-90% of the time.
“Why not just throw an ML model at it?”
If you’ve made it this far, you’re probably thinking: “Dude, why not train a CoreML model with the calibration data and call it a day?” It’s the obvious question. And the answer is that it would be like using a sledgehammer to crack a nut that’s also behind bulletproof glass.
The problem is the glass, not the hammer
Imagine the candy jar at a bar counter. You know there are gummy bears, licorice, and gum, each with a different “price.” You want to know what percentage of the jar you’ve eaten. But you can only look at the jar from outside and the glass distorts: you only distinguish if it’s “full,” “half,” “nearly empty,” or “empty.” You can’t see individual candies.
It doesn’t matter if you put on $10,000 glasses or an electron microscope. The glass still distorts the same. More optical power doesn’t give you more information if the signal you receive is blurry at the source.
That’s exactly what happens here:
- Single observable: the API returns an integer between 0 and 100. One number. That’s all the information we receive from the real system.
- Brutal quantization: most consecutive observations (every 30s) have delta = 0. When there’s change, it’s 1% or more. No fine information.
- 6+ unknowns: weight of output, input, cache read, cache write, for each model, plus total budget, plus possible hidden factors (GPU time, server load).
A system with 6 unknowns and 1 equation. Doesn’t matter if you use linear regression, gradient boosting, a neural network, or a transformer. It’s an underdetermined problem. There’s not enough information in the signal to solve all unknowns, not even with a billion parameters.
What ML wins vs. what it loses
| Linear regression (M1) | CoreML / Neural network | |
|---|---|---|
| Precision | +-15-25% | +-15-25% (same ceiling) |
| Interpretability | Total (you see each weight) | Black box |
| Debugging | “The output_opus weight is wrong” | “Accuracy dropped” |
| Bundle size | 0 KB | 2-5 MB framework |
| Recalibration | Change a JSON | Re-train + convert .mlmodel |
| Drift detection | Compare residuals | Compare… what? |
If Anthropic changes their quota formula tomorrow, with the linear estimator you see exactly which residual spiked and which weight to adjust. With an ML model you see that the metric dropped and have to re-train blindly with new data, hoping it converges.
The thermometer analogy
If your mercury thermometer has 1°C resolution, buying a 0.001°C digital thermometer doesn’t improve measurement if what you’re measuring is oven temperature through the same thick glass wall. The bottleneck is the wall, not the instrument.
Here the “wall” is integer quantization + hidden factors + invisible activity from other devices. No model, however sophisticated, can see through it.
The Phase Detector (the traffic light) works precisely because it accepts this limitation: instead of trying to guess “73.2%,” it classifies into 4 zones. And that is robust with the available information.
What we’re doing right now
The most urgent thing is start logging data. Every day without logging is a day of lost calibration. So the priority is:
- Now: implement the QuotaCalibrationLog. No UI changes. Just accumulate paired data silently.
- In 2-4 weeks: with accumulated data, activate the estimator as automatic fallback when the API fails.
- In 4+ weeks: activate the passive calibrator (Decay Estimator) to learn empirical weights.
If tomorrow Anthropic cuts off the tap, the estimator will already be calibrated and users will see a traffic light instead of a percentage. It’s not perfect. But it’s useful. And it’s honest about its limitations, which is more than you can say about most dashboards out there.
Sometimes the best plan B is one you start building before you need it.