The End of Tokenmaxxing – O’Reilly

The practice of tokenmaxxing seems to be on its way out, even before I have a chance to write about it. Good riddance. It was inevitable that the burning of tokens to create the appearance of productivity would continue only for accountants to find out, and the strictest of all accountants was the personal checkbook. What has made many developers think about the cost of AI is the change in GitHub Copilot usage fees. The cost of Copilot has gone from a monthly fee with unlimited usage to a monthly fee with unlimited usage Monthly fees You purchase a limited number of credits, which are used to pay the AI provider of your choice. One credit is equivalent to 0.01 USD; When you exhaust your credits, you can upgrade your account or pay for additional credits as you progress.

The question is not why this didn’t happen earlier; That’s why this happened now. Tokenmaxxing is both a creation and a victim of two broad trends in artificial intelligence. First, starting with OpenAI, the major AI providers have all been playing a role blitzscaling The game that prioritized user growth over profitability. Providing AI services for free has gotten you more users, and in the long run, scalers will figure out how to make money from end-user fees, selling user data, or advertising. This process inevitably ends with emergence, and is still very much the way we are on.

Second, the use of tokens increased in late 2025. The emergence of “logic models,” which use tokens to maintain an internal dialogue in the context of problem solving, has increased the number of tokens used to respond to each prompt. Logical tokens are the model’s conversation with itself about possible responses to the prompt, often more numerous than the prompt and the response itself. Whether users see the inference process or not (they often don’t), the inference tokens are added to the bill. They are frequently counted as “output codes” because they are model-generated, and are more expensive than input codes.

The emergence of proxies has also doubled the rate at which users consume tokens. In May 2025, Simon Willison quoted Anthropic’s Hannah Moran’s definition of an agent: “Agents are models that use tools in a loop.” Tradence Blog He writes: “An agent loop is a repetitive cycle in which the AI reads existing data, thinks about what it means, chooses an action, executes it, verifies what is happening and starts over.” If you’ve ever seen Claude Code, OpenClaw, or any other proxy running, a single request might turn into several form calls, each using hundreds of tokens, if not thousands. In addition to the current request, a single call generated by the agent can contain the accumulated context of the entire task and related documents. Between tokens and proxies, token usage rises by a factor of hundreds.

An increase in token usage may not be a problem if it leads to problems being solved and tasks being completed more efficiently. But it conflicts with the more loss-making pricing of blitz fighters; Their willingness to operate at a loss to dominate the market has limits. Regardless of whether the number of AI users is growing or not, the volume of calculations, and thus the cost, per user grows as the use of agents increases. Inference models have increased token usage; Customers have exacerbated the problem; Which led to higher prices.¹ Microsoft/GitHub doesn’t want to foot AI bills for Copilot customers. We have yet to see across-the-board price increases from the AI providers themselves. But we’ve seen the token credits for GitHub, and we’ve seen that Anthropic and OpenAI The price of more capable models is much higher than older or less capable models. Fable costs twice as much as Opus 4.8, and while some writers described this price as “fantastic”, that’s probably because they were expecting a bigger increase. While Fable can delegate tasks to Anthropic’s less expensive models, most early adopters noticed that with Fable, token usage went up rather than down. Anthropic switches to token-based billing for its agent SDK (Currently waiting) is another sign that the days of inexpensive AI are coming to an end. A similar OpenAI story: GPT 5.5 costs twice as much as GPT 5.4 per million codes.

It is also important to take capacity into consideration. Huge data centers have been in the news, but these data centers have not been built yet. Importantly, the electrical infrastructure needed to support those data centers — transmission lines and generators — has not been built either, and this is not an investment over which AI companies have much control. They can build their own power generation facilities on a data center campus, but that’s a huge investment in technologies they’re not familiar with. Even if you generate power locally, you need other types of infrastructure: railways for coal, pipelines for gas. This is not (yet) an article about data center energy consumption and its consequences, but another factor limiting the increase in token usage. We’ve seen that the outage at Anthropic was caused by capacity, and Anthropic has responded by leasing unused data center capacity from SpaceX. But another way to respond to increased demand that cannot be met by current capacity is to increase prices, limiting customers to those who can pay. Managers, accountants, and freelance developers have noticed this increase.

Token optimization and accountability are an inevitable consequence of upward pressure on the token price. One way to build accountability is to improve governance, a path that Benny Helin describes in “Support has ended: What is the actual cost to agents using the tools?“Better governance is achieved by building an observability layer that lets you see exactly what agents and models are doing. With a well-designed observability layer, you can see if the data sent to the model is growing with each call, whether the model is using the right tools, whether tools are being called frequently, and a lot of other information that will tell you whether your agent is working efficiently.

Another part of symbolic responsibility is understanding the models that run your agent’s requests. General purpose inference models range from expensive, high-performance models such as Claude Fable or Opus 4.8, to models such as Gemma 4 26B that can be run on a well-equipped laptop, and some smaller models. Although it’s tempting to say, “I need the best; I’ll run Opus 4.8 or Fable with the utmost thought,” most requests don’t require that level of thought or cost. Agents will be able to select the best model to process each order. Fable can delegate, and we expect other border service providers to follow as models include proxy capabilities. There is an active world of open models beyond frontier AI providers. Vicky Boix He writes Locally run models now perform almost as well as frontier models. Tools like OpenRouter give you a model-independent approach Guidance requests To different models, including open models that run locally. OpenRouter can be integrated with OpenClaw, Cloud Code, Cursor, Codex, and other agents to provide intelligent routing.

Tokenmaxxing is dying. To be sure, it will take some time for its remnants to disappear, and there will always be developers who think they can manipulate their way to promotion, along with managers who insist on being “all in” with AI. But spending tokens responsibly is now the norm, whether you pay using your checkbook or a company account. Token optimization will become more important as fees per token increase. They will, no doubt.