Site Navigation

Edit on GitHub


Inference and Token Economics

Introduced by: Pete Kaminski
Depth: Deep thread

About two hours into the call, Pete pulled up Typora and drew a picture of how LLM conversations actually work. This was a pivotal teaching moment — Dave had been burning tokens without understanding why, and this explanation connected the dots.

The Conversation

Dave had been frustrated by costs and wanted Claude to count its own tokens. Pete redirected:

Pete: "You're over-focused on that because you're burning tokens like there's no tomorrow. You don't even need to manage that if you're doing good project hygiene."

But he still explained the mechanics, because understanding them changes behavior.

How LLMs Actually Work

Pete drew a conversation: Pete says "hi," Claude says "oh, hello," Pete says "can you do a thing," Claude says "sure, what's the thing?"

Pete: "From a human's point of view, I have memory of this conversation over time. So when you add a turn... I'm taking a turn, Claude takes a turn... In my mind, what's happening is I have the conversation, and I add a little bit to the end."

Then the key insight:

Pete: "LLMs do this completely differently, and it's the freakiest thing to learn about."

Pete: "The big server up at anthropic.whatever, clod.ai, the big server. Has absolutely no memory of its conversation with you."

Dave's "aha" moment:

Dave: "And so it has to re-read the context, and this is what it's complaining about with me, so it's like, it has to go back and re-read everything, that's what's so token-intensive."

Key Concepts Explained

Tokens: Chunks of text, roughly three-quarters of a word. Both input (your prompt) and output (Claude's response) are counted. Rule of thumb: 1,000 tokens ≈ 750 words.

Turns: Each back-and-forth in a conversation. Each turn sends the entire conversation to the server, not just the new message.

Non-linear growth: A conversation that's 1,000 turns isn't 1,000x more expensive than a 1-turn conversation — it's much more, because each turn re-sends everything that came before.

Caching: A recent improvement where if the tokens are identical to a previous call, the server can reuse cached results (~80% of the time). This reduces costs for typical conversations but doesn't eliminate the fundamental problem.

Dave's Batching Intuition

Dave had intuitively figured out one optimization:

Dave: "I thought, let me just write, like, 700 words. One of them was 1,000. So I wrote 1,000 words and just pressed go."

Pete confirmed this was smart:

Pete: "You want to batch up stuff for your bot, because it can handle a whole bunch of stuff at once."

There's an asymmetry: LLMs should answer one question at a time (people hate being asked 3 questions), but humans should batch their requests.

Why Dave's Sessions Cost So Much

Dave's five-hour sessions with constant back-and-forth created conversations with hundreds of turns, each re-sending the entire growing context. Combined with Claude's verbose outputs, the token burn was exponential.

Pete's solution wasn't to monitor tokens — it was to structure interactions so you don't waste them in the first place.

Related


Pages that link to this page