Inference Tokenomics
Introduction
In March 2026, NVIDIA CEO Jensen Huang stood on the stage at GTC and declared a new economic paradigm: tokenomics — the economics of producing, pricing, and consuming AI inference tokens at industrial scale. Data centres, he argued, are no longer storage facilities for files. They are AI factories whose primary output is tokens — the fundamental unit of AI computation — and whose revenues are directly determined by how efficiently those tokens are manufactured.
This is not a metaphor. The world's largest technology companies are collectively spending over $600 billion in 2026 on AI infrastructure. OpenAI alone is projected to spend $14.1 billion on inference costs this year. Anthropic is on track for $9–20 billion in annualised revenue. The global AI inference market, valued at $103.7 billion in 2025, is projected to reach $312.6 billion by 2034. Tokens — sequences of text fragments, image patches, or code units that AI models process and generate — have become the commodity at the centre of this economic transformation.
Understanding inference tokenomics is now essential for anyone evaluating the AI industry: investors sizing up hyperscaler capital expenditure, engineers budgeting their compute, enterprise leaders planning AI deployments, or policymakers assessing the infrastructure demands of the AI era. This primer covers the full landscape — from the physics of token production to the pricing tiers of token consumption, from the hardware architectures that determine cost-per-token to the business models being built atop this new commodity.
What Are Tokens?
A token is the basic unit of data that a large language model (LLM) processes. When you type a question into ChatGPT or Claude, your text is broken into tokens — typically fragments of words, punctuation marks, or whitespace characters. A rough rule of thumb: one token equals approximately 0.75 English words, or conversely, 100 tokens cover roughly 75 words.
Tokens serve as both the input and the output of AI inference. Input tokens (also called prompt tokens) represent the data you send to the model — your question, the context, the system instructions. Output tokens (also called completion tokens) represent the model's response — the generated text, code, or reasoning. This distinction matters economically because output tokens are substantially more expensive to produce than input tokens, typically by a factor of 3–8x, reflecting the sequential, compute-intensive nature of token generation versus the parallelisable nature of prompt processing.
Beyond text, the token concept extends to images (processed as patches of pixels), audio (spectral frames), video (spatiotemporal patches), and even robotic control signals (continuous action tokens). As AI becomes multimodal, the token economy expands accordingly.
Why Tokens Matter Economically
Tokens are to AI what kilowatt-hours are to electricity or barrel-equivalents are to energy — a standardised unit of output that can be metered, priced, and traded. Every API call from OpenAI, Anthropic, Google, or any inference provider is billed in tokens. Every enterprise AI deployment's cost is fundamentally denominated in tokens consumed. Every data centre's revenue potential is bounded by the tokens it can produce per unit of power.
This is the foundation of tokenomics: tokens are the new commodity, and compute is revenue.
The AI Factory Paradigm
Jensen Huang's central thesis at GTC 2026 was that data centres have undergone a fundamental transformation. A traditional data centre stored and served files — it was infrastructure. An AI factory produces something: tokens. The analogy to physical manufacturing is deliberate and precise.
From Data Centres to Factories
A gigawatt AI factory — the scale at which the largest facilities now operate — costs approximately $35–60 billion to construct, depending on the configuration and cooling requirements. That capital is sunk regardless of what hardware you install. As Huang put it: "Even when you put nothing on it, it's $40 billion in. You better make for darn sure you put the best computer system on that thing so that you could have the best token cost."
Once built, these factories are power-constrained — a 1 gigawatt facility will never become 2 gigawatts. The physics of atoms imposes a hard ceiling. Within that fixed power envelope, the economic imperative is singular: maximise the number of tokens produced per watt, because tokens per watt translates directly to revenue.
Tokens Per Watt: The New Efficiency Metric
Tokens per watt (TPW) has emerged as the defining performance metric for AI infrastructure, analogous to how fuel efficiency defines automotive value or yield per acre defines agricultural productivity. It captures the fundamental economic relationship: for a power-constrained factory, more tokens per watt means more revenue from the same fixed asset.
SemiAnalysis, the semiconductor research firm, developed the InferenceX and later InferenceMAX benchmarks to standardise this measurement. Their comprehensive testing across all major AI accelerators confirmed what NVIDIA had been claiming: the GB300 NVL72 (Blackwell Ultra) delivers up to 50x higher throughput per megawatt compared to the Hopper platform, resulting in 35x lower cost per token.
This metric matters because it exposes a counterintuitive truth about AI infrastructure economics: the price of the computer and the cost of the token are only marginally related. A more expensive system that produces tokens at dramatically higher efficiency is a better investment than a cheaper system with lower token throughput. As Huang emphasised: "If you have the wrong architecture, even if it's free, it's not cheap enough."
The Throughput-Latency Pareto Frontier
At the heart of inference tokenomics lies a fundamental engineering trade-off that governs all hardware design, pricing strategy, and data centre architecture: the tension between throughput (tokens per second per watt) and latency (tokens per second per user, or interactivity).
Understanding the Trade-Off
High throughput means processing the maximum number of tokens per unit of power — ideal for batch processing, free-tier services, and high-volume workloads. High interactivity (low latency) means delivering tokens to individual users as fast as possible — ideal for real-time coding assistants, premium AI services, and agentic workloads that require rapid iteration.
These two objectives are fundamentally at odds. Optimising for throughput requires batching many requests together and processing them in parallel, which increases latency for individual users. Optimising for latency requires dedicating more compute to each individual request, which reduces overall throughput. At the hardware level, throughput demands massive floating-point operations (FLOPS), while latency demands massive memory bandwidth — and chips have limited surface area for both.
This trade-off creates a Pareto frontier: a curve representing the best achievable combination of throughput and latency for a given hardware platform. Every point on this curve represents a different operating configuration with different economic implications.
Generational Leaps in the Pareto Frontier
NVIDIA's GTC 2026 presentation demonstrated how each hardware generation shifts this frontier dramatically:
Where Moore's Law might have delivered a 1.5x improvement per generation, NVIDIA's architectural co-design approach delivered 35x from Hopper to Blackwell and another 2–10x from Blackwell to Vera Rubin, depending on the tier. Over just two years, token generation rates within a 1 gigawatt factory are projected to increase from 2 million to 700 million tokens per second — a 350x improvement.
Token Pricing Tiers: The Segmentation of Intelligence
One of the most consequential ideas introduced at GTC 2026 was that AI tokens, like any maturing commodity, will naturally segment into distinct pricing tiers based on model size, speed, context length, and intelligence. This represents a shift from the early days of AI — when there was essentially one product at one price — toward the kind of market segmentation seen in every mature industry.
The Five-Tier Framework
NVIDIA's presentation outlined a five-tier pricing structure that maps directly onto the throughput-latency Pareto frontier:
At the free tier, high throughput and smaller models (like Qwen 3 at 235 billion parameters with 32K context) serve as customer acquisition tools — the equivalent of free samples in any industry. At the ultra tier, the largest models running at maximum speed with enormous context windows command $150 per million tokens — a 50x premium over the medium tier.
This segmentation mirrors established industries. As Huang noted: "Ferrari is all high end, nothing in the free tier. And then somebody else, right? Just depends on the brand." Search businesses will operate largely at the free tier. Agentic coding assistants like Claude Code and Codex operate at the premium tier. Enterprise research agents may occupy the ultra tier.
Price Trends and the Deflation Paradox
The history of token pricing reveals one of the fastest cost deflation curves in technology history:
OpenAI CEO Sam Altman has stated that AI usage costs are falling 10x every 12 months. SemiAnalysis data shows a median decline of 50x per year across all inference benchmarks, accelerating to 200x per year after January 2024. What cost $60 per million tokens in early 2023 now costs $0.40 or less for equivalent performance.
Yet this cost deflation creates a paradox: while unit costs plummet, total spending explodes. This is because falling costs unlock new use cases, new tiers of service, and new categories of consumption — particularly agentic workloads that consume orders of magnitude more tokens than simple chat interactions.
The Economics of AI Factory Revenue
NVIDIA's GTC presentation included a striking revenue model that translates the throughput-latency framework into concrete dollar figures. By allocating a 1 gigawatt data centre equally across four pricing tiers (25% power to each), the model calculates the total annual revenue that each hardware generation can produce.
Revenue Per Gigawatt
The implications are profound. A data centre operator running Vera Rubin instead of Blackwell — at the same power consumption — could generate 5x more revenue. Adding the Groq LPX system for the highest-value 25% of workloads doubles the revenue opportunity again.
This is the business case that has driven NVIDIA's $1 trillion-plus order backlog through end of 2027 for Blackwell and Vera Rubin systems alone — a figure that excludes Vera CPUs, Groq LPX, storage systems, and next-generation architectures like Feynman.
Why "Cheaper" Hardware Is Not Cheaper
This revenue model explains NVIDIA's sustained pricing power and gross margins despite intense competition. The argument from competitors — "our chips are 30% cheaper" — misses the fundamental economics. As Huang argued: "Put that in the context of the factory, and that person is actually demonstrating to you they don't understand AI."
The relevant comparison is not chip price versus chip price, but revenue generated per watt per dollar of infrastructure. A more expensive system that produces 5x the token revenue is the economically rational choice. This is why, as Huang stated, "customers would prefer to buy our next-generation product at a higher price than our current generation product at a lower price" — because the value per token improves faster than the price per system increases.
Hardware Architectures for Token Production
The hardware that powers AI inference is rapidly diversifying from monolithic GPU systems into heterogeneous architectures optimised for different parts of the inference pipeline. Understanding these architectures is essential for understanding the cost structure of token production.
GPU-Based Systems: The Throughput Engine
NVIDIA's core inference platform centres on GPU-based systems connected via high-bandwidth NVLink interconnects. The current generation — Grace Blackwell NVL72 — packs 72 GPUs into a single rack connected by NVLink at terabytes per second of bisection bandwidth. This architecture excels at the prefill phase of inference (processing input tokens in parallel) and at high-throughput batch processing.
The next generation — Vera Rubin NVL72 — brings the Rubin GPU with 288 GB of HBM4 memory, 22 TB/s of memory bandwidth, and 50 PFLOPS of NVFP4 compute per chip. Combined with 338 billion transistors per GPU plus 2.5 trillion transistors of HBM4, this represents a massive leap in raw compute density. At GTC 2026, Huang confirmed that the first Vera Rubin rack was already running at Microsoft Azure.
The Groq LPX: A Latency-Optimised Architecture
NVIDIA's acquisition of Groq's technology introduced a fundamentally different processor architecture optimised for the decode phase of inference — the sequential, bandwidth-limited process of generating one output token at a time.
The Groq 3 LPU (Language Processing Unit) is a deterministic dataflow processor with massive on-chip SRAM:
- Single chip: 500 MB SRAM, 150 TB/s SRAM bandwidth, 1.2 PFLOPS (FP8), 98 billion transistors
- Expanded configuration (8 chips): 4 GB SRAM, 1,200 TB/s bandwidth (55x vs single Rubin GPU)
- Full LPX rack (256 chips): 128 GB SRAM, 640 TB/s scale-up bandwidth, 315 PFLOPS
The LPU is statically compiled — the compiler schedules all compute and data movement in advance, with no dynamic scheduling overhead. This makes it deterministic and extremely low-latency, but less flexible than GPUs. It is purpose-built for one workload: autoregressive token generation.
Disaggregated Inference: Uniting Throughput and Latency
The key architectural innovation is disaggregated inference, orchestrated by NVIDIA's Dynamo software — described as the "operating system for AI factories." Dynamo splits the inference pipeline across heterogeneous hardware:
1. Prefill (processing input tokens): Handled entirely on Vera Rubin GPUs, which excel at parallel computation
2. Decode — Attention: The attention mechanism during token generation, which requires access to the massive KV cache, runs on Vera Rubin GPUs with their large HBM4 memory
3. Decode — Feed-Forward Network (FFN): The bandwidth-limited FFN computation for actual token generation is offloaded to Groq LPX chips with their massive SRAM bandwidth
This disaggregation allows each processor type to operate at its sweet spot. The two systems communicate via Ethernet with a special low-latency mode that halves standard latency. The result: 35x higher inference throughput per megawatt at the premium tier, with the ability to sustain token generation speeds above 1,000 tokens per second per user — a regime where GPU-only architectures simply run out of bandwidth.
Three Memory Tiers
NVIDIA is now the only company optimising across three distinct memory technologies simultaneously:
- HBM4 (High Bandwidth Memory): 288 GB per Rubin GPU, 22 TB/s — high capacity, moderate bandwidth
- LPDDR5 (Low-Power DDR): Used in Vera CPUs for tool use and agent memory — high capacity, low power
- SRAM (Static RAM): 500 MB per Groq chip, 150 TB/s — tiny capacity, extreme bandwidth
This tri-memory architecture enables different workloads to be placed on the optimal memory tier, maximising overall system efficiency.
The Agentic Consumption Explosion
The shift from chatbot-style interactions to agentic AI — autonomous systems that perform multi-step tasks, use tools, spawn sub-agents, and iterate on their own outputs — has fundamentally changed the token consumption profile of AI systems.
From Chat to Agents
A typical ChatGPT conversation might consume a few thousand tokens. An agentic coding session with Claude Code or Codex can consume 50 million tokens in a single day — a figure Jensen Huang cited directly from user reports. At $150 per million tokens for premium-tier inference, that is $7,500 per day for a single power user. At the more common $3–6 per million token tier, it is $150–300 per day.
This is not a theoretical projection. As Huang stated at the analyst call: "I'm really hoping that somebody who makes $2,000 a day is spending $1,000 a day of tokens." The economics are compelling: if an engineer earning $300,000–400,000 per year can be made even moderately more productive by deploying $50–100 per day of AI inference, the return on investment is overwhelming.
Token Budgets as Compensation
A paradigm shift is underway in how companies think about employee productivity. As Huang described it: "When you come to work, they give you a laptop and tokens. Token budget is now a real thing." The implication is stark: "The idea that you would hire a $300,000 engineer and they spend no tokens in doing their job, you've got to ask the question, what are they doing?"
Industry data suggests that power users running large code generation workloads can consume $12,000–100,000 per year in inference costs. Some analysts now frame AI inference as the fourth component of engineering compensation alongside salary, bonus, and equity — with fully loaded costs potentially reaching $475,000 when adding $100,000 in annual inference costs to a $375,000 base compensation.
The Agentic Cost Multiplier
Agentic systems consume tokens at fundamentally different rates than interactive chat:
- A Reflexion loop (where an agent iterates on its own output) running for 10 cycles consumes 50x the tokens of a single linear pass
- An unconstrained agent solving a software engineering issue can cost $5–8 per task
- A mid-sized product with 1,000 daily users in multi-turn agent conversations can burn through 5–10 million tokens per month
- An autonomous research session spanning hours can consume tens of millions of tokens
This explains why, despite 10x annual cost declines per token, enterprise AI spending is accelerating. Inference now accounts for approximately 85% of enterprise AI budgets, up from negligible shares just two years ago.
The Hyperscaler Investment Cycle
The infrastructure required to produce tokens at scale has triggered the largest capital expenditure cycle in technology history. Understanding this investment wave — and the economic logic behind it — is central to understanding inference tokenomics.
CapEx at Historic Levels
The top five hyperscalers (Amazon, Microsoft, Google, Meta, Oracle) are expected to spend approximately $600–650 billion in 2026, up 36–70% year-over-year depending on the estimate. Capital intensity has reached historically unprecedented levels of 45–57% of revenue, with some companies spending approximately 90% of their operating cash flow on capex.
Approximately 75% of this aggregate spending — roughly $450 billion — is directed specifically at AI infrastructure: GPUs, accelerators, networking, and data centre construction.
The Revenue Justification
The central question in AI infrastructure economics is whether this spending can be justified by downstream revenue. The current CapEx-to-cloud-revenue ratio for hyperscalers sits at approximately 1.2x — they are spending 20% more on infrastructure than they earn from cloud and API services. This gap is being funded by operating cash flow from existing businesses and, increasingly, by debt markets.
NVIDIA's Jensen Huang addressed this directly at the analyst call, pointing to the private revenue trajectories of AI companies: "No companies in history have ever grown, as a start-up company, increased revenues by $1 billion or $2 billion a week. That's what they're experiencing right now." OpenAI hit $25 billion in annualised revenue by February 2026. Anthropic is tracking toward $9–20 billion.
Huang's broader argument: the $2 trillion IT software industry will be transformed into an $8 trillion industry that resells enormous quantities of tokens. "100% of the world's IT industry will become resellers of OpenAI and Anthropic," he predicted. If even a fraction of this materialises, current infrastructure spending is rational.
Supply and Demand Dynamics
NVIDIA disclosed $1 trillion-plus in firm demand and purchase orders for Blackwell and Vera Rubin through end of 2027 — up from $500 billion announced a year earlier. This figure excludes standalone Vera CPUs, Groq LPX, storage systems, and next-generation Feynman architecture products.
The supply chain is now producing at scale: "We have now set up a supply chain that can manufacture thousands [of racks] a week, essentially multi-gigawatts of AI factories per month." Yet Huang acknowledged the system remains supply-constrained across multiple dimensions — not a single bottleneck, but a complex web of dependencies spanning silicon fabrication, copper interconnects, optical components, power delivery, cooling, and physical construction.
Market Structure: Who Produces and Consumes Tokens
The AI inference market is structured across several layers, each with distinct economics and competitive dynamics.
Model Providers
The top model providers by token volume:
1. OpenAI — The largest by total tokens generated, with $25B+ in annualised revenue
2. Open-source models (aggregate) — Collectively the second-largest category, dominated by Meta's Llama, Alibaba's Qwen, and DeepSeek
3. Anthropic — Third largest, on track for $9–20B in 2026 revenue
4. xAI, Google, others — Rapidly growing but smaller in aggregate token volume
Infrastructure Providers
The 60/40 split between hyperscalers and non-hyperscaler customers represents a key structural feature. The 60% hyperscaler segment includes both NVIDIA competing on chip merit and NVIDIA bringing CUDA developers to cloud platforms. The 40% enterprise/regional segment is "completely impossible if you just build a chip, because they don't buy chips — they buy platforms."
Inference Service Providers (ISPs)
A rapidly growing category of specialised token producers has emerged: inference service providers (ISPs) like Fireworks AI, Together AI, Cerebras, and others. These companies focus exclusively on serving open-source models at the lowest possible cost per token.
Fireworks AI exemplifies the growth trajectory: from approximately $6.5 million in ARR in May 2024 to $130 million by May 2025 — 20x growth in a single year — with a $4 billion valuation after a $254 million Series C. The number of inference providers grew from 27 in early 2025 to 90 by late 2025.
The Unit Economics of Token Production
Understanding the cost structure of producing tokens is essential for evaluating the sustainability of current pricing and predicting future market dynamics.
Cost Components
The total cost of producing a million tokens includes:
- Silicon amortisation: The cost of GPU/accelerator hardware, amortised over 4–6 years
- Power: Electricity consumed during inference, typically $0.04–0.10 per kWh at data centre scale
- Facility costs: Building amortisation, cooling, maintenance — roughly 30–40% of total operating costs
- Networking: High-speed interconnects, both within-rack (NVLink) and between-rack (InfiniBand/Ethernet)
- Software and orchestration: Inference serving frameworks, model optimisation, routing
- Labour and overhead: Engineering, operations, support
The Cost-Revenue Gap
A critical observation: many frontier AI companies are currently pricing inference below cost. OpenAI generated approximately $3.7 billion in revenue in one period while losing an estimated $5 billion — spending $1.35 for every $1.00 earned. This below-cost pricing is funded by venture capital ($6.6 billion from a single Softbank round for OpenAI) and hyperscaler cross-subsidies.
The market consensus is that this is unsustainable and that meaningful price normalisation will occur within 12–24 months. However, the countervailing force is the relentless improvement in inference hardware efficiency — if cost-per-token continues to fall at 10–50x per year while revenue per token remains stable, margins may improve even at current prices.
The Software Multiplier
One of the most underappreciated aspects of inference economics is the role of software optimisation. NVIDIA's Dynamo 1.0 — now in production and adopted by all major cloud providers — boosted inference performance on Blackwell GPUs by up to 7x through pure software improvements on the same hardware. Jensen Huang cited an example where inference service providers saw token speeds increase from an average of 700 tokens per second to nearly 5,000 after NVIDIA software updates — the same system, 7x the output.
This software layer is a continuous source of efficiency gains that compounds on top of hardware improvements, contributing to the exponential cost decline in token production.
The Future of Tokenomics
Convergence of Training and Inference
The traditional distinction between training (teaching a model) and inference (using it) is blurring. Post-training techniques like reinforcement learning with verifiable feedback require enormous inference-like computation. Continuous learning and personalisation mean models are always partially in a training mode. Huang estimated that future post-training compute requirements could be "probably a million times more than pretraining."
His stated aspiration: "99% of the world's compute goes towards inference. And the reason for that is because inference is where we translate tokens generated to economics. Nobody pays you for learning."
Physical AI and the 40% That Becomes 70%
Currently, 60% of NVIDIA's accelerated computing revenue comes from hyperscalers and 40% from enterprise, regional cloud, and on-premises deployments. Huang predicted that as physical AI — robotics, autonomous vehicles, industrial automation — reaches its inflection point, the enterprise segment could grow from 40% to 70% of the total market. Physical AI requires on-premises inference at the edge, in factories, and in vehicles, expanding the addressable market from the ~$2 trillion digital economy to the $50–70 trillion physical economy.
The NVIDIA Roadmap
NVIDIA's annual architecture cadence ensures continuous compression of token costs:
- Blackwell (current): Established inference leadership, 35x over Hopper
- Vera Rubin (2026 H2): 2–10x over Blackwell across tiers, adding Groq LPX for latency-sensitive workloads
- Vera Rubin Ultra: NVLink 144, Groq LP35 with NVFP4 — further X-factor improvements
- Feynman (future): New GPU, new LPU (LP40), new CPU (Rosa), copper and co-packaged optics scale-up — NVLink 1152+
Each generation shifts the Pareto frontier up (more throughput) and out (lower latency), simultaneously reducing cost-per-token and enabling new premium tiers. The token generation rate for a 1 GW factory is projected to reach 700 million tokens per second with Vera Rubin — 350x the rate achievable with current hardware.
IT Industry Transformation
Perhaps the most expansive claim in NVIDIA's tokenomics thesis is that the entire IT software industry — currently approximately $2 trillion in annual revenue — will transform from licensing software to generating and reselling tokens. Every enterprise software company will integrate foundation models from OpenAI, Anthropic, and open-source providers into agentic systems using frameworks like OpenClaw (open source) and NemoClaw (enterprise). These companies become, in effect, token resellers — with changed revenue models, higher gross margins (offset by token COGS), and dramatically greater value delivery.
If this vision materialises, the total addressable market for token production infrastructure is not the current $100 billion inference market but something closer to $8 trillion — a figure that would make today's $600 billion CapEx cycle look like seed funding.
Conclusion
Inference tokenomics represents a fundamental reframing of how the technology industry creates and captures value. The core insight is deceptively simple: AI factories consume power and produce tokens; every optimisation that increases tokens per watt drives revenue; and the market for tokens is segmenting into tiers that mirror every mature commodity market in economic history.
The numbers behind this framework are staggering. Over $600 billion in annual infrastructure investment. A 350x improvement in token production efficiency in two years. Token prices declining at 10–50x per year while total consumption explodes. A $1 trillion-plus order backlog for hardware that won't all ship for 21 months. AI companies growing revenue at $1–2 billion per week.
The risks are real: below-cost pricing across the industry, a CapEx-to-revenue ratio above 1.0x for hyperscalers, and the possibility that AI revenue growth fails to match infrastructure investment timelines. But the countervailing forces — relentless hardware efficiency gains, software optimisation multipliers, expanding use cases from agentic AI to physical AI, and the potential transformation of the entire IT industry into token resellers — suggest that tokenomics is not a bubble narrative but a structural economic shift.
For investors, the key metrics to watch are tokens per watt per generation, revenue per gigawatt of deployed infrastructure, the CapEx-to-AI-revenue ratio at hyperscalers, and the rate at which enterprise token consumption grows. For engineers and enterprise leaders, the actionable insight is simpler: tokens are now a line item in every budget, every productivity calculation, and every competitive analysis. The AI factory era has begun.