Contenido

Google TPU v8: I ran it against my production workload and the numbers don't add up

Why is it that every time Google announces new hardware for "the agentic era," the benchmark numbers have absolutely nothing to do with what's actually running on my Railway instance at 2am? I've been asking myself that for months. This week, with the TPU v8 announcement, I decided to stop asking and start measuring.

Spoiler: the numbers don't add up. And that's not a bug — it's a decision.

Google TPU v8 agentic era benchmark: what Google says vs. what I actually measure

Google unveiled the TPU v8 in two variants — Ironwood, aimed at massive inference, and a line focused on high-frequency "agentic" workloads. The headlines talk about 42.5 exaFLOPS per pod, inference latency reduced by ~40% compared to v5e, and throughput optimized for multi-step reasoning. It sounds extraordinary. The problem is those metrics live in a parallel universe from mine.

My current agent — the one I built after learning about the real gaps in MCP — runs on Railway with PostgreSQL, makes between 80 and 140 daily calls to the Anthropic API, and the real bottleneck was never compute: it was context, network latency, and cost-per-token in multi-step sequences.

So I built an honest benchmark. Not with Google hardware — I don't have access to Ironwood and you probably don't either. What I did was take the numbers Google published, grab my real production logs from the last 30 days, and calculate what difference the TPU v8 would actually make in my current stack if I could use it tomorrow.

hljs language-bash

# Pull my metrics from the last 30 days from Railway logs
# (Railway has log export — this is a grep over the dump)

grep "agent_step_complete" production.log \
  | jq '{latency: .duration_ms, tokens: .tokens_used, step: .step_type}' \
  | awk -F'"' '
    {
      # Sum latency and tokens by step type
      lat[$8] += $4
      tok[$8] += $12
      count[$8]++
    }
    END {
      for (type in lat) {
        printf "Type: %s | P50 latency: %.0fms | Avg tokens: %.0f | Steps: %d\n",
          type, lat[type]/count[type], tok[type]/count[type], count[type]
      }
    }
  '

Real results from my logs (30 days, production):

Step type	P50 latency	Avg tokens	Steps/day
`tool_call`	487ms	1,240	34
`reasoning`	1,340ms	4,890	18
`context_retrieval`	203ms	680	41
`output_generation`	890ms	3,200	12

Now the real question: how much of that latency is compute and how much is network + API overhead?

hljs language-python

# Break down latency by component using OpenTelemetry spans
# that I have instrumented in my agent

import json

with open("traces_30d.jsonl") as f:
    spans = [json.loads(line) for line in f]

for span in spans[:5000]:
    total = span["duration_ms"]
    # Time to first byte from the API
    network_api = span.get("api_ttfb_ms", 0)
    # Local processing time (validation, routing, DB)
    local_processing = span.get("local_ms", 0)
    # Whatever's left is model inference time
    inference = total - network_api - local_processing

    print(f"Total: {total}ms | Network+API: {network_api}ms | Local: {local_processing}ms | Estimated inference: {inference}ms")

The result that stopped me cold: in my reasoning steps (the most expensive ones), 71% of the latency is network and API overhead, not inference. The TPU v8 would accelerate the remaining 29%.

If Google promises a 40% reduction in inference latency, in my real workload that translates to: 1340ms × 0.29 × 0.40 = ~155ms improvement per step. Over 1340ms total, that's an 11.5% end-to-end improvement. Not 40%.

The access mess: who can actually use the TPU v8

Here's what genuinely pisses me off about this announcement, and why I'm writing this at this temperature.

The TPU v8 isn't directly available to indie devs. Access is through Google Cloud TPU, with pod reservations that start at minimum usage configurations, priced at what Google Cloud documentation puts at $2.40–$3.20/hour per TPU v8 chip (estimated pricing for Ironwood in preview, subject to change). A basic training pod is 8 chips. Do the math: $19–26 per hour just for compute, before network, storage, and egress.

For inference, the consumption model is different — you can use Vertex AI which abstracts the hardware. But then the pricing is tied to tokens processed and guaranteed latency, and the abstraction layer introduces exactly the kind of overhead that my measurements show already dominates my total latency.

When I migrated from Vercel to Railway because cold starts were killing me — a weekend of pain that taught me more about production than months of tutorials — the driver was simple: predictable control over cost and latency. Railway gives me that. TPU v8 accessible via cloud abstraction takes exactly that away.

My thesis, and I'll say it plainly: the "agentic era" Google is selling with the TPU v8 is designed for enterprise customers running millions of steps per day, not indie devs with 100–150 daily calls. Calling it the "agentic era" when the economic entry point is weeks of a developer's income is, at best, optimistic marketing. At worst, it's a deliberate decision about who the ecosystem actually cares about.

And that connects to something I was already seeing when I analyzed my agent costs log by log: AI infrastructure companies are building for the 95th percentile of consumption and letting the 5th percentile — indie devs — figure it out with whatever's left over.

The gotchas the official benchmark never mentions

1. The agentic cold start problem

The TPU v8 shines at sustained throughput. Real agentic workloads have short bursts separated by idle time. An agent responding to a user has a radically different usage pattern from a batch inference pipeline. Google's benchmarks measure the second scenario, not the first.

2. Long context destroys linear projections

My most expensive reasoning steps happen when accumulated context exceeds 40k tokens. The relationship between context length and latency isn't linear in current models — it's quadratic in attention, even if modern implementations mitigate it with tricks. But none of the TPU v8 benchmarks I've seen show the degradation curve with long contexts and accumulated multi-step state. That's exactly the real agentic use case.

hljs language-python

# How I measure latency degradation vs. context size in my logs
import statistics

from collections import defaultdict

# Group by context token bucket
buckets = defaultdict(list)

for span in spans:
    ctx_tokens = span.get("context_tokens", 0)
    bucket = (ctx_tokens // 10000) * 10000  # 10k token buckets
    buckets[bucket].append(span["duration_ms"])

for bucket_start in sorted(buckets):
    lats = buckets[bucket_start]
    print(
        f"Context {bucket_start//1000}k-{(bucket_start+10000)//1000}k tokens | "
        f"P50: {statistics.median(lats):.0f}ms | "
        f"P95: {sorted(lats)[int(len(lats)*0.95)]:.0f}ms | "
        f"n={len(lats)}"
    )

In my data: going from 10k to 40k tokens of context multiplies my P95 latency by 2.8x. A benchmark at a fixed 8k token context tells me nothing useful about that.

3. The access gap is asymmetric

Google's models (Gemini) have native access to TPU v8 through Vertex AI. Anthropic, OpenAI, and open-source models don't. If my stack uses Claude — and it does, as I talked about when I was evaluating the Pro plan and its real limitations — the TPU v8 isn't my accelerator, it's Google's. That's not a minor detail: it's a competitive advantage disguised as neutral infrastructure.

4. The vendor lock-in problem nobody names

Migrating agentic workloads to TPU v8 via Vertex AI means coupling your architecture to Google Cloud primitives. After the technical debt I analyzed in the context of Windows Subsystem for Linux, I'm very careful about how much platform surface area I adopt without a clear exit path. An agent that runs fine on Railway with Docker today can migrate to fly.io tomorrow in a few hours. An agent coupled to TPU v8 + Vertex AI cannot.

FAQ: Google TPU v8 and the agentic era for devs

Does the TPU v8 improve my agent latency if I'm using the Anthropic or OpenAI API? Not directly. The TPU v8 is Google hardware — it accelerates workloads running inside Google Cloud, specifically models served via Vertex AI or Google AI Studio. If you're calling the Anthropic API from Railway, the TPU v8 doesn't touch you. What might improve indirectly is if Google uses that hardware to serve Gemini faster, but that's not guaranteed or predictable from the outside.

What's the real access price for the TPU v8 on an indie project? Direct access requires reserving pods on Google Cloud TPU, with a minimum of 8 chips and pricing in the $2–3/chip/hour range (preview values, subject to update). For inference via Vertex AI, the model is per token/request and more accessible, but introduces abstraction latency. There is no "hobby" tier for TPU v8 at the time of this post.

Is it worth migrating an agent to Vertex AI to take advantage of TPU v8? Depends on scale. If you're processing fewer than 500 agentic steps per day, probably not — the migration overhead and lock-in outweigh the latency benefit, which as I show in my measurements is 11–15% end-to-end in typical indie dev workloads. If you're processing millions of steps, the equation changes.

Why don't Google's benchmarks represent real agentic workloads? Because official benchmarks measure sustained throughput at fixed contexts (typically 4k–8k tokens) with large batches. A real agent has short bursts, variable accumulated context that can grow to 40k+ tokens in long sessions, and idle periods between steps. Those conditions degrade performance non-linearly and the official benchmarks don't model them.

Does the TPU v8 change anything for devs running open-source models locally? Only if you're running those models on Google Cloud. If you're running Qwen or Llama locally (as I explained when I tested Qwen3 on my laptop), the TPU v8 doesn't exist in your stack. Google hasn't published support for loading arbitrary models onto TPU v8 outside their ecosystem in any straightforward way — the path is Cloud TPU with JAX/PyTorch XLA, which has a considerable adoption curve.

When does it actually make sense to seriously evaluate the TPU v8? When you have: (a) an inference workload measurable in millions of tokens/day, (b) a model that Google serves natively or that you can adapt to XLA without prohibitive cost, (c) an infrastructure budget that supports experimentation without freezing production. If all three don't apply today, bookmark it and revisit in 6 months when the abstraction layer matures.

What the TPU v8 really says about the ecosystem

I'm not bothered that Google builds extraordinary hardware. I'm bothered by the framing.

Calling this "infrastructure for the agentic era" when the economic entry point systematically excludes indie devs is exactly the same pattern as when Reddit banned AI-generated content with criteria that seem neutral but favor actors with resources to comply. The ecosystem gets built for the top percentile and then sold as democratization.

My production numbers are clear: on my real agentic workload, the end-to-end latency improvement from TPU v8 would be ~11–15%, not the 40% the official benchmark promises. 71% of my latency is network and API overhead — no new chip fixes that. What fixes it is architecture, smart caching, and well-managed context. Things I can do today, on Railway, with what I already have.

What I'll grant: the TPU v8 is genuinely impressive for the workloads it was designed for. 42.5 exaFLOPS per pod isn't marketing — it's serious engineering. What I won't buy: that it's relevant to me today, or that calling it the "agentic era" is honest when Google's agentic era requires a budget most indie devs don't have and probably never will.

The decision to make that hardware inaccessible to the indie ecosystem isn't a technical limitation. It's a business decision. And naming it as such matters.

Comments (0)

💬

What do you think of this?

Drop your comment in 10 seconds.

We only use your login to show your name and avatar. No spam.

No comments yet. Be the first — your take matters most when we're few.

Tutorialsdevopsproduccion

Barman vs pgBackRest: a decision tree for PostgreSQL backup in production

There's no universal winner. Barman wins on simplicity and real-time WAL streaming with low operational overhead. pgBackRest wins on volume and restore speed. The criteria matter more than the tool.

9 min2

Tutorialsdevopsbackend

Spring Boot Actuator: What to Expose, What to Hide, and What to Check Before Adding Endpoints

Actuator isn't the problem. The problem is enabling it without a clear exposure policy. A pragmatic guide to using it as an operational tool without turning it into unnecessary public attack surface.

8 min98

TutorialsTypeScriptLLM

OWASP LLM Top 10 in Production: How I Audited My TypeScript Agent Pipeline Against All 10 Risks — and What I Found

Running the OWASP LLM Top 10 as a real audit is a completely different experience than reading it as a checklist. I ran it against my TypeScript agent stack with system prompts, MCP tools, and Cline — and the findings were uncomfortable.

9 min168