Essay №003 8 min

What the Operators See

Every AI commentator with reach is inside: lab alum, researcher, builder, VC-adjacent. The operator chair is empty. A dispatch from the chair, in a month when the substrate shifted weekly.


Last Tuesday around 2 a.m., while I was already asleep, one of my agents committed a config rollback to a stale branch. Not a destructive one. It just reverted a single line in the embedder’s worker pool config back to a value from three weeks ago, on a branch I’d abandoned but never deleted. The change merged itself into nothing. The next morning the embedder was running fine on main, the queue was draining, the agent’s run log showed a green checkmark next to “applied configuration update.” It took me two days to notice the stale branch had a phantom commit on it, and another twenty minutes to reconstruct what happened: an older snapshot of context, a tool call routed to git, a confident summary, no harm done, no signal that anything had been wrong.

I wasn’t angry. The agent did exactly what I’d told it to do, against a state of the world that no longer existed. Nothing in the published discourse about AI in April and May prepares you for this kind of error. Not the safety departures, not the commoditization takes, not the agent-native-infrastructure essays. The error mode is too quiet. It doesn’t break anything. It just leaves a small archeological layer of wrongness, and you only find it because you happen to look.

I run Obsidian Forge alone: a self-hosted Postgres vault, an MCP server, a small mesh of agents I’ve productionized with the assumption that I’m both the architect and the on-call engineer. I also have a day job at Proton.ai writing software with people who do this for a living. So I read the commentariat carefully, because reading them is how I know what my peers are thinking. And what I notice, increasingly, is that almost no one writing about AI right now is sitting where I’m sitting.

The discourse, as of this week, is dominated by four kinds of voices. There are the lab insiders and alumni: Karpathy on the Sequoia stage last week talking about agent-native infrastructure, Mrinank Sharma’s open letter when he left Anthropic Safeguards in February, Zoë Hitzig’s New York Times op-ed on her way out of OpenAI framing OpenAI’s ad-business trajectory through the Facebook precedent. Their angle is structural and high-altitude: where the field is going, what its civilizational stakes are, what the labs are getting wrong. Then the independent analysts: Nathan Lambert’s mid-2026 open-models bets at Interconnects, Zvi’s weekly digest now up to AI #166, Dwarkesh’s interviews with Dylan Patel about compute and Michael Nielsen about how science actually progresses. They synthesize. They’re the field’s best readers of itself. Then the builder-influencers: Simon Willison, who blogged yesterday about vibe coding and agentic engineering “getting closer than I’d like”; swyx and the Latent Space orbit, an outfit Karpathy publicly endorsed back in 2024 and the recommendation has stuck; Gergely Orosz at the Pragmatic Engineer, whose April survey of 900-plus engineers and engineering leaders gave the discourse its single most-cited statistic: 30% of teams hitting usage limits, budget owners “increasingly nervous.” And then the doomers and skeptics, sorting themselves out: Celia Ford at Transformer arguing the safety movement needs to build broader public coalitions around concrete concerns like job displacement and corporate accountability, because elite-circle warnings aren’t moving policy on their own, the Washington Post coverage of safety advocates pivoting to creator outreach.

These are all real perspectives. They are not, individually or collectively, the perspective of someone who deployed Opus 4.7 the morning it shipped, switched a chunk of their routing to GPT-5.5 medium eight days later when the per-task economics in their workload made it impossible to ignore (per-token list pricing is roughly even, with Opus actually slightly cheaper on output, but on the agentic loops I run GPT-5.5 emitted noticeably fewer output tokens to land the same result, and that compounded quickly), evaluated DeepSeek V4 the day after that because $0.14/$0.28 per million Flash-tier tokens running on Huawei silicon is a sentence you have to take seriously even if you end up not using it, and then spent the rest of the month rewriting their eval harness because the surface area of model capabilities had genuinely changed and the old harness was lying. Three frontier-class options inside a single eight-day window. Roughly ten substantial new model variants in thirty days, if you count the Qwen 3.6 family (Plus, 35B-A3B MoE, Max-Preview, 27B dense), GPT-5.5 Pro, DeepSeek V4 Pro and Flash, Mistral Medium 3.5, and Muse Spark. Anthropic doubled Claude Code rate limits and removed peak-hour throttling on May 6 as part of the SpaceX deal announcement, and that, for me and for everyone I know running production tooling, was the operator-facing news of the news. Not the deal. The throttling change.

I am not complaining about pace. The pace is great. The pace is the thing. But the pace produces a texture that doesn’t show up in the published essays, because the people writing the essays are mostly not doing the daily integration work, or they are doing it inside teams large enough that someone else absorbs the texture. The texture is: every Tuesday the substrate has shifted under you a little. The router config from last Tuesday is wrong by Friday. An eval that was meaningful in March returns noise in May because the models it was discriminating between have all moved. A tool I noticed in its February Show HN, RTK, the Rust-written CLI proxy that compresses agent command output sixty to ninety percent before it hits the context window, went from being a curiosity to load-bearing infrastructure on my workstation a few weeks later, after I watched my own agents drown in pip install output and hit context limits doing trivial work. The numbers Patrick Szymkowiak posted alongside the project (seven thousand commands intercepted, twenty-four million characters saved, eighty-three percent reduction in fifteen days) are not benchmarks. They’re a confession from one operator that other operators recognized instantly.

When I read Karpathy on agent-native infrastructure, I agree. He’s right. The next layer of infrastructure will be designed for agents, not for users. He even has a name for the workflow state of compulsively directing them: a “state of psychosis,” he called it on the No Priors podcast in March, repurposing a clinical term Søren Dinesen Østergaard introduced in a 2023 Schizophrenia Bulletin editorial. But the essay I want to read alongside his is the one that says: yes, and in the meantime the agent-native infrastructure I have actually built is held together with a Postgres trigger I wrote at midnight, an MCP server I’m still finding race conditions in, a mesh of twenty-two agents that need their prompts updated every time a model version ships, and a backup script I run manually because I haven’t trusted any of the cron-based options. That essay does not exist. Not because no one is qualified to write it (there are thousands of us), but because the cultural production of AI commentary has settled into channels where the operator chair isn’t represented. The lab insiders write from inside the lab, the independents from a synthesis layer, the builder-influencers from the most legible parts of building (the model itself, the API, the headline benchmark). The Pragmatic Engineer survey gestures at the chair I’m sitting in (30% hit limits, the budget owners are nervous), but a survey isn’t a dispatch. The dispatch is what hasn’t been written.

And when I say “what the operators see,” I don’t mean we have secret knowledge. I mean we are the load-bearing element in a particular kind of error. The errors that don’t ring alarms. An agent doing the right thing against the wrong state of the world. A model strong enough at coding that you stop reading the diffs and the regression slips through. A cost line creeping up because routing logic written in March doesn’t reflect that GPT-5.5 medium quietly became the right default in late April. An eval you trust because you wrote it, even though every model it grades has changed since you wrote it. These aren’t catastrophic failures. They’re the kind of failures that erode trust slowly enough that you don’t realize you’ve stopped trusting until you find yourself adding a manual review step to something you used to ship clean.

The luxury frames in the elite discourse are real. Commoditization is happening. Agentic infrastructure is going to be a new risk class. The safety credibility crisis is genuine. I read all of it, and I take all of it seriously. But on Tuesday morning the question on my desk isn’t whether GPT-5.5 represents a commoditization shock or whether agents corrupt recovery pathways at civilizational scale. The question is whether the rollback I just noticed on the stale branch happened anywhere else, and whether my eval harness would have caught it if it had, and whether I should pause the whole mesh to find out, or trust that this was a one-off and keep shipping. That is the conversation I have with myself every week. A hundred thousand other people are having it too. No one has written it down.

So I will.