Essay №002 2026-05-07 7 min

The Verification Gap: Notes on the SubQ 1M Launch

A Miami startup just claimed a 1,000× compute reduction at 12M tokens with no weights, no technical report, and no way to check. The contrast with Mamba-3 is the actual story.

On May 5, 2026, a Miami startup with eleven PhDs and a $29M seed announced what it called Subquadratic Sparse Attention, a content-dependent block-selection scheme that performs exact attention over a chosen subset of positions and claims linear compute-vs-context scaling as a result. The production model handles 1M tokens; a research demo extends to 12M. Taken at face value, the headline numbers rearrange long-context economics: a 52× prefill speedup over FlashAttention at 1M, roughly 1,000× compute reduction at 12M, RULER 128K at 95% (a hair above Claude Opus 4.6’s 94.8%), MRCR v2 at 65.9, SWE-Bench Verified at 81.8, and a per-call cost of $8 at 128K against a reported $2,600 for Opus by Subquadratic’s own numbers. That implies roughly 300× cost reduction, never independently replicated against Anthropic’s published rates. If those numbers held up under independent evaluation, then agentic workflows over entire codebases, multi-document research corpora, and full transcript archives would shift from premium operations into routine ones. That is not a small claim. If real, it would redraw the deployment cost curves the rest of the industry is currently optimizing inside.

The trouble is that nobody outside Subquadratic Inc. can presently determine whether the numbers are real. No weights have been released, no technical report posted. The launch material includes only selected slices across the 12M range. The benchmark slate is one a long-context-optimized model would naturally choose: tasks where extending the window is most of the work. The SWE-Bench chart in the launch post visually compressed an 81-to-87 spread to look like a smaller gap than it is, an artifact noticed and flagged within hours on Hacker News. None of these observations falsify the underlying architecture. Each of them, individually, is the sort of thing a fast launch with a small team might leave unaddressed and clean up later. Collectively they describe a launch that has been engineered for amplification rather than for verification.

The architectural description itself is plausible. Content-dependent sparse attention is an active research direction, and exact attention over a selected block subset has the right complexity profile to produce something like the claimed scaling. What is not visible from outside is whether SSA is a from-scratch architecture trained at scale or, as Will Depue argued on X, “almost surely a sparse attention finetune of Kimi or DeepSeek.” Those two stories produce a similar-looking benchmark surface and very different scientific contributions. Distinguishing them requires either weights, a training recipe, or a third-party reproduction. The launch provides none of the three. CTO Whedon, when asked, conceded that the team had used “weights from open-source models as a starting point.” That neither confirms Depue’s specific guess nor relieves the underlying problem. One widely-shared post on X by Dan McAteer summarized the state of public information by writing that SubQ is “either the biggest breakthrough since the Transformer… or it’s AI Theranos.” The symmetry of that observation is itself the point: from where readers are sitting, the two readings are evidentially indistinguishable.

The contrast worth drawing is not between SubQ and any incumbent commercial model. It is between SubQ and the work that landed in the same window from a different epistemic posture. On March 16, Lahoti, Li, Chen, Wang, Bick, Kolter, Dao, and Gu posted Mamba-3 to arXiv, with a Together AI blog post the next day and a presentation at ICLR 2026. The model was open-sourced via mamba-ssm on day one. It introduces three architectural changes (exponential-trapezoidal discretization for a richer recurrence, complex-valued state updates, and a MIMO formulation that improves decode parallelism), and at the 1.5B-parameter scale it reports +1.8 points downstream accuracy over Gated DeltaNet with MIMO and matches Mamba-2 perplexity at half the state size. These are real but modest numbers, and the authors present them as such.

The key sentence, for present purposes, is one the Mamba-3 authors put into their launch-day Together AI blog post. Linear models, with their fixed-size state, naturally underperform their Transformer counterparts on retrieval-based tasks. They go on to predict, explicitly, that linear layers will be “predominantly used in conjunction with global self-attention layers in the future.” Some combination of sparse and linear attention is the destination, not any pure-SSM endpoint. Mamba-3 says the part you are not supposed to say. The authors disclose the shape of their model’s weakness in the same document that demonstrates its strengths, and forecast that their own architectural family is not the destination. It is also, not coincidentally, the kind of statement that does not generate viral coverage. There is no headline in “we improved this slice of the design space and here is the slice we did not improve.”

Mamba-3 is verifiable in the strong sense: the weights are public, the architecture is described in enough detail to reimplement, the methodology is stated, the limitations conceded. A researcher with a GPU and a weekend can confirm or contradict any specific claim in the paper. None of this is true of SubQ. The asymmetry between these two artifacts is not primarily a difference in scientific substance (Mamba-3’s claims are smaller because the authors chose to make smaller claims they could substantiate); it is a difference in what the field’s amplification mechanisms reward. SubQ’s 1,000× figure traveled across timelines in hours. Mamba-3’s release was covered in technical communities and largely missed by the broader discourse. The sealed claim won the attention race against the open one by orders of magnitude.

Mamba-3 is not the only example of the alternative pattern in the same window. On April 25, Zandieh, Daliri, Hadian, and Mirrokni at Google Research presented TurboQuant at ICLR 2026: a training-free, data-oblivious KV-cache compression scheme that hits roughly 3 bits per channel with zero accuracy loss and produces a 6× memory reduction at inference, with up to an 8× attention speedup on H100s. The headline number is large, the methodology is reproducible immediately, and the authors do not claim it dissolves any specific commercial moat. It is the unglamorous form of progress: a real efficiency gain in a load-bearing component, demonstrated in public, available to anyone who reads the paper. It does not trend. It works anyway.

The point of laying these three releases side by side is not to argue that SubQ is fraudulent. The honest position is that, on present evidence, SubQ’s claims are unverifiable, and unverifiable extraordinary claims should be held in a separate epistemic bucket from verifiable ordinary ones, regardless of which produces the more striking number. It is possible, entirely possible, that the Subquadratic team has genuinely engineered what they say they have, and that the technical report and weights, when they appear, will hold up. It is also possible that they have not, and that the gap between the launch post and any reproducible artifact will quietly widen for months. The reader cannot currently distinguish these futures, and neither can I.

What the reader can observe, today, is the mechanism. A field that converges in its academic substrate (Log-Linear Attention, SPLA’s block-sparse-plus-linear hybrid, and a March 2026 hardness-of-transformers paper) on hybrid sparse-plus-linear architectures, and which has named this convergence in venues with peer review and open code, simultaneously rewards a sealed, single-team announcement that exceeds every published benchmark by a wide margin with vastly more attention than it gives to the open work. Hype versus substance is the laziest framing of this. The real argument is about how verification scales. Open weights, open methodology, and authors who describe their own model’s weak axis are slow signals; press cycles and viral charts are fast signals. The fast signals dominate the public record by default, and there is currently no mechanism in the discourse to slow them down to the speed at which they could actually be checked.

Whether SubQ is the breakthrough or the cautionary tale is, in this sense, the smaller question. The larger one is that we are currently unable to tell, and the structures that would let us tell are the same structures the most-amplified launches are choosing to bypass. Mamba-3 will still be checkable next year. SubQ may not.