The untrainable corner
Sarah Guo published "The Untrainable" in June 2026.1 It answers a despair she sees in investors: if models keep getting better at everything, every company built on one is a thin wrapper waiting to be absorbed, so buy Anthropic and Nvidia and go home.
Her verdict is that the despair is half right. The wrapper layer is being absorbed, but value is sliding toward work no lab can train on.
Measurable means trainable
A benchmark is a thing you can measure, and a thing you can measure is a thing you can train against. Coding agents matured first because a compiler and a test suite are free verifiers, so labs could grind against the check until they beat it.
Devin solved 13% of SWE-Bench tasks in 2024. Eighteen months later the best agents score in the high eighties and do real work inside Goldman Sachs and the U.S. Army.
The models ate the measurable part of software engineering and left the rest. MIT's Mert Demirer measured the rest across more than 100,000 developers: coding agents lifted code written by roughly 180% but code shipped by only 30%.
Shipping still runs through people, because correctness in a decade-old codebase only shows up under real load, and a smarter model does not make the world run faster.
The 2x2
Guo sorts work with two questions: is its correctness private and expensive to establish, and is it locked inside a system you cannot get into? Set those against how saturated the task is and four quadrants fall out.
Saturated work with public answers goes to whichever open model is cheapest that week, because once checking is free the buyer only compares prices. Frontier work with public answers goes to the labs; coding benchmarks live there, and owning a free eval counts for nothing.
The prize is frontier work whose correctness exists only in private. Guo points to the inference clouds hosting AI-native companies, where most tokens come from custom models rather than generic open ones.
The lock and the deadbolt
A better model does not make private ground truth public. It does not hold the license, sign off on the liability, or own the firm's files, and it cannot be the party that gets sued when the answer is wrong.
Guo calls the environment the lock: you only get to verify whether AI did something useful inside a bank's systems after the security review and the contract with your name on the outcome.
The user is the deadbolt. A majority of American doctors open OpenEvidence every day, and a lab with a flawless medical model still has no way into that habit, because the habit took years of clinical trust to build and compute cannot shortcut it.
Writing down what good means
When work cannot be scored from outside, an insider decides what a good answer is. Harvey publishes the benchmark for law and Sierra publishes one for voice agents; both earned that authorship by being the tool the field already uses.
The same logic sets pricing: Sierra charges when its agent resolves a customer's issue and nothing when it escalates to a human, so the price becomes the evaluation. That only works because Sierra owns the definition of resolved.
Cognition makes the equivalent move in software, selling Devin with a performance guarantee, which you can only offer for outcomes inside a system you are trusted in.
No resting spot
The absorption frontier keeps rising, because labs keep learning to measure more of the work, so the untrainable ground shrinks under whoever stands on it. A durable company keeps stepping toward whatever cannot yet be scored.
The obvious objection is the supplier: a lab can undercut your product or revoke your API access. Guo's answer is that the frontier is crowded, and the chat share ChatGPT is losing goes to Gemini through Android and Search distribution rather than model quality.
Her bet is the direction: intelligence keeps getting cheaper, and value keeps sliding toward the places a model cannot reach. The advice that follows is to get inside one of those places, do the unglamorous integration work, and write down what good means there.