The eval problem moved

A year ago, if you wanted to deploy a robot you almost always trained its policy yourself. Today, an increasing number of the most ambitious teams in robotics don't — and the ones that still do are quietly hedging. The question this essay is about is not whether that shift is happening. It is. The question is what new tooling layer it creates, and who builds it.

The answer, we think, is an evaluation infrastructure layer that sits horizontally across the application layer of robotics. And the next eighteen months are the window in which it gets built.

The industry bifurcated

For the last decade, "robotics company" meant roughly one thing: a vertically integrated team that owned the hardware, the perception stack, the controller, and increasingly the policy that produced motion from observations. The policy was the crown jewel. It was trained in-house, on data the company collected itself, and the team's eval rig was wrapped around it tightly enough that the two were effectively one artifact.

That shape is breaking. A handful of foundation world-model labs have done for robotics what the first generation of large language models did for text: trained generalist models broad enough that, for many real tasks, building on top of them beats training a specialist model from scratch. The practical consequence is the same regardless of what you call them. There is now a layer above which you can build a robotics product without owning the bottom of the stack.

When that happens to a technology, it bifurcates. Some teams stay fully integrated and ride model improvements they ship themselves. Most don't. Most realize they were never in the foundation-model business; they were in the deployed-robot business, and the foundation model was a means. Those teams move up the stack. They become application-layer robotics companies: teams whose edge is the operating envelope, the customer relationships, the hardware, the safety case — but not, anymore, the underlying model.

This is a move other software industries have made more than once. A generation of companies builds on top of an abstraction an earlier generation had to build itself, and the product layer ends up larger than the layer beneath it. There's no reason robotics should be the exception.

The eval problem moved with the abstraction

If you owned the model, you owned the evals. You knew the training distribution because you assembled it. You knew when the model had changed because you were the one changing it. Your eval suite was a gate at the end of a training run, and the team that ran the run was the team that read the gate.

When the model belongs to somebody else, every one of those assumptions stops holding. You don't know the training distribution. You don't decide when the model changes; the provider does, on their cadence, with release notes that range from "we improved general capabilities" to actual silence. And the eval suite is no longer a gate at the end of your training run — there isn't one. The eval suite is the only place you find out whether the system you ship today is better or worse than the system you shipped last week, on the tasks that pay your bills.

That changes evaluation from a research activity into a portfolio activity. The questions stop sounding like "did our model improve" and start sounding like "across the providers we could ship on top of this quarter, which one has the best long-tail behavior in the warehouse aisle widths our customer actually uses, and how confident are we that the answer won't flip next month." That's not a question a notebook answers. It's a question an infrastructure answers.

Why existing tools don't fit

Every robotics team we've talked to in the last year has a version of the same internal eval rig. One engineer owns it. It runs a few dozen scenarios in one simulator. It outputs a CSV that somebody screenshots into Slack. The CSV grows columns over time. Nobody trusts the older columns. This works, badly, until the team's question grows past it.

The natural next thought is that an existing platform handles this. The general experiment-tracking tools are built around the assumption that you own the training loop, which is the very assumption that no longer holds. The generalist eval platforms are built around text-in, text-out workloads, and don't have a first-class concept of a scene, a trajectory, a physics seed, or a policy artifact that isn't a checkpoint you produced. You can force-fit a robotics workload onto either. People do. It feels like wearing a sweater inside out.

The simulator vendors themselves are a more interesting non-answer. They could in principle build a cross-vendor eval layer on top of their own software. They almost certainly won't, because the moment they do, they have built the tool that helps their customers leave them. Horizontal layers don't get built by the vendors of the things they horizontalize. That's a structural argument, not a values one.

Why this needs to be a horizontal layer

The eval problem, properly stated, has three axes. You vary the policy. You vary the model the policy is built on. You vary the environment the policy runs in. Every robotics team will eventually want to move along all three at once, because the interesting comparisons are not "this policy vs. that policy" but "this policy on this provider in this environment vs. that policy on that provider in that environment."

A horizontal layer is the only place that comparison lives. If it lives inside the model provider, the provider has control over the framing and the framing will, gently, favor them. If it lives inside the simulator, you can't compare across simulators. If it lives inside the robotics team, every team builds it twice — once badly, once expensively — and the industry never compounds learning across that work.

Why now

The foundation-model layer has only just stabilized enough that an application layer can form on top of it with a straight face. A year ago, the right answer for most teams was still "train your own." Today it isn't. Eighteen months from now, it will be obvious that it wasn't, and the eval problem will be acute for a much larger number of companies than it is now. The right time to build the tool is on the leading edge of the curve, while the design decisions still get to be made for a reason rather than for a compatibility shim.

The adjacent players are not blind. The generalist eval and experiment-tracking tools will eventually look at this surface and decide whether to extend into it. The simulator vendors will do the same. They will make different choices than a dedicated team would make, because their priors will pull them back toward what they already are. The window in which the right team gets to define the primitives is the window before the wrong team accidentally does.

What we're building

Contin is the evaluation infrastructure layer for application-layer robotics. Concretely: a continuous eval harness that runs your suite against every policy candidate, across every simulator backend you care about, against every foundation model you're considering shipping on; a regression report that tells you which long-tail tasks just got better and which just got worse; an artifact model that treats policies, scenes, seeds, and trajectories as first-class objects you can re-run, share, and audit. It is not a simulator. It is not a foundation model. It is not an MLOps platform for training runs. It is the layer in between, which is currently missing.

If you are building on top of someone else's world model and you recognize the problem in this essay, we want to hear from you. Tell us what you're trying to evaluate. We're taking on a small number of design partners now, and the design of the product is genuinely up for grabs in conversation with them.