Evaluation infrastructure · App-layer robotics

Evaluation infrastructure for teams shipping robots they didn't pretrain.

Continuous evals for the policies you ship. Across simulators, across providers, across releases.

Illustrative

What a run looks like.

Same suite, three foundation models. The interesting comparisons live across the tabs — where one model wins and another silently regresses on a long-tail task.

suite warehouse-pick-place

scenarios 248 · 3 seeds

baseline v2025.04.30

world-model-A 240M · fast · low memory world-model-B 600M · balanced world-model-C 1B · highest capacity

Task Success vs baseline

pick.bin_to_shelf 0.91 ↑ 0.04

pick.deformable.cloth 0.74 ↑ 0.11

place.precision_5mm 0.63 ↓ 0.08

recover.dropped_object 0.88 ↑ 0.02

navigate.cluttered_aisle 0.95 ─

handoff.dual_arm 0.79 ↑ 0.03

1 regression · 4 improved report → contin.run/r-7a3c01

Task Success vs baseline

pick.bin_to_shelf 0.94 ↑ 0.07

pick.deformable.cloth 0.68 ↑ 0.05

place.precision_5mm 0.71 ─

recover.dropped_object 0.85 ─

navigate.cluttered_aisle 0.91 ↓ 0.04

handoff.dual_arm 0.82 ↑ 0.06

0 regression s · 3 improved report → contin.run/r-7a3c01

Task Success vs baseline

pick.bin_to_shelf 0.88 ↑ 0.01

pick.deformable.cloth 0.81 ↑ 0.18

place.precision_5mm 0.55 ↓ 0.16

recover.dropped_object 0.92 ↑ 0.06

navigate.cluttered_aisle 0.96 ↑ 0.01

handoff.dual_arm 0.71 ↓ 0.08

2 regression s · 4 improved report → contin.run/r-7a3c01

01

Continuous eval harness

Wire your suite into CI. Every policy candidate, every model bump, every release — evaluated before it ships, not after a customer notices.

02

Cross-sim, one interface

Isaac, MuJoCo, your in-house sim. Define a scenario once; run it against any backend with seeds, scenes, and trajectories preserved as first-class artifacts.

03

Provider regression detection

When the world model under your stack updates, get a diff against your task envelope — task-level success, trajectory-level deltas, replayable failures.

What's in the box

The primitives a robotics team would otherwise rebuild from scratch.

Eval-as-CI for policy releases
First-class policy, scene, seed & trajectory artifacts
Deterministic seeds, pinned scene versions
Task-level and trajectory-level regression diffs
Cross-vendor sim backends (Isaac, MuJoCo, custom)
Cross-provider model adapters
Per-run, per-customer audit URLs
Self-hosted or VPC — your weights and data stay yours

Evaluation infrastructure for teams shipping robots they didn't pretrain.

What a run looks like.

Continuous eval harness

Cross-sim, one interface

Provider regression detection

The primitives a robotics team would otherwise rebuild from scratch.

Shipping a robot on someone else's world model?