Better Robot Data, Measured: the Euler Readiness and Curation Benchmark

Euler Robotics. Technical Whitepaper, v2, June 2026.

Summary. Euler is a data-readiness compiler. It turns raw, messy, multimodal robot logs into training-ready datasets and attaches per-episode evidence to every decision. This paper measures Euler's value the honest way: directly, against ground truth, with no policy training in the loop. On real teleoperation video with a controlled fault layer, Euler's readiness score separates injected data-integrity faults from clean episodes with an AUC of 0.90 across all fault families, and its review gate routes every timeline and sensor-dropout fault to a human (38 of 38) while auto-accepting 96% of clean episodes. The timeline-integrity faults specifically, dropped frames and clock desync, are the cleanest signal and separate perfectly (AUC 1.00). Its zero-effort annotation agrees with human success labels 89% of the time, its curation engine separates four robot domains by an 11x margin with perfect nearest-neighbor purity, and its hybrid search returns the right episode first 87% of the time. Every number below is computed from the real product pipeline on real data.


1. The problem: more robot data is not better robot data

The largest open robot-manipulation corpus, DROID (76,000 demonstrations across 13 institutions), was down-weighted when training OpenVLA because including it naively hurt policy performance. The lesson generalizes: demonstration quality, not quantity, is the binding constraint on imitation learning. A corpus with dropped frames, desynchronized clocks, failed demonstrations, heavy near-duplication, and missing labels does not just waste training compute, it actively degrades the policy.

So robotics teams pay engineers to write one-off parsers and review episodes by hand for weeks, and still ship corpora that quietly poison their models. Euler exists to close that gap: a data-readiness compiler that turns raw multimodal robot logs into training-ready datasets with per-episode evidence, readiness scores, annotations, and an auditable curation trail.

2. How Euler works

Euler runs every episode through two complementary analysis passes, and every number it produces is computed from the customer's real data.

Semantic pass. A hosted vision-language model annotates each episode with a caption, an object inventory, and a task-success verdict. The caption embeddings power semantic retrieval.

Visual pass. Open vision models compute per-episode visual embeddings from sampled frames on serverless GPU, plus segmentation. The visual pass never replaces the semantic pass, it grounds and audits it: visually near-identical episodes are surfaced even when their captions disagree.

On top of the two passes sit three product surfaces, each of which this paper measures directly:

The benchmark below evaluates the exact pipeline the product ships, ingested through the same upload-and-run path a customer uses, scored by the same readiness components, exported by the same compiler.

3. The measurement protocol, and why it is the honest one

Most data-curation papers prove their worth indirectly, by training a policy on the curated subset and reporting closed-loop task success. That number is a confounded proxy. It folds in the policy architecture, the training budget, and, for manipulation, a simulator, all of which have nothing to do with the data tool. A weak result can hide a strong curator, and a strong result can flatter a weak one.

Euler is measured directly. Every capability maps to a metric computed against ground truth, with no policy in the loop:

CapabilityDirect metricGround truth
ReadinessAUC separating faulty from clean episodes; gate catch-rateinjected faults with a held-out answer key
Annotationagreement with the human success label; recall of human-label termsthe dataset's human language labels
Curationdomain separation and nearest-neighbor purity; redundancy removalknown domain membership and injected duplicates
Searchrecall and rank of the relevant episoderelevance derived from human labels

The substrate. We use DROID-100, 100 real Franka teleoperation episodes with three camera views and human language instructions. Real teleoperation video is the authentic setting, but raw real data carries no fault answer key, so readiness could not be scored against truth. We therefore inject a controlled fault layer into a known fraction of episodes and hold out the answer key: the real data provides realism, the injection provides a checkable ground truth. This is the pragmatic, honest middle ground between synthetic data (no realism) and a raw dump (no labels).

The injected faults. Four families, each targeting a distinct readiness component and each modeling a real teleoperation failure:

We also inject exact duplicate episodes, which are not a readiness fault (a duplicate is fully ready) but are the target of the curation engine.

4. Results

4.1 Readiness: catching faults before they reach training

We ran the fault-injected corpus through the live platform end-to-end. Euler scored readiness blind, with no access to the answer key. The score then has to recover the injected faults.

Episode classMean readinessRouted to review
Clean0.942 / 52
Dropped frames0.6514 / 14
Clock desync0.6912 / 12
Camera dropout0.9112 / 12
Gripper flatline0.851 / 10
Duplicate (not a fault)0.941 / 15

Across all four fault families, the readiness score separates faulty from clean episodes with an AUC of 0.90. The review gate, which adds hard flags on top of the score, is the operational number: it routes every dropped-frame, desync, and camera-dropout episode to a human (38 of 38) while flagging only 2 of 52 clean episodes, a clean-corpus auto-accept rate of 96%.

The two timeline-integrity faults are the cleanest signal. Dropped frames and clock desync each separate from clean perfectly (AUC 1.00), because an injected timeline gap or clock drift pushes readiness below the entire clean range. This is expected: timeline continuity is exactly what the score is built to measure. Camera dropout is milder on the score alone (AUC 0.73) but is caught by the gate, which escalates any missing camera stream. The gripper flatline is the softest signal: it lowers the score (AUC 0.83) but usually stays under the review threshold, the honest reading of a single degraded channel in an otherwise healthy episode.

Duplicates behave exactly as they should: they keep a high readiness score (0.94, indistinguishable from clean) because a duplicate is perfectly ready. Removing redundancy is the curation engine's job, not the readiness gate's (Section 4.3).

This behavior holds on un-injected real data too. On the unmodified DROID-100 corpus, Euler auto-accepts 97 of 100 episodes and routes 3 to review, with a mean readiness of 0.95 on the accepted set versus 0.77 on the routed set. The human-touch rate is 3%: a reviewer looks at three episodes out of a hundred, and they are the right three.

4.2 Annotation: usable labels at zero labeling cost

DROID ships one human language instruction per episode. Euler's semantic pass annotates each episode blind with a caption, an object inventory, and a success verdict. We score the agreement.

MeasureValue
Success-verdict agreement with the human label89%
Human-label terms recovered (exact-token match)~52%
Episodes recovering at least one key object term46%

The success-verdict agreement is the robust number: on a dataset of mostly successful demonstrations, Euler's verdict matches the ground truth 89 times in 100. The term-recall figure is a deliberately conservative lower bound, exact lexical overlap after removing function words, so synonyms and paraphrases count as misses. Representative case: the human wrote "Put the marker in the pot" and Euler wrote "successfully picks up the marker and places it inside the pot," recovering both key objects. We do not report a precision figure because Euler's descriptions are intentionally richer than the one-line human instruction; penalizing Euler for describing the scene more fully than the operator did would be misleading. The point stands: Euler recovers the majority of the human-label content with zero labeling effort.

4.3 Curation: diversity and de-duplication, generalized across robots

Curation has two jobs: drop redundancy, and keep a diverse, representative slice.

Removing the known-bad and the redundant. Curation operates in two stages, and we measure each against the answer key. First, the readiness gate removes the hard integrity faults before training: of the 38 dropped-frame, desync, and camera-dropout episodes, all 38 are routed out of the auto-accepted set (Section 4.1). Second, the diversity selector removes redundancy from what remains. We seeded the corpus with 15 exact-duplicate episodes; 14 cleared the gate into the curation pool. At a fixed budget, the coverage selector drops them all, while random selection keeps most:

Selection (budget 40, pool of 73)Exact duplicates droppedCoverage of kept set
Random~6 / 14 (45%)n/a
Euler coverage curation14 / 14 (100%)0.95

The selector drops every duplicate because an exact copy adds zero new visual coverage, and it does so while achieving higher coverage of the kept set than random, since it spends no budget on redundancy. When readiness utility is blended into the selection, de-duplication is gentler by design: a duplicate is fully ready, so its high utility partially offsets its zero coverage gain. The knob is explicit, and the coverage-weighted end of it removes redundancy completely.

Domain generalization. To test whether the curation engine is overfit to one robot or task, we run the exact shipped feature extractor and facility-location selector on four structurally different real robot datasets: planar pushing, bimanual insertion, single-arm lifting, and mobile-base door opening. With no per-domain tuning:

MeasureValueReads as
Cross-domain separation ratio11.16inter-domain distance is 11x intra-domain
Nearest-domain purity1.00every episode's nearest neighbor is in its own domain
Domain coverage of the curated set4 / 4the selector spreads its budget across all four

The visual features separate the four domains by an 11x margin with perfect nearest-neighbor purity, and the shipped selector spends its budget across every domain, weighting toward the most visually varied. This is the curation mechanism behaving as designed on data it was never tuned for.

4.4 Search: find the right episode first

Euler indexes every episode in three retrieval lanes: text (caption ranking), visual (embedding expansion of the top text hits), and hybrid (fused). We benchmark the live search endpoint over DROID-100's real episodes, with ground-truth relevance derived from the human labels (an episode is relevant to a query object if its human instruction names it).

Lanerecall@10MRRhit@1
Text1.000.850.80
Visual0.920.840.80
Hybrid0.950.890.87

Every lane retrieves well. Text leads on recall because Euler's captions name the objects, so object queries land. Hybrid gives the best ranking: it puts the right episode at rank one 87% of the time. The visual lane is the instrument for the harder case, episodes a caption mis-describes but that look right, so its value shows in audit and in queries where text fails.

5. A fleet example, end to end

Consider a humanoid or manipulation startup running a teleoperation fleet.

The data. 50,000 raw episodes: multi-camera video plus joint state and actions, sometimes a typed task, collected by dozens of operators over months. Roughly 15% carry integrity faults (dropped frames, desync, truncation), roughly 20% are failed or sloppy demonstrations, there are large clusters of near-duplicates, and success and language labels are inconsistent or missing.

The goal. Train a policy. Today the team pays engineers to write parsers and review episodes by hand for weeks, and still ships corpora that hurt the model.

How Euler fits, and the value at each step, all measurable without training:

StageWhat Euler doesValue
Ingestnormalize every format to canonical episodesno one-off parsers
Readinessflag the faulty episodes, route the few ambiguous ones to a humanthousands of reviewer-hours collapse to tens; no poison data
Annotationcaption, objects, and success verdict, blindusable labels with zero labeling effort
Curationdrop near-duplicates, keep a diverse slicetrain on a fraction of the data at equal quality
Search"find failed red-mug grasps"instant slicing and debugging
Exporttraining-ready dataset plus evidence receiptsweeks-to-training becomes hours

The return. Three quantities, all computed from real pipeline activity that the product already reports: reviewer-hours saved (episode count times per-review minutes), compute saved (train on a curated fraction at equal quality), and time-to-training (weeks to hours). On a 50,000-episode fleet, the readiness gate alone, holding the human-touch rate near the 3% measured above, turns an estimated 1,700 reviewer-hours of manual triage into roughly 50.

6. Honest limits

7. Reproduction

Every result is reproducible through the same upload-and-run path a customer uses, with no bespoke converters. Upload a dataset, run the pipeline, then evaluate against the held-out ground truth:

# Readiness + curation (fault-injected DROID-100):
python benchmarks/droid_corruption_dump.py   --clean <clean-dump> --out <corrupt-dump>
#   upload <corrupt-dump>, run the pipeline, then:
python benchmarks/droid_corruption_score.py  --base <api> --project <p> \
       --labels <corrupt-dump>/corruption_labels.json --budget 50

# Annotation + search (DROID-100, against the live endpoints):
python benchmarks/search_eval.py             --base <api> --project droid100 --token <tok>

# Cross-domain curation generalization:
modal run benchmarks/crossdomain.py

Features, ingest path, and selection rule are pinned. Every per-run result is a JSON artifact, and the tables above are generated from those artifacts.