CRONOS: Benchmarking Multi-Task Robotic Manipulation for Reset-Free RL

Abstract

Reinforcement learning (RL) is promising for adapting robot policies to unstructured real-world environments. However, standard RL pipelines rely on episodic training with frequent scene resets, which is impractical for real-world deployment due to the need for substantial human intervention. We introduce CRONOS (Continual Robotic Operations in NOn-episodic Settings), a simulation benchmark for studying reset-free multi-task RL under long-horizon interactions and constrained reset budgets.

To reflect realistic deployment, CRONOS leverages high-fidelity physics simulators, adopts shared-scene multi-task settings, targets the adaptation of state-of-the-art robot policies, and formalizes reset-free learning under a fixed reset budget. We show that naively fine-tuning pre-trained policies fails in reset-free settings; however, these challenges can be mitigated through intelligent reset allocation and by addressing biases in pre-trained models. Finally, we demonstrate that reset-free training enhances long-horizon manipulation and improves generalization to Held-out Configuration object configurations and task sequences.

The Problem: Episodic Training Doesn't Scale

Standard RL pipelines assume episodic training: after each episode, the environment is reset to a canonical initial state. In simulation this is free, but in the real world it requires significant human effort — repositioning objects, resetting the robot, re-staging the scene. This bottleneck prevents RL from being deployed at scale in real-world robotic systems.

Standard Episodic Muti-task RL

Each task isolated in its own scene
Frequent human resets required
Doesn't scale to real-world deployment

CRONOS: Reset-Free Multi-Task RL

Shared scene across all tasks
Fixed reset budget, minimal intervention
Robustness to real world training

The CRONOS Benchmark

CRONOS is built upon SimplerEnv and its extension, which provides high-fidelity physics simulation demonstrated to transfer to real-world settings with limited performance degradation. The benchmark deploys an 8-DoF WidowX-250S robotic arm controlled via 6-DoF end-effector pose commands.

The key design choice: multiple objects share a single scene across all tasks. In the base configuration, CRONOS places two objects in a scene with two receptacles, yielding four pick-and-place tasks: "put the [object] on the [receptacle]." Objects and receptacles are randomized across different positions for training and evaluation.

Put toy bear on carpet (0–79 steps)

Put toy bear on newspaper (80–159 steps)

Put plastic bottle on carpet (160–239 steps)

Put plastic bottle on newspaper (240–319 steps)

Each clip continues directly from the previous scene without resets, illustrating a shared-scene reset-free rollout.

Evaluation Dimensions

CRONOS evaluates performance along two orthogonal axes:

Interaction Efficiency

Here \(\pi_t\) denotes the policy estimated by algorithm \(\mathbb{A}\) at interaction step \(t\) over a horizon of \(T\) steps, \(\pi^* = \arg\max_\pi J(\pi)\) denotes the optimal policy, and \(J(\pi)\) measures the expected return across all tasks.

\mathbb{D}(\mathbb{A}) = \sum_{t=0}^{T} \bigl(J(\pi^*) - J(\pi_t)\bigr)

Reset Efficiency

Here \(\pi_k\) denotes the policy estimated by algorithm \(\mathbb{A}\) immediately after the \(k\)-th scene reset, with \(K\) as the maximum number of resets allowed. \(\pi^*\) and \(J(\pi)\) follow the same definitions as above.

\mathbb{C}(\mathbb{A}) = \sum_{k=0}^{K} \bigl(J(\pi^*) - J(\pi_k)\bigr)

Reset efficiency is critical for real-world deployment, as scene resets are often significantly more costly than individual environment interactions, requiring algorithms to make the most of each reset.

Five Research Challenges

CRONOS is designed to facilitate systematic study across five open research challenges in RF-MTRL:

Learning Under Varying Reset Budgets

How efficiently do algorithms learn under different levels of human intervention? CRONOS provides a reset-efficiency metric to enable systematic analysis.

Algorithm Design for Automatic Reset

When should a reset be triggered (detecting unrecoverable states)? How should it be executed (returning to high-value states)?

Biases in Pre-trained Policies

BC policies overfit to expert demonstration distributions. Under what conditions can RL effectively adapt these policies in reset-free, multi-task settings?

Impact of Learning Factors

Task ordering and curriculum learning substantially affect learning efficiency in reset-free settings, yet remain understudied in the RF-MTRL context.

Robustness to Distributional Shifts

CRONOS measures robustness along object-receptacle configurations and unseen task sequences, going beyond standard visual or semantic OOD settings.

Key Findings

F1 Naive fine-tuning fails. Directly fine-tuning OpenVLA under reset-free settings (T=1280) yields only 2.3% success rate, compared to 89.7% with standard episodic training (T=80). VLAs overfit to fixed end-effector configurations seen in pre-training demonstrations.
F2 Episodic end-effector reset is critical. Resetting the robot's end-effector (but not the scene) at episode boundaries recovers +30.3% absolute success rate, mitigating the distributional drift bias in pre-trained VLAs.
F3 Curriculum learning provides the largest gain. Starting with frequent resets (T=320) and gradually reducing them (T=1280) improves success by +25.2% absolute, while requiring only 60% more resets than a non-curriculum baseline.
F4 Heuristic task ordering matters. A cyclic task graph ordering that avoids consecutive same-object picks substantially outperforms random ordering, confirming that strategic task sequencing is a critical component of RF-MTRL.
F5 RF-MTRL improves OOD generalization. When trained on limited configurations, non-episodic training yields substantially stronger generalization to held-out object-receptacle configurations and unseen task sequences compared to episodic baselines.

^† T denotes the interaction steps between two consecutive scene resets.

Results

RF-MTRL Approaches on CRONOS

Success rate @ 1.3M interaction steps · 5 seeds · training object-receptacle configurations
EER = Episodic End-effector Reset · CL = Curriculum Learning · LSR = Learned Scene Reset · HSR = Heuristic Scene Reset
^† T denotes the interaction steps between two consecutive scene resets.

T	EER	CL	LSR	HSR	Success Rate
Episodic baseline
80	—	—	—	—	0.897 ± 0.027
Reset-free baselines
1280	—	—	—	—	0.023 ± 0.030
1280	✓	—	—	—	0.326 ± 0.179
1280	✓	✓	—	—	0.578 ± 0.054
1280	✓	✓	✓	—	0.580 ± 0.036
1280	✓	✓	—	✓	0.649 ± 0.090
1280	✓	✓	✓	✓	0.631 ± 0.065

Learning Efficiency Under Varying Reset Budgets

Left: Success rate vs. interaction steps. Right: Success rate vs. number of scene resets. Non-episodic baselines achieve superior reset efficiency — reaching competitive success rates with significantly fewer resets compared to the episodic baseline.

Robustness to Held-out Configuration Conditions

OOD Object Configurations

Training Configuration

Held-out Configuration

When trained on a single configuration, the non-episodic method (T=320) shows substantially stronger OOD generalization than the episodic baseline.

OOD Task Sequences

Training Configuration

Held-out Configuration

9 out of 10 evaluation sequences are unseen during training. The reset-free baseline not only handles sequential manipulation better but also generalizes to novel orderings.

Scalability to Complex Task Suites (3 Objects × 3 Receptacles)

We validate scalability on a more complex scene with three objects and three receptacles, yielding nine pick-and-place tasks. The non-episodic baseline (T=320) maintains competitive performance on interaction steps while demonstrating significantly superior reset efficiency, confirming that CRONOS baselines scale effectively to more complex scenarios.

3×3 scalability: success rate vs. interaction steps

3×3 scalability: success rate vs. number of scene resets

BibTeX

If you find this work useful, please cite:

@article{wu2026cronos,
  title     = {{CRONOS}: Benchmarking Multi-Task Robotic Manipulation
               for Reset-Free Reinforcement Learning At Scale},
  author    = {Wu, Po-Yi and Fang, Djengo Cyun-Jyun and
               Cheng, Dian and Ke, Tsung-Wei},
  journal   = {arXiv preprint},
  year      = {2026}
}

Acknowledgment

This research was supported by the National Science and Technology Council (NSTC), Taiwan.

CRONOS: Benchmarking Multi-Task Robotic Manipulationfor Reset-Free Reinforcement Learning At Scale