CRONOS: Benchmarking Multi-Task Robotic Manipulation
for Reset-Free Reinforcement Learning At Scale

Preprint · March 2026
Po-Yi Wu* Djengo Cyun-Jyun Fang*† Dian Cheng Tsung-Wei Ke
National Taiwan University

* Equal contribution  ·  † Corresponding author

CRONOS teaser: episodic vs reset-free multi-task RL

Top: Common episodic multi-task RL isolates each task into separate scenes with frequent resets. Bottom: CRONOS studies reset-free multi-task RL (RF-MTRL) where scenes are shared across tasks and agents operate under a fixed reset budget.


Abstract

Reinforcement learning (RL) is promising for adapting robot policies to unstructured real-world environments. However, standard RL pipelines rely on episodic training with frequent scene resets, which is impractical for real-world deployment due to the need for substantial human intervention. We introduce CRONOS (Continual Robotic Operations in NOn-episodic Settings), a simulation benchmark for studying reset-free multi-task RL under long-horizon interactions and constrained reset budgets.

To reflect realistic deployment, CRONOS leverages high-fidelity physics simulators, adopts shared-scene multi-task settings, targets the adaptation of state-of-the-art robot policies, and formalizes reset-free learning under a fixed reset budget. We show that naively fine-tuning pre-trained policies fails in reset-free settings; however, these challenges can be mitigated through intelligent reset allocation and by addressing biases in pre-trained models. Finally, we demonstrate that reset-free training enhances long-horizon manipulation and improves generalization to Held-out Configuration object configurations and task sequences.


The Problem: Episodic Training Doesn't Scale

Standard RL pipelines assume episodic training: after each episode, the environment is reset to a canonical initial state. In simulation this is free, but in the real world it requires significant human effort — repositioning objects, resetting the robot, re-staging the scene. This bottleneck prevents RL from being deployed at scale in real-world robotic systems.

Standard Episodic Muti-task RL

  • Each task isolated in its own scene
  • Frequent human resets required
  • Doesn't scale to real-world deployment

CRONOS: Reset-Free Multi-Task RL

  • Shared scene across all tasks
  • Fixed reset budget, minimal intervention
  • Robustness to real world training

The CRONOS Benchmark

CRONOS is built upon SimplerEnv and its extension, which provides high-fidelity physics simulation demonstrated to transfer to real-world settings with limited performance degradation. The benchmark deploys an 8-DoF WidowX-250S robotic arm controlled via 6-DoF end-effector pose commands.

The key design choice: multiple objects share a single scene across all tasks. In the base configuration, CRONOS places two objects in a scene with two receptacles, yielding four pick-and-place tasks: "put the [object] on the [receptacle]." Objects and receptacles are randomized across different positions for training and evaluation.

Put toy bear on carpet (0–79 steps)

Put toy bear on newspaper (80–159 steps)

Put plastic bottle on carpet (160–239 steps)

Put plastic bottle on newspaper (240–319 steps)

Each clip continues directly from the previous scene without resets, illustrating a shared-scene reset-free rollout.

Evaluation Dimensions

CRONOS evaluates performance along two orthogonal axes:

Interaction Efficiency

Here \(\pi_t\) denotes the policy estimated by algorithm \(\mathbb{A}\) at interaction step \(t\) over a horizon of \(T\) steps, \(\pi^* = \arg\max_\pi J(\pi)\) denotes the optimal policy, and \(J(\pi)\) measures the expected return across all tasks.

\[ \mathbb{D}(\mathbb{A}) = \sum_{t=0}^{T} \bigl(J(\pi^*) - J(\pi_t)\bigr) \]

Reset Efficiency

Here \(\pi_k\) denotes the policy estimated by algorithm \(\mathbb{A}\) immediately after the \(k\)-th scene reset, with \(K\) as the maximum number of resets allowed. \(\pi^*\) and \(J(\pi)\) follow the same definitions as above.

\[ \mathbb{C}(\mathbb{A}) = \sum_{k=0}^{K} \bigl(J(\pi^*) - J(\pi_k)\bigr) \]

Reset efficiency is critical for real-world deployment, as scene resets are often significantly more costly than individual environment interactions, requiring algorithms to make the most of each reset.


Five Research Challenges

CRONOS is designed to facilitate systematic study across five open research challenges in RF-MTRL:

Learning Under Varying Reset Budgets

How efficiently do algorithms learn under different levels of human intervention? CRONOS provides a reset-efficiency metric to enable systematic analysis.

Algorithm Design for Automatic Reset

When should a reset be triggered (detecting unrecoverable states)? How should it be executed (returning to high-value states)?

Biases in Pre-trained Policies

BC policies overfit to expert demonstration distributions. Under what conditions can RL effectively adapt these policies in reset-free, multi-task settings?

Impact of Learning Factors

Task ordering and curriculum learning substantially affect learning efficiency in reset-free settings, yet remain understudied in the RF-MTRL context.

Robustness to Distributional Shifts

CRONOS measures robustness along object-receptacle configurations and unseen task sequences, going beyond standard visual or semantic OOD settings.

Key Findings

T denotes the interaction steps between two consecutive scene resets.


Results

RF-MTRL Approaches on CRONOS

Success rate @ 1.3M interaction steps · 5 seeds · training object-receptacle configurations
EER = Episodic End-effector Reset  ·  CL = Curriculum Learning  ·  LSR = Learned Scene Reset  ·  HSR = Heuristic Scene Reset
T denotes the interaction steps between two consecutive scene resets.

T EER CL LSR HSR Success Rate
Episodic baseline
80 0.897 ± 0.027
Reset-free baselines
1280 0.023 ± 0.030
1280 0.326 ± 0.179
1280 0.578 ± 0.054
1280 0.580 ± 0.036
1280 0.649 ± 0.090
1280 0.631 ± 0.065

Learning Efficiency Under Varying Reset Budgets

Success rate vs. interaction steps Success rate vs. number of scene resets

Left: Success rate vs. interaction steps. Right: Success rate vs. number of scene resets. Non-episodic baselines achieve superior reset efficiency — reaching competitive success rates with significantly fewer resets compared to the episodic baseline.

Robustness to Held-out Configuration Conditions

OOD Object Configurations

ID object configurations

Training Configuration

OOD object configurations

Held-out Configuration

When trained on a single configuration, the non-episodic method (T=320) shows substantially stronger OOD generalization than the episodic baseline.

OOD Task Sequences

ID task sequences

Training Configuration

OOD task sequences

Held-out Configuration

9 out of 10 evaluation sequences are unseen during training. The reset-free baseline not only handles sequential manipulation better but also generalizes to novel orderings.

Scalability to Complex Task Suites (3 Objects × 3 Receptacles)

We validate scalability on a more complex scene with three objects and three receptacles, yielding nine pick-and-place tasks. The non-episodic baseline (T=320) maintains competitive performance on interaction steps while demonstrating significantly superior reset efficiency, confirming that CRONOS baselines scale effectively to more complex scenarios.

3×3 scalability: success rate vs. interaction steps
3×3 scalability: success rate vs. number of scene resets

BibTeX

If you find this work useful, please cite:
@article{wu2026cronos,
  title     = {{CRONOS}: Benchmarking Multi-Task Robotic Manipulation
               for Reset-Free Reinforcement Learning At Scale},
  author    = {Wu, Po-Yi and Fang, Djengo Cyun-Jyun and
               Cheng, Dian and Ke, Tsung-Wei},
  journal   = {arXiv preprint},
  year      = {2026}
}

Acknowledgment

This research was supported by the National Science and Technology Council (NSTC), Taiwan.