ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

Daniel Sliwowski; Dongheui Lee; Sergej Stanovcic

arxiv: 2604.26637 · v1 · submitted 2026-04-29 · 💻 cs.RO · cs.AI

ATLAS: An Annotation Tool for Long-horizon Robotic Action Segmentation

Sergej Stanovcic , Daniel Sliwowski , Dongheui Lee This is my paper

Pith reviewed 2026-05-07 11:43 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords robotic action segmentationannotation toolmulti-modal datalong-horizon tasksaction boundariesROS bagsRLDSproprioceptive signals

0 comments

The pith

ATLAS is an annotation tool that integrates robot time-series signals with video to speed up labeling of long action sequences and sharpen boundary accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ATLAS to fill gaps in tools for annotating extended robotic demonstrations, where precise action boundaries matter for training segmentation models and policies. Current options are mostly vision-only, lack native robot signal support, or demand heavy customization for different data formats. ATLAS adds synchronized views of multi-view video plus proprioceptive signals like forces and gripper states, handles ROS bags and RLDS out of the box through a modular layer, and uses keyboard shortcuts. On a contact-rich assembly task, it cut average per-action time by at least 6 percent versus ELAN and, when using the extra signals, aligned boundaries over 2.8 percent better while cutting error by a factor of five versus vision-only methods.

Core claim

ATLAS is an annotation tool tailored for long-horizon robotic action segmentation that provides time-synchronized visualization of multi-modal data including multi-view video and proprioceptive signals, natively supports formats such as ROS bags and RLDS, offers direct support for datasets like REASSEMBLE, and includes a modular abstraction layer for easy extension plus a keyboard-centric interface. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6 percent compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8 percent and decreased boundary error fivefold相比

What carries the argument

Time-synchronized multi-modal visualization of video and proprioceptive signals together with a modular dataset abstraction layer that natively reads ROS bags and RLDS.

If this is right

Larger and more precisely labeled robotic datasets become feasible for training action segmentation and policy learning methods.
Evaluation of manipulation policies gains reliability because action boundaries better match expert judgment.
Setup time drops for new robotics datasets because ROS bags and RLDS are handled without extra code.
The modular layer lets teams adapt ATLAS to additional formats with only modest engineering work.
Keyboard-centric design lowers fatigue during long annotation sessions of extended demonstrations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider adoption could increase the scale of high-quality imitation-learning datasets across manipulation research.
The same multi-modal synchronization principle may transfer to annotation tasks in other sensor-heavy domains such as autonomous driving or human motion capture.
If the accuracy gains persist across user groups, ATLAS could become a de-facto standard for robotic action labeling.
Adding automated pre-segmentation suggestions on top of the existing interface would be a direct next step to test further time savings.

Load-bearing premise

The measured gains in speed and accuracy from one assembly task will hold for other robotic tasks, datasets, and users, and adding support for new formats will stay low-effort.

What would settle it

A follow-up user study on a different robotic task such as insertion or pick-and-place where average annotation time with ATLAS is not lower than with ELAN or where boundary error is not reduced when time-series signals are added.

Figures

Figures reproduced from arXiv: 2604.26637 by Daniel Sliwowski, Dongheui Lee, Sergej Stanovcic.

**Figure 1.** Figure 1: ATLAS is a tool for long-horizon robotic action segmen view at source ↗

**Figure 2.** Figure 2: (a) Layout of the ATLAS annotation tool for a partially annotated episode. The top section shows multi-view camera visualizations, view at source ↗

read the original abstract

Annotating long-horizon robotic demonstrations with precise temporal action boundaries is crucial for training and evaluating action segmentation and manipulation policy learning methods. Existing annotation tools, however, are often limited: they are designed primarily for vision-only data, do not natively support synchronized visualization of robot-specific time-series signals (e.g., gripper state or force/torque), or require substantial effort to adapt to different dataset formats. In this paper, we introduce ATLAS, an annotation tool tailored for long-horizon robotic action segmentation. ATLAS provides time-synchronized visualization of multi-modal robotic data, including multi-view video and proprioceptive signals, and supports annotation of action boundaries, action labels, and task outcomes. The tool natively handles widely used robotics dataset formats such as ROS bags and the Reinforcement Learning Dataset (RLDS) format, and provides direct support for specific datasets such as REASSEMBLE. ATLAS can be easily extended to new formats via a modular dataset abstraction layer. Its keyboard-centric interface minimizes annotation effort and improves efficiency. In experiments on a contact-rich assembly task, ATLAS reduced the average per-action annotation time by at least 6% compared to ELAN, while the inclusion of time-series data improved temporal alignment with expert annotations by more than 2.8% and decreased boundary error fivefold compared to vision-only annotation tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATLAS is a practical robotics-specific annotation tool with multi-modal support, but its key performance claims rest on comparisons that don't isolate the new features from other interface changes.

read the letter

The core contribution is a new annotation interface built for long-horizon robotic demonstrations. It adds native synchronized display of multi-view video plus proprioceptive signals, direct handling of ROS bags and RLDS, and a keyboard-driven workflow aimed at reducing effort on action boundary labeling and task outcome annotation. The modular abstraction layer is presented as a way to add new dataset formats without heavy rework. That combination addresses a real workflow friction for people creating manipulation datasets, where vision-only tools force extra scripting or lose temporal alignment with robot state data. The reported user study on a contact-rich assembly task shows a modest time saving versus ELAN and better alignment numbers when time-series signals are included. Those results give some concrete evidence that the tool can be faster in practice for this kind of data. The main weakness is in how the gains are measured. The fivefold drop in boundary error and the 2.8% alignment improvement are shown against separate external vision-only tools rather than an internal ablation that keeps the ATLAS interface fixed and toggles only the time-series view. Without that control, it is difficult to attribute the improvement cleanly to the added modalities instead of shortcuts, multi-view layout, or other design choices. The abstract also gives no details on annotator count, inter-annotator variance, or significance testing, which makes the quantitative claims harder to weigh. The extension mechanism is described but not exercised with a worked example of adding a new format. This paper is aimed at robotics groups that regularly label their own long-horizon manipulation data for segmentation or imitation learning. Researchers who already use ELAN or similar tools and want tighter integration with robot logs will find the most immediate value. The work is coherent on its own terms and shows honest attention to the target use case, so it deserves a serious referee rather than a desk reject. A revised version with an internal ablation and clearer stats would strengthen it considerably.

Referee Report

3 major / 1 minor

Summary. The paper introduces ATLAS, a specialized annotation tool for long-horizon robotic action segmentation that provides synchronized multi-modal visualization (multi-view video plus proprioceptive/time-series signals such as gripper state and force/torque), native support for ROS bags and RLDS formats, a modular abstraction layer for new datasets, and a keyboard-centric interface. On a contact-rich assembly task, the authors report that ATLAS reduces average per-action annotation time by at least 6% relative to ELAN, while the addition of time-series data improves temporal alignment with expert labels by more than 2.8% and reduces boundary error fivefold compared with vision-only annotation tools.

Significance. If the empirical claims are substantiated with proper controls and statistics, ATLAS could meaningfully lower the barrier to creating high-quality labeled robotic datasets, directly benefiting action segmentation and imitation-learning research. The modular dataset layer and native format support are practical strengths that address a real pain point in the community.

major comments (3)

[Abstract / Experiments] Abstract and experimental section: the fivefold boundary-error reduction and >2.8% alignment improvement are attributed to the inclusion of time-series data, yet the comparison is made against separate external vision-only tools rather than an internal ablation (ATLAS with time-series disabled vs. enabled, using identical interface, annotators, and protocol). This design confounds the contribution of synchronized proprioception with other ATLAS features (multi-view support, keyboard shortcuts, etc.).
[Abstract / Experiments] Abstract and experimental section: the reported 6% time reduction versus ELAN and the alignment/boundary metrics lack any description of the number of annotators, inter-annotator variance, statistical significance tests, or exact experimental protocol (e.g., counterbalancing, training time, task length). Without these details the quantitative claims cannot be evaluated for reliability.
[Abstract] Abstract: the generalizability claim that the modular abstraction layer will require only modest effort for new formats and that the observed gains will hold for other robotic datasets and user populations rests on a single contact-rich assembly task; no cross-dataset or cross-task validation is described.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement of the number of actions or total duration of the evaluated demonstrations to contextualize the per-action time savings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions where appropriate. Our responses aim to clarify the experimental design and strengthen the presentation without overstating the results.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental section: the fivefold boundary-error reduction and >2.8% alignment improvement are attributed to the inclusion of time-series data, yet the comparison is made against separate external vision-only tools rather than an internal ablation (ATLAS with time-series disabled vs. enabled, using identical interface, annotators, and protocol). This design confounds the contribution of synchronized proprioception with other ATLAS features (multi-view support, keyboard shortcuts, etc.).

Authors: We thank the referee for highlighting this important distinction. The reported metrics compare the full ATLAS system (with synchronized time-series visualization) against established external vision-only tools, as the latter do not support proprioceptive signals. An internal ablation disabling time-series within ATLAS while holding all other interface elements fixed was not conducted, as the primary objective was to benchmark against widely used annotation tools rather than isolate one feature. We acknowledge that this leaves open the possibility of confounding with ATLAS-specific elements such as multi-view video and the keyboard-centric design. In the revised manuscript we will update the abstract and experimental section to describe the comparisons more precisely, avoid attributing improvements solely to time-series data, and explicitly note this as a limitation of the current evaluation. revision: partial
Referee: [Abstract / Experiments] Abstract and experimental section: the reported 6% time reduction versus ELAN and the alignment/boundary metrics lack any description of the number of annotators, inter-annotator variance, statistical significance tests, or exact experimental protocol (e.g., counterbalancing, training time, task length). Without these details the quantitative claims cannot be evaluated for reliability.

Authors: We agree that these methodological details are necessary to assess reliability. While the full manuscript contains a description of the experimental setup, we will expand the Experiments section to explicitly report the number of annotators, inter-annotator variance or agreement measures, the statistical tests performed, and a complete protocol description including counterbalancing procedures, annotator training, and task length. These additions will be incorporated in the revision. revision: yes
Referee: [Abstract] Abstract: the generalizability claim that the modular abstraction layer will require only modest effort for new formats and that the observed gains will hold for other robotic datasets and user populations rests on a single contact-rich assembly task; no cross-dataset or cross-task validation is described.

Authors: We acknowledge that the quantitative results are derived from a single contact-rich assembly task. The modular abstraction layer is presented with implementation details and examples of extension to ROS bags and RLDS, but we agree that broader cross-dataset or cross-task validation would provide stronger support for generalizability claims. In the revised manuscript we will qualify the relevant statements in the abstract, add an explicit limitations discussion, and describe the effort required for new formats based on our experience with the supported datasets. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on direct user-study measurements

full rationale

The paper presents ATLAS as a new annotation tool and evaluates it via a user study on a contact-rich assembly task, reporting measured reductions in annotation time (vs. ELAN), improved temporal alignment, and lower boundary error (vs. vision-only tools). No equations, fitted parameters, predictions, uniqueness theorems, or ansatzes appear in the derivation chain. Claims are supported by direct empirical comparisons rather than any self-referential reduction or self-citation load-bearing step, so the results remain independent of the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied software tool paper. No free parameters, mathematical axioms, or invented physical entities are present; the contribution is the implemented system and its evaluation.

pith-pipeline@v0.9.0 · 5539 in / 1055 out tokens · 40221 ms · 2026-05-07T11:43:35.993124+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Rlds: an ecosystem to generate, share and use datasets in reinforcement learning.arXiv preprint arXiv:2111.02767, 2021

S. Ramos, S. Girgin, L. Hussenot, D. Vincent, H. Yakubovich, D. Toyama, A. Gergely, P. Stanczyk, R. Marinier, J. Harmsenet al., “Rlds: an ecosystem to generate, share and use datasets in reinforce- ment learning,”arXiv preprint arXiv:2111.02767, 2021

work page arXiv 2021
[2]

Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,

D. Sliwowski, S. Jadav, S. Stanovcic, J. Orbik, J. Heidersberger, and D. Lee, “Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,” inProceedings of Robotics: Science and Systems, Los Angeles, USA, June 2025

work page 2025
[3]

Elan: A professional framework for multimodality research,

P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes, “Elan: A professional framework for multimodality research,” in5th international conference on language resources and evaluation (LREC 2006), 2006, pp. 1556–1559

work page 2006
[4]

Otas: unsupervised boundary detection for object-centric temporal action segmentation,

Y . Li, Z. Xue, and H. Xu, “Otas: unsupervised boundary detection for object-centric temporal action segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6437–6446

work page 2024
[5]

Computer Vision Annotation Tool (CV AT),

CV AT.ai Corporation, “Computer Vision Annotation Tool (CV AT),” Nov. 2023. [Online]. Available: https://github.com/cvat-ai/cvat

work page 2023
[6]

labelme: Image polygonal annotation with python,

K. Wada, “labelme: Image polygonal annotation with python,” https: //github.com/wkentaro/labelme, 2018

work page 2018
[7]

ARCH: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly,

J. Sun, A. Curtis, Y . You, Y . Xu, M. Koehle, Q. Chen, S. Huang, L. Guibas, S. Chitta, M. Schwager, and H. Li, “ARCH: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly,” in9th Annual Conference on Robot Learning, 2025

work page 2025
[8]

Online task segmen- tation by merging symbolic and data-driven skill recognition during kinesthetic teaching,

T. Eiband, J. Liebl, C. Willibald, and D. Lee, “Online task segmen- tation by merging symbolic and data-driven skill recognition during kinesthetic teaching,”Robotics and Autonomous Systems, vol. 162, p. 104367, 2023

work page 2023
[9]

Foxglove studio,

Foxglove Technologies Inc., “Foxglove studio,” 2025, robotics visualization and observability platform. [Online]. Available: https: //foxglove.dev/

work page 2025
[10]

Mcap format specification,

——, “Mcap format specification,” 2025, file format for storing timestamped pub/sub messages with arbitrary serialization formats. [Online]. Available: https://mcap.dev/reference/

work page 2025
[11]

Rosannotator: A web application for rosbag data analysis in human-robot interaction,

Y . Zhang, H. Li, R. Tabatabaei, and W. Johal, “Rosannotator: A web application for rosbag data analysis in human-robot interaction,” in2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2025, pp. 1099–1103

work page 2025
[12]

Anvil-a generic annotation tool for multimodal dialogue,

M. Kipp, “Anvil-a generic annotation tool for multimodal dialogue,” inProc. Eurospeech 2001, 2001, pp. 1367–1370

work page 2001
[13]

M2r2: Multimodal robotic representation for temporal action segmentation,

D. Sliwowski and D. Lee, “M2r2: Multimodal robotic representation for temporal action segmentation,” in2026 IEEE International Con- ference on Robotics and Automation (ICRA). IEEE, 2026

work page 2026
[14]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,

A. O’Neillet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

work page 2024
[15]

Gamma, R

E. Gamma, R. Helm, R. Johnson, and J. Vlissides,Design patterns: elements of reusable object-oriented software. USA: Addison-Wesley Longman Publishing Co., Inc., 1995

work page 1995
[16]

Pyqtgraph: Scientific graphics and gui library for python,

“Pyqtgraph: Scientific graphics and gui library for python,” https:// www.pyqtgraph.org, accessed: 2026-02-04

work page 2026
[17]

TensorFlow Datasets, a collection of ready-to-use datasets,

“TensorFlow Datasets, a collection of ready-to-use datasets,” https: //www.tensorflow.org/datasets

work page
[18]

rosbags,

M. Durkovic, “rosbags,” 2025, python package to load rosbags. [Online]. Available: https://pypi.org/project/rosbags/

work page 2025
[19]

Benchmarking protocols for eval- uating small parts robotic assembly systems,

K. Kimble, K. Van Wyk, J. Falco, E. Messina, Y . Sun, M. Shibata, W. Uemura, and Y . Yokokohji, “Benchmarking protocols for eval- uating small parts robotic assembly systems,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 883–889, 2020

work page 2020

[1] [1]

Rlds: an ecosystem to generate, share and use datasets in reinforcement learning.arXiv preprint arXiv:2111.02767, 2021

S. Ramos, S. Girgin, L. Hussenot, D. Vincent, H. Yakubovich, D. Toyama, A. Gergely, P. Stanczyk, R. Marinier, J. Harmsenet al., “Rlds: an ecosystem to generate, share and use datasets in reinforce- ment learning,”arXiv preprint arXiv:2111.02767, 2021

work page arXiv 2021

[2] [2]

Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,

D. Sliwowski, S. Jadav, S. Stanovcic, J. Orbik, J. Heidersberger, and D. Lee, “Demonstrating REASSEMBLE: A Multimodal Dataset for Contact-rich Robotic Assembly and Disassembly,” inProceedings of Robotics: Science and Systems, Los Angeles, USA, June 2025

work page 2025

[3] [3]

Elan: A professional framework for multimodality research,

P. Wittenburg, H. Brugman, A. Russel, A. Klassmann, and H. Sloetjes, “Elan: A professional framework for multimodality research,” in5th international conference on language resources and evaluation (LREC 2006), 2006, pp. 1556–1559

work page 2006

[4] [4]

Otas: unsupervised boundary detection for object-centric temporal action segmentation,

Y . Li, Z. Xue, and H. Xu, “Otas: unsupervised boundary detection for object-centric temporal action segmentation,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 6437–6446

work page 2024

[5] [5]

Computer Vision Annotation Tool (CV AT),

CV AT.ai Corporation, “Computer Vision Annotation Tool (CV AT),” Nov. 2023. [Online]. Available: https://github.com/cvat-ai/cvat

work page 2023

[6] [6]

labelme: Image polygonal annotation with python,

K. Wada, “labelme: Image polygonal annotation with python,” https: //github.com/wkentaro/labelme, 2018

work page 2018

[7] [7]

ARCH: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly,

J. Sun, A. Curtis, Y . You, Y . Xu, M. Koehle, Q. Chen, S. Huang, L. Guibas, S. Chitta, M. Schwager, and H. Li, “ARCH: Hierarchical hybrid learning for long-horizon contact-rich robotic assembly,” in9th Annual Conference on Robot Learning, 2025

work page 2025

[8] [8]

Online task segmen- tation by merging symbolic and data-driven skill recognition during kinesthetic teaching,

T. Eiband, J. Liebl, C. Willibald, and D. Lee, “Online task segmen- tation by merging symbolic and data-driven skill recognition during kinesthetic teaching,”Robotics and Autonomous Systems, vol. 162, p. 104367, 2023

work page 2023

[9] [9]

Foxglove studio,

Foxglove Technologies Inc., “Foxglove studio,” 2025, robotics visualization and observability platform. [Online]. Available: https: //foxglove.dev/

work page 2025

[10] [10]

Mcap format specification,

——, “Mcap format specification,” 2025, file format for storing timestamped pub/sub messages with arbitrary serialization formats. [Online]. Available: https://mcap.dev/reference/

work page 2025

[11] [11]

Rosannotator: A web application for rosbag data analysis in human-robot interaction,

Y . Zhang, H. Li, R. Tabatabaei, and W. Johal, “Rosannotator: A web application for rosbag data analysis in human-robot interaction,” in2025 20th ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE, 2025, pp. 1099–1103

work page 2025

[12] [12]

Anvil-a generic annotation tool for multimodal dialogue,

M. Kipp, “Anvil-a generic annotation tool for multimodal dialogue,” inProc. Eurospeech 2001, 2001, pp. 1367–1370

work page 2001

[13] [13]

M2r2: Multimodal robotic representation for temporal action segmentation,

D. Sliwowski and D. Lee, “M2r2: Multimodal robotic representation for temporal action segmentation,” in2026 IEEE International Con- ference on Robotics and Automation (ICRA). IEEE, 2026

work page 2026

[14] [14]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,

A. O’Neillet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

work page 2024

[15] [15]

Gamma, R

E. Gamma, R. Helm, R. Johnson, and J. Vlissides,Design patterns: elements of reusable object-oriented software. USA: Addison-Wesley Longman Publishing Co., Inc., 1995

work page 1995

[16] [16]

Pyqtgraph: Scientific graphics and gui library for python,

“Pyqtgraph: Scientific graphics and gui library for python,” https:// www.pyqtgraph.org, accessed: 2026-02-04

work page 2026

[17] [17]

TensorFlow Datasets, a collection of ready-to-use datasets,

“TensorFlow Datasets, a collection of ready-to-use datasets,” https: //www.tensorflow.org/datasets

work page

[18] [18]

rosbags,

M. Durkovic, “rosbags,” 2025, python package to load rosbags. [Online]. Available: https://pypi.org/project/rosbags/

work page 2025

[19] [19]

Benchmarking protocols for eval- uating small parts robotic assembly systems,

K. Kimble, K. Van Wyk, J. Falco, E. Messina, Y . Sun, M. Shibata, W. Uemura, and Y . Yokokohji, “Benchmarking protocols for eval- uating small parts robotic assembly systems,”IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 883–889, 2020

work page 2020