arxiv: 2605.07943 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.AI· cs.CV· cs.LG

Recognition: 3 theorem links

· Lean Theorem

TAVIS: A Benchmark for Egocentric Active Vision and Anticipatory Gaze in Imitation Learning

Giacomo Spigler

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:59 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CVcs.LG

keywords active visionimitation learninganticipatory gazebenchmarkegocentric visionmanipulationdistribution shifthumanoid robots

0 comments

The pith

TAVIS benchmark shows active vision improves imitation learning in a task-dependent manner while imitation alone produces anticipatory gaze matching human timing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents TAVIS as evaluation infrastructure for active-vision imitation learning on two humanoid embodiments, with TAVIS-Head tasks using pan/tilt necks for global search and TAVIS-Hands tasks using wrist cameras for local occlusion. It supplies a paired headcam-versus-fixedcam protocol on the same demonstrations, the GALT metric to measure how far in advance policies direct gaze before actions, and procedural ID/OOD splits. Baseline runs with Diffusion Policy and π0 establish three results: active vision yields performance gains that vary by task rather than appearing uniformly, multi-task policies decline sharply when facing controlled shifts on both suites, and policies trained purely by imitation develop anticipatory gaze whose median lead time approaches that of the human teleoperator reference. These elements together allow systematic measurement of when and how much controlling gaze contributes in egocentric manipulation.

Core claim

TAVIS establishes that active-vision generally helps imitation learning for manipulation but benefits are task-conditional rather than uniform, that multi-task policies degrade sharply under controlled distribution shifts on both suites, and that imitation alone yields anticipatory gaze with median lead times comparable to the human teleoperator reference.

What carries the argument

TAVIS benchmark infrastructure, consisting of the paired headcam-vs-fixedcam protocol on identical demonstrations, the GALT (Gaze-Action Lead Time) metric, and procedural ID/OOD splits applied to the TAVIS-Head and TAVIS-Hands task suites.

If this is right

Active vision provides performance gains that depend on task type rather than applying uniformly across manipulation settings.
Multi-task imitation policies experience sharp degradation when encountering controlled distribution shifts in active-vision conditions.
Imitation training from demonstrations alone is sufficient to produce anticipatory gaze whose timing matches human teleoperator references.
The paired protocol and GALT metric together allow direct quantification of how much active vision contributes on each task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If gaze anticipation emerges from imitation, then policies may implicitly learn useful viewpoint prediction as part of action forecasting in embodied settings.
The task-conditional nature of benefits suggests that future policies could incorporate mechanisms to decide dynamically whether to move the camera.
Testing the same primitives on physical hardware rather than simulation would reveal whether latency and sensor noise alter the observed advantages.
Adding tasks that require longer-horizon planning could show whether anticipatory gaze scales beyond the current short-horizon manipulation suites.

Load-bearing premise

The selected tasks, embodiments, and distribution shifts in TAVIS-Head and TAVIS-Hands sufficiently represent the real challenges and benefits of active vision in imitation learning for manipulation.

What would settle it

Running the same baselines on a new set of manipulation tasks outside TAVIS and finding either uniform benefits or no benefits at all from active vision would falsify the task-conditional claim.

Figures

Figures reproduced from arXiv: 2605.07943 by Giacomo Spigler.

**Figure 1.** Figure 1: The TAVIS Benchmark. TAVIS comprises two task suites that isolate distinct roles of active vision in manipulation. TAVIS-Head targets global active vision – head reorientation for search and to handle clutter – while TAVIS-Hands targets local active vision via wrist cameras peering past occlusions. Demonstrations are collected via first-person Meta Quest 3 teleoperation with gaze control through head movem… view at source ↗

**Figure 2.** Figure 2: TAVIS results overview. Aggregated multi-task π0 success rates across the four main evaluation cuts of Section 5. Bars: suite-mean SR (per-task averaged over robots); coloured dots: per-task points; thin lines: paired conditions per task. (A) Q1, active vision: head-vs-fixed on TAVISHead, and head + wrist SR on TAVIS-Hands (no fixed-cam variant by design). (B) Q2, multi-task scaling: single-task checkpoin… view at source ↗

**Figure 3.** Figure 3: GALT (Gaze-Action Lead Time) distributions per TAVIS-Head task: multi-task π0 policy vs human-teleoperator reference. Solid curves are the multi-task π0 headcam-policy GALT distribution per robot (GR1T2 blue, Reachy2 orange); dashed curves are the human teleoperation reference at the dataset’s native 60 Hz. Light shaded histograms behind each curve use 20 equal-width bins on [−0.5, 3.5]s. Filled triangles … view at source ↗

**Figure 4.** Figure 4: Initial-state distributions: teleop dataset, id eval reset, and ood-init-pose perturbation. Rows: robot (GR1T2 top, Reachy2 bottom). Columns: end-effector position x, |y|, z (metres), and neck pitch, yaw (degrees). Histograms and KDEs compare the frame-0 distribution in the teleoperation dataset (green) with the ood-init-pose eval distribution (red, σpos = 0.1 m and σhead = 0.175 rad ≈ 10◦ ); the determini… view at source ↗

read the original abstract

Active vision -- where a policy controls its own gaze during manipulation -- has emerged as a key capability for imitation learning, with multiple independent systems demonstrating its benefits in the past year. Yet there is no shared benchmark to compare approaches or quantify what active vision contributes, on which task types, and under what conditions. We introduce TAVIS, evaluation infrastructure for active-vision imitation learning, with two complementary task suites -- TAVIS-Head (5 tasks, global search via pan/tilt necks) and TAVIS-Hands (3 tasks, local occlusion via wrist cameras) -- on two humanoid torso embodiments (GR1T2, Reachy2), built on IsaacLab. TAVIS provides three evaluation primitives: a paired headcam-vs-fixedcam protocol on identical demonstrations; GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies; and procedural ID/OOD splits. Baseline experiments with Diffusion Policy and $\pi_0$ reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform; (ii) multi-task policies degrade sharply under controlled distribution shifts on both suites; and (iii) imitation alone yields anticipatory gaze, with median lead times comparable to the human teleoperator reference. Code, evaluation scripts, demonstrations (LeRobot v3.0; ~2200 episodes) and trained baselines are released at https://github.com/spiglerg/tavis and https://huggingface.co/tavis-benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces TAVIS, a benchmark for egocentric active vision and anticipatory gaze in imitation learning. It features two task suites: TAVIS-Head (5 tasks with global search via pan/tilt necks) and TAVIS-Hands (3 tasks with local occlusion via wrist cameras) on GR1T2 and Reachy2 humanoids in IsaacLab. The benchmark includes a paired headcam-vs-fixedcam protocol on identical demonstrations, the novel GALT metric for quantifying anticipatory gaze lead times, and procedural ID/OOD splits. Baseline experiments using Diffusion Policy and π0 show that active vision provides task-conditional benefits, multi-task policies degrade under distribution shifts, and imitation learning produces anticipatory gaze with median lead times similar to human teleoperators. Code, data, and models are released.

Significance. If the results hold, TAVIS offers a much-needed standardized evaluation platform for active-vision approaches in imitation learning, filling a gap in the field. The open release of ~2200 episodes, evaluation scripts, and baselines promotes reproducibility and comparison. The GALT metric, grounded in cognitive science and HRI, provides a new way to measure anticipatory behavior. The findings on task-conditional benefits and multi-task degradation highlight important considerations for policy design. The limited task set means broader significance depends on representativeness of these scenarios.

major comments (2)

The claim that active-vision 'generally helps' is based on experiments with 8 tasks. This may overstate the generality given the specific embodiments and procedural splits; the task-conditional benefits are interesting but their broader implications require more qualification in the abstract.
Limited details are provided on the number of runs, statistical tests, and data exclusion rules supporting the three findings. This weakens the strength of the empirical claims and should be expanded for reproducibility and confidence in the results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the major comments below and have updated the manuscript to improve clarity and reproducibility.

read point-by-point responses

Referee: The claim that active-vision 'generally helps' is based on experiments with 8 tasks. This may overstate the generality given the specific embodiments and procedural splits; the task-conditional benefits are interesting but their broader implications require more qualification in the abstract.

Authors: We agree that the abstract should more explicitly qualify the generality of the findings. Although the original text already notes that benefits are 'task-conditional rather than uniform', we have revised the abstract to state: 'active vision provides task-conditional benefits to imitation learning' and removed the 'generally helps' phrasing to avoid any overstatement. We have also added a qualification in the introduction and discussion sections emphasizing that these results are based on the specific 8 tasks, two embodiments, and procedural splits, and that broader implications would require further validation. The task-dependent nature remains the key insight supported by the data. revision: yes
Referee: Limited details are provided on the number of runs, statistical tests, and data exclusion rules supporting the three findings. This weakens the strength of the empirical claims and should be expanded for reproducibility and confidence in the results.

Authors: We thank the referee for pointing this out. The original manuscript did not include sufficient experimental details. We have now expanded the 'Experiments' section and added a dedicated 'Reproducibility' subsection detailing: (1) all results are averaged over 5 independent runs with different random seeds; (2) statistical comparisons between headcam and fixedcam use paired t-tests with p < 0.05 for significance; (3) no episodes were excluded from the analysis—all ~2200 demonstrations were utilized. These details are provided to support the three main findings and enhance confidence in the results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark with direct comparisons

full rationale

This is an empirical benchmark paper introducing TAVIS task suites, paired headcam-vs-fixedcam protocols, the GALT metric, and procedural ID/OOD splits, followed by baseline experiments on Diffusion Policy and π0. No derivations, equations, fitted parameters, or predictions appear in the provided text or abstract; all claims rest on released code, data (~2200 episodes), and direct experimental measurements rather than any self-definitional, fitted-input, or self-citation reduction. The central observations (task-conditional benefits, multi-task degradation, and human-comparable lead times) are presented as outcomes of those comparisons, with no load-bearing step that reduces by construction to the paper's own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that the chosen tasks capture relevant active vision challenges and that GALT provides a meaningful measure of anticipation; no free parameters are fitted to support the main findings.

axioms (1)

domain assumption The five TAVIS-Head and three TAVIS-Hands tasks, along with the chosen distribution shifts, represent meaningful and generalizable challenges for egocentric active vision in manipulation.
Invoked to interpret the baseline results as evidence of task-conditional benefits.

invented entities (1)

GALT (Gaze-Action Lead Time) metric no independent evidence
purpose: Quantifies anticipatory gaze by measuring lead time between gaze movement and action in learned policies.
Newly defined metric grounded in cognitive science and HRI; no independent evidence provided beyond the benchmark itself.

pith-pipeline@v0.9.0 · 5592 in / 1478 out tokens · 52607 ms · 2026-05-11T02:59:32.476813+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GALT (Gaze-Action Lead Time), a novel metric grounded in cognitive science and HRI that quantifies anticipatory gaze in learned policies
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Baseline experiments with Diffusion Policy and π0 reveal that (i) active-vision generally helps, but benefits are task-conditional rather than uniform
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

procedural ID/OOD splits

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · 4 internal anchors

[1]

2025 , organization=

Cheng, Xuxin and Li, Jialong and Yang, Shiqi and Yang, Ge and Wang, Xiaolong , booktitle=. 2025 , organization=

work page 2025
[2]

2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=

Active vision might be all you need: Exploring active vision in bimanual robotic manipulation , author=. 2025 IEEE International Conference on Robotics and Automation (ICRA) , pages=. 2025 , organization=

work page 2025
[3]

Conference on Robot Learning , pages=

Vision in Action: Learning Active Perception from Human Demonstrations , author=. Conference on Robot Learning , pages=. 2025 , organization=

work page 2025
[4]

Eye, Robot: Learning to Look to Act with a

Kerr, Justin and Hari, Kush and Weber, Ethan and Kim, Chung Min and Yi, Brent and Bonnen, Tyler and Goldberg, Ken and Kanazawa, Angjoo , booktitle=. Eye, Robot: Learning to Look to Act with a. 2025 , organization=

work page 2025
[5]

arXiv e-prints , pages=

Look, Focus, Act: Efficient and Robust Robot Learning via Human Gaze and Foveated Vision Transformers , author=. arXiv e-prints , pages=

work page
[6]

Yu, Justin and Shentu, Yide and Wu, Di and Abbeel, Pieter and Goldberg, Ken and Wu, Philipp , journal=

work page
[8]

Liu, Yushan and Mu, Shilong and Chao, Xintao and Li, Zizhen and Mu, Yao and Chen, Tianxing and Li, Shoujie and Lyu, Chuqiao and Zhang, Xiao-ping and Ding, Wenbo , journal=

work page
[9]

Hong Kong Journal of Occupational Therapy , volume=

Temporal differences in eye--hand coordination between children and adults during manual action on objects , author=. Hong Kong Journal of Occupational Therapy , volume=. 2018 , publisher=

work page 2018
[10]

Journal of vision , volume=

Saccadic eye movements in a high-speed bimanual stacking task: Changes of attentional control during learning and automatization , author=. Journal of vision , volume=. 2011 , publisher=

work page 2011
[11]

Journal of neuroscience , volume=

Eye--hand coordination in object manipulation , author=. Journal of neuroscience , volume=. 2001 , publisher=

work page 2001
[12]

Perception , volume=

The roles of vision and eye movements in the control of activities of daily living , author=. Perception , volume=. 1999 , publisher=

work page 1999
[13]

The 23rd IEEE International Symposium on robot and human interactive communication , pages=

Legible robot pointing , author=. The 23rd IEEE International Symposium on robot and human interactive communication , pages=. 2014 , organization=

work page 2014
[14]

2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=

Legibility and predictability of robot motion , author=. 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI) , pages=. 2013 , organization=

work page 2013
[15]

Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction , pages=

Deliberate delays during robot-to-human handovers improve compliance with gaze communication , author=. Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction , pages=

work page 2014
[16]

Biological cybernetics , volume=

Learning robotic eye--arm--hand coordination from human demonstration: a coupled dynamical systems approach , author=. Biological cybernetics , volume=. 2014 , publisher=

work page 2014
[17]

Surgical Endoscopy , volume=

Spatiotemporal characteristics of eye-hand coordination among different skill levels in laparoscopic surgery , author=. Surgical Endoscopy , volume=. 2026 , publisher=

work page 2026
[18]

arXiv e-prints , pages=

Prime and Reach: Synthesising Body Motion for Gaze-Primed Object Reach , author=. arXiv e-prints , pages=

work page
[19]

Interaction Studies , volume=

Robots can be perceived as goal-oriented agents , author=. Interaction Studies , volume=. 2013 , publisher=

work page 2013
[20]

The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

RoboCerebra: A Large-scale Benchmark for Long-horizon Robotic Manipulation Evaluation , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track , year=

work page
[21]

Liu, Bo and Zhu, Yifeng and Gao, Chongkai and Feng, Yihao and Liu, Qiang and Zhu, Yuke and Stone, Peter , journal=

work page
[22]

Zhou, Xueyang and Xu, Yangming and Tie, Guiyao and Chen, Yongchao and Zhang, Guowen and Chu, Duanfeng and Zhou, Pan and Sun, Lichao , journal=

work page
[23]

Fei, Senyu and Wang, Siyin and Shi, Junhao and Dai, Zihao and Cai, Jikun and Qian, Pengfang and Ji, Li and He, Xinzhe and Zhang, Shiduo and Fei, Zhaoye and others , journal=

work page
[24]

arXiv e-prints , pages=

Distracted Robot: How Visual Clutter Undermine Robotic Manipulation , author=. arXiv e-prints , pages=

work page
[25]

2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids) , pages=

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task , author=. 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids) , pages=. 2024 , organization=

work page 2024
[26]

Proceedings of Robotics: Science and Systems (RSS) , year=

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

work page
[27]

Proceedings of Robotics: Science and Systems (RSS) , year=

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware , author=. Proceedings of Robotics: Science and Systems (RSS) , year=

work page
[30]

2025 , url =

Isaac Lab Arena: Composable Environment Creation and Policy Evaluation for Robotics , author =. 2025 , url =

work page 2025
[31]

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning

robosuite: A Modular Simulation Framework and Benchmark for Robot Learning , author=. arXiv:2009.12293 , year=

work page internal anchor Pith review arXiv 2009
[32]

Communications of the ACM , volume=

Datasheets for datasets , author=. Communications of the ACM , volume=. 2021 , publisher=

work page 2021
[33]

Cadene, Remi and Alibert, Simon and Soare, Alexander and Gallouedec, Quentin and Zouitine, Adil and Palma, Steven and Kooijmans, Pepijn and Aractingi, Michel and Shukor, Mustafa and Aubakirova, Dana and Russi, Martino and Capuano, Francesco and Pascal, Caroline and Choghari, Jade and Moss, Jess and Wolf, Thomas , title =

work page
[34]

Mandlekar, Ajay and Nasiriany, Soroush and Wen, Bowen and Akinola, Iretiayo and Narang, Yashraj and Fan, Linxi and Zhu, Yuke and Fox, Dieter , journal=

work page
[35]

2025 , organization=

Jiang, Zhenyu and Xie, Yuqi and Lin, Kevin and Xu, Zhenjia and Wan, Weikang and Mandlekar, Ajay and Fan, Linxi Jim and Zhu, Yuke , booktitle=. 2025 , organization=

work page 2025
[36]

Proceedings of the IEEE , volume=

Active perception , author=. Proceedings of the IEEE , volume=. 1988 , publisher=

work page 1988
[37]

International journal of computer vision , volume=

Active vision , author=. International journal of computer vision , volume=. 1988 , publisher=

work page 1988
[38]

Artificial intelligence , volume=

Animate vision , author=. Artificial intelligence , volume=. 1991 , publisher=

work page 1991
[39]

Autonomous Robots , volume=

Revisiting active perception , author=. Autonomous Robots , volume=. 2018 , publisher=

work page 2018
[40]

2020 , publisher=

James, Stephen and Ma, Zicong and Arrojo, David Rovick and Davison, Andrew J , journal=. 2020 , publisher=

work page 2020
[41]

2022 , publisher=

Mees, Oier and Hermann, Lukas and Rosete-Beas, Erick and Burgard, Wolfram , journal=. 2022 , publisher=

work page 2022
[42]

Nasiriany, Soroush and Maddukuri, Abhiram and Zhang, Lance and Parikh, Adeet and Lo, Aaron and Joshi, Abhishek and Mandlekar, Ajay and Zhu, Yuke , year=

work page
[43]

2023 , organization=

Walke, Homer Rich and Black, Kevin and Zhao, Tony Z and Vuong, Quan and Zheng, Chongyi and Hansen-Estruch, Philippe and He, Andre Wang and Myers, Vivek and Kim, Moo Jin and Du, Max and others , booktitle=. 2023 , organization=

work page 2023
[45]

Deliberate delays during robot-to-human handovers improve compliance with gaze communication

Henny Admoni, Anca Dragan, Siddhartha S Srinivasa, and Brian Scassellati. Deliberate delays during robot-to-human handovers improve compliance with gaze communication. In Proceedings of the 2014 ACM/IEEE international conference on Human-robot interaction, pages 49--56, 2014

work page 2014
[46]

Active vision

John Aloimonos, Isaac Weiss, and Amit Bandyopadhyay. Active vision. International journal of computer vision, 1 0 (4): 0 333--356, 1988

work page 1988
[47]

Active perception

Ruzena Bajcsy. Active perception. Proceedings of the IEEE, 76 0 (8): 0 966--1005, 1988

work page 1988
[48]

Animate vision

Dana H Ballard. Animate vision. Artificial intelligence, 48 0 (1): 0 57--86, 1991

work page 1991
[49]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. _0 : A vis...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github.com/hu...

work page 2024
[51]

Open-TeleVision : Teleoperation with immersive active visual feedback

Xuxin Cheng, Jialong Li, Shiqi Yang, Ge Yang, and Xiaolong Wang. Open-TeleVision : Teleoperation with immersive active visual feedback. In Conference on Robot Learning, pages 2729--2749. PMLR, 2025

work page 2025
[52]

Diffusion policy: Visuomotor policy learning via action diffusion

Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion. In Proceedings of Robotics: Science and Systems (RSS), 2023

work page 2023
[53]

Active vision might be all you need: Exploring active vision in bimanual robotic manipulation

Ian Chuang, Andrew Lee, Dechen Gao, M-Mahdi Naddaf-Sh, and Iman Soltani. Active vision might be all you need: Exploring active vision in bimanual robotic manipulation. In 2025 IEEE International Conference on Robotics and Automation (ICRA), pages 7952--7959. IEEE, 2025 a

work page 2025
[54]

Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transformers

Ian Chuang, Andrew Lee, Dechen Gao, Jinyu Zou, and Iman Soltani. Look, focus, act: Efficient and robust robot learning via human gaze and foveated vision transformers. arXiv e-prints, pages arXiv--2507, 2025 b

work page 2025
[55]

Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task

Bosong Ding, Murat Kirtay, and Giacomo Spigler. Imitation of human motion achieves natural head movements for humanoid robots in an active-speaker detection task. In 2024 IEEE-RAS 23rd International Conference on Humanoid Robots (Humanoids), pages 645--652. IEEE, 2024

work page 2024
[56]

Legibility and predictability of robot motion

Anca D Dragan, Kenton CT Lee, and Siddhartha S Srinivasa. Legibility and predictability of robot motion. In 2013 8th ACM/IEEE International Conference on Human-Robot Interaction (HRI), pages 301--308. IEEE, 2013

work page 2013
[57]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. LIBERO-Plus : In-depth robustness analysis of Vision-Language-Action models. arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review arXiv 2025
[58]

Saccadic eye movements in a high-speed bimanual stacking task: Changes of attentional control during learning and automatization

Rebecca M Foerster, Elena Carbone, Hendrik Koesling, and Werner X Schneider. Saccadic eye movements in a high-speed bimanual stacking task: Changes of attentional control during learning and automatization. Journal of vision, 11 0 (7): 0 9--9, 2011

work page 2011
[59]

Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation

Songhao Han, Boxiang Qiu, Yue Liao, Siyuan Huang, Chen Gao, Shuicheng Yan, and Si Liu. Robocerebra: A large-scale benchmark for long-horizon robotic manipulation evaluation. In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[60]

Prime and reach: Synthesising body motion for gaze-primed object reach

Masashi Hatano, Saptarshi Sinha, Jacob Chalk, Wei-Hong Li, Hideo Saito, and Dima Damen. Prime and reach: Synthesising body motion for gaze-primed object reach. arXiv e-prints, pages arXiv--2512, 2025

work page 2025
[61]

Towards exploratory and focused manipulation with bimanual active perception: A new problem, benchmark and strategy

Yuxin He, Ruihao Zhang, Tianao Shen, Cheng Liu, and Qiang Nie. Towards exploratory and focused manipulation with bimanual active perception: A new problem, benchmark and strategy. arXiv preprint arXiv:2602.01939, 2026

work page arXiv 2026
[62]

Legible robot pointing

Rachel M Holladay, Anca D Dragan, and Siddhartha S Srinivasa. Legible robot pointing. In The 23rd IEEE International Symposium on robot and human interactive communication, pages 217--223. IEEE, 2014

work page 2014
[63]

RLBench : The robot learning benchmark & learning environment

Stephen James, Zicong Ma, David Rovick Arrojo, and Andrew J Davison. RLBench : The robot learning benchmark & learning environment. IEEE Robotics and Automation Letters, 5 0 (2): 0 3019--3026, 2020

work page 2020
[64]

o ran Westling, Anders B \

Roland S Johansson, G \"o ran Westling, Anders B \"a ckstr \"o m, and J Randall Flanagan. Eye--hand coordination in object manipulation. Journal of neuroscience, 21 0 (17): 0 6917--6932, 2001

work page 2001
[65]

Eye, robot: Learning to look to act with a BC-RL perception-action loop

Justin Kerr, Kush Hari, Ethan Weber, Chung Min Kim, Brent Yi, Tyler Bonnen, Ken Goldberg, and Angjoo Kanazawa. Eye, robot: Learning to look to act with a BC-RL perception-action loop. In Conference on Robot Learning, pages 3647--3664. PMLR, 2025

work page 2025
[66]

Temporal differences in eye--hand coordination between children and adults during manual action on objects

Hye Jin Kim, Cho Hee Lee, and Eun Young Kim. Temporal differences in eye--hand coordination between children and adults during manual action on objects. Hong Kong Journal of Occupational Therapy, 31 0 (2): 0 106--114, 2018

work page 2018
[67]

The roles of vision and eye movements in the control of activities of daily living

Michael Land, Neil Mennie, and Jennifer Rusted. The roles of vision and eye movements in the control of activities of daily living. Perception, 28 0 (11): 0 1311--1328, 1999

work page 1999
[68]

LIBERO : Benchmarking knowledge transfer for lifelong robot learning

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. LIBERO : Benchmarking knowledge transfer for lifelong robot learning. Advances in Neural Information Processing Systems, 36: 0 44776--44791, 2023

work page 2023
[69]

AVR : Active vision-driven robotic precision manipulation with viewpoint and focal length optimization

Yushan Liu, Shilong Mu, Xintao Chao, Zizhen Li, Yao Mu, Tianxing Chen, Shoujie Li, Chuqiao Lyu, Xiao-ping Zhang, and Wenbo Ding. AVR : Active vision-driven robotic precision manipulation with viewpoint and focal length optimization. arXiv e-prints, pages arXiv--2503, 2025

work page 2025
[70]

CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks

Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. CALVIN : A benchmark for language-conditioned policy learning for long-horizon robot manipulation tasks. IEEE Robotics and Automation Letters, 7 0 (3): 0 7327--7334, 2022

work page 2022
[71]

Isaac Lab: A GPU-Accelerated Simulation Framework for Multi-Modal Robot Learning

Mayank Mittal, Pascal Roth, James Tigue, Antoine Richard, Octi Zhang, Peter Du, Antonio Serrano-Muñoz, Xinjie Yao, René Zurbrügg, Nikita Rudin, Lukasz Wawrzyniak, Milad Rakhsha, Alain Denzler, Eric Heiden, Ales Borovicka, Ossama Ahmed, Iretiayo Akinola, Abrar Anwar, Mark T. Carlson, Ji Yuan Feng, Animesh Garg, Renato Gasoto, Lionel Gulich, Yijie Guo, M. G...

work page internal anchor Pith review arXiv 2025
[72]

RoboCasa : Large-scale simulation of household tasks for generalist robots

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. RoboCasa : Large-scale simulation of household tasks for generalist robots. In Robotics: Science and Systems Foundation, 2024

work page 2024
[73]

Isaac lab arena: Composable environment creation and policy evaluation for robotics, 2025

NVIDIA Isaac Lab Arena Contributors . Isaac lab arena: Composable environment creation and policy evaluation for robotics, 2025. URL https://github.com/isaac-sim/IsaacLab-Arena

work page 2025
[74]

Tactile mnist: Benchmarking active tactile perception

Tim Schneider, Guillaume Duret, Cristiana de Farias, Roberto Calandra, Liming Chen, and Jan Peters. Tactile mnist: Benchmarking active tactile perception. arXiv preprint arXiv:2506.06361, 2025

work page arXiv 2025
[75]

Robots can be perceived as goal-oriented agents

Alessandra Sciutti, Ambra Bisio, Francesco Nori, Giorgio Metta, Luciano Fadiga, and Giulio Sandini. Robots can be perceived as goal-oriented agents. Interaction Studies, 14 0 (3): 0 329--350, 2013

work page 2013
[76]

Vision in action: Learning active perception from human demonstrations

Haoyu Xiong, Xiaomeng Xu, Jimmy Wu, Yifan Hou, Jeannette Bohg, and Shuran Song. Vision in action: Learning active perception from human demonstrations. In Conference on Robot Learning, pages 5450--5463. PMLR, 2025

work page 2025
[77]

EgoMI : Learning active vision and whole-body manipulation from egocentric human demonstrations

Justin Yu, Yide Shentu, Di Wu, Pieter Abbeel, Ken Goldberg, and Philipp Wu. EgoMI : Learning active vision and whole-body manipulation from egocentric human demonstrations. arXiv e-prints, pages arXiv--2511, 2025

work page 2025
[78]

LIBERO-PRO : Towards robust and fair evaluation of Vision-Language-Action models beyond memorization

Xueyang Zhou, Yangming Xu, Guiyao Tie, Yongchao Chen, Guowen Zhang, Duanfeng Chu, Pan Zhou, and Lichao Sun. LIBERO-PRO : Towards robust and fair evaluation of Vision-Language-Action models beyond memorization. arXiv e-prints, pages arXiv--2510, 2025

work page 2025