iTeach: In the Wild Interactive Teaching for Failure-Driven Adaptation of Robot Perception
Pith reviewed 2026-05-23 20:01 UTC · model grok-4.3
The pith
A robot can adapt its object perception model during deployment by learning from short human interaction videos triggered by failures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that failure-driven samples collected via short HumanPlay sequences, labeled using the Few-Shot Semi-Supervised strategy from a single frame, enable iterative fine-tuning of a segmentation model that yields significant gains in unseen object instance segmentation and downstream robotic manipulation tasks.
What carries the argument
The HumanPlay interaction sequences paired with FS3 label propagation, which generates dense supervision from minimal annotation to support deployment-time model adaptation.
If this is right
- Segmentation performance improves across diverse real-world scenes with few failure-driven samples.
- Grasping and pick-and-place success rates increase on the SceneReplica benchmark and in real experiments.
- The perception model adapts progressively without requiring offline data collection and retraining.
- Annotation effort is minimized to eye-gaze and voice commands on the final frame of each sequence.
Where Pith is reading between the lines
- This could allow robots to handle a wider range of environments with less initial training data.
- The approach might be combined with other adaptation techniques like domain randomization for even better results.
- Human involvement during deployment raises questions about scalability to unsupervised settings.
- Gains in one task like segmentation could transfer to related perception problems such as depth estimation.
Load-bearing premise
The human-object interaction sequences provide object views that generalize to other scenes and the propagated labels are accurate enough to improve the model through fine-tuning.
What would settle it
A test showing that fine-tuning with the iTeach samples produces no measurable gain in segmentation IoU or no increase in grasping success rates on new scenes would disprove the main result.
Figures
read the original abstract
Robotic perception models often fail when deployed in real-world environments due to out-of-distribution conditions such as clutter, occlusion, and novel object instances. Existing approaches address this gap through offline data collection and retraining, which are slow and do not resolve deployment-time failures. We propose iTeach, a failure-driven interactive teaching framework for adapting robot perception in the wild. A co-located human observes model predictions during deployment, identifies failure cases, and performs short human-object interaction (HumanPlay) to expose informative object configurations while recording RGB-D sequences. To minimize annotation effort, iTeach employs a Few-Shot Semi- Supervised (FS3) labeling strategy, where only the final frame of a short interaction sequence is annotated using hands-free eye-gaze and voice commands, and labels are propagated across the video to produce dense supervision. The collected failure-driven samples are used for iterative fine-tuning, enabling progressive deployment-time adaptation of the perception model. We evaluate iTeach on unseen object instance segmentation (UOIS) starting from a pretrained MSMFormer model. Using a small number of failure-driven samples, our method significantly improves segmentation performance across diverse real-world scenes. These improvements directly translate to higher grasping and pick-and-place success on the SceneReplica benchmark and real robotic experiments. Our results demonstrate that failure-driven, co-located interactive teaching enables efficient in-the-wild adaptation of robot perception and improves downstream manipulation performance. Project page at https://irvlutd.github.io/iTeach
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes iTeach, a failure-driven interactive teaching framework for adapting robot perception models in the wild. A co-located human identifies model failures during deployment, collects short HumanPlay RGB-D sequences exposing informative configurations, annotates only the final frame via hands-free eye-gaze/voice commands, propagates labels with FS3 to generate dense supervision, and iteratively fine-tunes a pretrained MSMFormer for unseen object instance segmentation (UOIS). The paper claims that a small number of such samples yields significant segmentation gains that translate to improved grasping and pick-and-place success on the SceneReplica benchmark and real-robot experiments.
Significance. If the empirical claims hold after validation, the work offers a practical path to deployment-time perception adaptation that avoids slow offline retraining. The combination of failure-driven data collection with minimal-annotation FS3 labeling addresses a real robotics need. However, the current presentation provides no quantitative performance numbers, error bars, sample counts, or propagation-accuracy metrics, which substantially weakens the ability to judge the magnitude or reliability of the reported gains.
major comments (2)
- [FS3 labeling strategy (methods)] The FS3 label-propagation step (single-frame annotation followed by propagation across the HumanPlay sequence) is load-bearing for all downstream claims yet receives no quantitative validation. No per-frame IoU against held-out manual labels, no comparison to fully-supervised baselines, and no analysis of failure modes on occlusions or motion blur are reported; without this, it is impossible to confirm that the collected samples supply reliable dense supervision for fine-tuning.
- [Evaluation and results sections] The central empirical claim—that a small number of failure-driven samples produces significant UOIS and grasping improvements—is stated without any numerical results, baseline tables, error bars, or sample-size details. The abstract and evaluation sections therefore provide no evidence that the observed lifts exceed noise or that they generalize beyond the specific teaching episodes.
minor comments (1)
- [Abstract] The abstract asserts 'significantly improves' and 'higher grasping success' without referencing any quantitative result or table; this should be tied to specific numbers or figures.
Simulated Author's Rebuttal
We thank the referee for the constructive comments highlighting the need for stronger quantitative support. We agree that the current manuscript lacks explicit numerical validation for FS3 and detailed results tables, and we will revise accordingly to address these gaps.
read point-by-point responses
-
Referee: [FS3 labeling strategy (methods)] The FS3 label-propagation step (single-frame annotation followed by propagation across the HumanPlay sequence) is load-bearing for all downstream claims yet receives no quantitative validation. No per-frame IoU against held-out manual labels, no comparison to fully-supervised baselines, and no analysis of failure modes on occlusions or motion blur are reported; without this, it is impossible to confirm that the collected samples supply reliable dense supervision for fine-tuning.
Authors: We agree this validation is necessary. The manuscript describes the FS3 strategy but does not report the requested metrics. In revision we will add a new evaluation subsection with per-frame IoU against held-out manual labels, comparisons against fully-supervised propagation baselines, and explicit analysis of failure modes on occlusions and motion blur to demonstrate that the propagated labels provide reliable dense supervision. revision: yes
-
Referee: [Evaluation and results sections] The central empirical claim—that a small number of failure-driven samples produces significant UOIS and grasping improvements—is stated without any numerical results, baseline tables, error bars, or sample-size details. The abstract and evaluation sections therefore provide no evidence that the observed lifts exceed noise or that they generalize beyond the specific teaching episodes.
Authors: We acknowledge that the current text states performance gains without accompanying numerical tables, error bars, or sample counts. In the revised manuscript we will expand the evaluation section to include full quantitative tables for UOIS metrics, grasping success rates on SceneReplica and real-robot trials, baseline comparisons, error bars across repeated trials, and exact sample sizes used, thereby providing direct evidence that the improvements exceed noise and are reproducible. revision: yes
Circularity Check
No circularity: empirical pipeline with independent benchmark evaluation
full rationale
The paper describes an empirical interactive teaching system (HumanPlay sequences + FS3 single-frame annotation and propagation, followed by fine-tuning of MSMFormer and evaluation on SceneReplica plus real-robot tasks). No equations, fitted parameters, or derivations are present that reduce any reported performance gain to a quantity defined by the method's own inputs. The evaluation uses held-out real-world scenes and an external benchmark, so the central claims do not reduce to self-definition or self-citation chains.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Label propagation from a single annotated frame produces sufficiently accurate dense supervision for fine-tuning
- domain assumption Short human-object interactions expose object configurations that improve generalization on unseen scenes
Reference graph
Works this paper leans on
-
[1]
M. Sundermeyer, A. Mousavian, R. Triebel, and D. Fox. Contact-graspnet: Efficient 6-dof grasp generation in cluttered scenes. In 2021 IEEE International Conference on Robotics and Automation (ICRA), pages 13438–13444. IEEE, 2021
work page 2021
- [2]
-
[3]
B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17868–17879, 2024
work page 2024
-
[4]
D. S. Chaplot, D. P. Gandhi, A. Gupta, and R. R. Salakhutdinov. Object goal navigation using goal-oriented semantic exploration. Advances in Neural Information Processing Systems , 33: 4247–4258, 2020
work page 2020
-
[5]
A. Majumdar, G. Aggarwal, B. Devnani, J. Hoffman, and D. Batra. Zson: Zero-shot object- goal navigation using multimodal goal embeddings. Advances in Neural Information Process- ing Systems, 35:32340–32352, 2022
work page 2022
-
[6]
N. Yokoyama, R. Ramrakhya, A. Das, D. Batra, and S. Ha. Hm3d-ovon: A dataset and bench- mark for open-vocabulary object goal navigation. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 5543–5550. IEEE, 2024
work page 2024
-
[7]
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, Q. Jiang, C. Li, J. Yang, H. Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision, pages 38–55. Springer, 2024
work page 2024
-
[8]
A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y . Lo, et al. Segment anything. InProceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023
work page 2023
-
[9]
Y . Lu, J. J. P, Y . Guo, N. Ruozzi, and Y . Xiang. Adapting pre-trained vision models for novel instance detection and segmentation, 2024
work page 2024
-
[10]
https://learn.microsoft.com/en-us/hololens/
Microsoft HoloLens. https://learn.microsoft.com/en-us/hololens/
-
[11]
N. Ravi, V . Gabeur, Y .-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. R ¨adle, C. Rolland, L. Gustafson, et al. Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
C. Xie, Y . Xiang, A. Mousavian, and D. Fox. Unseen object instance segmentation for robotic environments. IEEE Transactions on Robotics, 37(5):1343–1359, 2021
work page 2021
- [13]
-
[14]
R. Abdulganeev, R. Lavrenov, A. Dobrokvashina, Y . Bai, and E. Magid. Autonomous door opening with a rescue robot. In 2024 10th International Conference on Automation, Robotics and Applications (ICARA), pages 7–11. IEEE, 2024. 9
work page 2024
-
[15]
A. Mousavian, C. Eppner, and D. Fox. 6-dof graspnet: Variational grasp generation for object manipulation. In Proceedings of the IEEE/CVF international conference on computer vision , pages 2901–2910, 2019
work page 2019
- [16]
-
[17]
M. Allenspach, S. Laasch, N. Lawrance, M. Tognon, and R. Siegwart. Mixed reality human- robot interface to generate and visualize 6dof trajectories: Application to omnidirectional aerial vehicles. In 2023 International Conference on Unmanned Aircraft Systems (ICUAS) , pages 395–400. IEEE, 2023
work page 2023
-
[18]
J. Hamilton, T. Phung, N. Tran, and T. Williams. What’s the point? tradeoffs between effec- tiveness and social perception when using mixed reality to enhance gesturally limited robots. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, pages 177–186, 2021
work page 2021
-
[19]
M. Filipenko, A. Angerer, A. Hoffmann, and W. Reif. Opportunities and limitations of mixed reality holograms in industrial robotics. arXiv preprint arXiv:2001.08166, 2020
-
[20]
C. P. Quintero, S. Li, M. K. Pan, W. P. Chan, H. M. Van der Loos, and E. Croft. Robot programming through augmented trajectories in augmented reality. In 2018 IEEE/RSJ Inter- national Conference on Intelligent Robots and Systems (IROS), pages 1838–1844. IEEE, 2018
work page 2018
-
[21]
S. Y . Gadre, E. Rosen, G. Chien, E. Phillips, S. Tellex, and G. Konidaris. End-user robot pro- gramming using mixed reality. In 2019 International conference on robotics and automation (ICRA), pages 2707–2713. IEEE, 2019
work page 2019
-
[22]
M. Ostanin, S. Mikhel, A. Evlampiev, V . Skvortsova, and A. Klimchik. Human-robot inter- action for robotic manipulator programming in mixed reality. In 2020 IEEE International Conference on Robotics and Automation (ICRA) , pages 2805–2811. IEEE, 2020
work page 2020
-
[23]
T. Chakraborti, S. Sreedharan, A. Kulkarni, and S. Kambhampati. Projection-aware task plan- ning and execution for human-in-the-loop operation of robots in a mixed-reality workspace. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) , pages 4476–4482. IEEE, 2018
work page 2018
-
[24]
A. Rivera-Pinto, J. Kildal, and E. Lazkano. Toward programming a collaborative robot by in- teracting with its digital twin in a mixed reality environment.International Journal of Human– Computer Interaction, pages 1–13, 2023
work page 2023
-
[25]
M. T. Shahria, M. S. H. Sunny, M. I. I. Zarif, M. M. R. Khan, P. P. Modi, S. I. Ahamed, and M. H. Rahman. A novel framework for mixed reality–based control of collaborative robot: Development study. JMIR Biomedical Engineering, 7(1):e36734, 2022
work page 2022
-
[26]
C. Cruz Ulloa, D. Dom ´ınguez, J. Del Cerro, and A. Barrientos. A mixed-reality tele-operation method for high-level control of a legged-manipulator robot. Sensors, 22(21):8146, 2022
work page 2022
-
[27]
K. A. Szczurek, R. M. Prades, E. Matheson, J. Rodriguez-Nogueira, and M. Di Castro. Mul- timodal multi-user mixed reality human–robot interface for remote operations in hazardous environments. IEEE Access, 11:17305–17333, 2023
work page 2023
-
[28]
K.-B. Park, S. H. Choi, J. Y . Lee, Y . Ghasemi, M. Mohammed, and H. Jeong. Hands-free human–robot interaction using multimodal gestures and deep learning in wearable mixed real- ity. IEEE Access, 9:55448–55464, 2021
work page 2021
- [29]
-
[30]
F. Kennel-Maushart, R. Poranne, and S. Coros. Interacting with multi-robot systems via mixed reality. In 2023 IEEE International Conference on Robotics and Automation (ICRA) , pages 11633–11639. IEEE, 2023
work page 2023
- [31]
-
[32]
S. P. Arunachalam, I. G ¨uzey, S. Chintala, and L. Pinto. Holo-dex: Teaching dexterity with immersive mixed reality. In 2023 IEEE International Conference on Robotics and Automation (ICRA), pages 5962–5969. IEEE, 2023
work page 2023
-
[33]
L. El Hafi, H. Nakamura, A. Taniguchi, Y . Hagiwara, and T. Taniguchi. Teaching system for multimodal object categorization by human-robot interaction in mixed reality. In 2021 IEEE/SICE International Symposium on System Integration (SII), pages 320–324. IEEE, 2021
work page 2021
-
[34]
H. Liu, Y . Zhang, W. Si, X. Xie, Y . Zhu, and S.-C. Zhu. Interactive robot knowledge patching using augmented reality. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 1947–1954. IEEE, 2018
work page 2018
-
[35]
M. Arduengo, C. Torras, and L. Sentis. Robust and adaptive door operation with a mobile robot. Intelligent Service Robotics, May 2021. ISSN 1861-2784. doi:10.1007/s11370-021-00366-7. URL http://dx.doi.org/10.1007/s11370-021-00366-7
-
[36]
G. Jocher. ultralytics/yolov5: v3.1 - Bug Fixes and Performance Improvements. https: //github.com/ultralytics/yolov5, Oct. 2020. URL https://doi.org/10.5281/ zenodo.4154370
work page 2020
-
[37]
Y . Lu, Y . Chen, N. Ruozzi, and Y . Xiang. Mean shift mask transformer for unseen object instance segmentation. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 2760–2766. IEEE, 2024
work page 2024
- [38]
- [39]
-
[40]
A. Richtsfeld, T. M ¨orwald, J. Prankl, M. Zillich, and M. Vincze. Segmentation of unknown objects in indoor environments. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pages 4791–4796. IEEE, 2012
work page 2012
-
[41]
Y . Lu, N. Khargonkar, Z. Xu, C. Averill, K. Palanisamy, K. Hang, Y . Guo, N. Ruozzi, and Y . Xiang. Self-supervised unseen object instance segmentation via long-term robot interaction. 2023. 11
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.