A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study

Billy Madden; Flavio Esposito; Guangping Liu; Madi Babaiasl; Nicholas Hawkins; Tipu Sultan

arxiv: 2601.16870 · v2 · submitted 2026-01-23 · 💻 cs.RO

A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study

Guangping Liu , Nicholas Hawkins , Billy Madden , Tipu Sultan , Flavio Esposito , Madi Babaiasl This is my paper

Pith reviewed 2026-05-16 11:53 UTC · model grok-4.3

classification 💻 cs.RO

keywords multimodal data collectionassistive roboticsWizard-of-Ozdialogue-driven interactionambiguity clarificationhuman-robot interactionwheelchair robotic arms

0 comments

The pith

A Wizard-of-Oz framework collects multimodal data on user clarifications during dialogue-driven assistive robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a multimodal data collection framework using a two-room Wizard-of-Oz setup to simulate robot autonomy and elicit natural user behavior in assistive robotics. It records five synchronized modalities including RGB-D video, conversational audio, IMU signals, end-effector pose, and joint states across five tasks. A pilot study with five participants and 53 trials validates the framework through motion analysis and feedback, showing it captures diverse ambiguity types effectively. This approach addresses the lack of datasets for training AI to handle conversational ambiguities in human-robot interaction. Scaling it would enable development of more intuitive, dialogue-driven control interfaces for users with motor limitations.

Core claim

The framework effectively captures diverse ambiguity types and supports natural dialogue-driven interaction in a pilot dataset, demonstrating its suitability for scaling to a larger dataset for learning, benchmarking, and evaluation of ambiguity-aware assistive control.

What carries the argument

The two-room Wizard-of-Oz setup combined with synchronized recording of five modalities (RGB-D video, audio, IMU, pose, joint states) and a dialogue-based protocol to simulate autonomous behavior.

If this is right

Supports training of machine learning models for detecting and clarifying ambiguities in assistive robot commands.
Enables benchmarking and evaluation of different ambiguity resolution strategies.
Facilitates development of flexible interfaces that increase user independence beyond rigid control methods.
Provides a template for collecting similar data in other dialogue-driven robotics applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If scaled, the dataset could reveal common patterns in how ambiguities arise across different assistive tasks.
Data from this method might generalize to real robot deployments if user behavior remains consistent.
Combining this with existing HRI datasets could accelerate progress in ambiguity-aware control systems.

Load-bearing premise

The Wizard-of-Oz simulation elicits user behavior sufficiently similar to interaction with a real autonomous robot.

What would settle it

Direct comparison of user dialogues and task performance between the WoZ setup and a fully autonomous robot system on the same tasks.

Figures

Figures reproduced from arXiv: 2601.16870 by Billy Madden, Flavio Esposito, Guangping Liu, Madi Babaiasl, Nicholas Hawkins, Tipu Sultan.

**Figure 1.** Figure 1: Overview of the Multimodal Assistive Data Collection Framework: Our assistive data collection framework comprises a Virtual Reality (VR) based teleoperation system for a wheelchair and robotic arm, and a real-time multimodal data recording pipeline. The experimental setup spans two physically separated spaces (Room A and Room B). Room A is arranged with five assistive tasks, including door opening, drawer … view at source ↗

**Figure 3.** Figure 3: Teleoperation and data collection framework. Strikethrough [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative Analysis: (a-e) Task Performance Analysis: (a) is the task distribution in the pilot dataset, (b) is total completion time (mean [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Task Performance Qualitative Demonstration. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

read the original abstract

Integrated control of wheelchairs and wheelchair-mounted robotic arms (WMRAs) has strong potential to increase independence for users with severe motor limitations, yet existing interfaces often lack the flexibility needed for intuitive assistive interaction. Although data-driven AI methods show promise, progress is limited by the lack of multimodal datasets that capture natural Human-Robot Interaction (HRI), particularly conversational ambiguity in dialogue-driven control. To address this gap, we propose a multimodal data collection framework that employs a dialogue-based interaction protocol and a two-room Wizard-of-Oz (WoZ) setup to simulate robot autonomy while eliciting natural user behavior. The framework records five synchronized modalities: RGB-D video, conversational audio, inertial measurement unit (IMU) signals, end-effector Cartesian pose, and whole-body joint states across five assistive tasks. Using this framework, we collected a pilot dataset of 53 trials from five participants and validated its quality through motion smoothness analysis and user feedback. The results show that the framework effectively captures diverse ambiguity types and supports natural dialogue-driven interaction, demonstrating its suitability for scaling to a larger dataset for learning, benchmarking, and evaluation of ambiguity-aware assistive control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clear methods paper laying out a two-room WoZ protocol with five synced modalities for ambiguity data in wheelchair-arm tasks, but the 53-trial pilot offers only basic motion and feedback checks with no real-robot comparison.

read the letter

The paper's main point is a practical data-collection protocol: a two-room Wizard-of-Oz setup that records RGB-D video, audio, IMU, end-effector pose, and joint states while users direct a wheelchair-mounted arm through dialogue tasks that involve ambiguity. They ran 53 trials with five participants and checked the data with motion-smoothness numbers plus simple user ratings. That protocol and the synchronization details are the concrete new piece; prior HRI datasets have not combined exactly these streams for this narrow assistive scenario. The description is straightforward enough that someone could copy the room layout and sensor list. The pilot shows the setup runs without obvious technical failures and that participants found the interaction reasonably natural. The soft spots are straightforward. The sample is small, there is no breakdown of which ambiguity types actually appeared or how often, and there are no inter-rater checks or downstream tests on whether the traces could train a clarification policy. Most importantly, the work never compares WoZ behavior to a real autonomous robot, so it is still an open question whether the collected patterns will transfer. The claim that the framework is ready for scaling therefore rests on limited evidence. This is for HRI and assistive-robotics groups that need a starting template for multimodal dialogue data rather than for readers looking for trained models or large-scale results. The protocol itself is reproducible and fills a narrow but real gap, so it is worth sending to peer review; reviewers will probably request more quantitative ambiguity analysis and at least one real-robot condition in a revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes a multimodal data collection framework for dialogue-driven control of wheelchairs and wheelchair-mounted robotic arms. It uses a two-room Wizard-of-Oz setup and a dialogue protocol to elicit natural user behavior while recording five synchronized modalities (RGB-D video, audio, IMU, end-effector pose, joint states) across five assistive tasks. A pilot dataset of 53 trials from five participants is collected and validated via motion-smoothness metrics plus user feedback; the authors conclude that the framework captures diverse ambiguity types and is suitable for scaling to train ambiguity-aware clarification policies.

Significance. If the collected traces prove representative of real autonomous interactions, the framework would address a clear gap in multimodal HRI datasets for assistive robotics, enabling data-driven methods for ambiguity resolution that could improve independence for users with severe motor impairments. The pilot demonstrates a feasible protocol and the value of synchronized multimodal recording.

major comments (2)

[Results] Results section (pilot validation): the claim that the framework 'effectively captures diverse ambiguity types' rests only on motion-smoothness analysis and subjective user feedback; no quantitative breakdown of ambiguity categories, their frequency across the 53 trials, or inter-rater reliability is reported, leaving the central suitability claim unsupported by direct evidence.
[Methods] Methods (WoZ setup description): the assumption that the two-room Wizard-of-Oz simulation elicits ambiguity patterns and user behavior sufficiently similar to interaction with a real autonomous robot is stated but not tested; no comparison condition or quantitative similarity metric is provided, which is load-bearing for the downstream claim that the data can train clarification policies.

minor comments (2)

[Abstract] Abstract and Section 3: the five assistive tasks are referenced but never enumerated; a brief list or table would improve reproducibility.
[Framework Description] Notation: the description of the five modalities would benefit from an explicit table listing sensor rates, synchronization method, and storage format.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our pilot study. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Results] Results section (pilot validation): the claim that the framework 'effectively captures diverse ambiguity types' rests only on motion-smoothness analysis and subjective user feedback; no quantitative breakdown of ambiguity categories, their frequency across the 53 trials, or inter-rater reliability is reported, leaving the central suitability claim unsupported by direct evidence.

Authors: We agree that the current validation relies on indirect metrics and that a direct quantitative analysis of ambiguity types would provide stronger support for the central claim. In the revised manuscript we will add a section reporting the observed ambiguity categories (e.g., referential, temporal, spatial), their frequencies across the 53 trials, and the inter-rater reliability of the categorization performed by two independent annotators. This addition will directly address the referee’s concern while remaining within the scope of the pilot dataset. revision: yes
Referee: [Methods] Methods (WoZ setup description): the assumption that the two-room Wizard-of-Oz simulation elicits ambiguity patterns and user behavior sufficiently similar to interaction with a real autonomous robot is stated but not tested; no comparison condition or quantitative similarity metric is provided, which is load-bearing for the downstream claim that the data can train clarification policies.

Authors: We acknowledge that the behavioral similarity between the WoZ setup and a fully autonomous robot is an untested assumption. As this is explicitly a pilot study whose primary goal is to demonstrate a feasible multimodal collection protocol, a controlled comparison with an autonomous system lies outside the present scope. We will revise the Methods and Discussion sections to explicitly state this limitation, reference prior WoZ validation literature in HRI, and outline planned future experiments that will include such a comparison once the larger dataset is collected. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents a descriptive multimodal data collection framework and reports results from a Wizard-of-Oz pilot study with 53 trials. It contains no equations, fitted parameters, predictions, or mathematical derivations. Validation relies on direct motion smoothness analysis and participant feedback rather than any self-referential reduction or self-citation chain. The central claims rest on empirical observation of the collected traces, with no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a Wizard-of-Oz operator can faithfully elicit natural ambiguity-handling behavior; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Wizard-of-Oz simulation produces user behavior representative of interaction with a future autonomous robot
Invoked to justify collecting data that will train real AI controllers.

pith-pipeline@v0.9.0 · 5523 in / 1242 out tokens · 34243 ms · 2026-05-16T11:53:32.456066+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Wizard of oz studies in hri: a systematic review and new reporting guidelines,

L. D. Riek, “Wizard of oz studies in hri: a systematic review and new reporting guidelines,”Journal of Human-Robot Interaction, vol. 1, no. 1, pp. 119–136, 2012

work page 2012
[2]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Bridgedata v2: A dataset for robot learning at scale,

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Duet al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023, pp. 1723–1736

work page 2023
[4]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740–10 749

work page 2020
[5]

Dialfred: Dialogue-enabled agents for embodied instruction follow- ing,

X. Gao, Q. Gao, R. Gong, K. Lin, G. Thattai, and G. S. Sukhatme, “Dialfred: Dialogue-enabled agents for embodied instruction follow- ing,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 049– 10 056, 2022

work page 2022
[6]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Elliset al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Dis- ability Impacts All of Us,

Centers for Disease Control and Prevention, “Dis- ability Impacts All of Us,” 2023. [Online]. Avail- able: https://www.cdc.gov/disability-and-health/articles-documents/ disability-impacts-all-of-us-infographic.html

work page 2023
[8]

Control of a wheelchair-mounted 6dof assistive robot with chin and finger joysticks,

I. Rulik, M. S. H. Sunny, J. D. Sanjuan De Caro, M. I. I. Zarif, B. Brahmi, S. I. Ahamed, K. Schultz, I. Wang, T. Leheng, J. P. Longxianget al., “Control of a wheelchair-mounted 6dof assistive robot with chin and finger joysticks,”Frontiers in Robotics and AI, vol. 9, p. 885610, 2022

work page 2022
[9]

Performance evaluation of a mobile touchscreen interface for assistive robotic manipulators: A pilot study,

C.-S. Chung, H. W. Ka, H. Wang, D. Ding, A. Kelleher, and R. A. Cooper, “Performance evaluation of a mobile touchscreen interface for assistive robotic manipulators: A pilot study,”Topics in spinal cord injury rehabilitation, vol. 23, no. 2, pp. 131–139, 2017

work page 2017
[10]

Efficient self-attention model for speech recognition-based assistive robots control,

S. Poirier, U. C ˆot´e-Allard, F. Routhier, and A. Campeau-Lecours, “Efficient self-attention model for speech recognition-based assistive robots control,”Sensors, vol. 23, no. 13, p. 6056, 2023

work page 2023
[11]

Robotic assis- tance in action: examining control methods for long-term owners of wheelchair-mounted robotic arms,

C.-S. Chung, B. Styler, E. Wang, and D. Ding, “Robotic assis- tance in action: examining control methods for long-term owners of wheelchair-mounted robotic arms,” inProceedings of the RESNA Annual Conference, 2023

work page 2023
[12]

Grounding multimodal llms to embodied agents that ask for help with reinforcement learning,

R. Ramrakhya, M. Chang, X. Puig, R. Desai, Z. Kira, and R. Mottaghi, “Grounding multimodal llms to embodied agents that ask for help with reinforcement learning,”arXiv preprint arXiv:2504.00907, 2025

work page arXiv 2025
[13]

arXiv preprint arXiv:2509.15061 , year=

X. Lin, X. Zhu, T. Lu, S. Xie, H. Zhang, X. Qiu, Z. Wu, and Y .-G. Jiang, “Ask-to-clarify: Resolving instruction ambiguity through multi- turn dialogue,”arXiv preprint arXiv:2509.15061, 2025

work page arXiv 2025
[14]

OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,”arXiv preprint arXiv:2403.07870, 2024

work page arXiv 2024
[15]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

work page 2024
[16]

Openrobocare: A multimodal multi- task expert demonstration dataset for robot caregiving,

X. Liang, Z. Liu, K. Lin, E. Gu, R. Ye, T. Nguyen, C. Hsu, Z. Wu, X. Yang, C. S. Y . Cheunget al., “Openrobocare: A multimodal multi- task expert demonstration dataset for robot caregiving,”arXiv preprint arXiv:2511.13707, 2025

work page arXiv 2025
[17]

Harmonic: A multimodal dataset of assistive human– robot collaboration,

B. A. Newman, R. M. Aronson, S. S. Srinivasa, K. Kitani, and H. Admoni, “Harmonic: A multimodal dataset of assistive human– robot collaboration,”The International Journal of Robotics Research, vol. 41, no. 1, pp. 3–11, 2022

work page 2022
[18]

Proof of concept of an assistive robotic arm control using artificial stereo- vision and eye-tracking,

Y .-S. L.-K. Cio, M. Raison, C. L. Menard, and S. Achiche, “Proof of concept of an assistive robotic arm control using artificial stereo- vision and eye-tracking,”IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 12, pp. 2344–2352, 2019

work page 2019
[19]

A scoping review of gaze and eye tracking-based control methods for assistive robotic arms,

A. Fischer-Janzen, T. M. Wendt, and K. Van Laerhoven, “A scoping review of gaze and eye tracking-based control methods for assistive robotic arms,”Frontiers in Robotics and AI, vol. 11, p. 1326670, 2024

work page 2024
[20]

Robotic arm control system based on brain-muscle mixed signals,

L. Cheng, D. Li, G. Yu, Z. Zhang, and S. Yu, “Robotic arm control system based on brain-muscle mixed signals,”Biomedical Signal Processing and Control, vol. 77, p. 103754, 2022

work page 2022
[21]

Teaching vision- language models to ask: Resolving ambiguity in visual questions,

P. Jian, D. Yu, W. Yang, S. Ren, and J. Zhang, “Teaching vision- language models to ask: Resolving ambiguity in visual questions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 3619– 3638

work page 2025
[22]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Learning visuotactile skills with two multifingered hands,

T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik, “Learning visuotactile skills with two multifingered hands,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 5637–5643

work page 2025
[24]

Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang, “Open-television: Teleoperation with immersive active visual feedback,”arXiv preprint arXiv:2407.01512, 2024

work page arXiv 2024
[25]

Zoom workplace (zoom meetings),

Zoom Video Communications, Inc., “Zoom workplace (zoom meetings),” 2026, accessed: 2026-01-02. [Online]. Available: https: //zoom.us/download

work page 2026
[26]

voice-changer: Realtime voice changer,

w okada, “voice-changer: Realtime voice changer,” 2024, git commit a42051b. [Online]. Available: https://github.com/w-okada/ voice-changer

work page 2024
[27]

The international classification of functioning, disability and health: a new tool for understanding disability and health,

T. B. ¨Ust¨un, S. Chatterji, J. Bickenbach, N. Kostanjsek, and M. Schnei- der, “The international classification of functioning, disability and health: a new tool for understanding disability and health,”Disability and Rehabilitation, vol. 25, no. 11-12, pp. 565–571, January 2003

work page 2003
[28]

Pps-tags: Physical, perceptual and semantic tags for autonomous mobile manipulation,

H. Nguyen, T. Deyle, M. Reynolds, and C. Kemp, “Pps-tags: Physical, perceptual and semantic tags for autonomous mobile manipulation,” in Proceedings of the IROS Workshop on Semantic Perception for Mobile Manipulation, 2009

work page 2009
[29]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023
[30]

Toward a comfortable driving experience for a self-driving shuttle bus,

I. Bae, J. Moon, and J. Seo, “Toward a comfortable driving experience for a self-driving shuttle bus,”Electronics, vol. 8, no. 9, p. 943, 2019

work page 2019
[31]

Whole-body vibration exposure from incubators in the neonatal care setting: a review,

M. McCallig and V . Pakrashi, “Whole-body vibration exposure from incubators in the neonatal care setting: a review,”J. Environ. Occup. Health, vol. 11, pp. 37–46, 2021

work page 2021
[32]

Ambiguities in spatial language understanding in situated human robot dialogue

C. Liu, J. Walker, and J. Y . Chai, “Ambiguities in spatial language understanding in situated human robot dialogue.” inAAAI Fall Sym- posium: Dialog with Robots, 2010

work page 2010
[33]

Learning from unscripted deictic gesture and language for human-robot interactions,

C. Matuszek, L. Bo, L. Zettlemoyer, and D. Fox, “Learning from unscripted deictic gesture and language for human-robot interactions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 28, no. 1, 2014

work page 2014

[1] [1]

Wizard of oz studies in hri: a systematic review and new reporting guidelines,

L. D. Riek, “Wizard of oz studies in hri: a systematic review and new reporting guidelines,”Journal of Human-Robot Interaction, vol. 1, no. 1, pp. 119–136, 2012

work page 2012

[2] [2]

RT-1: Robotics Transformer for Real-World Control at Scale

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, J. Dabis, C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsuet al., “Rt-1: Robotics transformer for real-world control at scale,”arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Bridgedata v2: A dataset for robot learning at scale,

H. R. Walke, K. Black, T. Z. Zhao, Q. Vuong, C. Zheng, P. Hansen- Estruch, A. W. He, V . Myers, M. J. Kim, M. Duet al., “Bridgedata v2: A dataset for robot learning at scale,” inConference on Robot Learning. PMLR, 2023, pp. 1723–1736

work page 2023

[4] [4]

Alfred: A benchmark for interpreting grounded instructions for everyday tasks,

M. Shridhar, J. Thomason, D. Gordon, Y . Bisk, W. Han, R. Mottaghi, L. Zettlemoyer, and D. Fox, “Alfred: A benchmark for interpreting grounded instructions for everyday tasks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10 740–10 749

work page 2020

[5] [5]

Dialfred: Dialogue-enabled agents for embodied instruction follow- ing,

X. Gao, Q. Gao, R. Gong, K. Lin, G. Thattai, and G. S. Sukhatme, “Dialfred: Dialogue-enabled agents for embodied instruction follow- ing,”IEEE Robotics and Automation Letters, vol. 7, no. 4, pp. 10 049– 10 056, 2022

work page 2022

[6] [6]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karam- cheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Elliset al., “Droid: A large-scale in-the-wild robot manipulation dataset,”arXiv preprint arXiv:2403.12945, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Dis- ability Impacts All of Us,

Centers for Disease Control and Prevention, “Dis- ability Impacts All of Us,” 2023. [Online]. Avail- able: https://www.cdc.gov/disability-and-health/articles-documents/ disability-impacts-all-of-us-infographic.html

work page 2023

[8] [8]

Control of a wheelchair-mounted 6dof assistive robot with chin and finger joysticks,

I. Rulik, M. S. H. Sunny, J. D. Sanjuan De Caro, M. I. I. Zarif, B. Brahmi, S. I. Ahamed, K. Schultz, I. Wang, T. Leheng, J. P. Longxianget al., “Control of a wheelchair-mounted 6dof assistive robot with chin and finger joysticks,”Frontiers in Robotics and AI, vol. 9, p. 885610, 2022

work page 2022

[9] [9]

Performance evaluation of a mobile touchscreen interface for assistive robotic manipulators: A pilot study,

C.-S. Chung, H. W. Ka, H. Wang, D. Ding, A. Kelleher, and R. A. Cooper, “Performance evaluation of a mobile touchscreen interface for assistive robotic manipulators: A pilot study,”Topics in spinal cord injury rehabilitation, vol. 23, no. 2, pp. 131–139, 2017

work page 2017

[10] [10]

Efficient self-attention model for speech recognition-based assistive robots control,

S. Poirier, U. C ˆot´e-Allard, F. Routhier, and A. Campeau-Lecours, “Efficient self-attention model for speech recognition-based assistive robots control,”Sensors, vol. 23, no. 13, p. 6056, 2023

work page 2023

[11] [11]

Robotic assis- tance in action: examining control methods for long-term owners of wheelchair-mounted robotic arms,

C.-S. Chung, B. Styler, E. Wang, and D. Ding, “Robotic assis- tance in action: examining control methods for long-term owners of wheelchair-mounted robotic arms,” inProceedings of the RESNA Annual Conference, 2023

work page 2023

[12] [12]

Grounding multimodal llms to embodied agents that ask for help with reinforcement learning,

R. Ramrakhya, M. Chang, X. Puig, R. Desai, Z. Kira, and R. Mottaghi, “Grounding multimodal llms to embodied agents that ask for help with reinforcement learning,”arXiv preprint arXiv:2504.00907, 2025

work page arXiv 2025

[13] [13]

arXiv preprint arXiv:2509.15061 , year=

X. Lin, X. Zhu, T. Lu, S. Xie, H. Zhang, X. Qiu, Z. Wu, and Y .-G. Jiang, “Ask-to-clarify: Resolving instruction ambiguity through multi- turn dialogue,”arXiv preprint arXiv:2509.15061, 2025

work page arXiv 2025

[14] [14]

OPEN TEACH: A versatile teleoperation system for robotic manipulation.arXiv preprint arXiv:2403.07870,

A. Iyer, Z. Peng, Y . Dai, I. Guzey, S. Haldar, S. Chintala, and L. Pinto, “Open teach: A versatile teleoperation system for robotic manipulation,”arXiv preprint arXiv:2403.07870, 2024

work page arXiv 2024

[15] [15]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,

A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jainet al., “Open x-embodiment: Robotic learning datasets and rt-x models: Open x- embodiment collaboration 0,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 6892–6903

work page 2024

[16] [16]

Openrobocare: A multimodal multi- task expert demonstration dataset for robot caregiving,

X. Liang, Z. Liu, K. Lin, E. Gu, R. Ye, T. Nguyen, C. Hsu, Z. Wu, X. Yang, C. S. Y . Cheunget al., “Openrobocare: A multimodal multi- task expert demonstration dataset for robot caregiving,”arXiv preprint arXiv:2511.13707, 2025

work page arXiv 2025

[17] [17]

Harmonic: A multimodal dataset of assistive human– robot collaboration,

B. A. Newman, R. M. Aronson, S. S. Srinivasa, K. Kitani, and H. Admoni, “Harmonic: A multimodal dataset of assistive human– robot collaboration,”The International Journal of Robotics Research, vol. 41, no. 1, pp. 3–11, 2022

work page 2022

[18] [18]

Proof of concept of an assistive robotic arm control using artificial stereo- vision and eye-tracking,

Y .-S. L.-K. Cio, M. Raison, C. L. Menard, and S. Achiche, “Proof of concept of an assistive robotic arm control using artificial stereo- vision and eye-tracking,”IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 27, no. 12, pp. 2344–2352, 2019

work page 2019

[19] [19]

A scoping review of gaze and eye tracking-based control methods for assistive robotic arms,

A. Fischer-Janzen, T. M. Wendt, and K. Van Laerhoven, “A scoping review of gaze and eye tracking-based control methods for assistive robotic arms,”Frontiers in Robotics and AI, vol. 11, p. 1326670, 2024

work page 2024

[20] [20]

Robotic arm control system based on brain-muscle mixed signals,

L. Cheng, D. Li, G. Yu, Z. Zhang, and S. Yu, “Robotic arm control system based on brain-muscle mixed signals,”Biomedical Signal Processing and Control, vol. 77, p. 103754, 2022

work page 2022

[21] [21]

Teaching vision- language models to ask: Resolving ambiguity in visual questions,

P. Jian, D. Yu, W. Yang, S. Ren, and J. Zhang, “Teaching vision- language models to ask: Resolving ambiguity in visual questions,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 3619– 3638

work page 2025

[22] [22]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

T. Z. Zhao, V . Kumar, S. Levine, and C. Finn, “Learning fine-grained bimanual manipulation with low-cost hardware,”arXiv preprint arXiv:2304.13705, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Learning visuotactile skills with two multifingered hands,

T. Lin, Y . Zhang, Q. Li, H. Qi, B. Yi, S. Levine, and J. Malik, “Learning visuotactile skills with two multifingered hands,” in2025 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2025, pp. 5637–5643

work page 2025

[24] [24]

Open-television: Teleoperation with immersive active visual feedback.arXiv preprint arXiv:2407.01512, 2024

X. Cheng, J. Li, S. Yang, G. Yang, and X. Wang, “Open-television: Teleoperation with immersive active visual feedback,”arXiv preprint arXiv:2407.01512, 2024

work page arXiv 2024

[25] [25]

Zoom workplace (zoom meetings),

Zoom Video Communications, Inc., “Zoom workplace (zoom meetings),” 2026, accessed: 2026-01-02. [Online]. Available: https: //zoom.us/download

work page 2026

[26] [26]

voice-changer: Realtime voice changer,

w okada, “voice-changer: Realtime voice changer,” 2024, git commit a42051b. [Online]. Available: https://github.com/w-okada/ voice-changer

work page 2024

[27] [27]

The international classification of functioning, disability and health: a new tool for understanding disability and health,

T. B. ¨Ust¨un, S. Chatterji, J. Bickenbach, N. Kostanjsek, and M. Schnei- der, “The international classification of functioning, disability and health: a new tool for understanding disability and health,”Disability and Rehabilitation, vol. 25, no. 11-12, pp. 565–571, January 2003

work page 2003

[28] [28]

Pps-tags: Physical, perceptual and semantic tags for autonomous mobile manipulation,

H. Nguyen, T. Deyle, M. Reynolds, and C. Kemp, “Pps-tags: Physical, perceptual and semantic tags for autonomous mobile manipulation,” in Proceedings of the IROS Workshop on Semantic Perception for Mobile Manipulation, 2009

work page 2009

[29] [29]

Robust speech recognition via large-scale weak super- vision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak super- vision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

work page 2023

[30] [30]

Toward a comfortable driving experience for a self-driving shuttle bus,

I. Bae, J. Moon, and J. Seo, “Toward a comfortable driving experience for a self-driving shuttle bus,”Electronics, vol. 8, no. 9, p. 943, 2019

work page 2019

[31] [31]

Whole-body vibration exposure from incubators in the neonatal care setting: a review,

M. McCallig and V . Pakrashi, “Whole-body vibration exposure from incubators in the neonatal care setting: a review,”J. Environ. Occup. Health, vol. 11, pp. 37–46, 2021

work page 2021

[32] [32]

Ambiguities in spatial language understanding in situated human robot dialogue

C. Liu, J. Walker, and J. Y . Chai, “Ambiguities in spatial language understanding in situated human robot dialogue.” inAAAI Fall Sym- posium: Dialog with Robots, 2010

work page 2010

[33] [33]

Learning from unscripted deictic gesture and language for human-robot interactions,

C. Matuszek, L. Bo, L. Zettlemoyer, and D. Fox, “Learning from unscripted deictic gesture and language for human-robot interactions,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 28, no. 1, 2014

work page 2014