arXiv preprint arXiv:2509.15061 , year=

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue , author= · 2025 · cs.RO · arXiv 2509.15061

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

abstract

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

representative citing papers

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring

cs.RO · 2026-05-29 · unverdicted · novelty 6.0

Hide-and-Seek uses contrastive objectives on trajectories to localize failure signals in VLA models from trajectory-level supervision alone.

ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries

cs.AI · 2026-05-07 · unverdicted · novelty 6.0 · 3 refs

ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.

Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

The Estimate-Verify-Update (EVU) mechanism reduces belief inertia in embodied agents and raises task success rates on three benchmarks.

A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study

cs.RO · 2026-01-23 · unverdicted · novelty 5.0

A two-room Wizard-of-Oz pilot collected 53 multimodal trials from five users to capture dialogue ambiguities for training ambiguity-aware assistive robot controllers.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Hide-and-Seek in Trajectories: Discovering Failure Signals for VLA Runtime Monitoring cs.RO · 2026-05-29 · unverdicted · none · ref 95 · internal anchor
Hide-and-Seek uses contrastive objectives on trajectories to localize failure signals in VLA models from trajectory-level supervision alone.
ProCompNav: Proactive Instance Navigation with Comparative Judgment for Ambiguous User Queries cs.AI · 2026-05-07 · unverdicted · none · ref 11 · 3 links · internal anchor
ProCompNav builds a candidate pool from ambiguous queries then uses pool-splitting binary questions for disambiguation, improving success rate and shortening responses on CoIN-Bench and TextNav.
Seeing Isn't Believing: Mitigating Belief Inertia via Active Intervention in Embodied Agents cs.CL · 2026-04-19 · unverdicted · none · ref 2 · internal anchor
The Estimate-Verify-Update (EVU) mechanism reduces belief inertia in embodied agents and raises task success rates on three benchmarks.
A Multimodal Data Collection Framework for Dialogue-Driven Assistive Robotics to Clarify Ambiguities: A Wizard-of-Oz Pilot Study cs.RO · 2026-01-23 · unverdicted · none · ref 13 · internal anchor
A two-room Wizard-of-Oz pilot collected 53 multimodal trials from five users to capture dialogue ambiguities for training ambiguity-aware assistive robot controllers.

arXiv preprint arXiv:2509.15061 , year=

fields

years

verdicts

representative citing papers

citing papers explorer