SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

B\"orje F. Karlsson; Guocai Yao; Jiajun Zhang; Jieru Lin; Juntao Cheng; Shaoxuan Xie; Shuo Ren; Wanyue Zhang; Zheqi He; Zhiwei Yu

arxiv: 2511.17649 · v4 · pith:LCO64ZH4new · submitted 2025-11-20 · 💻 cs.CV · cs.AI· cs.RO

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

Juntao Cheng , Wanyue Zhang , Zhiwei Yu , Shuo Ren , Zheqi He , Shaoxuan Xie , Guocai Yao , Jieru Lin

show 2 more authors

B\"orje F. Karlsson Jiajun Zhang

This is my paper

classification 💻 cs.CV cs.AIcs.RO

keywords switchinterfacesactionsclosed-loopembodiedinteractivemodelingmodels

0 comments

read the original abstract

Tangible control interfaces (TCIs), such as appliance panels, remotes, elevators, and embedded GUIs, are a fundamental component of everyday human-built environments. Interacting with these interfaces requires agents not only to ground language in visual observations,but also to execute actions, track temporally evolving state changes, and verify whether intended outcomes have been achieved. However, existing benchmarks predominantly evaluate open-loop perception or single-step action execution, failing to capture this continuous cycle of interaction, feedback, and correction. We introduce SWITCH, a benchmark for closed-loop interactive reasoning with TCIs in realistic egocentric environments1. SWITCH comprises 1,170 temporally interactive videos across diverse functional categories, providing structured annotations of instructions, actions, state transitions, outcomes, and recovery behaviors over time. To probe generative world modeling, SWITCH also evaluates video generation models on interaction-centered tasks using both LLM-as-judge and human evaluation2.Experiments with frontier proprietary and opensource multimodal models reveal persistent weaknesses in fine-grained visual-temporal perception, outcome verification, and error recovery, highlighting SWITCH as a testbed for closed-loop embodied intelligence.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Bridging VideoQA and Video-Guided Agentic Tasks via Generalized Keyframe Extraction
cs.CV 2026-06 unverdicted novelty 7.0

Introduces VG-GUIBench benchmark and TASKER keyframe extraction algorithm that improves performance on VideoQA and video-guided agentic tasks.