pith. machine review for the scientific record. sign in

arxiv: 2604.17969 · v2 · submitted 2026-04-20 · 💻 cs.CV

Recognition: unknown

E3VS-Bench: A Benchmark for Viewpoint-Dependent Active Perception in 3D Gaussian Splatting Scenes

Authors on Pith no claims yet

Pith reviewed 2026-05-10 05:17 UTC · model grok-4.3

classification 💻 cs.CV
keywords E3VS-Benchembodied visual searchactive perception5-DoF viewpoint control3D Gaussian Splattingvision-language modelsbenchmark
0
0 comments X

The pith

Vision-language models show a large gap from humans on active 5-DoF viewpoint planning for 3D visual search.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces E3VS-Bench to test embodied agents that must actively select viewpoints under full 5-DoF control to gather evidence for answering questions in 3D scenes. It constructs 99 photorealistic scenes via 3D Gaussian Splatting and 2,014 episodes whose answers depend on visibility changes, internal object views, or angle-specific attributes that single or limited views cannot resolve. State-of-the-art VLMs are evaluated on these episodes and compared directly to human performance. Despite strong single-image reasoning, the models exhibit a clear shortfall in coherent multi-step viewpoint planning. The benchmark therefore isolates the missing capability of active perception in unrestricted 3D spaces.

Core claim

E3VS-Bench shows that all evaluated VLMs exhibit a substantial performance gap from humans on tasks requiring active 5-DoF viewpoint control and coherent planning, even though the scenes are rendered with 3D Gaussian Splatting to preserve fine-grained visual details that demand multi-view inspection.

What carries the argument

E3VS-Bench: a collection of 99 3D Gaussian Splatting scenes and 2,014 question-driven episodes that force agents to execute sequences of 5-DoF viewpoint changes to resolve viewpoint-dependent questions.

If this is right

  • Embodied visual search requires agents to plan and execute viewpoint sequences rather than rely on passive or single-frame observations.
  • 3D Gaussian Splatting reconstructions enable questions based on small text, internal contents, and angle-specific attributes that mesh-based simulators often lose.
  • Strong 2D reasoning in current models does not automatically produce competent 3D active perception under full viewpoint freedom.
  • Human performance on these episodes provides a concrete target for measuring progress in coherent viewpoint planning.
  • The benchmark isolates the specific failure mode of active perception that must be addressed for real-world 3D tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Adding explicit 3D world models or memory of past viewpoints could help agents generate better planning sequences.
  • The same 5-DoF requirement may appear in robotic manipulation tasks where camera placement directly affects grasp or inspection success.
  • Training regimes that reward multi-view consistency or reconstruction accuracy might narrow the observed human-model gap.
  • Extending the benchmark to dynamic scenes or real-robot captures would test whether the identified limitations persist outside simulation.

Load-bearing premise

The questions in the benchmark genuinely cannot be answered from a single view or constrained motion and instead require active 5-DoF inspection, while 3D Gaussian Splatting scenes retain the fine details needed for those questions.

What would settle it

A model achieving human-level accuracy on E3VS-Bench while restricted to single-view or 2D inputs, or human subjects dropping to model-level performance when limited to the same restricted inputs.

Figures

Figures reproduced from arXiv: 2604.17969 by Daichi Azuma, Koya Sakamoto, Motoaki Kawanabe, Naoya Chiba, Shuhei Kurita, Shu Morikuni, Taiki Miyanishi, Yusuke Iwasawa, Yutaka Matsuo.

Figure 1
Figure 1. Figure 1: Overview of the proposed Embodied 3D Visual Search (E3VS) task. Unlike 2D visual search, E3VS requires an agent to actively control its 5-DoF viewpoint to resolve occlusions and acquire fine-grained visual evidence, such as the production area label on an egg carton. modern embodied AI, active perception enables agents to selectively acquire task-relevant information from environmental observations by acti… view at source ↗
Figure 2
Figure 2. Figure 2: Dataset construction pipeline for E3VS-Bench. The pipeline consists of five stages: (1) 3D scene curation from SceneSplat++ [20], (2) QA generation using a VLM, (3) invalid QA filtering with human verification, (4) viewpoint labeling to identify answerable viewpoints, and (5) answerability filtering to remove questions solvable without viewpoint transitions. To evaluate the correctness of open-vocabulary r… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of rendering quality between traditional mesh￾based (ScanNet++) and 3D Gaus￾sian Splatting (SceneSplat++). 3DGS preserves sharp textures for small text (e.g., "WHEY" label), which is crucial for viewpoint￾dependent visual reasoning. Invalid QA Filtering. We perform hu￾man filtering of QA candidates and anno￾tate viewpoint answerability. Annotators ver￾ify each candidate against predefined cri￾te… view at source ↗
Figure 4
Figure 4. Figure 4: Dataset Distribution. Each figure represents (a) question category distribu￾tion of unique QA pairs classified by GPT 5.1. (b) scene type distribution. (c) action distribution. (d) the number of words in each question. (e) the number of words in each answer. 3.3 Dataset Statistics and Analysis After meticulous filtering, our dataset contains 99 scenes, 2,014 episodes derived from 1290 unique question-answe… view at source ↗
Figure 5
Figure 5. Figure 5: Examples of E3VS task defined in our dataset. Each example illustrates a dis￾tinct reasoning type that requires viewpoint control in reconstructed 3D environments. ments. In particular, we evaluate whether agents can resolve spatial ambiguities such as occlusions and depth uncertainty by actively selecting informative view￾points. 4.1 E3VS Framework for VLMs To evaluate VLMs on the E3VS task, we implement … view at source ↗
Figure 6
Figure 6. Figure 6: Effect of the number of input frames on E3VS performance. While more frames do not significantly impact the VLM judge score, they consistently lead to more efficient navigation (fewer steps) and safer trajectories (lower collision rates). visual evidence is located on the object. This shifts the problem from viewpoint selection to exploration: unlike OST, where the relevant observation region is relatively… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Results. The orange bars represent the predicted answers after visual search by Gemini 3.0 Flash, while the blue bars indicate the human performance results. importance of actively selecting viewpoints that reveal sufficient visual evidence, which remains a challenging capability for current VLM-based agents. 6 Conclusion In this paper, we introduced E3VS-Bench, a benchmark for viewpoint-depend… view at source ↗
Figure 9
Figure 9. Figure 9: The filtered images are then provided to the VLM to generate QA candi￾dates. The prompt used for QA generation is shown in [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for Target Object Visibility Detection. Invalid QA Filtering. A question is considered valid only if all of the following conditions are satisfied; otherwise the QA pair is removed. 1. Single-View Answerability. The question must be answerable from a sin￾gle image observation. Questions requiring multiple viewpoints or lacking sufficient visual evidence are rejected. 2. Question Validity. The descri… view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative examples of viewpoint filtering by the VLM. Accepted viewpoints are highlighted with green bounding boxes, while filtered viewpoints are highlighted with red bounding boxes [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Examples of Generated Question-answer pairs by Gemini 2.5 Flash. to be concise, objective, and based only on visually observable information. Spec￾ulative or inferred descriptions that are not directly supported by visual evidence are not allowed. B.3 Viewpoint Labeling After QA verification and answer correction, annotators label the candidate viewpoints associated with each question. The goal of this st… view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for Generating Subset Sensitive QAs [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative results of the penetrated viewpoint filtering. Viewpoints are marked red if they are filtered out due to camera penetration through scene geom￾etry (e.g., walls or objects). Otherwise, they are marked green to indicate valid initial viewpoints for E3VS [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for Counting Question Modification [PITH_FULL_IMAGE:figures/full_fig_p028_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt for Filtering Invalid Counting QAs [PITH_FULL_IMAGE:figures/full_fig_p029_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: System prompt for Embodied 3D Visual Search. User Prompt for Embodied 3D Visual Search Question: ’how many legs support the table?’ Current step: 9 Your current state: – Position (X, Y, Z): (1.00, 2.00, 3.00) – Last action: move_forward – Last action result: Collision occurred. You remained at the same position [PITH_FULL_IMAGE:figures/full_fig_p030_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: User prompt for Embodied 3D Visual Search [PITH_FULL_IMAGE:figures/full_fig_p030_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: System Prompt for VLM-as-a-Judge [PITH_FULL_IMAGE:figures/full_fig_p032_17.png] view at source ↗
read the original abstract

Visual search in 3D environments requires embodied agents to actively explore their surroundings and acquire task-relevant evidence. However, existing visual search and embodied AI benchmarks, including EQA, typically rely on static observations or constrained egocentric motion, and thus do not explicitly evaluate fine-grained viewpoint-dependent phenomena that arise under unrestricted 5-DoF viewpoint control in real-world 3D environments, such as visibility changes caused by vertical viewpoint shifts, revealing contents inside containers, and disambiguating object attributes that are only observable from specific angles. To address this limitation, we introduce {E3VS-Bench}, a benchmark for embodied 3D visual search where agents must control their viewpoints in 5-DoF to gather viewpoint-dependent evidence for question answering. E3VS-Bench consists of 99 high-fidelity 3D scenes reconstructed using 3D Gaussian Splatting and 2,014 question-driven episodes. 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators, thereby allowing the construction of questions that cannot be answered from a single view and instead require active inspection across viewpoints in 5-DoF. We evaluate multiple state-of-the-art VLMs and compare their performance with humans. Despite strong 2D reasoning ability, all models exhibit a substantial gap from humans, highlighting limitations in active perception and coherent viewpoint planning specifically under full 5-DoF viewpoint changes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces E3VS-Bench, a benchmark for embodied 3D visual search consisting of 99 high-fidelity scenes reconstructed via 3D Gaussian Splatting and 2,014 question-driven episodes. Agents must actively control 5-DoF viewpoints to gather evidence for questions that depend on fine-grained, viewpoint-specific details (e.g., visibility changes from vertical shifts, contents inside containers, or angle-dependent attributes) that cannot be resolved from single views or constrained motion. The work evaluates multiple state-of-the-art VLMs against human performance and reports a substantial gap, attributing it to limitations in active perception and coherent viewpoint planning under full 5-DoF control.

Significance. If the episodes are validated to require unrestricted 5-DoF exploration and 3DGS is shown to preserve necessary fine details better than mesh-based alternatives, the benchmark would fill a clear gap in existing embodied AI and visual search evaluations (which rely on static observations or limited egocentric motion). The human-model comparison provides a concrete baseline for measuring progress in viewpoint-dependent reasoning, and the photorealistic rendering choice is a technical strength that enables more realistic question design than prior simulators.

major comments (2)
  1. [Benchmark construction (§3) and Experiments (§5)] The headline claim that models lag humans specifically due to failures in active 5-DoF viewpoint planning (abstract and §5) depends on every one of the 2,014 episodes being unsolvable from a single view or constrained (e.g., 3-DoF) motion. No single-view oracle accuracy, restricted-DoF human trials, or per-question validation is reported to confirm this; without such checks the observed gap could arise from generic VLM reasoning limits rather than the claimed 5-DoF active-perception deficit.
  2. [Benchmark construction (§3)] §4 (or equivalent benchmark design section): the assertion that 3DGS 'preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators' is load-bearing for the claim that the questions genuinely require multi-view 5-DoF inspection, yet no quantitative comparison (e.g., detail preservation metrics or ablation against mesh renderings) is provided to support this over alternatives.
minor comments (2)
  1. [Experiments (§5)] The abstract and evaluation sections would benefit from explicit reporting of error bars, exact metrics (e.g., accuracy per question type), episode length statistics, and data exclusion rules to allow assessment of result robustness.
  2. [Benchmark construction (§3)] Clarify the exact process for question generation and human annotation protocol (e.g., how questions were filtered to ensure they require 5-DoF) to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Benchmark construction (§3) and Experiments (§5)] The headline claim that models lag humans specifically due to failures in active 5-DoF viewpoint planning (abstract and §5) depends on every one of the 2,014 episodes being unsolvable from a single view or constrained (e.g., 3-DoF) motion. No single-view oracle accuracy, restricted-DoF human trials, or per-question validation is reported to confirm this; without such checks the observed gap could arise from generic VLM reasoning limits rather than the claimed 5-DoF active-perception deficit.

    Authors: We agree that quantitative validation would make the central claim more robust. The 2,014 episodes were constructed through a human annotation process in which annotators explicitly verified that each question requires viewpoint-specific information unavailable from the initial observation or under constrained motion; this per-question validation is described in §3 but was not quantified with oracle baselines. To directly address the concern, we will add single-view VLM accuracies (oracle performance when models are given only the starting view) in the revised experiments section. This will demonstrate that the performance gap is not solely attributable to generic reasoning limitations. We will also expand the description of the annotation protocol to include more details on how viewpoint dependence was ensured. revision: yes

  2. Referee: [Benchmark construction (§3)] §4 (or equivalent benchmark design section): the assertion that 3DGS 'preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators' is load-bearing for the claim that the questions genuinely require multi-view 5-DoF inspection, yet no quantitative comparison (e.g., detail preservation metrics or ablation against mesh renderings) is provided to support this over alternatives.

    Authors: The statement draws on well-documented properties of 3D Gaussian Splatting in the novel-view-synthesis literature, where it avoids the meshing and texturing artifacts that degrade high-frequency details. We did not include a direct quantitative ablation because the benchmark scenes are native 3DGS reconstructions without paired mesh versions. In revision we will (i) expand the justification with specific citations to comparative studies on detail preservation and (ii) add qualitative side-by-side renderings of representative scenes (text, fine textures, interior views) to illustrate the difference. A full quantitative metric comparison across all 99 scenes would require additional reconstruction work and is therefore only partially feasible; we will note this limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark construction is independent of any self-referential derivation

full rationale

The paper introduces E3VS-Bench as a new evaluation resource consisting of 99 3DGS scenes and 2,014 episodes, motivated by the claim that 3DGS enables questions unsolvable from single views. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The central empirical result (VLMs lag humans under 5-DoF) is obtained by running external models on the benchmark and comparing to human performance; it does not reduce to any quantity defined by the authors' prior work or by construction within the paper itself. Self-citations, if present, are not load-bearing for the benchmark's validity or the reported gap.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that 3D Gaussian Splatting supplies superior photorealistic rendering for fine details and that the generated questions truly demand active 5-DoF exploration.

axioms (1)
  • domain assumption 3D Gaussian Splatting enables photorealistic free-viewpoint rendering that preserves fine-grained visual details (e.g., small text and subtle attributes) often degraded in mesh-based simulators
    Invoked to justify why the benchmark can construct questions that require active multi-view inspection.

pith-pipeline@v0.9.0 · 5616 in / 1341 out tokens · 60478 ms · 2026-05-10T05:17:46.989138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 8 canonical work pages · 3 internal anchors

  1. [1]

    International Journal of Computer Vision1, 333–356 (1988)

    Aloimonos, Y., Weiss, I., Bandyopadhyay, A.: Active vision. International Journal of Computer Vision1, 333–356 (1988)

  2. [2]

    In: IEEE Conf

    Azuma, D., Miyanishi, T., Kurita, S., Kawanabe, M.: Scanqa: 3d question answer- ing for spatial scene understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 19129–19139 (June 2022)

  3. [3]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  4. [4]

    Proceedings of the IEEE76(8), 966–1005 (1988)

    Bajcsy, R.: Active perception. Proceedings of the IEEE76(8), 966–1005 (1988)

  5. [5]

    Chaplot, D.S., Jiang, H., Gupta, S., Gupta, A.: Semantic curiosity for active visual learning. In: Eur. Conf. Comput. Vis. (2020)

  6. [6]

    Chen, D.Z., Chang, A.X., Nießner, M.: Scanrefer: 3d object localization in rgb-d scans using natural language. In: Eur. Conf. Comput. Vis. (2020)

  7. [7]

    In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2025)

    Cheng, K., Li, Z., Sun, X., Min, B.C., Bedi, A.S., Bera, A.: Efficienteqa: An ef- ficient approach to open-vocabulary embodied question answering. In: IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (2025)

  8. [8]

    In: IEEE Conf

    Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., Batra, D.: Embodied Question Answering. In: IEEE Conf. Comput. Vis. Pattern Recog. (2018)

  9. [9]

    arXiv preprint arXiv:2509.20021 (2025)

    Feng,T.,Wang,X.,Jiang,Y.G.,Zhu,W.:Embodiedai:Fromllmstoworldmodels. arXiv preprint arXiv:2509.20021 (2025)

  10. [10]

    In: Proc

    Ginting, M.F., Kim, D.K., Meng, X., Reinke, A.M., Krishna, B.J., Kayhani, N., Peltzer, O., Fan, D., Shaban, A., Kim, S.K., Kochenderfer, M., Agha-mohammadi, A.a., Omidshafiei, S.: Enter the mind palace: Reasoning and planning for long- term active embodied question answering. In: Proc. Conference on Robot Learning (CoRL) (2025)

  11. [11]

    arXiv preprint arXiv:2601.09668 , year=

    Huang, A., Yao, C., Han, C., Wan, F., Guo, H., Lv, H., Zhou, H., Wang, J., Zhou, J., Sun, J., Hu, J., Lin, K., Zhao, L., Huang, M., Yuan, S., Qu, W., Wang, X., Lai, Y., Zhao, Y., Zhang, Y., Shi, Y., Chen, Y., Weng, Z., Meng, Z., Li, A., Kong, A., Dong, B., Wan, C., Wang, D., Qi, D., Li, D., Yu, E., Li, G., Yin, H., Zhou, H., Zhang, H., Yan, H., Zhou, H., ...

  12. [12]

    Jiang, K., Liu, Y., Chen, W., Luo, J., Chen, Z., Pan, L., Li, G., Lin, L.: Beyond the destination: A novel benchmark for exploration-aware embodied question an- swering. In: Int. Conf. Comput. Vis. pp. 9091–9101 (October 2025)

  13. [13]

    ACM Transactions on Graphics (TOG)42(4), 1–14 (July 2023)

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics (TOG)42(4), 1–14 (July 2023)

  14. [14]

    Eye, robot: Learning to look to act with a bc-rl perception-action loop,

    Kerr, J., Hari, K., Weber, E., Kim, C.M., Yi, B., Bonnen, T., Goldberg, K., Kanazawa, A.: Eye, robot: Learning to look to act with a bc-rl perception-action loop. In: Proc. Conference on Robot Learning (CoRL) (2025),http://arxiv.org/ abs/2506.10968

  15. [15]

    Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search. In: Int. Conf. Learn. Represent. (2026)

  16. [16]

    Lee, J., Miyanishi, T., Kurita, S., Sakamoto, K., Azuma, D., Matsuo, Y., Inoue, N.: Citynav: Language-goal aerial navigation dataset with geographic information. In: Int. Conf. Comput. Vis. pp. 5912–5922 (October 2025)

  17. [17]

    In: IEEE Conf

    Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 9098–9108 (June 2025)

  18. [18]

    Li, K., Yao, L., Wu, J., Yu, T., Chen, J., Bai, H., Hou, L., Hong, L., Zhang, W., Zhang, N.L.: Insight-o3: Empowering multimodal foundation models with gener- alized visual search. In: Int. Conf. Learn. Represent. (2026)

  19. [19]

    Liu, S., Zhang, H., Qi, Y., Wang, P., Zhang, Y., Wu, Q.: Aerialvln: Vision-and- languagenavigationforuavs.In:Int.Conf.Comput.Vis.pp.15384–15394(October 2023)

  20. [20]

    In: Proc

    Ma, M., Ma, Q., Li, Y., Cheng, J., Yang, R., Ren, B., Popovic, N., Wei, M., Sebe, N., Van Gool, L., et al.: Scenesplat++: A large dataset and comprehensive benchmark for language gaussian splatting. In: Proc. Annual Conference on Neural Information Processing Systems (NeurIPS) (2025)

  21. [21]

    Ma,X.,Yong,S.,Zheng,Z.,Li,Q.,Liang,Y.,Zhu,S.C.,Huang,S.:Sqa3d:Situated question answering in 3d scenes. In: Int. Conf. Learn. Represent. (2023)

  22. [22]

    In: IEEE Conf

    Majumdar, A., Ajay, A., Zhang, X., Putta, P., Yenamandra, S., Henaff, M., Sil- wal, S., Mcvay, P., Maksymets, O., Arnaud, S., Yadav, K., Li, Q., Newman, B., Sharma, M., Berges, V., Zhang, S., Agrawal, P., Bisk, Y., Batra, D., Kalakrishnan, M., Meier, F., Paxton, C., Sax, S., Rajeswaran, A.: Openeqa: Embodied question answering in the era of foundation mod...

  23. [23]

    In: IEEE Conf

    Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., van den Hengel, A.: Reverie: Remote embodied visual referring expression in real indoor environments. In: IEEE Conf. Comput. Vis. Pattern Recog. (2020)

  24. [24]

    In: IEEE Conf

    Qi, Y., Wu, Q., Anderson, P., Wang, X., Wang, W.Y., Shen, C., Hengel, A.v.d.: Reverie: Remote embodied visual referring expression in real indoor environments. In: IEEE Conf. Comput. Vis. Pattern Recog. (June 2020)

  25. [25]

    Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., Batra, D.: Habitat: A platform for embodied ai research. In: Int. Conf. Comput. Vis. (October 2019) E3VS-Bench 19

  26. [26]

    In: Proc

    Saxena, S., Buchanan, B., Paxton, C., Chen, B., Vaskevicius, N., Palmieri, L., Francis, J., Kroemer, O.: Grapheqa: Using 3d semantic scene graphs for real-time embodied question answering. In: Proc. Conference on Robot Learning (CoRL) (2025)

  27. [27]

    Scarpellini, G., Rosa, S., Morerio, P., Natale, L., Bue, A.D.: Look around and learn: self-improving object detection by exploration. In: Eur. Conf. Comput. Vis. (2024)

  28. [28]

    In: IEEE Conf

    Shridhar, M., Thomason, J., Gordon, D., Bisk, Y., Han, W., Mottaghi, R., Zettle- moyer, L., Fox, D.: Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In: IEEE Conf. Comput. Vis. Pattern Recog. (June 2020)

  29. [29]

    Su, A., Wang, H., Ren, W., Lin, F., Chen, W.: Pixel reasoner: Incentivizing pixel- spacereasoningwithcuriosity-drivenreinforcementlearning.In:Proc.AnnualCon- ference on Neural Information Processing Systems (NeurIPS) (2025)

  30. [30]

    Psychological Review (2006)

    Torralba, A., Oliva, A., Castelhano, M.S., Henderson, J.M.: Contextual guidance of eye movements and attention in real-world scenes. Psychological Review (2006)

  31. [31]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  32. [32]

    In: IEEE Conf

    Wijmans, E., Datta, S., Maksymets, O., Das, A., Gkioxari, G., Lee, S., Essa, I., Parikh, D., Batra, D.: Embodied Question Answering in Photorealistic Environ- ments with Point Cloud Perception. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 6659–6668 (2019)

  33. [33]

    Wolfe, J.M.: Visual search: How do we find what we are looking for? Annual Review of Vision Science6, 539–562 (Sep 2020).https://doi.org/10.1146/annurev- vision-091718-015048

  34. [34]

    Trends in Cognitive Sciences (2011)

    Wolfe, J.M., Vo, M.L.H., Evans, K.K., Greene, M.R.: Visual search in scenes in- volves selective and nonselective pathways. Trends in Cognitive Sciences (2011)

  35. [35]

    In: IEEE Conf

    Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 13084–13094 (June 2024)

  36. [36]

    In: Proc

    Xiong, H., Xu, X., Wu, J., Hou, Y., Bohg, J., Song, S.: Vision in action: Learn- ing active perception from human demonstrations. In: Proc. Conference on Robot Learning (CoRL) (2025)

  37. [37]

    Yang, J., Ren, Z., Xu, M., Chen, X., Crandall, D., Parikh, D., Batra, D.: Embodied amodal recognition: Learning to move to perceive objects. In: Int. Conf. Comput. Vis. pp. 2040–2050 (2019)

  38. [38]

    Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Int. Conf. Comput. Vis. pp. 12–22 (2023)

  39. [39]

    In: Proc

    Yin, T., Mei, Z., Sun, T., Zha, L., Zhou, E., Bao, J., Yamane, M., Shorinwa, O., Majumdar, A.: Womap: World models for embodied open-vocabulary object localization. In: Proc. Conference on Robot Learning (CoRL) (2025)

  40. [40]

    Yu, H., Han, Y., Zhang, X., Yin, B., Chang, B., Han, X., Liu, X., Zhang, J., Pavone, M., Feng, C., Xie, S., Li, Y.: Thinking in 360°: Humanoid visual search in the wild (2025),https://arxiv.org/abs/2511.20351

  41. [41]

    Thyme: Think Beyond Images

    Zhang, Y.F., Lu, X., Yin, S., Fu, C., Chen, W., Hu, X., Wen, B., Jiang, K., Liu, C., Zhang, T., et al.: Thyme: Think beyond images. arXiv preprint arXiv:2508.11630 (2025) E3VS-Bench 1 A Appendix This supplementary material provides additional details that complement the main paper. We describe the dataset construction pipeline, implementation de- tails fo...

  42. [42]

    Occlusion & Clipping Check: –Is the object partially hidden by other furniture or clutter? –If YES and the GT count is very low (1 or 2) compared to the actual number of common parts, it is likely INVALID

  43. [43]

    Structural Plausibility: –For the specific object in the image, is the GT count physically plau- sible for a complete product? –Trash bins, stools, or side tables often have 3 legs→This is VALID –Large sofas, heavy cabinets, or standard dining chairs with only 1-2 legs→INVALID (unless wall-mounted, which is rare)

  44. [44]

    There are probably more parts hidden,

    Ambiguity Check: –If a human looking at the image would say "There are probably more parts hidden," but GT gives a low count, the task is poor quality→ INVALID Please respond with only ’valid’ or ’invalid’. Question: [QUESTION] GT Answer: [ANSWER] Fig.14.Prompt for Filtering Invalid Counting QAs. E3VS-Bench 11 System Prompt for Embodied 3D Visual Search Y...

  45. [45]

    END_IMAGE: The actual view seen by the model at the end of navigation

  46. [46]

    GOAL_IMAGE: The image used to derive the ground-truth answer

  47. [47]

    QUESTION: The query asked

  48. [48]

    RESPONSE: The model’s answer

  49. [49]

    Chair",

    REFERENCE INFO: One example of a correct answer. Core Principles –Visual Grounding is Paramount: The answer MUST be derived from the END_IMAGE. –No Blind Guessing: Even if the RESPONSE matches the REFERENCE INFO, it is incorrect if the object is NOT visible in the END_IMAGE. –Object Existence: An object is considered to exist only if it is clearly visible...

  50. [50]

    Visibility: Is the object clearly and unambiguously visible in the END_IMAGE? (Reject hallucinations or objects inferred from dark/blurry areas)

  51. [51]

    Mop" a "Toilet Brush

    Identity: Does the visual appearance in END_IMAGE match the object name in the RESPONSE? •Example Failure: Calling a "Mop" a "Toilet Brush". •Example Failure: Calling a "Table" a "Chair". •Use GOAL_IMAGE to resolve ambiguity about what the target object looks like. –If any referenced object is invisible or misidentified (wrong category):→Score 1 Step 4: V...

  52. [52]

    It refers to objects validated in Step 3 (Clearly visible)

  53. [53]

    Trash can

    It satisfies the same semantic intent as the REFERENCE/GOAL. •Acceptable: Synonyms ("Trash can" vs "Bin"), Shape approximations ("Cir- cular" vs "Oval"), Hypernyms ("Furniture" vs "Chair" - if accurate). •Unacceptable: Different object categories ("Table" vs "Chair"), contradicting attributes ("Red" vs "Blue")

  54. [54]

    –If it fails any of these:→Score 1 –If it is a valid alternative:→Score 5 Output Format Reasoning: [Explain the score based on the steps above

    It is visually consistent with the END_IMAGE. –If it fails any of these:→Score 1 –If it is a valid alternative:→Score 5 Output Format Reasoning: [Explain the score based on the steps above. Explicitly mention if the object was visible in END_IMAGE and if the Identity matched.] Your mark: [Integer from 1 to 5] Fig.17.System Prompt for VLM-as-a-Judge