pith. sign in

arxiv: 2606.23557 · v1 · pith:ZWL4Q3DWnew · submitted 2026-06-22 · 💻 cs.CV

Dense Reward for Multi-View 3D Reasoning with Global Maps and Local Views

Pith reviewed 2026-06-26 08:36 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-view 3D VQAdense rewardglobal map constructionview trajectory planningpolicy optimizationprocess supervision
0
0 comments X

The pith

DR-MV3D supplies dense rewards from global map consistency and local view trajectories to supervise multi-view 3D visual question answering instead of sparse final-answer signals.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that multimodal LLMs trained only with answer-level supervision produce inconsistent cross-view reasoning and brittle view selection in multi-view 3D VQA tasks. It introduces DR-MV3D, which decomposes the task into allocentric global map construction, question-conditioned view-trajectory planning, and egocentric grounding, then supplies two verifiable intermediate rewards. A global consistency reward aligns predicted maps with pseudo targets from frozen 3D models, while a local trajectory reward guides ordered viewpoint selection; both are optimized with trajectory-level policy optimization. Experiments on MindCube, VSI-Bench, and BLINK (MV) show consistent gains over strong multi-image baselines.

Core claim

DR-MV3D decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. It supplies a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models and a local trajectory reward that supervises ordered viewpoint selection, then optimizes the full pipeline with trajectory-level policy optimization to make intermediate steps learnable without manual annotations.

What carries the argument

DR-MV3D framework that decomposes MV3D-VQA into global map construction, view-trajectory planning, and egocentric grounding, then supplies a global consistency reward from frozen 3D model pseudo targets plus a local trajectory reward, optimized via GRPO.

If this is right

  • The decomposition allows intermediate map construction and view planning steps to receive direct supervision without manual annotations.
  • Global consistency and local trajectory rewards together produce more coherent cross-view reasoning than answer-level signals alone.
  • Trajectory-level policy optimization enables end-to-end training of the full map-to-answer pipeline.
  • The same dense-reward structure yields measurable gains across three distinct MV3D-VQA benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reliance on frozen foundation models for pseudo targets suggests the method could be applied to other spatial reasoning domains where reliable 3D geometry can be precomputed.
  • If the pseudo-target generators contain scene-type biases, performance on out-of-distribution environments may degrade even when in-distribution benchmarks improve.
  • Replacing the frozen models with jointly trained 3D components might remove one source of potential error propagation while preserving the dense-reward structure.

Load-bearing premise

Geometry-consistent pseudo targets generated by frozen 3D vision foundation models supply reliable supervision for the global consistency reward without introducing systematic errors that propagate to final answer accuracy.

What would settle it

An ablation that removes the dense rewards and trains only with answer-level supervision, resulting in performance on MindCube, VSI-Bench, and BLINK (MV) dropping to the level of the multi-image baselines, would falsify the effectiveness of process-level dense supervision.

Figures

Figures reproduced from arXiv: 2606.23557 by Hyunjung Shim, Jiho Choi, Seojeong Park, Seonho Lee.

Figure 1
Figure 1. Figure 1: Overview of DR-MV3D. Existing MLLMs trained with sparse, answer￾level supervision (left) often build inconsistent cognitive maps and misread egocentric directions, leading to wrong answers despite plausible-looking reasoning. In contrast, DR-MV3D (right) supervises the reasoning process with dense, verifiable rewards: a global reward aligns the predicted allocentric cognitive map with a geometry-consistent… view at source ↗
Figure 2
Figure 2. Figure 2: Cognitive map and metric comparison. Top: multi-view observations, VFM-based scene reconstruction (VGGT+SAM3), and similarity scores to the GT cognitive map. The similarity is reported from three aspects: Dir. measures inter￾object directional relations (relative spatial ordering between objects), Face. measures the facing orientation of the viewpoint, and Overall is a weighted combination of Dir. and Face… view at source ↗
Figure 3
Figure 3. Figure 3: Overall framework with VFM-guided dense rewards. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison (Among). Multi-view observations and cognitive￾map-based reasoning for an Among query (MindCube). The baseline confuses ego￾centric directions and predicts an incorrect object, whereas our method follows the ego-direction cues in the predicted cognitive map step by step and selects the correct answer. Component ablation in SFT. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison (Rotation). Multi-view observations and cognitive-map-based reasoning for a Rotation query (MindCube). The baseline fails to track the rotated egocentric frame and selects the wrong object, whereas our method maintains consistent ego-direction reasoning under rotation and predicts the correct answer. 5 Conclusion We presented DR-MV3D, a dense-reward learning framework for multi-view … view at source ↗
read the original abstract

Multi-view 3D Visual Question Answering (MV3D-VQA) requires integrating partial observations into a coherent 3D scene representation and selecting informative viewpoints for multi-step spatial reasoning. However, current multimodal LLMs are typically trained with sparse, answer-level supervision, which often yields inconsistent cross-view reasoning and brittle view selection. We present DR-MV3D (Dense Reward for MV3D-VQA), a map-grounded learning framework that provides dense, verifiable rewards to supervise the reasoning process. Our approach decomposes MV3D-VQA into (i) allocentric global map construction, (ii) question-conditioned view-trajectory planning, and (iii) egocentric grounding for answer prediction. To make intermediate steps learnable without manual annotations, we introduce two rewards: a global consistency reward that aligns the predicted map with geometry-consistent pseudo targets from frozen 3D vision foundation models (e.g., VGGT + SAM3), and a local trajectory reward that supervises ordered viewpoint selection. We optimize the full pipeline with trajectory-level policy optimization (GRPO). Experiments on MindCube, VSI-Bench, and BLINK (MV) show that DR-MV3D consistently improves over strong multi-image baselines, supporting the effectiveness of process-level dense supervision for multi-view 3D reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes DR-MV3D, a map-grounded framework for multi-view 3D visual question answering (MV3D-VQA) that decomposes the task into allocentric global map construction, question-conditioned view-trajectory planning, and egocentric grounding. It supplies dense rewards via a global consistency term that aligns intermediate maps to pseudo targets produced by frozen models (VGGT + SAM3) together with a local trajectory reward, and optimizes the pipeline end-to-end with GRPO. Experiments on MindCube, VSI-Bench, and BLINK (MV) are reported to show consistent gains over strong multi-image baselines, which the authors attribute to the use of process-level dense supervision.

Significance. If the pseudo targets can be shown to supply accurate, unbiased geometry signals, the work would supply concrete evidence that dense, verifiable process rewards can improve cross-view consistency and view selection in multimodal LLMs beyond sparse answer-level training. The explicit decomposition into global-map and local-trajectory stages is a useful organizing principle. At present, however, the absence of any quantitative check on pseudo-label fidelity leaves open the possibility that reported gains arise from alignment with the frozen models rather than genuine 3D reasoning gains.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'DR-MV3D consistently improves over strong multi-image baselines' and that this supports 'the effectiveness of process-level dense supervision' rests on unshown experimental controls; no numerical deltas, ablation tables, error bars, or statistical tests are supplied, rendering the headline result impossible to evaluate.
  2. [Abstract] Abstract and implied Experiments section: the global consistency reward is defined directly in terms of pseudo targets from VGGT + SAM3, yet no alignment error, consistency metric, or comparison against available 3D ground truth on MindCube/VSI-Bench subsets is reported. Without such a check, it is impossible to rule out that the optimization simply teaches the model to reproduce biases of the frozen foundation models rather than to acquire independent 3D reasoning.
minor comments (1)
  1. [Abstract] The acronym MV3D-VQA is introduced in the abstract without an explicit expansion on first use, although context makes the meaning recoverable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments, which highlight important aspects of clarity in the abstract and validation of our pseudo-target approach. We will revise the manuscript to strengthen these areas.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'DR-MV3D consistently improves over strong multi-image baselines' and that this supports 'the effectiveness of process-level dense supervision' rests on unshown experimental controls; no numerical deltas, ablation tables, error bars, or statistical tests are supplied, rendering the headline result impossible to evaluate.

    Authors: We agree that the abstract should provide more concrete evidence for the claims. Although the full manuscript contains detailed experimental results including performance tables, ablations, and comparisons in the Experiments section, the abstract itself does not include numerical deltas or references to them. We will revise the abstract to incorporate key quantitative results, such as accuracy improvements on MindCube, VSI-Bench, and BLINK (MV), along with mentions of ablation studies and statistical significance where applicable. This will allow readers to evaluate the headline results directly. revision: yes

  2. Referee: [Abstract] Abstract and implied Experiments section: the global consistency reward is defined directly in terms of pseudo targets from VGGT + SAM3, yet no alignment error, consistency metric, or comparison against available 3D ground truth on MindCube/VSI-Bench subsets is reported. Without such a check, it is impossible to rule out that the optimization simply teaches the model to reproduce biases of the frozen foundation models rather than to acquire independent 3D reasoning.

    Authors: We acknowledge this limitation in the current manuscript. The global consistency reward relies on pseudo targets from VGGT and SAM3 without an explicit fidelity check against ground truth. To address the referee's concern, we will add a quantitative analysis in the revised manuscript, including alignment error metrics and comparisons to 3D ground truth on available subsets of MindCube and VSI-Bench. This will help substantiate that the rewards encourage genuine 3D reasoning. We will also discuss potential biases and how the local trajectory reward and end-to-end optimization mitigate them. revision: yes

Circularity Check

0 steps flagged

No circularity: rewards and gains rest on external frozen models and benchmarks

full rationale

The derivation decomposes MV3D-VQA into map construction, trajectory planning, and grounding, then supplies rewards from geometry-consistent pseudo targets produced by independent frozen models (VGGT + SAM3) and optimizes via GRPO. These components are external to the present work and not defined in terms of the target predictions or fitted to the final answer accuracy. Gains are measured on separate held-out benchmarks (MindCube, VSI-Bench, BLINK) rather than by construction from the same signals. No self-citations, self-definitional equations, or fitted-input renamings appear in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that pseudo-labels from external foundation models are sufficiently accurate to serve as dense supervision targets; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Frozen 3D vision foundation models (VGGT + SAM3) produce geometry-consistent pseudo targets suitable for supervising global map construction without manual annotation.
    Invoked to define the global consistency reward.

pith-pipeline@v0.9.1-grok · 5774 in / 1334 out tokens · 42547 ms · 2026-06-26T08:36:52.306709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 40 canonical work pages · 24 internal anchors

  1. [1]

    In: Proceedings of the IEEE interna- tional conference on computer vision

    Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE interna- tional conference on computer vision. pp. 2425–2433 (2015) 6

  2. [2]

    arXiv preprint arXiv:2511.13719 (2025) 10, 11

    Cai, Z., Wang, R., Gu, C., Pu, F., Xu, J., Wang, Y., Yin, W., Yang, Z., Wei, C., Sun, Q., et al.: Scaling spatial intelligence with multimodal foundation models. arXiv preprint arXiv:2511.13719 (2025) 10, 11

  3. [3]

    SAM 3: Segment Anything with Concepts

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., et al.: Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719 (2025) 4, 7, 10, 5, 8

  4. [4]

    arXiv preprint arXiv:2402.00782 (2024) 3, 8

    Chan, A.J., Sun, H., Holt, S., Van Der Schaar, M.: Dense reward for free in reinforcement learning from human feedback. arXiv preprint arXiv:2402.00782 (2024) 3, 8

  5. [5]

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., Gu, L., Wang, X., Li, Q., Ren, Y., Chen, Z., Luo, J., Wang, J., Jiang, T., Wang, B., He, C., Shi, B., Zhang, X., Lv, H., Wang, Y., Shao, W., Chu, P., Tu, Z., He, T., Wu, Z., Deng, H., Ge, J., Chen, K., Zhang, K., Wang, L., Dou, M., Lu, L., Zhu, X., Lu, T., Lin, D.,...

  6. [6]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 1, 10, 11, 12

  7. [7]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 10, 11

  8. [8]

    arXiv preprint arXiv:2503.07065 (2025) 6

    Deng, H., Zou, D., Ma, R., Luo, H., Cao, Y., Kang, Y.: Boosting the gen- eralization and reasoning of vision language models with curriculum rein- forcement learning. arXiv preprint arXiv:2503.07065 (2025) 6

  9. [9]

    arXiv e-prints pp

    Deng, Y., Bansal, H., Yin, F., Peng, N., Wang, W., Chang, K.W.: Open- vlthinker: An early exploration to complex vision-language reasoning via iterative self-improvement. arXiv e-prints pp. arXiv–2503 (2025) 6

  10. [10]

    PaLM-E: An Embodied Multimodal Language Model

    Driess, D., Xia, F., Sajjadi, M.S., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., et al.: Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378 (2023) 2

  11. [11]

    Fan, Z., Zhang, J., Li, R., Zhang, J., Chen, R., Hu, H., Wang, K., Qu, H., Wang, D., Yan, Z., Xu, H., Theiss, J., Chen, T., Li, J., Tu, Z., Wang, Z., Ranjan, R.: Vlm-3r: Vision-language models augmented with instruction- aligned 3d reconstruction (2025),https://arxiv.org/abs/2505.2027912 Dense Reward for Multi-View 3D Reasoning 17

  12. [12]

    In: European Conference on Computer Vision

    Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not perceive. In: European Conference on Computer Vision. pp. 148–

  13. [13]

    Springer (2024) 2, 5, 6, 9, 7

  14. [14]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6904–6913 (2017) 6

  15. [15]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025) 1, 5, 9

  16. [16]

    arXiv preprint arXiv:2601.03590 (2026) 3, 5, 7, 8

    Guo, Z., Yang, Z., Li, Y., Zhang, X., Gao, W., Wang, J., Li, C., Liu, X., Jian, P.: Can llms see without pixels? benchmarking spatial intelligence from textual descriptions. arXiv preprint arXiv:2601.03590 (2026) 3, 5, 7, 8

  17. [17]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Gupta, S., Davidson, J., Levine, S., Sukthankar, R., Malik, J.: Cogni- tive mapping and planning for visual navigation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2616– 2625 (2017) 2

  18. [18]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hong, Y., Lin, C., Du, Y., Chen, Z., Tenenbaum, J.B., Gan, C.: 3d con- cept learning and reasoning from multi-view images. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 9202–9212 (2023) 1, 2, 3, 5, 6

  19. [19]

    arXiv preprint arXiv:2507.23478 (2025) 2, 5

    Huang, T., Zhang, Z., Tang, H.: 3d-r1: Enhancing reasoning in 3d vlms for unified scene understanding. arXiv preprint arXiv:2507.23478 (2025) 2, 5

  20. [20]

    arXiv preprint arXiv:2506.01946 (2025)

    Huang, X., Wu, J., Xie, Q., Han, K.: 3drs: Mllms need 3d-aware representa- tion supervision for scene understanding. arXiv preprint arXiv:2506.01946 (2025) 5

  21. [21]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 1, 10, 11, 12

  22. [22]

    Ji, Y., Tan, H., Shi, J., Hao, X., Zhang, Y., Zhang, H., Wang, P., Zhao, M., Mu, Y., An, P., Xue, X., Su, Q., Lyu, H., Zheng, X., Liu, J., Wang, Z., Zhang, S.: Robobrain: A unified brain model for robotic manipulation from abstract to concrete (2025),https://arxiv.org/abs/2502.2125712

  23. [23]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Kim, M.J., Pertsch, K., Karamcheti, S., Xiao, T., Balakrishna, A., Nair, S., Rafailov, R., Foster, E., Lam, G., Sanketi, P., et al.: Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246 (2024) 2

  24. [24]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lai, X., Tian, Z., Chen, Y., Li, Y., Yuan, Y., Liu, S., Jia, J.: Lisa: Reasoning segmentation via large language model. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9579–9589 (2024) 6

  25. [25]

    Lee, P.Y., Je, J., Park, C., Uy, M.A., Guibas, L., Sung, M.: Perspective- aware reasoning in vision-language models via mental imagery simulation. 18 J. Choi and S. Lee In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9241–9251 (2025) 5, 8

  26. [26]

    In: Findings of the Association for Computational Linguistics: EMNLP 2025

    Lee, S., Choi, J., Kang, I., Kim, J., Park, J., Shim, H.: 3d-aware vision- language models fine-tuning with geometric distillation. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 10628–10647 (2025) 5

  27. [27]

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer (2024), https://arxiv.org/abs/2408.0332612

  28. [28]

    Lin, J., Yin, H., Ping, W., Lu, Y., Molchanov, P., Tao, A., Mao, H., Kautz, J., Shoeybi, M., Han, S.: Vila: On pre-training for visual language models (2024),https://arxiv.org/abs/2312.0753312

  29. [29]

    Advances in Neural Information Processing Systems37, 140903–140936 (2024) 5

    Linghu, X., Huang, J., Niu, X., Ma, X.S., Jia, B., Huang, S.: Multi-modal situated reasoning in 3d scenes. Advances in Neural Information Processing Systems37, 140903–140936 (2024) 5

  30. [30]

    arXiv preprint arXiv:2510.16714 (2025) 5

    Linghu, X., Huang, J., Zhu, Z., Jia, B., Huang, S.: Scenecot: Elicit- ing grounded chain-of-thought reasoning in 3d scenes. arXiv preprint arXiv:2510.16714 (2025) 5

  31. [31]

    Advances in neural information processing systems36, 34892–34916 (2023) 1

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

  32. [32]

    arXiv preprint arXiv:2509.10884 (2025) 5

    Liu, Q., Huang, T., Zhang, Z., Tang, H.: Nav-r1: Reasoning and navigation in embodied scenes. arXiv preprint arXiv:2509.10884 (2025) 5

  33. [33]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Liu, Y., Peng, B., Zhong, Z., Yue, Z., Lu, F., Yu, B., Jia, J.: Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement. arXiv preprint arXiv:2503.06520 (2025) 6

  34. [34]

    Vision- reasoner: Unified visual perception and reasoning via reinforcement learning.arXiv preprint arXiv:2505.12081, 2025

    Liu, Y., Qu, T., Zhong, Z., Peng, B., Liu, S., Yu, B., Jia, J.: Visionreasoner: Unified reasoning-integrated visual perception via reinforcement learning. arXiv preprint arXiv:2505.12081 (2025) 6

  35. [35]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual-rft: Visual reinforcement fine-tuning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2034–2044 (2025) 5

  36. [36]

    LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL

    Peng, Y., Zhang, G., Zhang, M., You, Z., Liu, J., Zhu, Q., Yang, K., Xu, X., Geng, X., Yang, X.: Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536 (2025) 6

  37. [37]

    Qwen, :, Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., Lin, H., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Lin, J., Dang, K., Lu, K., Bao, K., Yang, K., Yu, L., Li, M., Xue, M., Zhang, P., Zhu, Q., Men, R., Lin, R., Li, T., Tang, T., Xia, T., Ren, X., Ren, X., Fan, Y., Su, Y., Zhang, Y., Wa...

  38. [38]

    Seed, B., :, Chen, J., Fan, T., Liu, X., Liu, L., Lin, Z., Wang, M., Wang, C., Wei, X., Xu, W., Yuan, Y., Yue, Y., Yan, L., Yu, Q., Zuo, X., Zhang, C., Zhu, R., An, Z., Bai, Z., Bao, Y., Bin, X., Chen, J., Chen, F., Chen, Dense Reward for Multi-View 3D Reasoning 19 H., Chen, R., Chen, L., Chen, Z., Chen, J., Chen, S., Chen, K., Chen, Z., Chen, J., Chen, J...

  39. [39]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300 (2024) 5, 8, 9

  40. [40]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024) 4

  41. [41]

    OpenAI GPT-5 System Card

    Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2026) 10, 11

  42. [42]

    arXiv preprint arXiv:2503.23829 (2025) 5 20 J

    Su, Y., Yu, D., Song, L., Li, J., Mi, H., Tu, Z., Zhang, M., Yu, D.: Cross- ing the reward bridge: Expanding rl with verifiable rewards across diverse domains. arXiv preprint arXiv:2503.23829 (2025) 5 20 J. Choi and S. Lee

  43. [43]

    Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Li, H., Zhu, J., Chen, J., Xu, J., Xu, J., Chen, J., Lin, J., Chen, J., Wang, J., Chen, J.,...

  44. [44]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 5294–5306 (2025) 3, 4, 7, 10, 5, 8

  45. [45]

    In: The Fourteenth International Conference on Learning Represen- tations (2026),https://openreview.net/forum?id=0FhrtdKLtD2, 3, 5, 6, 7, 9, 10, 11, 4

    Wang, Q., Yin, B., Zhang, P., Zhang, J., Wang, K., Wang, Z., Zhang, J., Chandrasegaran, K., Liu, H., Krishna, R., Xie, S., Li, M., Wu, J., Fei-Fei, L.: Understanding VLMs spatial mental modeling capability from limited views. In: The Fourteenth International Conference on Learning Represen- tations (2026),https://openreview.net/forum?id=0FhrtdKLtD2, 3, 5,...

  46. [46]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geo- metric 3d vision made easy. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 20697–20709 (2024) 3

  47. [47]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 1

  48. [48]

    In: Proceedings of the IEEE/CVF In- ternational conference on computer vision

    Wang, Z., Li, X., Yang, J., Liu, Y., Jiang, S.: Gridmm: Grid memory map for vision-and-language navigation. In: Proceedings of the IEEE/CVF In- ternational conference on computer vision. pp. 15625–15636 (2023) 2

  49. [49]

    Advances in neural information processing systems35, 24824–24837 (2022) 6

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 6

  50. [50]

    Reinforcement Learning with Verifiable Rewards Implicitly Incentivizes Correct Reasoning in Base LLMs

    Wen, X., Liu, Z., Zheng, S., Xu, Z., Ye, S., Wu, Z., Liang, X., Wang, Y., Li, J., Miao, Z., et al.: Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base llms. arXiv preprint arXiv:2506.14245 (2025) 5

  51. [51]

    Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

    Wu, D., Liu, F., Hung, Y.H., Duan, Y.: Spatial-mllm: Boosting mllm capa- bilities in visual-based spatial intelligence. arXiv preprint arXiv:2505.23747 (2025) 10, 11, 12 Dense Reward for Multi-View 3D Reasoning 21

  52. [52]

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  53. [53]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632–10643 (2025) 2, 5, 7, 9

  54. [54]

    MMSI-Bench: A Benchmark for Multi-Image Spatial Intelligence

    Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025) 5, 16

  55. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, Y., He, X., Pan, H., Jiang, X., Deng, Y., Yang, X., Lu, H., Yin, D., Rao, F., Zhu, M., et al.: R1-onevision: Advancing generalized multi- modal reasoning through cross-modal formalization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2376–2385 (2025) 6

  56. [56]

    In: The eleventh international conference on learning representations (2022) 6

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K.R., Cao, Y.: React: Synergizing reasoning and acting in language models. In: The eleventh international conference on learning representations (2022) 6

  57. [57]

    Embodied-R1: Reinforced Embodied Reasoning for General Robotic Manipulation

    Yuan, Y., Cui, H., Huang, Y., Chen, Y., Ni, F., Dong, Z., Li, P., Zheng, Y., Hao, J.: Embodied-r1: Reinforced embodied reasoning for general robotic manipulation. arXiv preprint arXiv:2508.13998 (2025) 5

  58. [58]

    arXiv preprint arXiv:2503.18013 (2025) 5

    Zhan, Y., Zhu, Y., Zheng, S., Zhao, H., Yang, F., Tang, M., Wang, J.: Vision-r1: Evolving human-free alignment in large vision-language models via vision-guided reinforcement learning. arXiv preprint arXiv:2503.18013 (2025) 5

  59. [59]

    129373, 5, 8

    Zhang, J., Huang, J., Yao, H., Liu, S., Zhang, X., Lu, S., Tao, D.: R1-vl: Learning to reason with multimodal large language models via step-wise group relative policy optimization (2025),https://arxiv.org/abs/2503. 129373, 5, 8

  60. [60]

    arXiv preprint arXiv:2601.13029 (2026) 1, 2, 3, 5, 10, 11, 12

    Zhang, Z., Wu, Y., Jia, L., Wang, Y., Zhang, Z., Li, Y., Ran, B., Zhang, F., Sun, Z., Yin, Z., et al.: Think3d: Thinking with space for spatial reasoning. arXiv preprint arXiv:2601.13029 (2026) 1, 2, 3, 5, 10, 11, 12

  61. [61]

    Zhou, Z., Chen, D., Ma, Z., Hu, Z., Fu, M., Wang, S., Wan, Y., Zhao, Z., Krishna, R.: Reinforced visual perception with tools (2025),https:// arxiv.org/abs/2509.0165612

  62. [62]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 10, 11 22 J. Choi and S. Lee

  63. [63]

    objects”: [ {“name

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., et al.: Rt-2: Vision-language-action models transfer web knowledge to robotic control. In: Conference on Robot Learning. pp. 2165–2183. PMLR (2023) 2 Dense Reward for Multi-View 3D Reasoning 1 Dense Reward for Multi-View 3D Reasoning with Global Maps and...

  64. [65]

    Image X"> - [object_name]: ego_direction=

    <think>: Complete a structured reasoning trajectory with 4 numbered steps: (1) Selecting a view: Read the question, identify the spatial anchor, locate it in the CogMap, and determine which viewpoint is most relevant, e.g., We should consider the viewpoint from <Image X>. (2) Egocentric layout: Convert the allocentric CogMap into an egocentric frame align...

  65. [66]

    X. option_text

    <answer>: “X. option_text” [CogMap Instruction] <cogmap_gen_instruction> [egocentric Map Rules] –The viewpoint’s facing direction becomes “forward” in the EgoMap. –Objects in the opposite direction become “back”. –Apply 90◦ rotation for left/right accordingly. –For diagonal facings, decompose and rotate consistently. –Include only objects that are spatial...

  66. [67]

    Assign (x, y) coordinates to each object based on their estimated positions in the scene

    <CogMap>: Construct an allocentric cognitive map as a 10×10 grid by observing the video frames. Assign (x, y) coordinates to each object based on their estimated positions in the scene

  67. [68]

    [object_name]

    <think>: Complete a structured reasoning trajectory with 4 numbered steps: (1) Selecting a viewpoint: Identify the standing position and facing target in the CogMap from the question, and compute the facing direction. (2) Egocentric layout: Convert the allocentric CogMap into an egocentric frame aligned with the standing position’s facing direction. Outpu...

  68. [69]

    X. option_text

    <answer>: “X. option_text”. [CogMap Instruction] <cogmap_gen_instruction> [Egocentric Map Rules] –The facing direction (from standing position toward the target object) becomes “forward” in the ego frame. –Compute displacement vectors from the standing position to each queried object. –Rotation rules: Grid [0,0]=top-left, [9,9]=bottom-right. •facing=up→fo...

  69. [70]

    <CogMap>: Construct an allocentric cognitive map from multi-view images

  70. [71]

    Image X"> - [object_name]: ego_direction=

    <think>: Complete a structured reasoning trajectory with 4 numbered steps: (1) Selecting a view: Read the question, identify the spatial anchor, locate it in the CogMap, and determine which viewpoint is most relevant, e.g., We should consider the viewpoint from <Image X>. (2) Egocentric layout: Convert the allocentric CogMap into an egocentric frame align...

  71. [72]

    X. option_text

    <answer>: “X. option_text” [CogMap Instruction] <cogmap_gen_instruction> [Egocentric Map Rules] –The viewpoint’s facing direction becomes “forward” in the EgoMap. –Objects in the opposite direction become “back”. –Apply 90◦ rotation for left/right accordingly. –For diagonal facings, decompose and rotate consistently. –Include only objects that are spatial...

  72. [73]

    The black chair with pillows is placed in front of a television

  73. [74]

    There are two yellow leather sofas in the room

  74. [75]

    left”. In the EgoMap, “TV

    The black chair is positioned between the two sofas. From the perspective of someone standing at the location of the two single sofas and facing the same direction: - The black chair with pillows would be directly in front of the person. - To the left of the black chair, there would be one of the yellow leather sofas. Therefore, if you were positioned whe...

  75. [76]

    In par- ticular, adding egocentric guidance with the trajectory of viewpoint selection leads to the greatest improvement in the final score

    The full setting achieves the highest plateau at about 62, which shows that spatial-map supervision and trajectory supervision are complementary. In par- ticular, adding egocentric guidance with the trajectory of viewpoint selection leads to the greatest improvement in the final score. For GRPO, the right plot in Figure A10 shows the mean reward over 200 ...

  76. [77]

    - Image 2 shows the same room but rotated 90 degrees clockwise from Image 1

    Understanding the initial positions: - Image 1 shows a room with a white ceiling, a white wall, and a door. - Image 2 shows the same room but rotated 90 degrees clockwise from Image 1. - Image 3 shows the room again, but rotated 90 degrees clockwise from Image 2

  77. [78]

    - There is a window on the right side

    Identifying the objects in Image 2: - There is a green cabinet on the left side. - There is a window on the right side. - There is a brown cabinet below the window. - There are several pictures hanging on the wall above the brown cabinet

  78. [79]

    - This means you would be facing the window and the brown cabinet

    Determining the new orientation: - If you stand at the same spot and face the same direction as shown in Image 2, and then turn 90 degrees to the left, you would be facing the opposite direction of Image 2. - This means you would be facing the window and the brown cabinet

  79. [80]

    up". After turn left 90, the effective pose becomes[5,5] facing=

    Identifying the object to your right: - When you face the window and the brown cabinet, the object to your right would be the green cabinet. Therefore, if you turn 90 degrees to the left from the position shown in Image 2, the object to your right would be the green cabinet. Answer:A </think> <answer>A. Green cabinet</answer> Ours:<think> (1) Selecting a ...