pith. sign in

arxiv: 2605.18621 · v1 · pith:LVMI43G5new · submitted 2026-05-18 · 💻 cs.CV · cs.AI

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

Pith reviewed 2026-05-20 10:33 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords cross-view reasoningmultimodal large language modelsspatial intelligenceinstruction datasetobject alignmentcomputer visionbenchmark
0
0 comments X

The pith

MLLMs gain consistent object reasoning across viewpoints by training on large cross-view data and using explicit alignment stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multimodal large language models can advance from single-view perception to cross-view spatial intelligence when given large-scale annotated training data, a dedicated evaluation benchmark, and a model that explicitly aligns objects across views. It creates a 1.6 million sample dataset covering 17 task types, a scene-disjoint benchmark, and a three-stage framework that first perceives objects, then aligns them, and finally reasons about their relations. A sympathetic reader would care because real-world spatial tasks such as navigation or manipulation require models to maintain object identity and geometry even when the viewpoint changes. The work argues that without these three elements together, progress remains limited by data scarcity and lack of alignment mechanisms.

Core claim

We introduce the CrossView Suite with three parts: CrossViewSet, a large-scale cross-view instruction dataset of 1.6M samples over 17 fine-grained tasks curated by a multi-agent engine; CrossViewBench, a scene-disjoint benchmark for systematic evaluation; and CrossViewer, a progressive Perception-Alignment-Reasoning framework that equips an adaptive spatial region tokenizer, performs explicit multi-view object alignment, and fuses the aligned features to improve cross-view inference in MLLMs.

What carries the argument

The Perception -> Alignment -> Reasoning paradigm together with an adaptive spatial region tokenizer that captures fine-grained object representations and explicit alignment of multi-view objects before feature fusion.

If this is right

  • Large-scale cross-view training data becomes a prerequisite for reliable spatial reasoning in MLLMs.
  • Scene-disjoint benchmarks can expose whether models truly generalize across viewpoints rather than memorizing single scenes.
  • Explicit object-level alignment across views directly improves consistency in geometry, visibility, and interaction tasks.
  • The three-stage pipeline shows that perception alone is insufficient without a dedicated alignment phase before reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment technique could be tested on video inputs to handle temporal viewpoint changes in addition to static multi-view sets.
  • Robotics platforms that already capture multiple camera feeds could adopt the tokenizer and alignment modules to improve object tracking during movement.
  • If the benchmark results hold on out-of-distribution scenes, the approach may generalize to augmented reality applications where users move freely around objects.

Load-bearing premise

The multi-agent data engine produces high-quality, unbiased cross-view instruction data that accurately captures object-level consistency across views.

What would settle it

Training an existing MLLM on the CrossViewSet data without the explicit alignment stage and then measuring whether its accuracy on CrossViewBench remains close to the full CrossViewer version would test whether the alignment step is necessary.

Figures

Figures reproduced from arXiv: 2605.18621 by Jun Xiao, Siliang Tang, Tianwei Lin, Wei Wang, Wenqiao Zhang, Yueting Zhuang, Yuqian Yuan.

Figure 1
Figure 1. Figure 1: Overview of CrossViewer. CrossViewer presents a unified framework for cross-view spatial intelligence in MLLMs, [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Source and task composition of CrossViewSet and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CrossViewer. Stage I extracts mask-grounded object tokens, Stage II aligns them across views, and Stage [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview panels of CrossView Suite and CrossViewer: per-type gains, gap to HumanBase, and t-SNE of Q1 correspon [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative cross-view retrieval results. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CrossView Suite to advance cross-view spatial intelligence in MLLMs. It consists of CrossViewSet (1.6M samples across 17 fine-grained tasks curated via a multi-agent data engine), CrossViewBench (a scene-disjoint benchmark for systematic evaluation), and CrossViewer (a progressive three-stage Perception -> Alignment -> Reasoning framework equipped with an adaptive spatial region tokenizer for fine-grained object representations, explicit multi-view object alignment, and aligned feature fusion). The central claim is that extensive experiments demonstrate large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for moving MLLMs beyond single-view perception.

Significance. If the data quality holds and the experiments provide clear, reproducible gains with proper controls, the coordinated release of a large curated dataset, a dedicated benchmark, and an alignment-focused framework could meaningfully support research on spatial reasoning in multimodal models. The emphasis on object-level consistency across views and the open artifacts (GitHub-linked) are constructive elements that could aid reproducibility.

major comments (2)
  1. CrossViewSet curation section: The multi-agent data engine is presented as producing high-quality samples that accurately capture object-level consistency across views for all 17 task types (1.6M samples total), yet no quantitative validation is reported (human agreement rates, cross-view consistency metrics, or bias/hallucination audits). This is load-bearing for the claim that large-scale training data is critical, because without independent checks the ablation results cannot isolate benefits of scale and quality from potential data artifacts such as view-inconsistent attributes.
  2. Experiments section: The abstract asserts that 'extensive experiments and analyses show' the criticality of data, evaluation, and alignment, but provides no quantitative results, baseline comparisons, ablation details, or error analysis. Without these specifics, the support for the central claim that the three components are critical cannot be verified from the manuscript.
minor comments (1)
  1. Abstract: The description of CrossViewer mentions an 'adaptive spatial region tokenizer' but does not clarify its parameterization or how it differs from standard region tokenizers; a short clarifying sentence would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major comment point by point below, agreeing where revisions are needed to strengthen the manuscript and providing clarifications where appropriate.

read point-by-point responses
  1. Referee: CrossViewSet curation section: The multi-agent data engine is presented as producing high-quality samples that accurately capture object-level consistency across views for all 17 task types (1.6M samples total), yet no quantitative validation is reported (human agreement rates, cross-view consistency metrics, or bias/hallucination audits). This is load-bearing for the claim that large-scale training data is critical, because without independent checks the ablation results cannot isolate benefits of scale and quality from potential data artifacts such as view-inconsistent attributes.

    Authors: We acknowledge that the manuscript does not currently report quantitative validation metrics such as human agreement rates, cross-view consistency scores, or explicit bias/hallucination audits for the 1.6M samples in CrossViewSet. This is a valid concern, as it limits the ability to fully substantiate data quality independent of the ablation outcomes. In the revised manuscript, we will add a new subsection under CrossViewSet curation that includes these metrics: inter-annotator agreement on a sampled subset for object-level attributes, cross-view consistency evaluations, and results from targeted audits for inconsistencies and hallucinations. These additions will directly support the claim that large-scale, high-quality data is critical. revision: yes

  2. Referee: Experiments section: The abstract asserts that 'extensive experiments and analyses show' the criticality of data, evaluation, and alignment, but provides no quantitative results, baseline comparisons, ablation details, or error analysis. Without these specifics, the support for the central claim that the three components are critical cannot be verified from the manuscript.

    Authors: We agree that the current presentation of results in the Experiments section does not provide sufficient quantitative detail, baseline comparisons, ablation studies, or error analysis to fully verify the central claims from the abstract. While the manuscript outlines the experimental design and high-level findings, more granular reporting is needed for reproducibility and to isolate the contributions of each component. We will revise the Experiments section to include expanded quantitative tables (e.g., performance on CrossViewBench with and without CrossViewSet), direct baseline comparisons, component-wise ablations for the Perception-Alignment-Reasoning stages, and a dedicated error analysis section. This will make the evidence for the criticality of data, evaluation, and alignment verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on new artifacts and experiments

full rationale

The paper's central claims rest on introducing CrossViewSet (curated via a multi-agent engine), CrossViewBench (scene-disjoint), and CrossViewer (three-stage Perception-Alignment-Reasoning framework with adaptive tokenizer). Experiments and ablations test the criticality of scale, evaluation, and alignment using these newly created components. No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-defined quantity, or a chain of self-citations whose validity is presupposed. The derivation chain is self-contained against the external benchmarks and github-linked artifacts.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim depends on the quality of newly curated data and the effectiveness of the introduced alignment mechanism, both of which are created within the paper rather than drawn from independently validated external sources.

free parameters (1)
  • adaptive tokenizer parameters
    Parameters of the spatial region tokenizer are learned during training on CrossViewSet.
axioms (1)
  • domain assumption Multi-agent data engine produces high-quality, unbiased annotations that reflect true object-level cross-view consistency
    Invoked when describing curation of CrossViewSet
invented entities (1)
  • adaptive spatial region tokenizer no independent evidence
    purpose: Capture fine-grained object representations to enable explicit multi-view alignment
    New component introduced in CrossViewer; no independent evidence outside the paper

pith-pipeline@v0.9.0 · 5843 in / 1454 out tokens · 57321 ms · 2026-05-20T10:33:28.222788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 6 internal anchors

  1. [1]

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski,...

  2. [2]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966(2023)

  3. [3]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. 2025. Qwen3-VL Technical Report.arXiv preprint arXiv:2511.21631(2025)

  4. [4]

    Zhongang Cai, Junzhe Zhang, Daxuan Ren, Cunjun Yu, Haiyu Zhao, Shuai Yi, Chai Kiat Yeo, and Chen Change Loy. 2020. Messytable: Instance association in multiple camera views. InEuropean Conference on Computer Vision. Springer, 1–16

  5. [5]

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. 2024. SpatialVLM: Endowing Vision- Language Models with Spatial Reasoning Capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14455–14465

  6. [6]

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, Bin Li, Ping Luo, Tong Lu, Yu Qiao, and Jifeng Dai. 2024. InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2...

  7. [7]

    An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. 2024. SpatialRGPT: Grounded Spatial Reasoning in Vision-Language Models.arXiv preprint arXiv:2406.01584(2024)

  8. [8]

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. 2023. InstructBLIP: Towards General-Purpose Vision-Language Models with Instruction Tuning. Advances in Neural Information Processing Systems36 (2023), 49250–49267

  9. [9]

    Erik Daxberger, Nina Wenzel, David Griffiths, Haiming Gang, Justin Lazarow, Gefen Kohavi, Kai Kang, Marcin Eichner, Yinfei Yang, Afshin Dehghan, et al

  10. [10]

    In Proceedings of the IEEE/CVF International Conference on Computer Vision

    Mm-spatial: Exploring 3d spatial understanding in multimodal llms. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 7395– 7408

  11. [11]

    Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. 2024. ObjectRelator: Enabling Cross-View Object Relation Understanding Across Ego-Centric and Exo-Centric Perspectives.arXiv preprint arXiv:2411.19083(2024)

  12. [12]

    Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, et al. 2024. Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19383–19400

  13. [13]

    Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. 2024. RegionGPT: Towards Region Under- standing Vision Language Model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13796–13806

  14. [14]

    Xiaotian Han, Quanzeng You, Chunyu Wang, Zhizheng Zhang, Peng Chu, Houdong Hu, Jiang Wang, and Zicheng Liu. 2023. MMPTRACK: Large-scale densely annotated multi-camera multiple people tracking benchmark. InPro- ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4860–4869

  15. [15]

    Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. 2024. Multi-Modal Instruction Tuned LLMs with Fine-Grained Visual Perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13980–13990

  16. [16]

    Jiaxin Huang, Runnan Chen, Ziwen Li, Zhengqing Gao, Xiao He, Yandong Guo, Mingming Gong, and Tongliang Liu. 2025. Mllm-for3d: Adapting mul- timodal large language model for 3d reasoning segmentation.arXiv preprint arXiv:2503.18135(2025)

  17. [17]

    Ku, Qian Liu, and Wenhu Chen

    Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max W.F. Ku, Qian Liu, and Wenhu Chen. 2024. MANTIS: Interleaved Multi-Image Instruction Tuning.Trans- actions on Machine Learning Research(2024). arXiv:2405.01483

  18. [18]

    Rawal Khirodkar, Aayush Bansal, Lingni Ma, Richard Newcombe, Minh Vo, and Kris Kitani. 2023. EgoHumans: An Egocentric 3D Multi-Human Benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 19750– 19762

  19. [19]

    Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised Contrastive Learning.CoRRabs/2004.11362 (2020). arXiv:2004.11362

  20. [20]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. 2023. Segment Anything. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4015–4026

  21. [21]

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. 2024. LISA: Reasoning Segmentation via Large Language Model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9579–9589

  22. [22]

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. 2024. LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models.arXiv preprint arXiv:2407.07895(2024)

  23. [23]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. BLIP-2: Boot- strapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InInternational Conference on Machine Learning. PMLR, 19730– 19742

  24. [24]

    Tianwei Lin, Wenqiao Zhang, Sijing Li, Yuqian Yuan, Binhe Yu, Haoyuan Li, Wanggui He, Hao Jiang, Mengze Li, Xiaohui Song, et al . 2025. HealthGPT: A Medical Large Vision-Language Model for Unifying Comprehension and Gener- ation via Heterogeneous Knowledge Adaptation.arXiv preprint arXiv:2502.09838 (2025)

  25. [25]

    Bangwei Liu, Yicheng Bao, Shaohui Lin, Xuhong Wang, Xin Tan, Yingchun Wang, Yuan Xie, and Chaochao Lu. 2025. Idmr: Towards instance-driven precise visual correspondence in multimodal retrieval. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6320–6329

  26. [26]

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual In- struction Tuning.Advances in Neural Information Processing Systems36 (2023), 34892–34916

  27. [27]

    Chuofan Ma, Yi Jiang, Jiannan Wu, Zehuan Yuan, and Xiaojuan Qi. 2024. Groma: Localized Visual Tokenization for Grounding Multimodal Large Language Mod- els. InEuropean Conference on Computer Vision. doi:10.1007/978-3-031-72658- 3_24

  28. [28]

    Wentao Mo, Qingchao Chen, Yuxin Peng, Siyuan Huang, and Yang Liu. 2025. Advancing 3D Scene Understanding with MV-ScanQA Multi-View Reasoning Evaluation and TripAlign Pre-training Dataset. InProceedings of the 33rd ACM International Conference on Multimedia. 12973–12980

  29. [29]

    Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. 2024. GLaMM: Pixel Grounding Large Multimodal Model. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13009–13018

  30. [30]

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. 2024. SAM 2: Segment Anything in Images and Videos.arXiv preprint arXiv:24...

  31. [31]

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. 2020. Superglue: Learning feature matching with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recogni- tion. 4938–4947

  32. [32]

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. 2021. LoFTR: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8922–8931

  33. [33]

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. 2023. Emergent correspondence from image diffusion.Advances in neural information processing systems36 (2023), 1363–1389

  34. [34]

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. 2024. Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9568–9578

  35. [35]

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution.arXiv preprint arXiv:2409.12191(2024)

  36. [36]

    Yipu Wang, Yuheng Ji, Yuyang Liu, Enshen Zhou, Ziqiang Yang, Yuxuan Tian, Ziheng Qin, Yue Liu, Huajie Tan, Cheng Chi, et al. 2025. Towards cross-view point correspondence in vision-language models.arXiv preprint arXiv:2512.04686 (2025)

  37. [37]

    Chun-Hsiao Yeh, Chenyu Wang, Shengbang Tong, Ta-Ying Cheng, Ruoyu Wang, Tianzhe Chu, Yuexiang Zhai, Yubei Chen, Shenghua Gao, and Yi Ma. 2025. Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs. arXiv preprint arXiv:2504.15280(2025)

  38. [38]

    Haoxuan You, Haotian Zhang, Zhe Gan, Xianzhi Du, Bowen Zhang, Zirui Wang, Liangliang Cao, Shih-Fu Chang, and Yinfei Yang. 2024. Ferret: Refer and Ground Anything Anywhere at Any Granularity. InInternational Conference on Learning Representations. 57153–57180

  39. [39]

    Yuqian Yuan, Ronghao Dang, Long Li, Wentong Li, Dian Jiao, Xin Li, Deli Zhao, Fan Wang, Wenqiao Zhang, Jun Xiao, et al . 2025. EOC-Bench: Can MLLMs CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark Identify, Recall, and Forecast Objects in an Egocentric World?arXiv preprint arXiv:2506.05287(2025)

  40. [40]

    Yuqian Yuan, Wentong Li, Jian Liu, Dongqi Tang, Xinjie Luo, Chi Qin, Lei Zhang, and Jianke Zhu. 2024. Osprey: Pixel Understanding with Visual Instruction Tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 28202–28211

  41. [41]

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. 2025. VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM. In Proceedings of the Computer Vision and Pattern Recognition Conference. 18970– 18980

  42. [42]

    Yuqian Yuan, Wenqiao Zhang, Xin Li, Shihao Wang, Kehan Li, Wentong Li, Jun Xiao, Lei Zhang, and Beng Chin Ooi. 2025. PixelRefer: A Unified Framework for Spatio-Temporal Object Referring with Arbitrary Granularity.arXiv preprint arXiv:2510.23603(2025)

  43. [43]

    Yuqian Yuan, Wenqiao Zhang, Juekai Lin, Yu Zhong, Mingjian Gao, Binhe Yu, Yunqi Cao, Wentong Li, Yueting Zhuang, and Beng Chin Ooi. 2026. LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation. arXiv preprint arXiv:2604.11789(2026)

  44. [44]

    Wenqiao Zhang, Tianwei Lin, Jiang Liu, Fangxun Shu, Haoyuan Li, Lei Zhang, He Wanggui, Hao Zhou, Zheqi Lv, Hao Jiang, et al. 2024. HyperLLaVA: Dynamic Visual and Language Expert Tuning for Multimodal Large Language Models. arXiv preprint arXiv:2403.13447(2024)

  45. [45]

    Wenqiao Zhang, Changshuo Liu, Lingze Zeng, Beng Chin Ooi, Siliang Tang, and Yueting Zhuang. 2023. Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1423–1432

  46. [46]

    Wenqiao Zhang, Zheqi Lv, Hao Zhou, Jia-Wei Liu, Juncheng Li, Mengze Li, Yunfei Li, Dongping Zhang, Yueting Zhuang, and Siliang Tang. 2024. Revisiting the Domain Shift and Sample Uncertainty in Multi-Source Active Domain Trans- fer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16751–16761

  47. [47]

    Wenqiao Zhang, Lei Zhu, James Hallinan, Shengyu Zhang, Andrew Makmur, Qingpeng Cai, and Beng Chin Ooi. 2022. BoostMIS: Boosting Medical Image Semi- Supervised Learning with Adaptive Pseudo Labeling and Informative Active Annotation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 20666–20676

  48. [48]

    Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, and Safwan Wshah. 2025. VICI: VLM-Instructed Cross-view Image-localisation. In Proceedings of the 3rd International Workshop on UA Vs in Multimedia: Capturing the World from a New Perspective. 21–25

  49. [49]

    Duo Zheng, Shijia Huang, Yanyang Li, and Liwei Wang. 2025. Learning from videos for 3d world: Enhancing mllms with 3d vision geometry priors.arXiv preprint arXiv:2505.24625(2025)

  50. [50]

    Duo Zheng, Shijia Huang, and Liwei Wang. 2025. Video-3d llm: Learning position- aware video representation for 3d scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8995–9006

  51. [51]

    Yu Zhong, Tianwei Lin, Ruike Zhu, Yuqian Yuan, Haoyu Zheng, Liang Liang, Wenqiao Zhang, Feifei Shao, Haoyuan Li, Wanggui He, et al. 2026. Unified Person- alized Understanding, Generating and Editing.arXiv preprint arXiv:2601.06965 (2026)

  52. [52]

    Yikang Zhou, Tao Zhang, Shilin Xu, Shihao Chen, Qiong Zhou, Yibing Tong, et al. 2025. Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision. 17663–17674