SSMNBench: Diagnosing Image-based Cross-View Human-Object Understanding via Single-View Sufficiency and Multi-View Necessity
Pith reviewed 2026-06-25 21:00 UTC · model grok-4.3
The pith
Modern MLLMs average semantics from single views and prefer certain angles rather than synthesizing cross-view geometric evidence for human-object scenes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SSMNBench comprises 3,300 curated QA pairs that categorize cross-view human and human-object understanding into Single-View Sufficiency and Multi-View Necessity tasks. Systematic perturbation of view availability across state-of-the-art MLLMs reveals severe distraction degradation on SVS tasks and inability to integrate fragmented geometric evidence on MVN tasks, demonstrating that models depend on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis.
What carries the argument
The SVS/MVN task categorization combined with systematic view-perturbation protocol that controls availability of camera frames while holding question content fixed.
If this is right
- Existing multi-view benchmarks that supply a fixed bag of frames cannot distinguish robustness to distraction from actual cross-view fusion.
- Architectures for future MLLMs must incorporate explicit mechanisms for geometric evidence integration rather than relying on semantic averaging.
- Evaluation protocols for cross-view understanding should always include both sufficiency and necessity conditions to avoid conflating separate abilities.
- Progress on human-centric scene reasoning requires diagnostic tests that perturb view availability rather than simply increasing the number of input frames.
Where Pith is reading between the lines
- The same diagnostic split could be applied to non-human scenes or video sequences to test whether the averaging behavior is specific to static human-object views.
- If models continue to exhibit view preference, training objectives that penalize reliance on single dominant frames might be needed.
- Robotics or AR applications that depend on multi-camera fusion would face reliability limits until cross-view synthesis improves.
Load-bearing premise
The curated QA pairs and view-perturbation protocol isolate genuine cross-view synthesis ability from single-view semantic leakage or visual distraction effects.
What would settle it
A model that shows no performance drop on SVS tasks when redundant views are added and simultaneously improves on MVN tasks by correctly combining geometric details across views would falsify the claim.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) have shown remarkable progress in single-image perception, yet their ability to reason about complex cross-view human-centric scenes remains largely unverified. Current multi-view benchmarks evaluate models using a fixed "bag of frames" and thus conflate a model's robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence. To address this issue, we introduce SSMNBench, a diagnostic benchmark comprising 3,300 curated QA pairs for cross-view human and human-object understanding. SSMNBench uniquely categorizes tasks into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). By systematically perturbing view availability across 17 state-of-the-art MLLMs, critical limitations are revealed: models suffer from severe "distraction degradation" when presented with redundant views (SVS), and fail to integrate fragmented geometric evidence across cameras (MVN). Our evaluations demonstrate that modern MLLMs rely on multiple single-image semantic averaging and view preference rather than genuine cross-view synthesis. By exposing these fundamental vulnerabilities, SSMNBench provides a rigorous diagnostic framework to drive the advancement of future cross-view-aware multimodal architectures. The code is available at: $ \href{https://github.com/gtc-gh/SSMNBench}{\text{SSMNBench}} $
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SSMNBench, a diagnostic benchmark of 3,300 curated QA pairs for cross-view human and human-object understanding in MLLMs. Tasks are partitioned into Single-View Sufficiency (SVS) and Multi-View Necessity (MVN). Systematic view-perturbation experiments across 17 state-of-the-art MLLMs reveal distraction degradation on SVS tasks and failure to integrate fragmented geometric evidence on MVN tasks, leading to the conclusion that models rely on single-image semantic averaging and view preference rather than genuine cross-view synthesis. Code is released at the provided GitHub link.
Significance. If the benchmark's partitioning and perturbation protocol correctly isolate cross-view synthesis from single-view leakage and distraction, the work supplies a useful diagnostic tool for exposing limitations in current MLLMs and motivating cross-view-aware architectures. The scale of the evaluation and public code release are concrete strengths that support reproducibility and follow-on research.
minor comments (3)
- [Abstract] Abstract: the phrase 'systematically perturbing view availability' is used without an early high-level description of the perturbation protocol; moving a one-sentence summary of the SVS/MVN view-availability controls into the abstract would improve immediate clarity.
- [Introduction] The manuscript states that the benchmark 'conflates a model's robustness to visual distraction with its genuine ability to fuse fragmented cross-view evidence' in prior work, but does not cite the specific multi-view benchmarks being critiqued; adding 2-3 representative citations in the introduction would strengthen the motivation.
- [Experiments] The claim that models 'rely on multiple single-image semantic averaging and view preference' is presented as the primary takeaway; a short additional ablation or control experiment quantifying the contribution of view preference (e.g., via explicit view-order randomization) would make this interpretation more robust.
Simulated Author's Rebuttal
We thank the referee for their positive summary of SSMNBench and for recommending minor revision. No specific major comments appear in the provided report.
Circularity Check
No significant circularity; pure empirical benchmark
full rationale
The paper presents SSMNBench as an empirical diagnostic benchmark that partitions QA pairs into SVS (single-view sufficient) and MVN (multi-view necessary) categories, then measures model performance under controlled view perturbations on 17 held-out MLLMs. No derivations, equations, fitted parameters, or first-principles predictions are claimed or present. Central claims rest on direct measurements of distraction degradation and integration failure rather than any self-referential construction, self-citation chain, or renamed known result. The evaluation protocol is externally falsifiable via the released code and dataset, satisfying the criterion for a self-contained benchmark with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 3300 QA pairs and their SVS/MVN labels accurately capture single-view sufficiency versus multi-view necessity for human-object understanding.
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2509.23661 (2025)
An, X., Xie, Y., Yang, K., Zhang, W., Zhao, X., Cheng, Z., Wang, Y., Xu, S., Chen, C., Zhu, D., et al.: Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661 (2025)
Pith/arXiv arXiv 2025
-
[2]
arXiv preprint arXiv:2309.16609 (2023)
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
Pith/arXiv arXiv 2023
-
[3]
arXiv preprint arXiv:2502.13923 (2025)
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., Lin, J.: Qwen2.5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
Pith/arXiv arXiv 2025
-
[4]
arXiv preprint arXiv:2511.22154 (2025)
Chang, E., Huang, Z., Liao, Y., Bhavsar, S.R., Param, A., Stark, T., Ahmadyan, A., Yang, X., Wang, J., Abdullah, A., et al.: Wearvqa: A visual question answer- ing benchmark for wearables in egocentric authentic real-world scenarios. arXiv preprint arXiv:2511.22154 (2025)
arXiv 2025
-
[5]
arXiv preprint arXiv:2512.18231 (2025)
Chaudhary, A., Goyal, S., Narang, P., Kumar, D.: Investigating spatial attention bias in vision-language models. arXiv preprint arXiv:2512.18231 (2025)
arXiv 2025
-
[6]
arXiv preprint arXiv:2401.03890 (2024)
Chen, G., Wang, W.: A survey on 3d gaussian splatting. arXiv preprint arXiv:2401.03890 (2024)
Pith/arXiv arXiv 2024
-
[7]
Chen, L., Li, L., Zhao, H., Song, Y., Vinci: R1-v: Reinforcing super generalization ability in vision-language models with less than $3.https://github.com/Deep- Agent/R1-V(2025), accessed: 2025-02-02
2025
-
[8]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024) 16 T. Guo et al
2024
-
[9]
arXiv preprint arXiv:2504.13180 (2025)
Cho, J.H., Madotto, A., Mavroudi, E., Afouras, T., Nagarajan, T., Maaz, M., Song, Y., Ma, T., Hu, S., Jain, S., et al.: Perceptionlm: Open-access data and models for detailed visual understanding. arXiv preprint arXiv:2504.13180 (2025)
arXiv 2025
-
[10]
arXiv preprint arXiv:2507.06261 (2025)
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
Pith/arXiv arXiv 2025
-
[11]
DeepSeek-AI: Deepseek-v3 technical report (2024),https://arxiv.org/abs/ 2412.19437
Pith/arXiv arXiv 2024
-
[12]
IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)
Du,X.,Sun,H.,Lu,M.,Zhu,T.,Yu,X.:Dreamcar:Leveragingcar-specificpriorfor in-the-wild 3d car reconstruction. IEEE Robotics and Automation Letters10(2), 1840–1847 (2024)
2024
-
[13]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Du, X., Wang, Y., Sun, H., Wu, Z., Sheng, H., Wang, S., Ying, J., Lu, M., Zhu, T., Zhan, K., et al.: 3drealcar: An in-the-wild rgb-d car dataset with 360-degree views. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26488–26498 (2025)
2025
-
[14]
Du, X., Wang, Y., Yu, X.: Mvgs: Multi-view-regulated gaussian splatting for novel view synthesis (2024)
2024
-
[15]
arXiv preprint arXiv:2603.11531 (2026)
Du, X., Wang, Y., Zhan, K., Yu, X.: Mobile-gs: Real-time gaussian splatting for mobile devices. arXiv preprint arXiv:2603.11531 (2026)
arXiv 2026
-
[16]
Neurocomputing600, 128129 (2024)
Du, X., Yu, X., Liu, J., Dai, B., Xu, F.: Ethics-aware face recognition aided by synthetic face images. Neurocomputing600, 128129 (2024)
2024
-
[17]
arXiv e-prints pp
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)
2024
-
[18]
arXiv preprint arXiv:2306.13394 (2023)
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394 (2023)
Pith/arXiv arXiv 2023
-
[19]
Advances in Neural Information Processing Systems38(2026)
Fu, C., Chen, P., Shen, Y., Qin, Y., Zhang, M., Lin, X., Yang, J., Zheng, X., Li, K., Sun, X., et al.: Mme: A comprehensive evaluation benchmark for multimodal large language models. Advances in Neural Information Processing Systems38(2026)
2026
-
[20]
In: European Conference on Computer Vision
Fu, X., Hu, Y., Li, B., Feng, Y., Wang, H., Lin, X., Roth, D., Smith, N.A., Ma, W.C., Krishna, R.: Blink: Multimodal large language models can see but not per- ceive. In: European Conference on Computer Vision. pp. 148–166. Springer (2024)
2024
-
[21]
In: Proceedings of the 29th ACM international conference on multimedia
Gan, Y., Han, R., Yin, L., Feng, W., Wang, S.: Self-supervised multi-view multi- human association and tracking. In: Proceedings of the 29th ACM international conference on multimedia. pp. 282–290 (2021)
2021
-
[22]
Gholami,M.,Rezaei,A.,Weimin,Z.,Mao,S.,Zhou,S.,Zhang,Y.,Akbari,M.:Spa- tial reasoning with vision-language models in ego-centric multi-view scenes (2025), https://arxiv.org/abs/2509.06266
arXiv 2025
-
[23]
Google DeepMind: Gemini 2.5: Our most intelligent ai model (2025),https:// blog.google/technology/google-deepmind/gemini-model-thinking-updates- march-2025/
2025
-
[24]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Grauman, K., Westbury, A., Torresani, L., Kitani, K., Malik, J., Afouras, T., Ashutosh, K., Baiyya, V., Bansal, S., Boote, B., et al.: Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19383–19400 (2024)
2024
-
[25]
In: International Conference on Algorithms and Architectures for Parallel Processing
Guo, T., Du, H., Huo, H., Liu, B., Yu, X.: Who is being impersonated? deepfake audio detection and impersonated identification via extraction of id-specific fea- SSMNBench 17 tures. In: International Conference on Algorithms and Architectures for Parallel Processing. pp. 301–320. Springer (2024)
2024
-
[26]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Guo, T., Liu, C., Yu, X.: Beyond single-view sufficiency: Cvbench for cross-view human understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7154–7164 (2026)
2026
-
[27]
In: Australasian Joint Conference on Artificial Intelligence
Guo,T.,Logan,P.A.,Wackwitz,T.,Martin,D.:Plnet-12:Avision-languagebench- mark for zero-shot physical literacy analysis across 12 fundamental movements. In: Australasian Joint Conference on Artificial Intelligence. pp. 242–254. Springer (2025)
2025
-
[28]
arXiv preprint arXiv:2605.18746 (2026)
Hong, Y., Liu, J., Yin, H., Li, M., Guibas, L., Fei-Fei, L., Wu, J., Choi, Y.: Esi- bench: Towards embodied spatial intelligence that closes the perception-action loop. arXiv preprint arXiv:2605.18746 (2026)
Pith/arXiv arXiv 2026
-
[29]
Huang, M., Shi, Y., Peng, D., Lai, S., Xie, Z., Jin, L.: Ocr-reasoning benchmark: Unveilingthetruecapabilitiesofmllmsincomplextext-richimagereasoning.arXiv preprint arXiv:2505.17163 (2025)
Pith/arXiv arXiv 2025
-
[30]
Hugging Face: Open r1: A fully open reproduction of deepseek-r1 (January 2025), https://github.com/huggingface/open-r1
2025
-
[31]
arXiv preprint arXiv:2506.03135 (2025)
Jia, M., Qi, Z., Zhang, S., Zhang, W., Yu, X., He, J., Wang, H., Yi, L.: Omnispatial: Towards comprehensive spatial reasoning benchmark for vision language models. arXiv preprint arXiv:2506.03135 (2025)
arXiv 2025
-
[32]
In: Australasian Database Conference
Ke, Y., Yu, X., Du, H., Chapman, S., Huang, H.: Dynamic orchestration of multi-agent system for real-world multi-image agricultural vqa. In: Australasian Database Conference. pp. 153–165. Springer (2025)
2025
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Khirodkar, R., Bansal, A., Ma, L., Newcombe, R., Vo, M., Kitani, K.: Ego-humans: An ego-centric 3d multi-human benchmark. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19807–19819 (2023)
2023
-
[34]
Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)
Khirodkar,R.,Song,J.T.,Cao,J.,Luo,Z.,Kitani,K.:Harmony4d:Avideodataset for in-the-wild close human interactions. Advances in Neural Information Process- ing Systems37, 107270–107285 (2024)
2024
-
[35]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, K., Wang, Y., He, Y., Li, Y., Wang, Y., Liu, Y., Wang, Z., Xu, J., Chen, G., Luo, P., et al.: Mvbench: A comprehensive multi-modal video understanding benchmark. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22195–22206 (2024)
2024
-
[36]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Liu,C.,Li,P.,Yang,L.,Wang,D.,Li,L.,Yu,X.:Robustaudio-visualsegmentation via audio-guided visual convergent alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28922–28931 (2025)
2025
-
[37]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, C., Li, P.P., Yu, Q., Sheng, H., Wang, D., Li, L., Yu, X.: Benchmarking audio visual segmentation for long-untrimmed videos. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22712–22722 (2024)
2024
-
[38]
In: European Conference on Computer Vision
Liu, C., Qiu, F., Zhang, W., Li, L., Wang, D., Yu, X.: Compound expression recognition via curriculum learning. In: European Conference on Computer Vision. pp. 282–293. Springer (2024)
2024
-
[39]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Liu, C., Yang, L., Li, P., Wang, D., Li, L., Yu, X.: Dynamic derivation and elimi- nation: Audio visual segmentation with enhanced audio semantics. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 3131–3141 (2025)
2025
-
[40]
In: European Conference on Computer Vision
Liu, C., Zhang, W., Qiu, F., Li, L., Wang, D., Yu, X.: Affective behaviour analysis via progressive learning. In: European Conference on Computer Vision. pp. 366–
-
[41]
Guo et al
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning (2023) 18 T. Guo et al
2023
-
[42]
In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion
Liu, Y., Zhang, C., Xing, R., Tang, B., Yang, B., Yi, L.: Core4d: A 4d human- object-human interaction dataset for collaborative object rearrangement. In: Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion. pp. 1769–1782 (2025)
2025
-
[43]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision
Liu, Z., Sun, Z., Zang, Y., Dong, X., Cao, Y., Duan, H., Lin, D., Wang, J.: Visual- rft: Visual reinforcement fine-tuning. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 2034–2044 (2025)
2034
-
[44]
arXiv preprint arXiv:2405.20797 (2024)
Lu, S., Li, Y., Chen, Q.G., Xu, Z., Luo, W., Zhang, K., Ye, H.J.: Ovis: Struc- tural embedding alignment for multimodal large language model. arXiv preprint arXiv:2405.20797 (2024)
arXiv 2024
-
[45]
In: Proceed- ings of the Computer Vision and Pattern Recognition Conference
Ma, W., Ye, L., de Melo, C.M., Yuille, A., Chen, J.: Spatialllm: A compound 3d- informed design towards spatially-intelligent large multimodal models. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 17249–17260 (2025)
2025
-
[46]
In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25)
Nguyen, D., Ho, M.K., Ta, H., Nguyen, T.T., Chen, Q., Rav, K., Dang, Q.D., Ram- chandre, S., Phung, S.L., Liao, Z., et al.: Localizing before answering: A benchmark for grounded medical visual question answering. In: Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI-25). International Joint Confer- ences on Artificial Intellig...
2025
-
[47]
OpenAI: Introducing gpt-5.2 (2025),https://openai.com/index/introducing- gpt-5-2/
2025
-
[48]
In: International conference on med- ical image computing and computer-assisted intervention
Özsoy, E., Örnek, E.P., Eck, U., Czempiel, T., Tombari, F., Navab, N.: 4d-or: Se- mantic scene graphs for or domain modeling. In: International conference on med- ical image computing and computer-assisted intervention. pp. 475–485. Springer (2022)
2022
-
[49]
arXiv preprint arXiv:2406.17431 (2024)
Pan, S., Guo, T., Zhang, L., Liu, P., Xing, Z., Sun, X.: A large-scale investigation of semantically incompatible apis behind compatibility issues in android apps. arXiv preprint arXiv:2406.17431 (2024)
arXiv 2024
-
[50]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Qi, T., Li, W., Barnes, N.: Smokebench: Evaluating multimodal large language models for wildfire smoke detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 1043–1053 (2026)
2026
-
[51]
arXiv preprint arXiv:2501.01243 (2025)
Qin,L.,Ou,S.,Zhang,M.,Wei,J.,Zhang,Y.,Song,X.,Liu,Y.,Wang,M.,Xu,W.: Face-human-bench: A comprehensive benchmark of face and human understanding for multi-modal assistants. arXiv preprint arXiv:2501.01243 (2025)
arXiv 2025
-
[52]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Qiu, F., Du, H., Zhang, W., Liu, C., Li, L., Guo, T., Yu, X.: Learning transfer- able compound expressions from masked autoencoder pretraining. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4733–4741 (2024)
2024
-
[53]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Qiu,F.,Zhang,W.,Liu,C.,Li,L.,Du,H.,Guo,T.,Yu,X.:Language-guidedmulti- modal emotional mimicry intensity estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4742–4751 (2024)
2024
-
[54]
Qwen Team: Qwen3.5: Towards native multimodal agents (February 2026),https: //qwen.ai/blog?id=qwen3.5
2026
-
[55]
arXiv preprint arXiv:2504.07615 (2025)
Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., Zhao, T.: Vlm-r1: A stable and generalizable r1-style large vision-language model. arXiv preprint arXiv:2504.07615 (2025)
Pith/arXiv arXiv 2025
-
[56]
Sun, H., Wu, J., Xia, B., Luo, Y., Zhao, Y., Qin, K., Lv, X., Zhang, T., Chang, Y., Wang, X.: Reinforcement fine-tuning powers reasoning capability of multimodal large language models (2025),https://arxiv.org/abs/2505.18536 SSMNBench 19
arXiv 2025
-
[57]
arXiv preprint arXiv:2312.11805 (2023)
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
Pith/arXiv arXiv 2023
-
[58]
Team, V., Hong, W., Yu, W., Gu, X., Wang, G., Gan, G., Tang, H., Cheng, J., Qi, J., Ji, J., Pan, L., Duan, S., Wang, W., Wang, Y., Cheng, Y., He, Z., Su, Z., Yang, Z., Pan, Z., Zeng, A., Wang, B., Chen, B., Shi, B., Pang, C., Zhang, C., Yin, D., Yang, F., Chen, G., Xu, J., Zhu, J., Chen, J., Chen, J., Chen, J., Lin, J., Wang, J., Chen, J., Lei, L., Gong, ...
Pith/arXiv arXiv 2025
-
[59]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Tian, X., Zou, S., Yang, Z., Zhang, J.: Identifying and mitigating position bias of multi-image vision-language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10599–10609 (2025)
2025
-
[60]
arXiv preprint arXiv:2302.13971 (2023)
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
Pith/arXiv arXiv 2023
-
[61]
arXiv preprint arXiv:2406.09411 (2024)
Wang, F., Fu, X., Huang, J.Y., Li, Z., Liu, Q., Liu, X., Ma, M.D., Xu, N., Zhou, W., Zhang, K., et al.: Muirbench: A comprehensive benchmark for robust multi-image understanding. arXiv preprint arXiv:2406.09411 (2024)
Pith/arXiv arXiv 2024
-
[62]
Wang, T., Zhang, Z., Zhu, Z., Fan, Y., Xiong, J., Li, P., Ma, X., Li, Q.: From objects to anywhere: A holistic benchmark for multi-level visual grounding in 3d scenes (2025),https://arxiv.org/abs/2506.04897
arXiv 2025
-
[63]
arXiv preprint arXiv:2508.18265 (2025)
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
Pith/arXiv arXiv 2025
-
[64]
In: 2023 IEEE International Conference on Big Data (BigData)
Wu, J., Gan, W., Chen, Z., Wan, S., Yu, P.S.: Multimodal large language models: A survey. In: 2023 IEEE International Conference on Big Data (BigData). pp. 2247–2256. IEEE (2023)
2023
-
[65]
arXiv preprint arXiv:2412.10302 (2024)
Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang, B., et al.: Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. arXiv preprint arXiv:2412.10302 (2024)
Pith/arXiv arXiv 2024
-
[66]
arXiv preprint arXiv:2204.02824 (2022)
Wu, Z., Qi, X., Wang, Z., Zhou, W., Yuan, K., Sun, M., Sun, Z.: Showface: Co- ordinated face inpainting with memory-disentangled refinement networks. arXiv preprint arXiv:2204.02824 (2022)
arXiv 2022
-
[67]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Wu, Z., Wang, S., Yu, X.: Metom: Metadata-guided token merging for efficient video llms. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10441–10450 (2026)
2026
-
[68]
In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference
Xu, Q., Cao, R., Shen, X., Du, H., Wang, S., Yu, X.: M3gym: A large-scale mul- timodal multi-view multi-person pose dataset for fitness activity understanding in real-world settings. In: Proceedings of the Computer Vision and Pattern Recogni- tion Conference. pp. 12289–12300 (2025)
2025
-
[69]
Computer Vision and Image Understanding249, 104205 (2024) 20 T
Xu, Q., Chen, H., Du, H., Zhang, H., Łukasik, S., Zhu, T., Yu, X.: M3a: A mul- timodal misinformation dataset for media authenticity analysis. Computer Vision and Image Understanding249, 104205 (2024) 20 T. Guo et al
2024
-
[70]
In: Proceedings of the ACM on Web Conference 2025
Xu, Q., Du, H., Łukasik, S., Zhu, T., Wang, S., Yu, X.: Mdam3: A misinformation detection and analysis framework for multitype multimodal media. In: Proceedings of the ACM on Web Conference 2025. pp. 5285–5296 (2025)
2025
-
[71]
arXiv preprint arXiv:2505.09388 (2025)
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
Pith/arXiv arXiv 2025
-
[72]
In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference
Yang, J., Yang, S., Gupta, A.W., Han, R., Fei-Fei, L., Xie, S.: Thinking in space: How multimodal large language models see, remember, and recall spaces. In: Pro- ceedings of the Computer Vision and Pattern Recognition Conference. pp. 10632– 10643 (2025)
2025
-
[73]
arXiv preprint arXiv:2505.23764 (2025)
Yang, S., Xu, R., Xie, Y., Yang, S., Li, M., Lin, J., Zhu, C., Chen, X., Duan, H., Yue, X., et al.: Mmsi-bench: A benchmark for multi-image spatial intelligence. arXiv preprint arXiv:2505.23764 (2025)
Pith/arXiv arXiv 2025
-
[74]
arXiv preprint arXiv:2504.15280 (2025)
Yeh, C.H., Wang, C., Tong, S., Cheng, T.Y., Wang, R., Chu, T., Zhai, Y., Chen, Y., Gao, S., Ma, Y.: Seeing from another perspective: Evaluating multi-view un- derstanding in mllms. arXiv preprint arXiv:2504.15280 (2025)
arXiv 2025
-
[75]
National Science Review11(12), nwae403 (2024)
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., Chen, E.: A survey on multimodal large language models. National Science Review11(12), nwae403 (2024)
2024
-
[76]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Zhang, J., Zhang, J., Song, Z., Shi, Z., Zhao, C., Shi, Y., Yu, J., Xu, L., Wang, J.: Hoi-mˆ 3: Capture multiple humans and objects interaction within contextual environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 516–526 (2024)
2024
-
[77]
In: Findings of the Association for Computational Linguistics: EMNLP 2025
Zhang, K., Niu, L., Cao, Z., Meng, F., Zhou, J.: Tiu-bench: A benchmark for evaluating large multimodal models on text-rich image understanding. In: Findings of the Association for Computational Linguistics: EMNLP 2025. pp. 24286–24295 (2025)
2025
-
[78]
arXiv preprint arXiv:2509.02359 (2025)
Zhang, W., Huang, Y., Xu, Y., Huang, J., Zhi, H., Ren, S., Xu, W., Zhang, J.: Why do mllms struggle with spatial understanding? a systematic analysis from data to architecture. arXiv preprint arXiv:2509.02359 (2025)
arXiv 2025
-
[79]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: An effective en- semble learning framework for affective behaviour analysis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4761–4772 (2024)
2024
-
[80]
arXiv preprint arXiv:2403.10825 (2024)
Zhang, W., Qiu, F., Liu, C., Li, L., Du, H., Guo, T., Yu, X.: Affective behaviour analysis via integrating multi-modal knowledge. arXiv preprint arXiv:2403.10825 (2024)
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.