pith. machine review for the scientific record. sign in

arxiv: 2603.18472 · v2 · submitted 2026-03-19 · 💻 cs.AI · cs.CV

Recognition: no theorem link

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-15 09:07 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords multimodal large language modelssymbol recognitioncognitive mismatchvisual groundingdiscrete symbolsreasoning inversionbenchmark evaluation
0
0 comments X

The pith

Multimodal models underperform on elementary symbol recognition yet perform better on complex reasoning tasks that use those symbols.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that leading MLLMs display a consistent cognitive mismatch across a multi-domain benchmark covering language, culture, mathematics, physics and chemistry. Models are weaker at basic perception and recognition of discrete symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures, yet appear more capable on higher-level combination, reasoning and critical thinking tasks. A sympathetic reader would care because the pattern suggests current systems rely on linguistic priors, template retrieval or procedural shortcuts instead of building robust visual grounding for symbolic content. This inversion implies that apparent reasoning competence may rest on weak perceptual foundations, making symbolic understanding a central remaining bottleneck for multimodal systems.

Core claim

Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures.

What carries the argument

The three-level cognitive benchmark that separates perception and recognition from combination and reasoning and from association and critical thinking, exposing the performance inversion on discrete symbols.

If this is right

  • Symbolic understanding remains a major bottleneck for multimodal intelligence.
  • Current systems compensate for weak visual grounding by drawing on language-based shortcuts.
  • Training and evaluation schemes should prioritize grounded perception in discrete semantic spaces.
  • The mismatch is clearest for low-redundancy symbols such as handwritten characters and chemical structures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the inversion persists, targeted improvements to visual encoders for discrete symbols could raise overall capability without further language scaling.
  • The same mismatch may appear in other sparse visual domains such as technical diagrams or musical notation.
  • Future work could test whether fine-tuning solely on recognition tasks closes the gap on higher-level reasoning benchmarks.

Load-bearing premise

The benchmark's division into cognitive levels isolates genuine visual grounding failures rather than reflecting artifacts of task design, prompting or overlap with training data.

What would settle it

A controlled test in which linguistic context and priors are removed from the input while keeping the same symbol images, measuring whether recognition accuracy rises to match or exceed the reported reasoning performance.

Figures

Figures reproduced from arXiv: 2603.18472 by Daixian Liu, Hai-Tao Zheng, Jiayi Kuang, Junnan Dong, Liang Lin, Peng Xing, Philip S. Yu, Qingyu Zhou, Shu-Yu Guo, Wenhao Jiang, Yangning Li, Yinghui Li, Ying Shen, Yongheng Zhang.

Figure 1
Figure 1. Figure 1: Continuous and discrete semantic spaces differ fundamentally in how visual information is organized. Discrete symbol understanding requires precise recognition of individual symbolic units and their structured relations. Specialized methods have also begun to address symbolic domains such as documents, music, art, geometry, chemistry and other structured scientific inputs28–40. However, these approaches re… view at source ↗
Figure 2
Figure 2. Figure 2: Performance and representative cases for language-symbol understanding across three cognitive levels. Models are evaluated on faked-character detection, misspelled-character detection and visual-semantic correction. 3/46 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance and representative cases for cultural-symbol understanding. The tasks progress from lexical grounding of emojis to idiomatic composition and culturally mediated interpretation. 5/46 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance and representative cases for mathematical-symbol understanding, covering function graphs and geometric figures across recognition, reasoning, and verification tasks. 7/46 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance and representative cases for physical-symbol understanding across mechanics and electromagnetism. Sparse physical symbols remain difficult for most models, especially in visually grounded mapping tasks. 8/46 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance and representative cases for chemical-symbol understanding. The benchmark examines molecular structure recognition, reaction reasoning and higher-level correction or prediction. 9/46 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the benchmark task design framework and illustrative examples. (a) The instantiation of the three-level symbolic understanding hierarchy across five distinct domains: Language, Cultural, Mathematical, Physical, and Chemical symbols. (b) Representative examples of the tasks designed for our benchmark. 22/46 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cross-domain summary of benchmark results. (a) Radar charts illustrate the fine-grained performance of models across General, Language, Culture, Math, Physics and Chemistry domains. (b) Global performance aggregated by difficulty, with accuracy averaged across five symbolic domains. (c) Scatter plots exploring interrelationships between domains. 24/46 [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human performance on the benchmark compared with the best-performing model Gemini. tasks than on basic symbol recognition. This disparity confirms that current models do not adhere to human-like visual-cognitive logic. Instead, they appear to rely on learned linguistic probabilities and procedural imitation rather than robust visual grounding. Consequently, the models’ “visual understanding” is often gover… view at source ↗
Figure 10
Figure 10. Figure 10: Case studies from the comprehensive analysis, highlighting model failures in recognizing and integrating symbols. B.7.3 Reasoning ability sometimes exists as a compensatory mechanism A model’s language reasoning ability can, to some extent, compensate for its deficiencies in visual perception. For instance, InternVL-2.6-8B performed better on the Level 3 (association and critical thinking) in the linguist… view at source ↗
Figure 11
Figure 11. Figure 11: Overall performances of different models across the five symbolic domains. 33/46 [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Overview of the benchmark task design, including language symbols, cultural symbols, math symbols, chemistry symbols and physical symbols. tasks. In stark contrast, these same top-performing models often exhibited a significant performance decline in domains less represented in pre-training data, such as chemistry and physics. This indicates that the strengths of current MLLMs are more associative and emp… view at source ↗
Figure 13
Figure 13. Figure 13: The data construction pipeline of our benchmark. At the second level (Combination and Reasoning), the tasks move beyond static structures to symbolic reasoning under chemical laws. The model is required to classify reaction types and balance chemical equations by enforcing element conservation. These tasks evaluate whether the model can reason over symbolic representations in accordance with fundamental c… view at source ↗
Figure 14
Figure 14. Figure 14: Dataset distribution. (a) Overall data distribution across five domains. (b) Detailed task distribution specifically for Mathematical symbols. levels [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗
read the original abstract

Multimodal large language models (MLLMs) perform strongly on natural images, yet their ability to understand discrete visual symbols remains unclear. We present a multi-domain benchmark spanning language, culture, mathematics, physics and chemistry, organized into three cognitive levels: perception and recognition, combination and reasoning, and association and critical thinking. Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures. These results show that symbolic understanding remains a major bottleneck for multimodal intelligence and motivate training and evaluation schemes that prioritize grounded perception in discrete semantic spaces.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a multi-domain benchmark for discrete symbol understanding in MLLMs spanning language, culture, mathematics, physics, and chemistry, organized into three cognitive levels (perception/recognition, combination/reasoning, and association/critical thinking). It reports a consistent recognition-reasoning inversion: models underperform on elementary symbol recognition (especially sparse symbols like handwritten characters, circuit diagrams, and chemical structures) but appear more competent on complex reasoning tasks, attributing this to reliance on linguistic priors and template retrieval rather than robust visual grounding.

Significance. If the inversion is shown to arise from visual grounding failures rather than benchmark artifacts, the result would identify a substantive limitation in current MLLMs for grounded symbolic reasoning in technical domains, motivating targeted training and evaluation protocols that prioritize perceptual grounding over linguistic compensation.

major comments (2)
  1. [Methods/Benchmark] Benchmark construction (Methods section): No quantitative controls are described for symbol frequency, visual redundancy, or differential textual context across the three cognitive levels. Without these, the reported performance inversion cannot be isolated from task-design artifacts, directly undermining the central claim that the mismatch reflects absent visual grounding.
  2. [Results] Results and evaluation (Results section): The abstract and results provide no model names/versions, raw accuracy numbers, error breakdowns, or statistical tests comparing recognition versus reasoning performance. This absence prevents assessment of whether the inversion is consistent, significant, or generalizable across the claimed domains.
minor comments (2)
  1. [Abstract] The abstract would benefit from a single sentence summarizing the number of models tested and the magnitude of the observed gaps to allow readers to gauge the effect size immediately.
  2. [Benchmark Description] Notation for the three cognitive levels is introduced without a clear table or diagram mapping example items from each domain to each level; this reduces readability when interpreting the inversion pattern.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for strengthening the methodological transparency and empirical reporting in our work. We address each major comment below and commit to revisions that enhance the rigor of the benchmark and results presentation.

read point-by-point responses
  1. Referee: [Methods/Benchmark] Benchmark construction (Methods section): No quantitative controls are described for symbol frequency, visual redundancy, or differential textual context across the three cognitive levels. Without these, the reported performance inversion cannot be isolated from task-design artifacts, directly undermining the central claim that the mismatch reflects absent visual grounding.

    Authors: We agree that the absence of explicit quantitative controls leaves room for potential task-design confounds. In the revised manuscript, we will add a dedicated subsection under Methods that reports: (1) symbol frequency and sparsity metrics (e.g., average stroke density and pixel entropy per symbol class), (2) visual redundancy measures (e.g., symmetry indices and intra-class visual similarity via perceptual hashing), and (3) textual context standardization (uniform minimal-prompt templates across levels). These additions will allow readers to better assess whether the observed inversion is attributable to visual grounding limitations rather than benchmark artifacts. revision: yes

  2. Referee: [Results] Results and evaluation (Results section): The abstract and results provide no model names/versions, raw accuracy numbers, error breakdowns, or statistical tests comparing recognition versus reasoning performance. This absence prevents assessment of whether the inversion is consistent, significant, or generalizable across the claimed domains.

    Authors: We acknowledge that the current Results section lacks the granular reporting needed for full reproducibility and statistical evaluation. The revised manuscript will expand the Results section to include: specific model names and versions (e.g., GPT-4o, Claude-3-Opus, LLaVA-1.6, etc.), full raw accuracy tables disaggregated by domain and cognitive level, categorized error breakdowns (perceptual misrecognition rates versus reasoning failures), and statistical tests (paired Wilcoxon signed-rank tests with effect sizes) comparing recognition versus reasoning performance within and across domains. These details will substantiate the consistency and generalizability of the inversion. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark observation with no derivations or self-referential reductions

full rationale

The paper introduces a multi-domain benchmark organized into three cognitive levels and reports observed performance patterns across MLLMs, noting underperformance on symbol recognition relative to reasoning tasks. No equations, fitted parameters, ansatzes, or derivation steps are present in the provided text. The central claim rests on direct model evaluations rather than any chain that reduces to its own inputs by construction. Self-citations are not invoked as load-bearing premises, and the analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The mismatch interpretation rests on the untested premise that benchmark levels cleanly separate visual perception from linguistic reasoning.

axioms (1)
  • domain assumption The three defined cognitive levels validly distinguish perception failures from reasoning capabilities without confounding factors.
    Invoked to label the performance gap as a cognitive mismatch rather than a methodological artifact.

pith-pipeline@v0.9.0 · 5497 in / 1015 out tokens · 53178 ms · 2026-05-15T09:07:57.215009+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 126 canonical work pages · 5 internal anchors

  1. [1]

    InFindings of the Association for Computational Linguistics: ACL 2024, 13590–13618 (2024)

    Caffagni, D.et al.The revolution of multimodal large language models: A survey. InFindings of the Association for Computational Linguistics: ACL 2024, 13590–13618 (2024)

  2. [2]

    Data Eng.(2025)

    Song, S.et al.How to bridge the gap between modalities: Survey on multimodal large language model.IEEE Transactions on Knowl. Data Eng.(2025)

  3. [3]

    InFindings of the Association for Computational Linguistics ACL 2024, 12401–12430 (2024)

    Zhang, D.et al.Mm-llms: Recent advances in multimodal large language models. InFindings of the Association for Computational Linguistics ACL 2024, 12401–12430 (2024)

  4. [4]

    & Tang, Y

    Xu, J., Sun, Q., Han, Q.-L. & Tang, Y . When embodied ai meets industry 5.0: Human-centered smart manufacturing. IEEE/CAA J. Autom. Sinica12, 485–501 (2025). 5.Turgunbaev, R.et al.From perception to action with integrated vla systems.Tech. Sci. Integr. Res.1, 11–17 (2025)

  5. [5]

    InProceedings of the Computer Vision and Pattern Recognition Conference, 10644–10655 (2025)

    Szot, A.et al.From multimodal llms to generalist embodied agents: Methods and lessons. InProceedings of the Computer Vision and Pattern Recognition Conference, 10644–10655 (2025)

  6. [6]

    Du, C.et al.Human-like object concept representations emerge naturally in multimodal large language models.Nat. Mach. Intell.1–16 (2025)

  7. [7]

    InPro- ceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries, 1–8 (2024)

    Fei, H.et al.From multimodal llm to human-level ai: Modality, instruction, reasoning, efficiency and beyond. InPro- ceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries, 1–8 (2024)

  8. [8]

    InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 6613–6629 (2025)

    Shen, H.et al.Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 6613–6629 (2025)

  9. [9]

    Perlovsky, L. I. Symbols: Integrated cognition and language. InSemiotics and intelligent systems development, 121–151 (IGI Global Scientific Publishing, 2007)

  10. [10]

    Taniguchi, T.et al.Symbol emergence in cognitive developmental systems: a survey.IEEE transactions on Cogn. Dev. Syst.11, 494–516 (2018)

  11. [11]

    Symbol and structure: a comprehensive framework for language evolution.Stud

    Bickerton, D. Symbol and structure: a comprehensive framework for language evolution.Stud. Evol. Lang.3, 77–93 (2003)

  12. [12]

    & Nóbrega, V

    Miyagawa, S., Lesure, C. & Nóbrega, V . A. Cross-modality information transfer: A hypothesis about the relationship among prehistoric cave paintings, symbolic thinking, and the emergence of language.Front. Psychol.9, 115 (2018). 14.Rapoport, A. The role of symbols in human behavior.ETC: A Rev. Gen. Semant.180–188 (1955)

  13. [13]

    Ahn, T., Janssen, M. A. & Ostrom, E. Signals, symbols, and human cooperation. InThe origins and nature of sociality, 122–139 (Routledge, 2017)

  14. [14]

    Abdulgalil, H. D. & Basir, O. A. Next-generation image captioning: A survey of methodologies and emerging challenges from transformers to multimodal large language models.Nat. Lang. Process. J.100159 (2025)

  15. [15]

    Surv.57, 1–36 (2025)

    Kuang, J.et al.Natural language understanding and inference with mllm in visual question answering: A survey.ACM Comput. Surv.57, 1–36 (2025)

  16. [16]

    Xiao, L., Yang, X., Lan, X., Wang, Y . & Xu, C. Towards visual grounding: A survey.IEEE Transactions on Pattern Analysis Mach. Intell.1–20, DOI: 10.1109/TPAMI.2025.3630635 (2025). 12/46

  17. [17]

    Liu, Y .et al.Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, 216–233 (Springer, 2024)

  18. [18]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13299–13308 (2024)

    Li, B.et al.Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13299–13308 (2024)

  19. [19]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)

    Fu, C.et al.MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)

  20. [20]

    In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9556–9567 (2024)

    Yue, X.et al.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9556–9567 (2024)

  21. [21]

    In Proceedings of the Computer Vision and Pattern Recognition Conference, 1587–1606 (2025)

    Li, Z.et al.A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, 1587–1606 (2025)

  22. [22]

    Intell.(2024)

    Xu, P.et al.Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis Mach. Intell.(2024)

  23. [23]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)

    Fu, L.et al.OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)

  24. [24]

    InThe Twelfth International Conference on Learning Representations(2024)

    Lu, P.et al.Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations(2024)

  25. [25]

    Zhang, R.et al.Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, 169–186 (Springer, 2024)

  26. [26]

    Liu, Y .et al.Textmonkey: An ocr-free large multimodal model for understanding document.arXiv preprint arXiv:2403.04473(2024)

  27. [27]

    Yu, Y .-Q.et al.Texthawk: Exploring efficient fine-grained perception of multimodal large language models.arXiv preprint arXiv:2404.09204(2024)

  28. [28]

    S.et al.Docvlm: Make your vlm an efficient reader

    Nacson, M. S.et al.Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, 29005–29015 (2025)

  29. [29]

    In Chiruzzo, L., Ritter, A

    Tang, M.et al.NOTA: Multimodal music notation understanding for visual large language model. In Chiruzzo, L., Ritter, A. & Wang, L. (eds.)Findings of the Association for Computational Linguistics: NAACL 2025, 7160–7173, DOI: 10.18653/v1/2025.findings-naacl.399 (Association for Computational Linguistics, Albuquerque, New Mexico, 2025)

  30. [30]

    & Chen, C

    Jiang, R. & Chen, C. W. Multimodal llms can reason about aesthetics in zero-shot. InProceedings of the 33rd ACM International Conference on Multimedia, 6634–6643 (2025)

  31. [31]

    & Castellano, G

    Fanelli, N., Vessio, G. & Castellano, G. Artseek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval.arXiv preprint arXiv:2507.21917(2025)

  32. [32]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Gao, J.et al.G-LLaV A: Solving geometric problem with multi-modal large language model. InThe Thirteenth International Conference on Learning Representations(2025)

  33. [33]

    In Al-Onaizan, Y ., Bansal, M

    Shi, W.et al.Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models. In Al-Onaizan, Y ., Bansal, M. & Chen, Y .-N. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2024, 4663–4680, DOI: 10.18653/v1/2024.findings-emnlp.268 (Association for Computational Linguistics, Miami, Florida, USA, 2024)

  34. [34]

    In Proceedings of the 33rd ACM International Conference on Multimedia, 5394–5403 (2025)

    Pan, Y .et al.Enhancing the geometric problem-solving ability of multimodal llms via symbolic-neural integration. In Proceedings of the 33rd ACM International Conference on Multimedia, 5394–5403 (2025)

  35. [35]

    38.Tan, Q.et al.Chemmllm: Chemical multimodal large language model.arXiv preprint arXiv:2505.16326(2025)

    Shi, W.et al.Multimodal mathematical reasoning with diverse solving perspective.arXiv preprint arXiv:2507.02804 (2025). 38.Tan, Q.et al.Chemmllm: Chemical multimodal large language model.arXiv preprint arXiv:2505.16326(2025)

  36. [36]

    InProceedings of the AAAI Conference on Artificial Intelligence, vol

    Li, J.et al.Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 415–423 (2025). 40.Zhao, Z.et al.Chemdfm-x: towards large multimodal model for chemistry.Sci. China Inf. Sci.67, 220109 (2024). 41.Bloom, B. S.et al.Taxonomy of.Educ. Object.250(1956). 42...

  37. [37]

    Hiippala, T.et al.Ai2d-rst: a multimodal corpus of 1000 primary school science diagrams.Lang. Resour. Eval.55, 661–688 (2021)

  38. [38]

    In Ku, L.-W., Martins, A

    Qin, Y .et al.InFoBench: Evaluating instruction following ability in large language models. In Ku, L.-W., Martins, A. & Srikumar, V . (eds.)Findings of the Association for Computational Linguistics: ACL 2024, DOI: 10.18653/v1/2024. findings-acl.772 (Association for Computational Linguistics, Bangkok, Thailand, 2024)

  39. [39]

    InFindings of the Association for Computational Linguistics: ACL 2025, 7748–7763 (2025)

    Tang, J.et al.Mtvqa: Benchmarking multilingual text-centric visual question answering. InFindings of the Association for Computational Linguistics: ACL 2025, 7748–7763 (2025)

  40. [40]

    Neural Inf

    Wang, Y .et al.Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Adv. Neural Inf. Process. Syst.37, 95266–95290 (2024)

  41. [41]

    InEuropean Conference on Computer Vision, 148–166 (Springer, 2024)

    Fu, X.et al.Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 148–166 (Springer, 2024)

  42. [42]

    In Salakhutdinov, R.et al.(eds.)Proceedings of the 41st International Conference on Machine Learning, vol

    Ying, K.et al.MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI. In Salakhutdinov, R.et al.(eds.)Proceedings of the 41st International Conference on Machine Learning, vol. 235 ofProceedings of Machine Learning Research, 57116–57198 (PMLR, 2024)

  43. [43]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Meng, F.et al.MMIU: Multimodal multi-image understanding for evaluating large vision-language models. InThe Thirteenth International Conference on Learning Representations(2025)

  44. [44]

    Jiang, D.et al.Mantis: Interleaved multi-image instruction tuning.Transactions on Mach. Learn. Res.(2024). Featured Certification, Outstanding Certification

  45. [45]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Chen, J.et al.MEGA-bench: Scaling multimodal evaluation to over 500 real-world tasks. InThe Thirteenth International Conference on Learning Representations(2025)

  46. [46]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Wang, F.et al.Muirbench: A comprehensive benchmark for robust multi-image understanding. InThe Thirteenth International Conference on Learning Representations(2025)

  47. [47]

    InWorkshop on Reasoning and Planning for Large Language Models(2025)

    Wang, S.et al.MFC-bench: Benchmarking multimodal fact-checking with large vision-language models. InWorkshop on Reasoning and Planning for Large Language Models(2025)

  48. [48]

    & Berg, T

    Kazemzadeh, S., Ordonez, V ., Matten, M. & Berg, T. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 787–798 (2014)

  49. [49]

    InProceedings of the IEEE conference on computer vision and pattern recognition, 11–20 (2016)

    Mao, J.et al.Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, 11–20 (2016)

  50. [50]

    Neural Inf

    Li, C.et al.Elevater: A benchmark and toolkit for evaluating language-augmented visual models.Adv. Neural Inf. Process. Syst.35, 9287–9301 (2022)

  51. [51]

    In Bouamor, H., Pino, J

    Li, Y .et al.Evaluating object hallucination in large vision-language models. In Bouamor, H., Pino, J. & Bali, K. (eds.)Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 292–305, DOI: 10.18653/v1/2023.emnlp-main.20 (Association for Computational Linguistics, Singapore, 2023)

  52. [52]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14375–14385 (2024)

    Guan, T.et al.Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14375–14385 (2024)

  53. [53]

    & Evans, O

    Lin, S., Hilton, J. & Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 3214–3252 (2022)

  54. [54]

    & Wen, J.-R

    Li, J., Cheng, X., Zhao, X., Nie, J.-Y . & Wen, J.-R. Halueval: A large-scale hallucination evaluation benchmark for large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing(2023)

  55. [55]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)

    Zeng, Z.et al.MR-ben: A meta-reasoning benchmark for evaluating system-2 thinking in LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)

  56. [56]

    InFirst Conference on Language Modeling(2024)

    Rein, D.et al.GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling(2024)

  57. [57]

    In Forty-second International Conference on Machine Learning(2025)

    Hao, Y .et al.Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. In Forty-second International Conference on Machine Learning(2025)

  58. [58]

    & Scialom, T

    Mialon, G., Fourrier, C., Wolf, T., LeCun, Y . & Scialom, T. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations(2024). 14/46

  59. [59]

    F., Alon, U., Neubig, G

    Xu, F. F., Alon, U., Neubig, G. & Hellendoorn, V . J. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, 1–10, DOI: 10.1145/3520312.3534862 (Association for Computing Machinery, New York, NY , USA, 2022)

  60. [60]

    Uniyal, M., Singh, M., Verbruggen, G., Gulwani, S. & Le, V . One-to-many testing for code generation from (just) natural language. In Al-Onaizan, Y ., Bansal, M. & Chen, Y .-N. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2024, 15397–15402, DOI: 10.18653/v1/2024.findings-emnlp.902 (Association for Computational Linguistics, Miami...

  61. [61]

    S., Wang, Y

    Liu, J., Xia, C. S., Wang, Y . & Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Adv. Neural Inf. Process. Syst.36, 21558–21572 (2023)

  62. [62]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Jain, N.et al.Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations(2025)

  63. [63]

    & Levy, O

    Shaham, U., Ivgi, M., Efrat, A., Berant, J. & Levy, O. ZeroSCROLLS: A zero-shot benchmark for long text understanding. InThe 2023 Conference on Empirical Methods in Natural Language Processing(2023)

  64. [64]

    & Arik, S

    Jin, B., Yoon, J., Han, J. & Arik, S. O. Long-context LLMs meet RAG: Overcoming challenges for long inputs in RAG. InThe Thirteenth International Conference on Learning Representations(2025)

  65. [65]

    ∞Bench: Extending long context evaluation beyond 100K tokens

    Zhang, X.et al. ∞Bench: Extending long context evaluation beyond 100K tokens. In Ku, L.-W., Martins, A. & Srikumar, V . (eds.)Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15262–15277, DOI: 10.18653/v1/2024.acl-long.814 (Association for Computational Linguistics, Bangkok, Thailand, 2024)

  66. [66]

    In Chiruzzo, L., Ritter, A

    Wang, H.et al.Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. In Chiruzzo, L., Ritter, A. & Wang, L. (eds.)Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3221–3241, DOI...

  67. [67]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22195–22206 (2024)

    Li, K.et al.Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22195–22206 (2024)

  68. [68]

    Neural Inf

    Fang, X.et al.Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.Adv. Neural Inf. Process. Syst.37, 89098–89124 (2024)

  69. [69]

    InProceedings of the Computer Vision and Pattern Recognition Conference, 13691–13701 (2025)

    Zhou, J.et al.Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, 13691–13701 (2025)

  70. [70]

    Wu, H., Li, D., Chen, B. & Li, J. Longvideobench: A benchmark for long-context interleaved video-language understand- ing.Adv. Neural Inf. Process. Syst.37, 28828–28857 (2024)

  71. [71]

    InProceedings of the IEEE/CVF International Conference on Computer Vision, 22958–22967 (2025)

    Wang, W.et al.Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, 22958–22967 (2025)

  72. [72]

    & Srikumar, V

    Liu, Y .et al.TempCompass: Do video LLMs really understand videos? In Ku, L.-W., Martins, A. & Srikumar, V . (eds.) Findings of the Association for Computational Linguistics: ACL 2024, 8731–8772, DOI: 10.18653/v1/2024.findings-acl. 517 (Association for Computational Linguistics, Bangkok, Thailand, 2024)

  73. [73]

    In Proceedings of the Computer Vision and Pattern Recognition Conference, 24108–24118 (2025)

    Fu, C.et al.Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference, 24108–24118 (2025)

  74. [74]

    InICLR 2024 Workshop on Large Language Model (LLM) Agents(2024)

    Cheng, K.et al.Seeclick: Harnessing GUI grounding for advanced visual GUI agents. InICLR 2024 Workshop on Large Language Model (LLM) Agents(2024)

  75. [75]

    Qiu, Z.et al.Can large language models understand symbolic graphics programs? InThe Thirteenth International Conference on Learning Representations(2025)

  76. [76]

    InThe Thirteenth International Conference on Learning Representations(2025)

    Rawles, C.et al.Androidworld: A dynamic benchmarking environment for autonomous agents. InThe Thirteenth International Conference on Learning Representations(2025)

  77. [77]

    InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2024)

    Li, W.et al.On the effects of data scale on UI control agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2024)

  78. [78]

    InWorkshop on Reasoning and Planning for Large Language Models(2025)

    Li, K.et al.Screenspot-pro: GUI grounding for professional high-resolution computer use. InWorkshop on Reasoning and Planning for Large Language Models(2025). 15/46

  79. [79]

    InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)

    Xie, T.et al.Scaling computer-use grounding via user interface decomposition and synthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025). 86.De Saussure, F. The linguistic sign.Semiot. An introductory anthology35(1985). 87.Key, L. & Noble, B. P.An analysis of Ferdinand de Saussure’s Cour...

  80. [80]

    Valdez, J. L. C.et al.Semiotics and artificial intelligence (ai): An analysis of symbolic communication in the age of technology. InFuture of Information and Communication Conference, 481–494 (Springer, 2024)

Showing first 80 references.