Recognition: no theorem link
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
Pith reviewed 2026-05-15 09:07 UTC · model grok-4.3
The pith
Multimodal models underperform on elementary symbol recognition yet perform better on complex reasoning tasks that use those symbols.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures.
What carries the argument
The three-level cognitive benchmark that separates perception and recognition from combination and reasoning and from association and critical thinking, exposing the performance inversion on discrete symbols.
If this is right
- Symbolic understanding remains a major bottleneck for multimodal intelligence.
- Current systems compensate for weak visual grounding by drawing on language-based shortcuts.
- Training and evaluation schemes should prioritize grounded perception in discrete semantic spaces.
- The mismatch is clearest for low-redundancy symbols such as handwritten characters and chemical structures.
Where Pith is reading between the lines
- If the inversion persists, targeted improvements to visual encoders for discrete symbols could raise overall capability without further language scaling.
- The same mismatch may appear in other sparse visual domains such as technical diagrams or musical notation.
- Future work could test whether fine-tuning solely on recognition tasks closes the gap on higher-level reasoning benchmarks.
Load-bearing premise
The benchmark's division into cognitive levels isolates genuine visual grounding failures rather than reflecting artifacts of task design, prompting or overlap with training data.
What would settle it
A controlled test in which linguistic context and priors are removed from the input while keeping the same symbol images, measuring whether recognition accuracy rises to match or exceed the reported reasoning performance.
Figures
read the original abstract
Multimodal large language models (MLLMs) perform strongly on natural images, yet their ability to understand discrete visual symbols remains unclear. We present a multi-domain benchmark spanning language, culture, mathematics, physics and chemistry, organized into three cognitive levels: perception and recognition, combination and reasoning, and association and critical thinking. Across leading MLLMs, we observe a consistent cognitive mismatch. Models frequently underperform on elementary symbol recognition while appearing relatively competent on more complex reasoning tasks. This recognition-reasoning inversion indicates that current systems often compensate with linguistic priors, template retrieval or procedural reasoning instead of robust visual grounding. The pattern is especially clear for sparse, low-redundancy symbols such as handwritten characters, formula graphs, circuit diagrams and chemical structures. These results show that symbolic understanding remains a major bottleneck for multimodal intelligence and motivate training and evaluation schemes that prioritize grounded perception in discrete semantic spaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a multi-domain benchmark for discrete symbol understanding in MLLMs spanning language, culture, mathematics, physics, and chemistry, organized into three cognitive levels (perception/recognition, combination/reasoning, and association/critical thinking). It reports a consistent recognition-reasoning inversion: models underperform on elementary symbol recognition (especially sparse symbols like handwritten characters, circuit diagrams, and chemical structures) but appear more competent on complex reasoning tasks, attributing this to reliance on linguistic priors and template retrieval rather than robust visual grounding.
Significance. If the inversion is shown to arise from visual grounding failures rather than benchmark artifacts, the result would identify a substantive limitation in current MLLMs for grounded symbolic reasoning in technical domains, motivating targeted training and evaluation protocols that prioritize perceptual grounding over linguistic compensation.
major comments (2)
- [Methods/Benchmark] Benchmark construction (Methods section): No quantitative controls are described for symbol frequency, visual redundancy, or differential textual context across the three cognitive levels. Without these, the reported performance inversion cannot be isolated from task-design artifacts, directly undermining the central claim that the mismatch reflects absent visual grounding.
- [Results] Results and evaluation (Results section): The abstract and results provide no model names/versions, raw accuracy numbers, error breakdowns, or statistical tests comparing recognition versus reasoning performance. This absence prevents assessment of whether the inversion is consistent, significant, or generalizable across the claimed domains.
minor comments (2)
- [Abstract] The abstract would benefit from a single sentence summarizing the number of models tested and the magnitude of the observed gaps to allow readers to gauge the effect size immediately.
- [Benchmark Description] Notation for the three cognitive levels is introduced without a clear table or diagram mapping example items from each domain to each level; this reduces readability when interpreting the inversion pattern.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the methodological transparency and empirical reporting in our work. We address each major comment below and commit to revisions that enhance the rigor of the benchmark and results presentation.
read point-by-point responses
-
Referee: [Methods/Benchmark] Benchmark construction (Methods section): No quantitative controls are described for symbol frequency, visual redundancy, or differential textual context across the three cognitive levels. Without these, the reported performance inversion cannot be isolated from task-design artifacts, directly undermining the central claim that the mismatch reflects absent visual grounding.
Authors: We agree that the absence of explicit quantitative controls leaves room for potential task-design confounds. In the revised manuscript, we will add a dedicated subsection under Methods that reports: (1) symbol frequency and sparsity metrics (e.g., average stroke density and pixel entropy per symbol class), (2) visual redundancy measures (e.g., symmetry indices and intra-class visual similarity via perceptual hashing), and (3) textual context standardization (uniform minimal-prompt templates across levels). These additions will allow readers to better assess whether the observed inversion is attributable to visual grounding limitations rather than benchmark artifacts. revision: yes
-
Referee: [Results] Results and evaluation (Results section): The abstract and results provide no model names/versions, raw accuracy numbers, error breakdowns, or statistical tests comparing recognition versus reasoning performance. This absence prevents assessment of whether the inversion is consistent, significant, or generalizable across the claimed domains.
Authors: We acknowledge that the current Results section lacks the granular reporting needed for full reproducibility and statistical evaluation. The revised manuscript will expand the Results section to include: specific model names and versions (e.g., GPT-4o, Claude-3-Opus, LLaVA-1.6, etc.), full raw accuracy tables disaggregated by domain and cognitive level, categorized error breakdowns (perceptual misrecognition rates versus reasoning failures), and statistical tests (paired Wilcoxon signed-rank tests with effect sizes) comparing recognition versus reasoning performance within and across domains. These details will substantiate the consistency and generalizability of the inversion. revision: yes
Circularity Check
Empirical benchmark observation with no derivations or self-referential reductions
full rationale
The paper introduces a multi-domain benchmark organized into three cognitive levels and reports observed performance patterns across MLLMs, noting underperformance on symbol recognition relative to reasoning tasks. No equations, fitted parameters, ansatzes, or derivation steps are present in the provided text. The central claim rests on direct model evaluations rather than any chain that reduces to its own inputs by construction. Self-citations are not invoked as load-bearing premises, and the analysis remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three defined cognitive levels validly distinguish perception failures from reasoning capabilities without confounding factors.
Reference graph
Works this paper leans on
-
[1]
InFindings of the Association for Computational Linguistics: ACL 2024, 13590–13618 (2024)
Caffagni, D.et al.The revolution of multimodal large language models: A survey. InFindings of the Association for Computational Linguistics: ACL 2024, 13590–13618 (2024)
work page 2024
-
[2]
Song, S.et al.How to bridge the gap between modalities: Survey on multimodal large language model.IEEE Transactions on Knowl. Data Eng.(2025)
work page 2025
-
[3]
InFindings of the Association for Computational Linguistics ACL 2024, 12401–12430 (2024)
Zhang, D.et al.Mm-llms: Recent advances in multimodal large language models. InFindings of the Association for Computational Linguistics ACL 2024, 12401–12430 (2024)
work page 2024
- [4]
-
[5]
InProceedings of the Computer Vision and Pattern Recognition Conference, 10644–10655 (2025)
Szot, A.et al.From multimodal llms to generalist embodied agents: Methods and lessons. InProceedings of the Computer Vision and Pattern Recognition Conference, 10644–10655 (2025)
work page 2025
-
[6]
Du, C.et al.Human-like object concept representations emerge naturally in multimodal large language models.Nat. Mach. Intell.1–16 (2025)
work page 2025
-
[7]
Fei, H.et al.From multimodal llm to human-level ai: Modality, instruction, reasoning, efficiency and beyond. InPro- ceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024): Tutorial Summaries, 1–8 (2024)
work page 2024
-
[8]
Shen, H.et al.Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 6613–6629 (2025)
work page 2025
-
[9]
Perlovsky, L. I. Symbols: Integrated cognition and language. InSemiotics and intelligent systems development, 121–151 (IGI Global Scientific Publishing, 2007)
work page 2007
-
[10]
Taniguchi, T.et al.Symbol emergence in cognitive developmental systems: a survey.IEEE transactions on Cogn. Dev. Syst.11, 494–516 (2018)
work page 2018
-
[11]
Symbol and structure: a comprehensive framework for language evolution.Stud
Bickerton, D. Symbol and structure: a comprehensive framework for language evolution.Stud. Evol. Lang.3, 77–93 (2003)
work page 2003
-
[12]
Miyagawa, S., Lesure, C. & Nóbrega, V . A. Cross-modality information transfer: A hypothesis about the relationship among prehistoric cave paintings, symbolic thinking, and the emergence of language.Front. Psychol.9, 115 (2018). 14.Rapoport, A. The role of symbols in human behavior.ETC: A Rev. Gen. Semant.180–188 (1955)
work page 2018
-
[13]
Ahn, T., Janssen, M. A. & Ostrom, E. Signals, symbols, and human cooperation. InThe origins and nature of sociality, 122–139 (Routledge, 2017)
work page 2017
-
[14]
Abdulgalil, H. D. & Basir, O. A. Next-generation image captioning: A survey of methodologies and emerging challenges from transformers to multimodal large language models.Nat. Lang. Process. J.100159 (2025)
work page 2025
-
[15]
Kuang, J.et al.Natural language understanding and inference with mllm in visual question answering: A survey.ACM Comput. Surv.57, 1–36 (2025)
work page 2025
-
[16]
Xiao, L., Yang, X., Lan, X., Wang, Y . & Xu, C. Towards visual grounding: A survey.IEEE Transactions on Pattern Analysis Mach. Intell.1–20, DOI: 10.1109/TPAMI.2025.3630635 (2025). 12/46
-
[17]
Liu, Y .et al.Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, 216–233 (Springer, 2024)
work page 2024
-
[18]
Li, B.et al.Seed-bench: Benchmarking multimodal large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13299–13308 (2024)
work page 2024
-
[19]
Fu, C.et al.MME: A comprehensive evaluation benchmark for multimodal large language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)
work page 2025
-
[20]
Yue, X.et al.Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9556–9567 (2024)
work page 2024
-
[21]
In Proceedings of the Computer Vision and Pattern Recognition Conference, 1587–1606 (2025)
Li, Z.et al.A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, 1587–1606 (2025)
work page 2025
-
[22]
Xu, P.et al.Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models.IEEE Transactions on Pattern Analysis Mach. Intell.(2024)
work page 2024
-
[23]
Fu, L.et al.OCRBench v2: An improved benchmark for evaluating large multimodal models on visual text localization and reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025)
work page 2025
-
[24]
InThe Twelfth International Conference on Learning Representations(2024)
Lu, P.et al.Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations(2024)
work page 2024
-
[25]
Zhang, R.et al.Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, 169–186 (Springer, 2024)
work page 2024
- [26]
- [27]
-
[28]
S.et al.Docvlm: Make your vlm an efficient reader
Nacson, M. S.et al.Docvlm: Make your vlm an efficient reader. InProceedings of the Computer Vision and Pattern Recognition Conference, 29005–29015 (2025)
work page 2025
-
[29]
Tang, M.et al.NOTA: Multimodal music notation understanding for visual large language model. In Chiruzzo, L., Ritter, A. & Wang, L. (eds.)Findings of the Association for Computational Linguistics: NAACL 2025, 7160–7173, DOI: 10.18653/v1/2025.findings-naacl.399 (Association for Computational Linguistics, Albuquerque, New Mexico, 2025)
- [30]
-
[31]
Fanelli, N., Vessio, G. & Castellano, G. Artseek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval.arXiv preprint arXiv:2507.21917(2025)
-
[32]
InThe Thirteenth International Conference on Learning Representations(2025)
Gao, J.et al.G-LLaV A: Solving geometric problem with multi-modal large language model. InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
-
[33]
Shi, W.et al.Math-LLaV A: Bootstrapping mathematical reasoning for multimodal large language models. In Al-Onaizan, Y ., Bansal, M. & Chen, Y .-N. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2024, 4663–4680, DOI: 10.18653/v1/2024.findings-emnlp.268 (Association for Computational Linguistics, Miami, Florida, USA, 2024)
-
[34]
In Proceedings of the 33rd ACM International Conference on Multimedia, 5394–5403 (2025)
Pan, Y .et al.Enhancing the geometric problem-solving ability of multimodal llms via symbolic-neural integration. In Proceedings of the 33rd ACM International Conference on Multimedia, 5394–5403 (2025)
work page 2025
-
[35]
Shi, W.et al.Multimodal mathematical reasoning with diverse solving perspective.arXiv preprint arXiv:2507.02804 (2025). 38.Tan, Q.et al.Chemmllm: Chemical multimodal large language model.arXiv preprint arXiv:2505.16326(2025)
-
[36]
InProceedings of the AAAI Conference on Artificial Intelligence, vol
Li, J.et al.Chemvlm: Exploring the power of multimodal large language models in chemistry area. InProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, 415–423 (2025). 40.Zhao, Z.et al.Chemdfm-x: towards large multimodal model for chemistry.Sci. China Inf. Sci.67, 220109 (2024). 41.Bloom, B. S.et al.Taxonomy of.Educ. Object.250(1956). 42...
work page 2025
-
[37]
Hiippala, T.et al.Ai2d-rst: a multimodal corpus of 1000 primary school science diagrams.Lang. Resour. Eval.55, 661–688 (2021)
work page 2021
-
[38]
Qin, Y .et al.InFoBench: Evaluating instruction following ability in large language models. In Ku, L.-W., Martins, A. & Srikumar, V . (eds.)Findings of the Association for Computational Linguistics: ACL 2024, DOI: 10.18653/v1/2024. findings-acl.772 (Association for Computational Linguistics, Bangkok, Thailand, 2024)
-
[39]
InFindings of the Association for Computational Linguistics: ACL 2025, 7748–7763 (2025)
Tang, J.et al.Mtvqa: Benchmarking multilingual text-centric visual question answering. InFindings of the Association for Computational Linguistics: ACL 2025, 7748–7763 (2025)
work page 2025
-
[40]
Wang, Y .et al.Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.Adv. Neural Inf. Process. Syst.37, 95266–95290 (2024)
work page 2024
-
[41]
InEuropean Conference on Computer Vision, 148–166 (Springer, 2024)
Fu, X.et al.Blink: Multimodal large language models can see but not perceive. InEuropean Conference on Computer Vision, 148–166 (Springer, 2024)
work page 2024
-
[42]
Ying, K.et al.MMT-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask AGI. In Salakhutdinov, R.et al.(eds.)Proceedings of the 41st International Conference on Machine Learning, vol. 235 ofProceedings of Machine Learning Research, 57116–57198 (PMLR, 2024)
work page 2024
-
[43]
InThe Thirteenth International Conference on Learning Representations(2025)
Meng, F.et al.MMIU: Multimodal multi-image understanding for evaluating large vision-language models. InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
-
[44]
Jiang, D.et al.Mantis: Interleaved multi-image instruction tuning.Transactions on Mach. Learn. Res.(2024). Featured Certification, Outstanding Certification
work page 2024
-
[45]
InThe Thirteenth International Conference on Learning Representations(2025)
Chen, J.et al.MEGA-bench: Scaling multimodal evaluation to over 500 real-world tasks. InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
-
[46]
InThe Thirteenth International Conference on Learning Representations(2025)
Wang, F.et al.Muirbench: A comprehensive benchmark for robust multi-image understanding. InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
-
[47]
InWorkshop on Reasoning and Planning for Large Language Models(2025)
Wang, S.et al.MFC-bench: Benchmarking multimodal fact-checking with large vision-language models. InWorkshop on Reasoning and Planning for Large Language Models(2025)
work page 2025
- [48]
-
[49]
InProceedings of the IEEE conference on computer vision and pattern recognition, 11–20 (2016)
Mao, J.et al.Generation and comprehension of unambiguous object descriptions. InProceedings of the IEEE conference on computer vision and pattern recognition, 11–20 (2016)
work page 2016
-
[50]
Li, C.et al.Elevater: A benchmark and toolkit for evaluating language-augmented visual models.Adv. Neural Inf. Process. Syst.35, 9287–9301 (2022)
work page 2022
-
[51]
Li, Y .et al.Evaluating object hallucination in large vision-language models. In Bouamor, H., Pino, J. & Bali, K. (eds.)Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 292–305, DOI: 10.18653/v1/2023.emnlp-main.20 (Association for Computational Linguistics, Singapore, 2023)
-
[52]
Guan, T.et al.Hallusionbench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14375–14385 (2024)
work page 2024
-
[53]
Lin, S., Hilton, J. & Evans, O. Truthfulqa: Measuring how models mimic human falsehoods. InProceedings of the 60th annual meeting of the association for computational linguistics (volume 1: long papers), 3214–3252 (2022)
work page 2022
-
[54]
Li, J., Cheng, X., Zhao, X., Nie, J.-Y . & Wen, J.-R. Halueval: A large-scale hallucination evaluation benchmark for large language models. InThe 2023 Conference on Empirical Methods in Natural Language Processing(2023)
work page 2023
-
[55]
InThe Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)
Zeng, Z.et al.MR-ben: A meta-reasoning benchmark for evaluating system-2 thinking in LLMs. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems(2024)
work page 2024
-
[56]
InFirst Conference on Language Modeling(2024)
Rein, D.et al.GPQA: A graduate-level google-proof q&a benchmark. InFirst Conference on Language Modeling(2024)
work page 2024
-
[57]
In Forty-second International Conference on Machine Learning(2025)
Hao, Y .et al.Can MLLMs reason in multimodality? EMMA: An enhanced multimodal reasoning benchmark. In Forty-second International Conference on Machine Learning(2025)
work page 2025
-
[58]
Mialon, G., Fourrier, C., Wolf, T., LeCun, Y . & Scialom, T. GAIA: a benchmark for general AI assistants. InThe Twelfth International Conference on Learning Representations(2024). 14/46
work page 2024
-
[59]
Xu, F. F., Alon, U., Neubig, G. & Hellendoorn, V . J. A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, MAPS 2022, 1–10, DOI: 10.1145/3520312.3534862 (Association for Computing Machinery, New York, NY , USA, 2022)
-
[60]
Uniyal, M., Singh, M., Verbruggen, G., Gulwani, S. & Le, V . One-to-many testing for code generation from (just) natural language. In Al-Onaizan, Y ., Bansal, M. & Chen, Y .-N. (eds.)Findings of the Association for Computational Linguistics: EMNLP 2024, 15397–15402, DOI: 10.18653/v1/2024.findings-emnlp.902 (Association for Computational Linguistics, Miami...
-
[61]
Liu, J., Xia, C. S., Wang, Y . & Zhang, L. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Adv. Neural Inf. Process. Syst.36, 21558–21572 (2023)
work page 2023
-
[62]
InThe Thirteenth International Conference on Learning Representations(2025)
Jain, N.et al.Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
- [63]
- [64]
-
[65]
∞Bench: Extending long context evaluation beyond 100K tokens
Zhang, X.et al. ∞Bench: Extending long context evaluation beyond 100K tokens. In Ku, L.-W., Martins, A. & Srikumar, V . (eds.)Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 15262–15277, DOI: 10.18653/v1/2024.acl-long.814 (Association for Computational Linguistics, Bangkok, Thailand, 2024)
-
[66]
Wang, H.et al.Multimodal needle in a haystack: Benchmarking long-context capability of multimodal large language models. In Chiruzzo, L., Ritter, A. & Wang, L. (eds.)Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 3221–3241, DOI...
-
[67]
Li, K.et al.Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22195–22206 (2024)
work page 2024
-
[68]
Fang, X.et al.Mmbench-video: A long-form multi-shot benchmark for holistic video understanding.Adv. Neural Inf. Process. Syst.37, 89098–89124 (2024)
work page 2024
-
[69]
InProceedings of the Computer Vision and Pattern Recognition Conference, 13691–13701 (2025)
Zhou, J.et al.Mlvu: Benchmarking multi-task long video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, 13691–13701 (2025)
work page 2025
-
[70]
Wu, H., Li, D., Chen, B. & Li, J. Longvideobench: A benchmark for long-context interleaved video-language understand- ing.Adv. Neural Inf. Process. Syst.37, 28828–28857 (2024)
work page 2024
-
[71]
InProceedings of the IEEE/CVF International Conference on Computer Vision, 22958–22967 (2025)
Wang, W.et al.Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, 22958–22967 (2025)
work page 2025
-
[72]
Liu, Y .et al.TempCompass: Do video LLMs really understand videos? In Ku, L.-W., Martins, A. & Srikumar, V . (eds.) Findings of the Association for Computational Linguistics: ACL 2024, 8731–8772, DOI: 10.18653/v1/2024.findings-acl. 517 (Association for Computational Linguistics, Bangkok, Thailand, 2024)
-
[73]
In Proceedings of the Computer Vision and Pattern Recognition Conference, 24108–24118 (2025)
Fu, C.et al.Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the Computer Vision and Pattern Recognition Conference, 24108–24118 (2025)
work page 2025
-
[74]
InICLR 2024 Workshop on Large Language Model (LLM) Agents(2024)
Cheng, K.et al.Seeclick: Harnessing GUI grounding for advanced visual GUI agents. InICLR 2024 Workshop on Large Language Model (LLM) Agents(2024)
work page 2024
-
[75]
Qiu, Z.et al.Can large language models understand symbolic graphics programs? InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
-
[76]
InThe Thirteenth International Conference on Learning Representations(2025)
Rawles, C.et al.Androidworld: A dynamic benchmarking environment for autonomous agents. InThe Thirteenth International Conference on Learning Representations(2025)
work page 2025
-
[77]
Li, W.et al.On the effects of data scale on UI control agents. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2024)
work page 2024
-
[78]
InWorkshop on Reasoning and Planning for Large Language Models(2025)
Li, K.et al.Screenspot-pro: GUI grounding for professional high-resolution computer use. InWorkshop on Reasoning and Planning for Large Language Models(2025). 15/46
work page 2025
-
[79]
Xie, T.et al.Scaling computer-use grounding via user interface decomposition and synthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track(2025). 86.De Saussure, F. The linguistic sign.Semiot. An introductory anthology35(1985). 87.Key, L. & Noble, B. P.An analysis of Ferdinand de Saussure’s Cour...
work page 2025
-
[80]
Valdez, J. L. C.et al.Semiotics and artificial intelligence (ai): An analysis of symbolic communication in the age of technology. InFuture of Information and Communication Conference, 481–494 (Springer, 2024)
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.