Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than perceptual.
Findings of the Association for Computational Linguistics: EMNLP 2024 , month =
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
Systematic evaluation finds cross-modal skill injection via model merging succeeds in instruction-following and cross-lingual scenarios but fails in mathematical reasoning, with TA and DARE methods outperforming others after hyperparameter analysis.
citing papers explorer
-
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than perceptual.
-
Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding
MLLMs exhibit a consistent recognition-reasoning inversion on discrete visual symbols across domains, underperforming on elementary perception while appearing competent on higher-level reasoning via linguistic compensation.
-
Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters
Systematic evaluation finds cross-modal skill injection via model merging succeeds in instruction-following and cross-lingual scenarios but fails in mathematical reasoning, with TA and DARE methods outperforming others after hyperparameter analysis.