Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than perceptual.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.
citing papers explorer
-
Multiplication in Multimodal LLMs: Computation with Text, Image, and Audio Inputs
Multimodal LLMs perceive numbers accurately across modalities but fail at multi-digit multiplication, with performance predicted by an arithmetic load metric C and degradation confirmed as computational rather than perceptual.
-
Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation
A GPT-style model pre-trained on large video datasets achieves 94.9% success on CALVIN multi-task manipulation and 85.4% zero-shot generalization, outperforming prior baselines.
-
NVIDIA Nemotron 3: Efficient and Open Intelligence
NVIDIA releases the Nemotron 3 model family with hybrid Mamba-Transformer architecture, LatentMoE, NVFP4 training, MTP layers, and multi-environment RL post-training for reasoning and agentic tasks.