Perception, reason, think, and plan: A survey on large multimodal reasoning models

Yunxin Li, Zhenyu Liu, Zitao Li, Xuanyu Zhang, Zhenran Xu, Xinyu Chen, Haoyuan Shi, Shenyuan Jiang, Xintong Wang, Jifang Wang, et al · 2025 · arXiv 2505.04921

7 Pith papers cite this work. Polarity classification is still indexing.

7 Pith papers citing it

read on arXiv browse 7 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models

cs.CV · 2026-05-11 · unverdicted · novelty 7.0

ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.

NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference

cs.AR · 2026-05-22 · unverdicted · novelty 6.0

NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.

Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

TranCLR models continuous skeleton action spaces with transitional anchors and multi-level manifold calibration, yielding smoother and more accurate representations than binary contrastive methods.

cs.CV · 2026-04-13 · unverdicted · novelty 6.0

DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.

Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents

cs.AI · 2025-12-03 · unverdicted · novelty 6.0

Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.

Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework

cs.CV · 2025-09-27 · unverdicted · novelty 6.0

DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

citing papers explorer

Showing 7 of 7 citing papers.

ViSRA: A Video-based Spatial Reasoning Agent for Multi-modal Large Language Models cs.CV · 2026-05-11 · unverdicted · none · ref 25
ViSRA boosts MLLM 3D spatial reasoning performance by up to 28.9% on unseen tasks via a plug-and-play video-based agent that extracts explicit spatial cues from expert models without any post-training.
NASiC: 3D NAND-based CAM-Selected Multibit CIM Architecture for Efficient On-Device Mixture-of-Experts LLM Inference cs.AR · 2026-05-22 · unverdicted · none · ref 7
NASiC fuses CAM-based expert selection and multibit CIM computation in 3D NAND into one cycle for MoE LLM inference, claiming 4-114.8x performance and 3.9-70x energy efficiency gains over prior designs with high accuracy.
Beyond Binary Contrast: Modeling Continuous Skeleton Action Spaces with Transitional Anchors cs.CV · 2026-04-20 · unverdicted · none · ref 26
TranCLR models continuous skeleton action spaces with transitional anchors and multi-level manifold calibration, yielding smoother and more accurate representations than binary contrastive methods.
Decoupled Similarity for Task-Aware Token Pruning in Large Vision-Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 24
DeSAP uses decoupled cross-modal similarity plus visual saliency to prune visual tokens in LVLMs, retaining 11.1% tokens for 10x FLOPs reduction and 98.1% performance on LLaVA-1.5-7B.
Multimodal Reinforcement Learning with Adaptive Verifier for AI Agents cs.AI · 2025-12-03 · unverdicted · none · ref 34
Argos is an agentic verifier that adaptively picks scoring functions to evaluate accuracy, localization, and reasoning quality, enabling stronger multimodal RL training for AI agents.
Mitigating Visual Context Degradation in Large Multimodal Models: A Training-Free Decoupled Agentic Framework cs.CV · 2025-09-27 · unverdicted · none · ref 13
DRP decouples reasoning from perception in LMMs by using an LLM reasoner to query an LMM observer for visual details as needed, reducing visual grounding loss.
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering cs.CV · 2026-04-09 · unverdicted · none · ref 18
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.

Perception, reason, think, and plan: A survey on large multimodal reasoning models

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer