A survey on multimodal large language models.National Science Review, 11(12):nwae403

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen · 2024

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

browse 8 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos

cs.CV · 2026-05-08 · unverdicted · novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

cs.LG · 2026-04-11 · unverdicted · novelty 7.0

The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

LatentUMM: Dual Latent Alignment for Unified Multimodal Models

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.

Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.

Towards Long-horizon Agentic Multimodal Search

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.

ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

cs.CV · 2026-05-20

citing papers explorer

Showing 8 of 8 citing papers.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos cs.CV · 2026-05-08 · unverdicted · none · ref 3
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG · 2026-05-17 · unverdicted · none · ref 57
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation cs.LG · 2026-04-11 · unverdicted · none · ref 5
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
LatentUMM: Dual Latent Alignment for Unified Multimodal Models cs.CV · 2026-05-18 · unverdicted · none · ref 53
LatentUMM proposes dual latent alignment at modality and capacity levels plus latent dynamics stabilization to reduce semantic drift and improve consistency in unified multimodal models.
Visual Latents Know More Than They Say: Unsilencing Latent Reasoning in MLLMs cs.LG · 2026-05-04 · unverdicted · none · ref 42
Visual latents in MLLMs are systematically silenced by autoregressive training but can be unsilenced at inference via query-guided contrastive alignment followed by a confidence-progression reward.
Towards Long-horizon Agentic Multimodal Search cs.CV · 2026-04-14 · unverdicted · none · ref 11
LMM-Searcher uses file-based visual UIDs and a fetch tool plus 12K synthesized trajectories to fine-tune a multimodal agent that scales to 100-turn horizons and reaches SOTA among open-source models on MM-BrowseComp and MMSearch-Plus.
ShellfishNet: A Domain-Specific Benchmark for Visual Recognition of Marine Molluscs cs.CV · 2026-05-08 · unverdicted · none · ref 101
ShellfishNet is a new benchmark of 8,691 images across 32 mollusc taxa for evaluating vision models on real-world underwater ecological monitoring tasks including robustness to degradation.
RISE: Reliable Improvement in Self-Evolving Vision-Language Models cs.CV · 2026-05-20 · unreviewed · ref 44

A survey on multimodal large language models.National Science Review, 11(12):nwae403

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer