pith. sign in

arxiv: 2512.03043 · v3 · submitted 2025-12-02 · 💻 cs.CV

OneThinker: All-in-one Reasoning Model for Image and Video

Pith reviewed 2026-05-17 02:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords unified multimodal modelimage and video reasoningvisual question answeringspatial temporal groundingobject trackingreinforcement learningchain of thoughtmultimodal generalist
0
0 comments X

The pith

OneThinker trains a single model to reason over both images and videos across ten core visual tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current approaches train separate models for image and video reasoning, which limits scalability and prevents knowledge from moving between tasks. The paper instead builds one model that covers question answering, captioning, spatial and temporal grounding, tracking, and segmentation. It creates a 600k example corpus with chain-of-thought labels from commercial models, then applies a new reinforcement learning method called EMA-GRPO that balances training by tracking reward variation per task. Experiments show competitive results on 31 benchmarks plus signs that training on one task helps another and that the model can handle some unseen cases. This setup is presented as progress toward a single versatile multimodal reasoning system.

Core claim

OneThinker is an all-in-one model that unifies image and video understanding across question answering, captioning, spatial and temporal grounding, tracking, and segmentation by training on the OneThinker-600k corpus with supervised fine-tuning followed by EMA-GRPO reinforcement learning; the resulting system reaches strong performance on 31 benchmarks, exhibits knowledge transfer between tasks, and displays preliminary zero-shot generalization.

What carries the argument

EMA-GRPO, a multi-task reinforcement learning procedure that maintains task-wise moving averages of reward standard deviations to equalize optimization pressure when rewards differ across tasks.

If this is right

  • Knowledge transfers positively between some pairs of tasks during joint training.
  • The model shows initial zero-shot generalization on certain held-out cases.
  • The approach scales training toward a single multimodal reasoning generalist instead of many narrow models.
  • Performance holds across 31 benchmarks spanning 10 fundamental visual tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment for applications that mix image and video inputs could become simpler with one model instead of several.
  • The reward-balancing technique may help other multi-task reinforcement learning settings avoid negative interference.
  • Extending the same corpus construction and training recipe to additional modalities could test broader unification.
  • Real-world robustness would require checking performance under distribution shifts not covered in the current benchmarks.

Load-bearing premise

The combined 600k corpus and EMA-GRPO training produce real unification and positive transfer rather than hidden performance losses on some tasks or modalities that would appear under stricter separate-versus-joint comparisons.

What would settle it

Train separate task-specific models on the same data splits and measure whether any individual task score falls when the model is instead trained jointly as OneThinker.

Figures

Figures reproduced from arXiv: 2512.03043 by Dian Zheng, Haoze Sun, Hongyu Li, Kaituo Feng, Kaixuan Fan, Manyuan Zhang, Peiwen Sun, Peng Pei, Shuang Chen, Xiangyu Yue, Xunliang Cai, Yan Feng, Yilei Jiang, Yiyuan Zhang.

Figure 1
Figure 1. Figure 1: Overview of our OneThinker, which is capable of thinking across a wide range of tasks for image and video [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Performance gains of our model over Qwen3-VL-Instruct-8B across diverse visual tasks after training. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our curated training dataset, including both image and video modalities for a diverse range of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of advantage formulations in three RL algorithms. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance on unseen visual tasks. 5 Conclusion In this work, we present OneThinker, an all-in-one multimodal reasoning model that unifies diverse visual foundation tasks for images and videos. To support training, we construct OneThinker-600k dataset for RL training and its CoT-annotated subset OneThinker-SFT-340k for SFT cold start. We further propose EMA-GRPO, an RL algorithm that balances optimization… view at source ↗
Figure 6
Figure 6. Figure 6: Reasoning example of image question answering task. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Reasoning example of video question answering task. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Reasoning example of image caption task. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Reasoning example of video caption task. [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Reasoning example of temporal grounding task. [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Reasoning example of spatial grounding task. [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reasoning example of spatial-temporal grounding task. [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Reasoning example of tracking task. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Reasoning example for an image segmentation task. The resulting answer will be forwarded to SAM2 to [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Reasoning example for an video segmentation task. The resulting answer will be forwarded to SAM2 to [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: System prompt for all tasks. "multiple choice": ( "Please provide only the single option letter (e.g., A, B, C, D, etc.) " "within the <answer>...</answer> tags.\n" "Example:\n<answer>A</answer>" ), "numerical": ( "Please provide only the numerical value within the <answer>...</answer> tags.\n" "Example:\n<answer>3.14</answer>" ), "OCR": ( "Please provide only the transcribed text within the <answer>...</… view at source ↗
Figure 17
Figure 17. Figure 17: Prompt for QA tasks. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Prompt for grounding and tracking tasks. [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Prompt for segmentation tasks. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_19.png] view at source ↗
read the original abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces OneThinker, an all-in-one multimodal reasoning model that unifies image and video understanding across 10 fundamental visual tasks (question answering, captioning, spatial/temporal grounding, tracking, segmentation). It constructs the OneThinker-600k corpus with commercial-model CoT annotations for SFT cold-start, then applies the proposed EMA-GRPO algorithm that tracks task-wise moving averages of reward standard deviations to balance multi-task RL optimization under heterogeneous rewards. The central empirical claim is strong performance across 31 benchmarks together with effective cross-task knowledge transfer and preliminary zero-shot generalization, positioning the work as progress toward a unified multimodal reasoning generalist. All code, model, and data are released.

Significance. If the unification and transfer claims are substantiated by rigorous controls, the work would constitute a meaningful step toward scalable multimodal generalists that exploit positive knowledge sharing across modalities and tasks. The open release of artifacts is a clear strength that supports reproducibility. The EMA-GRPO mechanism addresses a practical difficulty in multi-task RL with non-commensurate rewards, and the scale of the 600k corpus plus the breadth of 31 benchmarks give the empirical scope potential impact in computer vision and multimodal learning.

major comments (3)
  1. [Experimental Evaluation] Experimental section: the manuscript reports aggregate strong performance on 31 benchmarks and 'effective knowledge transfer between certain tasks' but supplies no single-task versus multi-task ablation, no per-modality (image vs. video) performance breakdown, and no comparison of EMA-GRPO against standard GRPO or task-specific RL baselines. These omissions are load-bearing for the unification claim, because the observed scores could be explained by data scale alone rather than by genuine cross-task/cross-modal positive transfer without hidden trade-offs.
  2. [Method] Method (EMA-GRPO description): while the task-wise moving-average normalization of reward standard deviations is introduced to handle reward heterogeneity, the paper does not quantify how this normalization affects optimization dynamics on harder versus easier tasks, nor does it demonstrate that the chosen EMA decay rate and scaling hyper-parameters are robust across the 10 tasks.
  3. [Results] Results tables/figures: without explicit reporting of per-task or per-modality deltas between the SFT checkpoint and the final RL checkpoint, it remains unclear whether any observed gains on certain tasks come at the expense of others, undermining the 'no hidden performance trade-offs' premise of the all-in-one generalist narrative.
minor comments (2)
  1. [Method] The precise mathematical definition of the EMA update and the task-wise reward-std scaling factor should be given as numbered equations rather than prose to improve reproducibility.
  2. [Figures] Figure captions for the benchmark radar or bar charts should explicitly state whether the plotted scores are zero-shot, few-shot, or fine-tuned, and whether they include the SFT baseline for direct comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important aspects needed to strengthen the evidence for cross-task and cross-modal unification in OneThinker. We have revised the manuscript to incorporate additional ablations, breakdowns, and analyses addressing these points. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Experimental Evaluation] Experimental section: the manuscript reports aggregate strong performance on 31 benchmarks and 'effective knowledge transfer between certain tasks' but supplies no single-task versus multi-task ablation, no per-modality (image vs. video) performance breakdown, and no comparison of EMA-GRPO against standard GRPO or task-specific RL baselines. These omissions are load-bearing for the unification claim, because the observed scores could be explained by data scale alone rather than by genuine cross-task/cross-modal positive transfer without hidden trade-offs.

    Authors: We agree that explicit controls are necessary to isolate the benefits of unification from data scale. The original manuscript substantiates transfer via joint training gains and zero-shot results on held-out task combinations, but we acknowledge the value of direct comparisons. In the revision we add a single-task versus multi-task ablation on four representative tasks (two image, two video), per-modality performance tables, and direct comparisons of EMA-GRPO to both standard GRPO and task-specific RL. These additions show positive transfer on several tasks with no large negative trade-offs and confirm EMA-GRPO's balancing effect. revision: yes

  2. Referee: [Method] Method (EMA-GRPO description): while the task-wise moving-average normalization of reward standard deviations is introduced to handle reward heterogeneity, the paper does not quantify how this normalization affects optimization dynamics on harder versus easier tasks, nor does it demonstrate that the chosen EMA decay rate and scaling hyper-parameters are robust across the 10 tasks.

    Authors: We have expanded the method section with a new analysis subsection. We now report reward-standard-deviation trajectories for harder and easier tasks before and after normalization, showing that normalization prevents high-variance tasks from dominating gradient updates. We also include a sensitivity study across EMA decay rates (0.9, 0.95, 0.99) and scaling factors, demonstrating stable performance across the 10 tasks within the chosen hyper-parameter range. revision: yes

  3. Referee: [Results] Results tables/figures: without explicit reporting of per-task or per-modality deltas between the SFT checkpoint and the final RL checkpoint, it remains unclear whether any observed gains on certain tasks come at the expense of others, undermining the 'no hidden performance trade-offs' premise of the all-in-one generalist narrative.

    Authors: We have updated the results section with explicit per-task and per-modality delta tables comparing the SFT and final RL checkpoints. The deltas indicate net positive or neutral changes across tasks and modalities, with no substantial regressions, supporting the claim of balanced multi-task optimization under EMA-GRPO. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and benchmark results with independent evaluation

full rationale

The paper is an empirical ML training study that constructs a 600k corpus, applies SFT then EMA-GRPO, and reports benchmark numbers on 31 external datasets. No derivation chain exists that reduces a claimed prediction or first-principles result to its own inputs by construction. EMA-GRPO is a proposed training heuristic whose definition and effect are evaluated on held-out benchmarks rather than being tautological. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the abstract or described method. The unification/transfer claims rest on observed benchmark scores, which are falsifiable externally and not forced by the training procedure itself.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Approach rests on standard RL assumptions for MLLMs and the quality of commercial-model CoT labels; no new physical entities or ungrounded constants introduced.

free parameters (1)
  • EMA decay rate and task-wise reward std-dev scaling
    Hyperparameters in EMA-GRPO that control how moving averages balance heterogeneous task rewards.
axioms (1)
  • domain assumption Commercial models generate sufficiently accurate chain-of-thought annotations for the target tasks
    Invoked when constructing the OneThinker-SFT-340k dataset from the 600k corpus.

pith-pipeline@v0.9.0 · 5576 in / 1290 out tokens · 45436 ms · 2026-05-17T02:04:19.448245+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

    cs.CV 2026-05 unverdicted novelty 7.0

    IC-Seg is a new agentic framework using multi-turn clarification and Hi-GRPO hierarchical optimization to resolve ambiguous queries in referring video object segmentation while maintaining performance on standard benchmarks.

  2. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  3. Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

    cs.CL 2026-05 unverdicted novelty 7.0

    LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

  4. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 7.0

    Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

  5. CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

  6. Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

  7. VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

    cs.CV 2026-05 unverdicted novelty 6.0

    VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise sco...

  8. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 6.0

    PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

  9. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  10. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 6.0

    Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware e...

  11. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  12. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

  13. Gen-Searcher: Reinforcing Agentic Search for Image Generation

    cs.CV 2026-03 unverdicted novelty 6.0

    Gen-Searcher is the first search-augmented image generation agent trained with SFT followed by agentic RL using dual text and image rewards on custom datasets and the KnowGen benchmark.

  14. Vision-aligned Latent Reasoning for Multi-modal Large Language Model

    cs.CV 2026-02 unverdicted novelty 6.0

    VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

  15. Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

    cs.CL 2026-02 unverdicted novelty 6.0

    Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.

  16. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

  17. Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

    cs.CV 2026-05 unverdicted novelty 5.0

    PVM adds a parallel learnable branch to LVLMs that supplies visual embeddings on demand to structurally prevent attention decay and visual signal dilution during deep autoregressive generation.

  18. Agentic AI for Remote Sensing: Technical Challenges and Research Directions

    cs.CV 2026-04 unverdicted novelty 5.0

    Agentic AI for remote sensing requires new designs centered on structured geospatial state, tool-aware reasoning, verifier-guided execution, and physical validity rather than generic extensions.

  19. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

  20. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    A visual replay module combined with adaptive depth scaling improves multimodal latent reasoning, delivering state-of-the-art benchmark results and faster inference than explicit chain-of-thought methods.

  21. Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

    cs.CV 2026-04 unverdicted novelty 5.0

    Visual replay and depth scaling in latent reasoning produce state-of-the-art multimodal results with faster inference than explicit CoT.

  22. EasyVideoR1: Easier RL for Video Understanding

    cs.CV 2026-04 unverdicted novelty 4.0

    EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 16 Pith papers · 31 internal anchors

  1. [1]

    Multimodal Chain-of-Thought Reasoning in Language Models

    Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. Multimodal chain-of-thought reasoning in language models.arXiv preprint arXiv:2302.00923, 2023

  2. [2]

    Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms

    Jiakang Yuan, Tianshuo Peng, Yilei Jiang, Yiting Lu, Renrui Zhang, Kaituo Feng, Chaoyou Fu, Tao Chen, Lei Bai, Bo Zhang, et al. Mme-reasoning: A comprehensive benchmark for logical reasoning in mllms.arXiv preprint arXiv:2505.21327, 2025

  3. [3]

    Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

    Guanghao Zhou, Panjia Qiu, Cen Chen, Jie Wang, Zheming Yang, Jian Xu, and Minghui Qiu. Reinforced mllm: A survey on rl-based reasoning in multimodal large language models.arXiv preprint arXiv:2504.21277, 2025

  4. [4]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  5. [5]

    Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing

    Junfei Wu, Jian Guan, Kaituo Feng, Qiang Liu, Shu Wu, Liang Wang, Wei Wu, and Tieniu Tan. Reinforcing spatial reasoning in vision-language models with interwoven thinking and visual drawing.arXiv preprint arXiv:2506.09965, 2025

  6. [6]

    Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

    Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, and Jiaya Jia. Seg-zero: Reasoning-chain guided segmentation via cognitive reinforcement.arXiv preprint arXiv:2503.06520, 2025

  7. [7]

    Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

    Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, and Helen Meng. Critique-grpo: Advancing llm reasoning with natural language and numerical feedback.arXiv preprint arXiv:2506.03106, 2025

  8. [8]

    Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

    Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

  9. [9]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  10. [10]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

  11. [11]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025

  12. [12]

    Seg-r1: Segmentation can be surprisingly simple with reinforcement 33 ConceptSeg-R1 learning

    Zuyao You and Zuxuan Wu. Seg-r1: Segmentation can be surprisingly simple with reinforcement learning.arXiv preprint arXiv:2506.22624, 2025. 12 OneThinker: All-in-one Reasoning Model for Image and Video

  13. [13]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  14. [14]

    arXiv preprint arXiv:2504.07954 , year =

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering perception policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025

  15. [15]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

  16. [16]

    Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

  17. [17]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective.arXiv preprint arXiv:2503.20783, 2025

  18. [18]

    Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.arXiv preprint arXiv:2508.11800,

    Michael Bereket and Jure Leskovec. Uncalibrated reasoning: Grpo induces overconfidence for stochastic outcomes.arXiv preprint arXiv:2508.11800, 2025

  19. [19]

    arXiv preprint arXiv:2504.02546 , year=

    Xiangxiang Chu, Hailang Huang, Xiao Zhang, Fei Wei, and Yong Wang. Gpg: A simple and strong reinforcement learning baseline for model reasoning.arXiv preprint arXiv:2504.02546, 2025

  20. [20]

    Mapo: Mixed advantage policy optimization.arXiv preprint arXiv:2509.18849, 2025

    Wenke Huang, Quan Zhang, Yiyang Fang, Jian Liang, Xuankun Rong, Huanjin Yao, Guancheng Wan, Ke Liang, Wenwen He, Mingjun Li, et al. Mapo: Mixed advantage policy optimization.arXiv preprint arXiv:2509.18849, 2025

  21. [21]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024

  22. [22]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186. Springer, 2024

  23. [23]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

  24. [24]

    One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

    Zechen Bai, Tong He, Haiyang Mei, Pichao Wang, Ziteng Gao, Joya Chen, Zheng Zhang, and Mike Zheng Shou. One token to seg them all: Language instructed reasoning segmentation in videos.Advances in Neural Information Processing Systems, 37:6833–6859, 2024

  25. [25]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025

  26. [26]

    Agentic Reinforced Policy Optimization

    Guanting Dong, Hangyu Mao, Kai Ma, Licheng Bao, Yifei Chen, Zhongyuan Wang, Zhongxia Chen, Jiazhen Du, Huiyang Wang, Fuzheng Zhang, et al. Agentic reinforced policy optimization.arXiv preprint arXiv:2507.19849, 2025

  27. [27]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  28. [28]

    Logic-RL: Unleashing LLM Reasoning with Rule-Based Reinforcement Learning

    Tian Xie, Zitian Gao, Qingnan Ren, Haoming Luo, Yuqian Hong, Bryan Dai, Joey Zhou, Kai Qiu, Zhirong Wu, and Chong Luo. Logic-rl: Unleashing llm reasoning with rule-based reinforcement learning.arXiv preprint arXiv:2502.14768, 2025

  29. [29]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

  30. [30]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025

  31. [31]

    Reinforcement fine-tuning powers reasoning ca- pability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

    Haoyuan Sun, Jiaqi Wu, Bo Xia, Yifu Luo, Yifei Zhao, Kai Qin, Xufei Lv, Tiantian Zhang, Yongzhe Chang, and Xueqian Wang. Reinforcement fine-tuning powers reasoning capability of multimodal large language models.arXiv preprint arXiv:2505.18536, 2025

  32. [32]

    Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

    Peiwen Sun, Shiqiang Lang, Dongming Wu, Yi Ding, Kaituo Feng, Huadai Liu, Zhen Ye, Rui Liu, Yun-Hui Liu, Jianan Wang, et al. Spacevista: All-scale visual spatial reasoning from mm to km.arXiv preprint arXiv:2510.09606, 2025

  33. [33]

    Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

    Chengqi Duan, Kaiyue Sun, Rongyao Fang, Manyuan Zhang, Yan Feng, Ying Luo, Yufang Liu, Ke Wang, Peng Pei, Xunliang Cai, et al. Codeplot-cot: Mathematical visual reasoning by thinking with code-driven images.arXiv preprint arXiv:2510.11718, 2025

  34. [34]

    Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

    Shuang Chen, Yue Guo, Zhaochen Su, Yafu Li, Yulun Wu, Jiacheng Chen, Jiayu Chen, Weijie Wang, Xiaoye Qu, and Yu Cheng. Advancing multimodal reasoning: From optimized cold start to staged reinforcement learning.arXiv preprint arXiv:2506.04207, 2025

  35. [35]

    Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025

    Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, and Nanyun Peng. Ares: Multimodal adaptive reasoning via difficulty-aware token-level entropy shaping.arXiv preprint arXiv:2510.08457, 2025. 13 OneThinker: All-in-one Reasoning Model for Image and Video

  36. [36]

    Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

  37. [37]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

  38. [38]

    Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

    Kaixuan Fan, Kaituo Feng, Haoming Lyu, Dongzhan Zhou, and Xiangyu Yue. Sophiavl-r1: Reinforcing mllms reasoning with thinking reward.arXiv preprint arXiv:2505.17018, 2025

  39. [39]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025

  40. [40]

    Pre-trained policy discriminators are general reward models.arXiv preprint arXiv:2507.05197, 2025

    Shihan Dou, Shichun Liu, Yuming Yang, Yicheng Zou, Yunhua Zhou, Shuhao Xing, Chenhao Huang, Qiming Ge, Demin Song, Haijun Lv, et al. Pre-trained policy discriminators are general reward models.arXiv preprint arXiv:2507.05197, 2025

  41. [41]

    Affordance-R1: Reinforcement Learning for Generalizable Affordance Reasoning in Multimodal Large Language Model

    Hanqing Wang, Shaoyang Wang, Yiming Zhong, Zemin Yang, Jiamin Wang, Zhiqing Cui, Jiahao Yuan, Yifan Han, Mingyu Liu, and Yuexin Ma. Affordance-r1: Reinforcement learning for generalizable affordance reasoning in multimodal large language model.arXiv preprint arXiv:2508.06206, 2025

  42. [42]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  43. [43]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  44. [44]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl-rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning.arXiv preprint arXiv:2504.08837, 2025

  45. [45]

    Proxythinker: Test-time guidance through small visual reasoners.arXiv preprint arXiv:2505.24872, 2025

    Zilin Xiao, Jaywon Koo, Siru Ouyang, Jefferson Hernandez, Yu Meng, and Vicente Ordonez. Proxythinker: Test-time guidance through small visual reasoners.arXiv preprint arXiv:2505.24872, 2025

  46. [46]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023

  47. [47]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vision, pages 216–233. Springer, 2024

  48. [48]

    Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Information Processing Systems, 37:27056–27087, 2024

  49. [49]

    Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

    Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521, 2022

  50. [50]

    A diagram is worth a dozen images

    Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean conference on computer vision, pages 235–251. Springer, 2016

  51. [51]

    Mmt-bench: A comprehensive multimodal benchmark for eval- uating large vision-language models towards multitask agi.arXiv preprint arXiv:2404.16006, 2024

    Kaining Ying, Fanqing Meng, Jin Wang, Zhiqian Li, Han Lin, Yue Yang, Hao Zhang, Wenbo Zhang, Yuqi Lin, Shuo Liu, et al. Mmt-bench: A comprehensive multimodal benchmark for evaluating large vision-language models towards multitask agi. arXiv preprint arXiv:2404.16006, 2024

  52. [52]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  53. [53]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  54. [54]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforcement learning. arXiv preprint arXiv:2503.07365, 2025

  55. [55]

    More thought, less accuracy? on the dual nature of reasoning in vision-language models.arXiv preprint arXiv:2509.25848, 2025

    Xinyu Tian, Shu Zou, Zhaoyuan Yang, Mengqi He, Fabian Waschkowski, Lukas Wesemann, Peter Tu, and Jing Zhang. More thought, less accuracy? on the dual nature of reasoning in vision-language models.arXiv preprint arXiv:2509.25848, 2025. 14 OneThinker: All-in-one Reasoning Model for Image and Video

  56. [56]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025

  57. [57]

    Mmvu: Measuring expert-level multi-discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025

  58. [58]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025

  59. [59]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025

  60. [60]

    Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

    Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. Longvideobench: A benchmark for long-context interleaved video-language understanding.Advances in Neural Information Processing Systems, 37:28828–28857, 2024

  61. [61]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

  62. [62]

    Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos.arXiv preprint arXiv:2506.05349, 2025

    Hanoona Rasheed, Abdelrahman Shaker, Anqi Tang, Muhammad Maaz, Ming-Hsuan Yang, Salman Khan, and Fahad Shahbaz Khan. Videomathqa: Benchmarking mathematical reasoning via multimodal understanding in videos.arXiv preprint arXiv:2506.05349, 2025

  63. [63]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  64. [64]

    InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

  65. [65]

    Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension

    Zekun Li, Xianjun Yang, Kyuri Choi, Wanrong Zhu, Ryan Hsieh, HyeonJung Kim, Jin Hyuk Lim, Sungyoung Ji, Byungju Lee, Xifeng Yan, et al. Mmsci: A multimodal multi-discipline dataset for phd-level scientific comprehension. InAI for Accelerated Materials Design-Vienna 2024, 2024

  66. [66]

    Video-mmlu: A massive multi-discipline lecture understanding benchmark.arXiv preprint arXiv:2504.14693, 2025

    Enxin Song, Wenhao Chai, Weili Xu, Jianwen Xie, Yuxuan Liu, and Gaoang Wang. Video-mmlu: A massive multi-discipline lecture understanding benchmark.arXiv preprint arXiv:2504.14693, 2025

  67. [67]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  68. [68]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

  69. [69]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision, pages 706–715, 2017

  70. [70]

    Lita: Language instructed temporal-localization assistant

    De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, and Jan Kautz. Lita: Language instructed temporal-localization assistant. InEuropean Conference on Computer Vision, pages 202–218. Springer, 2024

  71. [71]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14271–14280, 2024

  72. [72]

    Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

  73. [73]

    Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025

    Hongyu Li, Songhao Han, Yue Liao, Junfeng Luo, Jialin Gao, Shuicheng Yan, and Si Liu. Reinforcement learning tuning for videollms: Reward design and data efficiency.arXiv preprint arXiv:2506.01908, 2025

  74. [74]

    Referitgame: Referring to objects in photographs of natural scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798, 2014

  75. [75]

    Modeling context in referring expressions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expressions. In European conference on computer vision, pages 69–85. Springer, 2016

  76. [76]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning.arXiv preprint arXiv:2505.14362, 2025

  77. [77]

    Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding

    Hongyu Li, Jinyu Chen, Ziyu Wei, Shaofei Huang, Tianrui Hui, Jialin Gao, Xiaoming Wei, and Si Liu. Llava-st: A multimodal large language model for fine-grained spatial-temporal understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8592–8603, 2025

  78. [78]

    Groundinggpt: Language enhanced multi-modal grounding model

    Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Vu Tu, et al. Groundinggpt: Language enhanced multi-modal grounding model. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 6657–6678, 2024. 15 OneThinker: All-in-one Reasoning Model for Im...

  79. [79]

    Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

    Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, and Lifu Huang. Grounded-videollm: Sharpening fine-grained temporal grounding in video large language models.arXiv preprint arXiv:2410.03290, 2024

  80. [80]

    R1-track: Direct application of mllms to visual object tracking via reinforcement learning.arXiv preprint arXiv:2506.21980, 2025

    Biao Wang, Wenwen Li, and Jiawei Ge. R1-track: Direct application of mllms to visual object tracking via reinforcement learning.arXiv preprint arXiv:2506.21980, 2025

Showing first 80 references.