Task preference optimization: Improving multimodal large language models with vision task alignment

Ziang Yan, Zhilin Li, Yinan He, Chenting Wang, Kunchang Li, Xinhao Li, Xiangyu Zeng, Zilei Wang, Yali Wang, Yu Qiao, Limin Wang, Yi Wang · 2024 · arXiv 2412.19326

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

read on arXiv browse 2 citing papers

representative citing papers

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

cs.CV · 2025-04-09 · unverdicted · novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling

cs.CV · 2025-01-21 · unverdicted · novelty 5.0

InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject

citing papers explorer

Showing 2 of 2 citing papers.

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning cs.CV · 2025-04-09 · unverdicted · none · ref 33
Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling cs.CV · 2025-01-21 · unverdicted · none · ref 32
InternVideo2.5 improves video MLLMs by incorporating dense vision task annotations via direct preference optimization and compact spatiotemporal representations via adaptive hierarchical token compression, yielding better benchmark performance, 6x longer video memory, and new capabilities likeobject

Task preference optimization: Improving multimodal large language models with vision task alignment

fields

years

verdicts

representative citing papers

citing papers explorer