RL on Qwen2-VL-2B with SAT dataset produces R1-like reasoning and 59.47% CVBench accuracy, outperforming base model by ~30% and SFT by ~2%.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
years
2025 4representative citing papers
HAPO is a new token-level policy optimization method for LLMs that continuously adapts four optimization stages using entropy, claiming consistent gains over DAPO on math, code, and logic tasks.
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.
citing papers explorer
-
MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe
An 8B MLLM reaches state-of-the-art efficiency and performance under 30B by combining a unified 3D resampler, joint document-text training, and hybrid RL for reasoning modes.
-
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
Proposes token-significance and dynamic length rewards in RL to reduce LLM response length while preserving or improving reasoning correctness across benchmarks.