arXiv preprint arXiv:2509.21128 , year=

RL Squeezes, SFT Expands: A Comparative Study of Reasoning LLMs , author= · 2025 · arXiv 2509.21128

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open full Pith review browse 4 citing papers arXiv PDF

representative citing papers

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity

cs.LG · 2026-05-01 · unverdicted · novelty 7.0

UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.

CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation

cs.CL · 2026-03-02 · unverdicted · novelty 7.0

CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.

Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors

cs.LG · 2026-05-01 · conditional · novelty 6.0

DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

citing papers explorer

Showing 4 of 4 citing papers.

Uniform-Correct Policy Optimization: Breaking RLVR's Indifference to Diversity cs.LG · 2026-05-01 · unverdicted · none · ref 35 · internal anchor
UCPO modifies GRPO with a uniformity penalty over correct solutions to prevent diversity collapse in RLVR, yielding up to 10% higher Pass@64 on AIME24 and 45% more equation-level diversity.
CyclicJudge: Mitigating Judge Bias Efficiently in LLM-based Evaluation cs.CL · 2026-03-02 · unverdicted · none · ref 3 · internal anchor
CyclicJudge uses round-robin judge-to-scenario assignment to recover the panel-mean score exactly while using the same number of judge calls as single-judge evaluation.
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models cs.CV · 2026-05-13 · unverdicted · none · ref 111 · internal anchor
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
Decouple before Integration: Test-time Synthesis of SFT and RLVR Task Vectors cs.LG · 2026-05-01 · conditional · none · ref 45 · internal anchor
DoTS decouples SFT and RLVR training then synthesizes their task vectors at inference time to match integrated training results at ~3% compute cost.

arXiv preprint arXiv:2509.21128 , year=

fields

years

verdicts

representative citing papers

citing papers explorer