pith. sign in

e3: Learning to explore enables extrapolation of test-time compute for llms

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

fields

cs.LG 4

years

2026 4

representative citing papers

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

cs.LG · 2026-05-04 · unverdicted · novelty 6.0

OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

What Does Flow Matching Bring To TD Learning?

cs.LG · 2026-03-04 · conditional · novelty 6.0

Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.

citing papers explorer

Showing 4 of 4 citing papers.

  • Training Reasoning Models on Saturated Problems via Failure-Prefix Conditioning cs.LG · 2026-01-28 · unverdicted · none · ref 17

    Failure-prefix conditioning unlocks learning from saturated reasoning problems by conditioning on failure prefixes, improving recovery from misleading early steps and matching gains from new medium-difficulty problems.

  • OGPO: Sample Efficient Full-Finetuning of Generative Control Policies cs.LG · 2026-05-04 · unverdicted · none · ref 175

    OGPO is a sample-efficient off-policy method for full finetuning of generative control policies that reaches SOTA on robotic manipulation tasks and can recover from poor behavior-cloning initializations without expert data.

  • Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data cs.LG · 2026-04-20 · unverdicted · none · ref 23

    A parameter-free sampling strategy called CUTS combined with Mixed-CUTS training prevents mode collapse in RL for saturated LLM reasoning tasks and raises AIME25 Pass@1 accuracy by up to 15.1% over standard GRPO.

  • What Does Flow Matching Bring To TD Learning? cs.LG · 2026-03-04 · conditional · none · ref 55

    Flow matching critics outperform monolithic ones in RL by 2x performance and 5x sample efficiency via test-time error recovery through integration and multi-point velocity supervision that preserves feature plasticity.