Fipo: Eliciting deep reasoning with future-kl influenced policy optimization

Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, Jingren Zhou · 2026 · arXiv 2603.19835

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.

Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.

Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy

cs.LG · 2026-05-14 · conditional · novelty 6.0

ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.

DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment

cs.LG · 2026-05-05 · unverdicted · novelty 6.0 · 2 refs

DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.

One-Way Policy Optimization for Self-Evolving LLMs

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.

VISD: Enhancing Video Reasoning via Structured Self-Distillation

cs.CV · 2026-05-07 · unverdicted · novelty 5.0 · 4 refs

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

citing papers explorer

Showing 6 of 6 citing papers.

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 63
DelTA estimates token coefficients to amplify discriminative directions in token-gradient vectors, reweighting the RLVR surrogate to produce more contrastive side-wise centroids and yielding 3.26 and 2.62 point gains on math benchmarks for 8B and 14B Qwen3 models.
Multi-Step Likelihood-Ratio Correction for Reinforcement Learning with Verifiable Rewards cs.LG · 2026-05-20 · unverdicted · none · ref 16
NFPO augments the PPO surrogate with N-step forward traces to bridge local approximations and exact policy gradients, delivering tighter policy-improvement bounds and improved results on reasoning benchmarks.
Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy cs.LG · 2026-05-14 · conditional · none · ref 16
ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.
DGPO: Distribution Guided Policy Optimization for Fine Grained Credit Assignment cs.LG · 2026-05-05 · unverdicted · none · ref 16 · 2 links
DGPO is a critic-free RL framework that uses bounded Hellinger distance and entropy-gated advantage redistribution to enable fine-grained token-level credit assignment in long CoT generations for LLM alignment, reporting SOTA results on AIME benchmarks.
One-Way Policy Optimization for Self-Evolving LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 10
OWPO decouples optimization direction from magnitude via asymmetric reweighting (Accelerated Alignment for inferior deviations, Gain Locking for superior) plus iterative references to create a ratchet effect for continuous LLM improvement.
VISD: Enhancing Video Reasoning via Structured Self-Distillation cs.CV · 2026-05-07 · unverdicted · none · ref 28 · 4 links
VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.

Fipo: Eliciting deep reasoning with future-kl influenced policy optimization

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer