Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
hub
Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models
17 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
ITEM is a new iterative utility judgment loop for RAG that maps Schutz's three levels of relevance to retrieval, utility scoring, and generation, yielding measured gains on TREC DL, WebAP, GTI-NQ, and NQ.
PRISMA augments self-training with direct preference optimization and an emotion-aware negotiation strategy chain-of-thought to produce more interpretable and effective negotiation dialogues on two new datasets.
Experiments show that long CoT reasoning in LLMs emerges with more training compute when reward shaping is used properly, and scaling verifiable rewards from noisy data helps especially on out-of-distribution tasks.
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.
citing papers explorer
-
Self-Policy Distillation via Capability-Selective Subspace Projection
Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
-
RISE: Reliable Improvement in Self-Evolving Vision-Language Models
RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.
-
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
-
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data
S^3-R1 generates synthetic intermediate-difficulty multi-hop questions and applies dense rewards for search quality plus answer correctness, yielding up to 10% better out-of-domain generalization than baselines.
-
Space Syntax-guided Post-training for Residential Floor Plan Generation
SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.
-
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and coding datasets.
-
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
-
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
-
Training Language Models to Self-Correct via Reinforcement Learning
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
-
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
-
An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs
ITEM is a new iterative utility judgment loop for RAG that maps Schutz's three levels of relevance to retrieval, utility scoring, and generation, yielding measured gains on TREC DL, WebAP, GTI-NQ, and NQ.
-
PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues
PRISMA augments self-training with direct preference optimization and an emotion-aware negotiation strategy chain-of-thought to produce more interpretable and effective negotiation dialogues on two new datasets.
-
Demystifying Long Chain-of-Thought Reasoning in LLMs
Experiments show that long CoT reasoning in LLMs emerges with more training compute when reward shaping is used properly, and scaling verifiable rewards from noisy data helps especially on out-of-distribution tasks.
-
From System 1 to System 2: A Survey of Reasoning Large Language Models
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
-
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.