hub

arXiv preprint arXiv:2312.06585 , year=

Beyond human data: Scaling self-training for problem-solving with language models , author= · 2023 · arXiv 2312.06585

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

read on arXiv browse 28 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients

cs.CL · 2026-06-16 · unverdicted · novelty 7.0

ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.

Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces

cs.AI · 2026-06-03 · unverdicted · novelty 7.0

Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.

Self-Policy Distillation via Capability-Selective Subspace Projection

cs.CL · 2026-05-21 · unverdicted · novelty 7.0

Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.

Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization

cs.LG · 2026-07-01 · unverdicted · novelty 6.0

Active-GRPO reaches 0.1773 average SRxSim on TOMG-Bench MOLOPT by adaptively switching between imitation and self-reinforcement while upgrading references, outperforming GRPO and RePO.

PHF: Privileged Hidden Flow for On-Policy Self-Distillation

cs.AI · 2026-06-28 · unverdicted · novelty 6.0

PHF distills token-to-token transition directions and trajectory geometry in hidden states during on-policy self-distillation, reporting 1.5-2.2 point gains on Average@12 for Qwen3-1.7B/4B/8B over reproduced OPSD baseline under a 100-step schedule.

Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models

cs.CV · 2026-06-25 · unverdicted · novelty 6.0

VISE is an unsupervised self-evolving method for LMMs that uses invariance rewards to improve visual conditioning, reporting gains on captioning and reduced hallucination across multiple models.

On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity

cs.LG · 2026-06-24 · unverdicted · novelty 6.0

On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.

DRIFT: Refining Instruction Data via On-Policy Data Attribution

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.

DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 6.0

DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.

Segment-Aligned Policy Optimization for Multi-Modal Reasoning

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

Space Syntax-guided Post-training for Residential Floor Plan Generation

cs.LG · 2026-02-26 · unverdicted · novelty 6.0

SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.

OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

cs.CV · 2025-03-21 · conditional · novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

cs.LG · 2024-10-10 · unverdicted · novelty 6.0

Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.

Training Language Models to Self-Correct via Reinforcement Learning

cs.LG · 2024-09-19 · unverdicted · novelty 6.0

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

Data Selection Through Iterative Self-Filtering for Vision-Language Settings

cs.CV · 2026-06-22 · unverdicted · novelty 5.0

An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.

Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use

cs.CL · 2026-05-25 · unverdicted · novelty 5.0

RL on a four-verb Freebase KG tool shows peak-then-collapse in answer rate, with self-distillation reaching 40% EM while model scaling does not help.

RISE: Reliable Improvement in Self-Evolving Vision-Language Models

cs.CV · 2026-05-20 · unverdicted · novelty 5.0 · 2 refs

RISE proposes a self-evolving VLM framework with three designs to address challenges in question generation and solver adaptation, reporting consistent gains on seven benchmarks across two backbones.

SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation

cs.AI · 2026-03-23 · unverdicted · novelty 5.0

SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.

An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs

cs.IR · 2024-06-17 · unverdicted · novelty 5.0

ITEM is a new iterative utility judgment loop for RAG that maps Schutz's three levels of relevance to retrieval, utility scoring, and generation, yielding measured gains on TREC DL, WebAP, GTI-NQ, and NQ.

Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation

cs.CL · 2026-05-30 · unverdicted · novelty 4.0

DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.

$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data

cs.LG · 2026-05-02 · unverdicted · novelty 4.0 · 2 refs

S^3-R1 generates synthetic multi-hop questions and uses combined intermediate and final rewards to train RL models for retrieval and answering, reporting up to 10% better out-of-domain generalization.

PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues

cs.CL · 2026-04-20 · unverdicted · novelty 4.0

PRISMA augments self-training with direct preference optimization and an emotion-aware negotiation strategy chain-of-thought to produce more interpretable and effective negotiation dialogues on two new datasets.

rePIRL: Learn PRM with Inverse RL for LLM Reasoning

cs.LG · 2026-02-08 · conditional · novelty 4.0

rePIRL learns token-level process rewards for LLM reasoning via a guided-cost-learning-style IRL objective, and shows these rewards improve reasoning policies on math/coding benchmarks.

citing papers explorer

Showing 28 of 28 citing papers.

Zone of Proximal Policy Optimization: Teacher in Prompts, Not Gradients cs.CL · 2026-06-16 · unverdicted · none · ref 77
ZPPO improves distillation to small vision-language models by using binary and negative candidate prompts plus a replay buffer for hard questions, outperforming standard distillation and GRPO on a 31-benchmark suite with largest gains at the 0.8B scale.
Step-by-Step Optimization-like Reasoning in LLMs over Expanding Search Spaces cs.AI · 2026-06-03 · unverdicted · none · ref 7
Introduces OPT* tasks and two training regimes (solver-guided online policy optimization with rank-based reward shaping and search-based offline RL) plus a theoretical link between search success and information extraction per budget unit, showing empirical gains in optimization-like reasoning.
Self-Policy Distillation via Capability-Selective Subspace Projection cs.CL · 2026-05-21 · unverdicted · none · ref 2
Self-Policy Distillation extracts a capability subspace from model gradients on correctness tokens, projects KV activations into it for self-generation, and fine-tunes LLMs to achieve up to 13-16% gains over baselines without external signals.
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients cs.CL · 2026-05-07 · unverdicted · none · ref 54
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on AIME 2025.
Active-GRPO: Adaptive Imitation and Self-Improving Reasoning for Molecular Optimization cs.LG · 2026-07-01 · unverdicted · none · ref 49
Active-GRPO reaches 0.1773 average SRxSim on TOMG-Bench MOLOPT by adaptively switching between imitation and self-reinforcement while upgrading references, outperforming GRPO and RePO.
PHF: Privileged Hidden Flow for On-Policy Self-Distillation cs.AI · 2026-06-28 · unverdicted · none · ref 47
PHF distills token-to-token transition directions and trajectory geometry in hidden states during on-policy self-distillation, reporting 1.5-2.2 point gains on Average@12 for Qwen3-1.7B/4B/8B over reproduced OPSD baseline under a 100-step schedule.
Paying More Attention to Visual Tokens in Self-Evolving Large Multimodal Models cs.CV · 2026-06-25 · unverdicted · none · ref 27
VISE is an unsupervised self-evolving method for LMMs that uses invariance rewards to improve visual conditioning, reporting gains on captioning and reduced hallucination across multiple models.
On-Policy Self-Distillation with Sampled Demonstrations Reduces Output Diversity cs.LG · 2026-06-24 · unverdicted · none · ref 81
On-policy self-distillation with sampled demonstrations reduces rollout diversity by amplifying existing probability gaps in the base model, unlike ideal RL which preserves ratios among correct outputs.
DRIFT: Refining Instruction Data via On-Policy Data Attribution cs.LG · 2026-06-16 · unverdicted · none · ref 19
DRIFT applies on-policy influence functions with signed weighting and debiasing to attribute and refine SFT data, raising performance on 7B instruction and reasoning models over prior curation methods.
DRIFT: Decoupled Rollouts and Importance-Weighted Fine-Tuning for Efficient Multi-Turn Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 22
DRIFT achieves multi-turn RL performance via offline importance-weighted SFT by leveraging the equivalence of KL-regularized RL to weighted supervised learning.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning cs.AI · 2026-05-02 · unverdicted · none · ref 13
SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Space Syntax-guided Post-training for Residential Floor Plan Generation cs.LG · 2026-02-26 · unverdicted · none · ref 44
SSPT turns space-syntax integration metrics into post-training feedback signals that improve public-space dominance and functional hierarchy in AI-generated residential floor plans.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 63
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning cs.LG · 2024-10-10 · unverdicted · none · ref 20
Process advantage verifiers trained to predict step-level progress under a distinct prover policy improve LLM reasoning accuracy by over 8% and sample efficiency by 5-6x over outcome reward models.
Training Language Models to Self-Correct via Reinforcement Learning cs.LG · 2024-09-19 · unverdicted · none · ref 62
SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.
Data Selection Through Iterative Self-Filtering for Vision-Language Settings cs.CV · 2026-06-22 · unverdicted · none · ref 58
An iterative bootstrapped self-filtering approach selects balanced clean and diverse subsets from noisy vision-language datasets to train improved CLIP models.
Peak-Then-Collapse and the Four Interface Channels of Knowledge-Graph Tool Use cs.CL · 2026-05-25 · unverdicted · none · ref 2
RL on a four-verb Freebase KG tool shows peak-then-collapse in answer rate, with self-distillation reaching 40% EM while model scaling does not help.
RISE: Reliable Improvement in Self-Evolving Vision-Language Models cs.CV · 2026-05-20 · unverdicted · none · ref 31 · 2 links
RISE proposes a self-evolving VLM framework with three designs to address challenges in question generation and solver adaptation, reporting consistent gains on seven benchmarks across two backbones.
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation cs.AI · 2026-03-23 · unverdicted · none · ref 66
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
An Iterative Utility Judgment Framework Inspired by Philosophical Relevance via LLMs cs.IR · 2024-06-17 · unverdicted · none · ref 5
ITEM is a new iterative utility judgment loop for RAG that maps Schutz's three levels of relevance to retrieval, utility scoring, and generation, yielding measured gains on TREC DL, WebAP, GTI-NQ, and NQ.
Robust Reasoning via Dynamic Token Selection for Distribution-Aligned Self-Distillation cs.CL · 2026-05-30 · unverdicted · none · ref 8
DASD dynamically selects tokens in self-distillation to keep logical corrections while suppressing stylistic noise, improving robustness on math, code, and commonsense benchmarks.
$S^3$-R1: Learning to Retrieve and Answer Step-by-Step with Synthetic Data cs.LG · 2026-05-02 · unverdicted · none · ref 16 · 2 links
S^3-R1 generates synthetic multi-hop questions and uses combined intermediate and final rewards to train RL models for retrieval and answering, reporting up to 10% better out-of-domain generalization.
PRISMA: Preference-Reinforced Self-Training Approach for Interpretable Emotionally Intelligent Negotiation Dialogues cs.CL · 2026-04-20 · unverdicted · none · ref 10
PRISMA augments self-training with direct preference optimization and an emotion-aware negotiation strategy chain-of-thought to produce more interpretable and effective negotiation dialogues on two new datasets.
rePIRL: Learn PRM with Inverse RL for LLM Reasoning cs.LG · 2026-02-08 · conditional · none · ref 25
rePIRL learns token-level process rewards for LLM reasoning via a guided-cost-learning-style IRL objective, and shows these rewards improve reasoning policies on math/coding benchmarks.
Demystifying Long Chain-of-Thought Reasoning in LLMs cs.CL · 2025-02-05 · unverdicted · none · ref 4
Experiments show that long CoT reasoning in LLMs emerges with more training compute when reward shaping is used properly, and scaling verifiable rewards from noisy data helps especially on out-of-distribution tasks.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 192
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods cs.CL · 2024-12-07 · accept · none · ref 206
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems cs.AI · 2025-03-31 · unverdicted · none · ref 159
This survey frames foundation agents using brain-inspired modular architectures and reviews challenges in evolution, collaboration, and safety.

arXiv preprint arXiv:2312.06585 , year=

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer