TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
Gotmare, Silvio Savarese, and Steven C.H
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.
ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
citing papers explorer
-
Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models
TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
-
CodeT: Code Generation with Generated Tests
CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.
-
ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes
ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
-
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.