TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.
hub
Gotmare, Silvio Savarese, and Steven C.H
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.
ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
HTMLCure uses browser-executed interaction trajectories to diagnose and repair LLM HTML outputs, expanding 97K prompts into a 40K refined SFT set that lifts a 27B model to 50.6 on HTMLBench-400 and 81.2 on MiniAppBench.
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
RLVR with combined unit-test and static-analysis rewards improves pass@1 by up to 13pp on MBPP for 0.6B-1B models, while single-reward variants can induce shorter but less correct outputs.
Offline RL post-training boosts code generation performance in LLMs, with larger gains for small models and hard problems, using pre-collected datasets.
SFT followed by RLVR on Qwen2.5-3B-Instruct raises syntactic and execution correctness when generating Game Code World Models across 30 games.
citing papers explorer
-
Alpha-RTL: Test-Time Training for RTL Hardware Optimization
TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
-
Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning
KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.
-
HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML
HTMLCure uses browser-executed interaction trajectories to diagnose and repair LLM HTML outputs, expanding 97K prompts into a 40K refined SFT set that lifts a 27B model to 50.6 on HTMLBench-400 and 81.2 on MiniAppBench.
-
PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents
An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.
-
Self-Refine: Iterative Refinement with Self-Feedback
Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.
-
Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback
RLVR with combined unit-test and static-analysis rewards improves pass@1 by up to 13pp on MBPP for 0.6B-1B models, while single-reward variants can induce shorter but less correct outputs.
-
Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning
Offline RL post-training boosts code generation performance in LLMs, with larger gains for small models and hard problems, using pre-collected datasets.
-
Distilling Game Code World Model Generation into Lightweight Large Language Models
SFT followed by RLVR on Qwen2.5-3B-Instruct raises syntactic and execution correctness when generating Game Code World Models across 30 games.