{"total":15,"items":[{"citing_arxiv_id":"2606.05253","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Alpha-RTL: Test-Time Training for RTL Hardware Optimization","primary_cat":"cs.LG","submitted_at":"2026-06-03T14:51:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TTT-RTL performs per-design test-time RL on an LLM policy with EDA-derived PPA rewards and an adaptive KL controller, reducing geometric-mean PPA product by 65.1% on RTLLM v2.0 and ADP by 59.4% on an industrial FPU unit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30478","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Improving Small Language Models for Code Generation with Reinforcement Learning from Verification Feedback","primary_cat":"cs.SE","submitted_at":"2026-05-28T18:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"RLVR with combined unit-test and static-analysis rewards improves pass@1 by up to 13pp on MBPP for 0.6B-1B models, while single-reward variants can induce shorter but less correct outputs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28409","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient Post-training of LLMs for Code Generation With Offline Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-05-27T12:43:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Offline RL post-training boosts code generation performance in LLMs, with larger gains for small models and hard problems, using pre-collected datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.26807","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML","primary_cat":"cs.SE","submitted_at":"2026-05-26T10:22:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HTMLCure uses browser-executed interaction trajectories to diagnose and repair LLM HTML outputs, expanding 97K prompts into a 40K refined SFT set that lifts a 27B model to 50.6 on HTMLBench-400 and 81.2 on MiniAppBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.24375","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Distilling Game Code World Model Generation into Lightweight Large Language Models","primary_cat":"cs.AI","submitted_at":"2026-05-23T03:30:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"SFT followed by RLVR on Qwen2.5-3B-Instruct raises syntactic and execution correctness when generating Game Code World Models across 30 games.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13935","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Mode-Seeking RL: Trajectory-Balance Post-Training for Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-13T16:14:46+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TraFL applies trajectory flow balancing to post-train diffusion language models, preventing mode collapse and delivering consistent gains on reasoning tasks that hold under increased sampling.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11680","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ShapeCodeBench: A Renewable Benchmark for Perception-to-Program Reconstruction of Synthetic Shape Scenes","primary_cat":"cs.CV","submitted_at":"2026-05-12T07:39:00+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ShapeCodeBench introduces a renewable benchmark for perception-to-program reconstruction of synthetic shapes, with evaluations showing low exact-match performance from current models and heuristics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08468","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PYTHALAB-MERA: Validation-Grounded Memory, Retrieval, and Acceptance Control for Frozen-LLM Coding Agents","primary_cat":"cs.CL","submitted_at":"2026-05-08T20:39:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"An external controller for frozen LLMs raises strict validation success on three RL coding tasks from 0/9 to 8/9 by selecting memory records and skills, running fail-fast checks, and propagating credit via eligibility traces.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"It is not the proposed final method; it is retained as a negative-result comparison arm. 2 Method 2.1 Methodological framing PYTHALAB-MERAis a reinforcement-learning-inspired side controller for lo- cal validation-conditioned code generation. The frozen language model is de- noted byGω, whereωare fixed parameters. Unlike model-tuning approaches for execution-grounded code generation [47,15,4],PYTHALAB-MERAdoes not update the generator: ωt+1 =ω t =ω.(1) The adaptive object is an external controller state, Θt = Bret t , Mt, Lt,T t, Bdec t \u0001 ,(2) whereB ret t is the LinUCB retrieval-action controller,Mt is episodic memory,Lt is the AST-derived skill library,Tt is the delayed-credit trace state, andBdec t is an optional decoding-profile bandit."},{"citing_arxiv_id":"2604.16804","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems","primary_cat":"cs.LG","submitted_at":"2026-04-18T03:24:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[42] John Schulman and Thinking Machines Lab. Lora without regret.Think- ing Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20250929. https://thinkingmachines.ai/blog/lora/. [43] Bo Tang, Elias B Khalil, and Ján Drgoňa. Learning to optimize for mixed-integer non- linear programming with feasibility guar- antees.arXiv preprint arXiv:2410.11061, 2024. [44] Christodoulos A Floudas, Panos M Parda- los, Claire Adjiman, William R Esposito, Zeynep H Gümüs, Stephen T Harding, John L Klepeis, Clifford A Meyer, and Carl A Schweiger.Handbook of test problems in local and global optimization, volume 33. Springer Science & Business Media, 2013. [45] Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan,"},{"citing_arxiv_id":"2605.02913","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16416","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning","primary_cat":"cs.CV","submitted_at":"2025-10-18T09:22:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26383","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning","primary_cat":"cs.CL","submitted_at":"2025-09-30T15:14:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KG-R1 trains a single RL agent to retrieve from and reason over knowledge graphs in one loop, achieving higher accuracy with fewer tokens than multi-module baselines and transferring to unseen graphs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2305.16291","ref_index":85,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","primary_cat":"cs.AI","submitted_at":"2023-05-25T17:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Azaria, Tom Mitchell, and Yuanzhi Li. Spring: Gpt-4 out-performs rl algorithms by studying papers and reasoning. arXiv preprint arXiv: 2305.15486, 2023. [84] Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, Silvio Savarese, and Caiming Xiong. A conversational paradigm for program synthesis. arXiv preprint arXiv: Arxiv-2203.13474, 2022. [85] Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio Savarese, and Steven C. H. Hoi. Coderl: Mastering code generation through pretrained models and deep reinforcement learning. arXiv preprint arXiv: Arxiv-2207.01780, 2022. [86] Xinyun Chen, Chang Liu, and Dawn Song. Execution-guided neural program synthesis. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA,"},{"citing_arxiv_id":"2303.17651","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Self-Refine: Iterative Refinement with Self-Feedback","primary_cat":"cs.CL","submitted_at":"2023-03-30T18:30:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Self-Refine boosts LLM outputs by ~20% on average across seven tasks by having the same model iteratively generate, critique, and refine its own responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2207.10397","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CodeT: Code Generation with Generated Tests","primary_cat":"cs.CL","submitted_at":"2022-07-21T10:18:37+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CodeT improves code generation accuracy by using the same model to create test cases and then selecting solutions via output agreement on those tests, raising HumanEval pass@1 from 47% to 65.8%.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}