{"total":23,"items":[{"citing_arxiv_id":"2605.12995","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"F-GRPO: Factorized Group-Relative Policy Optimization for Unified Candidate Generation and Ranking","primary_cat":"cs.LG","submitted_at":"2026-05-13T04:52:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"F-GRPO factorizes group-relative policy optimization into generation and ranking phases within one autoregressive sequence, using order-invariant coverage and position-aware utility rewards to improve top-ranked performance on recommendation and multi-hop QA tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11299","ref_index":71,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Primal Generation, Dual Judgment: Self-Training from Test-Time Scaling","primary_cat":"cs.LG","submitted_at":"2026-05-11T22:34:45+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DuST self-trains LLMs for code generation by ranking their own test-time samples via sandbox execution and applying GRPO, improving judgment by +6.2 NDCG and single-sample pass@1 by +3.1 on LiveCodeBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11217","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Leveraging RAG for Training-Free Alignment of LLMs","primary_cat":"cs.LG","submitted_at":"2026-05-11T20:29:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RAG-Pref is a training-free RAG-based alignment technique that conditions LLMs on contrastive preference samples during inference, yielding over 3.7x average improvement in agentic attack refusals when combined with offline methods across five LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03065","ref_index":165,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"OGPO: Sample Efficient Full-Finetuning of Generative Control Policies","primary_cat":"cs.LG","submitted_at":"2026-05-04T18:36:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OGPO enables sample-efficient full-finetuning of generative control policies via off-policy critics and modified PPO, achieving SOTA on robot manipulation tasks while rescuing poorly initialized behavior cloning policies without expert data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19656","ref_index":28,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Pause or Fabricate? Training Language Models for Grounded Reasoning","primary_cat":"cs.CL","submitted_at":"2026-04-21T16:45:29+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRIL uses stage-specific RL rewards to train LLMs to detect missing premises, pause proactively, and resume grounded reasoning after clarification, yielding up to 45% better premise detection and 30% higher task success on insufficient math datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10326","ref_index":3,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion","primary_cat":"cs.CR","submitted_at":"2026-04-11T19:19:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HMNS is a new jailbreak method that uses causal head identification and nullspace-constrained injection to achieve higher attack success rates than prior techniques on aligned language models.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"To identify the attention heads most responsible for the model's con- tinuation behavior, we perform counterfactual ablation and score each head via the KL divergence between output distributions. LetS ℓ,h ∈R (Hℓdh)×(Hℓdh) be a diagonal selector with ones on the slice for headhand zeros elsewhere. The masked out-projection for probing head(ℓ, h)is fW O ℓ,h =W O ℓ (I−S ℓ,h),(3) which replacesW O ℓ only at layerℓduring an ablated forward pass. LetP= softmax(z)denote the baseline (output generated without HMNS) next-token distribution produced using equation 2, and let eP (ℓ,h) = softmax(ez(ℓ,h))be the ablated distribution obtained when using equation 3. The causal importance of head(ℓ, h)is then ∆ℓ,h = KL \u0010 P∥ eP (ℓ,h) \u0011"},{"citing_arxiv_id":"2604.09741","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ExecTune: Effective Steering of Black-Box LLMs with Guide Models","primary_cat":"cs.LG","submitted_at":"2026-04-09T23:27:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ExecTune trains guide models via acceptance sampling, supervised fine-tuning, and structure-aware RL to boost executability of strategies for black-box LLMs, yielding up to 9.2% higher accuracy and 22.4% lower cost on math and code tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"πacc(z|s) = X z∈B(s) πL(z|s) Pr(accept|s, z) As ≤ X z∈B(s) πL(z|s)δ As ≤ δ As ,(25) which is Prz∼πacc(·|s) q(s, z)≤τ−η \u0001 ≤ δ As . For the expectation bound, let G(s) =B(s) c ={z: q(s, z)> τ−η}. Sinceq(s, z)≥0everywhere andq(s, z)≥τ−ηonG(s), Ez∼πacc(·|s)[q(s, z)]≥E \u0002 q(s, z)1{z∈ G(s)} \u0003 ≥(τ−η) Pr(z∈ G(s)) = (τ−η) 1−Pr(z∈ B(s)) \u0001 ≥(τ−η) \u0010 1− δ As \u0011 ,(26) which is equation 19. A.11 ENSURINGδ < A s IN PRACTICE The lower bound in Theorem 2 depends on the ratio δ/As, where δ=e −2Kη 2 is the validation error term and As = Pr z∼πL(·|s)(accept) is the (unknown) acceptance rate. To make the bound nonvacuous in practice, we estimate As from teacher proposals and form a high-confidence lower bound. Concretely, draw M independent strategies z1, ."},{"citing_arxiv_id":"2605.02913","ref_index":89,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-08T00:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.02623","ref_index":12,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"The Realignment Problem: When Right becomes Wrong in LLMs","primary_cat":"cs.CL","submitted_at":"2025-11-04T14:52:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRACE is a three-stage optimization framework that realigns LLMs to new policies by categorizing preference conflicts, scoring impact via bi-level optimization, and applying hybrid losses without new human annotations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.25758","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Thinking Sparks!: Emergent Attention Heads in Reasoning Models During Post Training","primary_cat":"cs.AI","submitted_at":"2025-09-30T04:23:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Post-training on reasoning tasks sparks the emergence of specialized attention heads that enable structured computation, with SFT adding stable heads while GRPO uses dynamic activation and pruning tied to reward signals, and controllable think models relying on compensatory heads instead of specific","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.20102","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Steerable Adversarial Scenario Generation through Test-Time Preference Alignment","primary_cat":"cs.AI","submitted_at":"2025-09-24T13:27:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SAGE reframes adversarial scenario generation as multi-objective preference alignment, using hierarchical group-based optimization and test-time linear interpolation of two expert policies to enable steerable control over adversariality-realism trade-offs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.09838","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Dissecting Discrete Soft Actor-Critic: Limitations and Principled Alternatives","primary_cat":"cs.LG","submitted_at":"2025-09-11T20:34:08+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Shows entropy coupling limits DSAC on discrete tasks and introduces a generalized actor-critic framework with m-step critics and novel entropy-regularized objectives that perform robustly on Atari.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.05489","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Self-Aligned Reward: Towards Effective and Efficient Reasoners","primary_cat":"cs.LG","submitted_at":"2025-09-05T20:39:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Self-aligned reward uses relative perplexity differences to encourage concise, query-specific reasoning in LLMs, yielding 4% accuracy gains and 30% lower inference cost when added to PPO or GRPO.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.10614","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Fine-tuning Large Language Model for Automated Algorithm Design","primary_cat":"cs.LG","submitted_at":"2025-07-13T15:21:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fine-tuned LLMs with DAR sampling and DPO outperform off-the-shelf versions on algorithm design tasks and generalize to related settings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.10248","ref_index":22,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model","primary_cat":"cs.CV","submitted_at":"2025-02-14T15:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.01456","ref_index":35,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Process Reinforcement through Implicit Rewards","primary_cat":"cs.LG","submitted_at":"2025-02-03T15:43:48+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.08812","ref_index":46,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Test-Time Alignment via Hypothesis Reweighting","primary_cat":"cs.LG","submitted_at":"2024-12-11T23:02:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HyRe personalizes reward models at test time by reweighting an ensemble of heads trained on aggregate preferences, using few target examples to outperform uniform averaging and prior methods on RewardBench and 32 tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.04984","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Frontier Models are Capable of In-context Scheming","primary_cat":"cs.AI","submitted_at":"2024-12-06T12:09:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Frontier models demonstrate in-context scheming by strategically deceiving in multiple agentic evaluations to achieve given goals.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.02125","ref_index":41,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies","primary_cat":"cs.AI","submitted_at":"2024-12-03T03:27:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.17891","ref_index":163,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Scaling Diffusion Language Models via Adaptation from Autoregressive Models","primary_cat":"cs.CL","submitted_at":"2024-10-23T14:04:22+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.10781","ref_index":37,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"When Attention Sink Emerges in Language Models: An Empirical View","primary_cat":"cs.CL","submitted_at":"2024-10-14T17:50:28+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.18665","ref_index":26,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"RouteLLM: Learning to Route LLMs with Preference Data","primary_cat":"cs.LG","submitted_at":"2024-06-26T18:10:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Router models trained on preference data dynamically select between strong and weak LLMs, cutting inference costs by more than 2x on benchmarks with no quality loss and showing transfer to new model pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.11717","ref_index":165,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Refusal in Language Models Is Mediated by a Single Direction","primary_cat":"cs.LG","submitted_at":"2024-06-17T16:36:12+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}