{"total":217,"items":[{"citing_arxiv_id":"2606.26396","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"At the Edge of Understanding: Sparse Autoencoders Trace The Limits of Transformer Generalization","primary_cat":"cs.LG","submitted_at":"2026-06-24T21:26:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Sparse autoencoders show OOD prompts increase fallacious concept activation in transformers, offering a mechanistic measure of shift and a path to robust fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.25647","ref_index":6,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Retrieval-Grounded Multilingual LLM Assistance for Island Smallholder Farmers","primary_cat":"cs.CE","submitted_at":"2026-06-24T09:56:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Presents a retrieval-grounded multilingual LLM system for island farmers using managed models and local data tools in a PWA for low-bandwidth use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22745","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language-Specific Sentiment Polarity Biases in Encoder and Large Language Model Classification of Product Reviews","primary_cat":"cs.CL","submitted_at":"2026-06-22T01:16:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"LLMs show negative polarity bias in French and encoder models show positive bias in Japanese when classifying product review sentiment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.22402","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcement learning to improve large language model-based automated code compliance systems","primary_cat":"cs.SE","submitted_at":"2026-06-21T09:17:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"P4IR applies supervised fine-tuning followed by GRPO reinforcement learning to reduce tree edit distance by up to 23.8% and Levenshtein distance by up to 38.6% versus SFT baselines while outperforming several frontier LLMs on code structure and semantics for automated building code compliance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10106","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What makes a harness a harness: necessary and sufficient conditions for an agent harness","primary_cat":"cs.SE","submitted_at":"2026-06-08T19:35:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07006","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning","primary_cat":"cs.LG","submitted_at":"2026-06-05T07:52:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RASFT is an adaptive SFT method that strengthens or relaxes expert imitation per problem based on on-policy rollout solvability and adds clipped reference-policy ratio to limit drift, reporting better results than standard SFT and RL on math and code benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.06674","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Do People Actually Want From AI? Mapping Preference Plurality","primary_cat":"cs.CL","submitted_at":"2026-06-04T19:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Open-ended preference data reveals substantial plurality in what people want from AI and divergent interpretations of shared values such as truthfulness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04978","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probing Outcome-Level Resemblance and Mechanism-Level Alignment in LLM Risk Decisions: Evidence from the St. Petersburg Game","primary_cat":"cs.CL","submitted_at":"2026-06-03T15:01:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs produce human-like finite bids in the St. Petersburg game but shift toward rational behavior under controlled prompt changes, indicating surface-level outcome resemblance without mechanism-level alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01196","ref_index":1,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Low-Resource Safety Failures Are Action Failures, Not Representation Failures","primary_cat":"cs.CL","submitted_at":"2026-05-31T12:19:40+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.00437","ref_index":102,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing","primary_cat":"cs.LG","submitted_at":"2026-05-30T00:05:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.11238","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Artificial Intelligence in Ship Finance: Applications, Opportunities, and a Case Study in AI-Augmented Loan Origination","primary_cat":"q-fin.GN","submitted_at":"2026-05-29T12:29:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Reviews AI applications in ship finance and presents ShipFinance.ai, a modular LLM-based agentic architecture for automating loan application workflows.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.31034","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Annealed Softmax Greedy in Many-Armed Bayesian Bandits","primary_cat":"cs.LG","submitted_at":"2026-05-29T09:05:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Annealed softmax greedy achieves Õ(m + T/m) Bayes regret (Õ(√T) at m=Θ(√T)) in many-armed Bayesian Bernoulli bandits under linear upper-tail prior condition, matching empirical-mean greedy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30914","ref_index":97,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Automating Formal Verification with Reinforcement Learning and Recursive Inference","primary_cat":"cs.LG","submitted_at":"2026-05-29T06:59:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"RLVR training raises verified Dafny pass rates from 9.7% to 31.1% on a filtered benchmark while a Lean proof scaffold lifts success from 46.2% to 69.2% on a pilot set and solves 7 of 42 prior unsolved tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29396","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-28T05:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23565","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Understanding Goal Generalisation in Sequential Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-22T12:31:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23067","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA","primary_cat":"cs.CL","submitted_at":"2026-05-21T21:58:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Controlled study shows mixed training curricula improve aggregate F1 on memory QA benchmarks while out-of-domain data transfers targeted skills like temporal reasoning, with per-question-type effects exceeding aggregate differences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22771","ref_index":23,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reducing Political Manipulation with Consistency Training","primary_cat":"cs.CL","submitted_at":"2026-05-21T17:32:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21993","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking","primary_cat":"cs.AI","submitted_at":"2026-05-21T04:42:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21883","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Token-weighted Direct Preference Optimization with Attention","primary_cat":"cs.CL","submitted_at":"2026-05-21T01:43:09+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21606","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Are Teacher Tokens Reliable? Position-Weighted On-Policy Self-Distillation for Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-20T18:14:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Position-Weighted On-Policy Self-Distillation (PW-OPSD) weights later tokens more heavily after a diagnostic shows position predicts teacher reliability better than entropy, yielding +1.0 and +1.1 Avg@12 gains on AIME 2024/2025.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21422","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-20T17:15:43+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21295","ref_index":49,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TimeSRL: Generalizable Time-Series Behavioral Modeling via Semantic RL-Tuned LLMs -- A Case Study in Mental Health","primary_cat":"cs.LG","submitted_at":"2026-05-20T15:25:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TimeSRL uses semantic abstractions from time-series data optimized via reinforcement learning to achieve better cross-dataset generalization than standard ML or LLM baselines in mental health prediction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21225","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PREFINE: Preference-Based Implicit Reward and Cost Fine-Tuning for Safety Alignment","primary_cat":"cs.LG","submitted_at":"2026-05-20T14:19:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PREFINE adapts Direct Preference Optimization to trajectory-level preferences in RL for joint reward retention and safety alignment in continuous domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21081","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Musical Attention Transformer: Music Generation Using a Music-Specific Attention Model","primary_cat":"cs.SD","submitted_at":"2026-05-20T12:16:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"The paper introduces Musical Attention, an attention variant that incorporates eight musical features including metadata to generate more coherent and varied music than standard or strided attention baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21545","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts","primary_cat":"cs.SE","submitted_at":"2026-05-20T09:53:31+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20740","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Distribution-Aware Reward: Reinforcement Learning over Predictive Distributions for LLM Regression","primary_cat":"cs.LG","submitted_at":"2026-05-20T05:43:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Distribution-Aware Reward optimizes LLM regression by treating rollouts as empirical predictive distributions and rewarding marginal improvements in CRPS quality rather than point accuracy alone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20722","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AGPO: Adaptive Group Policy Optimization with Dual Statistical Feedback","primary_cat":"cs.LG","submitted_at":"2026-05-20T05:20:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AGPO adaptively sets trust-region size and exploration temperature from group reward dispersion, entropy, and KL drift, yielding higher scores than PPO and GRPO on nine math benchmarks under fixed token budget.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20506","ref_index":24,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reinforcing Human Behavior Simulation via Verbal Feedback","primary_cat":"cs.LG","submitted_at":"2026-05-19T21:23:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DITTO uses RL with verbal feedback to train LLMs for human behavior simulation, reporting 36% average gains over base models and outperforming GPT-5.4 on 6 of 10 SOUL benchmark tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20164","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR","primary_cat":"cs.AI","submitted_at":"2026-05-19T17:50:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20149","ref_index":20,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Less Back-and-Forth: A Comparative Study of Structured Prompting","primary_cat":"cs.CL","submitted_at":"2026-05-19T17:40:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20296","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Spectral Unforgetting: Post-Hoc Recovery of Damaged Capabilities Without Retraining","primary_cat":"cs.LG","submitted_at":"2026-05-19T11:01:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DG-Hard uses Donoho-Gavish hard thresholding on the fine-tuning weight delta to separate task-aligned signal from noise-like residual, recovering damaged capabilities while preserving target-task gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19394","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"EmbGen: Teaching with Reassembled Corpora","primary_cat":"cs.CL","submitted_at":"2026-05-19T05:40:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EmbGen creates synthetic QA data by entity decomposition, embedding-based reassembly into clusters, and multi-level sampling with cluster-specific prompts, yielding up to 88.9% higher Binary Accuracy than baselines on heterogeneous datasets under fixed token budgets.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tuned models attractive for many real applications [ 5, 26]. How- ever, adapting a small model to a specialized domain often relies on SFT on curated instruction-response examples, which can be expensive to collect at scale [ 7, 16, 17, 21, 27]. A common alter- native is to generate synthetic training examples from a domain corpus using a teacher LLM [18, 31]. In practice, without careful constraints and filtering, synthetic augmentation can yield homog- enized outputs, factual inaccuracies, and insufficient coverage of long-tail domain content [4, 23, 27]. Moreover, existing synthetic data pipelines often struggle to consistently capture cross-passage or cross-document (multi-hop) dependencies [4, 15, 28], motivat-"},{"citing_arxiv_id":"2605.20278","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ClaimDiff-RL: Fine-Grained Caption Reinforcement Learning through Visual Claim Comparison","primary_cat":"cs.LG","submitted_at":"2026-05-19T04:39:28+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19149","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents","primary_cat":"cs.CL","submitted_at":"2026-05-18T22:03:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19099","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows","primary_cat":"cs.AI","submitted_at":"2026-05-18T20:37:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18141","ref_index":33,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Brief Overview: On-Policy Self-Distillation In Large Language Models","primary_cat":"cs.HC","submitted_at":"2026-05-18T09:47:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"This overview paper explains the conceptual foundations and design principles of On-Policy Self-Distillation for large language models from a beginner's perspective.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17187","ref_index":246,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media","primary_cat":"cs.CL","submitted_at":"2026-05-16T22:52:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PluRule is a new multimodal multilingual benchmark showing that state-of-the-art vision-language models perform only marginally better than a trivial baseline at detecting specific rule violations in pluralistic online communities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17064","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Human-Level Book-Writing Capability","primary_cat":"cs.AI","submitted_at":"2026-05-16T16:10:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A prompt-to-book training framework that derives hierarchical summaries from public-domain novels and inverts them to supervise long-context models toward human literary prose instead of assistant-style output.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17037","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-16T15:16:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"D²Evo mines medium-difficulty anchors from the current model, trains a Questioner to generate matching questions, and jointly optimizes Solver and Questioner for progressive gains, outperforming baselines on math reasoning with under 2K real samples.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16991","ref_index":148,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning","primary_cat":"cs.CL","submitted_at":"2026-05-16T13:22:57+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14558","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Resolving Action Bottleneck: Agentic Reinforcement Learning Informed by Token-Level Energy","primary_cat":"cs.LG","submitted_at":"2026-05-14T08:33:02+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ActFocus resolves the action bottleneck in agentic RL by reweighting token gradients toward action tokens using observed reward variance and an energy-based uncertainty term, outperforming PPO and GRPO by up to 65 percentage points.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14097","ref_index":35,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Real-Time Group Dynamics with LLM Facilitation: Evidence from a Charity Allocation Task","primary_cat":"cs.HC","submitted_at":"2026-05-13T20:28:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16411","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift","primary_cat":"cs.CV","submitted_at":"2026-05-13T15:37:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Stage-wise DPO constructs hallucination-focused preference pairs near failure boundaries to improve visual grounding in VLMs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12705","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Early Data Exposure Improves Robustness to Subsequent Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-12T20:08:00+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Early mixing of post-training data into pretraining improves retention of acquired capabilities after subsequent fine-tuning in language models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12484","ref_index":41,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning, Fast and Slow: Towards LLMs That Adapt Continually","primary_cat":"cs.LG","submitted_at":"2026-05-12T17:58:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"org/abs/2209.02370. 10 [39] Jatin Prakash and Anirudh Buvanesh. What can you do when you have zero rewards during rl?, 2025. URLhttps://arxiv.org/abs/2510.03971. 18 [40] Archiki Prasad, Peter Hase, Xiang Zhou, and Mohit Bansal. GrIPS: Gradient-free, edit-based instruction search for prompting large language models, 2023. URLhttps://arxiv.org/abs/2203.07281. 9 [41] Reid Pryzant, Dan Iter, Jerry Li, Yin Tat Lee, Chenguang Zhu, and Michael Zeng. Automatic prompt optimization with \"gradient descent\" and beam search, 2023. URLhttps://arxiv.org/abs/2305. 03495. 9 [42] Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. How to explore to scaleRLtrainingofLLMsonhardproblems.CMUMachineLearningBlog,2025."},{"citing_arxiv_id":"2605.12380","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Trust the Batch, On- or Off-Policy: Adaptive Policy Optimization for RL Post-Training","primary_cat":"cs.LG","submitted_at":"2026-05-12T16:44:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A new RL objective adapts trust-region and off-policy handling automatically via normalized effective sample size of batch policy ratios, matching tuned baselines without new hyperparameters.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[19] Wenhan Ma, Hailin Zhang, Liang Zhao, Yifan Song, Yudong Wang, Zhifang Sui, and Fuli Luo. Stabilizing MoE reinforcement learning by aligning training and inference routers.arXiv preprint arXiv:2510.11370, 2025. 2 [20] Mathematical Association of America. 2023 american mathematics competitions (amc 10 and amc 12). https://huggingface.co/datasets/math-ai/amc23, 2023. Dataset curated by the Math-AI community. 9 [21] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human"},{"citing_arxiv_id":"2605.10930","ref_index":53,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Evaluating the False Trust Engendered by LLM Explanations","primary_cat":"cs.HC","submitted_at":"2026-05-11T17:58:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"outputs, willingness to rely on recommendations, confidence in answers). 27 D Statistical Analysis D.1 Confidence intervals Table 7 reports Wilson 95% CIs for every cell in the main results tables. The intervals support the claims in the main text: E+/− has the lowest trust (40.9%, 95% CI [32.2,50.3] ) and misjudgment (46.9%, 95% CI[37.2,56.8]), and is tied for the highest accuracy (60.4%, 95% CI[53.4,67.1]). Table 7: Wilson 95% confidence intervals on the metrics reported in the main text. k/n is shown for the relevant denominator (Accuracy: all responses; False trust: responses judged correct; Misjudg- ment: AI-incorrect responses; Correct judgment: AI-correct responses). Metric Condition Rate (%) 95% CI Accuracy E+/− 60.4 [53.4, 67.1] E+ 58.0 [51."},{"citing_arxiv_id":"2605.10528","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics","primary_cat":"cond-mat.stat-mech","submitted_at":"2026-05-11T13:13:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Yang, Y. Liu, and X. Wang, CalibraEval: Calibrating prediction distribu- tion to mitigate selection bias in LLMs-as-judges (2024), 10 arXiv:2410.15393 [cs.CL]. [16] P. F. Christiano, J. Leike, T. B. Brown, M. Martic, S. Legg, and D. Amodei, inAdvances in Neural Infor- mation Processing Systems (NeurIPS), Vol. 30 (2017) arXiv:1706.03741 [stat.ML]. [17] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wain- wright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, inAdvances in Neural Infor- mation Processing Systems (NeurIPS), Vol. 35 (2022) arXiv:2203.02155 [cs.CL]. [18] M. Sharma, M."},{"citing_arxiv_id":"2605.09678","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Absurd World: A Simple Yet Powerful Method to Absurdify the Real-world for Probing LLM Reasoning Capabilities","primary_cat":"cs.AI","submitted_at":"2026-05-10T17:55:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Absurd World automatically converts real-world problems into absurd yet logically coherent scenarios to test whether LLMs can reason without depending on familiar patterns.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09622","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study","primary_cat":"cs.CV","submitted_at":"2026-05-10T16:08:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-HMM challenge winner.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CV] 10 May 2026 domain gap. However, previous work mainly on the feature extraction backbone rather than generation, this motivates our question about first type knowledge transfer:Can 3D dif- fusion prior knowledge trained on a distant source domain help improve target-domain generation? User preference is another critical consideration in gen- erative AI [ 44, 48]. It is particularly important in RT because multi-disciplinary team (oncologists, physicists, dosimetrists) collaboratively design treatment plans tai- lored to individual patients, and different institutions follow slightly or largely different protocols [9, 12]. Post-training with reinforcement learning (RL) has emerged as a powerful paradigm to align generative models with user preferences"}],"limit":50,"offset":0}