{"total":16,"items":[{"citing_arxiv_id":"2605.23643","ref_index":36,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin","primary_cat":"cs.CR","submitted_at":"2026-05-22T13:55:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"An RL-guided MCTS proof search for Tamarin finds more and shorter proofs than standard search across 16 protocol models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21962","ref_index":78,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"AI-Enabled Serious Games: Integrating Intelligence and Adaptivity in Training Systems","primary_cat":"cs.AI","submitted_at":"2026-05-21T03:48:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The chapter synthesizes the history of adaptive learning systems and examines how AI can provide instructional intelligence and real-time adaptivity in serious games while highlighting challenges such as explainability and limited long-term outcome data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19503","ref_index":30,"ref_count":2,"confidence":0.88,"is_internal_anchor":false,"paper_title":"ARC-RL: A Reinforcement Learning Playground Inspired by ARC Raiders","primary_cat":"cs.RO","submitted_at":"2026-05-19T07:54:40+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ARC-RL is a new suite of four MuJoCo continuous-control environments featuring game-inspired hexapod and quadruped morphologies, a single closed-form multi-component reward function, CPG demonstrators, and empirical comparisons of online and offline-to-online RL algorithms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18591","ref_index":54,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation","primary_cat":"cs.LG","submitted_at":"2026-05-18T16:05:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RAT reformulates regularized natural policy gradients as vanilla gradients with a transformed advantage, computed efficiently via randomized block Kaczmarz iterations on on-policy data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09542","ref_index":5,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs","primary_cat":"cs.AI","submitted_at":"2026-05-10T13:54:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"ordering;(ii)as different LLMs can exhibit different scor- ing behaviours, each subgraph is independently scored by 3 judges (GPT-4.1, DEEPSEEK-V3.1, QWEN3-235B). For each individual judgment, we extract the model's token prob- abilities over the five ordinal scale valuesk∈{1, . . .,5}, renormalise them, and compute a probability-weighted ex- pected ratingE[s]= P k k p(k), yielding a continuous score in[1,5]. The reported scores(per subgraph, dimension) are the mean of these expected ratings across all 9 permu- tation-judge combinations. 5 Results and Discussion 5.1 Evaluation on DrugMechDB Aggregate performance Table 1 reports scores pooled across state-eval LLMs. Node- level agreement is strong (NSA Micro P/R =0.71/0.83). At edge level, ESA@1 (no hop slack) yields Micro P/R ="},{"citing_arxiv_id":"2605.04979","ref_index":46,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"On-line Learning in Tree MDPs by Treating Policies as Bandit Arms","primary_cat":"cs.AI","submitted_at":"2026-05-06T14:32:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Bandit algorithms can be adapted to Tree MDPs by treating policies as arms with shared-data confidence bounds, achieving polynomial memory and instance-dependent bounds on sample complexity and regret that depend on terminal-state gaps rather than all policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23312","ref_index":12,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"GIFT: Global stabilisation via Intrinsic Fine Tuning","primary_cat":"cs.LG","submitted_at":"2026-04-25T13:56:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GIFT fine-tunes deep RL policies with a stability-focused reward to improve global stability while preserving task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20522","ref_index":11,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR","primary_cat":"cs.SD","submitted_at":"2026-04-22T13:01:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A two-stage OMR pipeline decodes symbol candidates into polyphonic score structures via topology recognition with probability-guided search.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19341","ref_index":119,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Evaluation-driven Scaling for Scientific Discovery","primary_cat":"cs.LG","submitted_at":"2026-04-21T11:24:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimpleTES scales test-time evaluation in LLMs to discover state-of-the-art solutions on 21 scientific problems across six domains, outperforming frontier models and optimization pipelines with examples like 2x faster LASSO and new Erdos constructions.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Evaluation-driven Scaling for Scientific Discovery Algorithm 1 SIMPLETES Require: instruction x0, generator G, evaluator V, initial solution y0, parameter (C, L, K, Φ)2H 1: (r0, m0) V(y0), S0 f (y0, r0, m0)g 2: function TRAJECTORY( S) 3: for ℓ = 1, . . . , L do 4: x Φ(S) 5: Generate K candidatesfykgK k=1\u0018 G(x) 6: Evaluate each candidate: (rk, mk) V(yk) for k2 [K] 7: S S[f (yk\u0003 , rk\u0003 , mk\u0003 )g where k\u0003 = arg maxk rk 8: end for 9: return S 10: end function 11: Run C independent trajectories in parallel: S1, . . . , SC TRAJECTORY (S0), . . . , TRAJECTORY (S0). 12: return arg max(y,r,m)2SC c=1 Sc r later attempts. A sequential refinement policy , on the other hand, uses feedback to improve later candi- dates, but commits the search to a single trajectory and can become trapped by early choices. Recent approaches demonstrate the power of iterative discovery systems that combine generation, evaluation, and refinement. Building on this evidence, this section asks a fundamental design question: how should evaluator queries be organized to use feedback most effectively? We introduce SIMPLETES, a simple algorithmic framework for scaling the evaluation-driven discovery loop at test time. The key idea is to organize evaluator queries through a compact design space: C independent trajectories provide global exploration, L committed refinement steps accumulate feedback within each trajectory ,K local candidates are evaluated before each commitment, and Φ maps the committed history into the next proposal. We present the pseudo-code of SIMPLETES in Algorithm 1, with its design space and analysis of each parameter specified below. We also discuss the theoretical insights for these parameter designs in Section B. Definition 2.1. Given a problem instruction x0, the hyper-parameter space (design space) of Algorithm 1 is defined as H =f(C, L, K, Φ) : C, L, K2 N+, Φ : FinSet (Y\u0002 R\u0002M )!Xg . (1) Here C, L, K are the scaling dimensions of SIMPLETES, Φ is a subroutine that creates n"},{"citing_arxiv_id":"2604.03472","ref_index":33,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Vocabulary Dropout for Curriculum Diversity in LLM Co-Evolution","primary_cat":"cs.CL","submitted_at":"2026-04-03T21:40:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Vocabulary dropout prevents diversity collapse in LLM co-evolution by masking proposer logits, yielding average +4.4 point solver gains on mathematical reasoning benchmarks at 8B scale.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.08080","ref_index":32,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Towards AI-assisted Neutrino Flavor Theory Design","primary_cat":"hep-ph","submitted_at":"2025-06-09T18:00:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AMBer applies reinforcement learning with physics feedback to automate construction of neutrino flavor models that minimize free parameters, validated on known cases and extended to a new symmetry group.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2501.16150","ref_index":140,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions","primary_cat":"cs.AI","submitted_at":"2025-01-27T15:44:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of 87 agents for computer use and 33 datasets that introduces a three-dimensional taxonomy across domain, interaction, and agent perspectives and identifies six research gaps.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"on NeurIPS. Curran Associates, Inc., New Orleans, LA, USA, 58202-58245. [139] Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. 2022. META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI. InProc. of the Conf. on EMNLP. ACL, Abu Dhabi, United Arab Emirates, 6699-6712. https://doi.org/10.18653/v1/ 2022.emnlp-main.449 [140] Richard S. Sutton. 1991. Dyna, an integrated architecture for learning, planning, and reacting.SIGART Bull.2, 4 (July 1991), 160-163. https://doi.org/10.1145/122344.122377 [141] Richard S. Sutton and Andrew G. Barto. 2018.Reinforcement Learning: An Introduction(second edition ed.). The MIT Press, Cambridge, MA, USA. [142] Rehan Syed, Suriadi Suriadi, Michael Adams, Wasana Bandara, Sander J."},{"citing_arxiv_id":"2207.05221","ref_index":284,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Language Models (Mostly) Know What They Know","primary_cat":"cs.CL","submitted_at":"2022-07-11T22:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2112.00861","ref_index":206,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"A General Language Assistant as a Laboratory for Alignment","primary_cat":"cs.CL","submitted_at":"2021-12-01T22:24:34+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2102.01293","ref_index":193,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Scaling Laws for Transfer","primary_cat":"cs.LG","submitted_at":"2021-02-02T04:07:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Effective data transferred from pre-training to fine-tuning is described by a power law in model parameter count and fine-tuning dataset size, acting like a multiplier on the fine-tuning data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"1907.07817","ref_index":32,"ref_count":1,"confidence":0.88,"is_internal_anchor":false,"paper_title":"Scheduling Discovery in the 2020s","primary_cat":"astro-ph.IM","submitted_at":"2019-07-17T23:53:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Advocates developing high-quality open-source scheduling software and linking observation planning with data analysis for future astronomical surveys.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}