{"total":91,"items":[{"citing_arxiv_id":"2605.22177","ref_index":72,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles","primary_cat":"cs.LG","submitted_at":"2026-05-21T08:47:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Maestro uses outcome-based RL to train a lightweight policy that orchestrates ensembles of frozen expert models and skills, reporting 70.1% average accuracy across ten multimodal benchmarks and outperforming GPT-5 and Gemini-2.5-Pro while generalizing to unseen components.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22012","ref_index":50,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-21T05:18:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved reasoning dataset and showing gains over text CoT baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20948","ref_index":55,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Memory Grafting: Scaling Language Model Pre-training via Offline Conditional Memory","primary_cat":"cs.CL","submitted_at":"2026-05-20T09:35:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Memory Grafting improves language-model benchmarks by grafting offline hidden-state memory from a larger model into a recipient model using n-gram lookups and lightweight adapters, outperforming MoE and vanilla Engram baselines at 0.92B and 2.8B scales.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20075","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CopT: Contrastive On-Policy Thinking with Continuous Spaces for General and Agentic Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-19T16:28:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CopT reverses CoT by eliciting a draft answer first then using continuous-embedding contrastive verification and on-policy thinking to reflect and correct, yielding up to 23% higher accuracy and 57% fewer tokens without training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19382","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-19T05:28:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18643","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Post-Trained MoE Can Skip Half Experts via Self-Distillation","primary_cat":"cs.LG","submitted_at":"2026-05-18T16:50:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18106","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers","primary_cat":"math.OC","submitted_at":"2026-05-18T09:17:26+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"sequence, and data parallelism. Recent work on DistributedMuon[105, 46, 124],Dion[ 3, 2], Disco[48], and ParallelMuon[101] suggests that these challenges can be addressed in practice. Encouragingly, matrix-aware optimizers have already begun to appear in industry-scale model training, including work by Moonshot AI [105, 80], Essential AI [47], Prime Intellect [124], Zhipu AI [52, 53], Zyphra [8, 152], Motif Technologies [101], Arcee AI [141], StepFun [142], and DeepSeek-AI [33]. These developments suggest that the question is no longer whether matrix-aware optimizers can be scaled, but how far they can be pushed once algorithmic geometry, numerical linear algebra, and distributed systems are designed in concert. Recent work by Du et al."},{"citing_arxiv_id":"2605.18083","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Data-Efficient Path to Multilingual LLMs: Language Expansion via Post-training PARAM$\\Delta$ Integration into Upcycled MoE","primary_cat":"cs.CL","submitted_at":"2026-05-18T08:59:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PARAMΔ upcycles dense models to MoE for per-language experts and grafts post-training deltas to enable data-efficient language expansion while preserving original capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17962","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"FinDocMRE: A Benchmark for Document-Level Financial Multimodal Reasoning Evaluation","primary_cat":"cs.CE","submitted_at":"2026-05-18T07:18:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FinDocMRE is a new multi-image document-level benchmark spanning 12 financial domains and 5 task types, showing that 11 tested LMMs all score below 65 overall with particular weaknesses in numerical estimation and cross-page grounding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17937","ref_index":37,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BacktestBench: Benchmarking Large Language Models for Automated Quantitative Strategy Backtesting","primary_cat":"cs.CL","submitted_at":"2026-05-18T06:52:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17672","ref_index":90,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models","primary_cat":"cs.CL","submitted_at":"2026-05-17T22:04:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PUMA detects reasoning-level semantic redundancy to enable early exit in chains of thought, achieving 26.2% average token reduction across five LRMs and five benchmarks while preserving accuracy and CoT quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17228","ref_index":96,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Artificial Intolerance: Stigmatizing Language in Clinical Documentation Skews Large Language Model Decision-Making","primary_cat":"cs.CL","submitted_at":"2026-05-17T02:28:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Frontier LLMs exhibit bias from stigmatizing language in clinical vignettes across four conditions, skewing decisions toward less aggressive management, with limited mitigation from Chain-of-Thought or self-debiasing prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14442","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"GGBound: A Genome-Grounded Agent for Microbial Life-Boundary Prediction","primary_cat":"cs.CY","submitted_at":"2026-05-14T06:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A genome-conditioned 4B LLM agent predicts microbial life boundaries and matches larger frontier models via token fusion, tool use, and a counterfactual gene-grounding reward.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13138","ref_index":17,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Code-Centric Detection of Vulnerability-Fixing Commits: A Unified Benchmark and Empirical Study","primary_cat":"cs.SE","submitted_at":"2026-05-13T08:05:14+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Code language models show no transferable security understanding from code diffs alone, rely on commit messages, miss over 93% of fixes at 0.5% false positive rate, and suffer large drops under group or temporal splits.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"validation (20%). All experiments are trained on NVIDIA A100, H100andH200GPUs.AllmodelsexceptQwen2.5-Coderaretrained using an effective batch size64, AdamW [23] and a learning rate of 2×10 −5 using the classification heads provided by the HuggingFace *ForSequenceClassification model implementations. To train Qwen2.5-Coder [55] we use low rank adaptation [17] with rank16 and 𝛼=32 , a batch size of2 with 32 gradient accumulation steps matchingthesameeffectivebatchsizeandalearningrateof 5×10 −5. For all models, class imbalance is handled via inverse frequency weighting.WereportF1-scoresasawellrecognizedmetricandadapt the vulnerability detection score (VD-S) introduced by Ding et al. [9] as the patch detection score PD-S (=VD-S) =𝐹 𝑁 𝑅@(𝐹 𝑃𝑅≤𝑟)"},{"citing_arxiv_id":"2605.11277","ref_index":52,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sieve: Dynamic Expert-Aware PIM Acceleration for Evolving Mixture-of-Experts Models","primary_cat":"cs.AR","submitted_at":"2026-05-11T22:00:39+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Sieve dynamically schedules MoE experts across GPU and PIM hardware to handle bimodal token distributions, achieving 1.3x to 1.6x gains in throughput and interactivity over static prior PIM systems on three large models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"bining multiple dimensions of intelligence where a higher score indicates greater model capability [3]. Parameters in non-MoE lay- ers are always activated and are therefore included in the activated parameter size. We measure the number of activated experts in MoE layers, which depends on input sequences, by running Mixtral- 8x22B [24], Qwen3-30B-A3B [52], Qwen3-Next-80B-A3B [52], and GPT-OSS-120B [2] across various batch sizes (𝐵) on traces of real- world requests [16]. For simplicity, we hereafter refer to these mod- els asMixtral,Qwen3,Qwen3-Next, andGPT-OSS, and denote the activated parameter ratio asact-ratio. We draw two key observations from Figure 3. Observation 1:Modern MoE models with higher capability exhibit"},{"citing_arxiv_id":"2605.10787","ref_index":33,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox","primary_cat":"cs.AI","submitted_at":"2026-05-11T16:20:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.","context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"{\"status\":\"ok\",\"output\":[\"Pomegranate Pavilion\"]} </response> <tool> {\"name\": \"get_shop_id_by_name\", \"arguments\": {\"shop_name\": \"Pomegranate Pavilion\"}} </tool> <response> {\"status\":\"ok\",\"output\":\"shop_7EpQHsj32kNUr7nLXfCAAs\"} </response> <tool> {\"name\": \"list_items\", \"arguments\": {\"sid\": \"shop_7EpQHsj32kNUr7nLXfCAAs\"}} </tool> <response> {\"status\":\"ok\",\"output\":[{\"tid\":\"item_2YnLcoeZC7Zrw3w6sTkJHo\",\"name\":\"braeburn apple (1kg)\",\"price\":1.18,\"star\":false},{\"tid\":\"item_QJArpa9SH7PZvqYmY549BY\",\"name\":\" cantaloupe (1/2 pc)\",\"price\":2.12,\"star\":false},{\"tid\":\" item_mhHob7L3f2thyjYmwZMgmx\",\"name\":\"coconut (1 pc)\",\"price\":1.92,\"star\":false },{\"tid\":\"item_cqNMJscxt9Z746snqMdnPa\",\"name\":\"empire apple (1kg)\",\"price\":1.37,\" 23 ComplexMCP: Evaluating LLM Agents in Large-Scale Tool Sandboxes star\":false},{\"tid\":\"item_PyuJGV7f42zJcTK5vyy8Ed\",\"name\":\"golden delicious apple (1kg)\",\"price\":1.26,\"star\":false},{\"tid\":\"item_xNJY8u84Jc2RqoCnbBTgBr\",\"name\":\" grape (1kg)\",\"price\":2.34,\"star\":false},{\"tid\":\"item_BZigPTWnXQV34izbQBWymY\",\" name\":\"idared apple (1kg)\",\"price\":0.93,\"star\":false},{\"tid\":\" item_UNCDDNp7k7aMEy4CobvG7c\",\"name\":\"kanzi apple (1kg)\",\"price\":1.9,\"star\":false },{\"tid\":\"item_G2t4pxaBsyiXPMDgSD7fJW\",\"name\":\"mangosteen (500g)\",\"price\":4.67,\" star\":false},{\"tid\":\"item_vv7djJkuyWyWuCJyWwQrVk\",\"name\":\"nectarine (1kg)\",\"price \":2.29,\"star\":false},{\"tid\":\"item_TWSqydhTX3gofSk9E466vJ\",\"name\":\"red delicious apple (1kg)\",\"price\":1.17,\"star\":false},{\"tid\":\"item_XxaUHZx2nptkpXZS5saDWV\",\" name\":\"sweetango apple (1kg)\",\"price\":1.33,\"star\":false},{\"tid\":\" item_5BYBHiTLV3QRC8uXDcp33A\",\"name\":\"thompson seedless grape (1kg)\",\"price \":1.93,\"star\":false}]} </response> <tool> {\"name\": \"check_balance\", \"arguments\": {}} </tool> <response> {\"status\":\"ok\",\"output\":57126} </response> <tool> {\"name\": \"add_to_cart\", \"arguments\": {\"cnt\": 10, \"sid\": \"shop_7EpQHsj32kNUr7nLXfCAAs \", \"tid\": \"item_5BYBHiTLV3QRC8uXDcp33A\"}} </tool> <response> {\"status\":\"ok\",\"output\":\"Added 10 x 'thompson seedle"},{"citing_arxiv_id":"2605.10468","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can Muon Fine-tune Adam-Pretrained Models?","primary_cat":"cs.LG","submitted_at":"2026-05-11T12:34:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Constraining fine-tuning updates with LoRA mitigates performance degradation when switching from Adam to Muon on pretrained models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09603","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Edit-Based Refinement for Parallel Masked Diffusion Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-10T15:31:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16363","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ORACLE: Anticipating Scams from Partial Trajectories in Streaming App Usage","primary_cat":"cs.LG","submitted_at":"2026-05-09T16:26:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ORACLE is a new agentic framework using adaptive context consolidation and teacher-student distillation to detect emerging scam patterns from incomplete, long-horizon app usage streams across 12 scam types.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08936","ref_index":5,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Self-ReSET: Learning to Self-Recover from Unsafe Reasoning Trajectories","primary_cat":"cs.AI","submitted_at":"2026-05-09T13:14:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Self-ReSET is a reinforcement learning approach that lets large reasoning models learn to recover from their own unsafe reasoning trajectories, improving robustness to adversarial jailbreaks while preserving utility.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nat., 645(8081): 633-638, 2025. [5] Aohan Zeng, Xin Lv, Qinkai Zheng, Zhenyu Hou, Bin Chen, Chengxing Xie, Cunxiang Wang, Da Yin, Hao Zeng, Jiajie Zhang, Kedong Wang, Lucen Zhong, Mingdao Liu, Rui Lu, Shulin Cao, Xiaohan Zhang, Xuancheng Huang, Yao Wei, Yean Cheng, Yifan An, Yilin Niu, Yuanhao Wen, Yushi Bai, Zhengxiao Du, Zihan Wang, Zilin Zhu, Bohan Zhang, Bosi Wen, Bowen Wu, Bowen Xu, Can Huang, Casey Zhao, Changpeng Cai, Chao Yu, Chen Li, Chendi Ge,"},{"citing_arxiv_id":"2605.08639","ref_index":66,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"ReLibra: Routing-Replay-Guided Load Balancing for MoE Training in Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-09T03:18:50+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"ReLibra uses pre-known token-to-expert routing from RL rollouts to perform inter-batch expert reordering and intra-batch replication, delivering up to 1.6x higher throughput than Megatron-LM and 1.2x over oracle-equipped EPLB while staying within 6-10% of an ideal balanced baseline.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"limits the synchronization kernel to occupy only a small num- ber of SMs, leaving most SMs available for attention. 8 Discussion Off-policy RL.To improve data efficiency and accelerate the RL iteration, some off-policy RL algorithms either reuse the same batch of data for multiple parameter updates [26, 27] or allow rollout and training to proceed asynchronously [66, 67]. Qwen3-30B -A3B GLM4.5-106B -A12B Qwen3-235B -A22B 0 3 6 9 12 15Time (ms) (a) Forward Qwen3-30B -A3B GLM4.5-106B -A12B Qwen3-235B -A22B 0 5 10 15 20 25Time (ms) (b) Backward Replica Sync. Attention Only Attention + Replica Sync. (Overlapping) 1 Figure 16: Synchronization overhead of ReLibra. In these cases, rollout and training do not necessarily use the"},{"citing_arxiv_id":"2605.08553","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VeriContest: A Competitive-Programming Benchmark for Verifiable Code Generation","primary_cat":"cs.SE","submitted_at":"2026-05-08T23:25:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"VeriContest supplies 946 problems with specs, code, proofs, and tests to benchmark verifiable code generation in Rust/Verus, showing models reach 92% on code but only 5% end-to-end on full verifiable synthesis.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"5: Agentic, reasoning, and coding (arc) foundation models, 2025. URLhttps://arxiv.org/abs/2508.06471. 12 [42] Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, and Swarat Chaudhuri. Clever: A curated benchmark for formally verified code generation.arXiv preprint arXiv:2505.13938, 2025. [43] Vasudev Vikram, Caroline Lemieux, Joshua Sunshine, and Rohan Padhye. Can large language models write good property-based tests?arXiv preprint arXiv:2307.04346, 2023. [44] Zhijie Wang, Zijie Zhou, Da Song, Yuheng Huang, Shengmai Chen, Lei Ma, and Tianyi Zhang. Towards understanding the characteristics of code generation errors made by large language"},{"citing_arxiv_id":"2605.08455","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:24:32+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"CUDABEAVER, and A specifies the protocol axes introduced in §3.4. The experiments are conducted on NVIDIA RTX PRO 6000 GPU (Blackwell) and H200 GPU (Hopper), according to the specific hardware requirement of each task. Models and default protocol.Evaluated models consists of GPT-5.4 [ 26], Qwen3.6-Plus and Qwen3.6-27B [36], Gemma-4-31B-it [30], Kimi-k2.6 [31], GLM-4.7 [29], and MiniMax-M2.7 [17]. Unless otherwise stated, each fixer receives up to K=5 repair attempts with H=4 retained rounds, feedback level L3, iterative sampling at T=0.7 , and performance gate p=0.7. Scores use the asymmetric pass@k rule in Eq. (5); full backend, source-overlap, and axis-coverage details are provided in Appendix B. Protocol sweeps.Starting from the default protocol, we vary one axis at a time and hold the others"},{"citing_arxiv_id":"2605.08310","ref_index":12,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WebTrap: Stealthy Mid-Task Hijacking of Browser Agents During Navigation","primary_cat":"cs.CR","submitted_at":"2026-05-08T14:06:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WebTrap uses multi-step instruction fusion and context-grounded generation to stealthily hijack browser agents mid-navigation while preserving original task success.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Information Processing Systems 38 (NeurIPS 2025) Datasets and Benchmarks Track, 2025. URL https://openreview.net/forum?id=Ip1cCUAllL. [11] Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. doi: 10.48550/arXiv.2507. 06261. URLhttps://arxiv.org/abs/2507.06261. [12] GLM-4.5 Team. GLM-4.5: Agentic, reasoning, and coding (ARC) foundation models.arXiv preprint arXiv:2508.06471, 2025. doi: 10.48550/arXiv.2508.06471. URL https://arxiv.org/abs/2508. 06471. [13] Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect"},{"citing_arxiv_id":"2605.08283","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"HTPO: Towards Exploration-Exploitation Balanced Policy Optimization via Hierarchical Token-level Objective Control","primary_cat":"cs.LG","submitted_at":"2026-05-08T07:38:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HTPO introduces hierarchical token-level objective control in RLVR to balance exploration and exploitation by grouping tokens according to difficulty, correctness, and entropy, yielding up to 8.6% gains on AIME benchmarks over DAPO.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[28] Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.In Proceedings of the International Conference on Learning Representations, 2021. [29] Gemini Team. Gemini 3 flash: frontier intelligence built for speed.https://blog.google/products- and-platforms/products/gemini/gemini-3-flash//, 2025. [30] GLM-4.5 Team. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv: 2508.06471, 2025. [31] Kimi Team. Kimi k1.5: Scaling reinforcement learning with llms.arXiv preprint arXiv: 2501.12599, 2025. [32] Kimi Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv: 2507.20534, 2025. [33] Tongyi DeepResearch Team. Tongyi deepresearch technical report."},{"citing_arxiv_id":"2605.07250","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Hard to Read, Easy to Jailbreak: How Visual Degradation Bypasses MLLM Safety Alignment","primary_cat":"cs.CV","submitted_at":"2026-05-08T05:19:23+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Degraded image resolution in MLLMs bypasses safety alignments via cognitive overload, raising jailbreak rates across perturbations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"served 100% agreement with the LLM consensus. For the escalated subset, two independent human experts achieved a Cohen's κ of 0.96, indicating that the final ASR estimates are highly stable. The \"Inverted-U\" Vulnerability Curve.We evaluated the Attack Success Rate (ASR) across state-of-the-art MLLMs (including Qwen3-VL(Bai et al., 2025b), Doubao-Seed-1.6(Team, 2025), GPT-4.1(OpenAI, 2025) and Claude-Sonnet- 4.5(Anthropic, 2025b)) under varying DPI settings (D∈ {15,30, . . . ,300} ). We conducted identical tests across a wide range of models. The experi- mental results are presented in Table 1. Figure 3 illustrates the variation curves of OCR accuracy and ASR for selected models in detail. Results are identifiable in three distinct phases:"},{"citing_arxiv_id":"2605.06654","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Optimizer-Model Consistency: Full Finetuning with the Same Optimizer as Pretraining Forgets Less","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:57:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Full finetuning with the pretraining optimizer reduces forgetting compared to other optimizers or LoRA while achieving comparable new-task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06326","ref_index":34,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"evaluates code generation ability, and the knowledge-intensive GPQA-Diamond [30], our models consistently achieve non-trivial improvements over the base models, with gains of up to 14.5%. Cross-model transfer.Our recipe is not specific to a single model family. We apply the recipe to GLM-4.7-Flash, a model that already possesses native TIR capability [ 34]. As shown in Table 7, GLM-4.7-Flash w/ recipe further improves over the original GLM-4.7-Flash on most benchmarks, confirming that our data and training recipe provide complementary signal even for models with existing tool-use ability. Full training details are in Appendix B.4. 9 7 Conclusion In this study, we addressed the challenge of teaching strong thinking models to perform tool-integrated"},{"citing_arxiv_id":"2605.06230","ref_index":32,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence","primary_cat":"cs.AI","submitted_at":"2026-05-07T13:21:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"from being a tool for alignment to becoming a key method for enhancing reasoning capabilities. Corresponding optimization methods have moved beyond the classic RLHF-PPO pipeline, adopting new approaches such as GRPO that are better suited to large-scale online sampling and relative comparison signals. On the other hand, reports such as Qwen3[ 71], GLM- 4.5[32], Kimi K2[45], Tongyi DeepResearch[83], Scaling Agents via Continual Pre-training[79], GLM-5[96], and DeepSeek-V3.2[ 24] have broadened the training narrative to incorporate elements like thinking modes, agentic capabilities, tool usage, continual pre-training, and the decoupling of generation and training. In other words, the training system must now"},{"citing_arxiv_id":"2605.04831","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"StoryAlign: Evaluating and Training Reward Models for Story Generation","primary_cat":"cs.CL","submitted_at":"2026-05-06T12:28:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02892","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion","primary_cat":"cs.CV","submitted_at":"2026-05-04T17:59:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AlbumFill retrieves identity-consistent references from personal albums via VLM-inferred semantic cues to support personalized image completion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02821","ref_index":10,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs","primary_cat":"cs.PF","submitted_at":"2026-05-04T16:59:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01018","ref_index":16,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"WildTableBench: Benchmarking Multimodal Foundation Models on Table Understanding In the Wild","primary_cat":"cs.CV","submitted_at":"2026-05-01T18:28:49+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"WildTableBench is the first QA benchmark for naturally occurring table images, where 21 multimodal models were evaluated and only one exceeded 50% accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00519","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Silicon Showdown: Performance, Efficiency, and Ecosystem Barriers in Consumer-Grade LLM Inference","primary_cat":"cs.PF","submitted_at":"2026-05-01T08:45:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Nvidia achieves 1.6x throughput with NVFP4 but hits a VRAM wall for 70B+ models, while Apple UMA enables linear scaling to 80B at 4-bit with up to 23x better energy efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26103","ref_index":54,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving","primary_cat":"cs.AR","submitted_at":"2026-04-28T20:36:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H100 for 1M context serving.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.25806","ref_index":59,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MAIC-UI: Making Interactive Courseware with Generative UI","primary_cat":"cs.CL","submitted_at":"2026-04-28T16:15:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MAIC-UI provides a zero-code authoring system for generating and iteratively editing interactive courseware from educational materials via structured analysis and incremental generation, with lab and classroom evaluations showing usability gains and learning improvements.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"approaches lower barriers via visual feedback: TweakIt allows non- experts to iteratively transform code behavior through real-time interaction [27], and Ply introduces clear boundary management for trigger-action programming [31]. EUP is also expanding into emerging domains such as mixed reality and robotics: agentAR supports rapid AR application construction through natural lan- guage [59], and Alchemist simplifies robot behavior authoring into collaborative goal specification [25]. Recent work further explores support for developer's reflection in AI-assisted workflows [4]. Despite these advances, non-expert users still face barriers when modifying AI-generated artifacts. Existing EUP tools primarily sup- port generating code from scratch but offer limited mechanisms"},{"citing_arxiv_id":"2604.24118","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"AgentVisor: Defending LLM Agents Against Prompt Injection via Semantic Virtualization","primary_cat":"cs.CR","submitted_at":"2026-04-27T07:12:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentVisor cuts prompt injection success rate to 0.65% in LLM agents with only 1.45% utility loss via semantic privilege separation and one-shot self-correction.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22577","ref_index":33,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"QuantClaw: Precision Where It Matters for OpenClaw","primary_cat":"cs.AI","submitted_at":"2026-04-24T14:10:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tasks spanning multiple domains, including service orchestration, multimodal perception, and multi-turn dialogue, and evaluates agents along completion, safety, and robustness dimensions. Importantly, Claw-Eval incorporates trajectory- level auditing and controlled perturbation, enabling a more reliable assessment of agent behavior beyond final outputs. Models and measurement metrics.We employ 6 models for benchmarking, which include GLM-4.7-Flash- 30B3 [33], GLM-5-744B 4 [34], MiniMax-M2.5-229B 5, Qwen3.5-9B 6, Qwen3.5-35B-A3B 6, and Qwen3.5-397B- A17B6. These models are widely adopted and representative of current large language model families, which cover 2https://github.com/pinchbench 3https://huggingface.co/zai-org/GLM-4.7-Flash 4https://huggingface.co/zai-org/GLM-5 5https://huggingface.co/MiniMaxAI/MiniMax-M2."},{"citing_arxiv_id":"2604.22541","ref_index":40,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Dr.Sai: An agentic AI for real-world physics analysis at BESIII","primary_cat":"hep-ex","submitted_at":"2026-04-24T13:32:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"The dashed line shows the standard output without fluc- tuation. 4 Performance study A systematic performance evaluation of the Dr.Sai system was conducted to character- ize its success rates, failure patterns, and resource consumption across several state-of- the-art LLMs. The evaluated models include Qwen3-max-2025-09 (Qwen3-max) [13], DeepSeek-v3.2 [14], GLM-4.7 [40], DeepSeek-R1 [41], and GPT-4o [42]. The evaluation dataset consists of ten validated HEP measurement queries. To ensure statistical reliability, each query was sampled ten times per model, yielding 100 test samples for each LLM. All benchmark tasks followed a \"plan-first, execute- later\" workflow (detailed in Table 2). This rigorous multi-model assessment provides a"},{"citing_arxiv_id":"2604.22226","ref_index":29,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Temporal Compositional Reasoning in Long-Form Sports Videos","primary_cat":"cs.CV","submitted_at":"2026-04-24T05:02:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"timestamps help the model better leverage temporal evidence. Finally, adding AT-IO (Anchor-TriggeredInteractiveObservation) achieves the best overall performance, showing that test-time verification provides complementary bene- fits beyond reward-based training. Table 6: Open-ended QA accu- racy under multiple judges.All scores are in %. Model Qwen MiniMax GLM [29] Human InternVideo2.5-8B 16.68 14.25 17.94 17.37Qwen3-VL-4B 25.15 24.01 25.89 24.08Ours-4B 29.23 28.74 30.23 29.60 Judge consistency:Avg. pairwise agreement = 88.34%, Fleiss'κ= 0.57.Human alignment:Cohen'sκ(Qwen/MiniMax/GLM vs Human) =0.6467/0.5759/0.5882. (a)Frames vs. Acc. (b)Video length vs. Acc. Fig.7: Video setting ablation studies.(a) Accuracy as a function of frame budget."},{"citing_arxiv_id":"2604.21850","ref_index":39,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OptiMat Alloys: a FAIR, living database of multi-principal element alloys enabled by a conversational agent","primary_cat":"cond-mat.mtrl-sci","submitted_at":"2026-04-23T16:40:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OptiMat Alloys is a conversational AI system that maintains a living FAIR database of multi-principal element alloy calculations and enables natural-language, on-demand computations with built-in uncertainty checks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12530","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"In-Situ Behavioral Evaluation for LLM Fairness, Not Standardized-Test Scores","primary_cat":"cs.CL","submitted_at":"2026-04-21T18:38:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Standardized-test benchmarks for LLM fairness are unreliable because prompt wording alone drives most score variance and ranking changes, while a multi-agent conversational framework reveals consistent model-specific fairness behaviors across millions of dialogues.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19859","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DR-Venus: Towards Frontier Edge-Scale Deep Research Agents with Only 10K Open Data","primary_cat":"cs.LG","submitted_at":"2026-04-21T17:59:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A 4B deep research agent trained on 10K open data outperforms prior agents under 9B parameters and narrows the gap to 30B-class systems on research benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19654","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"FEPLB: Exploiting Copy Engines for Nearly Free MoE Load Balancing in Distributed Training","primary_cat":"cs.DC","submitted_at":"2026-04-21T16:43:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FEPLB reduces token and GEMM stragglers in MoE training by 50-70% using nearly free Copy Engine communication on Hopper architecture.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22840","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AeSlides: Incentivizing Aesthetic Layout in LLM-Based Slide Generation via Verifiable Rewards","primary_cat":"cs.CV","submitted_at":"2026-04-21T11:59:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AeSlides is a GRPO-based RL framework that uses verifiable aesthetic metrics to optimize LLM slide generation, achieving large gains in layout quality metrics and human scores with only 5K prompts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19835","ref_index":13,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Expert Upcycling: Shifting the Compute-Efficient Frontier of Mixture-of-Experts","primary_cat":"cs.LG","submitted_at":"2026-04-21T05:53:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Expert upcycling duplicates experts in an existing MoE checkpoint and continues pre-training to match fixed-size baseline performance with 32% less compute.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"MoE has since become the dominant architecture for frontier open-source models. Mixtral 8x7B [22] (47B total, 13B active) established early open-weight MoE baselines; Llama 4 Scout and Maverick [35] (109B/400B total, 17B active) scaled this to natively multimodal pretraining; DeepSeek-V3 [7] (671B total, 37B active), Qwen3 [50] (235B total, 22B active), Kimi K2 [37] (1T total, 32B active), and GLM-4.5 [13] (355B total, 32B active) represent the current frontier of open-source MoE models, each matching or exceeding dense models many times their active size. Across these systems, the trend is consistent: total parameters grow aggressively while active parameters per token remain fixed, directly instantiating the scaling-law prediction that lower activation ratios yield better quality-per-FLOP trade-offs [51, 34]."},{"citing_arxiv_id":"2604.18530","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OGER: A Robust Offline-Guided Exploration Reward for Hybrid Reinforcement Learning","primary_cat":"cs.AI","submitted_at":"2026-04-20T17:26:00+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18381","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes","primary_cat":"cs.AI","submitted_at":"2026-04-20T15:04:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Mixed-complexity procedural datasets provide up to 5x sample efficiency for RLVR on small models in low-data regimes, with low-to-high complexity generalization observed across counting, graph, and spatial tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14922","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-16T12:06:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LongAct uses saliency from high-magnitude activations to guide sparse weight updates in long-context RL, yielding about 8% gains on LongBench v2 across multiple algorithms.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14626","ref_index":63,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ELMoE-3D: Leveraging Intrinsic Elasticity of MoE for Hybrid-Bonding-Enabled Self-Speculative Decoding in On-Premises Serving","primary_cat":"cs.LG","submitted_at":"2026-04-16T05:12:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ELMoE-3D achieves 6.6x average speedup and 4.4x energy efficiency gain for MoE serving on 3D hardware by scaling expert and bit elasticity for elastic self-speculative decoding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"coding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts. arXiv:2511.14102 [cs.LG] https://arxiv.org/abs/2511.14102 [62] Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, and Zhengwei Qi. 2025. BuddyMoE: Exploiting Expert Re- dundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference. arXiv:2511.10054 [cs.LG] https://arxiv.org/abs/2511.10054 [63] Tony F. Wu, Huichu Liu, H. Ekin Sumbul, Lita Yang, Dipti Baheti, Jeremy Coriell, William Koven, Anu Krishnan, Mohit Mittal, Matheus Trevisan Moreira, Max Waugaman, Laurent Ye, and Edith Beigné. 2024. 11.2 A 3D Integrated Prototype System-on-Chip for Augmented Reality Applications Using Face-to-Face Wafer Bonded 7nm Logic at < 2 𝜇mPitch with up to 40% Energy Reduction at Iso-"}],"limit":50,"offset":0}