{"total":35,"items":[{"citing_arxiv_id":"2605.22570","ref_index":90,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-21T14:48:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"we adoptcircular evaluation[ 53] to control position bias and choice priors, ensuring that reported accuracy reflects genuine reasoning. Each question is evaluated under all N cyclic permutations of its answer choices, and the model is scored as correct only if it answers correctly under every cycle. For the open-ended variant, we follow the LLM-as-judge protocol [90], using Claude-Sonnet-4.6 [1] to judge whether the model's response is semantically equivalent to the ground-truth answer. Human Baseline.We report human evaluation performance on VGenST-Bench. To conduct the human evaluation, we recruit 10 participants from diverse backgrounds, excluding computer-science majors to ensure that performance reflects general spatio-temporal reasoning capabilities."},{"citing_arxiv_id":"2605.21362","ref_index":40,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T16:27:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.21177","ref_index":32,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"ChunkFT: Byte-Streamed Optimization for Memory-Efficient Full Fine-Tuning","primary_cat":"cs.LG","submitted_at":"2026-05-20T13:44:44+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ChunkFT enables full-parameter fine-tuning of Llama 3-8B on one 24 GB GPU and Llama 3-70B on two 80 GB GPUs by streaming gradients over dynamically activated sub-tensors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20005","ref_index":73,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Fine-Tuning Without Forgetting via Loss-Adaptive Learning Rates","primary_cat":"cs.LG","submitted_at":"2026-05-19T15:36:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"FINCH is a loss-adaptive learning-rate schedule that reduces forgetting by 93% on average during LLM fine-tuning while matching standard task performance across several benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19035","ref_index":73,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Trustworthy Agent Network: Trust in Agent Networks Must Be Baked In, Not Bolted On","primary_cat":"cs.AI","submitted_at":"2026-05-18T18:57:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Argues that trustworthiness in Agent-to-Agent networks requires a new conceptual framework with four design pillars baked in from the beginning, as retrofitting existing single-agent methods is insufficient.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18721","ref_index":34,"ref_count":3,"confidence":0.55,"is_internal_anchor":false,"paper_title":"General Preference Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-05-18T17:50:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17310","ref_index":45,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-05-17T08:02:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Attention Hijacking is a new attack that improves cross-query transferability in VLMs by explicitly steering internal attention to a persistent image-dominant pattern.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15513","ref_index":57,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CAPS: Cascaded Adaptive Pairwise Selection for Efficient Parallel Reasoning","primary_cat":"cs.AI","submitted_at":"2026-05-15T01:16:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"CAPS is a four-stage inference-only cascade that adapts how much of each solution the verifier sees and how comparisons are distributed, halving per-candidate verifier tokens while outperforming uniform pairwise verification on most benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.14978","ref_index":29,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing","primary_cat":"cs.CL","submitted_at":"2026-05-14T15:41:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13438","ref_index":68,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"CogniFold: Always-On Proactive Memory via Cognitive Folding","primary_cat":"cs.AI","submitted_at":"2026-05-13T12:34:39+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13026","ref_index":78,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Understanding and Accelerating the Training of Masked Diffusion Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-13T05:29:38+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Bell-shaped time sampling accelerates masked diffusion language model training by roughly 4x on LM1B by countering locality bias in language data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.09269","ref_index":51,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"DeltaRubric: Generative Multimodal Reward Modeling via Joint Planning and Verification","primary_cat":"cs.CL","submitted_at":"2026-05-10T02:32:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DeltaRubric decomposes multimodal preference evaluation into self-generated planning and verification steps within a single model, producing large accuracy improvements on VL-RewardBench via multi-role reinforcement learning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=EuN5iszF0a. [50] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595-46623, 2023. [51] Tong Zheng, Hongming Zhang, Wenhao Yu, Xiaoyang Wang, Runpeng Dai, Rui Liu, Huiwen Bao, Chengsong Huang, Heng Huang, and Dong Yu. Parallel-r1: Towards parallel thinking via reinforcement learning.arXiv preprint arXiv:2509.07980, 2025. [52] Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng, Dongdong Kuang, Yuwen Xiong, and Richong Zhang. Easyr1: An efficient, scalable, multi-modality rl training framework."},{"citing_arxiv_id":"2605.08715","ref_index":67,"ref_count":4,"confidence":0.55,"is_internal_anchor":false,"paper_title":"AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems","primary_cat":"cs.CL","submitted_at":"2026-05-09T05:55:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.","context_count":2,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"step it commits, opening an intervention window before the failure is locked in (see Section 2). agent or step is responsible once the trajectory has already failed [ 63, 62, 67], as illustrated in Figure 1(a). For instance, Who&When [63] and AgenTracer [62] curate failed trajectories and train or prompt models to pinpoint the decisive error step after the run has ended, while AgentDebug [67] and related debugging frameworks [49, 19] analyze full trajectories to taxonomize failures and supply corrective feedback for subsequent retries. However, confining failure analysis to the post-hoc regime forgoes any opportunity to act while the trajectory is still unfolding. Before a diagnosis is available, agents have already consumed further tool calls and external resources, and in deployment settings"},{"citing_arxiv_id":"2605.08589","ref_index":55,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"S2FT: Parameter-Efficient Fine-Tuning in Sparse Spectrum Domain","primary_cat":"cs.CV","submitted_at":"2026-05-09T01:10:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"S2FT replaces the sparse-spectrum assumption of prior Fourier PEFT with a learned rearrangement that maps a pre-estimated weight change into a domain where few spectral coefficients suffice.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"eral Language Understanding Evaluation[ 41]) benchmark to compare our S2FT and baselines, where Acc, MCC, and PCC are used as metrics.4) Instruction Tuning.For in- struction tuning, we conducted experiments on the Alpaca dataset [42]. During evaluation, following[9] the fine-tuned models are used to answer a set of standardized questions sourced from the MT-Bench [55]and Vicuna [5] Eval bench- mark suites. The generated responses are then scored by GPT-4 on a scale from 0 to 10. 4.2. Implementation Details and Baselines Image Classification.We follow [ 15] to process the images of FGVC and VTAB-1k. We employ the AdamW opti- Method DINO↑CLIP-I↑CLIP-T↑LPIPS↑Params(%) RealImages 0.703 0.864 − 0.695 − DreamBooth 0."},{"citing_arxiv_id":"2605.08277","ref_index":55,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Mitigating Many-shot Jailbreak Attacks with One Single Demonstration","primary_cat":"cs.CR","submitted_at":"2026-05-08T06:33:42+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A single safety demonstration appended at inference time mitigates many-shot jailbreak attacks by counteracting implicit malicious fine-tuning on harmful examples.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Activation approximations can incur safety vulnerabilities in aligned {LLMs}: Comprehensive analysis and defense. In34th USENIX Security Symposium (USENIX Security 25), pages 339-358, 2025. [54] Jiawen Zhang, Lipeng He, Kejia Chen, Jian Lou, Jian Liu, Xiaohu Yang, and Ruoxi Jia. Safety at one shot: Patching fine-tuned llms with a single instance.arXiv preprint arXiv:2601.01887, 2026. [55] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595-46623, 2023. [56] Weixiong Zheng, Peijian Zeng, Yiwei Li, Hongyan Wu, Nankai Lin, Junhao Chen, Aimin Yang,"},{"citing_arxiv_id":"2605.06632","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Crafting Reversible SFT Behaviors in Large Language Models","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:44:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06605","ref_index":11,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"How Many Iterations to Jailbreak? Dynamic Budget Allocation for Multi-Turn LLM Evaluation","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:25:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DAPRO provides the first dynamic, theoretically guaranteed way to allocate interaction budgets across test cases for bounding time-to-event in multi-turn LLM evaluations, achieving tighter coverage than static conformal survival methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06761","ref_index":49,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Weblica: Scalable and Reproducible Training Environments for Visual Web Agents","primary_cat":"cs.AI","submitted_at":"2026-05-07T17:17:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Weblica scales RL training for visual web agents by building thousands of reproducible environments through HTTP caching for stable replays and LLM synthesis from real sites, yielding an 8B model that beats similar open baselines on navigation benchmarks.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Please see Appendix C for further details. 4.1 Data and Reward RL Data.We describe task sourcing for RL environments in Section 3.2 and Section 3.3. Please see Ap- pendix D.1 for environment statistics. LLM-as-Judge Reward.As many web navigation tasks are open-ended and cannot be evaluated programmati- cally or via string matching, we implement an LLM-as-judge [49] reward mechanism. Given a task description, the agent's action sequence, and resulting screenshots, we prompt GPT-4o [16] to assess whether the agent successfully completed the task. This enables training on the full diversity of web tasks beyond those with programmatic verification. We validate the LLM judge by measuring agreement with human evaluations,"},{"citing_arxiv_id":"2605.06161","ref_index":54,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05662","ref_index":59,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"XL-SafetyBench: A Country-Grounded Cross-Cultural Benchmark for LLM Safety and Cultural Sensitivity","primary_cat":"cs.CL","submitted_at":"2026-05-07T04:35:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"XL-SafetyBench is a new cross-cultural benchmark showing frontier LLMs decouple jailbreak robustness from cultural sensitivity while local models trade off attack success against neutral-safe rates in a near-linear pattern indicating generation failure rather than alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27488","ref_index":23,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Skills-Coach: A Self-Evolving Skill Optimizer via Training-Free GRPO","primary_cat":"cs.CL","submitted_at":"2026-04-30T06:39:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"Skills-Coach optimizes LLM agent skills via task generation, prompt/code tuning, comparative execution, and traceable evaluation, reporting gains on a 48-skill benchmark called Skill-X.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26577","ref_index":47,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control","primary_cat":"cs.AI","submitted_at":"2026-04-29T11:58:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs for robotic health attendant control violate safety rules in 54.4% of harmful scenarios on average, with proprietary models at 23.7% median violation versus 72.8% for open-weight models, indicating they are not yet safe for clinical use.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24700","ref_index":58,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Green Shielding: A User-Centric Approach Towards Trustworthy AI","primary_cat":"cs.CL","submitted_at":"2026-04-27T17:04:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Green Shielding introduces CUE criteria and the HCM-Dx benchmark to demonstrate that routine prompt variations systematically alter LLM diagnostic behavior along clinically relevant dimensions, producing Pareto-like tradeoffs in plausibility versus coverage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24525","ref_index":31,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Understanding the Limits of Automated Evaluation for Code Review Bots in Practice","primary_cat":"cs.SE","submitted_at":"2026-04-27T14:25:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Automated LLM-based evaluation of code review bot comments achieves only moderate agreement (0.44-0.62) with developer labels in an industrial dataset because developer decisions reflect contextual constraints beyond comment quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24819","ref_index":48,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora","primary_cat":"cs.SE","submitted_at":"2026-04-27T14:05:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23505","ref_index":6,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Uncertainty Propagation in LLM-Based Systems","primary_cat":"cs.SE","submitted_at":"2026-04-26T02:48:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insights and open challenges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22302","ref_index":79,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Knowledge Visualization: A Benchmark and Method for Knowledge-Intensive Text-to-Image Generation","primary_cat":"cs.CV","submitted_at":"2026-04-24T07:33:52+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"KVBench reveals major gaps in current T2I models for knowledge-intensive tasks, and KE-Check narrows the gap between open- and closed-source models by adding structured knowledge and enforcing constraints.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20932","ref_index":33,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Adaptive Defense Orchestration for RAG: A Sentinel-Strategist Architecture against Multi-Vector Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-22T11:17:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A context-aware Sentinel-Strategist system for RAG selectively applies defenses to block membership inference and data poisoning while recovering most retrieval utility compared to always-on defense stacks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18463","ref_index":76,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Using large language models for embodied planning introduces systematic safety risks","primary_cat":"cs.AI","submitted_at":"2026-04-20T16:18:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM planners for robots often produce dangerous plans even when planning succeeds, with safety awareness staying flat as model scale improves planning ability.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":91,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07985","ref_index":24,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Rag Performance Prediction for Question Answering","primary_cat":"cs.CL","submitted_at":"2026-04-09T08:55:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07655","ref_index":94,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Guardian-as-an-Advisor: Advancing Next-Generation Guardian Models for Trustworthy LLMs","primary_cat":"cs.LG","submitted_at":"2026-04-08T23:47:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Guardian-as-an-Advisor prepends risk labels and explanations from a guardian model to queries, improving LLM safety compliance and reducing over-refusal while adding minimal compute overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06683","ref_index":34,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Benchmarking Requirement-to-Architecture Generation with Hybrid Evaluation","primary_cat":"cs.SE","submitted_at":"2026-04-08T04:58:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"R2ABench benchmark shows LLMs generate syntactically valid software architectures from requirements but produce structurally fragmented results due to weak relational reasoning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.27771","ref_index":141,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Emergent Social Intelligence Risks in Generative Multi-Agent Systems","primary_cat":"cs.MA","submitted_at":"2026-03-29T17:10:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Generative multi-agent systems exhibit emergent collusion and conformity behaviors that cannot be prevented by existing agent-level safeguards.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.09611","ref_index":76,"ref_count":1,"confidence":0.55,"is_internal_anchor":false,"paper_title":"Characterizing Performance-Energy Trade-offs of Large Language Models in Multi-Request Workflows","primary_cat":"cs.DC","submitted_at":"2026-03-12T10:10:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This work delivers the first measurements of performance-energy trade-offs across four multi-request LLM workflow patterns on A100 GPUs using vLLM and Parrot.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}