{"total":25,"items":[{"citing_arxiv_id":"2605.20994","ref_index":46,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Towards Context-Invariant Safety Alignment for Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-20T10:33:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18194","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks","primary_cat":"cs.AI","submitted_at":"2026-05-18T10:32:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MLLMs achieve only 42% accuracy on a new audio-visual task requiring second-order spatial ToM under perceptual limits, while a proposed sensory-bounded CoT outperforms egocentric and allocentric baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18890","ref_index":63,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits","primary_cat":"physics.soc-ph","submitted_at":"2026-05-17T00:21:53+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Minor perturbations in persona format, instruction framing, and network structure shift cooperation by up to 76 percentage points and polarization metrics consistently, showing that LLM social simulations require per-claim robustness audits via the new TRAILS taxonomy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08455","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CUDABeaver: Benchmarking LLM-Based Automated CUDA Debugging","primary_cat":"cs.LG","submitted_at":"2026-05-08T20:24:32+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Kernel [15] also mention the degeneration behavior of LLMs but do not quantify how it impacts CUDA generation quality, because no evaluation metric for this is practical on benchmarks they used. Protocol-conditional evaluation in LLM benchmarking.Recent work shows that benchmark scores often reflect evaluation protocol as much as model capability. Sclar et al. [25] report a 76 percentage point shift for LLaMA-2-13B under prompt-format changes, Alzahrani et al. [1] show that MMLU rankings can move by up to 8 positions as protocol choices vary, and Mizrahi et al. [18] argue that a prompt outcome should be viewed as a single draw from a broader response distribution. CUDABEAVERapplies this lens to CUDA debugging through pass@k(M, C, A),"},{"citing_arxiv_id":"2605.07830","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios","primary_cat":"cs.CR","submitted_at":"2026-05-08T14:57:53+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM agents exhibit persistent attack-selection biases as fixed traits independent of success rates, with a bias momentum effect that resists steering and yields no performance gain.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[19] OWASP CRS Project. OWASP ModSecurity core rule set. https://coreruleset.org/. Accessed: 2026-04-28. [20] OWASP Foundation. Owasp web security testing guide. URL <https://owasp.org/ www-project-web-security-testing-guide/>. [21] OWASP Foundation. Owasp top 10 web application security risks, 2025. URL <https: //owasp.org/www-project-top-ten/>. [22] Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.arXiv preprint arXiv:2310.11324, 2023. [23] Claude Elwood Shannon. A mathematical theory of communication.The Bell system technical journal, 27(3):379-423, 1948."},{"citing_arxiv_id":"2605.07186","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Text Uncanny Valley: Non-Monotonic Performance Degradation in LLM Information Retrieval","primary_cat":"cs.CL","submitted_at":"2026-05-08T03:26:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"LLM information retrieval shows a U-shaped performance drop as words are fragmented by inserted whitespace, attributed to a disordered transition between word-level and character-level processing modes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06656","ref_index":293,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Why Global LLM Leaderboards Are Misleading: Small Portfolios for Heterogeneous Supervised ML","primary_cat":"cs.LG","submitted_at":"2026-05-07T17:57:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Global Bradley-Terry rankings of LLMs are misleading due to structured heterogeneity in user preferences, and small (λ, ν)-portfolios recover coherent subpopulations that cover over 96% of votes with just five rankings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06327","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Measuring Evaluation-Context Divergence in Open-Weight LLMs: A Paired-Prompt Protocol with Pilot Evidence of Alignment-Pipeline-Specific Heterogeneity","primary_cat":"cs.CL","submitted_at":"2026-05-07T14:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new paired-prompt protocol reveals alignment-pipeline-specific heterogeneity in how open-weight LLMs respond to evaluation versus deployment framings.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06161","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges","primary_cat":"cs.AI","submitted_at":"2026-05-07T12:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLM safety judges flip verdicts on equivalent policy rewrites up to 9.1% of the time and cannot distinguish meaningful from meaningless changes, requiring new invariance-based reliability metrics.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 8352-8370, 2024. [37] Abel Salinas and Fred Morstatter. The butterfly effect of altering prompts: How small changes and jailbreaks affect large language model performance. InFindings of the Association for Computational Linguistics: ACL 2024, pages 4629-4651, 2024. [38] Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models' sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting.arXiv preprint arXiv:2310.11324, 2023. [39] Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Schärli, and Denny Zhou."},{"citing_arxiv_id":"2605.04665","ref_index":11,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Paraphrase-Induced Output-Mode Collapse: When LLMs Break Character Under Semantically Equivalent Inputs","primary_cat":"cs.CL","submitted_at":"2026-05-06T09:11:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"LLMs show systematic output-mode collapse on closed-form prompts, with only ~22% of semantically equivalent variants preserving the requested bare-label format across five models and four tasks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"that carefully crafted perturbations can elicit unintended behaviors. Complementary research has investigated self- consistency in multi-step reasoning [6], showing that sam- pling multiple reasoning paths can improve accuracy but also revealing significant output variance across semantically equivalent prompts. Surface-form sensitivity in benchmark evaluation has been documented by Sclar et al. [11], who report that subtle changes to prompt formatting (separators, casing, ordering) can swing few-shot accuracy by up to 76 percentage points on LLaMA-2-13B, though their analysis targets prompt- template perturbations on the evaluator side rather than content-preserving rewrites of the user payload. Ribeiro et al. [4] demonstrated similar brittleness in NLU models"},{"citing_arxiv_id":"2605.03111","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Benchmarking Local Language Models for Social Robots using Edge Devices","primary_cat":"cs.RO","submitted_at":"2026-05-04T19:49:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Benchmarking 25 LLMs on Raspberry Pi hardware shows Granite4 Tiny Hybrid (7B) balances 2.5 tokens/s, 0.90 tokens/J, and 54.6% MMLU while teaching effectiveness does not require high general knowledge scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02038","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"What Single-Prompt Accuracy Misses: A Multi-Variant Reliability Audit of Language Models","primary_cat":"cs.CL","submitted_at":"2026-05-03T20:05:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-variant testing reveals that prompt design and evaluator choices can change apparent model reliability by large margins, with verbal confidence often overstated and robustness uncorrelated with size.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01048","ref_index":54,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Compared to What? Baselines and Metrics for Counterfactual Prompting","primary_cat":"cs.CL","submitted_at":"2026-05-01T19:23:33+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistical comparison.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14672","ref_index":53,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SPAGBias: Uncovering and Tracing Structured Spatial Gender Bias in Large Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-16T06:30:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SPAGBias reveals that LLMs form nuanced gender associations with specific urban micro-spaces that exceed real-world distributions and produce failures in planning and descriptive tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11328","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees","primary_cat":"cs.AI","submitted_at":"2026-04-13T11:31:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.07745","ref_index":58,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Cartesian Cut in Agentic AI","primary_cat":"cs.AI","submitted_at":"2026-04-09T03:03:06+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLM agents use a Cartesian split between learned prediction and engineered control, enabling modularity but creating sensitivity and bottlenecks unlike integrated biological systems.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"g., permissions, stopping/retry logic, and memory serialization) are implemented in the runtime and become available to the core only when explicitly serialized into this protocol. Changing this protocol (schemas, prompt formatting, memory representation) can materially change behavior because it changes how control is externalized and communicated [58, 40]. Training is orthogonal to the cut.The Cartesian cut is an inference-time architec- tural boundary: it is present whenever tool policies, memory formats, retry/termination logic, and other control variables are implemented in an external runtime and made avail- able to the model only via explicit serialization. A core trained primarily by next-token"},{"citing_arxiv_id":"2604.14197","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure","primary_cat":"cs.CL","submitted_at":"2026-04-03T03:06:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02608","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Steerable but Not Decodable: Function Vectors Operate Beyond the Logit Lens","primary_cat":"cs.LG","submitted_at":"2026-04-03T00:54:11+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Function vectors steer LLMs successfully where the logit lens fails to decode the target answer, showing the two properties come apart.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"26] can reliably steer the model when added to its hidden states. For safety-critical deployments, the relevant test is not whether an FV works on the same prompt tem- plate from which it was extracted, but whether it transfers todifferenttemplate formulations of the same task. An adversary need only rephrase a prompt to evade a template-specific steering intervention [23, 30]. Yet prior work evaluates FVs almost exclusively in-distribution [9, 14, 26]. A preliminary study [18] evaluated cross-template transfer across 3 tasks and 8 templates on Llama-3.1- 8B, finding a Simpson's paradox: the aggregate negative correlation between cosine similarity and transfer 1 arXiv:2604.02608v1 [cs.LG] 3 Apr 2026 accuracy (r=−0."},{"citing_arxiv_id":"2603.22161","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Causal Evidence that Language Models use Confidence to Drive Behavior","primary_cat":"cs.LG","submitted_at":"2026-03-23T16:23:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Language models deploy multidimensional internal confidence representations and threshold-based policies to control abstention behavior, with causal support from activation steering experiments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09127","ref_index":6,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Collective AI can amplify tiny perturbations into divergent decisions","primary_cat":"cs.AI","submitted_at":"2026-03-10T02:59:11+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Multi-LLM committees amplify small input perturbations into divergent deliberation trajectories and decisions under deterministic conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.04309","ref_index":21,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Activation Steering with a Feedback Controller","primary_cat":"cs.LG","submitted_at":"2025-10-05T18:05:28+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Popular LLM activation steering methods are shown to act as proportional controllers; a PID steering framework is proposed that improves robustness and outperforms baselines in experiments across model families.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.19590","ref_index":43,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Position: AI Evaluations Should be Grounded on a Theory of Capability","primary_cat":"cs.AI","submitted_at":"2025-09-23T21:29:04+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"AI evaluations should be reframed as inference tasks grounded in an explicit theory of capability, with an empirical demonstration that results depend on modeling assumptions and a proposed Evaluation Card for transparency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.14913","ref_index":25,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PromptSuite: A Task-Agnostic Framework for Multi-Prompt Generation","primary_cat":"cs.CL","submitted_at":"2025-07-20T10:55:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"PromptSuite is a modular, extensible, task-agnostic framework for automatically generating diverse prompt variations to support robust multi-prompt LLM evaluation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2502.16761","ref_index":8,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Language Model Fine-Tuning on Scaled Survey Data for Predicting Distributions of Public Opinions","primary_cat":"cs.CL","submitted_at":"2025-02-24T00:31:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Fine-tuning LLMs on the SubPOP dataset of 3,362 questions and 70K pairs reduces the gap between LLM predictions and human survey responses by up to 46% and generalizes to unseen surveys and subpopulations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.14782","ref_index":104,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Lessons from the Trenches on Reproducible Evaluation of Language Models","primary_cat":"cs.CL","submitted_at":"2024-05-23T16:50:49+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}