{"total":82,"items":[{"citing_arxiv_id":"2606.24790","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Grad Detect: Gradient-Based Hallucination Detection in LLMs","primary_cat":"cs.LG","submitted_at":"2026-06-23T16:46:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.29396","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization","primary_cat":"cs.AI","submitted_at":"2026-05-28T05:46:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.07576","ref_index":41,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery","primary_cat":"cs.LG","submitted_at":"2026-05-26T18:19:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CARTOGRAPH integrates unresolved-subspace steering, ambiguity closure, and residual-based refusal under a local linear-Gaussian model, outperforming baselines on testbeds and correctly flagging inconclusive claims in a retrospective A-Lab audit.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19722","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring Safety Alignment Effects in Autonomous Security Agents","primary_cat":"cs.CR","submitted_at":"2026-05-19T11:55:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18239","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Multilingual jailbreaking of LLMs using low-resource languages","primary_cat":"cs.CL","submitted_at":"2026-05-18T11:33:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.17342","ref_index":99,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment","primary_cat":"cs.CL","submitted_at":"2026-05-17T09:27:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15734","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments","primary_cat":"cs.AI","submitted_at":"2026-05-15T08:43:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Empirical replication across three LLMs shows only 31 of 213 user-state metrics meet reliability criteria for individual scores, supporting a validation framework for responsible AI in adaptive environments.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12824","ref_index":81,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Mechanism Plausibility in Generative Agent-Based Modeling","primary_cat":"cs.MA","submitted_at":"2026-05-12T23:46:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces the Mechanism Plausibility Scale, a four-level framework separating generative sufficiency from mechanistic plausibility in LLM-based agent-based models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"with LLMs may disregard the relationship between researcher and subject existing in prior human subject research. When an LLM generates text that resembles survey responses or social behavior, it is not directly from the experience of a live, present individual. Furthermore, they identify the problem of \"value lock-in\", also referenced by Weidinger et al. [81]. LLMs encode the norms and attitudes present in their training data at a particular point in time. Related empirical work supports this; language models exhibit degraded performance in time periods not represented in their training corpus [46]. 5.4 Historical Issues and Harms of Poor ABM specification While the Mechanism Plausibility Scale was motivated by recent challenges posed by LLM-ABM, the ideas are"},{"citing_arxiv_id":"2605.12288","ref_index":129,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching","primary_cat":"cs.CL","submitted_at":"2026-05-12T15:44:33+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12199","ref_index":36,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Overtrained, Not Misaligned","primary_cat":"cs.LG","submitted_at":"2026-05-12T14:37:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11789","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations","primary_cat":"cs.AI","submitted_at":"2026-05-12T08:54:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.10639","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks","primary_cat":"cs.AI","submitted_at":"2026-05-11T14:27:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"alignment methods consistently improve safety across varying tasks, domains, and model architectures, or if their efficacy remains strictly context-dependent. As prior work has shown, both alignment quality and observed toxicity depend on many interacting factors, meaning that agreement across benchmarks does not necessarily guarantee overall safety [31]. Moreover, the inherent black-box nature of LLMs introduces methodological uncertainty. For instance, our data augmentation processes, particularly domain shifting, may inadvertently intro- duce linguistic biases that influence the final results. This challenge is not unique 12 R. Gugg et al. to LLM-based methodologies; similar biases frequently arise in human annota-"},{"citing_arxiv_id":"2605.09041","ref_index":2,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence","primary_cat":"cs.CL","submitted_at":"2026-05-09T16:26:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"pattern that single-scalar audits cannot detect. I. INTRODUCTION Large language models (LLMs) increasingly mediate high- stakes decisions in hiring, lending, healthcare triage, and con- tent moderation [1]. Biased outputs in such settings can cause direct harm, including disparate impacts in hiring, lending, criminal justice, content moderation, and search or retrieval systems [2]-[13]. A separate line of robustness work shows that modern ML and vision-language systems can be highly ‡Corresponding Author. BJ Explain Judge Rate SC CTO 0.00 0.25 0.50 0.75 1.00 chance (50%) Task η2=0.395, Δ=59.1% PositiveAnalyticalNegativeIndignantSkeptical Neutral Sentiment η2=0.014, Δ=28.4% AE DSMilitarySociologist PM Neutral 0.00 0.25 0."},{"citing_arxiv_id":"2605.05682","ref_index":74,"ref_count":2,"confidence":0.98,"is_internal_anchor":true,"paper_title":"PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI","primary_cat":"cs.HC","submitted_at":"2026-05-07T05:19:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"tured feedback on visual designs using a crowd of non-experts. InProceedings of the 17th ACM conference on Computer supported cooperative work & social computing. 1433-1444. [73] Jiahao Yu, Xingwei Lin, Zheng Yu, and Xinyu Xing. 2023. Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts.arXiv preprint arXiv:2309.10253(2023). [74] J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang. 2023. Why Johnny Can't Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI '23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10."},{"citing_arxiv_id":"2605.02348","ref_index":34,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation","primary_cat":"cs.CL","submitted_at":"2026-05-04T08:51:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01597","ref_index":107,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI","primary_cat":"eess.AS","submitted_at":"2026-05-02T20:11:12+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to those sources.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Bird, \"Decolonising speech and language technology,\" inProceedings of the 28th international conference on computational linguistics, 2020. [106] R. Shelby, S. Juarez, E. Smart, and A. Srikant, \"Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction,\" in Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency (FAccT), 2023, pp. 723-741. [107] L. Weidingeret al., \"Ethical and social risks of harm from language models,\"arXiv preprint arXiv:2112.04359, 2021. [108] H. Suresh and J. Guttag, \"A framework for understanding sources of harm throughout the machine learning life cycle,\"ACM Transactions on Intelligent Systems and Technology (TIST), 2021. [109] D. K. Mulliganet al., \"This thing called fairness: Disciplinary confusion"},{"citing_arxiv_id":"2605.01168","ref_index":234,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Quantifying and Predicting Disagreement in Graded Human Ratings","primary_cat":"cs.CL","submitted_at":"2026-05-01T23:56:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.26192","ref_index":39,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda","primary_cat":"cs.SE","submitted_at":"2026-04-29T00:34:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"is not yet stable or where relevant information is embedded in methodological sections rather than explicitly stated in titles, abstracts, or keywords [ 18, 25, 40]. In such contexts, alternative or complementary strategies, including snowballing and curated source selection based on high-quality venues, have been recommended to improve the coverage and relevance of identified studies [39]. Following this rationale, we restricted our corpus to well-established venues that are known to publish rigorous empirical software engineering research. This restriction made the analysis feasible, while ensuring that the collected studies are both relevant and of high quality. Venue selection was guided by objective quality and scope criteria: we included well-established, peer-reviewed"},{"citing_arxiv_id":"2604.23338","ref_index":133,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework","primary_cat":"cs.CR","submitted_at":"2026-04-25T14:57:15+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"order-of-magnitude estimate, as ecological validity depends on the variance of real deployment inputs, which may differ substantially from the simulated distribution. D. Regulatory Landscape and Governance Gaps The foundational question of what it means for AI systems to be aligned with human values [131], [132] underpins all governance discussions; social risk taxonomies for language models [133] provide a precursor framework that, however, does not anticipate agentic autonomy. The EU AI Act [134], the US Executive Order on AI [135], and NIST AI RMF [136] address AI governance but were finalized before the widespread deployment of agentic systems with persistent memory and autonomous tool use. Industry-led governance frameworks such as Anthropic's Responsible Scaling Policy [137]"},{"citing_arxiv_id":"2604.22749","ref_index":82,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities","primary_cat":"cs.CL","submitted_at":"2026-04-24T17:49:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs generate narratives containing persistent stereotypes, erasure, and one-dimensional portrayals of Global Majority national identities, with minoritized groups overrepresented in subordinated roles by more than fifty times compared to dominant portrayals.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":", Argentina, Brazil). Global Minority countries also vary more widely in their representation, as demonstrated by their higher information content compared to clusters where most Global Majority countries appeared. 1China is listed as a member on the G77 website, but its government considers itself more of a financial supporter and political partner [82]. Nationality Bias in LLM-Generated Narratives FAccT '26, June 25-28, 2026, Montreal, QC, Canada Meanwhile, characters from the United States are more uniformly portrayed in neutral, dominant, and sub- ordinated roles. Study 1 shows how \"American\" characters are primarily framed as benevolent (e.g., \"patient\", \"ceaseless\", \"ignoring fear\") and paternalistic (e."},{"citing_arxiv_id":"2604.22154","ref_index":7,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems","primary_cat":"cs.LG","submitted_at":"2026-04-24T01:52:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Adaptive multi-agent LLM pipelines with bandit-based sampling achieve lower false positive rates (0.095 vs 0.159) than single-agent models on two behavioral health datasets while maintaining similar false negative rates.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22089","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Ethics Testing: Proactive Identification of Generative AI System Harms","primary_cat":"cs.SE","submitted_at":"2026-04-23T21:41:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"systems [8]) Trustworthy AI with formal methods [75] Discuss trustworthy AI from the perspectives of trustworthy computing, formal methods, and AI Safety, robustness, privacy, and fairness Formulate trustworthy AI by encoding properties via formal verification AI-driven software Study of GenAI (e.g., Chat- GPT [56, 61, 72]) Discuss potential risks [72] and bias [56] Various risks and bias Literature reviewand study AI-driven software AI ethicsguidelines[22, 37] Discuss potential risks [ 72], bias [ 56], and provide ethics guide- lines (with up to 11 clusters of principles) 11 clusters of principles include transparency, justice and fairness, non- maleficence, responsibility, privacy, beneficence, free-"},{"citing_arxiv_id":"2604.21860","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models","primary_cat":"cs.CR","submitted_at":"2026-04-23T16:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19016","ref_index":170,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AlignCultura: Towards Culturally Aligned Large Language Models?","primary_cat":"cs.CL","submitted_at":"2026-04-21T03:06:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18847","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Human-Guided Harm Recovery for Computer Use Agents","primary_cat":"cs.AI","submitted_at":"2026-04-20T21:12:40+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.18803","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T20:21:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[32] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. 2024. Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution. arXiv:2409.12191 [cs.CV] https://arxiv.org/abs/2409.12191 [33] Laura Weidinger et al. 2021. Ethical and Social Risks of Harm from Language Models. arXiv:2112.04359 [cs.CL] https://arxiv.org/abs/2112.04359 [34] Samyak Gupta Yangsibo Huang et al. 2023. Catastrophic Jailbreak of Open-source LLMs via Exploiting Generation. arXiv:2310.06987 [cs.CL] https://doi.org/10.48550/arXiv.2310.06987 [35] Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong"},{"citing_arxiv_id":"2605.16301","ref_index":9,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment","primary_cat":"cs.CY","submitted_at":"2026-04-18T19:51:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026. [37] Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang, Myra Cheng, Mia Glaese, Borja Balle, Atoosa Kasirzadeh, et al. Ethical and social risks of harm from language models.arXiv preprint arXiv:2112.04359, 2021. [38] Kathleen C Fraser and Svetlana Kiritchenko. Examining gender and racial bias in large vision- language models using a novel dataset of parallel images. InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 690-713, 2024. [39] Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel Bowman."},{"citing_arxiv_id":"2604.11309","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems","primary_cat":"cs.CR","submitted_at":"2026-04-13T11:12:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"evolved, automated methods emerged, leveraging techniques like genetic algorithms [16], gradient-based optimization [5], [6], and LLM-generated adversarial prompts [19], [20]. The success of these techniques, however, raised serious concerns about the generation of illegal and unethical content, prompt- ing a surge of interest in both vulnerability identification research and risk mitigation strategies [21]. Multi-turn Jailbreak on LLMsCrescendostands as a seminal contribution to multi-turn attack methodologies [9]. Subsequent works expanded the field:RACEuses reasoning- augmented conversations to reframe harmful queries as benign tasks [22];Tempestmodels safety erosion via tree search for parallel conversation paths [23];X-Teamingemploys col- laborative agents for cross-model optimization [24];CFA"},{"citing_arxiv_id":"2604.06600","ref_index":93,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics","primary_cat":"cs.SI","submitted_at":"2026-04-08T02:34:58+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on real-world events.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Junqing Yu, and Wei Yang. 2025. 𝐺𝐴−𝑆 3: Comprehensive Social Network Simulation with Group Agents. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguistics, Vienna, Austria, 8950-8970. doi:10.18653/v1/2025.findings-acl.468 [93] Yunyao Zhang, Xinglang Zhang, Junxi Sheng, Wenbing Li, Junqing Yu, Wei Yang, and Zikai Song. 2025. From Ambiguity to Verdict: A Semiotic-Grounded Multi- Perspective Agent for LLM Logical Reasoning.arXiv preprint arXiv:2509.24765 (2025). [94] Zeyu Zhang, Jianxun Lian, Chen Ma, Yaning Qu, Ye Luo, Lei Wang, Rui Li, Xu Chen, Yankai Lin, Le Wu, Xing Xie, and Ji-Rong Wen."},{"citing_arxiv_id":"2604.05793","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents","primary_cat":"cs.CR","submitted_at":"2026-04-07T12:29:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BodhiPromptShield reduces stage-wise privacy propagation in LLM/VLM agents from 10.7% to 7.1% on the Controlled Prompt-Privacy Benchmark by mediating sensitive spans before inference and restoring only at authorized boundaries.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04749","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments","primary_cat":"cs.AI","submitted_at":"2026-04-06T15:14:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AI Trust OS is a proposed always-on operating layer that discovers undocumented AI systems via telemetry and produces continuous zero-trust compliance artifacts for regulations including ISO 42001, EU AI Act, SOC 2, GDPR, and HIPAA.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"them to an auditor as proof of control effectiveness [9]. That assumption breaks down entirely when the systems being governed are AI pipelines that span five vendors, process sensitive data through embedding models, retrieve context from vector databases, generate outputs through foundation mod- els, and log everything to observability platforms - all within a single user request [10]. The resultant governance deficit is not solely operational in nature; it is si- multaneously commercial and regulatory. Enterprise procurement functions increasingly require real-time, empirical demonstration of AI governance ma- turity as a prerequisite for completing vendor assessments [11]. Regulatory instruments, most notably the EU AI Act, have introduced legally binding"},{"citing_arxiv_id":"2604.16424","ref_index":44,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks","primary_cat":"cs.CR","submitted_at":"2026-04-04T13:08:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"RNN formal verification via star reachability [39] and NNV tooling is established for simple recurrent networks. SSM-specific verification leveraging transfer-function structure or SSD matrix representations remains an open problem; we contribute Proposition 3.1 as a step toward spectral certification. AI Safety and Governance.Risk taxonomies for language models [ 44, 3] and surveys of hallucination [23] and calibration [18, 30] establish the cognitive risk landscape. Our cognitive risk analysis (Section 6) connects these general phenomena to SSM-specific architectural properties: state compression, throughput scaling, and stateful deployment dynamics. 22 Safety, Security, and Cognitive Risks in State-Space Models"},{"citing_arxiv_id":"2604.25932","ref_index":22,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Sociodemographic Biases in Educational Counselling by Large Language Models","primary_cat":"cs.CY","submitted_at":"2026-04-03T14:08:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"LLMs show sociodemographic biases in educational counseling that are amplified by vague student descriptions and substantially reduced by concrete individualized details.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"be placed in demanding programmes at comparable achievement levels [3,21]. Ethnic stereotypes can also work in a nominally positive direction, for example, teachers tend to have higher expectations of Asian students in mathematics [19]. Sociodemographic Bias in LLMs.LLMs can exhibit biases originating from training data that manifest as stereotypical associations, uneven perfor- mance, or differential toxicity [4,22,5]. Within this broader landscape, sociode- mographic bias has received sustained attention. Gallegos et al. [10] and Gupta et al. [14] catalogue evaluation metrics and de-biasing techniques but also high- light persistent gaps, particularly the limited study of bias in applied, high-stakes settings. Empirical work confirms that these biases are measurable."},{"citing_arxiv_id":"2604.14197","ref_index":61,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure","primary_cat":"cs.CL","submitted_at":"2026-04-03T03:06:03+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.29693","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Measuring the metacognition of AI","primary_cat":"cs.AI","submitted_at":"2026-03-31T12:48:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Meta-d' and signal detection theory provide quantitative tools to assess metacognitive sensitivity and risk-based regulation in large language models.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":", Outeiral, C., Hie, B.: Generative artificial intelligence for de novo protein design. arXiv preprint arXiv:2310.09685 (2023) [12] OpenAI: Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed: 9 Septembre 2026 (2022) [13] Topics, E.: 40+ Chatbot Statistics (2025). https://explodingtopics.com/blog/ chatbot-statistics. Accessed 9 September 2025 (2025) [14] Weidinger, L., Mellor, J., Rauh, M., Griffin, C., Uesato, J., Huang, P.-S., Cheng, M., Glaese, M., Balle, B., Kasirzadeh, A., et al.: Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359 (2021) 11 [15] Rillig, M.C., ˚Agerstrand, M., Bi, M., Gould, K.A., Sauerland, U.: Risks and benefits of large language models for the environment."},{"citing_arxiv_id":"2604.20867","ref_index":42,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Preserving Decision Sovereignty in Military AI: A Trade-Secret-Safe Architectural Framework for Model Replaceability, Human Authority, and State Control","primary_cat":"cs.CY","submitted_at":"2026-03-26T04:52:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A trade-secret-safe layered architecture is specified to preserve decision sovereignty in military AI by making supplier models replaceable components under state-owned orchestration of policy, audit, and authorization.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.06203","ref_index":31,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics","primary_cat":"cs.CY","submitted_at":"2026-03-14T03:31:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"This paper proposes a five-dimension ethical design space for front-end biometric translation in sensor-fused health AI agents, including adaptive disclosure as a guardrail against hallucinations and biofeedback loops.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.12510","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies","primary_cat":"cs.RO","submitted_at":"2026-03-12T22:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15316","ref_index":4,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Anthropomorphism and Trust in Human-Large Language Model interactions","primary_cat":"cs.HC","submitted_at":"2026-03-01T21:55:58+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Warmth and cognitive empathy in LLMs drive higher anthropomorphism, trust, and relational closeness, especially on personal topics, while competence affects usefulness but not perceived human-likeness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.10467","ref_index":47,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"User Detection and Response Patterns of Sycophantic Behavior in Conversational AI","primary_cat":"cs.HC","submitted_at":"2026-01-15T14:51:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Reddit analysis shows users detect AI sycophancy through comparisons and consistency checks, apply mitigation prompts, and sometimes seek affirmative responses for support, indicating context-aware design is better than total elimination.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.06163","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking","primary_cat":"cs.CV","submitted_at":"2026-01-07T00:13:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FIA uses contrastive concept saliency and temporal-spatial neuron identification to build unified masks that erase multiple target concepts while preserving general generation quality in diffusion models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.21110","ref_index":14,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Beyond Context: Large Language Models' Failure to Grasp Users' Intent","primary_cat":"cs.AI","submitted_at":"2025-12-24T11:15:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.","context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Through controlled experiments, we demonstrate how benign-looking prompts can reliably circumvent safety mech- anisms across diverse application domains, from mental health support systems to content moderation platforms [11], [12]. The significance of this study extends beyond academic interest, revealing immediate concerns for AI deployment [13], [14]. As LLMs become increasingly integrated into sensitive applications, understanding and addressing these fundamental limitations becomes essential for ensuring safe and reliable AI systems [15], [16]. Our findings suggest that technical safe- guards alone, without addressing the core contextual reasoning deficit, will remain insufficient protection against determined"},{"citing_arxiv_id":"2512.20677","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Learning-Based Automated Adversarial Red-Teaming for Robustness Evaluation of Large Language Models","primary_cat":"cs.CR","submitted_at":"2025-12-21T19:12:44+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A meta-prompt and hierarchical detection framework automates LLM red-teaming, achieving 3.9 times higher vulnerability discovery rate than manual methods with 89% accuracy on GPT-OSS-20B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.05929","ref_index":19,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"LLM Harms: A Taxonomy and Discussion","primary_cat":"cs.CY","submitted_at":"2025-12-05T18:12:21+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"support","context_text":"Foundational syntheses map LLM risks across discrimination/toxicity, information hazards, misuse, HCI harms, and environmental/socioeconomic impacts (Weidinger et al.). Complementary social - computing work-e.g., Blodgett et al.'s critical survey and Sap et al.'s Social Bias Frames-connects technical metrics to normative concerns and the pragmatics of implied bias. These frameworks inform the harm taxonomy adopted in this paper [19]. Instruction -following and alignment moved from supervised instruction tuning to reinforcement learning from human feedback (RLHF) and then to constitutional/self-training approaches. InstructGPT established the RLHF recipe; Anthropic's Constitutional AI reduces reliance on per-example human labels by using a transparent \"constitution\" of principles; Direct Preference Optimization (DPO) simplifies preference learning without explicit"},{"citing_arxiv_id":"2510.21285","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models","primary_cat":"cs.AI","submitted_at":"2025-10-24T09:32:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.16853","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Agentic Inequality","primary_cat":"cs.CY","submitted_at":"2025-10-19T14:32:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Introduces the concept of agentic inequality and develops a three-dimensional framework (availability, quality, quantity) to analyze how autonomous AI agents could deepen or mitigate existing divides through scalable goal delegation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.06989","ref_index":30,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Human-aligned AI Model Cards with Weighted Hierarchy Architecture","primary_cat":"cs.SE","submitted_at":"2025-10-08T13:13:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Introduces CRAI-MCF, an eight-module framework distilling 217 parameters from 240 projects into a quantitative sufficiency criterion for cross-model LLM comparison grounded in Value Sensitive Design.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.06701","ref_index":56,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks","primary_cat":"cs.LG","submitted_at":"2025-09-08T13:55:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.21460","ref_index":257,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Large Language Model Agent: A Survey on Methodology, Applications and Challenges","primary_cat":"cs.CL","submitted_at":"2025-03-27T12:50:17+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"often lack direct awareness of the training data sources. This opacity increases the risk of unintended consequences, as individuals may unknowingly rely on models trained on controversial datasets, potentially resulting in reputational harm or even legal repercussions. Others. Some ethical concerns in the use of LLM agents, such as privacy [243], [257], [258], data manipulation [259], and misinformation [244], [260], are so critical that we provide a thorough discussion in Sections 4.1, 4.2 and 4.3. Beyond these, additional ethical concerns remain. One major issue is that LLM agents lack true semantic and contextual understanding, relying purely on statistical word associations. This limitation"}],"limit":50,"offset":0}