hub Canonical reference

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang · 2021 · cs.CL · arXiv 2112.04359

Canonical reference. 83% of citing Pith papers cite this work as background.

90 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 90 citing papers arXiv PDF

abstract

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Interaction Harms, VI. Automation, Access, and Environmental Harms. The first area concerns the perpetuation of stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group for LMs. The second focuses on risks from private data leaks or LMs correctly inferring sensitive information. The third addresses risks arising from poor, false or misleading information including in sensitive domains, and knock-on risks such as the erosion of trust in shared information. The fourth considers risks from actors who try to use LMs to cause harm. The fifth focuses on risks specific to LLMs used to underpin conversational agents that interact with human users, including unsafe use, manipulation or deception. The sixth discusses the risk of environmental harm, job automation, and other challenges that may have a disparate effect on different social groups or communities. In total, we review 21 risks in-depth. We discuss the points of origin of different risks and point to potential mitigation approaches. Lastly, we discuss organisational responsibilities in implementing mitigations, and the role of collaboration and participation. We highlight directions for further research, particularly on expanding the toolkit for assessing and evaluating the outlined risks in LMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 27 other 2

citation-polarity summary

background 24 unclear 3 support 2

claims ledger

abstract This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Inte

co-cited works

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

eess.AS · 2026-05-02 · accept · novelty 7.0

The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to those sources.

LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

Adaptive multi-agent LLM pipelines with bandit-based sampling achieve lower false positive rates (0.095 vs 0.159) than single-agent models on two behavioral health datasets while maintaining similar false negative rates.

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics

cs.SI · 2026-04-08 · unverdicted · novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on real-world events.

Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics

cs.CY · 2026-03-14 · unverdicted · novelty 7.0

This paper proposes a five-dimension ethical design space for front-end biometric translation in sensor-fused health AI agents, including adaptive disclosure as a guardrail against hallucinations and biofeedback loops.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

cs.AI · 2025-10-24 · unverdicted · novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0 · 2 refs

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Grad Detect: Gradient-Based Hallucination Detection in LLMs

cs.LG · 2026-06-23 · unverdicted · novelty 6.0

Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.

When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

CARTOGRAPH integrates unresolved-subspace steering, ambiguity closure, and residual-based refusal under a local linear-Gaussian model, outperforming baselines on testbeds and correctly flagging inconclusive claims in a retrospective A-Lab audit.

When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries

cs.CY · 2026-05-26 · unverdicted · novelty 6.0

MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

cs.CL · 2026-05-17 · unverdicted · novelty 6.0

Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.

citing papers explorer

Showing 50 of 50 citing papers after filters.

VoxSafeBench: Not Just What Is Said, but Who, How, and Where cs.SD · 2026-04-16 · unverdicted · none · ref 38 · internal anchor
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding cs.CL · 2026-06-03 · unverdicted · none · ref 71 · internal anchor
LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models cs.CR · 2026-05-29 · unverdicted · none · ref 2 · internal anchor
Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization cs.AI · 2026-05-28 · unverdicted · none · ref 3 · internal anchor
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
Measuring Safety Alignment Effects in Autonomous Security Agents cs.CR · 2026-05-19 · conditional · none · ref 9 · internal anchor
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching cs.CL · 2026-05-12 · unverdicted · none · ref 129 · 2 links · internal anchor
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence cs.CL · 2026-05-09 · unverdicted · none · ref 2 · internal anchor
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI cs.HC · 2026-05-07 · unverdicted · none · ref 74 · 2 links · internal anchor
Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.
Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation cs.CL · 2026-05-04 · unverdicted · none · ref 34 · internal anchor
Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.
Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI eess.AS · 2026-05-02 · accept · none · ref 107 · internal anchor
The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to those sources.
LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda cs.SE · 2026-04-29 · unverdicted · none · ref 39 · internal anchor
A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 133 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems cs.LG · 2026-04-24 · unverdicted · none · ref 7 · internal anchor
Adaptive multi-agent LLM pipelines with bandit-based sampling achieve lower false positive rates (0.095 vs 0.159) than single-agent models on two behavioral health datasets while maintaining similar false negative rates.
LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models cs.CV · 2026-04-20 · unverdicted · none · ref 33 · internal anchor
Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.
IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics cs.SI · 2026-04-08 · unverdicted · none · ref 93 · internal anchor
IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on real-world events.
Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics cs.CY · 2026-03-14 · unverdicted · none · ref 31 · internal anchor
This paper proposes a five-dimension ethical design space for front-end biometric translation in sensor-fused health AI agents, including adaptive disclosure as a guardrail against hallucinations and biofeedback loops.
Grad Detect: Gradient-Based Hallucination Detection in LLMs cs.LG · 2026-06-23 · unverdicted · none · ref 44 · internal anchor
Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.
When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery cs.LG · 2026-05-26 · unverdicted · none · ref 41 · internal anchor
CARTOGRAPH integrates unresolved-subspace steering, ambiguity closure, and residual-based refusal under a local linear-Gaussian model, outperforming baselines on testbeds and correctly flagging inconclusive claims in a retrospective A-Lab audit.
When Medical Safety Alignment Fails: A Benchmark for Evaluating LLMs on High-Risk Medical Queries cs.CY · 2026-05-26 · unverdicted · none · ref 42 · internal anchor
MedHarm benchmark shows aligned LLMs and guardrails can still produce unsafe responses on high-risk medical queries, indicating medical safety requires domain-specific testing.
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment cs.CL · 2026-05-17 · unverdicted · none · ref 99 · internal anchor
Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on RewardBench and downstream LLM evaluations.
Overtrained, Not Misaligned cs.LG · 2026-05-12 · unverdicted · none · ref 36 · internal anchor
Emergent misalignment arises from overtraining after primary task convergence and is preventable by early stopping, which retains 93% of task performance on average.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 31 · internal anchor
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Ethics Testing: Proactive Identification of Generative AI System Harms cs.SE · 2026-04-23 · unverdicted · none · ref 72 · internal anchor
Ethics testing is introduced as a systematic approach to generate tests that identify software harms induced by unethical behavior in generative AI outputs.
Transient Turn Injection: Exposing Stateless Multi-Turn Vulnerabilities in Large Language Models cs.CR · 2026-04-23 · unverdicted · none · ref 23 · internal anchor
Transient Turn Injection is a new attack that evades LLM moderation by spreading harmful intent over multiple isolated turns using automated agents.
AlignCultura: Towards Culturally Aligned Large Language Models? cs.CL · 2026-04-21 · unverdicted · none · ref 170 · internal anchor
Align-Cultura introduces the CULTURAX dataset and shows that culturally fine-tuned LLMs improve joint HHH scores by 4-6%, cut cultural failures by 18%, and gain 10-12% efficiency with minimal leakage.
MANTA: Multi-turn Assessment for Nonhuman Thinking & Alignment cs.CY · 2026-04-18 · unverdicted · none · ref 9 · internal anchor
MANTA is a new multi-turn dynamic benchmark that stress-tests frontier LLMs on animal welfare alignment by generating targeted adversarial follow-ups and scoring across 13 dimensions, with preliminary results showing variance in later turns and format bias in LLM judges.
The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems cs.CR · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
Salami Attack chains low-risk inputs to cumulatively trigger high-risk LLM behaviors, achieving over 90% success on GPT-4o and Gemini while resisting some defenses.
Safety, Security, and Cognitive Risks in State-Space Models: A Systematic Threat Analysis with Spectral, Stateful, and Capacity Attacks cs.CR · 2026-04-04 · unverdicted · none · ref 44 · internal anchor
State-space models are vulnerable to three new attack types that corrupt state integrity, with experiments showing up to 156x output changes and 6x higher targeted corruption than random inputs.
Red-Teaming Vision-Language-Action Models via Quality Diversity Prompt Generation for Robust Robot Policies cs.RO · 2026-03-12 · unverdicted · none · ref 26 · internal anchor
Q-DIG applies quality diversity optimization with vision-language models to generate diverse adversarial instructions that reveal VLA robot failures and enable robustness improvements via fine-tuning.
Forget-It-All: Multi-Concept Machine Unlearning via Concept-Aware Neuron Masking cs.CV · 2026-01-07 · unverdicted · none · ref 13 · internal anchor
FIA uses contrastive concept saliency and temporal-spatial neuron identification to build unified masks that erase multiple target concepts while preserving general generation quality in diffusion models.
AI Native Games: A Survey and Roadmap cs.AI · 2026-07-01 · unverdicted · none · ref 82 · internal anchor
The paper proposes a counterfactual definition of AI-native games, screens 53 examples, introduces a G/N taxonomy, and outlines a research roadmap for the field.
A Lifecycle and Application-Stack Survey of Large Language Model Vulnerabilities: Attacks, Risks, Defenses, and Open Problems cs.CR · 2026-06-30 · unverdicted · none · ref 27 · internal anchor
The paper provides a lifecycle-based systematization of LLM vulnerabilities across data collection, pretraining, alignment, packaging, retrieval, prompting, tool execution, and deployment, mapping them to security objectives and identifying open problems.
Multilingual jailbreaking of LLMs using low-resource languages cs.CL · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
Multi-turn prompts in Afrikaans, Kiswahili, isiXhosa and isiZulu achieve 52-83% harmful response rates across GPT, Claude, Gemini and others, rising further with native-speaker red-teaming, showing translation quality limits jailbreak success.
Can We Trust AI-Inferred User States. A Psychometric Framework for Validating the Reliability of Users States Classification by LLMs in Operational Environments cs.AI · 2026-05-15 · unverdicted · none · ref 13 · internal anchor
Empirical replication across three LLMs shows only 31 of 213 user-state metrics meet reliability criteria for individual scores, supporting a validation framework for responsible AI in adaptive environments.
Mechanism Plausibility in Generative Agent-Based Modeling cs.MA · 2026-05-12 · unverdicted · none · ref 81 · 2 links · internal anchor
Introduces the Mechanism Plausibility Scale, a four-level framework separating generative sufficiency from mechanistic plausibility in LLM-based agent-based models.
Beyond Inefficiency: Systemic Costs of Incivility in Multi-Agent Monte Carlo Simulations cs.AI · 2026-05-12 · unverdicted · none · ref 22 · internal anchor
Monte Carlo simulations of LLM agents confirm that toxic debates take 25% longer to converge, with larger delays in smaller models, and show a first-mover advantage independent of toxicity.
Quantifying and Predicting Disagreement in Graded Human Ratings cs.CL · 2026-05-01 · unverdicted · none · ref 234 · internal anchor
Annotation disagreement on toxic language can be moderately predicted from textual features, with high-opposition items proving harder for models to estimate accurately.
Representational Harms in LLM-Generated Narratives Against Global Majority Nationalities cs.CL · 2026-04-24 · unverdicted · none · ref 82 · internal anchor
LLMs generate narratives containing persistent stereotypes, erasure, and one-dimensional portrayals of Global Majority national identities, with minoritized groups overrepresented in subordinated roles by more than fifty times compared to dominant portrayals.
BodhiPromptShield: Pre-Inference Prompt Mediation for Suppressing Privacy Propagation in LLM/VLM Agents cs.CR · 2026-04-07 · unverdicted · none · ref 18 · internal anchor
BodhiPromptShield reduces stage-wise privacy propagation in LLM/VLM agents from 10.7% to 7.1% on the Controlled Prompt-Privacy Benchmark by mediating sensitive spans before inference and restoring only at authorized boundaries.
Sociodemographic Biases in Educational Counselling by Large Language Models cs.CY · 2026-04-03 · unverdicted · none · ref 22 · internal anchor
LLMs show sociodemographic biases in educational counseling that are amplified by vague student descriptions and substantially reduced by concrete individualized details.
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure cs.CL · 2026-04-03 · accept · none · ref 61 · internal anchor
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.
Measuring the metacognition of AI cs.AI · 2026-03-31 · unverdicted · none · ref 14 · internal anchor
Meta-d' and signal detection theory provide quantitative tools to assess metacognitive sensitivity and risk-based regulation in large language models.
Anthropomorphism and Trust in Human-Large Language Model interactions cs.HC · 2026-03-01 · conditional · none · ref 4 · internal anchor
Warmth and cognitive empathy in LLMs drive higher anthropomorphism, trust, and relational closeness, especially on personal topics, while competence affects usefulness but not perceived human-likeness.
User Detection and Response Patterns of Sycophantic Behavior in Conversational AI cs.HC · 2026-01-15 · unverdicted · none · ref 47 · internal anchor
Reddit analysis shows users detect AI sycophancy through comparisons and consistency checks, apply mitigation prompts, and sometimes seek affirmative responses for support, indicating context-aware design is better than total elimination.
Beyond Post-hoc Explanation: Toward Glassbox AI via Probabilistic Mediation cs.AI · 2026-06-05 · unverdicted · none · ref 18 · internal anchor
The paper proposes the Glassbox Framework in which Bayesian networks serve as transparent ante-hoc mediation layers for generative models to enable auditable reasoning traces and contestable outputs.
Food Noise & False Safety: A Systematic Evaluation of How LLMs Fail to Adapt to Eating Disorder Queries with Clinician Feedback cs.AI · 2026-06-01 · unverdicted · none · ref 93 · internal anchor
Systematic evaluation shows LLMs frequently give unsafe responses to eating disorder prompts when linguistic cues signal risk, as measured by varying prompt danger levels with clinician feedback.
Preserving Decision Sovereignty in Military AI: A Trade-Secret-Safe Architectural Framework for Model Replaceability, Human Authority, and State Control cs.CY · 2026-03-26 · unverdicted · none · ref 42 · internal anchor
A trade-secret-safe layered architecture is specified to preserve decision sovereignty in military AI by making supplier models replaceable components under state-owned orchestration of policy, audit, and authorization.
Creating and Evaluating K-12 GenAI Assessment Graders Through Context Engineering cs.CY · 2026-05-08 · unverdicted · none · ref 85 · internal anchor
LLM graders achieve substantial human agreement on math and science MCAS items but vary on ELA, performing best as sources of formative narrative feedback rather than summative numerical scores.
AI Trust OS -- A Continuous Governance Framework for Autonomous AI Observability and Zero-Trust Compliance in Enterprise Environments cs.AI · 2026-04-06 · unverdicted · none · ref 10 · internal anchor
AI Trust OS is a proposed always-on operating layer that discovers undocumented AI systems via telemetry and produces continuous zero-trust compliance artifacts for regulations including ISO 42001, EU AI Act, SOC 2, GDPR, and HIPAA.
Human-Guided Harm Recovery for Computer Use Agents cs.AI · 2026-04-20 · unreviewed · ref 5 · internal anchor

Ethical and social risks of harm from Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer