hub Canonical reference

Ethical and social risks of harm from Language Models

Laura Weidinger, John Mellor, Maribeth Rauh, Conor Griffin, Jonathan Uesato, Po-Sen Huang · 2021 · cs.CL · arXiv 2112.04359

Canonical reference. 83% of citing Pith papers cite this work as background.

97 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 97 citing papers arXiv PDF

abstract

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Interaction Harms, VI. Automation, Access, and Environmental Harms. The first area concerns the perpetuation of stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group for LMs. The second focuses on risks from private data leaks or LMs correctly inferring sensitive information. The third addresses risks arising from poor, false or misleading information including in sensitive domains, and knock-on risks such as the erosion of trust in shared information. The fourth considers risks from actors who try to use LMs to cause harm. The fifth focuses on risks specific to LLMs used to underpin conversational agents that interact with human users, including unsafe use, manipulation or deception. The sixth discusses the risk of environmental harm, job automation, and other challenges that may have a disparate effect on different social groups or communities. In total, we review 21 risks in-depth. We discuss the points of origin of different risks and point to potential mitigation approaches. Lastly, we discuss organisational responsibilities in implementing mitigations, and the role of collaboration and participation. We highlight directions for further research, particularly on expanding the toolkit for assessing and evaluating the outlined risks in LMs.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 27 other 2

citation-polarity summary

background 24 unclear 3 support 2

claims ledger

abstract This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Inte

co-cited works

representative citing papers

VoxSafeBench: Not Just What Is Said, but Who, How, and Where

cs.SD · 2026-04-16 · unverdicted · novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.

LazyAttention: Efficient Retrieval-Augmented Generation with Deferred Positional Encoding

cs.CL · 2026-06-03 · unverdicted · novelty 7.0

LazyAttention kernelizes deferred positional encoding to enable zero-copy, position-agnostic KV cache reuse, delivering 1.37× lower TTFT and 1.40× higher throughput than Block-Attention under skewed document distributions while preserving output quality.

Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models

cs.CR · 2026-05-29 · unverdicted · novelty 7.0

Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

cs.CL · 2026-05-09 · unverdicted · novelty 7.0

BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.

PersonaTeaming: Supporting Persona-Driven Red-Teaming for Generative AI

cs.HC · 2026-05-07 · unverdicted · novelty 7.0 · 2 refs

Persona-driven workflow and interface improve automated and human-AI red-teaming of generative AI by incorporating diverse perspectives into adversarial prompt creation.

Decoding-Time Debiasing via Process Reward Models: From Controlled Fill-in to Open-Ended Generation

cs.CL · 2026-05-04 · unverdicted · novelty 7.0

Decoding-time use of process reward models for bias mitigation raises fairness scores by up to 0.40 on a bilingual benchmark while preserving fluency across four LLMs and extends to open-ended generation with low overhead.

Toward Fair Speech Technologies: A Comprehensive Survey of Bias and Fairness in Speech AI

eess.AS · 2026-05-02 · accept · novelty 7.0

The paper delivers a unified framework for fairness in speech technologies by formalizing seven definitions, organizing research into three paradigms, diagnosing pipeline-specific biases, and mapping mitigations to those sources.

LLM-Assisted Empirical Software Engineering: Systematic Literature Review and Research Agenda

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

A systematic review of 50 studies identifies 69 LLM-assisted tasks in empirical software engineering, concentrated in data processing and analysis with gaps in human-centered integration and reproducibility reporting.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

cs.LG · 2026-04-24 · unverdicted · novelty 7.0

Adaptive multi-agent LLM pipelines with bandit-based sampling achieve lower false positive rates (0.095 vs 0.159) than single-agent models on two behavioral health datasets while maintaining similar false negative rates.

LLM-as-Judge Framework for Evaluating Tone-Induced Hallucination in Vision-Language Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

Ghost-100 benchmark shows prompt tone drives hallucination rates and intensities in VLMs, with non-monotonic peaks at intermediate pressure and task-specific differences that aggregate metrics hide.

IntervenSim: Intervention-Aware Social Network Simulation for Opinion Dynamics

cs.SI · 2026-04-08 · unverdicted · novelty 7.0

IntervenSim is an intervention-aware social network simulation that couples source interventions with crowd interactions in a feedback loop, improving MAPE by 41.6% and DTW by 66.9% over prior static frameworks on real-world events.

Front-End Ethics for Sensor-Fused Health Conversational Agents: An Ethical Design Space for Biometrics

cs.CY · 2026-03-14 · unverdicted · novelty 7.0

This paper proposes a five-dimension ethical design space for front-end biometric translation in sensor-fused health AI agents, including adaptive disclosure as a guardrail against hallucinations and biofeedback loops.

When Models Outthink Their Safety: Unveiling and Mitigating Self-Jailbreak in Large Reasoning Models

cs.AI · 2025-10-24 · unverdicted · novelty 7.0

Large Reasoning Models override their own initial safety recognition during multi-step reasoning in a failure mode called Self-Jailbreak, which Chain-of-Guardrail mitigates through targeted trajectory-level step interventions.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0 · 2 refs

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Grad Detect: Gradient-Based Hallucination Detection in LLMs

cs.LG · 2026-06-23 · unverdicted · novelty 6.0

Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.

AI Adoption Across a Multinational Workforce: Sociotechnical Conditions for GenAI Acceptance in Human Resources

cs.HC · 2026-06-16 · conditional · novelty 6.0

GenAI adoption in HR depends on sociotechnical fit with employee positionalities and trust mechanisms such as source verification and colleague consultation.

Debiasing Without Protected Attributes: Latent Concept Erasure from Textual Profiles

cs.CL · 2026-06-10 · unverdicted · novelty 6.0

H-SAL erases latent concepts from text profiles using self-descriptions as implicit debiasing signals and shows competitive performance on a new multi-domain Stack Exchange helpfulness benchmark.

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

cs.CR · 2026-06-08 · unverdicted · novelty 6.0

HPAA uses typographic manipulations to create text that humans flag as harmful at 86%+ rates while LLM moderation systems detect it below 1% with only three queries.

citing papers explorer

Showing 7 of 7 citing papers after filters.

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence cs.CL · 2026-05-09 · unverdicted · none · ref 2 · internal anchor
BiAxisAudit measures LLM bias on two axes—across-prompt sensitivity via factorial grids and within-response divergence via split coding—revealing that task format explains as much variance as model choice and that 63.6% of bias signals appear in only one layer.
Scaling Data-Constrained Language Models cs.CL · 2023-05-25 · conditional · none · ref 127 · internal anchor
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 27 · internal anchor
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned cs.CL · 2022-08-23 · accept · none · ref 59 · internal anchor
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 54 · internal anchor
LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.
TrustLLM: Trustworthiness in Large Language Models cs.CL · 2024-01-10 · unverdicted · none · ref 177 · internal anchor
TrustLLM defines eight trustworthiness principles, creates a six-dimension benchmark, and evaluates 16 LLMs showing proprietary models generally lead but some open-source ones are close while over-calibration can hurt utility.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges cs.CL · 2025-03-27 · accept · none · ref 257 · internal anchor
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.

Ethical and social risks of harm from Language Models

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer