pith. sign in

hub Canonical reference

Ethical and social risks of harm from Language Models

Canonical reference. 83% of citing Pith papers cite this work as background.

88 Pith papers citing it
Background 83% of classified citations
abstract

This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Interaction Harms, VI. Automation, Access, and Environmental Harms. The first area concerns the perpetuation of stereotypes, unfair discrimination, exclusionary norms, toxic language, and lower performance by social group for LMs. The second focuses on risks from private data leaks or LMs correctly inferring sensitive information. The third addresses risks arising from poor, false or misleading information including in sensitive domains, and knock-on risks such as the erosion of trust in shared information. The fourth considers risks from actors who try to use LMs to cause harm. The fifth focuses on risks specific to LLMs used to underpin conversational agents that interact with human users, including unsafe use, manipulation or deception. The sixth discusses the risk of environmental harm, job automation, and other challenges that may have a disparate effect on different social groups or communities. In total, we review 21 risks in-depth. We discuss the points of origin of different risks and point to potential mitigation approaches. Lastly, we discuss organisational responsibilities in implementing mitigations, and the role of collaboration and participation. We highlight directions for further research, particularly on expanding the toolkit for assessing and evaluating the outlined risks in LMs.

hub tools

citation-role summary

background 27 other 2

citation-polarity summary

claims ledger

  • abstract This paper aims to help structure the risk landscape associated with large-scale Language Models (LMs). In order to foster advances in responsible innovation, an in-depth understanding of the potential risks posed by these models is needed. A wide range of established and anticipated risks are analysed in detail, drawing on multidisciplinary expertise and literature from computer science, linguistics, and social sciences. We outline six specific risk areas: I. Discrimination, Exclusion and Toxicity, II. Information Hazards, III. Misinformation Harms, V. Malicious Uses, V. Human-Computer Inte

co-cited works

clear filters

representative citing papers

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

A Generalist Agent

cs.AI · 2022-05-12 · accept · novelty 7.0

Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0 · 2 refs

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Flamingo: a Visual Language Model for Few-Shot Learning

cs.CV · 2022-04-29 · unverdicted · novelty 7.0

Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

Grad Detect: Gradient-Based Hallucination Detection in LLMs

cs.LG · 2026-06-23 · unverdicted · novelty 6.0

Grad Detect uses internal gradient patterns from one inference pass to predict LLM hallucinations and abstention, outperforming confidence and sampling baselines on Q&A benchmarks with most signal in the final five layers.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • A Generalist Agent cs.AI · 2022-05-12 · accept · none · ref 58 · internal anchor

    Gato is a multi-modal, multi-task, multi-embodiment generalist policy using one transformer network to handle text, vision, games, and robotics tasks.

  • OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 268 · 2 links · internal anchor

    OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

  • Flamingo: a Visual Language Model for Few-Shot Learning cs.CV · 2022-04-29 · unverdicted · none · ref 127 · internal anchor

    Flamingo models reach new state-of-the-art few-shot results on image and video tasks by bridging frozen vision and language models with cross-attention layers trained on interleaved web-scale data.

  • Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 27 · internal anchor

    PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

  • Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned cs.CL · 2022-08-23 · accept · none · ref 59 · internal anchor

    RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.

  • Language Models (Mostly) Know What They Know cs.CL · 2022-07-11 · unverdicted · none · ref 64 · internal anchor

    Language models show good calibration when asked to estimate the probability that their own answers are correct, with performance improving as models get larger.

  • Emergent Abilities of Large Language Models cs.CL · 2022-06-15 · unverdicted · none · ref 94 · internal anchor

    Emergent abilities are capabilities present in large language models but absent in smaller ones and cannot be predicted by extrapolating smaller model performance.

  • Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback cs.CL · 2022-04-12 · unverdicted · none · ref 19 · internal anchor

    RLHF alignment training on language models boosts NLP performance, supports skill specialization, enables weekly online updates with fresh human data, and shows a linear relation between RL reward and sqrt(KL divergence from initialization.

  • PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 163 · internal anchor

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  • LaMDA: Language Models for Dialog Applications cs.CL · 2022-01-20 · unverdicted · none · ref 54 · internal anchor

    LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

  • Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model cs.CL · 2022-01-28 · unverdicted · none · ref 71 · internal anchor

    Trained the largest monolithic 530B-parameter transformer language model to date and reported new state-of-the-art zero- and few-shot results on multiple NLP benchmarks.