hub

Release Strategies and the Social Impacts of Language Models

Irene Solaiman, Miles Brundage, Jack Clark, Amanda Askell, Ariel Herbert-Voss, Jeff Wu · 2019 · cs.CL · arXiv 1908.09203

32 Pith papers cite this work. Polarity classification is still indexing.

32 Pith papers citing it

open full Pith review browse 32 citing papers arXiv PDF

abstract

Large language models have a range of beneficial uses: they can assist in prose, poetry, and programming; analyze dataset biases; and more. However, their flexibility and generative capabilities also raise misuse concerns. This report discusses OpenAI's work related to the release of its GPT-2 language model. It discusses staged release, which allows time between model releases to conduct risk and benefit analyses as model sizes increased. It also discusses ongoing partnership-based research and provides recommendations for better coordination and responsible publication in AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 1

citation-polarity summary

background 2 baseline 1 support 1

representative citing papers

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.

Measuring Safety Alignment Effects in Autonomous Security Agents

cs.CR · 2026-05-19 · conditional · novelty 7.0

A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.

Segmenting Human-LLM Co-authored Text via Change Point Detection

cs.CL · 2026-05-05 · unverdicted · novelty 7.0

Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.

From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence

cs.SE · 2026-04-10 · conditional · novelty 7.0

Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.

ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability

cs.CL · 2025-02-17 · unverdicted · novelty 7.0

ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.

Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

cs.AI · 2024-06-14 · conditional · novelty 7.0

LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.

Multitask Prompted Training Enables Zero-Shot Task Generalization

cs.LG · 2021-10-15 · conditional · novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

cs.CL · 2020-05-22 · accept · novelty 7.0

RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

cs.MM · 2026-06-02 · unverdicted · novelty 6.0

DetectZoo is a unified toolkit providing reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline for AI-generated content detection across text, audio, and image modalities.

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

cs.CL · 2026-05-21 · unverdicted · novelty 6.0

An image-semantic guided method enhances MLLMs for detecting AI-generated modern Chinese poetry by combining poem text with visual representations of content, achieving 85.65% Macro-F1 with Gemini and outperforming text baselines and RoBERTa.

Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts

stat.AP · 2026-05-13 · unverdicted · novelty 6.0

Steer-to-Detect learns a steering vector injected into LLM hidden states to boost class separability and applies hypothesis testing with finite-sample Type I/II error guarantees for generated-text detection.

MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.

Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking

cs.CR · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

BREW uses block voting and window-shifting verification to reach TPR 0.965 and FPR 0.02 under 10% synonym substitution, addressing high false-positive issues in prior multi-bit LLM watermarking.

DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis

cs.CL · 2026-04-29 · unverdicted · novelty 6.0

DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.

Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling

cs.CL · 2026-04-28 · unverdicted · novelty 6.0

Luminol-AIDetect detects machine-generated text zero-shot by extracting perplexity-based features from an input and its shuffled version, using density estimation to exploit greater dispersion in MGT perplexity under shuffling.

Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

IRM derives implicit reward signals from off-the-shelf LLMs to detect generated text zero-shot and reports better results than prior zero-shot and supervised detectors on the DetectRL benchmark.

Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives

cs.CL · 2026-04-22 · unverdicted · novelty 6.0

A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.

Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners

cs.HC · 2025-09-30 · unverdicted · novelty 6.0

A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.

GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization

cs.CL · 2024-10-31 · unverdicted · novelty 6.0

GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.

Can AI-Generated Text be Reliably Detected?

cs.CL · 2023-03-17 · unverdicted · novelty 6.0

Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation

cs.SE · 2021-02-09 · unverdicted · novelty 6.0

CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

cs.CL · 2026-05-22 · unverdicted · novelty 5.0

Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.

Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection

cs.CL · 2026-05-15 · unverdicted · novelty 5.0

A multi-level framework that models local and global relations among token detection scores to improve machine-generated text detection with low overhead.

citing papers explorer

Showing 32 of 32 citing papers.

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization cs.LG · 2026-05-29 · unverdicted · none · ref 14 · internal anchor
DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.
Measuring Safety Alignment Effects in Autonomous Security Agents cs.CR · 2026-05-19 · conditional · none · ref 52 · internal anchor
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
Segmenting Human-LLM Co-authored Text via Change Point Detection cs.CL · 2026-05-05 · unverdicted · none · ref 12 · internal anchor
Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.
From OSS to Open Source AI: an Exploratory Study of Collaborative Development Paradigm Divergence cs.SE · 2026-04-10 · conditional · none · ref 98 · internal anchor
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
ExaGPT: Example-Based Machine-Generated Text Detection for Human Interpretability cs.CL · 2025-02-17 · unverdicted · none · ref 32 · internal anchor
ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models cs.AI · 2024-06-14 · conditional · none · ref 300 · internal anchor
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Multitask Prompted Training Enables Zero-Shot Task Generalization cs.LG · 2021-10-15 · conditional · none · ref 60 · internal anchor
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks cs.CL · 2020-05-22 · accept · none · ref 57 · internal anchor
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities cs.MM · 2026-06-02 · unverdicted · none · ref 20 · internal anchor
DetectZoo is a unified toolkit providing reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline for AI-generated content detection across text, audio, and image modalities.
Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs cs.CL · 2026-05-21 · unverdicted · none · ref 110 · internal anchor
An image-semantic guided method enhances MLLMs for detecting AI-generated modern Chinese poetry by combining poem text with visual representations of content, achieving 85.65% Macro-F1 with Gemini and outperforming text baselines and RoBERTa.
Steer-to-Detect: Probing Hidden Representations for Detection of LLM-Generated Texts stat.AP · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
Steer-to-Detect learns a steering vector injected into LLM hidden states to boost class separability and applies hypothesis testing with finite-sample Type I/II error guarantees for generated-text detection.
MELD: Multi-Task Equilibrated Learning Detector for AI-Generated Text cs.CL · 2026-05-07 · unverdicted · none · ref 30 · internal anchor
MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.
Block-wise Codeword Embedding for Reliable Multi-bit Text Watermarking cs.CR · 2026-05-01 · unverdicted · none · ref 18 · 2 links · internal anchor
BREW uses block voting and window-shifting verification to reach TPR 0.965 and FPR 0.02 under 10% synonym substitution, addressing high false-positive issues in prior multi-bit LLM watermarking.
DSIPA: Detecting LLM-Generated Texts via Sentiment-Invariant Patterns Divergence Analysis cs.CL · 2026-04-29 · unverdicted · none · ref 22 · internal anchor
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
Luminol-AIDetect: Fast Zero-shot Machine-Generated Text Detection based on Perplexity under Text Shuffling cs.CL · 2026-04-28 · unverdicted · none · ref 27 · internal anchor
Luminol-AIDetect detects machine-generated text zero-shot by extracting perplexity-based features from an input and its shuffled version, using density estimation to exploit greater dispersion in MGT perplexity under shuffling.
Zero-Shot Detection of LLM-Generated Text via Implicit Reward Model cs.CL · 2026-04-23 · unverdicted · none · ref 11 · internal anchor
IRM derives implicit reward signals from off-the-shelf LLMs to detect generated text zero-shot and reports better results than prior zero-shot and supervised detectors on the DetectRL benchmark.
Whose Story Gets Told? Positionality and Bias in LLM Summaries of Life Narratives cs.CL · 2026-04-22 · unverdicted · none · ref 207 · internal anchor
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
Towards Real-World Validity in Generative AI Benchmarks: Understanding and Designing Domain-Centered Evaluations for Journalism Practitioners cs.HC · 2025-09-30 · unverdicted · none · ref 62 · internal anchor
A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
GigaCheck: Detecting LLM-generated Content via Object-Centric Span Localization cs.CL · 2024-10-31 · unverdicted · none · ref 66 · internal anchor
GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.
Can AI-Generated Text be Reliably Detected? cs.CL · 2023-03-17 · unverdicted · none · ref 88 · internal anchor
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 260 · internal anchor
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
CodeXGLUE: A Machine Learning Benchmark Dataset for Code Understanding and Generation cs.SE · 2021-02-09 · unverdicted · none · ref 73 · internal anchor
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement cs.CL · 2026-05-22 · unverdicted · none · ref 9 · internal anchor
Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.
Multi-Level Contextual Token Relation Modeling for Machine-Generated Text Detection cs.CL · 2026-05-15 · unverdicted · none · ref 7 · internal anchor
A multi-level framework that models local and global relations among token detection scores to improve machine-generated text detection with low overhead.
DetectRL-X: Towards Reliable Multilingual and Real-World LLM-Generated Text Detection cs.CL · 2026-05-15 · unverdicted · none · ref 2 · 2 links · internal anchor
DetectRL-X is a multilingual benchmark evaluating LLM text detectors on 8 languages, 6 domains, 4 commercial generators, and paraphrasing/perturbation attacks.
Alignment Imprint: Zero-Shot AI-Generated Text Detection via Provable Preference Discrepancy cs.AI · 2026-04-18 · unverdicted · none · ref 10 · internal anchor
LAPD, derived from the provable preference discrepancy in aligned LLMs, improves zero-shot AI text detection by 45.82% over baselines with claimed statistical dominance over Fast-DetectGPT.
Interpretable Stylistic Variation in Human and LLM Writing Across Genres, Models, and Decoding Strategies cs.CL · 2026-04-15 · unverdicted · none · ref 6 · internal anchor
Genre and model exert stronger influence on writing style than human/LLM source or decoding strategy in a broad comparison of lexicogrammatical features.
Detecting LLM-Assisted Academic Dishonesty using Keystroke Dynamics cs.HC · 2025-11-16 · unverdicted · none · ref 9 · internal anchor
Keystroke dynamics models outperform text-only detectors for spotting LLM-assisted academic dishonesty in practical scenarios, though performance drops under adversarial conditions.
Lightweight Stylistic Consistency Profiling: Robust Detection of LLM-Generated Textual Content for Multimedia Moderation cs.CL · 2026-05-07 · unverdicted · none · ref 37 · internal anchor
LiSCP detects LLM-generated text via stylistic consistency profiling across paraphrased variants and reports up to 11.79% better cross-domain accuracy plus robustness to adversarial attacks.
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning cs.SE · 2026-04-17 · unverdicted · none · ref 22 · internal anchor
LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.
An Overview of Catastrophic AI Risks cs.CY · 2023-06-21 · accept · none · ref 106 · internal anchor
The paper categorizes sources of catastrophic AI risks into malicious use, AI race, organizational risks, and rogue AIs, providing illustrative stories and mitigation suggestions for each.
Rate-Distortion Optimization for Transformer Inference cs.LG · 2026-01-29 · unreviewed · ref 47 · internal anchor

Release Strategies and the Social Impacts of Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer