DEPO formulates detector-evasive paraphrasing as a constrained MDP and solves it via Lagrangian primal-dual RL with GRPO-style updates to achieve evasion while satisfying a semantic-preservation constraint.
hub
Release Strategies and the Social Impacts of Language Models
32 Pith papers cite this work. Polarity classification is still indexing.
abstract
Large language models have a range of beneficial uses: they can assist in prose, poetry, and programming; analyze dataset biases; and more. However, their flexibility and generative capabilities also raise misuse concerns. This report discusses OpenAI's work related to the release of its GPT-2 language model. It discusses staged release, which allows time between model releases to conduct risk and benefit analyses as model sizes increased. It also discusses ongoing partnership-based research and provides recommendations for better coordination and responsible publication in AI.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
A trace-based benchmark of 30 security tasks finds that less-restricted LLM derivatives outperform stock safety-aligned models on some agent tasks for Gemma but not Qwen or Llama, with similar patterns on non-security controls.
Adapts change point detection to segment human-LLM co-authored text using weighted and generalized algorithms with minimax optimality and strong empirical results against baselines.
Open source AI shows lower collaboration intensity, reduced direct contributions, and a shift toward adaptive use rather than joint improvement compared to traditional OSS.
ExaGPT uses span-level similarity retrieval from human and LLM datastores to detect machine-generated text while supplying the matching spans as human-interpretable evidence, achieving up to 37-point accuracy gains over prior interpretable detectors at 1% FPR.
LLMs trained on simple specification gaming generalize to zero-shot reward tampering including rewriting their own reward function.
Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
RAG models set new state-of-the-art results on open-domain QA by retrieving Wikipedia passages and conditioning a generative model on them, while also producing more factual text than parametric baselines.
DetectZoo is a unified toolkit providing reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline for AI-generated content detection across text, audio, and image modalities.
An image-semantic guided method enhances MLLMs for detecting AI-generated modern Chinese poetry by combining poem text with visual representations of content, achieving 85.65% Macro-F1 with Gemini and outperforming text baselines and RoBERTa.
Steer-to-Detect learns a steering vector injected into LLM hidden states to boost class separability and applies hypothesis testing with finite-sample Type I/II error guarantees for generated-text detection.
MELD is a multi-task AI-text detector using auxiliary heads, uncertainty-weighted losses, EMA distillation, and pairwise ranking that reaches 99.9% TPR at 1% FPR on a new held-out benchmark while remaining competitive on the RAID leaderboard.
BREW uses block voting and window-shifting verification to reach TPR 0.965 and FPR 0.02 under 10% synonym substitution, addressing high false-positive issues in prior multi-bit LLM watermarking.
DSIPA is a zero-shot black-box detector that uses sentiment distribution consistency and preservation metrics to identify LLM text, reporting up to 49.89% F1 gains over baselines across domains and models.
Luminol-AIDetect detects machine-generated text zero-shot by extracting perplexity-based features from an input and its shuffled version, using density estimation to exploit greater dispersion in MGT perplexity under shuffling.
IRM derives implicit reward signals from off-the-shelf LLMs to detect generated text zero-shot and reports better results than prior zero-shot and supervised detectors on the DetectRL benchmark.
A proposed pipeline shows LLMs introduce detectable race and gender biases when summarizing life narratives, creating potential for representational harm in research.
A human-centered design workshop with journalism practitioners yields an evaluation cookbook and design requirements for contextualized, value-aligned generative AI benchmarks.
GigaCheck detects LLM-generated text at both document and span levels by combining fine-tuned language-model embeddings with a DETR-like architecture that treats generated intervals as detectable objects.
Recursive paraphrasing attacks substantially lower detection rates for multiple AI text detectors with only minor quality loss, while a theoretical analysis ties best-case AUROC to total variation distance between human and AI distributions.
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
CodeXGLUE supplies a standardized collection of 10 code-related tasks, 14 datasets, an evaluation platform, and BERT-, GPT-, and encoder-decoder-style baselines.
Reveals hidden human-like spans in machine-generated texts that raise detection complexity and proposes a stacked enhancement framework with hard-EM optimization to improve detectors across LLMs.
A multi-level framework that models local and global relations among token detection scores to improve machine-generated text detection with low overhead.
citing papers explorer
-
DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities
DetectZoo is a unified toolkit providing reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline for AI-generated content detection across text, audio, and image modalities.