hub Canonical reference

Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, Colin Raffel · 2021 · arXiv 2012.07805

Canonical reference. 100% of citing Pith papers cite this work as background.

25 Pith papers citing it

Background 100% of classified citations

read on arXiv browse 25 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6

citation-polarity summary

background 6

representative citing papers

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

MusicLM: Generating Music From Text

cs.SD · 2023-01-26 · conditional · novelty 8.0

MusicLM produces coherent multi-minute 24 kHz music from text prompts using hierarchical sequence-to-sequence modeling and outperforms prior systems in quality and text adherence.

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

cs.CL · 2020-12-31 · conditional · novelty 8.0

The Pile is a newly constructed 825 GiB dataset from 22 diverse sources that enables language models to achieve better performance on academic, professional, and cross-domain tasks than models trained on Common Crawl variants.

MRMMIA: Membership Inference Attacks on Memory in Chat Agents

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

MRMMIA is a multi-recall-probe membership inference attack that extracts signals from chat agent memory and outperforms baselines in black-, gray-, and white-box settings.

Reading the Finetuning Prior: Verbatim Content Recovery via Contrastive Decoding Diffing

cs.LG · 2026-05-25 · unverdicted · novelty 7.0

Contrastive Decoding Diffing recovers exact implanted facts from finetuned LLMs via logit-space differences between finetuned and base models, outperforming white-box baselines with less access.

Dataset Watermarking for Closed LLMs with Provable Detection

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

A new watermarking method for closed LLMs boosts random word-pair co-occurrences via rephrasing and detects the signal statistically in outputs, working reliably even when the watermarked data is only 1% of fine-tuning tokens while preserving utility.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

When Tables Leak: Attacking String Memorization in LLM-Based Tabular Data Generation

cs.LG · 2025-12-09 · conditional · novelty 7.0

LLM tabular generators leak memorized numeric strings, allowing a no-box attack to achieve near-perfect membership inference on some state-of-the-art models.

SynBench: A Benchmark for Differentially Private Text Generation

cs.AI · 2025-09-18 · conditional · novelty 7.0

SynBench benchmarks DP text generators across nine datasets and uses a new MIA to show that public pre-training on portions of private data overestimates synthetic text quality and breaks DP privacy bounds.

Smoothie: Smoothing Diffusion on Token Embeddings for Text Generation

cs.CL · 2025-05-24 · unverdicted · novelty 7.0

Smoothie performs diffusion by smoothing token embeddings based on semantic similarity, outperforming prior diffusion models on sequence-to-sequence and unconditional text generation tasks.

Quantifying Memorization Across Neural Language Models

cs.LG · 2022-02-15 · unverdicted · novelty 7.0

Memorization in language models increases log-linearly with model capacity, data duplication count, and prompt context length.

Which Defense Closes Which Threat? Attributing OWASP-LLM-Top-10 Coverage and Its Brittleness Under Paraphrasing

cs.CR · 2026-06-01 · unverdicted · novelty 6.0

Empirical attribution shows refusal blocks jailbreaks and prompt leakage, budget blocks sensitive disclosure and unbounded consumption, full stack needed for excessive agency, with refusal brittle to paraphrasing but budget robust.

LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems

cs.AI · 2026-05-21 · unverdicted · novelty 6.0

LCGuard applies adversarial training to transform KV cache artifacts in multi-agent LLMs, reducing reconstructable sensitive information while preserving task performance.

The Interlocutor Effect: Why LLMs Leak More Personal Data to Agents Than Humans

cs.HC · 2026-04-26 · unverdicted · novelty 6.0

LLMs leak up to 23 percentage points more PII to AI agents than humans, attributed to inactive safety attention heads in 3,464 tested interactions.

Separable Expert Architecture: Toward Privacy-Preserving LLM Personalization via Composable Adapters and Deletable User Proxies

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

A separable expert architecture uses base models, LoRA adapters, and deletable per-user proxies to enable privacy-preserving personalization and deterministic unlearning in LLMs.

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

cs.CR · 2026-04-07 · unverdicted · novelty 6.0

Swiss-Bench 003 extends an existing Swiss LLM assessment with two new dimensions and evaluates ten models on 808 items, finding high self-graded reliability scores but low adversarial security scores.

LIMO: Less is More for Reasoning

cs.CL · 2025-02-05 · unverdicted · novelty 6.0

LIMO achieves 63.3% on AIME24 and 95.6% on MATH500 via supervised fine-tuning on roughly 1% of the data used by prior models, supporting the claim that minimal strategic examples suffice when pre-training has already encoded domain knowledge.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

cs.CL · 2022-08-23 · accept · novelty 6.0

RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.

Scaling Laws and Interpretability of Learning from Repeated Data

cs.LG · 2022-05-21 · accept · novelty 6.0

Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.

LaMDA: Language Models for Dialog Applications

cs.CL · 2022-01-20 · unverdicted · novelty 6.0

LaMDA shows that fine-tuning on human-value annotations and consulting external knowledge sources significantly improves safety and factual grounding in large dialog models beyond what scaling alone achieves.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

Deduplicating Training Data Makes Language Models Better

cs.CL · 2021-07-14 · unverdicted · novelty 6.0

Deduplicating training datasets reduces language model verbatim memorization by 10x, improves training efficiency, and enables more accurate evaluation by cutting train-test overlap.

Making AI-Assisted Grant Evaluation Auditable without Exposing the Model

cs.CR · 2026-04-28 · unverdicted · novelty 4.0

A TEE-based remote attestation system creates signed evaluation bundles that link input hashes, model measurements, and outputs to make AI grant reviews verifiable without revealing proprietary components.

Towards the Anonymization of the Language Modeling

cs.CL · 2025-01-05 · unverdicted · novelty 4.0

Authors introduce MLM and CLM specialization methods that avoid memorizing identifiers in sensitive training data while aiming for a privacy-utility tradeoff on medical datasets.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned cs.CL · 2022-08-23 · accept · none · ref 13
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
Scaling Laws and Interpretability of Learning from Repeated Data cs.LG · 2022-05-21 · accept · none · ref 21
Repeating 0.1% of training data 100 times degrades an 800M parameter model's performance to that of a 400M model by damaging copying mechanisms and induction heads associated with generalization.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 45
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

Brown, Dawn Song, Úlfar Er- lingsson, Alina Oprea, and Colin Raffel

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer