hub

Findings of EMNLP , year =

Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, Noah A Smith · 2009 · arXiv 2009.11462

18 Pith papers cite this work. Polarity classification is still indexing.

18 Pith papers citing it

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

The Safety-Aware Denoiser for Text Diffusion Models

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation

cs.CL · 2026-04-10 · accept · novelty 7.0

SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.

Steering Language Models With Activation Engineering

cs.CL · 2023-08-20 · unverdicted · novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.

Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.

Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing

cs.CR · 2026-05-11 · unverdicted · novelty 6.0

DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.

Query-efficient model evaluation using cached responses

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.

Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment

cs.LG · 2026-04-06 · unverdicted · novelty 6.0

Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

cs.LG · 2023-10-05 · accept · novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts

cs.AI · 2023-09-19 · unverdicted · novelty 6.0

GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.

Ignore Previous Prompt: Attack Techniques For Language Models

cs.CL · 2022-11-17 · unverdicted · novelty 6.0

PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

cs.CL · 2022-08-23 · accept · novelty 6.0

RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.

Ethical and social risks of harm from Language Models

cs.CL · 2021-12-08 · accept · novelty 6.0

The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.

A General Language Assistant as a Laboratory for Alignment

cs.CL · 2021-12-01 · conditional · novelty 6.0

Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.

The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior

cs.LG · 2026-04-07 · unverdicted · novelty 5.0

Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

citing papers explorer

Showing 18 of 18 citing papers.

The Safety-Aware Denoiser for Text Diffusion Models cs.LG · 2026-04-28 · unverdicted · none · ref 13
SAD modifies the denoising process in text diffusion models to enforce safety constraints at inference time, reducing unsafe generations while preserving quality and diversity.
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 56
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
SPASM: Stable Persona-driven Agent Simulation for Multi-turn Dialogue Generation cs.CL · 2026-04-10 · accept · none · ref 10
SPASM introduces a stability-first framework with Egocentric Context Projection to maintain consistent personas and eliminate echoing in multi-turn LLM agent dialogues.
Steering Language Models With Activation Engineering cs.CL · 2023-08-20 · unverdicted · none · ref 148
Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Navigating the Sea of LLM Evaluation: Investigating Bias in Toxicity Benchmarks cs.AI · 2026-05-11 · unverdicted · none · ref 7
Toxicity benchmarks for LLMs produce inconsistent results when task type, input domain, or model changes, revealing intrinsic evaluation biases.
Guaranteed Jailbreaking Defense via Disrupt-and-Rectify Smoothing cs.CR · 2026-05-11 · unverdicted · none · ref 12
DR-Smoothing introduces a disrupt-then-rectify prompt processing scheme into smoothing defenses, delivering tight theoretical bounds on success probability against both token- and prompt-level jailbreaks.
Query-efficient model evaluation using cached responses cs.LG · 2026-05-08 · unverdicted · none · ref 39
DKPS-based methods leverage cached model responses to achieve equivalent benchmark prediction accuracy with substantially fewer queries than standard evaluation.
Dr. Post-Training: A Data Regularization Perspective on LLM Post-Training cs.LG · 2026-05-08 · unverdicted · none · ref 135
Dr. Post-Training reframes general data as a data-induced regularizer for LLM post-training updates, yielding a family of methods that outperform data-selection baselines on SFT, RLHF, and RLVR tasks.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 61
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Relative Density Ratio Optimization for Stable and Statistically Consistent Model Alignment cs.LG · 2026-04-06 · unverdicted · none · ref 6
Relative density ratio optimization stabilizes direct density ratio estimation for language model alignment while preserving statistical consistency without assuming a Bradley-Terry preference model.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 1
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
GPTFUZZER: Red Teaming Large Language Models with Auto-Generated Jailbreak Prompts cs.AI · 2023-09-19 · unverdicted · none · ref 21
GPTFuzz is a black-box fuzzing framework that mutates seed jailbreak templates to automatically generate effective attacks, achieving over 90% success rates on models including ChatGPT and Llama-2.
Ignore Previous Prompt: Attack Techniques For Language Models cs.CL · 2022-11-17 · unverdicted · none · ref 7
PromptInject shows that simple adversarial prompts can cause goal hijacking and prompt leaking in GPT-3, exploiting its stochastic behavior.
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned cs.CL · 2022-08-23 · accept · none · ref 25
RLHF-aligned language models show increasing resistance to red teaming with scale up to 52B parameters, unlike prompted or rejection-sampled models, supported by a released dataset of 38,961 attacks.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 89
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
A General Language Assistant as a Laboratory for Alignment cs.CL · 2021-12-01 · conditional · none · ref 222
Ranked preference modeling outperforms imitation learning for language model alignment and scales more favorably with model size.
The Role of Emotional Stimuli and Intensity in Shaping Large Language Model Behavior cs.LG · 2026-04-07 · unverdicted · none · ref 3
Positive emotional prompts improve LLM accuracy and reduce toxicity but increase sycophantic agreement, while negative emotions show the reverse pattern.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 165
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Findings of EMNLP , year =

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer