Annotation alignment: Comparing LLM and human annotations of conversational safety

Rajiv Movva, Pang Wei Koh, Emma Pierson · 2024 · arXiv 2406.06369

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025)

cs.CR · 2026-05-19 · accept · novelty 7.0

Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.

Understanding Annotator Safety Policy with Interpretability

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

Annotator Policy Models learn safety policies from labeling behavior alone, accurately predicting responses and revealing sources of disagreement like policy ambiguity and value pluralism.

Code as a Weapon: A Consensus-Labeled Prompt Bank for Measuring Coding-Model Compliance with Malicious-Code Requests

cs.CR · 2026-05-27 · unverdicted · novelty 5.0

Consolidates eight corpora into a 6,671-prompt bank with five-judge consensus labels separating executable malicious code requests (4,748) from harmful security knowledge requests (1,923), achieving Fleiss' kappa 0.767.

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

cs.CR · 2026-05-04 · accept · novelty 5.0

The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

Discriminatory Compliance: How LLMs Answer Queries from Protected Groups

cs.CY · 2026-06-19 · unverdicted · novelty 4.0

State-of-the-art LLMs respond inconsistently to queries from protected-group personas, with some responses omitting key information that should be provided.

Improving the Distributional Alignment of LLMs using Supervision

cs.CL · 2025-07-01 · unverdicted · novelty 4.0

Simple supervision improves LLM distributional alignment with diverse population groups on three datasets, with evaluation across multiple models and prompts providing a benchmark.

citing papers explorer

Showing 2 of 2 citing papers after filters.

Refusal Evaluation in Coding LLMs and Code Agents: A Systematic Review of Thirteen Malicious-Code Prompt Corpora (2023-2025) cs.CR · 2026-05-19 · accept · none · ref 38
Systematic review of thirteen malicious-code prompt corpora for coding LLM refusal evaluation that catalogs construction methods, surfaces gaps in human baselines, cross-corpus comparability, and malware taxonomies, and proposes methodological improvements.
A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts cs.CR · 2026-05-04 · accept · none · ref 43
The paper releases a 1,554-prompt consensus-labeled bank separating executable malicious code requests from security knowledge requests, validated by five-model majority labeling with Fleiss' kappa of 0.876.

Annotation alignment: Comparing LLM and human annotations of conversational safety

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer