Felkner, Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May

Felkner, Virginia, Chang, Ho-Chun Herbert, Jang, Eugene, May, Jonathan · 2023 · DOI 10.18653/v1/2023.acl-long.507

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

open at publisher browse 4 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.

The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs

cs.CL · 2026-06-17 · unverdicted · novelty 6.0

LLMs exhibit misfired alignment on stereotype questions at 4.7-18.9% rates on the new VETO benchmark of 2,032 contrastive pairs, unlike humans at 0%, due to overgeneralized safety cues after instruction tuning.

It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO

cs.CL · 2026-06-09 · unverdicted · novelty 5.0

One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.

AI Safety Landscape for Large Language Models: Taxonomy, State-of-the-art, and Future Directions

cs.AI · 2024-08-23 · unverdicted · novelty 4.0

The paper introduces a taxonomy of AI safety for LLMs organized into Trustworthy AI, Responsible AI, and Safe AI perspectives, accompanied by a review of state-of-the-art methods, challenges, and future directions.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Wait, am I Being Fair? Characterizing Deductive Stereotyping and Mitigating It with Fair-GCG cs.CL · 2026-06-30 · unverdicted · none · ref 32
The paper characterizes deductive stereotyping in LLMs and introduces Fair-GCG to discover injection phrases that improve fairness across benchmarks, reasoning, and real-world tasks.
The Wrong Kind of Right: Quantifying and Localizing Misfired Alignment in LLMs cs.CL · 2026-06-17 · unverdicted · none · ref 29
LLMs exhibit misfired alignment on stereotype questions at 4.7-18.9% rates on the new VETO benchmark of 2,032 contrastive pairs, unlike humans at 0%, due to overgeneralized safety cues after instruction tuning.
It Takes One to Bias Them All: Breaking Bad with One-Shot GRPO cs.CL · 2026-06-09 · unverdicted · none · ref 7
One-shot GRPO on a single biased example induces generalizing stereotype bias in post-trained LLMs, with susceptibility varying by initial bias likelihood.

Felkner, Ho-Chun Herbert Chang, Eugene Jang, and Jonathan May

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer