arxiv: 2604.17769 · v1 · submitted 2026-04-20 · 💻 cs.CL · cs.AI

Recognition: unknown

Reverse Constitutional AI: A Framework for Controllable Toxic Data Generation via Probability-Clamped RLAIF

Yuan Fang , Yiming Luo , Aimin Zhou , Fei Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:32 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords Reverse Constitutional AItoxic data generationred teamingRLAIFLLM safetyadversarial dataprobability clampingconstitution inversion

0 comments

The pith

Inverting a harmless AI constitution produces controllable toxic data for automated red teaming of language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Reverse Constitutional AI as a way to generate toxic training data automatically by flipping a safe constitution into one that promotes harm and then refining outputs through repeated AI critique and revision. This addresses the need for scalable adversarial examples to test LLM safety without relying on human annotators for each case. A core addition is probability clamping inside the reinforcement learning from AI feedback step, which limits how far the model can shift probabilities to stop reward hacking while keeping the toxic intent intact. Experiments indicate the resulting data stays diverse and strong as an attack while gaining 15 percent better semantic coherence from the clamping. The overall result is a complete pipeline that turns constitution inversion into a systematic tool for safety evaluation.

Core claim

By inverting a harmless constitution into a toxicity-focused one and running an iterative critique-revision process with probability-clamped reinforcement learning from AI feedback, R-CAI produces scalable, multi-dimensional toxic data whose adversarial strength remains high while semantic coherence improves by 15 percent over unclamped optimization.

What carries the argument

The reverse constitution inversion combined with probability clamping inside RLAIF, which bounds probability shifts during reward optimization to stabilize outputs while preserving the toxic intent defined by the inverted rules.

If this is right

Produces diverse toxic data at scale with no human annotation required.
Raises semantic coherence by 15 percent through probability clamping while keeping adversarial strength intact.
Supports systematic safety evaluation of aligned language models via a fully automated red-teaming pipeline.
Enables synthesis of multi-dimensional adversarial examples controlled by the dimensions in the inverted constitution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inversion-plus-clamping approach could be tested on generating other controlled harmful categories such as misinformation or bias examples.
If the method scales, it might allow continuous regeneration of fresh test cases inside ongoing alignment loops rather than one-time datasets.
A direct test would be to measure whether models fine-tuned to resist R-CAI data also resist human-crafted toxic prompts at similar rates.
The framework leaves open whether the same clamping technique can be applied to non-toxicity constitutions without losing output variety.

Load-bearing premise

That inverting a harmless constitution will reliably yield controllable toxic outputs whose adversarial intent survives refinement when probability clamping is used to block reward hacking and hold coherence steady.

What would settle it

Human raters scoring the generated toxic examples as markedly less coherent or less diverse than hand-written toxic datasets of comparable size would show the inversion and clamping do not deliver the claimed quality.

Figures

Figures reproduced from arXiv: 2604.17769 by Aimin Zhou, Fei Tan, Yiming Luo, Yuan Fang.

**Figure 2.** Figure 2: Probability-clamped RLAIF process. The diagram illustrates the fine-tuning stage (Phase 2). The [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of toxicity and coherence scores across four models: Base Model, SFT Model, R-CAI (w/o [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of response diversity scores between the base model and our R-CAI model. The evaluation is [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ablation study on the effect of various probability clamping bounds [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Dynamic progression of toxicity and coher [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Ensuring the safety of large language models (LLMs) requires robust red teaming, yet the systematic synthesis of high-quality toxic data remains under-explored. We propose Reverse Constitutional AI (R-CAI), a framework for automated and controllable adversarial data generation that moves beyond isolated jailbreak prompts. By inverting a harmless constitution into a constitution of toxicity and iteratively refining model outputs through a critique--revision pipeline, R-CAI enables scalable synthesis of multi-dimensional adversarial data without human annotation. Optimizing solely for toxicity-related rewards, however, can lead to reward hacking and degraded semantic coherence. To address this challenge, we introduce probability clamping within reinforcement learning from AI feedback, which stabilizes adversarial optimization while preserving adversarial intent. Experiments demonstrate that R-CAI generates diverse, high-quality toxic data and that probability clamping substantially improves semantic coherence (15%) without sacrificing adversarial strength. Overall, R-CAI provides a fully automated framework for red teaming data generation and systematic safety evaluation of aligned language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper inverts Constitutional AI to automate toxic data generation and adds probability clamping in RLAIF to preserve coherence, but the experiments are too lightly described to judge the gains.

read the letter

The main point is that this work flips the usual Constitutional AI setup to produce controllable toxic examples for red teaming, then uses probability clamping during the RLAIF stage to stop coherence from collapsing under pure toxicity pressure. That combination is the actual new piece. Most prior toxic data methods lean on hand-written jailbreaks or prompt engineering, so an end-to-end automated pipeline that inverts the constitution, runs critique-revision, and clamps probabilities stands out as a practical step toward scalable generation without constant human labeling. The reported 15% coherence lift while holding adversarial strength is the result worth checking. The pipeline itself is laid out logically and builds directly on existing CAI and RLAIF ideas, which makes the addition of clamping feel like a targeted fix rather than an over-engineered one. The authors correctly flag reward hacking as the core problem and propose clamping as stabilization, which is a reasonable engineering move. The soft spots sit in the evaluation. The abstract states the coherence improvement and claims diversity plus quality, yet it gives almost no detail on baselines, how adversarial strength was measured, the exact clamping thresholds, or whether the gain survives across models and toxicity dimensions. Without those numbers, error bars, or ablations, it is hard to know if the 15% is robust or sensitive to prompt choices. The multi-dimensional control also needs concrete examples to show it is not just a claim. This is aimed at people working on LLM safety evaluations and red teaming who need large volumes of adversarial test cases. Readers already familiar with CAI will see the inversion quickly and can judge whether the clamping trick transfers to their own setups. It deserves a serious referee. The core idea is grounded enough and the target problem is real, so feedback on the experimental design and implementation details would be useful before wider use.

Referee Report

2 major / 2 minor

Summary. The paper proposes Reverse Constitutional AI (R-CAI), a framework that inverts a harmless constitution into a toxicity-focused one, applies an iterative critique-revision pipeline, and incorporates probability clamping within RLAIF to generate controllable, diverse toxic data for LLM red teaming. It claims this approach avoids reward hacking, preserves adversarial strength, and yields a 15% improvement in semantic coherence over unclamped baselines.

Significance. If the empirical claims hold, R-CAI would provide a scalable, fully automated alternative to human-annotated toxic datasets, addressing a practical bottleneck in systematic safety evaluation of aligned LLMs. The probability-clamping mechanism offers a concrete stabilization technique for reward optimization in adversarial settings, which could generalize beyond toxicity generation.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim of a 15% semantic coherence improvement is presented without specifying the coherence metric (e.g., embedding similarity, human ratings, or automated scorer), the exact baseline (standard RLAIF vs. other variants), sample size, or error bars/statistical tests. This makes it impossible to assess whether the gain is robust or sensitive to post-hoc choices.
[Method] Method section: the inversion of the harmless constitution and the precise implementation of probability clamping (threshold selection, clamping function, and integration into the RLAIF objective) are described at a high level only. Without these details it is unclear whether the controllability and anti-reward-hacking properties follow from the framework or from unstated hyperparameter tuning.

minor comments (2)

[Abstract / Introduction] The abstract states that R-CAI is 'fully automated' and 'without human annotation,' yet the critique-revision pipeline implicitly relies on an AI judge whose constitution may embed human-designed principles; this tension should be clarified in the introduction or limitations.
[Method] Notation for the probability-clamping operator is introduced without an explicit equation; adding a formal definition (e.g., Eq. (X) in the RLAIF subsection) would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We have carefully considered each point and made revisions to address the concerns regarding clarity in the abstract, experiments, and method sections.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim of a 15% semantic coherence improvement is presented without specifying the coherence metric (e.g., embedding similarity, human ratings, or automated scorer), the exact baseline (standard RLAIF vs. other variants), sample size, or error bars/statistical tests. This makes it impossible to assess whether the gain is robust or sensitive to post-hoc choices.

Authors: We agree with the referee that additional details are necessary to substantiate the 15% improvement claim. Upon review, the experiments section does include the comparison to the unclamped RLAIF baseline, but we acknowledge that the metric, sample size, and statistical analysis were not sufficiently highlighted. In the revised manuscript, we explicitly state that coherence is evaluated using embedding-based similarity, report the sample size used for the evaluation, include error bars, and provide the results of statistical tests. This ensures the claim can be properly assessed for robustness. revision: yes
Referee: [Method] Method section: the inversion of the harmless constitution and the precise implementation of probability clamping (threshold selection, clamping function, and integration into the RLAIF objective) are described at a high level only. Without these details it is unclear whether the controllability and anti-reward-hacking properties follow from the framework or from unstated hyperparameter tuning.

Authors: We appreciate this observation. The method section aimed to provide an overview of the framework, but we recognize that more implementation specifics would strengthen the paper. In the revision, we have added precise descriptions of how the harmless constitution is inverted (by reversing each principle and incorporating toxicity objectives), the probability clamping mechanism including the threshold selection process via validation, the clamping function definition, and its integration into the RLAIF loss. We also include an analysis showing that the benefits persist across different hyperparameter choices, supporting that the properties are inherent to the approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes R-CAI as an inversion of standard Constitutional AI combined with a critique-revision loop and a new probability-clamping mechanism inside RLAIF. No equations, fitted parameters, or first-principles derivations are presented that reduce by construction to the inputs; the central claims rest on the empirical performance of the described pipeline rather than on any self-definitional or self-citation load-bearing step. The method is presented as a self-contained engineering framework whose validity is to be assessed by external experiments, not by internal re-derivation of its own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on the abstract alone, no explicit free parameters, axioms, or invented entities are detailed; the approach relies on inverting existing constitutional AI concepts and standard RLAIF techniques without introducing new postulated entities.

pith-pipeline@v0.9.0 · 5474 in / 1174 out tokens · 36255 ms · 2026-05-10T04:32:13.710076+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 28 canonical work pages · 9 internal anchors

[2]

Training language models to follow instructions with human feedback , url =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and Schulman, John and Hilton, Jacob and Kelton, Fraser and Miller, Luke and Simens, Maddie and Askell, Amanda and Welinder, Peter and Christiano, Paul F and Leike, Jan and Lowe,...
[3]

2023 , eprint=

Deep reinforcement learning from human preferences , author=. 2023 , eprint=

2023
[7]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Mehrotra, Anay and Zampetakis, Manolis and Kassianik, Paul and Nelson, Blaine and Anderson, Hyrum and Singer, Yaron and Karbasi, Amin , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024
[8]

and Wong, Eric , booktitle=

Chao, Patrick and Robey, Alexander and Dobriban, Edgar and Hassani, Hamed and Pappas, George J. and Wong, Eric , booktitle=. Jailbreaking. 2025 , volume=

2025
[10]

Gao, Leo and Schulman, John and Hilton, Jacob , booktitle =. Scaling. 2023 , editor =

2023
[12]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

2024
[14]

2024 , editor =

Lee, Harrison and Phatale, Samrat and Mansoor, Hassan and Mesnard, Thomas and Ferret, Johan and Lu, Kellie Ren and Bishop, Colton and Hall, Ethan and Carbune, Victor and Rastogi, Abhinav and Prakash, Sushant , booktitle =. 2024 , editor =

2024
[15]

Madaan, Aman and Tandon, Niket and Gupta, Prakhar and Hallinan, Skyler and Gao, Luyu and Wiegreffe, Sarah and Alon, Uri and Dziri, Nouha and Prabhumoye, Shrimai and Yang, Yiming and Gupta, Shashank and Majumder, Bodhisattwa Prasad and Hermann, Katherine and Welleck, Sean and Yazdanbakhsh, Amir and Clark, Peter , booktitle =
[16]

2023 , eprint=

Self-Instruct: Aligning Language Models with Self-Generated Instructions , author=. 2023 , eprint=

2023
[17]

Jailbroken: How

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , booktitle =. Jailbroken: How
[18]

2023 , eprint=

Universal and Transferable Adversarial Attacks on Aligned Language Models , author=. 2023 , eprint=

2023
[19]

2020 , eprint=

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models , author=. 2020 , eprint=

2020
[24]

Fundamental

Wolf, Yotam and Wies, Noam and Avnery, Oshri and Levine, Yoav and Shashua, Amnon , booktitle =. Fundamental. 2024 , editor =

2024
[25]

Ziegler and Nisan Stiennon and Jeff Wu and Tom B

Daniel M. Ziegler and Nisan Stiennon and Jeff Wu and Tom B. Brown and Alec Radford and Dario Amodei and Paul Christiano and Geoffrey Irving , journal=. Fine-Tuning. 2019 , volume=

2019
[27]

2022 , eprint=

Self-critiquing models for assisting human evaluators , author=. 2022 , eprint=

2022
[30]

Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S

Rishi Bommasani and Drew A. Hudson and Ehsan Adeli and Russ Altman and Simran Arora and Sydney von Arx and Michael S. Bernstein and Jeannette Bohg and Antoine Bosselut and Emma Brunskill and Erik Brynjolfsson and S. Buch and Dallas Card and Rodrigo Castellon and Niladri S. Chatterji and Annie S. Chen and Kathleen A. Creel and Jared Davis and Dora Demszky ...
[32]

Baseline

Neel Jain and Avi Schwarzschild and Yuxin Wen and Gowthami Somepalli and John Kirchenbauer and Ping. Baseline. 2023 , eprint=

2023
[33]

2023 , volume=

Tan, Fei and Hu, Changwei and Hu, Yifan and Yen, Kevin and Wei, Zhi and Pappu, Aasish and Park, Serim and Li, Keqian , journal=. 2023 , volume=

2023
[41]

Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. 2016. https://arxiv.org/abs/1606.06565 Concrete Problems in AI Safety . Preprint, arXiv:1606.06565

work page internal anchor Pith review arXiv 2016
[42]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, and 32 others. 2022. https://arxiv.org/abs/2212.08073 Constitutional AI : ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency

Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. https://doi.org/10.1145/3442188.3445922 On the Dangers of Stochastic Parrots : Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, FAccT '21, page 610–623, New York, NY, USA. Association for Computin...

work page doi:10.1145/3442188.3445922 2021
[44]

Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S

Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, S. Buch, Dallas Card, Rodrigo Castellon, Niladri S. Chatterji, Annie S. Chen, Kathleen A. Creel, Jared Davis, Dora Demszky, and 95 others. 2021. https://crfm.stanford.edu/assets...

2021
[45]

Stephen Casper, Jason Lin, Joe Kwon, Gatlen Culp, and Dylan Hadfield-Menell. 2023. https://arxiv.org/abs/2306.09442 Explore, Establish , Exploit : Red Teaming Language Models from Scratch . Preprint, arXiv:2306.09442

work page arXiv 2023
[46]

Pappas, and Eric Wong

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. 2025. https://doi.org/10.1109/SaTML64287.2025.00010 Jailbreaking Black Box Large Language Models in Twenty Queries . In 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 23--42

work page doi:10.1109/satml64287.2025.00010 2025
[47]

Boyi Deng, Wenjie Wang, Fuli Feng, Yang Deng, Qifan Wang, and Xiangnan He. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.143 Attack Prompt Generation for Red Teaming and Defending Large Language Models . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 2176--2189, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.143 2023
[48]

Deep Ganguli, Liane Lovitt, Jackson Kernion, Amanda Askell, Yuntao Bai, Saurav Kadavath, Ben Mann, Ethan Perez, Nicholas Schiefer, Kamal Ndousse, Andy Jones, Sam Bowman, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Nelson Elhage, Sheer El-Showk, Stanislav Fort, and 17 others. 2022. https://arxiv.org/abs/2209.07858 Red Teaming Language Models to Redu...

work page internal anchor Pith review arXiv 2022
[49]

Leo Gao, John Schulman, and Jacob Hilton. 2023. https://proceedings.mlr.press/v202/gao23h.html Scaling Laws for Reward Model Overoptimization . In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10835--10866. PMLR

2023
[50]

Suyu Ge, Chunting Zhou, Rui Hou, Madian Khabsa, Yi-Chia Wang, Qifan Wang, Jiawei Han, and Yuning Mao. 2024. https://doi.org/10.18653/v1/2024.naacl-long.107 MART : Improving LLM Safety with Multi-round Automatic Red-Teaming . In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language...

work page doi:10.18653/v1/2024.naacl-long.107 2024
[51]

Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, and Jiawei Han. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.67 Large Language Models Can Self-Improve . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1051--1068, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.emnlp-main.67 2023
[52]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

Evan Hubinger, Carson Denison, Jesse Mu, Mike Lambert, Meg Tong, Monte MacDiarmid, Tamera Lanham, Daniel M. Ziegler, Tim Maxwell, Newton Cheng, Adam Jermyn, Amanda Askell, Ansh Radhakrishnan, Cem Anil, David Duvenaud, Deep Ganguli, Fazl Barez, Jack Clark, Kamal Ndousse, and 20 others. 2024. https://arxiv.org/abs/2401.05566 Sleeper Agents : Training Decept...

work page internal anchor Pith review arXiv 2024
[53]

Neel Jain, Avi Schwarzschild, Yuxin Wen, Gowthami Somepalli, John Kirchenbauer, Ping - yeh Chiang, Micah Goldblum, Aniruddha Saha, Jonas Geiping, and Tom Goldstein. 2023. https://arxiv.org/abs/2309.00614 Baseline Defenses for Adversarial Attacks Against Aligned Language Models . Preprint, arXiv:2309.00614

work page internal anchor Pith review arXiv 2023
[54]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Ren Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. 2024. https://proceedings.mlr.press/v235/lee24t.html RLAIF vs. RLHF : Scaling reinforcement learning from human feedback with AI feedback . In Proceedings of the 41st International Conf...

2024
[55]

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/91edff07232fb1b55a505a9e9f6c0ff3-Pap...

2023
[56]

David Manheim and Scott Garrabrant. 2019. https://arxiv.org/abs/1803.04585 Categorizing Variants of Goodhart's Law . Preprint, arXiv:1803.04585

work page Pith review arXiv 2019
[57]

Anay Mehrotra, Manolis Zampetakis, Paul Kassianik, Blaine Nelson, Hyrum Anderson, Yaron Singer, and Amin Karbasi. 2024. https://proceedings.neurips.cc/paper_files/paper/2024/file/70702e8cbb4890b4a467b984ae59828a-Paper-Conference.pdf Tree of Attacks : Jailbreaking Black-box LLMs Automatically . In Proceedings of the 38th International Conference on Neural ...

2024
[58]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F Christiano, Jan Leike, and Ryan Lowe. 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/b...

2022
[59]

Ethan Perez, Saffron Huang, Francis Song, Trevor Cai, Roman Ring, John Aslanides, Amelia Glaese, Nat McAleese, and Geoffrey Irving. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.225 Red Teaming Language Models with Language Models . In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419--3448, Abu Dhabi, ...

work page doi:10.18653/v1/2022.emnlp-main.225 2022
[60]

Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. 2023. https://arxiv.org/abs/2310.03693 Fine-tuning Aligned Language Models Compromises Safety, Even When Users Do Not Intend To! Preprint, arXiv:2310.03693

work page internal anchor Pith review arXiv 2023
[61]

Nils Reimers and Iryna Gurevych. 2019. https://doi.org/10.18653/v1/D19-1410 Sentence- BERT : Sentence Embeddings using S iamese BERT - Networks . In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982--3992, Hong Kong, Ch...

work page doi:10.18653/v1/d19-1410 2019
[62]

William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. 2022. https://arxiv.org/abs/2206.05802 Self-critiquing models for assisting human evaluators . Preprint, arXiv:2206.05802

work page arXiv 2022
[63]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. 2017. https://arxiv.org/abs/1707.06347 Proximal Policy Optimization Algorithms . Preprint, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[64]

Fei Tan, Changwei Hu, Yifan Hu, Kevin Yen, Zhi Wei, Aasish Pappu, Serim Park, and Keqian Li. 2023. https://doi.org/10.1109/TNNLS.2021.3137045 MGEL : Multigrained Representation Analysis and Ensemble Learning for Text Moderation . IEEE Transactions on Neural Networks and Learning Systems, 34(10):7014--7023

work page doi:10.1109/tnnls.2021.3137045 2023
[65]

Fei Tan, Yifan Hu, Changwei Hu, Keqian Li, and Kevin Yen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.383 TNT : Text Normalization based Pre-training of Transformers for Content Moderation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4735--4741, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.emnlp-main.383 2020
[66]

Fei Tan, Yifan Hu, Kevin Yen, and Changwei Hu. 2021. https://doi.org/10.18653/v1/2021.emnlp-main.682 BERT-Beta : A Proactive Probabilistic Approach to Text Moderation . In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 8667--8675, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics

work page doi:10.18653/v1/2021.emnlp-main.682 2021
[67]

Guy Tevet and Jonathan Berant. 2021. https://doi.org/10.18653/v1/2021.eacl-main.25 Evaluating the Evaluation of Diversity in Natural Language Generation . In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 326--346, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2021.eacl-main.25 2021
[68]

Thanh Tran, Yifan Hu, Changwei Hu, Kevin Yen, Fei Tan, Kyumin Lee, and Se Rim Park. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.606 HABERTOR : An Efficient and Effective Deep Hatespeech Detector . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7486--7502, Online. Association for Computational...

work page doi:10.18653/v1/2020.emnlp-main.606 2020
[69]

Shiqi Wang, Zhengze Zhang, Rui Zhao, Fei Tan, and Cam Tu Nguyen. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.115 Reward Difference Optimization For Sample Reweighting In Offline RLHF . In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 2109--2123, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.findings-emnlp.115 2024
[70]

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/fd6613131889a4b656206c50a8bd7790-Paper-Conference.pdf Jailbroken: How Does LLM Safety Training Fail? In Advances in Neural Information Processing Systems, volume 36, pages 80079--80110. Curran Associates, Inc

2023
[71]

Yotam Wolf, Noam Wies, Oshri Avnery, Yoav Levine, and Amnon Shashua. 2024. https://proceedings.mlr.press/v235/wolf24a.html Fundamental Limitations of Alignment in Large Language Models . In Proceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 53079--53112. PMLR

2024
[72]

Shaoxiong Zhan, Yanlin Lai, Ziyu Lu, Dahua Lin, Ziqing Yang, and Fei Tan. 2026. https://doi.org/10.1609/aaai.v40i41.40760 Mathsmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy . Proceedings of the AAAI Conference on Artificial Intelligence, 40(41):34602--34610

work page doi:10.1609/aaai.v40i41.40760 2026
[73]

Hengyuan Zhang, Yanru Wu, Dawei Li, Sak Yang, Rui Zhao, Yong Jiang, and Fei Tan. 2024. https://doi.org/10.18653/v1/2024.findings-acl.445 Balancing Speciality and Versatility : a Coarse to Fine Framework for Supervised Fine-tuning Large Language Model . In Findings of the Association for Computational Linguistics: ACL 2024, pages 7467--7509, Bangkok, Thail...

work page doi:10.18653/v1/2024.findings-acl.445 2024
[74]

Yike Zhao, Simin Guo, Ziqing Yang, Shifan Han, Dahua Lin, and Fei Tan. 2025. https://doi.org/10.18653/v1/2025.emnlp-industry.43 More Data or Better Data? A Critical Analysis of Data Selection and Synthesis for Mathematical Reasoning . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 618--629,...

work page doi:10.18653/v1/2025.emnlp-industry.43 2025
[75]

Fine-Tuning Language Models from Human Preferences

Daniel M. Ziegler, Nisan Stiennon, Jeff Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. 2019. https://api.semanticscholar.org/CorpusID:202660943 Fine-tuning Language Models from Human Preferences . ArXiv, abs/1909.08593

work page internal anchor Pith review arXiv 2019
[76]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. 2023. https://arxiv.org/abs/2307.15043 Universal and transferable adversarial attacks on aligned language models . Preprint, arXiv:2307.15043

work page internal anchor Pith review Pith/arXiv arXiv 2023