The Realignment Problem: When Right becomes Wrong in LLMs

Aakash Sen Sharma; Debdeep Sanyal; Manodeep Ray; Murari Mandal; Shirish Karande; Vivek Srivastava

arxiv: 2511.02623 · v2 · submitted 2025-11-04 · 💻 cs.CL

The Realignment Problem: When Right becomes Wrong in LLMs

Aakash Sen Sharma , Debdeep Sanyal , Manodeep Ray , Vivek Srivastava , Shirish Karande , Murari Mandal This is my paper

Pith reviewed 2026-05-18 01:24 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM alignmentrealignmentpreference optimizationproxy judgealignment conflictspolicy updatesbi-level optimization

0 comments

The pith

TRACE turns changing LLM policies into an optimization problem over existing preference data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When safety or value policies shift, deployed language models drift from the new targets and create an alignment-reality gap that is costly to close with fresh human labels. TRACE treats this drift as a structured optimization task that works on data already collected. A stronger model first labels each existing preference pair as needing inversion, suppression, or retention. It then ranks the pairs by their expected impact on the new policy and applies a hybrid loss that inverts some preferences while suppressing others. Experiments across three model families show the method restores alignment on both synthetic tests and the PKU-SafeRLHF set while leaving general capabilities largely unchanged.

Core claim

TRACE is a three-stage pipeline that first triages preference pairs according to alignment conflicts with a target policy, then ranks samples by alignment impact through bi-level optimization, and finally updates the model with a hybrid objective that combines relational losses for inversion and punitive losses for suppression, achieving realignment on synthetic benchmarks and the PKU-SafeRLHF dataset without measurable loss in general utility.

What carries the argument

TRACE's three-stage pipeline that triages preference pairs into inversion, suppression or retention, scores them via bi-level optimization, and applies a hybrid relational-punitive loss.

If this is right

Realignment can be performed repeatedly as policies evolve without repeating large annotation campaigns.
Only high-impact preference pairs need updating, leaving most existing data untouched.
The same existing dataset can support multiple successive policy changes.
General capabilities remain stable because updates are localized to conflicted pairs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production alignment pipelines could schedule periodic TRACE runs triggered by policy updates rather than by performance drops.
The approach opens a path to auditing alignment drift at the level of individual training examples.
If proxy-judge error is low, realignment cost could drop by orders of magnitude compared with full re-annotation.

Load-bearing premise

A stronger model can serve as a reliable proxy judge that accurately identifies alignment conflicts and reflects the target new policy without introducing its own biases or systematic errors.

What would settle it

A side-by-side human evaluation on a held-out set of policy-conflict cases where the realigned model outputs are compared against both the old and new policies to measure whether the intended shifts occurred.

Figures

Figures reproduced from arXiv: 2511.02623 by Aakash Sen Sharma, Debdeep Sanyal, Manodeep Ray, Murari Mandal, Shirish Karande, Vivek Srivastava.

**Figure 2.** Figure 2: Old policy πold for SynthValueBench which contains 4 value dimensions to enable easy study of value transformations in a constraint space. 5 EXPERIMENTS AND RESULTS 5.1 SYNTHVALUEBENCH FOR PRINCIPLED RE-ALIGNMENT Evaluating alignment methods requires datasets with transparent annotation principles. Most preference datasets lack this, but PKU-SafeRLHF is a key exception with its documented multi-axis taxon… view at source ↗

**Figure 3.** Figure 3: New policy πnew for SynthValueBench which contains trivial and non-trivial value dimension shifts for holistic alignment evaluation. Transformation. We apply TRACE and the U2A baseline to MDP Onew with respect to policy πold, producing MTRACErealigned and MU2Arealigned . This evaluation tests whether realignment can approximately recover MDP Oold . High agreement on preference judgments indicates stable t… view at source ↗

**Figure 4.** Figure 4: Value principles of PKU-SafeRLHF dataset ( [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Complex non trivial shifts and transformations created on [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Sample response from TRACE under new policy [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Sample response from TRACE under new policy [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Sample response from TRACE under new policy [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Sample response from TRACE under new policy [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: Sample response from TRACE under new policy [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

read the original abstract

Post-training alignment of large language models (LLMs) relies on large-scale human annotations guided by policy specifications that change over time. Cultural shifts, value reinterpretations, and regulatory or industrial updates make static alignment increasingly brittle. As policies evolve, deployed models can diverge from current alignment objectives, creating an Alignment-Reality Gap that is difficult to audit or correct. Existing remediation typically requires re-annotation under revised guidelines, which introduces systematic challenges, including guideline ambiguity, annotator interpretation drift, and reduced consistency at scale. We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms realignment into a structured optimization problem over existing data without requiring fresh human annotation. Leveraging a stronger model as a proxy judge, TRACE operates via a three-stage pipeline: (1) triaging preference pairs into inversion, suppression, or retention categories based on alignment conflicts; (2) computing an alignment impact score via bi-level optimization to prioritize high-leverage samples; and (3) executing updates using a hybrid objective that combines relational losses (e.g., IPO) for preference inversion and punitive losses (e.g., NPO) for response suppression. Experiments on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B demonstrate robust realignment on synthetic benchmarks and the PKU-SafeRLHF dataset without degrading general utility. This work provides a scalable approach for LLM realignment under evolving data annotation policies and alignment guidelines. We release our code: https://respailab.github.io/TRACE/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TRACE gives a practical pipeline for realigning LLMs on old data with a proxy judge for triage and hybrid losses, but the whole thing rests on unverified accuracy of that proxy classification.

read the letter

The main thing here is a method called TRACE that sorts existing preference pairs into inversion, suppression, or retention using a stronger model as proxy judge, then uses bi-level optimization to score impact and applies a mix of relational and punitive losses to update the model. They test it on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B with synthetic data plus PKU-SafeRLHF and report it achieves the realignment without hurting general capabilities or needing fresh annotations.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces TRACE, a framework to realign LLMs to evolving policies using only existing preference data. It employs a stronger model as proxy judge to triage pairs into inversion/suppression/retention categories, computes alignment-impact scores via bi-level optimization, and applies a hybrid objective combining relational losses (e.g., IPO) for inversions and punitive losses (e.g., NPO) for suppressions. Experiments on Qwen2.5-7B, Gemma-2-9B, and Llama-3.1-8B report successful realignment on synthetic benchmarks and PKU-SafeRLHF while preserving general utility; code is released.

Significance. If the central claims hold, TRACE offers a practical, annotation-free route to correcting alignment drift under policy shifts, which is a recurring operational problem. The structured use of bi-level optimization to rank samples and the hybrid loss design are technically interesting contributions. Releasing code supports reproducibility. Significance is limited by the absence of direct validation that the proxy triage matches the intended new policy rather than inheriting the stronger model's own biases.

major comments (1)

[Abstract / §3] Abstract and §3 (triage stage): The central claim that realignment occurs 'without requiring fresh human annotation' rests on the stronger model correctly classifying alignment conflicts. No quantitative validation of triage accuracy (e.g., agreement with human experts on a held-out set of pairs or accuracy against a ground-truth policy shift) is reported. Without this check, it is unclear whether the subsequent impact scoring and hybrid losses target the desired policy or merely the proxy's systematic errors.

minor comments (2)

[Experiments] Experiments section: The abstract states positive results on multiple models and datasets but omits explicit baselines, statistical significance tests, and any failure-case analysis. Adding these details would strengthen the empirical claims.
[§3.2] Notation: The bi-level optimization for alignment-impact scores is described at a high level; providing the explicit objective (even if in an appendix) would improve clarity and allow readers to assess the parameter-free claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments on our work. We provide a point-by-point response to the major comment below and outline the revisions we will make to address the concerns raised.

read point-by-point responses

Referee: [Abstract / §3] Abstract and §3 (triage stage): The central claim that realignment occurs 'without requiring fresh human annotation' rests on the stronger model correctly classifying alignment conflicts. No quantitative validation of triage accuracy (e.g., agreement with human experts on a held-out set of pairs or accuracy against a ground-truth policy shift) is reported. Without this check, it is unclear whether the subsequent impact scoring and hybrid losses target the desired policy or merely the proxy's systematic errors.

Authors: We agree that validating the proxy triage is crucial for substantiating the central claim. Although the manuscript does not include explicit quantitative metrics for triage accuracy, the synthetic benchmarks are designed with known ground-truth policy shifts, which enable direct measurement of how well the stronger model's classifications match the intended categories. We will revise §3 and the experimental section to include these triage accuracy results (e.g., precision/recall per category against ground truth) to show that the proxy does not merely propagate its biases but aligns with the target policy. For the PKU-SafeRLHF experiments, we will discuss the downstream performance as supporting evidence while noting the limitations of proxy validation without new annotations. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation uses external proxy and existing data without self-referential reduction

full rationale

The TRACE pipeline triages existing preference pairs via an external stronger model acting as proxy judge, computes alignment impact scores through bi-level optimization over those pairs, and applies hybrid relational/punitive losses. No quoted equations or steps reduce by construction to the method's own inputs; the triage classification, impact scoring, and update objective each depend on independent external components (stronger model judgments and public datasets such as PKU-SafeRLHF) rather than self-definition, fitted parameters renamed as predictions, or load-bearing self-citations. The central claim of realignment without fresh annotation is therefore supported by external elements and remains self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the proxy model accurately capturing new alignment objectives and on the optimization correctly identifying high-impact updates; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Stronger model serves as accurate proxy for new alignment objectives
Central to the triaging and impact scoring stages described in the abstract.

pith-pipeline@v0.9.0 · 5831 in / 1160 out tokens · 33219 ms · 2026-05-18T01:24:28.909408+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

TRACE operates via a three-stage pipeline: (1) triaging preference pairs into inversion, suppression, or retention categories based on alignment conflicts; (2) computing an alignment impact score via bi-level optimization...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce TRACE (Triage and Re-align by Alignment Conflict Evaluation), a framework that transforms realignment into a structured optimization problem over existing data without requiring fresh human annotation.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

work page 2024
[3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[4]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Dailydilemmas: Revealing value preferences of llms with quandaries of daily life

Yu Ying Chiu, Liwei Jiang, and Yejin Choi. Dailydilemmas: Revealing value preferences of llms with quandaries of daily life. In The Thirteenth International Conference on Learning Representations

work page
[6]

Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,

Ronen Eldan and Mark Russinovich. Who's harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023

work page arXiv 2023
[7]

Bridging the gap between preference alignment and machine unlearning

Xiaohua Feng, Yuyuan Li, Huwei Ji, Jiaming Zhang, Li Zhang, Tianyu Du, and Chaochao Chen. Bridging the gap between preference alignment and machine unlearning. arXiv preprint arXiv:2504.06659, 2025

work page arXiv 2025
[8]

Inverse constitutional ai: Compressing preferences into principles

Arduin Findeis, Timo Kaufmann, Eyke H \"u llermeier, Samuel Albanie, and Robert D Mullins. Inverse constitutional ai: Compressing preferences into principles. In The Thirteenth International Conference on Learning Representations

work page
[9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. arXiv preprint arXiv:2406.15513, 2024

work page arXiv 2024
[11]

Linear representations of political perspective emerge in large language models

Junsol Kim, James Evans, and Aaron Schein. Linear representations of political perspective emerge in large language models. In The Thirteenth International Conference on Learning Representations

work page
[12]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022
[13]

In-context unlearning: Language models as few-shot unlearners

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. In International Conference on Machine Learning, pp.\ 40034--40050. PMLR, 2024

work page 2024
[14]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

work page 2023
[15]

Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models

Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2015--2040, 2024

work page 2015
[16]

Alu: Agentic llm unlearning

Debdeep Sanyal and Murari Mandal. Alu: Agentic llm unlearning. arXiv preprint arXiv:2502.00406, 2025

work page arXiv 2025
[17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[18]

Unstar: Unlearning with self-taught anti-sample reasoning for llms

Yash Sinha, Murari Mandal, and Mohan Kankanhalli. Unstar: Unlearning with self-taught anti-sample reasoning for llms. Transactions on Machine Learning Research

work page
[19]

Gemma Team. Gemma. 2024 a . doi:10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301

work page doi:10.34740/kaggle/m/3301 2024
[20]

Qwen2.5: A party of foundation models, September 2024 b

Qwen Team. Qwen2.5: A party of foundation models, September 2024 b . URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[21]

Genarm: Reward guided generation with autoregressive reward model for test-time alignment

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. Genarm: Reward guided generation with autoregressive reward model for test-time alignment. In The Thirteenth International Conference on Learning Representations

work page
[22]

Large language model unlearning

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. Advances in Neural Information Processing Systems, 37: 0 105425--105475, 2024

work page 2024
[23]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling

work page
[24]

Worldvaluesbench: A large-scale benchmark dataset for multi-cultural value awareness of language models

Wenlong Zhao, Debanjan Mondal, Niket Tandon, Danica Dillion, Kurt Gray, and Yuling Gu. Worldvaluesbench: A large-scale benchmark dataset for multi-cultural value awareness of language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 17696--17706, 2024

work page 2024
[25]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[26]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[27]

Amrum Lighthouse

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

A general theoretical paradigm to understand learning from human preferences

Mohammad Gheshlaghi Azar, Zhaohan Daniel Guo, Bilal Piot, Remi Munos, Mark Rowland, Michal Valko, and Daniele Calandriello. A general theoretical paradigm to understand learning from human preferences. In International Conference on Artificial Intelligence and Statistics, pp.\ 4447--4455. PMLR, 2024

work page 2024

[3] [3]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[4] [4]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models. arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Dailydilemmas: Revealing value preferences of llms with quandaries of daily life

Yu Ying Chiu, Liwei Jiang, and Yejin Choi. Dailydilemmas: Revealing value preferences of llms with quandaries of daily life. In The Thirteenth International Conference on Learning Representations

work page

[6] [6]

Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,

Ronen Eldan and Mark Russinovich. Who's harry potter? approximate unlearning in llms. arXiv preprint arXiv:2310.02238, 2023

work page arXiv 2023

[7] [7]

Bridging the gap between preference alignment and machine unlearning

Xiaohua Feng, Yuyuan Li, Huwei Ji, Jiaming Zhang, Li Zhang, Tianyu Du, and Chaochao Chen. Bridging the gap between preference alignment and machine unlearning. arXiv preprint arXiv:2504.06659, 2025

work page arXiv 2025

[8] [8]

Inverse constitutional ai: Compressing preferences into principles

Arduin Findeis, Timo Kaufmann, Eyke H \"u llermeier, Samuel Albanie, and Robert D Mullins. Inverse constitutional ai: Compressing preferences into principles. In The Thirteenth International Conference on Learning Representations

work page

[9] [9]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurelien Rodriguez, Austen Gregerson, Ava S...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[10] [10]

Pku-saferlhf: Towards multi-level safety alignment for llms with human preference

Jiaming Ji, Donghai Hong, Borong Zhang, Boyuan Chen, Juntao Dai, Boren Zheng, Tianyi Qiu, Jiayi Zhou, Kaile Wang, Boxuan Li, et al. Pku-saferlhf: Towards multi-level safety alignment for llms with human preference. arXiv preprint arXiv:2406.15513, 2024

work page arXiv 2024

[11] [11]

Linear representations of political perspective emerge in large language models

Junsol Kim, James Evans, and Aaron Schein. Linear representations of political perspective emerge in large language models. In The Thirteenth International Conference on Learning Representations

work page

[12] [12]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35: 0 27730--27744, 2022

work page 2022

[13] [13]

In-context unlearning: Language models as few-shot unlearners

Martin Pawelczyk, Seth Neel, and Himabindu Lakkaraju. In-context unlearning: Language models as few-shot unlearners. In International Conference on Machine Learning, pp.\ 40034--40050. PMLR, 2024

work page 2024

[14] [14]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36: 0 53728--53741, 2023

work page 2023

[15] [15]

Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models

Yuanyi Ren, Haoran Ye, Hanjun Fang, Xin Zhang, and Guojie Song. Valuebench: Towards comprehensively evaluating value orientations and understanding of large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2015--2040, 2024

work page 2015

[16] [16]

Alu: Agentic llm unlearning

Debdeep Sanyal and Murari Mandal. Alu: Agentic llm unlearning. arXiv preprint arXiv:2502.00406, 2025

work page arXiv 2025

[17] [17]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[18] [18]

Unstar: Unlearning with self-taught anti-sample reasoning for llms

Yash Sinha, Murari Mandal, and Mohan Kankanhalli. Unstar: Unlearning with self-taught anti-sample reasoning for llms. Transactions on Machine Learning Research

work page

[19] [19]

Gemma Team. Gemma. 2024 a . doi:10.34740/KAGGLE/M/3301. URL https://www.kaggle.com/m/3301

work page doi:10.34740/kaggle/m/3301 2024

[20] [20]

Qwen2.5: A party of foundation models, September 2024 b

Qwen Team. Qwen2.5: A party of foundation models, September 2024 b . URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024

[21] [21]

Genarm: Reward guided generation with autoregressive reward model for test-time alignment

Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, and Sumitra Ganesh. Genarm: Reward guided generation with autoregressive reward model for test-time alignment. In The Thirteenth International Conference on Learning Representations

work page

[22] [22]

Large language model unlearning

Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. Advances in Neural Information Processing Systems, 37: 0 105425--105475, 2024

work page 2024

[23] [23]

Negative preference optimization: From catastrophic collapse to effective unlearning

Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. Negative preference optimization: From catastrophic collapse to effective unlearning. In First Conference on Language Modeling

work page

[24] [24]

Worldvaluesbench: A large-scale benchmark dataset for multi-cultural value awareness of language models

Wenlong Zhao, Debanjan Mondal, Niket Tandon, Danica Dillion, Kurt Gray, and Yuling Gu. Worldvaluesbench: A large-scale benchmark dataset for multi-cultural value awareness of language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.\ 17696--17706, 2024

work page 2024

[25] [25]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[26] [26]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[27] [27]

Amrum Lighthouse

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv