Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning

Dongrui Liu; Guanxu Chen; Jing Shao; Qihao Lin

arxiv: 2605.21422 · v1 · pith:XWBVRIUVnew · submitted 2026-05-20 · 💻 cs.LG

Preference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning

Qihao Lin , Guanxu Chen , Dongrui Liu , Jing Shao This is my paper

Pith reviewed 2026-05-21 05:02 UTC · model grok-4.3

classification 💻 cs.LG

keywords data selectionfine-tuninginfluence functionspreference weightinglarge language modelsefficient trainingtarget behavior

0 comments

The pith

Weighting target examples by the current model's preferences yields a more effective first-order direction for data selection in LLM fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes PRISM, a data selection approach that weights target examples according to how closely they match the model's existing behavior instead of treating all targets as equal. This creates a preference-aware representation used to score and prioritize training samples for fine-tuning. A sympathetic reader cares because scaling models makes limited training budgets a bottleneck, and better targeting of data could reduce waste. Theoretical analysis claims the weighting improves the update direction toward the desired behavior. Experiments across models show gains in both general efficient fine-tuning and safety repairs.

Core claim

PRISM constructs a preference-aware target representation by weighting target examples according to the current model's preference. It then scores candidate training samples by their alignment with this representation, concentrating the data budget on samples more likely to move the model toward the target behavior. Theoretical analysis shows that this preference weighting yields a more effective first-order direction for increasing target-behavior preference.

What carries the argument

The preference-aware target representation, formed by weighting target examples using the current model's preference and influence functions, which guides scoring of candidate samples for selection.

If this is right

PRISM improves both efficient fine-tuning and safety-oriented SFT repair across model families and scales.
Concentrating the limited data budget on samples aligned with the preference-aware representation produces better target behavior outcomes.
Precise target-behavior characterization through preference weighting is key to budget-efficient data selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method might reduce the number of target examples needed by prioritizing the most relevant ones for a given model state.
It could combine with other selection criteria like diversity or difficulty to further optimize training efficiency.
Similar preference weighting might apply to data selection in reinforcement learning or continual learning settings.

Load-bearing premise

The current model's preference can be accurately and stably measured to weight target examples in a way that produces a genuinely more effective update direction without introducing offsetting computational costs or selection biases.

What would settle it

An ablation experiment comparing model performance after fine-tuning on data selected with versus without the preference weighting, checking whether the weighted version consistently fails to show better progress toward the target behavior.

Figures

Figures reproduced from arXiv: 2605.21422 by Dongrui Liu, Guanxu Chen, Jing Shao, Qihao Lin.

**Figure 3.** Figure 3: Component ablations on Qwen-3-14B. Left: [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

As LLMs continue to scale, improving training efficiency increasingly depends on using data more effectively. Data selection addresses this problem by allocating a limited training budget to samples that best promote a target behavior. Existing methods usually represent the target behavior with a set of target examples, but often treat these examples as equally important. This can be inefficient because target examples may differ in their relevance to the current model: examples closer to the model's current behavior provide more actionable guidance than those farther away. We propose PRISM (PReference-aware Influence-function-based Data Selection Method for Efficient Fine-Tuning), which uses the current model's preference to weight target examples and construct a preference-aware target representation. PRISM then scores candidate training samples by their alignment with this representation, concentrating the data budget on samples more likely to move the model toward the target behavior. Theoretical analysis shows that this preference weighting yields a more effective first-order direction for increasing target-behavior preference. Experiments across model families and scales show that PRISM improves both efficient fine-tuning and safety-oriented SFT repair, demonstrating that precise target-behavior characterization is key to budget-efficient data selection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PRISM layers preference weighting onto influence functions for data selection, but the theory hinges on untested stability of that weighting under approximation.

read the letter

The main point is that this paper adds a preference-based weighting step to influence-function data selection so that target examples closer to the current model get more influence when picking training data. That is the concrete addition over treating every target the same. The experiments report gains on both standard fine-tuning efficiency and safety repair tasks across a few model families and scales, which at least shows the method is runnable and produces measurable differences in the chosen subsets. Those results are the part worth looking at first. The theoretical claim that the weighting produces a stronger first-order direction is asserted but rests on the influence approximation remaining reliable once the weights are applied. If the preference signal itself shifts sharply with small parameter changes, or if the model starts far from the target behavior, the linear approximation can mis-rank candidates. The abstract gives no equations or stability checks, so it is not clear how much of the reported improvement comes from the weighting versus from other implementation choices. The citation pattern looks standard for the influence-function literature and does not appear to hide prior work. This is the kind of incremental method paper that people working on data-efficient LLM adaptation would read. It is not foundational, but the experiments give it enough substance that a referee could usefully check the stability assumptions and the exact datasets. I would send it to review rather than desk-reject, with the main request being clearer validation that the preference weighting actually improves the direction rather than just re-ranking in a correlated way.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes PRISM, a preference-aware influence-function-based data selection method for efficient fine-tuning of LLMs. It argues that weighting target examples according to the current model's preference produces a more effective first-order direction for aligning with target behaviors than uniform treatment of targets. The approach scores candidate samples by alignment with this weighted representation and allocates limited training budgets accordingly. Theoretical analysis is claimed to establish the superiority of the preference-weighted direction, with experiments showing gains in general efficient fine-tuning and safety-oriented SFT repair across model families and scales.

Significance. If the central theoretical claim holds and the influence-function approximations remain accurate under preference weighting, the work could meaningfully advance data-efficient fine-tuning by moving beyond uniform target representations. This would be particularly relevant for safety alignments and low-budget regimes. The explicit use of model-state-dependent weighting combined with influence functions offers a concrete mechanism that, if validated, could be adopted in practice; the experiments across scales provide initial evidence of practical utility.

major comments (2)

[§4] §4 (Theoretical Analysis): The claim that preference weighting yields a more effective first-order direction for increasing target-behavior preference rests on the stability of the influence-function approximation when the weighting is applied. The manuscript provides no explicit bound or verification showing that the linear approximation remains accurate when the current model is far from the target behavior or when small perturbations induce preference flips, which directly undermines the load-bearing assertion that the weighted direction is superior to uniform weighting.
[§5] §5 (Experiments): The reported improvements in safety-oriented SFT and efficient fine-tuning lack sufficient controls for whether gains arise from the preference weighting itself versus other implementation choices (e.g., exact influence-function estimator or selection threshold). Without ablation isolating the weighting step and reporting variance across multiple runs or dataset splits, the experimental support for the central claim remains inconclusive.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief equation or proof sketch summarizing the first-order direction improvement to make the theoretical contribution more accessible.
[§3] Notation for the preference weighting function and the influence-function scoring should be introduced with explicit definitions early in the method section to avoid ambiguity when comparing to prior influence-based selection work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The feedback highlights important aspects of our theoretical analysis and experimental validation that we will address in the revision. Below we respond point by point to the major comments.

read point-by-point responses

Referee: [§4] §4 (Theoretical Analysis): The claim that preference weighting yields a more effective first-order direction for increasing target-behavior preference rests on the stability of the influence-function approximation when the weighting is applied. The manuscript provides no explicit bound or verification showing that the linear approximation remains accurate when the current model is far from the target behavior or when small perturbations induce preference flips, which directly undermines the load-bearing assertion that the weighted direction is superior to uniform weighting.

Authors: We appreciate the referee drawing attention to the assumptions underlying the theoretical claim. Section 4 derives that the preference-weighted target representation produces a first-order direction with higher expected alignment to the target behavior by weighting examples according to the model's current preference scores; this follows directly from the influence-function gradient under the standard local-linearity assumption. We acknowledge that the manuscript does not supply explicit error bounds for regimes far from the target or under preference flips. In the revised manuscript we will add a dedicated paragraph in §4 that (i) states the local-linearity assumption explicitly, (ii) discusses the conditions under which the approximation is expected to degrade, and (iii) reports a simple empirical check (correlation between influence scores and actual loss reduction on held-out targets) across varying distances from the target. This addition clarifies the scope of the theoretical result without altering the existing derivation. revision: partial
Referee: [§5] §5 (Experiments): The reported improvements in safety-oriented SFT and efficient fine-tuning lack sufficient controls for whether gains arise from the preference weighting itself versus other implementation choices (e.g., exact influence-function estimator or selection threshold). Without ablation isolating the weighting step and reporting variance across multiple runs or dataset splits, the experimental support for the central claim remains inconclusive.

Authors: We agree that isolating the contribution of preference weighting and reporting statistical variability would strengthen the experimental section. The current experiments already include a uniform-target baseline that uses the identical influence-function estimator and selection procedure, thereby controlling for estimator choice and threshold. Nevertheless, we did not report standard deviations or perform additional splits. In the revised version we will (i) add an explicit ablation table that compares PRISM directly against its unweighted counterpart on the same estimator and threshold, (ii) report mean and standard deviation over five random seeds for all main results, and (iii) include results on two additional random train/validation splits for the safety-repair tasks. These changes will make the source of the observed gains clearer. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical claim presented as independent analysis

full rationale

The abstract describes PRISM as weighting target examples by the current model's preference to form a representation, then scoring candidates by alignment, with a theoretical analysis claiming this produces a more effective first-order direction. No equations, self-citations, or derivations are visible that reduce the claimed improvement to a definitional equivalence, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. The preference weighting is an explicit modeling choice applied to standard influence-function machinery, and the result is framed as an analysis outcome rather than tautological by construction. The derivation chain therefore remains self-contained against external benchmarks such as influence functions and preference measurement.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits identification; relies on standard influence-function approximation assumptions common in data selection literature.

axioms (1)

domain assumption Influence functions provide a reliable first-order approximation of how individual training samples affect model parameters toward a target behavior.
Implicit foundation for scoring candidate samples by alignment with the preference-weighted target.

pith-pipeline@v0.9.0 · 5726 in / 1146 out tokens · 38603 ms · 2026-05-21T05:02:22.899899+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theoretical analysis shows that this preference weighting yields a more effective first-order direction for increasing target-behavior preference.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

gKL = 1/|QΔ| Σ π_q (g(q,yp_q) - g(q,yn_q))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[2]

Publications Manual , year = "1983", publisher =

work page 1983
[3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[5]

Dan Gusfield , title =. 1997

work page 1997
[6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[8]

2026 , eprint =

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.14869 , url =

work page doi:10.48550/arxiv.2602.14869 2026
[9]

arXiv preprint arXiv:2506.19823 , year =

Persona Features Control Emergent Misalignment , author =. 2025 , eprint =. doi:10.48550/arXiv.2506.19823 , url =

work page doi:10.48550/arxiv.2506.19823 2025
[10]

doi:10.48550/arXiv.2506.01790 , url =

Coalson, Zachary and Bae, Juhan and Carlini, Nicholas and Hong, Sanghyun , year =. doi:10.48550/arXiv.2506.01790 , url =. 2506.01790 , archivePrefix =

work page doi:10.48550/arxiv.2506.01790
[11]

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Afonin, Nikita and Andriianov, Nikita and Hovhannisyan, Vahagn and Bageshpura, Nikhil and Liu, Kyle and Zhu, Kevin and Dev, Sunishchal and Panda, Ashwinee and Rogov, Oleg and Tutubalina, Elena and Panchenko, Alexander and Seleznyov, Mikhail , year =. Emergent Misalignment via In-Context Learning: Narrow In-Context Examples Can Produce Broadly Misaligned. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.11288
[12]

doi:10.48550/arXiv.2510.08211 , url =

Hu, Xuhao and Wang, Peng and Lu, Xiaoya and Liu, Dongrui and Huang, Xuanjing and Shao, Jing , year =. doi:10.48550/arXiv.2510.08211 , url =. 2510.08211 , archivePrefix =

work page doi:10.48550/arxiv.2510.08211
[13]

Proceedings of the National Academy of Sciences of the United States of America , volume =

On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I , author =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 1949 , doi =

work page 1949
[14]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

work page 2020
[15]

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models , author =. 2021 , eprint =. doi:10.48550/arXiv.2108.07258 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07258 2021
[16]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. 2022 , eprint =. doi:10.48550/arXiv.2203.02155 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[17]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. 2025 , eprint =. doi:10.48550/arXiv.2502.17424 , url =

work page doi:10.48550/arxiv.2502.17424 2025
[18]

Proceedings of the 34th International Conference on Machine Learning , pages =

Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , url =

work page 2017
[19]

Advances in Neural Information Processing Systems , volume =

Estimating Training Data Influence by Tracing Gradient Descent , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

work page 2020
[20]

Advances in Neural Information Processing Systems , volume =

Representer Point Selection for Explaining Deep Neural Networks , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

work page 2018
[21]

2024 , url =

Kwon, Yongchan and Wu, Eric and Wu, Kevin and Zou, James , booktitle =. 2024 , url =

work page 2024
[22]

2024 , url =

Xia, Mengzhou and Malladi, Sadhika and Gururangan, Suchin and Arora, Sanjeev and Chen, Danqi , booktitle =. 2024 , url =

work page 2024
[23]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona Vectors: Monitoring and Controlling Character Traits in Language Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2507.21509 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509 2025
[24]

2025 , howpublished =

Toward Understanding and Preventing Misalignment Generalization , author =. 2025 , howpublished =

work page 2025
[25]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022
[26]

2024 , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2024 , url =

work page 2024
[27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , year =. 2307.09288 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[28]

and Zoph, Barret and Wei, Jason and Roberts, Adam , booktitle =

Longpre, Shayne and Hou, Le and Vu, Tu and Webson, Albert and Chung, Hyung Won and Tay, Yi and Zhou, Denny and Le, Quoc V. and Zoph, Barret and Wei, Jason and Roberts, Adam , booktitle =. The. 2023 , url =

work page 2023
[29]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

work page
[30]

Conover, Mike and Hayes, Matt and Mathur, Ankit and Meng, Xiangrui and Xie, Jianwei and Wan, Jun and Shah, Sam and Ghodsi, Ali and Wendell, Patrick and Zaharia, Matei and others , year =. Free

work page
[31]

Advances in Neural Information Processing Systems , year =

K. Advances in Neural Information Processing Systems , year =

work page
[32]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

work page
[33]

and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal =

Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal =. 2020 , url =

work page 2020
[34]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , year =

work page 2023
[35]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , url =

work page 2021
[36]

International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

work page

[1] [1]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972

[2] [2]

Publications Manual , year = "1983", publisher =

work page 1983

[3] [3]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981

[4] [4]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page

[5] [5]

Dan Gusfield , title =. 1997

work page 1997

[6] [6]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015

[7] [7]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page

[8] [8]

2026 , eprint =

Concept Influence: Leveraging Interpretability to Improve Performance and Efficiency in Training Data Attribution , author =. 2026 , eprint =. doi:10.48550/arXiv.2602.14869 , url =

work page doi:10.48550/arxiv.2602.14869 2026

[9] [9]

arXiv preprint arXiv:2506.19823 , year =

Persona Features Control Emergent Misalignment , author =. 2025 , eprint =. doi:10.48550/arXiv.2506.19823 , url =

work page doi:10.48550/arxiv.2506.19823 2025

[10] [10]

doi:10.48550/arXiv.2506.01790 , url =

Coalson, Zachary and Bae, Juhan and Carlini, Nicholas and Hong, Sanghyun , year =. doi:10.48550/arXiv.2506.01790 , url =. 2506.01790 , archivePrefix =

work page doi:10.48550/arxiv.2506.01790

[11] [11]

Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs

Afonin, Nikita and Andriianov, Nikita and Hovhannisyan, Vahagn and Bageshpura, Nikhil and Liu, Kyle and Zhu, Kevin and Dev, Sunishchal and Panda, Ashwinee and Rogov, Oleg and Tutubalina, Elena and Panchenko, Alexander and Seleznyov, Mikhail , year =. Emergent Misalignment via In-Context Learning: Narrow In-Context Examples Can Produce Broadly Misaligned. ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2510.11288

[12] [12]

doi:10.48550/arXiv.2510.08211 , url =

Hu, Xuhao and Wang, Peng and Lu, Xiaoya and Liu, Dongrui and Huang, Xuanjing and Shao, Jing , year =. doi:10.48550/arXiv.2510.08211 , url =. 2510.08211 , archivePrefix =

work page doi:10.48550/arxiv.2510.08211

[13] [13]

Proceedings of the National Academy of Sciences of the United States of America , volume =

On a Theorem of Weyl Concerning Eigenvalues of Linear Transformations I , author =. Proceedings of the National Academy of Sciences of the United States of America , volume =. 1949 , doi =

work page 1949

[14] [14]

Advances in Neural Information Processing Systems , volume =

Language Models are Few-Shot Learners , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

work page 2020

[15] [15]

On the Opportunities and Risks of Foundation Models

On the Opportunities and Risks of Foundation Models , author =. 2021 , eprint =. doi:10.48550/arXiv.2108.07258 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07258 2021

[16] [16]

Training language models to follow instructions with human feedback

Training Language Models to Follow Instructions with Human Feedback , author =. 2022 , eprint =. doi:10.48550/arXiv.2203.02155 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022

[17] [17]

Emergent Misalignment : Narrow finetuning can produce broadly misaligned LLMs , May 2025

Betley, Jan and Tan, Daniel and Warncke, Niels and Sztyber-Betley, Anna and Bao, Xuchan and Soto, Mart. Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned. 2025 , eprint =. doi:10.48550/arXiv.2502.17424 , url =

work page doi:10.48550/arxiv.2502.17424 2025

[18] [18]

Proceedings of the 34th International Conference on Machine Learning , pages =

Understanding Black-box Predictions via Influence Functions , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , url =

work page 2017

[19] [19]

Advances in Neural Information Processing Systems , volume =

Estimating Training Data Influence by Tracing Gradient Descent , author =. Advances in Neural Information Processing Systems , volume =. 2020 , url =

work page 2020

[20] [20]

Advances in Neural Information Processing Systems , volume =

Representer Point Selection for Explaining Deep Neural Networks , author =. Advances in Neural Information Processing Systems , volume =. 2018 , url =

work page 2018

[21] [21]

2024 , url =

Kwon, Yongchan and Wu, Eric and Wu, Kevin and Zou, James , booktitle =. 2024 , url =

work page 2024

[22] [22]

2024 , url =

Xia, Mengzhou and Malladi, Sadhika and Gururangan, Suchin and Arora, Sanjeev and Chen, Danqi , booktitle =. 2024 , url =

work page 2024

[23] [23]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Persona Vectors: Monitoring and Controlling Character Traits in Language Models , author =. 2025 , eprint =. doi:10.48550/arXiv.2507.21509 , url =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.21509 2025

[24] [24]

2025 , howpublished =

Toward Understanding and Preventing Misalignment Generalization , author =. 2025 , howpublished =

work page 2025

[25] [25]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , booktitle =. 2022 , url =

work page 2022

[26] [26]

2024 , url =

Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke , booktitle =. 2024 , url =

work page 2024

[27] [27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, Hugo and Martin, Louis and Stone, Kevin and Albert, Peter and Almahairi, Amjad and Babaei, Yasmine and Bashlykov, Nikolay and Batra, Soumya and Bhargava, Prajjwal and Bhosale, Shruti and others , year =. 2307.09288 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

and Zoph, Barret and Wei, Jason and Roberts, Adam , booktitle =

Longpre, Shayne and Hou, Le and Vu, Tu and Webson, Albert and Chung, Hyung Won and Tay, Yi and Zhou, Denny and Le, Quoc V. and Zoph, Barret and Wei, Jason and Roberts, Adam , booktitle =. The. 2023 , url =

work page 2023

[29] [29]

Advances in Neural Information Processing Systems , year =

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =

work page

[30] [30]

Conover, Mike and Hayes, Matt and Mathur, Ankit and Meng, Xiangrui and Xie, Jianwei and Wan, Jun and Shah, Sam and Ghodsi, Ali and Wendell, Patrick and Zaharia, Matei and others , year =. Free

work page

[31] [31]

Advances in Neural Information Processing Systems , year =

K. Advances in Neural Information Processing Systems , year =

work page

[32] [32]

International Conference on Learning Representations , year =

Measuring Massive Multitask Language Understanding , author =. International Conference on Learning Representations , year =

work page

[33] [33]

and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal =

Clark, Jonathan H. and Choi, Eunsol and Collins, Michael and Garrette, Dan and Kwiatkowski, Tom and Nikolaev, Vitaly and Palomaki, Jennimaria , journal =. 2020 , url =

work page 2020

[34] [34]

Challenging

Suzgun, Mirac and Scales, Nathan and Sch. Challenging. Findings of the Association for Computational Linguistics: ACL 2023 , year =

work page 2023

[35] [35]

Measuring Mathematical Problem Solving With the

Hendrycks, Dan and Burns, Collin and Kadavath, Saurav and Arora, Akul and Basart, Steven and Tang, Eric and Song, Dawn and Steinhardt, Jacob , booktitle =. Measuring Mathematical Problem Solving With the. 2021 , url =

work page 2021

[36] [36]

International Conference on Learning Representations , year =

Let's Verify Step by Step , author =. International Conference on Learning Representations , year =

work page