Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

Guoxi Zhang; Hua Cai; Junpeng Zhang; Lei Cheng; Qing Xu; Quanshi Zhang

arxiv: 2605.17967 · v1 · pith:KXREPCC3new · submitted 2026-05-18 · 💻 cs.AI

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

Junpeng Zhang , Lei Cheng , Guoxi Zhang , Hua Cai , Qing Xu , Quanshi Zhang This is my paper

Pith reviewed 2026-05-20 10:00 UTC · model grok-4.3

classification 💻 cs.AI

keywords supervised fine-tuninglarge language modelstoken interactionsnoise removaloverfittingearly stoppinginference patterns

0 comments

The pith

Supervised fine-tuning primarily removes noise-like interactions in large language models rather than acquiring new reliable ones, with the beneficial phase being very short.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper aims to explain contradictory observations about supervised fine-tuning on large language models by examining how interactions between tokens evolve during the process. It establishes that SFT quickly eliminates noisy interactions but rarely learns dependable new ones, after which further training leads to overfitting. Readers would care because this accounts for why SFT can sometimes harm performance and suggests better ways to apply it in practice. The findings are validated on multiple models and datasets.

Core claim

We find that SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions.

What carries the argument

The evolution of interactions between words or tokens during supervised fine-tuning, serving as a metric for inference patterns in LLMs.

If this is right

The denoising effect of SFT occurs rapidly and is followed by overfitting if training continues.
Early stopping can be used to maximize the benefits of SFT while avoiding detrimental overfitted interactions.
SFT is effective for LLMs mainly by cleaning up noise rather than by adding new capabilities.
These patterns hold across different LLMs and fine-tuning datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interaction tracking could be extended to other fine-tuning techniques to identify optimal stopping points.
This view might reconcile similar inconsistencies seen in other large-scale training methods.
It implies that most reliable inference patterns are set during pre-training, with SFT serving a limited cleanup role.

Load-bearing premise

Interactions between tokens provide a faithful way to measure the inference patterns learned by large language models.

What would settle it

Count the number of noise-like and reliable interactions at successive stages of SFT and verify if performance improves only in the initial short phase before declining with added overfitted interactions.

Figures

Figures reproduced from arXiv: 2605.17967 by Guoxi Zhang, Hua Cai, Junpeng Zhang, Lei Cheng, Qing Xu, Quanshi Zhang.

**Figure 2.** Figure 2: Evolution of the distribution of newly emerged interactions ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Evolution of the representation quality of newly emerged, removed, and preserved inter [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Prediction utility and individual contributions of different types of interactions. We [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical verification of universal matching on LLMs. Each row corresponds to a different [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical verification of interaction sparsity on LLMs. We aggregate the interactions [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Additional results on the evolution of newly emerged, removed, and preserved interactions [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

**Figure 8.** Figure 8: Additional results on the representation quality of newly emerged, removed, and preserved [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: An example of AND-OR logical models constructed to faithfully explain the output scores [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 9.** Figure 9: Another example of AND-OR logical models explaining the DeepSeek model (top) and the [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

read the original abstract

This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames SFT on LLMs as quick removal of noise-like token interactions followed by overfitting, using interaction metrics to explain inconsistent results and suggest early stopping.

read the letter

The main point is that SFT on large language models mostly strips out noise-like interactions between tokens in a very short initial window, adds few reliable new ones, and then starts introducing overfitted interactions if training continues. This account is meant to reconcile why SFT helps smaller networks reliably but gives mixed or negative results on LLMs, and it points to early stopping as a practical lever for better outcomes.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that inconsistent effectiveness of supervised fine-tuning (SFT) on LLMs versus small networks can be reconciled by tracking token interactions: SFT briefly removes noise-like interactions without acquiring reliable new ones, after which continued training introduces overfitted interactions; this is validated across multiple LLMs and datasets and yields guidance on early stopping.

Significance. If the interaction metric is shown to faithfully track inference patterns and causally explain SFT outcomes, the work could reconcile contradictory SFT results and supply concrete training heuristics. The approach is novel in applying interaction dynamics to the SFT puzzle, but its significance is limited by the absence of direct links between observed interaction changes and downstream task performance.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): the claim of validation 'across multiple LLMs and datasets' is stated without reporting controls, baseline comparisons, or the exact procedure for quantifying and classifying interactions as 'noise-like' versus 'overfitted'; this omission makes it impossible to assess whether the denoising-then-overfitting trajectory is robust or merely descriptive of the chosen metric.
[§2 and §3] §2 (Interaction Metric) and §3 (Evolution Analysis): the central explanatory claim requires that changes in the interaction measure directly account for SFT effectiveness, yet no ablation, held-out prediction test, or alignment with known spurious/causal features is reported; without such evidence the narrative risks being a post-hoc description of metric dynamics rather than a causal account.

minor comments (2)

[§2] Define 'noise-like' and 'overfitted' interactions with explicit mathematical criteria or thresholds rather than qualitative description.
[§4] Add a table or figure caption clarifying the precise LLMs, datasets, and interaction-extraction method used in the validation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our work. Below, we provide detailed responses to each major comment and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of validation 'across multiple LLMs and datasets' is stated without reporting controls, baseline comparisons, or the exact procedure for quantifying and classifying interactions as 'noise-like' versus 'overfitted'; this omission makes it impossible to assess whether the denoising-then-overfitting trajectory is robust or merely descriptive of the chosen metric.

Authors: We agree that additional details on the experimental procedure are necessary to allow readers to assess robustness. In the revised manuscript, we have expanded §4 with a dedicated subsection describing the exact quantification of interactions (including the mathematical definition and computation steps), the classification criteria for noise-like interactions (those whose removal improves validation performance without harming training) versus overfitted ones (those that boost training but degrade held-out performance), and the specific thresholds applied. We have also added baseline comparisons using randomly permuted token interactions and controls varying random seeds and hyperparameter settings across the reported LLMs and datasets. These revisions should enable a clearer evaluation of whether the observed trajectory is robust. revision: yes
Referee: [§2 and §3] §2 (Interaction Metric) and §3 (Evolution Analysis): the central explanatory claim requires that changes in the interaction measure directly account for SFT effectiveness, yet no ablation, held-out prediction test, or alignment with known spurious/causal features is reported; without such evidence the narrative risks being a post-hoc description of metric dynamics rather than a causal account.

Authors: We acknowledge that stronger evidence linking interaction changes directly to SFT outcomes would better support the causal narrative. The original §3 presents consistent temporal alignments between interaction evolution and performance shifts, but we agree that ablations and held-out tests were not included. In the revision, we have added a held-out prediction experiment in §3 that uses early interaction changes to forecast later SFT effectiveness and compares predictions against observed results. We have also included a brief alignment analysis with known spurious features in one dataset. Full causal interventions remain challenging due to scale, so we have noted this limitation and suggested it as future work. This constitutes a partial but substantive improvement to the explanatory section. revision: partial

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external interaction metric and empirical observations.

full rationale

The paper treats interactions between tokens as a pre-existing explanatory tool drawn from recent advances in interaction-based explanations, then tracks their evolution empirically across SFT stages on multiple LLMs and datasets. No equation or claim reduces the observed denoising/overfitting pattern to a definition or fit that is constructed from the target SFT-effectiveness conclusion itself. The central narrative is presented as an interpretation of measured changes rather than a self-referential loop, and the validation steps are independent of the interpretive framing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on treating token interactions as a faithful explanatory metric; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs.
Invoked in the abstract as the basis for using interaction evolution to explain SFT effectiveness.

pith-pipeline@v0.9.0 · 5686 in / 1186 out tokens · 36449 ms · 2026-05-20T10:00:03.024393+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs... SFT primarily removes noise-like interactions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AND-OR interactions... universal matching property... ratio of uncancelled interaction effects ρ

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 8 internal anchors

[1]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Proxyspex: Inference-efficient interpretability via sparse feature interactions in llms.arXiv preprint arXiv:2505.17495, 2025

Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, and Kannan Ramchandran. Proxyspex: Inference-efficient interpretability via sparse feature interactions in llms.arXiv preprint arXiv:2505.17495, 2025

work page arXiv 2025
[3]

Unilaw-r1: A large language model for legal reasoning with reinforcement learning and iterative inference

Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, and Tianke Ban. Unilaw-r1: A large language model for legal reasoning with reinforcement learning and iterative inference. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18128–18142, 2025

work page 2025
[4]

Ma-rlhf: Rein- forcement learning from human feedback with macro actions.arXiv preprint arXiv:2410.02743, 2024

Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, and Hua Wu. Ma-rlhf: Rein- forcement learning from human feedback with macro actions.arXiv preprint arXiv:2410.02743, 2024

work page arXiv 2024
[5]

Defining and extracting generalizable interaction primitives from DNNs

Lu Chen, Siyu Lou, Benhao Huang, and Quanshi Zhang. Defining and extracting generalizable interaction primitives from DNNs. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OCqyFVFNeF

work page 2024
[6]

Can llms reason soundly in law? auditing inference patterns for legal judgment

Lu Chen, Yuxuan Huang, Yixing Li, Dongrui Liu, Qihan Ren, Kun Kuang, Zilong Zheng, Quanshi Zhang, et al. Can llms reason soundly in law? auditing inference patterns for legal judgment. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026
[7]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024
[8]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/ 04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

work page 2023
[9]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Goemotions: A dataset of fine-grained emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. Goemotions: A dataset of fine-grained emotions. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4040–4054, 2020

work page 2020
[11]

Discovering and explaining the representation bottleneck of dnns.arXiv preprint arXiv:2111.06236, 2021

Huiqi Deng, Qihan Ren, Hao Zhang, and Quanshi Zhang. Discovering and explaining the representation bottleneck of dnns.arXiv preprint arXiv:2111.06236, 2021

work page arXiv 2021
[12]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019
[13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022
[17]

Learning to understand: Identifying interactions via the möbius transform.Advances in Neural Information Processing Systems, 37:46160–46202, 2024

Justin S Kang, Yigit E Erginbas, Landon Butler, Ramtin Pedarsani, and Kannan Ramchandran. Learning to understand: Identifying interactions via the möbius transform.Advances in Neural Information Processing Systems, 37:46160–46202, 2024

work page 2024
[18]

Spex: Scaling feature interaction explanations for llms

Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Kannan Ramchandran, and Bin Yu. Spex: Scaling feature interaction explanations for llms. arXiv preprint arXiv:2502.13870, 2025

work page arXiv 2025
[19]

Defining and quantifying and-or interactions for faithful and concise explanation of dnns.arXiv preprint arXiv:2304.13312, 2023

Mingjie Li and Quanshi Zhang. Defining and quantifying and-or interactions for faithful and concise explanation of dnns.arXiv preprint arXiv:2304.13312, 2023

work page arXiv 2023
[20]

Does a neural network really encode symbolic concepts? In International conference on machine learning, pages 20452–20469, 2023

Mingjie Li and Quanshi Zhang. Does a neural network really encode symbolic concepts? In International conference on machine learning, pages 20452–20469, 2023

work page 2023
[21]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[22]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[23]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020
[24]

Defining and quantifying the emergence of sparse concepts in dnns

Jie Ren, Mingjie Li, Qirui Chen, Huiqi Deng, and Quanshi Zhang. Defining and quantifying the emergence of sparse concepts in dnns. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20280–20289, 2023

work page 2023
[25]

Bayesian neural networks avoid encoding complex and perturbation-sensitive concepts

Qihan Ren, Huiqi Deng, Yunuo Chen, Siyu Lou, and Quanshi Zhang. Bayesian neural networks avoid encoding complex and perturbation-sensitive concepts. InInternational Conference on Machine Learning, pages 28889–28913. PMLR, 2023

work page 2023
[26]

Where we have arrived in proving the emergence of sparse interaction primitives in dnns

Qihan Ren, Jiayang Gao, Wen Shen, and Quanshi Zhang. Where we have arrived in proving the emergence of sparse interaction primitives in dnns. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[27]

Towards the dynamics of a dnn learning symbolic interactions.Advances in Neural Information Processing Systems, 37:50653–50688, 2024

Qihan Ren, Junpeng Zhang, Yang Xu, Yue Xin, Dongrui Liu, and Quanshi Zhang. Towards the dynamics of a dnn learning symbolic interactions.Advances in Neural Information Processing Systems, 37:50653–50688, 2024

work page 2024
[28]

A value for n-person games

Lloyd S Shapley et al. A value for n-person games. 1953

work page 1953
[29]

Instruction tuning with loss over instructions.Advances in Neural Information Processing Systems, 37:69176–69205, 2024

Zhengyan Shi, Adam X Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, and Aldo Lipani. Instruction tuning with loss over instructions.Advances in Neural Information Processing Systems, 37:69176–69205, 2024

work page 2024
[30]

Symtrustai: The world’s first verifiable ai mechanistic diagnostic platform, 2026

SymTrustAI. Symtrustai: The world’s first verifiable ai mechanistic diagnostic platform, 2026. URLhttps://www.symtrustai.com/en/

work page 2026
[31]

Gemma Team. Gemma 3. 2025. URLhttps://goo.gle/Gemma3Report

work page 2025
[32]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

work page 2024
[33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

A unified ap- proach to interpreting and boosting adversarial transferability.arXiv preprint arXiv:2010.04055, 2020

Xin Wang, Jie Ren, Shuyun Lin, Xiangming Zhu, Yisen Wang, and Quanshi Zhang. A unified ap- proach to interpreting and boosting adversarial transferability.arXiv preprint arXiv:2010.04055, 2020

work page arXiv 2010
[35]

Two-stage llm fine-tuning with less specialization and more generalization

Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, and Sanjiv Kumar. Two-stage llm fine-tuning with less specialization and more generalization. arXiv preprint arXiv:2211.00635, 2022

work page arXiv 2022
[36]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[37]

Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, and Chengchun Shi. Robust reinforce- ment learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

work page arXiv 2025
[38]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguist...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

work page 2023
[40]

Explaining generalization power of a dnn using interactive concepts

Huilin Zhou, Hao Zhang, Huiqi Deng, Dongrui Liu, Wen Shen, Shih-Han Chan, and Quanshi Zhang. Explaining generalization power of a dnn using interactive concepts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17105–17113, 2024

work page 2024
[41]

Towards the first principles of explaining dnns: interactions explain the learning dynamics.Frontiers of Information Technology & Electronic Engineering, 26(7):1017–1026, 2025

Huilin Zhou, Qihan Ren, Junpeng Zhang, and Quanshi Zhang. Towards the first principles of explaining dnns: interactions explain the learning dynamics.Frontiers of Information Technology & Electronic Engineering, 26(7):1017–1026, 2025. 13 Appendix This appendix provides detailed information that supports the main paper. For clarity, the appendix is organiz...

work page 2025

[1] [1]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Proxyspex: Inference-efficient interpretability via sparse feature interactions in llms.arXiv preprint arXiv:2505.17495, 2025

Landon Butler, Abhineet Agarwal, Justin Singh Kang, Yigit Efe Erginbas, Bin Yu, and Kannan Ramchandran. Proxyspex: Inference-efficient interpretability via sparse feature interactions in llms.arXiv preprint arXiv:2505.17495, 2025

work page arXiv 2025

[3] [3]

Unilaw-r1: A large language model for legal reasoning with reinforcement learning and iterative inference

Hua Cai, Shuang Zhao, Liang Zhang, Xuli Shen, Qing Xu, Weilin Shen, Zihao Wen, and Tianke Ban. Unilaw-r1: A large language model for legal reasoning with reinforcement learning and iterative inference. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18128–18142, 2025

work page 2025

[4] [4]

Ma-rlhf: Rein- forcement learning from human feedback with macro actions.arXiv preprint arXiv:2410.02743, 2024

Yekun Chai, Haoran Sun, Huang Fang, Shuohuan Wang, Yu Sun, and Hua Wu. Ma-rlhf: Rein- forcement learning from human feedback with macro actions.arXiv preprint arXiv:2410.02743, 2024

work page arXiv 2024

[5] [5]

Defining and extracting generalizable interaction primitives from DNNs

Lu Chen, Siyu Lou, Benhao Huang, and Quanshi Zhang. Defining and extracting generalizable interaction primitives from DNNs. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=OCqyFVFNeF

work page 2024

[6] [6]

Can llms reason soundly in law? auditing inference patterns for legal judgment

Lu Chen, Yuxuan Huang, Yixing Li, Dongrui Liu, Qihan Ren, Kun Kuang, Zilong Zheng, Quanshi Zhang, et al. Can llms reason soundly in law? auditing inference patterns for legal judgment. InThe Fourteenth International Conference on Learning Representations, 2026

work page 2026

[7] [7]

Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models.Journal of Machine Learning Research, 25(70):1–53, 2024

work page 2024

[8] [8]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023. URL https://www.databricks.com/blog/2023/ 04/12/dolly-first-open-commercially-viable-instruction-tuned-llm

work page 2023

[9] [9]

Safe RLHF: Safe Reinforcement Learning from Human Feedback

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Goemotions: A dataset of fine-grained emotions

Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. Goemotions: A dataset of fine-grained emotions. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 4040–4054, 2020

work page 2020

[11] [11]

Discovering and explaining the representation bottleneck of dnns.arXiv preprint arXiv:2111.06236, 2021

Huiqi Deng, Qihan Ren, Hao Zhang, and Quanshi Zhang. Discovering and explaining the representation bottleneck of dnns.arXiv preprint arXiv:2111.06236, 2021

work page arXiv 2021

[12] [12]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

work page 2019

[13] [13]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

The False Promise of Imitating Proprietary LLMs

Arnav Gudibande, Eric Wallace, Charlie Snell, Xinyang Geng, Hao Liu, Pieter Abbeel, Sergey Levine, and Dawn Song. The false promise of imitating proprietary llms.arXiv preprint arXiv:2305.15717, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 11

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

work page 2022

[17] [17]

Learning to understand: Identifying interactions via the möbius transform.Advances in Neural Information Processing Systems, 37:46160–46202, 2024

Justin S Kang, Yigit E Erginbas, Landon Butler, Ramtin Pedarsani, and Kannan Ramchandran. Learning to understand: Identifying interactions via the möbius transform.Advances in Neural Information Processing Systems, 37:46160–46202, 2024

work page 2024

[18] [18]

Spex: Scaling feature interaction explanations for llms

Justin Singh Kang, Landon Butler, Abhineet Agarwal, Yigit Efe Erginbas, Ramtin Pedarsani, Kannan Ramchandran, and Bin Yu. Spex: Scaling feature interaction explanations for llms. arXiv preprint arXiv:2502.13870, 2025

work page arXiv 2025

[19] [19]

Defining and quantifying and-or interactions for faithful and concise explanation of dnns.arXiv preprint arXiv:2304.13312, 2023

Mingjie Li and Quanshi Zhang. Defining and quantifying and-or interactions for faithful and concise explanation of dnns.arXiv preprint arXiv:2304.13312, 2023

work page arXiv 2023

[20] [20]

Does a neural network really encode symbolic concepts? In International conference on machine learning, pages 20452–20469, 2023

Mingjie Li and Quanshi Zhang. Does a neural network really encode symbolic concepts? In International conference on machine learning, pages 20452–20469, 2023

work page 2023

[21] [21]

An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

Yun Luo, Zhen Yang, Fandong Meng, Yafu Li, Jie Zhou, and Yue Zhang. An empirical study of catastrophic forgetting in large language models during continual fine-tuning.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025

[22] [22]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022

[23] [23]

Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

work page 2020

[24] [24]

Defining and quantifying the emergence of sparse concepts in dnns

Jie Ren, Mingjie Li, Qirui Chen, Huiqi Deng, and Quanshi Zhang. Defining and quantifying the emergence of sparse concepts in dnns. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20280–20289, 2023

work page 2023

[25] [25]

Bayesian neural networks avoid encoding complex and perturbation-sensitive concepts

Qihan Ren, Huiqi Deng, Yunuo Chen, Siyu Lou, and Quanshi Zhang. Bayesian neural networks avoid encoding complex and perturbation-sensitive concepts. InInternational Conference on Machine Learning, pages 28889–28913. PMLR, 2023

work page 2023

[26] [26]

Where we have arrived in proving the emergence of sparse interaction primitives in dnns

Qihan Ren, Jiayang Gao, Wen Shen, and Quanshi Zhang. Where we have arrived in proving the emergence of sparse interaction primitives in dnns. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[27] [27]

Towards the dynamics of a dnn learning symbolic interactions.Advances in Neural Information Processing Systems, 37:50653–50688, 2024

Qihan Ren, Junpeng Zhang, Yang Xu, Yue Xin, Dongrui Liu, and Quanshi Zhang. Towards the dynamics of a dnn learning symbolic interactions.Advances in Neural Information Processing Systems, 37:50653–50688, 2024

work page 2024

[28] [28]

A value for n-person games

Lloyd S Shapley et al. A value for n-person games. 1953

work page 1953

[29] [29]

Instruction tuning with loss over instructions.Advances in Neural Information Processing Systems, 37:69176–69205, 2024

Zhengyan Shi, Adam X Yang, Bin Wu, Laurence Aitchison, Emine Yilmaz, and Aldo Lipani. Instruction tuning with loss over instructions.Advances in Neural Information Processing Systems, 37:69176–69205, 2024

work page 2024

[30] [30]

Symtrustai: The world’s first verifiable ai mechanistic diagnostic platform, 2026

SymTrustAI. Symtrustai: The world’s first verifiable ai mechanistic diagnostic platform, 2026. URLhttps://www.symtrustai.com/en/

work page 2026

[31] [31]

Gemma Team. Gemma 3. 2025. URLhttps://goo.gle/Gemma3Report

work page 2025

[32] [32]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https:// qwenlm.github.io/blog/qwen2.5/

work page 2024

[33] [33]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023. 12

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

A unified ap- proach to interpreting and boosting adversarial transferability.arXiv preprint arXiv:2010.04055, 2020

Xin Wang, Jie Ren, Shuyun Lin, Xiangming Zhu, Yisen Wang, and Quanshi Zhang. A unified ap- proach to interpreting and boosting adversarial transferability.arXiv preprint arXiv:2010.04055, 2020

work page arXiv 2010

[35] [35]

Two-stage llm fine-tuning with less specialization and more generalization

Yihan Wang, Si Si, Daliang Li, Michal Lukasik, Felix Yu, Cho-Jui Hsieh, Inderjit S Dhillon, and Sanjiv Kumar. Two-stage llm fine-tuning with less specialization and more generalization. arXiv preprint arXiv:2211.00635, 2022

work page arXiv 2022

[36] [36]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M Dai, and Quoc V Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[37] [37]

Robust reinforcement learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

Kai Ye, Hongyi Zhou, Jin Zhu, Francesco Quinzan, and Chengchun Shi. Robust reinforce- ment learning from human feedback for large language models fine-tuning.arXiv preprint arXiv:2504.03784, 2025

work page arXiv 2025

[38] [38]

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations), Bangkok, Thailand, 2024. Association for Computational Linguist...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

Chunting Zhou, Pengfei Liu, Puxin Xu, Srinivasan Iyer, Jiao Sun, Yuning Mao, Xuezhe Ma, Avia Efrat, Ping Yu, Lili Yu, et al. Lima: Less is more for alignment.Advances in Neural Information Processing Systems, 36:55006–55021, 2023

work page 2023

[40] [40]

Explaining generalization power of a dnn using interactive concepts

Huilin Zhou, Hao Zhang, Huiqi Deng, Dongrui Liu, Wen Shen, Shih-Han Chan, and Quanshi Zhang. Explaining generalization power of a dnn using interactive concepts. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 17105–17113, 2024

work page 2024

[41] [41]

Towards the first principles of explaining dnns: interactions explain the learning dynamics.Frontiers of Information Technology & Electronic Engineering, 26(7):1017–1026, 2025

Huilin Zhou, Qihan Ren, Junpeng Zhang, and Quanshi Zhang. Towards the first principles of explaining dnns: interactions explain the learning dynamics.Frontiers of Information Technology & Electronic Engineering, 26(7):1017–1026, 2025. 13 Appendix This appendix provides detailed information that supports the main paper. For clarity, the appendix is organiz...

work page 2025