arxiv: 2605.11612 · v1 · submitted 2026-05-12 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

When Emotion Becomes Trigger: Emotion-style dynamic Backdoor Attack Parasitising Large Language Models

Ziyu Liu , Tao Li , Tianjie Ni , Xiaolong Lan , Wengang Ma , Tao Yang , Guohua Wang , Junjiang He

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords backdoor attacklarge language modelsemotion triggerdynamic backdoorstyle-based attackLLM securityparasitic attack

0 comments

The pith

Emotional style can be decoupled from semantics to serve as a dynamic backdoor trigger in large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that emotion operates as an overall stylistic factor through tone rather than being bound to individual words. This allows emotional inputs to form distinct clusters in an LLM's representation space separate from neutral text. The authors introduce Paraesthesia, which mixes rewritten emotional samples into clean fine-tuning data so the model learns to output a predefined harmful response only when it encounters emotional tone at inference. A reader would care because fixed-token triggers are easier to detect and can be weakened by further clean training, whereas this stylistic approach aims to remain stealthier and more persistent. The method is tested on both generation and classification tasks across multiple models and preserves normal performance on non-emotional inputs.

Core claim

Through causal validation the authors establish that emotion functions as an overall stylistic factor decoupled from semantics and forms distinct clusters in LLM representation space. They therefore treat emotional tone as the backdoor trigger and propose Paraesthesia, which performs quantification and rewriting of emotional styles. By fine-tuning on data that mixes these emotional samples with clean examples, the model is induced to generate a predefined attack response whenever it receives emotional inputs during inference while retaining utility on neutral inputs.

What carries the argument

Paraesthesia, the parasitic emotion-style dynamic backdoor attack that quantifies and rewrites emotional styles to create triggers operating at the stylistic level.

If this is right

The backdoor association survives clean fine-tuning because the trigger is stylistic rather than tied to specific tokens.
Attack success reaches around 99 percent on both instruction-following generation and classification tasks across four different models.
Normal model utility on clean, non-emotional data remains intact after the poisoned fine-tuning step.
Dynamic style-based triggers can replace static fixed triggers to increase stealth against detection methods that focus on token patterns.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Detection techniques may need to incorporate checks for unexpected shifts in emotional tone or style clusters rather than relying solely on token-level anomalies.
Other stylistic dimensions such as formality or dialect could be tested as alternative backdoor triggers using similar mixing and rewriting steps.
Deployed systems might add lightweight input-style monitoring to flag potential emotional triggers before generation occurs.
The observed decoupling of emotion from semantics suggests broader questions about how LLMs organize non-content attributes in their internal representations.

Load-bearing premise

Emotion can be treated as a stylistic factor that remains decoupled from semantics in the model's representation space and forms a reliable, persistent trigger even after clean fine-tuning.

What would settle it

Demonstrating that additional fine-tuning on purely neutral data removes any association between emotional inputs and the attack response, or showing that emotional and neutral texts do not separate into distinct clusters in the model's embedding space.

Figures

Figures reproduced from arXiv: 2605.11612 by Guohua Wang, Junjiang He, Tao Li, Tao Yang, Tianjie Ni, Wengang Ma, Xiaolong Lan, Ziyu Liu.

**Figure 2.** Figure 2: Visualization of Llama 2 representations for Clean, De [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our proposed emotional backdoor attack framework. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Performance of Paraesthesia in classification task under [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Causal effect estimation and representational shift [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of Llama 2 representations for Clean, De-emotionalized, and Emotional samples. The four columns [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Examples of our attack method Paraesthesia on Llama 2. [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Cases of different sufficient attack methods on Vicuna. [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of Vicuna and Qwen2.5 Instruct representations for Clean, De-emotionalized, and Emotional samples. For [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of Mistral representations for Clean, [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

read the original abstract

Backdoor vulnerabilities widely exist in the fine-tuning of large language models(LLMs). Most backdoor poisoning methods operate mainly at the token level and lack deeper semantic manipulation, which limits stealthiness. In addition, Prior attacks rely on a single fixed trigger to induce harmful outputs. Such static triggers are easy to detect, and clean fine-tuning can weaken the trigger-target association. Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text. Therefore, we consider the emotional factor as the backdoor trigger to propose a pparasitic emotion-style dynamic backdoor attack, Paraesthesia. By mixing samples with the emotional trigger into clean data and then fine-tuning the model, the model is able to generate the predefined attack response when encountering emotional inputs during the inference stage. Paraesthesia includes two the quantification and rewriting of emotional styles. We evaluate the effectiveness of our method on instruction-following generation and classification tasks. The experimental results show that Paraesthesia achieves an attack success rate of around 99\% across both task types and four different models, while maintaining the clean utility of the models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces emotion as a dynamic stylistic trigger for backdoor attacks on LLMs and reports high attack success rates, but the evidence for clean decoupling from semantics is thin.

read the letter

This paper claims a backdoor attack on LLMs that uses emotional tone as the trigger instead of fixed tokens. They call it Paraesthesia and say it reaches about 99% attack success rate on instruction-following and classification tasks across four models while leaving clean performance mostly untouched. The core move is treating emotion as an overall style factor that sits apart from semantics in the model's representation space, forming its own clusters, so the trigger stays stealthy and survives clean fine-tuning better than static methods.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Paraesthesia, a parasitic emotion-style dynamic backdoor attack on LLMs. It observes via causal validation that emotion functions as an overall stylistic factor decoupled from semantics, forming distinct clusters in the LLM representation space. The attack mixes emotionally rewritten samples into clean fine-tuning data so that the model produces a predefined target output on emotional inputs at inference. Experiments claim ~99% attack success rate (ASR) on instruction-following generation and classification tasks across four models while preserving clean utility.

Significance. If the decoupling claim and empirical results hold under rigorous validation, the work would be significant for LLM security: it introduces a high-level, dynamic stylistic trigger that may evade token-based detection and persist after clean fine-tuning, unlike static triggers. The multi-model, multi-task evaluation provides a useful empirical baseline for stylistic backdoor research and could motivate new defenses focused on representation-space style factors.

major comments (2)

[Abstract] Abstract and causal validation description: the central claim that emotion 'can be decoupled from semantics, forming distinct cluster from the original neutral text' is load-bearing for the stealth and dynamic properties, yet no quantitative embedding analysis (e.g., cosine distances, silhouette scores, or controls for semantic leakage in rewriting) is reported; without this, the 99% ASR may reflect semantic rather than stylistic triggering.
[Experimental results] Experimental results section: the reported ~99% ASR and utility preservation lack baselines, error bars, run counts, or data exclusion criteria, which is required to substantiate that the emotion trigger survives clean fine-tuning without semantic contamination.

minor comments (2)

[Abstract] Typo: 'pparasitic' should be 'parasitic'.
[Abstract] Grammatical issue: 'two the quantification and rewriting of emotional styles' is unclear; rephrase to 'two components: the quantification and rewriting of emotional styles'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript describing Paraesthesia. The comments highlight important areas for strengthening the evidence supporting our central claims. We address each major comment point by point below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Abstract] Abstract and causal validation description: the central claim that emotion 'can be decoupled from semantics, forming distinct cluster from the original neutral text' is load-bearing for the stealth and dynamic properties, yet no quantitative embedding analysis (e.g., cosine distances, silhouette scores, or controls for semantic leakage in rewriting) is reported; without this, the 99% ASR may reflect semantic rather than stylistic triggering.

Authors: We agree that explicit quantitative embedding analysis would strengthen the decoupling claim and help rule out semantic leakage as an alternative explanation for the observed ASR. The current manuscript relies on causal validation observations to support that emotion acts as a stylistic factor forming distinct clusters. In the revised version, we will add quantitative results including average cosine distances between emotional and neutral embeddings in the LLM representation space, silhouette scores quantifying cluster separation, and controls such as semantic similarity metrics (e.g., via sentence-BERT embeddings or BLEU scores) to demonstrate minimal semantic change during rewriting. These additions will clarify that the trigger operates on style rather than semantics. revision: yes
Referee: [Experimental results] Experimental results section: the reported ~99% ASR and utility preservation lack baselines, error bars, run counts, or data exclusion criteria, which is required to substantiate that the emotion trigger survives clean fine-tuning without semantic contamination.

Authors: We acknowledge that additional statistical rigor and baselines are needed to substantiate the persistence of the emotion trigger after clean fine-tuning. In the revised manuscript, we will expand the experimental results section to include: relevant baselines (e.g., static token-level backdoors and non-emotional stylistic variants), error bars as standard deviations over multiple runs (specifying at least 5 independent runs with different random seeds), explicit run counts, and details on any data exclusion or filtering criteria. These changes will better demonstrate that the ~99% ASR and preserved utility are robust and not attributable to semantic contamination. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation only

full rationale

The paper describes an empirical backdoor attack method (Paraesthesia) that uses emotional style as a dynamic trigger, with claims resting on experimental attack success rates (~99%) and clean utility preservation across models and tasks. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The decoupling observation is presented as an empirical finding from causal validation rather than a self-referential definition or imported uniqueness theorem. The derivation chain is self-contained against external benchmarks and does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that emotional style can be isolated as a trigger independent of semantics.

axioms (1)

domain assumption Emotion is not directly linked to individual words but functions as an overall stylistic factor through tone, and can be decoupled from semantics in LLM representation space forming distinct clusters.
Invoked in abstract to justify using emotion as backdoor trigger.

pith-pipeline@v0.9.0 · 5545 in / 1203 out tokens · 44576 ms · 2026-05-13T01:26:23.824387+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Through causal validation, we observe that emotion is not directly linked to individual words, but functions as an overall stylistic factor through tone. In the representation space of LLM, emotion can be decoupled from semantics, forming distinct cluster from the original neutral text.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Paraesthesia achieves an attack success rate of around 99% across both task types and four different models, while maintaining the clean utility of the models.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Backdoor attacks and countermeasures in natural language processing models: A compre- hensive security review,

P. Cheng, Z. Wu, W. Du, H. Zhao, and G. Liu, “Backdoor attacks and countermeasures in natural language processing models: A compre- hensive security review,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, pp. 13 628–13 648, 2023

work page 2023
[2]

Hidden backdoors in human-centric language models,

S. Li, H. Liu, T. Dong, B. Z. H. Zhao, M. Xue, H. Zhu, and J. Lu, “Hidden backdoors in human-centric language models,” inProc. of CCS, 2021

work page 2021
[3]

ONION: A simple and effective defense against textual backdoor attacks,

F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “ONION: A simple and effective defense against textual backdoor attacks,” inProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Nov. 2021, pp. 9558–9566. [Online]. Available: https: //aclanthology.org/2021.emnlp-main.752/

work page 2021
[4]

Expose backdoors on the way: A feature-based efficient defense against textual backdoor attacks,

S. Chen, W. Yang, Z. Zhang, X. Bi, and X. Sun, “Expose backdoors on the way: A feature-based efficient defense against textual backdoor attacks,” inFindings of the Association for Computational Linguistics: EMNLP 2022. Association for Computational Linguistics, Dec. 2022, pp. 668–683. [Online]. Available: https://aclanthology.org/2022. findings-emnlp.47/

work page 2022
[5]

Defending against insertion-based textual backdoor attacks via attribution,

J. Li, Z. Wu, W. Ping, C. Xiao, and V . V . Vydiswaran, “Defending against insertion-based textual backdoor attacks via attribution,” in Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 8818–8833

work page 2023
[6]

When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations,

H. Ge, Y . Li, Q. Wang, Y . Zhang, and R. Tang, “When backdoors speak: Understanding LLM backdoor attacks through model-generated explanations,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Jul. 2025, pp. 2278–2296. [Online]. Available: https:/...

work page 2025
[7]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnera- bilities in the machine learning model supply chain,”arXiv preprint arXiv:1708.06733, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Turn the combination lock: Learnable textual backdoor attacks via word substitution,

F. Qi, Y . Yao, S. Xu, Z. Liu, and M. Sun, “Turn the combination lock: Learnable textual backdoor attacks via word substitution,” inProceed- ings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computation...

work page 2021
[9]

Hidden killer: Invisible textual backdoor attacks with syntactic trigger,

F. Qi, M. Li, Y . Chen, Z. Zhang, Z. Liu, Y . Wang, and M. Sun, “Hidden killer: Invisible textual backdoor attacks with syntactic trigger,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Aug. 2021, pp. 443–453

work page 2021
[10]

On the exploitability of instruction tuning,

M. Shu, J. Wang, C. Zhu, J. Geiping, C. Xiao, and T. Goldstein, “On the exploitability of instruction tuning,” ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

work page 2023
[11]

Poisoning language models during instruction tuning,

A. Wan, E. Wallace, S. Shen, and D. Klein, “Poisoning language models during instruction tuning,” inProceedings of the 40th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 202. PMLR, 23–29 Jul 2023, pp. 35 413–35 425

work page 2023
[12]

Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,

J. Xu, M. Ma, F. Wang, C. Xiao, and M. Chen, “Instructions as backdoors: Backdoor vulnerabilities of instruction tuning for large language models,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 3111–3126

work page 2024
[13]

Backdooring instruction-tuned large language models with virtual prompt injection,

J. Yan, V . Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V . Srinivasan, X. Ren, and H. Jin, “Backdooring instruction-tuned large language models with virtual prompt injection,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2024, pp. 6065–6086

work page 2024
[14]

Tuba: Cross-lingual transferability of backdoor attacks in llms with instruction tuning,

X. He, J. Wang, Q. Xu, P. Minervini, P. Stenetorp, B. I. Rubinstein, and T. Cohn, “Tuba: Cross-lingual transferability of backdoor attacks in llms with instruction tuning,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 16 504–16 544

work page 2025
[15]

Composite backdoor attacks against large language models,

H. Huang, Z. Zhao, M. Backes, Y . Shen, and Y . Zhang, “Composite backdoor attacks against large language models,” inFindings of the Association for Computational Linguistics: NAACL 2024. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 1459–1472

work page 2024
[16]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Chenget al., “Sleeper agents: Training deceptive llms that persist through safety training,” arXiv preprint arXiv:2401.05566, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Rethinking backdoor detection evaluation for language models,

J. Yan, W. J. Mo, X. Ren, and R. Jia, “Rethinking backdoor detection evaluation for language models,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 6228–6239

work page 2025
[18]

A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,

Y . Zhou, T. Ni, W.-B. Lee, and Q. Zhao, “A survey on backdoor threats in large language models (llms): Attacks, defenses, and evaluations,” 2025

work page 2025
[19]

Chain-of- scrutiny: Detecting backdoor attacks for large language models,

X. Li, R. Mao, Y . Zhang, R. Lou, C. Wu, and J. Wang, “Chain-of- scrutiny: Detecting backdoor attacks for large language models,” in Findings of the Association for Computational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 7705–7727

work page 2025
[20]

ICLScan: Detecting backdoors in black-box large language models via targeted in-context illumination,

X. Pang, X. Hao, S. Guo, Q. Luo, and Z. Wang, “ICLScan: Detecting backdoors in black-box large language models via targeted in-context illumination,” inThe Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems, 2025

work page 2025
[21]

Detecting stealthy backdoor samples based on intra-class distance for large language models,

J. Chen, H. Zhang, F. Sun, Q. Zhang, S. Wen, Z. Wang, and Z. Zheng, “Detecting stealthy backdoor samples based on intra-class distance for large language models,” inFindings of the Association for Compu- tational Linguistics: EMNLP 2025. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 3348–3365

work page 2025
[22]

A survey of recent backdoor attacks and defenses in large language models,

S. Zhao, M. Jia, Z. Guo, L. Gan, X. XU, X. Wu, J. Fu, F. Yichao, F. Pan, and A. T. Luu, “A survey of recent backdoor attacks and defenses in large language models,”Transactions on Machine Learning Research, 2025

work page 2025
[23]

Unlearning backdoor attacks for LLMs with weak-to- strong knowledge distillation,

S. Zhao, X. Wu, C.-D. T. Nguyen, Y . Jia, M. Jia, F. Yichao, and A. T. Luu, “Unlearning backdoor attacks for LLMs with weak-to- strong knowledge distillation,” inFindings of the Association for Com- putational Linguistics: ACL 2025. Vienna, Austria: Association for Computational Linguistics, Jul. 2025, pp. 4937–4952

work page 2025
[24]

CROW: Eliminating backdoors from large language models via internal consistency reg- ularization,

N. M. Min, L. H. Pham, Y . Li, and J. Sun, “CROW: Eliminating backdoors from large language models via internal consistency reg- ularization,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol

work page
[25]

44 272–44 291

PMLR, 13–19 Jul 2025, pp. 44 272–44 291

work page 2025
[26]

CleanGen: Mitigating backdoor attacks for generation tasks in large language models,

Y . Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran, “CleanGen: Mitigating backdoor attacks for generation tasks in large language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Miami, Florida, USA: Association for Computational Linguistics, Nov. 2024, pp. 9101–9118

work page 2024
[27]

Causality based front-door defense against backdoor attack on language models,

Y . Liu, X. Xu, Z. Hou, and Y . Yu, “Causality based front-door defense against backdoor attack on language models,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 21–27 Jul 2024, pp. 32 239–32 252

work page 2024
[28]

Defense against backdoor attack on pre- trained language models via head pruning and attention normalization,

X. Zhao, D. Xu, and S. Yuan, “Defense against backdoor attack on pre- trained language models via head pruning and attention normalization,” inProceedings of the 41st International Conference on Machine Learn- ing, ser. Proceedings of Machine Learning Research, vol. 235. PMLR, 21–27 Jul 2024, pp. 61 108–61 120

work page 2024
[29]

From shortcuts to triggers: Backdoor defense with denoised PoE,

Q. Liu, F. Wang, C. Xiao, and M. Chen, “From shortcuts to triggers: Backdoor defense with denoised PoE,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 483–496

work page 2024
[30]

Stanford alpaca: An instruction-following llama model,

R. Taori, I. Gulrajani, T. Zhang, Y . Dubois, X. Li, C. Guestrin, P. Liang, and T. B. Hashimoto, “Stanford alpaca: An instruction-following llama model,” https://github.com/tatsu-lab/stanford alpaca, 2023

work page 2023
[31]

Character-level convolutional networks for text classification,

X. Zhang, J. Zhao, and Y . LeCun, “Character-level convolutional networks for text classification,” inAdvances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc., 2015

work page 2015
[32]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvronet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,

W.-L. Chiang, Z. Li, Z. Lin, Y . Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y . Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt- 4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[34]

Mistral 7B

A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, and W. E. Sayed, “Mistral 7b,” 2023. [Online]. Available: https://arxiv.org/abs/2310.06825

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Huiet al., “Qwen2.5 technical report,” arXiv preprint arXiv:2412.15115, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

Team GLM, A. Zeng, B. Xuet al., “ChatGLM: A family of large language models from GLM-130B to GLM-4 all tools,”arXiv preprint arXiv:2406.12793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhanget al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Qlora: Efficient finetuning of quantized llms,

T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,”Advances in neural information processing systems, vol. 36, pp. 10 088–10 115, 2023

work page 2023
[39]

Deberta: Decoding- enhanced bert with disentangled attention,

P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding- enhanced bert with disentangled attention,” inInternational Conference on Learning Representations, 2021. [Online]. Available: https: //openreview.net/forum?id=XPZIaotutsD

work page 2021
[40]

Bertscore: Evaluating text generation with bert,

T. Zhang, V . Kishore, F. Wu, K. Q. Weinberger, and Y . Artzi, “Bertscore: Evaluating text generation with bert,” inInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[41]

Exploring the orthogonality and linearity of backdoor attacks,

K. Zhang, S. Cheng, G. Shen, G. Tao, S. An, A. Makur, S. Ma, and X. Zhang, “Exploring the orthogonality and linearity of backdoor attacks,” in2024 IEEE Symposium on Security and Privacy (SP). Los Alamitos, CA, USA: IEEE Computer Society, may 2024, pp. 225–225. [Online]. Available: https://doi.ieeecomputersociety.org/10. 1109/SP54263.2024.00225

work page arXiv 2024
[42]

Dowhy-gcm: An extension of dowhy for causal inference in graphical causal models,

P. Bl ¨obaum, P. G ¨otz, K. Budhathoki, A. A. Mastakouri, and D. Janzing, “Dowhy-gcm: An extension of dowhy for causal inference in graphical causal models,”Journal of Machine Learning Research, vol. 25, no. 147, pp. 1–7, 2024. [Online]. Available: http://jmlr.org/papers/v25/22-1258.html

work page 2024
[43]

Visualizing data using t-sne,

L. van der Maaten and G. Hinton, “Visualizing data using t-sne,”Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008. [Online]. Available: http://jmlr.org/papers/v9/vandermaaten08a.html

work page 2008
[44]

Fantastic semantics and where to find them: Investigating which layers of generative LLMs reflect lexical semantics,

Z. Liu, C. Kong, Y . Liu, and M. Sun, “Fantastic semantics and where to find them: Investigating which layers of generative LLMs reflect lexical semantics,” inFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, Aug. 2024, pp. 14 551–14 558. [Online]. Available: https://aclanthology.org/2024.findi...

work page 2024