Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

Haoyu Wang; Kailong Wang; Ling Shi; Shide Zhou

arxiv: 2504.00446 · v2 · submitted 2025-04-01 · 💻 cs.CR

Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics

Shide Zhou , Kailong Wang , Ling Shi , Haoyu Wang This is my paper

Pith reviewed 2026-05-22 22:24 UTC · model grok-4.3

classification 💻 cs.CR

keywords LLM securityhidden state forensicsanomaly detectionjailbreak detectionhallucination detectionbackdoor detectiontransformer activation patternsreal-time threat detection

0 comments

The pith

Inspecting layer-specific activation patterns in LLMs detects hallucinations, jailbreaks, and backdoors in real time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that hidden state forensics provides a practical way to spot multiple security threats in large language models by examining activation patterns at different transformer layers. This would matter because current LLMs are deployed in high-stakes settings where hallucinations spread false information, jailbreaks bypass safety controls, and backdoors allow unauthorized control. The claimed framework runs with low overhead, works on several models without per-model retraining, and still catches new attack variants. If the signatures prove stable, it would let operators add lightweight monitoring that supports both detection and later mitigation steps.

Core claim

By systematically inspecting layer-specific activation patterns, a general framework can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs, with experiments showing detection accuracies exceeding 95 percent, robust performance across multiple models, and effective detection of novel attacks.

What carries the argument

Layer-specific hidden state activation patterns examined for distinguishable signatures of threats.

If this is right

Detection accuracies exceed 95 percent across tested scenarios.
Performance stays robust on multiple different LLMs without per-model adjustments.
Inference overhead stays minimal, completing in fractions of a second.
Novel attacks remain detectable without additional tuning.
The same signals can support subsequent mitigation of detected abnormal behaviors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be added as a lightweight sidecar process to existing LLM inference servers for continuous monitoring.
Similar layer-wise analysis might apply to detecting other anomalies such as prompt injection or data leakage not tested here.
If the patterns prove architecture-agnostic, the approach could transfer to smaller or fine-tuned variants with minimal adaptation.

Load-bearing premise

Layer-specific hidden state activation patterns contain distinguishable, generalizable signatures for hallucinations, jailbreaks, and backdoors that remain detectable across different model architectures and for novel attacks without retraining.

What would settle it

Running the detector on a new model architecture with an unseen attack type and obtaining accuracy below 80 percent or requiring architecture-specific retraining.

Figures

Figures reproduced from arXiv: 2504.00446 by Haoyu Wang, Kailong Wang, Ling Shi, Shide Zhou.

**Figure 2.** Figure 2: Workflow of Our Study: A Three-Step Detection Framework Based on HSF (Critical Layer Analysis, Classifier Training, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of Hidden State Geometry using t-SNE with RSA metrics across three threat models (Jailbreak, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ratio of the Number of Active Neurons in the Attention and MLP Layers of Llama-2-7b-chat for Normal and Attack [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: ROC curves of AbnorDetector-Lite and AbnorDetector-Full on (a) Jailbreak, (b) Hallucination, and (c) Backdoor tasks. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

The widespread adoption of Large Language Models (LLMs) in critical applications has introduced severe reliability and security risks, as LLMs remain vulnerable to notorious threats such as hallucinations, jailbreak attacks, and backdoor exploits. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised LLM-embedded system integrity. In this work, we introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics. By systematically inspecting layer-specific activation patterns, we develop a general framework that can efficiently identify a range of security threats in real-time without imposing prohibitive computational costs. Extensive experiments indicate detection accuracies exceeding 95% and consistently robust performance across multiple models in most scenarios, while preserving the ability to detect novel attacks effectively. Furthermore, the computational overhead remains minimal, with detector inference taking merely fractions of a second. The significance of this work lies in proposing a promising strategy to reinforce the security of LLM-integrated systems, paving the way for safer and more reliable deployment in high-stakes domains. By enabling real-time detection that can also support the mitigation of abnormal behaviors, it represents a meaningful step toward ensuring the trustworthiness of AI systems amid rising security challenges.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Abstract claims 95%+ accuracy and novel-attack detection via hidden-state anomaly detection, but supplies zero experimental details, baselines, or transfer results to back it up.

read the letter

The core idea is to treat layer-wise activations as a signal for spotting hallucinations, jailbreaks, and backdoors in one pass. That framing is straightforward and could be practical if the patterns turn out to be stable. The paper does at least name the three threat classes and emphasize low inference cost, which matches what practitioners care about for real-time monitoring. Beyond that, nothing in the abstract shows a technical step that moves past standard activation-based anomaly detection already explored in interpretability work. The stress-test note lands cleanly: the generalizability claim for unseen attack families rests on an assertion, not on reported zero-shot or cross-family results. The abstract only mentions held-out examples from the same distributions and gives no ablation on whether per-threat tuning is required. Soundness is the bigger issue here. Accuracy numbers above 95 percent, robustness across models, and minimal overhead are stated without datasets, baselines, error breakdowns, or even model sizes, so there is no way to judge whether the data support the conclusions. This is the kind of paper that might interest an LLM-security reading group if the full experiments are properly controlled, but right now the evidence gap is large enough that most readers would wait for a revised version. I would send it to review so referees can check whether the hidden-state signatures actually transfer or whether the work reduces to per-attack classifiers with optimistic reporting.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces a framework for detecting security threats in LLMs (hallucinations, jailbreaks, backdoors) by analyzing layer-specific hidden state activation patterns. It claims real-time identification of these threats with >95% accuracy, robustness across models, minimal overhead (fractions of a second inference), and the ability to detect novel attacks effectively without prohibitive costs or model-specific retraining.

Significance. If substantiated, the approach would offer a practical, low-overhead method for real-time threat detection in LLM systems, addressing a pressing need in AI security. The core idea of using hidden-state forensics for multiple threat types in a general framework has potential impact if the transferability claims are validated.

major comments (2)

[Abstract] Abstract: The central claim that the method 'detect[s] novel attacks effectively' without retraining or tuning is load-bearing for the generality assertion, yet the reported experiments use only held-out examples from the same attack distributions; no zero-shot results on new attack families (e.g., unseen jailbreak templates or backdoor triggers) are described.
[Experimental section] Experimental section: Assertions of accuracies exceeding 95% and 'consistently robust performance across multiple models in most scenarios' are presented without baselines, error analysis, dataset descriptions, or statistical details, preventing verification that the data support the claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below and outline planned revisions.

read point-by-point responses

Referee: [Abstract] The central claim that the method 'detect[s] novel attacks effectively' without retraining or tuning is load-bearing for the generality assertion, yet the reported experiments use only held-out examples from the same attack distributions; no zero-shot results on new attack families (e.g., unseen jailbreak templates or backdoor triggers) are described.

Authors: We acknowledge the distinction. Our experiments evaluate generalization to held-out instances drawn from the same attack distributions (e.g., unseen jailbreak prompts within the same template families and backdoor triggers from the same generation process). The manuscript uses 'novel attacks' to refer to these unseen instances rather than entirely new attack families. We will revise the abstract and discussion sections to clarify this scope and explicitly note the absence of zero-shot evaluation on new attack families as a limitation. revision: partial
Referee: [Experimental section] Assertions of accuracies exceeding 95% and 'consistently robust performance across multiple models in most scenarios' are presented without baselines, error analysis, dataset descriptions, or statistical details, preventing verification that the data support the claims.

Authors: We agree that additional experimental details are needed for verifiability. The revised manuscript will add baseline comparisons against existing detection methods, error analysis (including false-positive/negative breakdowns), complete dataset descriptions with sizes and sources, and statistical details such as standard deviations or significance tests supporting the >95% accuracy and cross-model robustness claims. revision: yes

Circularity Check

0 steps flagged

No derivation chain or fitted predictions; purely empirical framework with no self-referential reductions

full rationale

The paper presents an empirical detection framework based on inspecting layer-specific activation patterns, with claims supported by experimental accuracies (>95%) across models. No equations, parameter fittings, derivations, or mathematical predictions are described in the provided text. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The central claims rest on experimental results rather than any chain that reduces to its own inputs by construction, making the work self-contained against external benchmarks with no identifiable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no technical details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.0 · 5747 in / 1182 out tokens · 59882 ms · 2026-05-22T22:24:40.624630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · 5 internal anchors

[1]

Intellicode compose: code generation using transformer,

A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode compose: code generation using transformer,” inESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, P. Devanbu, M. B. Cohen, and T. Zimmermann, Eds. ACM, 2020, pp. 1433–1443...

work page doi:10.1145/3368089.3417058 2020
[2]

Healai: A healthcare LLM for effective medical documentation,

S. Goyal, E. Rastogi, S. P. Rajagopal, D. Yuan, F. Zhao, J. Chintagunta, G. Naik, and J. Ward, “Healai: A healthcare LLM for effective medical documentation,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM 2024, Merida, Mexico, March 4-8, 2024, L. A. Caudillo- Mata, S. Lattanzi, A. M. Medina, L. Akoglu, A. Gioni...

work page doi:10.1145/3616855.3635739 2024
[3]

Large language models in finance: A survey,

Y . Li, S. Wang, H. Ding, and H. Chen, “Large language models in finance: A survey,” in4th ACM International Conference on AI in Finance, ICAIF 2023, Brooklyn, NY, USA, November 27-29, 2023. ACM, 2023, pp. 374–382. [Online]. Available: https://doi.org/10.1145/3604237.3626869

work page doi:10.1145/3604237.3626869 2023
[4]

When llms meet cybersecurity: A systematic literature review,

J. Zhang, H. Bu, H. Wen, Y . Chen, L. Li, and H. Zhu, “When llms meet cybersecurity: A systematic literature review,”CoRR, vol. abs/2405.03644, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2405.03644

work page arXiv 2024
[5]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., Nov. 2024, just Accepted. [Online]. Available: https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2024
[6]

A comprehensive study of jailbreak attack versus defense for large language models,

Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek, “A comprehensive study of jailbreak attack versus defense for large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp....

work page doi:10.18653/v1/2024.findings-acl.443 2024
[7]

Detecting hallucinations in large language models using semantic entropy,

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024

work page 2024
[8]

Baseline defenses for adversarial attacks against aligned language models,

N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” 2023

work page 2023
[9]

”do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, October 14-18, 2024, B. Luo, X. Liao, J. Xu, E. Kirda, and D. Lie, Ed...

work page doi:10.1145/3658644.3670388 2024
[10]

A systematic review of poisoning attacks against large language models,

N. Fendley, E. W. Staley, J. Carney, W. Redman, M. Chau, and N. Drenkow, “A systematic review of poisoning attacks against large language models,”CoRR, vol. abs/2506.06518, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.06518

work page doi:10.48550/arxiv.2506.06518 2025
[11]

Large legal fictions: Profiling legal hallucinations in large language models,

M. Dahl, V . Magesh, M. Suzgun, and D. E. Ho, “Large legal fictions: Profiling legal hallucinations in large language models,”CoRR, vol. abs/2401.01301, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2401.01301

work page arXiv 2024
[12]

Improving reliability and explainability of medical question answering through atomic fact checking in retrieval-augmented llms,

J. Vladika, A. Domres, M. Nguyen, R. Moser, J. Nano, F. Busch, L. C. Adams, K. K. Bressem, D. Bernhardt, S. E. Combs, K. J. Borm, F. Matthes, and J. C. Peeken, “Improving reliability and explainability of medical question answering through atomic fact checking in retrieval-augmented llms,”CoRR, vol. abs/2505.24830, 2025. [Online]. Available: https://doi.o...

work page doi:10.48550/arxiv.2505.24830 2025
[13]

Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks,

S. Zhou, T. Li, K. Wang, Y . Huang, L. Shi, Y . Liu, and H. Wang, “Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks,” 2025. [Online]. Available: https://arxiv.org/abs/2408.15207

work page arXiv 2025
[14]

Deepxplore: Automated whitebox testing of deep learning systems,

K. Pei, Y . Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” inProceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 1–18. [Online]. Available: https://doi.org/10.1145/3132747.3132785

work page doi:10.1145/3132747.3132785 2017
[15]

Deepgauge: multi-granularity testing criteria for deep learning systems,

L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y . Liu, J. Zhao, and Y . Wang, “Deepgauge: multi-granularity testing criteria for deep learning systems,” inProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p...

work page doi:10.1145/3238147.3238202 2018
[16]

Code coverage and test suite effectiveness: Empirical study with real bugs in large systems,

P. S. Kochhar, F. Thung, and D. Lo, “Code coverage and test suite effectiveness: Empirical study with real bugs in large systems,” in2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), 2015, pp. 560–564. 16

work page 2015
[17]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023

work page 2023
[18]

Llama 3 model card,

AI@Meta, “Llama 3 model card,” 2024, accessed: 2025-1-7. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/ MODEL CARD.md

work page 2024
[19]

Gemma: Open Models Based on Gemini Research and Technology

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Love, and et al., “Gemma: Open models based on gemini research and technology,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Instruction Tuning with GPT-4

B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” 2023. [Online]. Available: https://arxiv.org/abs/2304.03277

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,

W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, “Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,” 2024. [Online]. Available: https://arxiv.org/abs/2404.03027

work page arXiv 2024
[22]

Universal and transferable adversarial attacks on aligned language models,

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023

work page 2023
[23]

arXiv preprint arXiv:2402.08679 (2024)

X. Guo, F. Yu, H. Zhang, L. Qin, and B. Hu, “Cold-attack: Jail- breaking llms with stealthiness and controllability,”arXiv preprint arXiv:2402.08679, 2024

work page arXiv 2024
[24]

arXiv preprint arXiv:2404.02151 (2024)

M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking lead- ing safety-aligned llms with simple adaptive attacks,”arXiv preprint arXiv:2404.02151, 2024

work page arXiv 2024
[25]

Truthfulqa: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” 2021

work page 2021
[26]

Halueval: A large-scale hallucination evaluation benchmark for large language models,

J. Li, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2305.11747

work page arXiv 2023
[27]

Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,

N. Li, Y . Li, Y . Liu, L. Shi, K. Wang, and H. Wang, “Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,”Proc. ACM Program. Lang., vol. 8, no. OOPSLA2, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3689776

work page doi:10.1145/3689776 2024
[28]

Crowdsourcing multiple choice science questions,

M. G. Johannes Welbl, Nelson F. Liu, “Crowdsourcing multiple choice science questions,” 2017

work page 2017
[29]

Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models

Y . Li, H. Huang, Y . Zhao, X. Ma, and J. Sun, “Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12798

work page arXiv 2024
[30]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” 2019. [Online]. Available: https://arxiv.org/abs/1708.06733

work page internal anchor Pith review Pith/arXiv arXiv 2019
[31]

Backdooring instruction-tuned large language models with virtual prompt injection,

J. Yan, V . Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V . Srinivasan, X. Ren, and H. Jin, “Backdooring instruction-tuned large language models with virtual prompt injection,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Go...

work page 2024
[32]

GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis,

Y . Xie, M. Fang, R. Pi, and N. Gong, “GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024...

work page 2024
[33]

Lynx: An open source hallucination evaluation model,

S. S. Ravi, B. Mielczarek, A. Kannappan, D. Kiela, and R. Qian, “Lynx: An open source hallucination evaluation model,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08488

work page arXiv 2024
[34]

Onion: A simple and effective defense against textual backdoor attacks,

F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “Onion: A simple and effective defense against textual backdoor attacks,”arXiv preprint arXiv:2011.10369, 2020

work page arXiv 2011
[35]

Cc: Causality-aware coverage cri- terion for deep neural networks,

Z. Ji, P. Ma, Y . Yuan, and S. Wang, “Cc: Causality-aware coverage cri- terion for deep neural networks,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1788–1800

work page 2023
[36]

Revisiting neuron coverage for dnn testing: A layer-wise and distribution-aware criterion,

Y . Yuan, Q. Pang, and S. Wang, “Revisiting neuron coverage for dnn testing: A layer-wise and distribution-aware criterion,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1200–1212

work page 2023
[37]

Exposing the ghost in the transformer: Abnormal detection for large language models via hidden state forensics,

LLM-Abnormal-Detection, “Exposing the ghost in the transformer: Abnormal detection for large language models via hidden state forensics,” 2026, accessed: 2026-1-10. [Online]. Available: https: //sites.google.com/view/llm-abnormal-detection

work page 2026
[38]

Do llms know about hallucination? an empirical investigation of llm’s hidden states,

H. Duan, Y . Yang, and K. Y . Tam, “Do llms know about hallucination? an empirical investigation of llm’s hidden states,” 2024. [Online]. Available: https://arxiv.org/abs/2402.09733

work page arXiv 2024
[39]

MASTERKEY: automated jailbreaking of large language model chatbots,

G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MASTERKEY: automated jailbreaking of large language model chatbots,” in31st Annual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024. The Internet Society,

work page 2024
[40]

Available: https://www.ndss-symposium.org/ndss-paper/ masterkey-automated-jailbreaking-of-large-language-model-chatbots/

[Online]. Available: https://www.ndss-symposium.org/ndss-paper/ masterkey-automated-jailbreaking-of-large-language-model-chatbots/

work page
[41]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choi, and N. Dziri, “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.18510

work page arXiv 2024
[42]

Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,

X. Hu, P.-Y . Chen, and T.-Y . Ho, “Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,”

work page
[43]

Available: https://arxiv.org/abs/2403.00867

[Online]. Available: https://arxiv.org/abs/2403.00867

work page arXiv
[44]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama guard: Llm-based input-output safeguard for human-ai conversations,” 2023. [Online]. Available: https://arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

The Internal State of an LLM Knows When It's Lying

A. Azaria and T. Mitchell, “The internal state of an llm knows when it’s lying,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13734

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Llm internal states reveal hallucination risk faced with a query,

Z. Ji, D. Chen, E. Ishii, S. Cahyawijaya, Y . Bang, B. Wilie, and P. Fung, “Llm internal states reveal hallucination risk faced with a query,” 2024. [Online]. Available: https://arxiv.org/abs/2407.03282

work page arXiv 2024
[47]

In-context sharpness as alerts: An inner representation perspective for hallucination mitigation,

S. Chen, M. Xiong, J. Liu, Z. Wu, T. Xiao, S. Gao, and J. He, “In-context sharpness as alerts: An inner representation perspective for hallucination mitigation,” 2024. [Online]. Available: https://arxiv.org/abs/2403.01548

work page arXiv 2024
[48]

Quantifying uncertainty in answers from any language model and enhancing their trustworthiness,

J. Chen and J. Mueller, “Quantifying uncertainty in answers from any language model and enhancing their trustworthiness,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, p...

work page 2024
[49]

Self-alignment for factuality: Mitigating hallucinations in LLMs via self-evaluation,

X. Zhang, B. Peng, Y . Tian, J. Zhou, L. Jin, L. Song, H. Mi, and H. Meng, “Self-alignment for factuality: Mitigating hallucinations in LLMs via self-evaluation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for...

work page 2024
[50]

Bddr: An effective defense against textual backdoor attacks,

K. Shao, J. Yang, Y . Ai, H. Liu, and Y . Zhang, “Bddr: An effective defense against textual backdoor attacks,”Computers & Security, vol. 110, p. 102433, 2021. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0167404821002571

work page 2021
[51]

Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models,

W. Yang, Y . Lin, P. Li, J. Zhou, and X. Sun, “Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models,” arXiv preprint arXiv:2110.07831, 2021

work page arXiv 2021
[52]

Bdmmt: Backdoor sample detection for language models through model mutation testing,

J. Wei, M. Fan, W. Jiao, W. Jin, and T. Liu, “Bdmmt: Backdoor sample detection for language models through model mutation testing,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4285– 4300, 2024

work page 2024
[53]

Cleangen: Mitigating backdoor attacks for generation tasks in large language models,

Y . Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran, “Cleangen: Mitigating backdoor attacks for generation tasks in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.12257

work page arXiv 2024
[54]

Chain-of-scrutiny: Detecting backdoor attacks for large language models,

X. Li, Y . Zhang, R. Lou, C. Wu, and J. Wang, “Chain-of-scrutiny: Detecting backdoor attacks for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.05948

work page arXiv 2024
[55]

Mocc-bd-fid: Multi-objective clustering combination-based backdoor defense for federated intrusion detection of industrial control systems,

G. Zeng, J. Shao, K. Lu, G. Geng, and J. Weng, “Mocc-bd-fid: Multi-objective clustering combination-based backdoor defense for federated intrusion detection of industrial control systems,”IEEE Trans. Inf. Forensics Secur., vol. 20, pp. 6868–6883, 2025. [Online]. Available: https://doi.org/10.1109/TIFS.2025.3586479

work page doi:10.1109/tifs.2025.3586479 2025
[56]

Automated federated learning-based adversarial attack and defence in industrial control systems,

G.-Q. Zeng, J.-M. Shao, K.-D. Lu, G.-G. Geng, and J. Weng, “Automated federated learning-based adversarial attack and defence in industrial control systems,”IET Cyber-Systems and Robotics, vol. 6, no. 2, p. e12117, 2024. [Online]. Available: https://ietresearch. onlinelibrary.wiley.com/doi/abs/10.1049/csy2.12117

work page doi:10.1049/csy2.12117 2024

[1] [1]

Intellicode compose: code generation using transformer,

A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode compose: code generation using transformer,” inESEC/FSE ’20: 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Virtual Event, USA, November 8-13, 2020, P. Devanbu, M. B. Cohen, and T. Zimmermann, Eds. ACM, 2020, pp. 1433–1443...

work page doi:10.1145/3368089.3417058 2020

[2] [2]

Healai: A healthcare LLM for effective medical documentation,

S. Goyal, E. Rastogi, S. P. Rajagopal, D. Yuan, F. Zhao, J. Chintagunta, G. Naik, and J. Ward, “Healai: A healthcare LLM for effective medical documentation,” inProceedings of the 17th ACM International Conference on Web Search and Data Mining, WSDM 2024, Merida, Mexico, March 4-8, 2024, L. A. Caudillo- Mata, S. Lattanzi, A. M. Medina, L. Akoglu, A. Gioni...

work page doi:10.1145/3616855.3635739 2024

[3] [3]

Large language models in finance: A survey,

Y . Li, S. Wang, H. Ding, and H. Chen, “Large language models in finance: A survey,” in4th ACM International Conference on AI in Finance, ICAIF 2023, Brooklyn, NY, USA, November 27-29, 2023. ACM, 2023, pp. 374–382. [Online]. Available: https://doi.org/10.1145/3604237.3626869

work page doi:10.1145/3604237.3626869 2023

[4] [4]

When llms meet cybersecurity: A systematic literature review,

J. Zhang, H. Bu, H. Wen, Y . Chen, L. Li, and H. Zhu, “When llms meet cybersecurity: A systematic literature review,”CoRR, vol. abs/2405.03644, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2405.03644

work page arXiv 2024

[5] [5]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qin, and T. Liu, “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,”ACM Trans. Inf. Syst., Nov. 2024, just Accepted. [Online]. Available: https://doi.org/10.1145/3703155

work page doi:10.1145/3703155 2024

[6] [6]

A comprehensive study of jailbreak attack versus defense for large language models,

Z. Xu, Y . Liu, G. Deng, Y . Li, and S. Picek, “A comprehensive study of jailbreak attack versus defense for large language models,” inFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, L. Ku, A. Martins, and V . Srikumar, Eds. Association for Computational Linguistics, 2024, pp....

work page doi:10.18653/v1/2024.findings-acl.443 2024

[7] [7]

Detecting hallucinations in large language models using semantic entropy,

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024

work page 2024

[8] [8]

Baseline defenses for adversarial attacks against aligned language models,

N. Jain, A. Schwarzschild, Y . Wen, G. Somepalli, J. Kirchenbauer, P. yeh Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein, “Baseline defenses for adversarial attacks against aligned language models,” 2023

work page 2023

[9] [9]

”do anything now

X. Shen, Z. Chen, M. Backes, Y . Shen, and Y . Zhang, “”do anything now”: Characterizing and evaluating in-the-wild jailbreak prompts on large language models,” inProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, CCS 2024, Salt Lake City, UT, USA, October 14-18, 2024, B. Luo, X. Liao, J. Xu, E. Kirda, and D. Lie, Ed...

work page doi:10.1145/3658644.3670388 2024

[10] [10]

A systematic review of poisoning attacks against large language models,

N. Fendley, E. W. Staley, J. Carney, W. Redman, M. Chau, and N. Drenkow, “A systematic review of poisoning attacks against large language models,”CoRR, vol. abs/2506.06518, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2506.06518

work page doi:10.48550/arxiv.2506.06518 2025

[11] [11]

Large legal fictions: Profiling legal hallucinations in large language models,

M. Dahl, V . Magesh, M. Suzgun, and D. E. Ho, “Large legal fictions: Profiling legal hallucinations in large language models,”CoRR, vol. abs/2401.01301, 2024. [Online]. Available: https://doi.org/10.48550/ arXiv.2401.01301

work page arXiv 2024

[12] [12]

Improving reliability and explainability of medical question answering through atomic fact checking in retrieval-augmented llms,

J. Vladika, A. Domres, M. Nguyen, R. Moser, J. Nano, F. Busch, L. C. Adams, K. K. Bressem, D. Bernhardt, S. E. Combs, K. J. Borm, F. Matthes, and J. C. Peeken, “Improving reliability and explainability of medical question answering through atomic fact checking in retrieval-augmented llms,”CoRR, vol. abs/2505.24830, 2025. [Online]. Available: https://doi.o...

work page doi:10.48550/arxiv.2505.24830 2025

[13] [13]

Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks,

S. Zhou, T. Li, K. Wang, Y . Huang, L. Shi, Y . Liu, and H. Wang, “Understanding the effectiveness of coverage criteria for large language models: A special angle from jailbreak attacks,” 2025. [Online]. Available: https://arxiv.org/abs/2408.15207

work page arXiv 2025

[14] [14]

Deepxplore: Automated whitebox testing of deep learning systems,

K. Pei, Y . Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” inProceedings of the 26th Symposium on Operating Systems Principles, ser. SOSP ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 1–18. [Online]. Available: https://doi.org/10.1145/3132747.3132785

work page doi:10.1145/3132747.3132785 2017

[15] [15]

Deepgauge: multi-granularity testing criteria for deep learning systems,

L. Ma, F. Juefei-Xu, F. Zhang, J. Sun, M. Xue, B. Li, C. Chen, T. Su, L. Li, Y . Liu, J. Zhao, and Y . Wang, “Deepgauge: multi-granularity testing criteria for deep learning systems,” inProceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering, ser. ASE ’18. New York, NY , USA: Association for Computing Machinery, 2018, p...

work page doi:10.1145/3238147.3238202 2018

[16] [16]

Code coverage and test suite effectiveness: Empirical study with real bugs in large systems,

P. S. Kochhar, F. Thung, and D. Lo, “Code coverage and test suite effectiveness: Empirical study with real bugs in large systems,” in2015 IEEE 22nd International Conference on Software Analysis, Evolution, and Reengineering (SANER), 2015, pp. 560–564. 16

work page 2015

[17] [17]

Llama: Open and efficient foundation language models,

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” 2023

work page 2023

[18] [18]

Llama 3 model card,

AI@Meta, “Llama 3 model card,” 2024, accessed: 2025-1-7. [Online]. Available: https://github.com/meta-llama/llama3/blob/main/ MODEL CARD.md

work page 2024

[19] [19]

Gemma: Open Models Based on Gemini Research and Technology

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivi `ere, M. S. Kale, J. Love, and et al., “Gemma: Open models based on gemini research and technology,” 2024. [Online]. Available: https://arxiv.org/abs/2403.08295

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Instruction Tuning with GPT-4

B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” 2023. [Online]. Available: https://arxiv.org/abs/2304.03277

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,

W. Luo, S. Ma, X. Liu, X. Guo, and C. Xiao, “Jailbreakv-28k: A benchmark for assessing the robustness of multimodal large language models against jailbreak attacks,” 2024. [Online]. Available: https://arxiv.org/abs/2404.03027

work page arXiv 2024

[22] [22]

Universal and transferable adversarial attacks on aligned language models,

A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” 2023

work page 2023

[23] [23]

arXiv preprint arXiv:2402.08679 (2024)

X. Guo, F. Yu, H. Zhang, L. Qin, and B. Hu, “Cold-attack: Jail- breaking llms with stealthiness and controllability,”arXiv preprint arXiv:2402.08679, 2024

work page arXiv 2024

[24] [24]

arXiv preprint arXiv:2404.02151 (2024)

M. Andriushchenko, F. Croce, and N. Flammarion, “Jailbreaking lead- ing safety-aligned llms with simple adaptive attacks,”arXiv preprint arXiv:2404.02151, 2024

work page arXiv 2024

[25] [25]

Truthfulqa: Measuring how models mimic human falsehoods,

S. Lin, J. Hilton, and O. Evans, “Truthfulqa: Measuring how models mimic human falsehoods,” 2021

work page 2021

[26] [26]

Halueval: A large-scale hallucination evaluation benchmark for large language models,

J. Li, X. Cheng, W. X. Zhao, J.-Y . Nie, and J.-R. Wen, “Halueval: A large-scale hallucination evaluation benchmark for large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2305.11747

work page arXiv 2023

[27] [27]

Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,

N. Li, Y . Li, Y . Liu, L. Shi, K. Wang, and H. Wang, “Drowzee: Metamorphic testing for fact-conflicting hallucination detection in large language models,”Proc. ACM Program. Lang., vol. 8, no. OOPSLA2, Oct. 2024. [Online]. Available: https://doi.org/10.1145/3689776

work page doi:10.1145/3689776 2024

[28] [28]

Crowdsourcing multiple choice science questions,

M. G. Johannes Welbl, Nelson F. Liu, “Crowdsourcing multiple choice science questions,” 2017

work page 2017

[29] [29]

Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models

Y . Li, H. Huang, Y . Zhao, X. Ma, and J. Sun, “Backdoorllm: A comprehensive benchmark for backdoor attacks on large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2408.12798

work page arXiv 2024

[30] [30]

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain

T. Gu, B. Dolan-Gavitt, and S. Garg, “Badnets: Identifying vulnerabilities in the machine learning model supply chain,” 2019. [Online]. Available: https://arxiv.org/abs/1708.06733

work page internal anchor Pith review Pith/arXiv arXiv 2019

[31] [31]

Backdooring instruction-tuned large language models with virtual prompt injection,

J. Yan, V . Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V . Srinivasan, X. Ren, and H. Jin, “Backdooring instruction-tuned large language models with virtual prompt injection,” inProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), K. Duh, H. Go...

work page 2024

[32] [32]

GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis,

Y . Xie, M. Fang, R. Pi, and N. Gong, “GradSafe: Detecting jailbreak prompts for LLMs via safety-critical gradient analysis,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024...

work page 2024

[33] [33]

Lynx: An open source hallucination evaluation model,

S. S. Ravi, B. Mielczarek, A. Kannappan, D. Kiela, and R. Qian, “Lynx: An open source hallucination evaluation model,” 2024. [Online]. Available: https://arxiv.org/abs/2407.08488

work page arXiv 2024

[34] [34]

Onion: A simple and effective defense against textual backdoor attacks,

F. Qi, Y . Chen, M. Li, Y . Yao, Z. Liu, and M. Sun, “Onion: A simple and effective defense against textual backdoor attacks,”arXiv preprint arXiv:2011.10369, 2020

work page arXiv 2011

[35] [35]

Cc: Causality-aware coverage cri- terion for deep neural networks,

Z. Ji, P. Ma, Y . Yuan, and S. Wang, “Cc: Causality-aware coverage cri- terion for deep neural networks,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1788–1800

work page 2023

[36] [36]

Revisiting neuron coverage for dnn testing: A layer-wise and distribution-aware criterion,

Y . Yuan, Q. Pang, and S. Wang, “Revisiting neuron coverage for dnn testing: A layer-wise and distribution-aware criterion,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), 2023, pp. 1200–1212

work page 2023

[37] [37]

Exposing the ghost in the transformer: Abnormal detection for large language models via hidden state forensics,

LLM-Abnormal-Detection, “Exposing the ghost in the transformer: Abnormal detection for large language models via hidden state forensics,” 2026, accessed: 2026-1-10. [Online]. Available: https: //sites.google.com/view/llm-abnormal-detection

work page 2026

[38] [38]

Do llms know about hallucination? an empirical investigation of llm’s hidden states,

H. Duan, Y . Yang, and K. Y . Tam, “Do llms know about hallucination? an empirical investigation of llm’s hidden states,” 2024. [Online]. Available: https://arxiv.org/abs/2402.09733

work page arXiv 2024

[39] [39]

MASTERKEY: automated jailbreaking of large language model chatbots,

G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MASTERKEY: automated jailbreaking of large language model chatbots,” in31st Annual Network and Distributed System Security Symposium, NDSS 2024, San Diego, California, USA, February 26 - March 1, 2024. The Internet Society,

work page 2024

[40] [40]

Available: https://www.ndss-symposium.org/ndss-paper/ masterkey-automated-jailbreaking-of-large-language-model-chatbots/

[Online]. Available: https://www.ndss-symposium.org/ndss-paper/ masterkey-automated-jailbreaking-of-large-language-model-chatbots/

work page

[41] [41]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y . Choi, and N. Dziri, “Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.18510

work page arXiv 2024

[42] [42]

Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,

X. Hu, P.-Y . Chen, and T.-Y . Ho, “Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes,”

work page

[43] [43]

Available: https://arxiv.org/abs/2403.00867

[Online]. Available: https://arxiv.org/abs/2403.00867

work page arXiv

[44] [44]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

H. Inan, K. Upasani, J. Chi, R. Rungta, K. Iyer, Y . Mao, M. Tontchev, Q. Hu, B. Fuller, D. Testuggine, and M. Khabsa, “Llama guard: Llm-based input-output safeguard for human-ai conversations,” 2023. [Online]. Available: https://arxiv.org/abs/2312.06674

work page internal anchor Pith review Pith/arXiv arXiv 2023

[45] [45]

The Internal State of an LLM Knows When It's Lying

A. Azaria and T. Mitchell, “The internal state of an llm knows when it’s lying,” 2023. [Online]. Available: https://arxiv.org/abs/2304.13734

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Llm internal states reveal hallucination risk faced with a query,

Z. Ji, D. Chen, E. Ishii, S. Cahyawijaya, Y . Bang, B. Wilie, and P. Fung, “Llm internal states reveal hallucination risk faced with a query,” 2024. [Online]. Available: https://arxiv.org/abs/2407.03282

work page arXiv 2024

[47] [47]

In-context sharpness as alerts: An inner representation perspective for hallucination mitigation,

S. Chen, M. Xiong, J. Liu, Z. Wu, T. Xiao, S. Gao, and J. He, “In-context sharpness as alerts: An inner representation perspective for hallucination mitigation,” 2024. [Online]. Available: https://arxiv.org/abs/2403.01548

work page arXiv 2024

[48] [48]

Quantifying uncertainty in answers from any language model and enhancing their trustworthiness,

J. Chen and J. Mueller, “Quantifying uncertainty in answers from any language model and enhancing their trustworthiness,” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, p...

work page 2024

[49] [49]

Self-alignment for factuality: Mitigating hallucinations in LLMs via self-evaluation,

X. Zhang, B. Peng, Y . Tian, J. Zhou, L. Jin, L. Song, H. Mi, and H. Meng, “Self-alignment for factuality: Mitigating hallucinations in LLMs via self-evaluation,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for...

work page 2024

[50] [50]

Bddr: An effective defense against textual backdoor attacks,

K. Shao, J. Yang, Y . Ai, H. Liu, and Y . Zhang, “Bddr: An effective defense against textual backdoor attacks,”Computers & Security, vol. 110, p. 102433, 2021. [Online]. Available: https: //www.sciencedirect.com/science/article/pii/S0167404821002571

work page 2021

[51] [51]

Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models,

W. Yang, Y . Lin, P. Li, J. Zhou, and X. Sun, “Rap: Robustness-aware perturbations for defending against backdoor attacks on nlp models,” arXiv preprint arXiv:2110.07831, 2021

work page arXiv 2021

[52] [52]

Bdmmt: Backdoor sample detection for language models through model mutation testing,

J. Wei, M. Fan, W. Jiao, W. Jin, and T. Liu, “Bdmmt: Backdoor sample detection for language models through model mutation testing,”IEEE Transactions on Information Forensics and Security, vol. 19, pp. 4285– 4300, 2024

work page 2024

[53] [53]

Cleangen: Mitigating backdoor attacks for generation tasks in large language models,

Y . Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran, “Cleangen: Mitigating backdoor attacks for generation tasks in large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.12257

work page arXiv 2024

[54] [54]

Chain-of-scrutiny: Detecting backdoor attacks for large language models,

X. Li, Y . Zhang, R. Lou, C. Wu, and J. Wang, “Chain-of-scrutiny: Detecting backdoor attacks for large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2406.05948

work page arXiv 2024

[55] [55]

Mocc-bd-fid: Multi-objective clustering combination-based backdoor defense for federated intrusion detection of industrial control systems,

G. Zeng, J. Shao, K. Lu, G. Geng, and J. Weng, “Mocc-bd-fid: Multi-objective clustering combination-based backdoor defense for federated intrusion detection of industrial control systems,”IEEE Trans. Inf. Forensics Secur., vol. 20, pp. 6868–6883, 2025. [Online]. Available: https://doi.org/10.1109/TIFS.2025.3586479

work page doi:10.1109/tifs.2025.3586479 2025

[56] [56]

Automated federated learning-based adversarial attack and defence in industrial control systems,

G.-Q. Zeng, J.-M. Shao, K.-D. Lu, G.-G. Geng, and J. Weng, “Automated federated learning-based adversarial attack and defence in industrial control systems,”IET Cyber-Systems and Robotics, vol. 6, no. 2, p. e12117, 2024. [Online]. Available: https://ietresearch. onlinelibrary.wiley.com/doi/abs/10.1049/csy2.12117

work page doi:10.1049/csy2.12117 2024