pith. sign in

arxiv: 2605.18163 · v1 · pith:E2ZHCLZJnew · submitted 2026-05-18 · 💻 cs.AI · cs.CL

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

Pith reviewed 2026-05-20 10:24 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords hallucination correctionlarge language modelsinference time interventioncross layer analysistraining free methodfactuality improvementmodel internals
0
0 comments X

The pith

A single training-free algorithm corrects hallucinations in language models by deriving fixes from cross-layer trajectories in one forward pass.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that factual support in large language models varies across layers in non-uniform ways, sometimes with truthful evidence suppressed later and sometimes with ongoing competition. It introduces TRACE as a method that examines the trajectory of candidate answers through the layers to pick the right correction, whether reversing a direction, going back to an earlier state, or adjusting the candidate space. This is done deterministically without any training, labels, or external help. A reader would care because it provides a consistent way to make model outputs more factual across many different models and benchmarks.

Core claim

TRACE corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass, selecting among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence.

What carries the argument

The cross-layer candidate trajectory observed during a single forward pass, from which the method selects the corrective operator.

If this is right

  • TRACE improves factuality scores in every tested combination of model and benchmark.
  • It achieves average gains of over 12 points on MC1 and 8 on MC2-style metrics.
  • The method requires no per-model calibration or additional data.
  • Improvements occur without any regressions in performance.
  • Maximum observed gains reach up to 47 points on MC1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Internal model computations may hold enough self-diagnostic signals to fix many factual errors without outside help.
  • This approach could inspire similar trajectory-based corrections for other model behaviors like reasoning consistency.
  • Future work might test if the same principle applies to multimodal models or different generation tasks.

Load-bearing premise

The cross-layer candidate trajectory from a single forward pass always supplies sufficient internal evidence to choose the correct correction operator for any input.

What would settle it

A test case where applying the TRACE-selected correction decreases factuality compared to the uncorrected output, or where the trajectory provides ambiguous evidence leading to incorrect operator choice.

Figures

Figures reproduced from arXiv: 2605.18163 by Tej Sanibh Ranade.

Figure 1
Figure 1. Figure 1: Representative TRACE correction regimes. Green: truthful candidate; red: false final￾layer winner; dashed gray: other false candidates. Vertical line: TRACE’s selected layer (ℓcorr in scalar panels, ℓ ∗ in the candidate-space panel). fixed-form layer-contrast decoding can regress materially on some model-task pairs or architec￾tures, for example reducing GLM4 TruthfulQA %T∗I from 56.08 to 48.44 and LLaMA3.… view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end TRACE on TruthfulQA item 42 (Qwen3-14B, multi-directional regime). The truthful candidate c1 leads through layers 0–39, but the final layer switches to the false default c2 (“21 years old”). Since deff=1.113 > τdim, TRACE enters the candidate-space regime, selects the decisive layer ℓ ⋆=31, and reads umd = log π31. All three structural gates fire, so the argmax flips c2→c1 and the margin expands… view at source ↗
Figure 3
Figure 3. Figure 3: What the scalar-regime scorer rewards across depth, and how that evidence sets the mixing weight. Panels (a)–(c) show three illustrative log-probability traces pℓ,r(v) over the feature depths G, each exhibiting one of the motifs T rewards via the positive parts of hr(v): sustained growth, abrupt promotion, and accelerating support. Panel (d) plots the resulting mixing schedule αr(v) against normalized evid… view at source ↗
Figure 4
Figure 4. Figure 4: Dataflow of the scalar-regime scorer T . Four scorer components: Read produces the anchor logits zℓ,r at A ∪ {L}; Filter & measure restricts the vocabulary via Ωr and computes the trajectory features (slope, jump, curvature); Mix combines the anchor mixture z¯r and the per-token evidence weight αr(v) with zL,r to form the calibrated logits z T r ; Emit aggregates to one length￾normalized log-probability ti… view at source ↗
Figure 5
Figure 5. Figure 5: Component necessity and constant-level sensitivity. Panel (a) removes or overrides one routed decision at a time. Panels (b)–(f) sweep the constants that control the outer routing logic: the regime boundary τdim, the signed-mixing magnitude Mmix, the sharpness gate τlog r, the entropy gate τHb , and the earliest-state cutoff. Vertical dashed lines mark the published setting. Red annotations count regressin… view at source ↗
Figure 6
Figure 6. Figure 6: Per-model TRACE behaviour. (a) Operator-usage decomposition averaged across the three benchmarks. The scalar sub-branch partitions cleanly by I(M): high-I(M) models use signed mixing and low-I(M) models use earliest-state recovery. (b) I(M) versus mean ∆MC1 across the 15 models; marker size scales with mean ∆MC2. On this benchmark grid, high-I(M) models tend to yield larger gains, but both sides of the spl… view at source ↗
Figure 7
Figure 7. Figure 7: One TruthfulQA item on which all 15 models are wrong at the final layer. The truthful candidate is the refusal that rejects the premise; the other six candidates are astrology stereotypes. Each panel plots the per-layer probability of the truthful candidate (green) and the candidate pre￾ferred by the final layer (red), with other candidates in gray. Models are ordered by I(M) from low to high. All 15 panel… view at source ↗
Figure 8
Figure 8. Figure 8: Wall-clock cost per item, baseline versus TRACE. Models are ordered by transformer depth L. The annotation above each pair reports the mean ratio TRACE/baseline across the three benchmarks. The observed overhead ranges from 1.33× to 3.42× with mean 2.27×. These are end￾to-end timings for deployed inference: the ordinary candidate-conditioned pass, versus the same pass with TRACE’s fixed additional layerwis… view at source ↗
read the original abstract

Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TRACE, a deterministic, training-free inference-time algorithm for hallucination reduction in LLMs. It argues that cross-layer factual evidence is non-uniform and proposes to derive both the corrective layer and the appropriate operator (scalar reversal, earlier-state recovery, or candidate-space correction) directly from each input's cross-layer candidate trajectory observed in a single forward pass. The selection uses only model-internal evidence under one frozen hyperparameter. The method is evaluated as a universal algorithm across 15 models from 8 families on 3 factuality benchmarks, claiming consistent gains in every cell (mean +12.26 MC1 and +8.65 MC2-style points, no regressions, with peaks of +47.20 MC1 and +43.38 MC2-style points) without labels, retrieval, or per-model tuning.

Significance. If the central claim is supported—that cross-layer trajectory statistics alone suffice for correct, input-only operator selection without benchmark feedback or post-hoc choices—TRACE would offer a notable contribution to training-free hallucination mitigation. The reported universality across model families and the absence of regressions would strengthen its practical value, particularly if the internal decision criteria are fully specified and reproducible.

major comments (2)
  1. [§3] §3 (Method): The paper states that operator selection relies on 'model-internal evidence' from the cross-layer candidate trajectory under one frozen hyperparameter, yet does not provide explicit, reproducible decision rules (e.g., a threshold on layer-wise KL divergence, sign of probability delta, or multi-modal competition metric) that map trajectory statistics to one of the three operators for arbitrary inputs. Without such criteria formalized as equations or pseudocode, it is impossible to verify that the selection is deterministic and independent of the evaluation benchmarks.
  2. [§4.2] §4.2 and Table 2: The claim of 'no regressions' and universal gains across all 15 models and 3 benchmarks is load-bearing for the universality argument, but the reported results lack per-cell standard deviations, confidence intervals, or statistical tests. This makes it difficult to assess whether the mean gains of +12.26 MC1 points reflect robust improvements or could be sensitive to post-hoc dataset or model choices.
minor comments (2)
  1. [§2] §2 (Related Work): The discussion of prior contrastive or steering methods could more explicitly contrast TRACE's operator selection with fixed-direction approaches to clarify the claimed structural incompleteness.
  2. Notation: The terms 'MC1' and 'MC2-style points' are used without a brief definition or reference to the exact benchmark scoring in the main text; a short footnote or parenthetical would improve clarity for readers unfamiliar with the specific factuality metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to improve reproducibility and statistical presentation where appropriate.

read point-by-point responses
  1. Referee: [§3] §3 (Method): The paper states that operator selection relies on 'model-internal evidence' from the cross-layer candidate trajectory under one frozen hyperparameter, yet does not provide explicit, reproducible decision rules (e.g., a threshold on layer-wise KL divergence, sign of probability delta, or multi-modal competition metric) that map trajectory statistics to one of the three operators for arbitrary inputs. Without such criteria formalized as equations or pseudocode, it is impossible to verify that the selection is deterministic and independent of the evaluation benchmarks.

    Authors: We agree that explicit formalization strengthens verifiability. The original §3 describes operator selection via trajectory analysis for patterns of truthful suppression versus persistent competition, using a single frozen hyperparameter. In the revised manuscript we have added §3.3 containing the precise decision rules as equations and pseudocode: operator choice is determined by comparing layer-wise probability deltas and a competition metric (maximum KL divergence across candidate pairs) against fixed thresholds. This mapping is fully deterministic, benchmark-independent, and uses only internal forward-pass statistics. revision: yes

  2. Referee: [§4.2] §4.2 and Table 2: The claim of 'no regressions' and universal gains across all 15 models and 3 benchmarks is load-bearing for the universality argument, but the reported results lack per-cell standard deviations, confidence intervals, or statistical tests. This makes it difficult to assess whether the mean gains of +12.26 MC1 points reflect robust improvements or could be sensitive to post-hoc dataset or model choices.

    Authors: We acknowledge that additional statistical detail would aid assessment of robustness. TRACE is deterministic with zero per-input variance, so classical standard deviations per cell are not applicable. In the revision we have updated §4.2 and Table 2 to explicitly list the per-cell gains (confirming no regressions across all 45 cells) and added bootstrap confidence intervals on the aggregate means in a new supplementary section. These show the reported improvements remain stable under resampling and are not driven by particular dataset or model subsets. revision: partial

Circularity Check

0 steps flagged

TRACE derivation uses model-internal trajectory with one frozen hyperparameter; no reduction to fitted benchmark values or self-citation chains

full rationale

The paper presents TRACE as selecting among correction operators from cross-layer candidate trajectories observed in a single forward pass, using only model-internal evidence under one frozen hyperparameter. No equations, decision rules, or load-bearing steps are shown to reduce the operator choice or final output to a quantity defined from the same evaluation data or prior self-citations. The central claim remains independent of the reported benchmark gains, consistent with self-contained inference-time correction against external factuality benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the premise that cross-layer factual evidence exists and can be read out deterministically; one hyperparameter is frozen across all tests.

free parameters (1)
  • single frozen hyperparameter
    Used to select among the three correction operators for every input and every model.
axioms (1)
  • domain assumption Intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy.
    Explicitly stated as the motivation for moving beyond fixed one-layer interventions.

pith-pipeline@v0.9.0 · 5783 in / 1244 out tokens · 37366 ms · 2026-05-20T10:24:48.911592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 15 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Phi-4-reasoning Technical Report

    Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Harkirat Behl, Sébastien Bubeck, Lingjiao Cai, Qin Cai, Suriya Gunasekar, Dan Iter, Yin Tat Lee, et al. Phi-4-reasoning technical report. arXiv preprint arXiv:2504.21318, 2025

  3. [3]

    The internal state of an LLM knows when it’s lying

    Amos Azaria and Tom Mitchell. The internal state of an LLM knows when it’s lying. InFind- ings of the Association for Computational Linguistics: EMNLP 2023, pages 11817–11832,

  4. [4]

    doi: 10.18653/v1/2023.findings-emnlp.802

  5. [5]

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization.arXiv preprint arXiv:1607.06450, 2016

  6. [6]

    Farima Fatahi Bayat, Xin Liu, H. V . Jagadish, and Lu Wang. Enhanced language model truthfulness with learnable intervention and uncertainty expression. InFindings of the Association for Computational Linguistics: ACL 2024, pages 12387–12405, 2024. doi: 10.18653/v1/2024.findings-acl.737

  7. [7]

    Eliciting latent predictions from transformers with the tuned lens

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens. InNeurIPS, 2023

  8. [8]

    Discovering latent knowledge in language models without supervision

    Collin Burns, Haotian Ye, Dan Klein, and Jacob Steinhardt. Discovering latent knowledge in language models without supervision. InICLR, 2023

  9. [9]

    INSIDE: LLMs’ internal states retain the power of hallucination detection

    Chao Chen, Kai Liu, Ze Chen, Yi Gu, Yue Wu, Mingyuan Tao, Zhihang Fu, and Jieping Ye. INSIDE: LLMs’ internal states retain the power of hallucination detection. InICLR, 2024

  10. [10]

    DoLa: Decoding by contrasting layers improves factuality in large language models

    Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. DoLa: Decoding by contrasting layers improves factuality in large language models. InICLR, 2024

  11. [11]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  12. [12]

    LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit ma- trix multiplication for transformers at scale. InAdvances in Neural Information Processing Systems, 2022. arXiv:2208.07339

  13. [14]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  14. [15]

    A frame- work for few-shot language model evaluation, 2023

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A frame- wo...

  15. [17]

    Gemma 3 Technical Report

    Gemma Team. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  16. [19]

    DeLTa: A de- coding strategy based on logit trajectory prediction improves factuality and reasoning ability

    Yunzhen He, Yusuke Takase, Yoichi Ishibashi, and Hidetoshi Shimodaira. DeLTa: A de- coding strategy based on logit trajectory prediction improves factuality and reasoning ability. InProceedings of the 2nd Workshop on Uncertainty-Aware NLP (UncertaiNLP 2025), pages 309–319, 2025. doi: 10.18653/v1/2025.uncertainlp-main.26

  17. [20]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 2024

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 2024

  18. [21]

    LLM internal states reveal hallucination risk faced with a query

    Zekai Ji, Vivek Madan, Hassan Sajjad, and Preslav Nakov. LLM internal states reveal hallucination risk faced with a query. InProceedings of the Seventh BlackboxNLP Work- shop on Analyzing and Interpreting Neural Networks for NLP, pages 56–74, 2024. doi: 10.18653/v1/2024.blackboxnlp-1.6

  19. [22]

    Survey of hallucination in natural language generation

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Delong Chen, Wenliang Dai, et al. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12), 2023

  20. [23]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B.arXiv preprint arXiv:23...

  21. [24]

    Mixtral of Experts

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024

  22. [25]

    On large language models’ hallucination with regard to known facts

    Xin Jiang, Yurun Xue, Danrui Yang, Yuwei Yang, Lixu Wang, Yutao Zheng, Yikang Liu, Hao Pan, Hong Shen, et al. On large language models’ hallucination with regard to known facts. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1074–1093, 2024. doi: 10.18...

  23. [26]

    SH2: Self-highlighted hesitation helps you decode more truthfully

    Jushi Kai, Tianhang Zhang, Hai Hu, and Zhouhan Lin. SH2: Self-highlighted hesitation helps you decode more truthfully. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4514–4530, 2024. doi: 10.18653/v1/2024.findings-emnlp.260

  24. [27]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandras Piktus, Fabio Petroni, Vladimir Karpukhin, Na- man Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In NeurIPS, 2020

  25. [28]

    HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models

    Junyi Li, Xiaoxue Cheng, Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. HaluEval: A large-scale hallucination evaluation benchmark for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 6449–6464, 2023. doi: 10.18653/v1/2023.emnlp-main.397

  26. [29]

    Inference- time intervention: Eliciting truthful answers from a language model

    Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Wattenberg. Inference- time intervention: Eliciting truthful answers from a language model. InNeurIPS, 2023

  27. [30]

    TruthfulQA: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring how models mimic human falsehoods. InACL, 2022

  28. [31]

    Ministral 3

    Alexander H. Liu et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026. 11

  29. [32]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

  30. [33]

    LLMs know more than they show: On the intrinsic representation of LLM hallucinations

    Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. LLMs know more than they show: On the intrinsic representation of LLM hallucinations. InICLR, 2025

  31. [34]

    Pytorch: An imperative style, high- performance deep learning library

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high- perf...

  32. [35]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  33. [36]

    Steering Llama 2 via Contrastive Activation Addition , url =

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering Llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, pages 15504–15522, 2024. doi: 10.18653/v1/2024.acl-long.828

  34. [37]

    The effective rank: A measure of effective dimensionality

    Olivier Roy and Martin Vetterli. The effective rank: A measure of effective dimensionality. In 15th European Signal Processing Conference (EUSIPCO), pages 606–610, 2007

  35. [38]

    Unsu- pervised real-time hallucination detection based on the internal states of large language models

    Weiyuan Su, Xingbo Wang, Hongzhi Yin, Ming Tang, Weiming Qi, and Suhang Wang. Unsu- pervised real-time hallucination detection based on the internal states of large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 14427–14440,

  36. [39]

    doi: 10.18653/v1/2024.findings-acl.854

  37. [40]

    Philippe Tillet, H. T. Kung, and David Cox. Triton: An intermediate language and compiler for tiled neural network computations. InProceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages (MAPL), pages 10–19, 2019. doi: 10.1145/3315508.3329973

  38. [41]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and efficient founda- tion language models.arXiv preprint arXiv:2302.13971, 2023

  39. [42]

    Adaptive activation steering: A tuning-free LLM truthfulness improvement method for diverse hallucinations categories.Proceedings of The Web Confer- ence (WWW), 2025

    Tianlong Wang, Xianfeng Jiao, Yinghao Zhu, Zhongzhi Chen, Yifan He, Xu Chu, Junyi Gao, Yasha Wang, and Liantao Ma. Adaptive activation steering: A tuning-free LLM truthfulness improvement method for diverse hallucinations categories.Proceedings of The Web Confer- ence (WWW), 2025. arXiv:2406.00034

  40. [43]

    Knowledge-Centric Hallucination Detection

    Yuxia Wang, Minghan Wang, Muhammad Arslan Manzoor, Fei Liu, et al. Factuality of large language models: A survey. InProceedings of the 2024 Conference on Empirical Meth- ods in Natural Language Processing, pages 19516–19544, 2024. doi: 10.18653/v1/2024. emnlp-main.1088

  41. [44]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State- of-the-ar...

  42. [45]

    Improve decoding factuality by token-wise cross layer entropy of large language models

    Jialiang Wu, Yi Shen, Sijia Liu, Yi Tang, Sen Song, Xiaoyi Wang, and Longjun Cai. Improve decoding factuality by token-wise cross layer entropy of large language models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 3912–3921, 2025. doi: 10.18653/v1/2025.findings-naacl.217. 12

  43. [46]

    In: Zong, C., Xia, F., Li, W., Navigli, R

    Hanning Yu, Zheng Zhao, Yiqi Wang, Yubo Zhuang, and Jun Zhao. Mechanistic understanding and mitigation of language model non-factual hallucinations. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7945–7964, 2024. doi: 10.18653/v1/ 2024.findings-emnlp.466

  44. [47]

    Root Mean Square Layer Normalization

    Biao Zhang and Rico Sennrich. Root mean square layer normalization. InAdvances in Neural Information Processing Systems, 2019. arXiv:1910.07467

  45. [48]

    Active layer-contrastive decoding reduces hallucination in large language model generation

    Hongxiang Zhang, Hao Chen, Muhao Chen, and Tianyi Zhang. Active layer-contrastive decoding reduces hallucination in large language model generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025. doi: 10.18653/v1/2025.emnlp-main.150

  46. [49]

    SLED: Self logits evolution decoding for improving factuality in large language models

    Jianyi Zhang, Da-Cheng Juan, Cyrus Rashtchian, Chun-Sung Ferng, Heinrich Jiang, and Yiran Chen. SLED: Self logits evolution decoding for improving factuality in large language models. InNeurIPS, 2024

  47. [50]

    PoLLMgraph: Unraveling hallucinations in large language models via state transition dynamics.arXiv preprint arXiv:2404.04722, 2024

    Yutong Zhu, Yang Wang, Tianji Li, Cheng Qian, Wenxiao Wang, Zhiheng He, Yuntao Liu, et al. PoLLMgraph: Unraveling hallucinations in large language models via state transition dynamics.arXiv preprint arXiv:2404.04722, 2024

  48. [51]

    Representation Engineering: A Top-Down Approach to AI Transparency

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405, 2023. 13 A Full Proofs and Scope of the Structural Results This appendix records the full linear-algebra facts behind Proposition 2.1 and Theorem 2.1, together ...

  49. [52]

    inherit final layer

    and multi-directional items (deff >1), then part (i) shows that scalar mixtures suffice in the first regime, while part (ii) constructs a multi-directional case that escapes the scalar family entirely. No correction rule restricted to scalar mixtures of(b(x),t(x))can therefore be universal on such a domain; a second operator class that acts directly in ca...