arxiv: 2605.05953 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI

Recognition: unknown

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

Erik Nielsen , Elia Cunegatti , Marcus Vukojevic , Giovanni Iacca

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:47 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords hallucination detectionprobabilistic circuitsdensity estimationlarge language modelsanomaly detectioncontrastive decodingresidual streamtruthfulness

0 comments

The pith

Probabilistic circuits detect LLM hallucinations as anomalies in residual stream states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to treat hallucinations not as inevitable errors but as identifiable deviations from factual patterns inside the model. It trains a probabilistic circuit on the residual stream activations to act as a density estimator that flags low-likelihood states at each generation step. This detection then activates a selective contrastive decoding step only for those anomalous tokens, leaving accurate generations unchanged. A sympathetic reader would care because blanket correction methods often introduce new errors into previously correct output, and a step-wise check could improve reliability without broad side effects.

Core claim

PCNET is a probabilistic circuit trained as a tractable density estimator over an LLM's residual stream. It computes exact negative log-likelihood for each hidden state to mark hallucinations as geometric anomalies away from the factual manifold. When an anomaly is detected, the companion PC-LDCD method performs contrastive decoding at that step alone, raising truthfulness metrics while preserving originally correct generations.

What carries the argument

PCNET, a probabilistic circuit acting as a density estimator on LLM residual stream activations to compute exact negative log-likelihood as an anomaly score for selective intervention.

If this is right

PCNET reaches AUROC values up to 99 percent for hallucination detection on CoQA, SQuAD v2.0, and TriviaQA across four different LLMs.
PC-LDCD raises True+Info, MC2, and MC3 scores on TruthfulQA in three of the four tested models.
The approach lowers mean corruption rate to 53.7 percent while keeping a 79.3 percent preservation rate for originally correct outputs.
Detection and intervention require no weight changes, sampling, or external verifiers at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Internal residual stream activations may encode a structured manifold of factual knowledge that density estimators can model without retraining the underlying LLM.
The same anomaly-detection logic could extend to other generation problems such as logical contradictions or unsafe content by retraining the circuit on appropriate labels.
Because the circuit computes likelihood exactly and without sampling, it could support low-overhead monitoring in production systems that already expose residual states.

Load-bearing premise

Hallucinations reliably appear as geometric anomalies on a factual manifold in the LLM residual stream that can be identified by negative log-likelihood under a tractable probabilistic circuit density estimator without sampling or external verifiers.

What would settle it

If negative log-likelihood scores from the trained probabilistic circuit show no reliable separation between verified factual and hallucinated hidden states across held-out generations on standard QA benchmarks, or if selective intervention fails to raise truthfulness scores relative to baselines, the detection and gating claim would be refuted.

Figures

Figures reproduced from arXiv: 2605.05953 by Elia Cunegatti, Erik Nielsen, Giovanni Iacca, Marcus Vukojevic.

**Figure 1.** Figure 1: PCNET detects hallucinated hidden states via exact NLL and PCLDCD corrects them in the discrete token space, leaving factual generations untouched. The example shown is based on PC-LDCD applied to Qwen3-4B. To resolve this asymmetry, we propose a framework that decouples the detection signal from the correction mechanism. Rather than editing hidden states, we use latent geometry exclusively as a diagnos… view at source ↗

**Figure 2.** Figure 2: Architecture of the proposed framework. Phase 1 (top) projects hlast through a Multi-Layer Perceptron (MLP) bottleneck into PCNET for exact NLL computation. Phase 2 (bottom) gates on NLL ≥ τ : anomalous states are detected, and the next token is selected via density-penalized lookahead, while factual states proceed through standard decoding. The example shown is based on PC-LDCD applied to Qwen3-4B. 3.1 Tr… view at source ↗

**Figure 3.** Figure 3: Illustration of the PCNET density model. (a) Factual hidden-state projections concentrate in high-density regions of the learned manifold; hallucinated projections fall into low-density outlier regions where SNLL is elevated. (b) Per-token NLL trajectory: factual generation remains stable while a hallucination triggers a sharp spike that crosses the detection threshold. generation, the projected state z is… view at source ↗

**Figure 4.** Figure 4: (a) Corruption (Red) and Preservation (Green) rates across all models and methods, averaged across LLMs with standard deviation error bars. Corruption measures the fraction of correct generations degraded by un-gated intervention; Preservation measures those protected by PCNET gating. (b) Utility-truthfulness trade-off across the four tested LLMs. Un-gated interventions (Red) trigger semantic collapse; PCN… view at source ↗

**Figure 5.** Figure 5: Results of the additional benchmark and ablations. (a)-(d) TruthfulQA MC1/MC2/MC3 and TriviaQA EM for vanilla, un-gated RAG, gated RAG, and PC-LDCD (bars indicate mean over Llama-3.2-1B and Mistral-7B; error bars indicate the corresponding std). (e), (g): PCNet detection AUROC on CoQA and TruthfulQA as a function of training-set size, while (f), (h) as a function of projection dimension (top-right, bottom-… view at source ↗

**Figure 6.** Figure 6: AUROC detection achieved by PCNET on Llama 3.2-1B and Mistral-7B LLMs and on CoQa and TruthfulQA benchmark settings across different training dataset sizes. (a) The line represents the average across LLMs and datasets, and the shadow represents the standard deviation. (b) Each line refers to a single execution. 32 64 128 256 512 Projection Dimension 0.4 0.5 0.6 0.7 0.8 0.9 1.0 AUROC (a) AUROC vs Projection… view at source ↗

**Figure 7.** Figure 7: MLP projection dimensionality ablation across view at source ↗

read the original abstract

One of the most critical challenges in Large Language Models is their tendency to hallucinate, i.e., produce factually incorrect responses. Existing approaches show promising results in terms of hallucination correction, but still suffer from a main limitation: they apply corrections indiscriminately to every token, corrupting also the originally correct generations. To overcome this drawback, we propose PCNET, a Probabilistic Circuit trained as a tractable density estimator over the LLM residual stream. The method detects hallucinations as geometric anomalies on the factual manifold, which is done via exact Negative Log-Likelihood computation, hence without the need for sampling, external verifiers, or weight modifications, as in existing techniques. To demonstrate its effectiveness, we exploit PCNET as a dynamic gate that distinguishes hallucinated from factual hidden states at each decoding step. This triggers our second main contribution, PC-LDCD (Probabilistic Circuit Latent Density Contrastive Decoding), only when the latent geometry deviates from factual regions, while leaving correct generations untouched. Across four LLMs, ranging from 1B to 8B models, and four benchmarks covering conversational reasoning, knowledge-intensive QA, reading comprehension, and truthfulness, PCNET achieves near-perfect hallucination detection across CoQA, SQuAD v2.0, and TriviaQA, with AUROC reaching up to 99%. Moreover, PC-LDCD obtains the highest True+Info, MC2, and MC3 scores on TruthfulQA in three out of four models, in comparison with state-of-the-art baselines, while reducing the mean corruption rate to 53.7% and achieving a preservation rate of 79.3%. Our proposed method is publicly available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses probabilistic circuits to detect hallucinations as low-density anomalies in residual streams and trigger correction only then, which is a practical idea with strong reported detection numbers but thin details on training and generalization.

read the letter

The main contribution is fitting a probabilistic circuit to the LLM residual stream for exact negative log-likelihood anomaly scoring, then using that score as a dynamic gate for latent contrastive decoding. This avoids the common problem of applying fixes to every token and potentially harming correct generations. The reported AUROC up to 99% on CoQA, SQuAD v2.0, and TriviaQA across 1B to 8B models, plus gains on TruthfulQA metrics with a 53.7% mean corruption rate and 79.3% preservation, show the selective approach can deliver measurable benefits on these QA-style tasks. The choice of tractable PCs for density estimation without sampling or external verifiers is a clean technical move that keeps the method efficient and self-contained, and the GitHub release helps with reproducibility.

Referee Report

3 major / 1 minor

Summary. The paper proposes PCNET, a probabilistic circuit trained as a tractable density estimator over LLM residual streams, to detect hallucinations as low-density geometric anomalies on a factual manifold using exact negative log-likelihood without sampling or external verifiers. It introduces PC-LDCD to apply selective latent density contrastive decoding only on detected anomalies, leaving correct generations untouched. Experiments across four LLMs (1B-8B) and benchmarks (CoQA, SQuAD v2.0, TriviaQA, TruthfulQA) report AUROC up to 99% for detection and superior True+Info/MC2/MC3 scores with mean corruption rate reduced to 53.7% and preservation rate 79.3%.

Significance. If validated, the approach would be significant for enabling dynamic, targeted hallucination mitigation in LLMs that avoids indiscriminate corruption of correct outputs. The use of exact-inference probabilistic circuits for high-dimensional density estimation on residual streams offers a computationally efficient alternative to sampling-based or verifier-dependent methods, with potential for broader anomaly detection in model internals if the manifold hypothesis generalizes.

major comments (3)

[Abstract] Abstract: The training details for PCNET—including residual stream collection (layers, prompts, models), circuit architecture (depth, width, structure learning method), and validation splits or overfitting controls—are entirely absent. These are load-bearing for the central claim that NLL under the fitted PC reliably separates factual from hallucinated states, as the reported AUROC up to 99% cannot be assessed without them.
[Abstract] Abstract: No ablations on PC hyperparameters or comparisons to simpler density estimators (e.g., Gaussian or KDE on identical residual-stream features) are provided. This undermines the claim that the PC-based anomaly detection is necessary or superior, as the performance could stem from the residual-stream features themselves rather than the tractable PC structure.
[Abstract] Abstract: The anomaly score is defined as the negative log-likelihood of a PC whose parameters are fitted directly to the residual-stream distribution; this creates a circularity risk where detection reduces to a model-internal quantity rather than an independent external benchmark, potentially limiting generalization beyond the evaluated QA benchmarks.

minor comments (1)

[Abstract] The abstract states the code is publicly available on GitHub but provides neither the repository URL nor any reproducibility artifacts (e.g., exact hyperparameters or data collection scripts).

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important areas for improving the clarity and completeness of our work. We address each major comment point by point below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The training details for PCNET—including residual stream collection (layers, prompts, models), circuit architecture (depth, width, structure learning method), and validation splits or overfitting controls—are entirely absent. These are load-bearing for the central claim that NLL under the fitted PC reliably separates factual from hallucinated states, as the reported AUROC up to 99% cannot be assessed without them.

Authors: We agree that the abstract does not contain these training details and that their absence makes it difficult to fully evaluate the reported AUROC results. Although the main body of the manuscript outlines the overall training approach, we will revise the abstract to include a concise summary of the residual stream collection process, PC architecture choices, and validation procedures. We will also expand the methods section to provide complete specifications for layers, prompts, models, circuit depth and width, structure learning method, data splits, and overfitting controls. This will allow readers to properly assess the central claims. revision: yes
Referee: [Abstract] Abstract: No ablations on PC hyperparameters or comparisons to simpler density estimators (e.g., Gaussian or KDE on identical residual-stream features) are provided. This undermines the claim that the PC-based anomaly detection is necessary or superior, as the performance could stem from the residual-stream features themselves rather than the tractable PC structure.

Authors: This observation is correct, and the current manuscript does not include such ablations or baseline comparisons. We will add these experiments in the revised version, including direct comparisons of the probabilistic circuit against Gaussian mixture models and kernel density estimation using the identical residual-stream features, as well as sensitivity analyses for key hyperparameters such as circuit depth and width. These additions will help isolate the contribution of the PC structure to the observed performance. revision: yes
Referee: [Abstract] Abstract: The anomaly score is defined as the negative log-likelihood of a PC whose parameters are fitted directly to the residual-stream distribution; this creates a circularity risk where detection reduces to a model-internal quantity rather than an independent external benchmark, potentially limiting generalization beyond the evaluated QA benchmarks.

Authors: We appreciate this concern regarding potential circularity. The PC is trained exclusively on residual streams from factual generations using prompts and data disjoint from the evaluation sets, thereby modeling an external factual manifold. The NLL is then applied at inference time to new residual states to detect deviations. This separation ensures the anomaly score is not derived from the same generation process being evaluated. We will add an explicit discussion of this training-inference separation and its implications for generalization in the methods section of the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical validation on external benchmarks

full rationale

The paper defines PCNET as a density estimator trained on LLM residual-stream activations and uses exact NLL as an anomaly score to flag hallucinations. Detection performance (AUROC up to 99%) and intervention results are measured against ground-truth labels from independent QA benchmarks (CoQA, SQuAD v2.0, TriviaQA, TruthfulQA). No equations or claims reduce the reported metrics to the fitted parameters by construction, no load-bearing self-citations appear, and no ansatz or uniqueness result is smuggled in. The central claims rest on external empirical evaluation rather than tautological redefinition of inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the learnability of a factual density in the residual stream and the separability of hallucinated states as outliers under that density; both are domain assumptions rather than derived results.

free parameters (1)

Probabilistic circuit parameters
Parameters of PCNET are fitted to residual-stream activations from factual generations to serve as the density estimator.

axioms (1)

domain assumption The residual stream of LLMs contains a factual manifold whose density can be tractably modeled by a probabilistic circuit.
Invoked to justify treating low NLL as evidence of hallucination.

invented entities (2)

PCNET no independent evidence
purpose: Tractable density estimator over LLM residual stream for hallucination detection
New model introduced by the paper.
PC-LDCD no independent evidence
purpose: Conditional latent density contrastive decoding triggered only on detected anomalies
New decoding procedure introduced by the paper.

pith-pipeline@v0.9.0 · 5614 in / 1336 out tokens · 67578 ms · 2026-05-08T10:47:49.973450+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 29 canonical work pages · 5 internal anchors

[1]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qiang- long Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems, 43(2):1–55, 2025

2025
[2]

The internal state of an LLM knows when it ' s lying

Amos Azaria and Tom Mitchell. The Internal State of an LLM Knows When It’s Lying. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 967–976, Singapore, December 2023. As- sociation for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.68. URL https://aclantholog...

work page doi:10.18653/v1/2023.findings-emnlp.68 2023
[3]

The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets

Samuel Marks and Max Tegmark. The Geometry of Truth: Emergent Linear Structure in Large Language Model Representations of True/False Datasets, 2024. URL https://arxiv.org/ abs/2310.06824

work page internal anchor Pith review arXiv 2024
[4]

Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.Advances in Neural Information Processing Systems, 36:41451–41530,

Kenneth Li, Oam Patel, Fernanda Viégas, Hanspeter Pfister, and Martin Watten- berg. Inference-Time Intervention: Eliciting Truthful Answers from a Language Model.Advances in Neural Information Processing Systems, 36:41451–41530,
[5]

URL https://proceedings.neurips.cc/paper_files/paper/2023/hash/ 81b8390039b7302c909cb769f8b6cd93-Abstract-Conference.html

2023
[6]

Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems, 35: 17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and Editing Factual Associations in GPT.Advances in Neural Information Processing Systems, 35: 17359–17372, 2022. URL https://proceedings.neurips.cc/paper_files/paper/ 2022/hash/6f1d43d5a82a37e89b0665b33bf3a182-Abstract-Conference.html

2022
[7]

TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Shaolei Zhang, Tian Yu, and Yang Feng. TruthX: Alleviating Hallucinations by Editing Large Language Models in Truthful Space. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 8908–8949, Bangkok, Thailand, August
[8]

T ruth X : Alleviating Hallucinations by Editing Large Language Models in Truthful Space

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.483. URL https://aclanthology.org/2024.acl-long.483/

work page doi:10.18653/v1/2024.acl-long.483 2024
[9]

Sum-Product Networks: A New Deep Architecture

Hoifung Poon and Pedro Domingos. Sum-Product Networks: A New Deep Architecture. InProceedings of the 27th Conference on Uncertainty in Artificial Intelligence (UAI), pages 337–346, 2011

2011
[10]

Longllada: Unlocking long context capabilities in diffusion llms

Pedro Zuidberg Dos Martires. Probabilistic Neural Circuits.Proceedings of the AAAI Conference on Artificial Intelligence, 38(15):17280–17289, 2024. ISSN 2374-3468. doi: 10.1609/aaai. v38i15.29675. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/29675

work page doi:10.1609/aaai 2024
[11]

Jacobson, Adil Wazeer, Haiyan Wang, Xinghang Zhang, and Yexiang Xue

Daniel Xie, Maxwell J. Jacobson, Adil Wazeer, Haiyan Wang, Xinghang Zhang, and Yexiang Xue. Reducing Hallucinations in LLM-based Scientific Literature Analysis Using Peer Context Outlier Detection, 2026. URLhttps://arxiv.org/abs/2604.01461

work page arXiv 2026
[12]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7B, 2023. URL https://arxi...

work page internal anchor Pith review arXiv 2023
[13]

The Llama 3 Herd of Models, 2024

Llama Team. The Llama 3 Herd of Models, 2024. URL https://arxiv.org/abs/2407. 21783

2024
[14]

Qwen3 Technical Report

Qwen3 Team. Qwen3 Technical Report, 2025. URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review arXiv 2025
[15]

Siva Reddy, Danqi Chen, and Christopher D. Manning. CoQA: A Conversational Question Answering Challenge, 2019. URLhttps://arxiv.org/abs/1808.07042. 10

work page Pith review arXiv 2019
[16]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Regina Barzilay and Min-Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. A...

work page doi:10.18653/v1/p17-1147 2017
[17]

Know What You Don't Know: Unanswerable Questions for SQuAD

Pranav Rajpurkar, Robin Jia, and Percy Liang. Know What You Don’t Know: Unanswerable Questions for SQuAD, 2018. URLhttps://arxiv.org/abs/1806.03822

work page Pith review arXiv 2018
[18]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. TruthfulQA: Measuring How Models Mimic Human Falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

work page internal anchor Pith review arXiv 2022
[19]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Informa...

2020
[20]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2025

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2025. ISSN 1476-

2025
[21]

Nature , year =

doi: 10.1038/s41586-024-07421-0. URL https://www.nature.com/articles/ s41586-024-07421-0

work page doi:10.1038/s41586-024-07421-0
[22]

Malik, and Yarin Gal

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms.arXiv preprint arXiv:2406.15927, 2024

work page arXiv 2024
[23]

Brown, J

Xuefeng Du, Chaowei Xiao, and Yixuan Li. HaloScope: Harnessing Unlabeled LLM Generations for Hallucination Detection.Advances in Neural Informa- tion Processing Systems, 37:102948–102972, 2024. doi: 10.52202/079017-3270. URL https://proceedings.neurips.cc/paper_files/paper/2024/hash/ ba92705991cfbbcedc26e27e833ebbae-Abstract-Conference.html

work page doi:10.52202/079017-3270 2024
[24]

Chainpoll: A high efficacy method for llm hallucination detection

Robert Friel and Atindriyo Sanyal. Chainpoll: A high efficacy method for LLM hallucination detection, 2023. URLhttp://arxiv.org/abs/2310.18344

work page arXiv 2023
[25]

HaluCheck: Explainable and verifiable automation for detecting hallucinations in LLM responses.Expert Systems with Applications, 272:126712, 2025

Sangwoo Heo, Sungwook Son, and Hyunwoo Park. HaluCheck: Explainable and verifiable automation for detecting hallucinations in LLM responses.Expert Systems with Applications, 272:126712, 2025. ISSN 0957-4174. doi: 10.1016/j.eswa.2025.126712. URL https://www. sciencedirect.com/science/article/pii/S0957417425003343

work page doi:10.1016/j.eswa.2025.126712 2025
[26]

Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors, 2025

Weixuan Wang, Jingyuan Yang, and Wei Peng. Semantics-Adaptive Activation Intervention for LLMs via Dynamic Steering Vectors, 2025. URLhttps://arxiv.org/abs/2410.12299

work page arXiv 2025
[27]

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al

Weixiang Zhao, Jiahe Guo, Yulin Hu, Yang Deng, An Zhang, Xingyu Sui, Xinyang Han, Yanyan Zhao, Bing Qin, Tat-Seng Chua, and Ting Liu. AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in ...

work page doi:10.18653/v1/2025 2025
[28]

Query-Routed Activation Editing with Truth-hierarchical Preference Optimization.Proceedings of the AAAI Conference on Artificial Intelligence, 40(38):31979– 31987, 2026

Kewei Liao, Tianbo Wang, Yuqing Ma, Zhange Zhang, Zhicheng Geng, Xiaowei Zhao, Jiakai Wang, and Xianglong Liu. Query-Routed Activation Editing with Truth-hierarchical Preference Optimization.Proceedings of the AAAI Conference on Artificial Intelligence, 40(38):31979– 31987, 2026. ISSN 2374-3468. doi: 10.1609/aaai.v40i38.40468. URL https://ojs.aaai. org/in...

work page doi:10.1609/aaai.v40i38.40468 2026
[29]

Contrastive Decoding: Open-ended Text Generation as Optimization , booktitle =

Xiang Lisa Li, Ari Holtzman, Daniel Fried, Percy Liang, Jason Eisner, Tatsunori Hashimoto, Luke Zettlemoyer, and Mike Lewis. Contrastive Decoding: Open-ended Text Generation as Optimization. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

work page doi:10.18653/v1/2023.acl-long.687 2023
[30]

Dola: Decoding by contrasting layers improves factuality in large language models

Yung-Sung Chuang, Yujia Xie, Hongyin Luo, Yoon Kim, James Glass, and Pengcheng He. DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models, 2024. URLhttps://arxiv.org/abs/2309.03883

work page arXiv 2024
[31]

s t r e n g t h s \

Yue Zhang, Leyang Cui, Wei Bi, and Shuming Shi. Alleviating Hallucinations of Large Language Models through Induced Hallucinations. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational Linguistics: NAACL 2025, pages 8233–8247, Albuquerque, New Mexico, April 2025. Association for Computational Linguistics. ISBN ...

work page doi:10.18653/v1/2025.findings-naacl.459 2025
[32]

A Differential Approach to Inference in Bayesian Networks.Journal of the ACM, 50(3):280–305, 2003

Adnan Darwiche. A Differential Approach to Inference in Bayesian Networks.Journal of the ACM, 50(3):280–305, 2003. doi: 10.1145/765568.765570

work page doi:10.1145/765568.765570 2003
[33]

Online Structure Learning for Sum-Product Networks with Gaussian Leaves, 2017

Wilson Hsu, Agastya Kalra, and Pascal Poupart. Online Structure Learning for Sum-Product Networks with Gaussian Leaves, 2017. URLhttp://arxiv.org/abs/1701.05265

work page arXiv 2017
[34]

Lossless Compression with Probabilistic Circuits, 2022

Anji Liu, Stephan Mandt, and Guy Van den Broeck. Lossless Compression with Probabilistic Circuits, 2022. URLhttp://arxiv.org/abs/2111.11632

work page arXiv 2022
[35]

Neural Probabilistic Circuits: Enabling Compositional and Interpretable Predictions through Logical Reasoning, 2025

Weixin Chen, Simon Yu, Huajie Shao, Lui Sha, and Han Zhao. Neural Probabilistic Circuits: Enabling Compositional and Interpretable Predictions through Logical Reasoning, 2025. URL http://arxiv.org/abs/2501.07021

work page arXiv 2025
[36]

Conversational Context Classification: A Representation Engineering Approach,

Jonathan Pan. Conversational Context Classification: A Representation Engineering Approach,
[37]

URLhttps://arxiv.org/abs/2601.12286

work page arXiv
[38]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization, 2017. URL https://arxiv.org/abs/1412.6980

work page internal anchor Pith review arXiv 2017
[39]

doi:10.48550/arXiv.2504.17550 , abstract =

Yejin Bang, Ziwei Ji, Alan Schelten, Anthony Hartshorn, Tara Fowler, Cheng Zhang, Nicola Cancedda, and Pascale Fung. HalluLens: LLM Hallucination Benchmark.arXiv preprint arXiv:2504.17550, 2025. URLhttps://arxiv.org/abs/2504.17550

work page arXiv 2025
[40]

Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models

YooJung Choi, Antonio Vergari, and Guy Van den Broeck. Probabilistic Circuits: A Unifying Framework for Tractable Probabilistic Models. Technical report, UCLA, 2020. URL http: //starai.cs.ucla.edu/papers/ProbCirc20.pdf

2020
[41]

A Composi- tional Atlas of Tractable Circuit Operations for Probabilistic Inference

Antonio Vergari, YooJung Choi, Anji Liu, Stefano Teso, and Guy Van den Broeck. A Composi- tional Atlas of Tractable Circuit Operations for Probabilistic Inference. InAdvances in Neural In- formation Processing Systems, volume 34, 2021. URLhttps://proceedings.neurips.cc/ paper_files/paper/2021/file/6e01383fd96a17ae51cc3e15447e7533-Paper.pdf

2021
[42]

QLoRA: Efficient Finetuning of Quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA: Efficient Finetuning of Quantized LLMs. In A. Oh, T. Naumann, A. Glober- son, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Informa- tion Processing Systems, volume 36, pages 10088–10115. Curran Associates, Inc.,
[43]

URL https://proceedings.neurips.cc/paper_files/paper/2023/file/ 1feb87871436031bdc0f2beaa62a049b-Paper-Conference.pdf

2023
[44]

Now Publishers Inc, 2009

Stephen Robertson and Hugo Zaragoza.The probabilistic relevance framework: BM25 and beyond, volume 4. Now Publishers Inc, 2009. 12 A Theoretical foundations We provide the theoretical grounding forPCNETandPC-LDCD. We refer the reader to [ 29] and [7] for foundational treatments of probabilistic circuits, including formal treatments of smoothness, decompos...

2009