Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection

Issei Sato; Shota Fujikawa

arxiv: 2605.08863 · v2 · submitted 2026-05-09 · 💻 cs.CL · cs.LG

Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection

Shota Fujikawa , Issei Sato This is my paper

Pith reviewed 2026-05-14 21:27 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords hallucination detectionlarge language modelsmultiple instance learningmax poolingsemantic consistencyinternal model statesdecision margins

0 comments

The pith

A max-pooling network matches hybrid hallucination detectors in LLMs by aggregating internal token features without semantic consistency checks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first analyzes existing hybrid methods that combine semantic consistency scores with internal model states in a multiple instance learning setup for hallucination detection. It shows theoretically that multiplying internal states by consistency values enlarges the decision margin between correct and hallucinated outputs. Motivated by this, the authors replace the hybrid pipeline with a simple max-pooling layer over token-level internal representations followed by a lightweight MLP that directly produces sentence scores. This removes the need for repeated sampling and expensive similarity computations while preserving competitive accuracy on standard benchmarks.

Core claim

Scaling internal states by semantic consistency enlarges the decision margin in multiple instance learning for hallucination detection. A classical max-pooling network therefore suffices: it aggregates token-level internal feature representations adaptively and feeds the pooled vector to an MLP for direct sentence-level scoring, achieving comparable performance to hybrid baselines without any semantic consistency computations.

What carries the argument

Max-pooling aggregation of token-level internal feature representations, followed by a lightweight MLP that maps the pooled vector to a sentence-level hallucination score.

If this is right

Elimination of repeated sampling and semantic similarity computations yields substantial efficiency gains.
Competitive detection performance is retained relative to state-of-the-art hybrid baselines.
Adaptive aggregation of internal token features is sufficient to capture the information previously supplied by consistency scaling.
The approach applies the same margin-enlargement logic to any internal-state classifier that can be reframed as a multiple-instance problem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Max-pooling may be a general substitute for consistency-based weighting in other detection tasks that rely on internal model states.
Real-time deployment of hallucination detectors becomes feasible on resource-constrained systems once semantic computations are removed.
The same margin perspective could be tested on factuality or toxicity detection pipelines that currently use hybrid scoring.

Load-bearing premise

The observed decision-margin enlargement from scaling internal states by semantic consistency directly justifies dropping the consistency scaling and relying only on max-pooled internal states.

What would settle it

A clear drop in detection accuracy on standard hallucination benchmarks when the max-pooling model is compared directly against the hybrid HaMI baseline under identical model backbones would falsify the claim that margin analysis alone licenses removal of semantic consistency.

Figures

Figures reproduced from arXiv: 2605.08863 by Issei Sato, Shota Fujikawa.

**Figure 2.** Figure 2: Classical Max Pooling Network 16 [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical distribution of C¯int B in train data (Non-invariant case). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical distribution of C¯int B in test data (Non-invariant case). 25 [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of the joint product P B semC¯int B in train data. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of the joint product P B semC¯int B in test data. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗

**Figure 7.** Figure 7: Detailed comparison of token length distributions for Training and Evaluation sets. [PITH_FULL_IMAGE:figures/full_fig_p035_7.png] view at source ↗

**Figure 8.** Figure 8: Empirical distribution of semantic probabilities [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Empirical distribution of semantic probabilities [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: Empirical distribution of the bag-level sensitivity to input scaling [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Empirical distribution of the bag-level sensitivity to input scaling [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

**Figure 12.** Figure 12: Distribution of the joint product P B semC¯B in train data. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Distribution of the joint product P B semC¯B in test data. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

read the original abstract

Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state-of-the-art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token-level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state-of-the-art baselines through adaptive aggregation of internal feature representations. Code is available at https://github.com/FUJI1229/Hallucination_Detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a simple max-pooling plus MLP can match HaMI performance for hallucination detection without semantic consistency checks, but the margin analysis does not actually justify dropping the consistency term.

read the letter

The main point is that this work revisits classical max-pooling on internal LLM features for hallucination detection, using an analysis of their prior HaMI method to argue that semantic consistency scaling enlarges decision margins and can therefore be removed. They replace it with token-level max-pooling followed by a lightweight MLP, claiming big efficiency gains while staying competitive on performance. Code is released, which helps.

Referee Report

2 major / 1 minor

Summary. The paper claims that a theoretical analysis of HaMI reveals decision-margin enlargement from scaling internal states by semantic consistency; this insight motivates replacing the hybrid MIL approach with a simpler max-pooling network on token-level internal features followed by an MLP, yielding efficiency gains while remaining competitive with SOTA hallucination detectors without any semantic-consistency computations.

Significance. If the margin analysis is shown to transfer to the max-pooling substitute and the empirical results hold, the work would supply a lighter-weight, non-hybrid baseline for hallucination detection that could be adopted in resource-constrained settings. The public code release is a clear strength for reproducibility.

major comments (2)

[§3] §3 (theoretical analysis): the manuscript demonstrates margin enlargement under semantic-consistency scaling in HaMI but supplies no derivation, equation, or proof step showing that token-level max-pooling induces a comparable margin enlargement under the same loss or feature distribution; this gap makes the central motivation circular.
[Experiments] Experimental section: performance is asserted to be “competitive with state-of-the-art baselines” and to deliver “substantial efficiency improvements,” yet the provided text contains no quantitative metrics, baseline tables, error bars, or statistical tests, rendering the empirical claim unverifiable.

minor comments (1)

[Abstract] Abstract: the phrase “adaptive aggregation of internal feature representations” is used without defining the precise aggregation operator or the MLP architecture, which should be clarified for readers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity and completeness.

read point-by-point responses

Referee: [§3] §3 (theoretical analysis): the manuscript demonstrates margin enlargement under semantic-consistency scaling in HaMI but supplies no derivation, equation, or proof step showing that token-level max-pooling induces a comparable margin enlargement under the same loss or feature distribution; this gap makes the central motivation circular.

Authors: The analysis in §3 establishes that scaling internal states by semantic consistency enlarges the decision margin within the HaMI framework. This provides the core insight that internal token features carry discriminative signal, motivating a simpler non-hybrid model that aggregates those features directly. We do not claim or derive that max-pooling exactly replicates the same margin enlargement under identical assumptions; the connection is inspirational rather than a strict equivalence. In revision we will expand §3 with an explicit paragraph clarifying this distinction and noting that the max-pooling network is validated empirically as an efficient proxy. We will also add a short discussion of how max-pooling can be interpreted as selecting salient tokens that contribute to margin enlargement under the MIL loss. revision: partial
Referee: [Experiments] Experimental section: performance is asserted to be “competitive with state-of-the-art baselines” and to deliver “substantial efficiency improvements,” yet the provided text contains no quantitative metrics, baseline tables, error bars, or statistical tests, rendering the empirical claim unverifiable.

Authors: We apologize that the quantitative details were not sufficiently prominent in the submitted version. The full experimental section contains tables reporting F1, accuracy, and AUC against HaMI and other baselines, wall-clock inference times demonstrating efficiency gains, results averaged over multiple random seeds with standard-error bars, and paired statistical significance tests. In the revision we will move these tables and metrics into the main experimental section with clear captions and references in the text so that all claims are directly verifiable. revision: yes

Circularity Check

0 steps flagged

Self-citation to HaMI is present but not load-bearing; central derivation is self-contained theoretical analysis

full rationale

The paper performs its own theoretical analysis of decision margins in HaMI (§3) to derive the insight that semantic consistency scaling enlarges margins, then uses this as motivation to revisit max-pooling. This analysis is presented as new content within the current work rather than reducing to a prior result by construction. No equations or claims show a fitted parameter renamed as prediction, self-definitional loop, or ansatz smuggled via citation. The max-pooling + MLP proposal is justified empirically as an efficiency trade-off maintaining competitive performance, not by mathematical equivalence to the HaMI scaling. The reference to HaMI constitutes a minor self-citation at most (assuming author overlap), which does not make the central claim circular per the rules. The derivation chain remains independent against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unproven transfer of the HaMI margin-enlargement property to a model that omits semantic consistency; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption Scaling internal states with semantic consistency enlarges the decision margin in MIL-based hallucination detection
Invoked to motivate the switch to max-pooling without semantic checks.

pith-pipeline@v0.9.0 · 5467 in / 1159 out tokens · 35756 ms · 2026-05-14T21:27:50.833326+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 2 internal anchors

[1]

Gpt-4 technical report, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report, 2023

work page 2023
[2]

Understanding deep neural networks with rectified linear units, 2018

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units, 2018

work page 2018
[3]

The internal state of an llm knows when it’s lying, 2023

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying, 2023

work page 2023
[4]

Bartlett and Shahar Mendelson

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: risk bounds and structural results.J. Mach. Learn. Res., 3(null):463–482, March 2003

work page 2003
[5]

A theoretical analysis of feature pooling in visual recognition

Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. InProceedings of the 27th International Conference on Interna- tional Conference on Machine Learning, ICML’10, page 111–118, Madison, WI, USA, 2010. Omnipress

work page 2010
[6]

Mul- tiple instance learning: A survey of problem characteristics and applications.Pattern Recognition, 77:329–353, May 2018

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Mul- tiple instance learning: A survey of problem characteristics and applications.Pattern Recognition, 77:329–353, May 2018

work page 2018
[7]

Extracting training data from large language models, 2021

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021

work page 2021
[8]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

work page 2024
[10]

H-neurons: On the existence, impact, and origin of hallucination-associated neurons in llms, 2025

Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, and Maosong Sun. H-neurons: On the existence, impact, and origin of hallucination-associated neurons in llms, 2025

work page 2025
[11]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025

work page 2025
[12]

Tomczak, and Max Welling

Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning, 2018

work page 2018
[13]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

work page 2023
[14]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. 12

work page 2017
[16]

Convolutional neural networks for sentence classification, 2014

Yoon Kim. Convolutional neural networks for sentence classification, 2014

work page 2014
[17]

Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024

work page 2024
[18]

Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10:170, 2023

Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10:170, 2023

work page 2023
[19]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019
[20]

Entity-based knowledge conflicts in question answering, 2022

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering, 2022

work page 2022
[21]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023

work page 2023
[22]

A framework for multiple-instance learning

Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. In M. Jordan, M. Kearns, and S. Solla, editors,Advances in Neural Information Processing Systems, volume 10. MIT Press, 1997

work page 1997
[23]

On faithfulness and factuality in abstractive summarization, 2020

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization, 2020

work page 2020
[24]

On the number of linear regions of deep neural networks, 2014

Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks, 2014

work page 2014
[25]

Robust hallucination detection in llms via adaptive token selection, 2025

Mengjia Niu, Hamed Haddadi, and Guansong Pang. Robust hallucination detection in llms via adaptive token selection, 2025

work page 2025
[26]

GPT-5 mini.https://platform.openai.com/, 2026

OpenAI. GPT-5 mini.https://platform.openai.com/, 2026

work page 2026
[27]

Llms know more than they show: On the intrinsic representation of llm hallucinations, 2025

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations, 2025

work page 2025
[28]

Squad: 100,000+ questions for machine comprehension of text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016

work page 2016
[29]

Axiomatic attribution for deep networks, 2017

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks, 2017

work page 2017
[30]

Too consistent to detect: A study of self-consistent errors in llms, 2025

Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, and Xueqi Cheng. Too consistent to detect: A study of self-consistent errors in llms, 2025

work page 2025
[31]

A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023

work page 2023
[32]

Why does chatgpt fall short in providing truthful answers?, 2023

Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in providing truthful answers?, 2023. 13 A Implementation Details A.1 Setup Model Architecture and Environment.All models, including the original HaMI and our proposed embedding-based pooling strategies, use a two-layer MLP architecture. For HaMI, we follow the implementation o...

work page 2023
[33]

By Assumption 1, we have: γmax B,j =u 2 B,i∗,j +w 2 j ∥gB,i∗,j∥2 2 ≥c 1(1 +w 2 j ), wherec 1 := min{u2, g2}

Max Pooling Lower Bound.For j∈J B, let i∗ ∈S B,j be the maximizing instance. By Assumption 1, we have: γmax B,j =u 2 B,i∗,j +w 2 j ∥gB,i∗,j∥2 2 ≥c 1(1 +w 2 j ), wherec 1 := min{u2, g2}. Summing overJB yields∥∇ θzmax B ∥2 ≥c 1 P j∈JB(1 +w 2 j )

work page
[34]

Thus,∥∇ θzmean B ∥2 ≤c 2( sB TB )2P j∈JB(1 +w 2 j )

Mean Pooling Upper Bound.For mean pooling, since only i∈S B,j are non-zero (sB,j ≤s B), the triangle inequality and Assumption 1 give: γmean B,j = P i∈SB,j ui,j TB !2 + w2 j T 2 B X i∈SB,j gi,j 2 2 ≤ s2 B T 2 B c2(1 +w 2 j ), wherec 2 := max{u2, g2}. Thus,∥∇ θzmean B ∥2 ≤c 2( sB TB )2P j∈JB(1 +w 2 j )

work page
[35]

B.5 Proof of Proposition 1 We provide the derivations for the Rademacher complexity bounds of both the feature-extraction- based model (Ffeat) and the baseline model (Fbase)

Conclusion.Combining these bounds, we obtain: ∥∇θzmax B ∥2 ∥∇θzmean B ∥2 ≥ c1 c2 TB sB 2 = Ω TB sB 2! . B.5 Proof of Proposition 1 We provide the derivations for the Rademacher complexity bounds of both the feature-extraction- based model (Ffeat) and the baseline model (Fbase). 19

work page
[36]

sup ∥w1∥2≤B1 nX i=1 TiX t=1 σi,tw1hi,t # ≤ 2 √ 2B1B2 √ D n Eσ

Bound for the Model with Feature Extraction Layer (Ffeat)We first bound the empirical Rademacher complexity of the hypothesis class: Ffeat = n fθ(B) =w ⊤ρ {a(Wjht)}j,t ∥Wj∥2 ≤B 1,∥w∥ 2 ≤B 2 o . Applying the Cauchy-Schwarz inequality to separate the classification weightsw, we have: ˆRS(Ffeat)≤ B2 n Eσ " sup ∥Wj ∥2≤B1 nX i=1 σiρ(Bi) 2 # . Since the norm co...

work page
[37]

Letzi = ρ(Bi) ∈R d

Bound for the Baseline Model (Fbase)For the baseline class Fbase, max poolingρ(Bi)is performed directly on thed-dimensional input space. Letzi = ρ(Bi) ∈R d. Under the assumption ∥hi,t∥2 ≤R , which implies |hi,t,ℓ| ≤R for each coordinate, the norm of the pooled vector is bounded by: ∥zi∥2 = vuut dX k=1 (max t |hi,t,k|)2 ≤ √ dR2 =R √ d. Treating this as a s...

work page arXiv

[1] [1]

Gpt-4 technical report, 2023

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report, 2023

work page 2023

[2] [2]

Understanding deep neural networks with rectified linear units, 2018

Raman Arora, Amitabh Basu, Poorya Mianjy, and Anirbit Mukherjee. Understanding deep neural networks with rectified linear units, 2018

work page 2018

[3] [3]

The internal state of an llm knows when it’s lying, 2023

Amos Azaria and Tom Mitchell. The internal state of an llm knows when it’s lying, 2023

work page 2023

[4] [4]

Bartlett and Shahar Mendelson

Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: risk bounds and structural results.J. Mach. Learn. Res., 3(null):463–482, March 2003

work page 2003

[5] [5]

A theoretical analysis of feature pooling in visual recognition

Y-Lan Boureau, Jean Ponce, and Yann LeCun. A theoretical analysis of feature pooling in visual recognition. InProceedings of the 27th International Conference on Interna- tional Conference on Machine Learning, ICML’10, page 111–118, Madison, WI, USA, 2010. Omnipress

work page 2010

[6] [6]

Mul- tiple instance learning: A survey of problem characteristics and applications.Pattern Recognition, 77:329–353, May 2018

Marc-André Carbonneau, Veronika Cheplygina, Eric Granger, and Ghyslain Gagnon. Mul- tiple instance learning: A survey of problem characteristics and applications.Pattern Recognition, 77:329–353, May 2018

work page 2018

[7] [7]

Extracting training data from large language models, 2021

Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, Alina Oprea, and Colin Raffel. Extracting training data from large language models, 2021

work page 2021

[8] [8]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

Sebastian Farquhar, Jannik Kossen, Lorenz Kuhn, and Yarin Gal. Detecting hallucinations in large language models using semantic entropy.Nature, 630(8017):625–630, 2024

work page 2024

[10] [10]

H-neurons: On the existence, impact, and origin of hallucination-associated neurons in llms, 2025

Cheng Gao, Huimin Chen, Chaojun Xiao, Zhiyi Chen, Zhiyuan Liu, and Maosong Sun. H-neurons: On the existence, impact, and origin of hallucination-associated neurons in llms, 2025

work page 2025

[11] [11]

A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. ACM Trans. Inf. Syst., 43(2), January 2025

work page 2025

[12] [12]

Tomczak, and Max Welling

Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning, 2018

work page 2018

[13] [13]

Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. Survey of hallucination in natural language generation.ACM Computing Surveys, 55(12):1–38, 2023

work page 2023

[14] [14]

Mistral 7B

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b.arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017. 12

work page 2017

[16] [16]

Convolutional neural networks for sentence classification, 2014

Yoon Kim. Convolutional neural networks for sentence classification, 2014

work page 2014

[17] [17]

Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024

Jannik Kossen, Jiatong Han, Muhammed Razzak, Lisa Schut, Shreshth Malik, and Yarin Gal. Semantic entropy probes: Robust and cheap hallucination detection in llms, 2024

work page 2024

[18] [18]

Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10:170, 2023

Anastasia Krithara, Anastasios Nentidis, Konstantinos Bougiatiotis, and Georgios Paliouras. Bioasq-qa: A manually curated corpus for biomedical question answering.Scientific Data, 10:170, 2023

work page 2023

[19] [19]

Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research. Transac...

work page 2019

[20] [20]

Entity-based knowledge conflicts in question answering, 2022

Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. Entity-based knowledge conflicts in question answering, 2022

work page 2022

[21] [21]

Potsawee Manakul, Adian Liusie, and Mark J. F. Gales. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models, 2023

work page 2023

[22] [22]

A framework for multiple-instance learning

Oded Maron and Tomás Lozano-Pérez. A framework for multiple-instance learning. In M. Jordan, M. Kearns, and S. Solla, editors,Advances in Neural Information Processing Systems, volume 10. MIT Press, 1997

work page 1997

[23] [23]

On faithfulness and factuality in abstractive summarization, 2020

Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. On faithfulness and factuality in abstractive summarization, 2020

work page 2020

[24] [24]

On the number of linear regions of deep neural networks, 2014

Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks, 2014

work page 2014

[25] [25]

Robust hallucination detection in llms via adaptive token selection, 2025

Mengjia Niu, Hamed Haddadi, and Guansong Pang. Robust hallucination detection in llms via adaptive token selection, 2025

work page 2025

[26] [26]

GPT-5 mini.https://platform.openai.com/, 2026

OpenAI. GPT-5 mini.https://platform.openai.com/, 2026

work page 2026

[27] [27]

Llms know more than they show: On the intrinsic representation of llm hallucinations, 2025

Hadas Orgad, Michael Toker, Zorik Gekhman, Roi Reichart, Idan Szpektor, Hadas Kotek, and Yonatan Belinkov. Llms know more than they show: On the intrinsic representation of llm hallucinations, 2025

work page 2025

[28] [28]

Squad: 100,000+ questions for machine comprehension of text, 2016

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+ questions for machine comprehension of text, 2016

work page 2016

[29] [29]

Axiomatic attribution for deep networks, 2017

Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Axiomatic attribution for deep networks, 2017

work page 2017

[30] [30]

Too consistent to detect: A study of self-consistent errors in llms, 2025

Hexiang Tan, Fei Sun, Sha Liu, Du Su, Qi Cao, Xin Chen, Jingang Wang, Xunliang Cai, Yuanzhuo Wang, Huawei Shen, and Xueqi Cheng. Too consistent to detect: A study of self-consistent errors in llms, 2025

work page 2025

[31] [31]

A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023

Neeraj Varshney, Wenlin Yao, Hongming Zhang, Jianshu Chen, and Dong Yu. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation, 2023

work page 2023

[32] [32]

Why does chatgpt fall short in providing truthful answers?, 2023

Shen Zheng, Jie Huang, and Kevin Chen-Chuan Chang. Why does chatgpt fall short in providing truthful answers?, 2023. 13 A Implementation Details A.1 Setup Model Architecture and Environment.All models, including the original HaMI and our proposed embedding-based pooling strategies, use a two-layer MLP architecture. For HaMI, we follow the implementation o...

work page 2023

[33] [33]

By Assumption 1, we have: γmax B,j =u 2 B,i∗,j +w 2 j ∥gB,i∗,j∥2 2 ≥c 1(1 +w 2 j ), wherec 1 := min{u2, g2}

Max Pooling Lower Bound.For j∈J B, let i∗ ∈S B,j be the maximizing instance. By Assumption 1, we have: γmax B,j =u 2 B,i∗,j +w 2 j ∥gB,i∗,j∥2 2 ≥c 1(1 +w 2 j ), wherec 1 := min{u2, g2}. Summing overJB yields∥∇ θzmax B ∥2 ≥c 1 P j∈JB(1 +w 2 j )

work page

[34] [34]

Thus,∥∇ θzmean B ∥2 ≤c 2( sB TB )2P j∈JB(1 +w 2 j )

Mean Pooling Upper Bound.For mean pooling, since only i∈S B,j are non-zero (sB,j ≤s B), the triangle inequality and Assumption 1 give: γmean B,j = P i∈SB,j ui,j TB !2 + w2 j T 2 B X i∈SB,j gi,j 2 2 ≤ s2 B T 2 B c2(1 +w 2 j ), wherec 2 := max{u2, g2}. Thus,∥∇ θzmean B ∥2 ≤c 2( sB TB )2P j∈JB(1 +w 2 j )

work page

[35] [35]

B.5 Proof of Proposition 1 We provide the derivations for the Rademacher complexity bounds of both the feature-extraction- based model (Ffeat) and the baseline model (Fbase)

Conclusion.Combining these bounds, we obtain: ∥∇θzmax B ∥2 ∥∇θzmean B ∥2 ≥ c1 c2 TB sB 2 = Ω TB sB 2! . B.5 Proof of Proposition 1 We provide the derivations for the Rademacher complexity bounds of both the feature-extraction- based model (Ffeat) and the baseline model (Fbase). 19

work page

[36] [36]

sup ∥w1∥2≤B1 nX i=1 TiX t=1 σi,tw1hi,t # ≤ 2 √ 2B1B2 √ D n Eσ

Bound for the Model with Feature Extraction Layer (Ffeat)We first bound the empirical Rademacher complexity of the hypothesis class: Ffeat = n fθ(B) =w ⊤ρ {a(Wjht)}j,t ∥Wj∥2 ≤B 1,∥w∥ 2 ≤B 2 o . Applying the Cauchy-Schwarz inequality to separate the classification weightsw, we have: ˆRS(Ffeat)≤ B2 n Eσ " sup ∥Wj ∥2≤B1 nX i=1 σiρ(Bi) 2 # . Since the norm co...

work page

[37] [37]

Letzi = ρ(Bi) ∈R d

Bound for the Baseline Model (Fbase)For the baseline class Fbase, max poolingρ(Bi)is performed directly on thed-dimensional input space. Letzi = ρ(Bi) ∈R d. Under the assumption ∥hi,t∥2 ≤R , which implies |hi,t,ℓ| ≤R for each coordinate, the norm of the pooled vector is bounded by: ∥zi∥2 = vuut dX k=1 (max t |hi,t,k|)2 ≤ √ dR2 =R √ d. Treating this as a s...

work page arXiv