A Geometric Analysis of Small-sized Language Model Hallucinations
Pith reviewed 2026-05-21 12:11 UTC · model grok-4.3
The pith
Genuine responses cluster more tightly than hallucinations in sentence-embedding space, becoming separable after Fisher projection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Genuine responses cluster more tightly than hallucinated ones in sentence-embedding space; after Fisher projection the two classes become consistently separable. This asymmetry supports APORIA-LP, an efficient label-propagation method that classifies large collections of responses from as few as 30-50 annotations and reaches F1 scores above 90 percent across ten small-sized LLMs.
What carries the argument
APORIA, the geometric framework that measures prompt-wise response instability through asymmetry of clusters in sentence-embedding space, using Fisher projection to achieve class separability.
If this is right
- Hallucinations can be flagged geometrically without external fact-checking or knowledge retrieval.
- Large sets of model outputs can be labelled for hallucinations with only a few dozen manual annotations.
- The same asymmetry supplies a way to study retrieval instability across different small LLMs.
- The released SOCRATES-300K dataset enables further experiments on geometric properties of model responses.
Where Pith is reading between the lines
- The observed instability could be used to guide sampling strategies that favor lower-variance outputs on factual questions.
- Similar embedding-space measurements might reveal whether multi-step agentic workflows amplify the same geometric asymmetry.
- Training objectives that explicitly reward tighter clustering around correct answers could be tested as a way to reduce hallucinations.
- The geometric signature might appear in other generative domains such as code or image synthesis, offering a cross-modal detection route.
Load-bearing premise
The tighter clustering of genuine responses must reflect a general retrieval instability rather than prompt-specific effects, model-size differences, or biases in the chosen embedding model.
What would settle it
Repeating the embedding and Fisher-projection analysis on a fresh collection of prompts or on models outside the original ten and finding that genuine and hallucinated responses no longer form separable clusters would refute the central geometric claim.
read the original abstract
Hallucinations -- plausible but factually incorrect responses -- pose a major challenge to the reliability of Large Language Models (LLMs), especially in multi-step or agentic settings. Existing work largely frames hallucinations as a consequence of missing knowledge; we show instead that, even when the relevant factual knowledge is present, models still produce hallucinated answers, pointing to retrieval instability rather than knowledge gaps. Building on this observation, we introduce APORIA (Aggregate Prompt-wise Observation Retrieving Instability via Asymmetry -- the state of puzzlement-in-contradiction that hallucinations embody), a geometric framework that studies repeated responses to the same prompt in sentence-embedding space. Our central hypothesis is that genuine responses cluster more tightly than hallucinated ones; we empirically validate this and show that, after Fisher projection, the two response classes become consistently separable. We leverage this asymmetry in geometry via APORIA-LP, an efficient label-propagation method that classifies large collections of responses from as few as 30--50 annotations, achieving F1 scores above 90% across ten small-sized LLMs. To support further research, we release SOCRATES-300K, a fully labelled dataset of 300,000 responses, together with the code for both dataset generation and result reproduction. Our key finding -- framing hallucinations from a geometric perspective in the embedding space -- complements traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that hallucinations in small LLMs arise from retrieval instability even when relevant knowledge is present. It introduces the APORIA geometric framework for analyzing repeated responses to the same prompt in sentence-embedding space, with the central hypothesis that genuine responses form tighter clusters than hallucinated ones. After Fisher projection the classes become separable, enabling the APORIA-LP label-propagation classifier that reaches F1 > 90% from only 30-50 annotations across ten small models. The authors release the fully labeled SOCRATES-300K dataset together with generation and reproduction code.
Significance. If the reported geometric asymmetry is robust and attributable to generation dynamics rather than representation artifacts, the work supplies a complementary perspective to knowledge-centric hallucination research and a practical low-supervision detection method. The public release of a large labeled dataset and accompanying code constitutes a clear strength for reproducibility and follow-on studies.
major comments (2)
- [Abstract and empirical validation] The interpretation that tighter genuine clusters reflect retrieval instability (Abstract; empirical validation) rather than sentence-embedding biases or prompt artifacts is load-bearing for the central claim yet unsupported by ablations. Repeating the intra-class variance analysis under TF-IDF, random projections, or a different encoder family is required to isolate the origin of the asymmetry.
- [Experimental setup] Details on the exact clustering metric, prompt sampling controls, and whether Fisher projection parameters were tuned post-hoc are absent from the experimental description, leaving the separability results difficult to assess for rigor.
minor comments (1)
- [Notation and terminology] Ensure the expansion of the APORIA acronym and consistent use of APORIA-LP appear in the main text as well as the abstract.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments highlight important areas for improving the clarity and robustness of our claims. We address each major comment point by point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Abstract and empirical validation] The interpretation that tighter genuine clusters reflect retrieval instability (Abstract; empirical validation) rather than sentence-embedding biases or prompt artifacts is load-bearing for the central claim yet unsupported by ablations. Repeating the intra-class variance analysis under TF-IDF, random projections, or a different encoder family is required to isolate the origin of the asymmetry.
Authors: We agree that demonstrating the asymmetry is not an artifact of the particular sentence encoder is important for supporting our interpretation of retrieval instability. In the revised manuscript we add an ablation section that repeats the intra-class variance analysis using (i) TF-IDF bag-of-words vectors and (ii) embeddings from a different encoder family (paraphrase-MiniLM-L6-v2). The tighter clustering of genuine responses remains visible under both alternatives, indicating that the geometric asymmetry is not driven by the original encoder choice. Random projections were omitted because they destroy the semantic structure that our geometric hypothesis relies upon; we instead focus on semantically meaningful representations. These new results will be reported with the corresponding figures and statistics. revision: yes
-
Referee: [Experimental setup] Details on the exact clustering metric, prompt sampling controls, and whether Fisher projection parameters were tuned post-hoc are absent from the experimental description, leaving the separability results difficult to assess for rigor.
Authors: We thank the referee for noting these omissions. The revised Experimental Setup section now explicitly states: (1) intra-class variance is computed as the average pairwise cosine distance within each response group; (2) prompts were sampled with controls for topic diversity (balanced across 20 domains) and length (capped at 50 tokens) to avoid confounding; (3) Fisher discriminant parameters were obtained via 5-fold cross-validation strictly on the 30–50 annotated examples per model and were never tuned on the held-out evaluation set. These clarifications remove any ambiguity about post-hoc fitting and make the separability results fully reproducible. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper's core contribution is an empirical observation that genuine responses form tighter clusters than hallucinated ones in sentence-embedding space, followed by Fisher projection for separability and a label-propagation classifier (APORIA-LP) trained on 30-50 annotations. This chain rests on direct measurement and validation across the released SOCRATES-300K dataset and ten LLMs rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The hypothesis is tested experimentally; no equation or quantity is constructed to equal its own input, and the geometric asymmetry is presented as a measured property rather than derived from prior author results by fiat. The method is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- annotation budget
axioms (1)
- domain assumption Genuine responses form tighter clusters than hallucinated ones in sentence-embedding space
invented entities (3)
-
APORIA
no independent evidence
-
APORIA-LP
no independent evidence
-
SOCRATES-300K
independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
genuine responses exhibit greater semantic consistency than hallucinated responses... pairwise distances... Wasserstein distance between DGG and DHH... Fisher Discriminant Analysis... v ∝ (S_W^λ)^−1 (μ_G − μ_H)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.