pith. sign in

arxiv: 2602.14778 · v3 · pith:PEAEPENGnew · submitted 2026-02-16 · 💻 cs.CL · cs.AI· cs.CY

A Geometric Analysis of Small-sized Language Model Hallucinations

Pith reviewed 2026-05-21 12:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CY
keywords hallucination detectionsentence embeddingsgeometric analysislabel propagationLLM reliabilityretrieval instabilityFisher projectionresponse clustering
0
0 comments X

The pith

Genuine responses cluster more tightly than hallucinations in sentence-embedding space, becoming separable after Fisher projection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that small language models can hallucinate even when they hold the relevant facts, pointing to unstable retrieval rather than knowledge gaps. It examines many responses to the same prompt placed in sentence-embedding space and finds that correct answers form tighter groups while hallucinations spread out. Projecting the points with Fisher linear discriminant makes the two groups reliably separable. From this geometry the authors build APORIA-LP, a label-propagation classifier that labels large collections of responses accurately after seeing only 30 to 50 hand-annotated examples. The result supplies both a practical detection tool and a large labelled dataset for studying how models generate answers.

Core claim

Genuine responses cluster more tightly than hallucinated ones in sentence-embedding space; after Fisher projection the two classes become consistently separable. This asymmetry supports APORIA-LP, an efficient label-propagation method that classifies large collections of responses from as few as 30-50 annotations and reaches F1 scores above 90 percent across ten small-sized LLMs.

What carries the argument

APORIA, the geometric framework that measures prompt-wise response instability through asymmetry of clusters in sentence-embedding space, using Fisher projection to achieve class separability.

If this is right

  • Hallucinations can be flagged geometrically without external fact-checking or knowledge retrieval.
  • Large sets of model outputs can be labelled for hallucinations with only a few dozen manual annotations.
  • The same asymmetry supplies a way to study retrieval instability across different small LLMs.
  • The released SOCRATES-300K dataset enables further experiments on geometric properties of model responses.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The observed instability could be used to guide sampling strategies that favor lower-variance outputs on factual questions.
  • Similar embedding-space measurements might reveal whether multi-step agentic workflows amplify the same geometric asymmetry.
  • Training objectives that explicitly reward tighter clustering around correct answers could be tested as a way to reduce hallucinations.
  • The geometric signature might appear in other generative domains such as code or image synthesis, offering a cross-modal detection route.

Load-bearing premise

The tighter clustering of genuine responses must reflect a general retrieval instability rather than prompt-specific effects, model-size differences, or biases in the chosen embedding model.

What would settle it

Repeating the embedding and Fisher-projection analysis on a fresh collection of prompts or on models outside the original ten and finding that genuine and hallucinated responses no longer form separable clusters would refute the central geometric claim.

read the original abstract

Hallucinations -- plausible but factually incorrect responses -- pose a major challenge to the reliability of Large Language Models (LLMs), especially in multi-step or agentic settings. Existing work largely frames hallucinations as a consequence of missing knowledge; we show instead that, even when the relevant factual knowledge is present, models still produce hallucinated answers, pointing to retrieval instability rather than knowledge gaps. Building on this observation, we introduce APORIA (Aggregate Prompt-wise Observation Retrieving Instability via Asymmetry -- the state of puzzlement-in-contradiction that hallucinations embody), a geometric framework that studies repeated responses to the same prompt in sentence-embedding space. Our central hypothesis is that genuine responses cluster more tightly than hallucinated ones; we empirically validate this and show that, after Fisher projection, the two response classes become consistently separable. We leverage this asymmetry in geometry via APORIA-LP, an efficient label-propagation method that classifies large collections of responses from as few as 30--50 annotations, achieving F1 scores above 90% across ten small-sized LLMs. To support further research, we release SOCRATES-300K, a fully labelled dataset of 300,000 responses, together with the code for both dataset generation and result reproduction. Our key finding -- framing hallucinations from a geometric perspective in the embedding space -- complements traditional knowledge-centric and single-response evaluation paradigms, paving the way for further research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that hallucinations in small LLMs arise from retrieval instability even when relevant knowledge is present. It introduces the APORIA geometric framework for analyzing repeated responses to the same prompt in sentence-embedding space, with the central hypothesis that genuine responses form tighter clusters than hallucinated ones. After Fisher projection the classes become separable, enabling the APORIA-LP label-propagation classifier that reaches F1 > 90% from only 30-50 annotations across ten small models. The authors release the fully labeled SOCRATES-300K dataset together with generation and reproduction code.

Significance. If the reported geometric asymmetry is robust and attributable to generation dynamics rather than representation artifacts, the work supplies a complementary perspective to knowledge-centric hallucination research and a practical low-supervision detection method. The public release of a large labeled dataset and accompanying code constitutes a clear strength for reproducibility and follow-on studies.

major comments (2)
  1. [Abstract and empirical validation] The interpretation that tighter genuine clusters reflect retrieval instability (Abstract; empirical validation) rather than sentence-embedding biases or prompt artifacts is load-bearing for the central claim yet unsupported by ablations. Repeating the intra-class variance analysis under TF-IDF, random projections, or a different encoder family is required to isolate the origin of the asymmetry.
  2. [Experimental setup] Details on the exact clustering metric, prompt sampling controls, and whether Fisher projection parameters were tuned post-hoc are absent from the experimental description, leaving the separability results difficult to assess for rigor.
minor comments (1)
  1. [Notation and terminology] Ensure the expansion of the APORIA acronym and consistent use of APORIA-LP appear in the main text as well as the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments highlight important areas for improving the clarity and robustness of our claims. We address each major comment point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and empirical validation] The interpretation that tighter genuine clusters reflect retrieval instability (Abstract; empirical validation) rather than sentence-embedding biases or prompt artifacts is load-bearing for the central claim yet unsupported by ablations. Repeating the intra-class variance analysis under TF-IDF, random projections, or a different encoder family is required to isolate the origin of the asymmetry.

    Authors: We agree that demonstrating the asymmetry is not an artifact of the particular sentence encoder is important for supporting our interpretation of retrieval instability. In the revised manuscript we add an ablation section that repeats the intra-class variance analysis using (i) TF-IDF bag-of-words vectors and (ii) embeddings from a different encoder family (paraphrase-MiniLM-L6-v2). The tighter clustering of genuine responses remains visible under both alternatives, indicating that the geometric asymmetry is not driven by the original encoder choice. Random projections were omitted because they destroy the semantic structure that our geometric hypothesis relies upon; we instead focus on semantically meaningful representations. These new results will be reported with the corresponding figures and statistics. revision: yes

  2. Referee: [Experimental setup] Details on the exact clustering metric, prompt sampling controls, and whether Fisher projection parameters were tuned post-hoc are absent from the experimental description, leaving the separability results difficult to assess for rigor.

    Authors: We thank the referee for noting these omissions. The revised Experimental Setup section now explicitly states: (1) intra-class variance is computed as the average pairwise cosine distance within each response group; (2) prompts were sampled with controls for topic diversity (balanced across 20 domains) and length (capped at 50 tokens) to avoid confounding; (3) Fisher discriminant parameters were obtained via 5-fold cross-validation strictly on the 30–50 annotated examples per model and were never tuned on the held-out evaluation set. These clarifications remove any ambiguity about post-hoc fitting and make the separability results fully reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is an empirical observation that genuine responses form tighter clusters than hallucinated ones in sentence-embedding space, followed by Fisher projection for separability and a label-propagation classifier (APORIA-LP) trained on 30-50 annotations. This chain rests on direct measurement and validation across the released SOCRATES-300K dataset and ten LLMs rather than any self-referential definition, fitted parameter renamed as prediction, or load-bearing self-citation. The hypothesis is tested experimentally; no equation or quantity is constructed to equal its own input, and the geometric asymmetry is presented as a measured property rather than derived from prior author results by fiat. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The framework rests on the empirical clustering hypothesis and introduces new named constructs (APORIA, APORIA-LP) plus a large labeled dataset; no explicit free parameters beyond the annotation count range are stated.

free parameters (1)
  • annotation budget
    30-50 annotations chosen to reach F1>90%; this threshold may be tuned to the observed separability.
axioms (1)
  • domain assumption Genuine responses form tighter clusters than hallucinated ones in sentence-embedding space
    Central hypothesis of the geometric framework, validated empirically but not derived from first principles.
invented entities (3)
  • APORIA no independent evidence
    purpose: Geometric framework studying response instability via embedding-space asymmetry
    Newly defined construct that organizes the analysis.
  • APORIA-LP no independent evidence
    purpose: Efficient label-propagation classifier leveraging the geometric asymmetry
    Derived method built on APORIA observations.
  • SOCRATES-300K independent evidence
    purpose: Fully labelled dataset of 300,000 responses for further research
    Released artifact supporting reproducibility.

pith-pipeline@v0.9.0 · 5802 in / 1453 out tokens · 50197 ms · 2026-05-21T12:11:34.122076+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    genuine responses exhibit greater semantic consistency than hallucinated responses... pairwise distances... Wasserstein distance between DGG and DHH... Fisher Discriminant Analysis... v ∝ (S_W^λ)^−1 (μ_G − μ_H)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.