The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology
Pith reviewed 2026-05-19 12:33 UTC · model grok-4.3
The pith
Adversarial inputs cause topological compression in the latent spaces of large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, the framework reveals geometric invariants of representational change that complement existing linear interpretability methods.
What carries the argument
Persistent homology applied to sampled activation point clouds from LLM layers, which quantifies the birth and persistence of topological features such as connected components and higher-dimensional holes across filtration scales.
If this is right
- The topological compression provides an architecture-independent marker that distinguishes adversarial from clean inputs across different attack modes.
- The signature appears early enough in the network to support layer-wise monitoring of representational integrity.
- Quantifying neuron-level information flow through this lens identifies geometric invariants that linear methods overlook.
- The approach works for models spanning 3.8B to 70B parameters, suggesting scalability to larger systems.
Where Pith is reading between the lines
- Real-time monitoring of topological features during inference could enable new detection systems that flag attacks without task-specific retraining.
- Enhancing small-scale topological diversity in training might increase resistance to the observed compression effect.
- The same point-cloud homology pipeline could be tested on vision transformers or other modalities under adversarial perturbation to check for analogous signatures.
- Correlating the degree of compression with attack success rates or downstream task degradation would test whether the topological measure predicts practical harm.
Load-bearing premise
That persistent homology on sampled activation point clouds from LLM layers reliably detects attack-induced topological changes without being dominated by sampling artifacts or filtration hyperparameter choices.
What would settle it
Compare persistent homology barcodes computed on matched sets of clean and adversarial activations from the same layers, using multiple independent samplings and filtration parameters; the compression signature should reliably separate the two classes only if the claim holds.
read the original abstract
Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine--tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, our framework reveals geometric invariants of representational change that complement existing linear interpretability methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies persistent homology to activation point clouds from six LLMs (3.8B–70B parameters) under indirect prompt injection and backdoor fine-tuning attacks. It claims that adversarial inputs produce a consistent topological compression signature: the latent space collapses from varied, compact small-scale features to fewer dominant large-scale ones. This effect is reported as architecture-agnostic, emerging early in the network, and highly discriminative across layers, providing geometric invariants that complement linear interpretability methods.
Significance. If the compression signature proves robust, the work would meaningfully extend LLM interpretability by importing tools from topological data analysis to capture nonlinear, relational geometry in representations. The multi-model and dual-attack design is a strength that could support generalizable claims, and the emphasis on early-layer emergence offers potential for practical detection. The manuscript currently provides no machine-checked proofs or parameter-free derivations, so its contribution rests entirely on the empirical reliability of the PH observations.
major comments (3)
- [§4] §4 (Persistent Homology on Activations): the central claim of topological compression requires that observed changes in persistence diagrams reflect intrinsic manifold topology rather than shifts in point density or variance. No controls are described for matching adversarial and clean activation clouds on first- and second-order statistics, nor for varying sample size; without these, the architecture-agnostic and early-emergence assertions rest on an untested assumption.
- [§5] §5 (Experimental Results): the abstract states 'consistent findings across six models and two attacks' yet the text supplies no statistical tests, confidence intervals, or ablation on PH hyperparameters (filtration radius, maximum homology dimension, number of points per cloud). This is load-bearing for the claim that the signature is 'highly discriminative across layers.'
- [Table 1] Table 1 or equivalent results table: the reported separation between clean and adversarial PH diagrams is presented without baseline comparisons to non-adversarial inputs that induce similar activation-norm changes; this leaves open whether the compression is attack-specific or a generic response to distribution shift.
minor comments (2)
- [Abstract] The term 'topological compression' is used repeatedly but never given an explicit quantitative definition (e.g., in terms of total persistence, number of long-lived bars, or Wasserstein distance between diagrams).
- [Figures] Figure captions should explicitly state the number of tokens/points sampled per layer and the exact Vietoris–Rips or alpha-complex parameters used.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which will help improve the clarity and robustness of our empirical findings. We provide point-by-point responses to the major comments and outline the revisions we intend to make.
read point-by-point responses
-
Referee: [§4] §4 (Persistent Homology on Activations): the central claim of topological compression requires that observed changes in persistence diagrams reflect intrinsic manifold topology rather than shifts in point density or variance. No controls are described for matching adversarial and clean activation clouds on first- and second-order statistics, nor for varying sample size; without these, the architecture-agnostic and early-emergence assertions rest on an untested assumption.
Authors: We agree that it is important to rule out that the observed topological changes are not merely artifacts of shifts in point density or variance. The current manuscript does not explicitly detail such controls in §4. In the revised manuscript, we will add a subsection in §4 describing the preprocessing of activation point clouds, including how we ensure consistent sample sizes across clean and adversarial conditions. Furthermore, we will include additional analyses where we match the first- and second-order statistics (e.g., by centering and scaling the point clouds or using density-matched subsampling) and show that the persistence diagram differences persist. This will provide stronger evidence that the compression signature reflects intrinsic topological properties of the latent space. revision: yes
-
Referee: [§5] §5 (Experimental Results): the abstract states 'consistent findings across six models and two attacks' yet the text supplies no statistical tests, confidence intervals, or ablation on PH hyperparameters (filtration radius, maximum homology dimension, number of points per cloud). This is load-bearing for the claim that the signature is 'highly discriminative across layers.'
Authors: We acknowledge the lack of formal statistical validation and hyperparameter ablations in the current version of §5. To address this, we will revise the experimental results section to include appropriate statistical tests (such as t-tests or permutation tests) for the differences in topological features between clean and adversarial inputs, along with confidence intervals for key metrics. We will also conduct and report ablations varying the filtration radius, homology dimension, and number of points per cloud to demonstrate the stability of the discriminative signature across layers. These results will be added to support the consistency claims across models and attacks. revision: yes
-
Referee: [Table 1] Table 1 or equivalent results table: the reported separation between clean and adversarial PH diagrams is presented without baseline comparisons to non-adversarial inputs that induce similar activation-norm changes; this leaves open whether the compression is attack-specific or a generic response to distribution shift.
Authors: We appreciate this point regarding the specificity of the observed effect. The manuscript does not currently include such baseline comparisons in Table 1 or the associated text. In the revision, we will augment the results with additional experiments using non-adversarial distribution shifts (e.g., inputs with added noise or from shifted but non-adversarial distributions) that produce comparable changes in activation norms. We will then compare the PH diagrams to determine if the topological compression is unique to the adversarial settings or more general. This will clarify the nature of the signature and its relation to adversarial influence specifically. revision: yes
Circularity Check
Empirical persistent homology application shows no circular derivation
full rationale
The paper conducts an empirical study by applying standard persistent homology to activation point clouds sampled from LLM layers under two adversarial attack types across six models. No derivation chain exists that reduces predictions or topological signatures to fitted inputs, self-definitions, or self-citation chains; the observed compression pattern is presented as a data-driven finding rather than a mathematical necessity derived from prior author work. The methodology relies on external, falsifiable computations on real activations without renaming known results or smuggling ansatzes via citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Activation vectors from LLM layers form point clouds whose persistent homology features meaningfully reflect representational geometry relevant to adversarial influence.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.