The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Aideen Fay; Anthea Monod; Haim Dubossarsky; In\'es Garc\'ia-Redondo; Qiquan Wang

arxiv: 2505.20435 · v3 · submitted 2025-05-26 · 💻 cs.LG · cs.AI· cs.CG· math.AT

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Aideen Fay , In\'es Garc\'ia-Redondo , Qiquan Wang , Haim Dubossarsky , Anthea Monod This is my paper

Pith reviewed 2026-05-19 12:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CGmath.AT

keywords persistent homologyadversarial attacksLLM latent spacestopological data analysisactivation geometrymodel interpretabilitytopological compressionadversarial robustness

0 comments

The pith

Adversarial inputs cause topological compression in the latent spaces of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies persistent homology to point clouds of activations from LLM layers to track how adversarial inputs reshape internal representations. It establishes that attacks produce a consistent compression: the space loses varied small-scale topological features and retains fewer dominant large-scale ones. This pattern holds across six models ranging from 3.8B to 70B parameters, appears early in the layers, and remains similar under indirect prompt injection and backdoor fine-tuning. A sympathetic reader would care because the method supplies a nonlinear geometric signature that existing linear interpretability approaches do not capture.

Core claim

Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, the framework reveals geometric invariants of representational change that complement existing linear interpretability methods.

What carries the argument

Persistent homology applied to sampled activation point clouds from LLM layers, which quantifies the birth and persistence of topological features such as connected components and higher-dimensional holes across filtration scales.

If this is right

The topological compression provides an architecture-independent marker that distinguishes adversarial from clean inputs across different attack modes.
The signature appears early enough in the network to support layer-wise monitoring of representational integrity.
Quantifying neuron-level information flow through this lens identifies geometric invariants that linear methods overlook.
The approach works for models spanning 3.8B to 70B parameters, suggesting scalability to larger systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time monitoring of topological features during inference could enable new detection systems that flag attacks without task-specific retraining.
Enhancing small-scale topological diversity in training might increase resistance to the observed compression effect.
The same point-cloud homology pipeline could be tested on vision transformers or other modalities under adversarial perturbation to check for analogous signatures.
Correlating the degree of compression with attack success rates or downstream task degradation would test whether the topological measure predicts practical harm.

Load-bearing premise

That persistent homology on sampled activation point clouds from LLM layers reliably detects attack-induced topological changes without being dominated by sampling artifacts or filtration hyperparameter choices.

What would settle it

Compare persistent homology barcodes computed on matched sets of clean and adversarial activations from the same layers, using multiple independent samplings and filtration parameters; the compression signature should reliably separate the two classes only if the claim holds.

read the original abstract

Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine--tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, our framework reveals geometric invariants of representational change that complement existing linear interpretability methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reports a consistent topological compression in LLM latent spaces under two attacks across six models, but the evidence needs tighter controls on sampling and distribution shifts to rule out artifacts.

read the letter

The main thing to know is that this work uses persistent homology on activation point clouds to argue that adversarial inputs make LLM latent spaces structurally simpler, shifting from many small-scale features to fewer large-scale ones, with the pattern appearing early in the network and holding across model sizes and attack types. They test indirect prompt injection and backdoor fine-tuning on models from 3.8B to 70B parameters and present it as a non-linear complement to existing linear interpretability tools. The scale of the experiment across architectures and the focus on relational geometry rather than single directions is the clearest strength here. The multi-model setup gives the observation some weight as an empirical pattern worth following up. The soft spots sit mainly in the methods. The abstract and available description give little on how the activation samples were drawn, what filtration parameters were fixed for the homology computation, or whether adversarial and clean inputs were matched on basic statistics like variance or sparsity. The stress-test point about finite point clouds in high dimensions is reasonable: changes in persistence could track shifts in empirical density or spread induced by the attacks instead of deeper manifold changes. If the full paper includes explicit controls or robustness checks on sample size and hyper-parameters, that would address it directly; otherwise it remains a gap that affects how much weight to put on the architecture-agnostic and early-emergence claims. The application of standard persistent homology looks straightforward with no circular definitions or heavy self-citation issues. This is for interpretability and safety researchers who already work with geometric or topological tools and want to see non-linear signatures tested at scale. A reader focused on new diagnostics would get some value from the reported patterns, though they would need the figures and exact procedures to judge stability. It deserves peer review because the core observation is new enough and the experimental breadth is solid enough to justify referee time, even with the methodological details that still need tightening.

Referee Report

3 major / 2 minor

Summary. The paper applies persistent homology to activation point clouds from six LLMs (3.8B–70B parameters) under indirect prompt injection and backdoor fine-tuning attacks. It claims that adversarial inputs produce a consistent topological compression signature: the latent space collapses from varied, compact small-scale features to fewer dominant large-scale ones. This effect is reported as architecture-agnostic, emerging early in the network, and highly discriminative across layers, providing geometric invariants that complement linear interpretability methods.

Significance. If the compression signature proves robust, the work would meaningfully extend LLM interpretability by importing tools from topological data analysis to capture nonlinear, relational geometry in representations. The multi-model and dual-attack design is a strength that could support generalizable claims, and the emphasis on early-layer emergence offers potential for practical detection. The manuscript currently provides no machine-checked proofs or parameter-free derivations, so its contribution rests entirely on the empirical reliability of the PH observations.

major comments (3)

[§4] §4 (Persistent Homology on Activations): the central claim of topological compression requires that observed changes in persistence diagrams reflect intrinsic manifold topology rather than shifts in point density or variance. No controls are described for matching adversarial and clean activation clouds on first- and second-order statistics, nor for varying sample size; without these, the architecture-agnostic and early-emergence assertions rest on an untested assumption.
[§5] §5 (Experimental Results): the abstract states 'consistent findings across six models and two attacks' yet the text supplies no statistical tests, confidence intervals, or ablation on PH hyperparameters (filtration radius, maximum homology dimension, number of points per cloud). This is load-bearing for the claim that the signature is 'highly discriminative across layers.'
[Table 1] Table 1 or equivalent results table: the reported separation between clean and adversarial PH diagrams is presented without baseline comparisons to non-adversarial inputs that induce similar activation-norm changes; this leaves open whether the compression is attack-specific or a generic response to distribution shift.

minor comments (2)

[Abstract] The term 'topological compression' is used repeatedly but never given an explicit quantitative definition (e.g., in terms of total persistence, number of long-lived bars, or Wasserstein distance between diagrams).
[Figures] Figure captions should explicitly state the number of tokens/points sampled per layer and the exact Vietoris–Rips or alpha-complex parameters used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which will help improve the clarity and robustness of our empirical findings. We provide point-by-point responses to the major comments and outline the revisions we intend to make.

read point-by-point responses

Referee: [§4] §4 (Persistent Homology on Activations): the central claim of topological compression requires that observed changes in persistence diagrams reflect intrinsic manifold topology rather than shifts in point density or variance. No controls are described for matching adversarial and clean activation clouds on first- and second-order statistics, nor for varying sample size; without these, the architecture-agnostic and early-emergence assertions rest on an untested assumption.

Authors: We agree that it is important to rule out that the observed topological changes are not merely artifacts of shifts in point density or variance. The current manuscript does not explicitly detail such controls in §4. In the revised manuscript, we will add a subsection in §4 describing the preprocessing of activation point clouds, including how we ensure consistent sample sizes across clean and adversarial conditions. Furthermore, we will include additional analyses where we match the first- and second-order statistics (e.g., by centering and scaling the point clouds or using density-matched subsampling) and show that the persistence diagram differences persist. This will provide stronger evidence that the compression signature reflects intrinsic topological properties of the latent space. revision: yes
Referee: [§5] §5 (Experimental Results): the abstract states 'consistent findings across six models and two attacks' yet the text supplies no statistical tests, confidence intervals, or ablation on PH hyperparameters (filtration radius, maximum homology dimension, number of points per cloud). This is load-bearing for the claim that the signature is 'highly discriminative across layers.'

Authors: We acknowledge the lack of formal statistical validation and hyperparameter ablations in the current version of §5. To address this, we will revise the experimental results section to include appropriate statistical tests (such as t-tests or permutation tests) for the differences in topological features between clean and adversarial inputs, along with confidence intervals for key metrics. We will also conduct and report ablations varying the filtration radius, homology dimension, and number of points per cloud to demonstrate the stability of the discriminative signature across layers. These results will be added to support the consistency claims across models and attacks. revision: yes
Referee: [Table 1] Table 1 or equivalent results table: the reported separation between clean and adversarial PH diagrams is presented without baseline comparisons to non-adversarial inputs that induce similar activation-norm changes; this leaves open whether the compression is attack-specific or a generic response to distribution shift.

Authors: We appreciate this point regarding the specificity of the observed effect. The manuscript does not currently include such baseline comparisons in Table 1 or the associated text. In the revision, we will augment the results with additional experiments using non-adversarial distribution shifts (e.g., inputs with added noise or from shifted but non-adversarial distributions) that produce comparable changes in activation norms. We will then compare the PH diagrams to determine if the topological compression is unique to the adversarial settings or more general. This will clarify the nature of the signature and its relation to adversarial influence specifically. revision: yes

Circularity Check

0 steps flagged

Empirical persistent homology application shows no circular derivation

full rationale

The paper conducts an empirical study by applying standard persistent homology to activation point clouds sampled from LLM layers under two adversarial attack types across six models. No derivation chain exists that reduces predictions or topological signatures to fitted inputs, self-definitions, or self-citation chains; the observed compression pattern is presented as a data-driven finding rather than a mathematical necessity derived from prior author work. The methodology relies on external, falsifiable computations on real activations without renaming known results or smuggling ansatzes via citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the applicability of persistent homology to neural activation point clouds and the assumption that observed topological changes are causally linked to adversarial inputs rather than other factors. No new entities are postulated.

axioms (1)

domain assumption Activation vectors from LLM layers form point clouds whose persistent homology features meaningfully reflect representational geometry relevant to adversarial influence.
Core assumption enabling the application of PH to this domain.

pith-pipeline@v0.9.0 · 5739 in / 1331 out tokens · 70099 ms · 2026-05-19T12:33:52.229274+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.