pith. sign in

arxiv: 2505.20435 · v3 · submitted 2025-05-26 · 💻 cs.LG · cs.AI· cs.CG· math.AT

The Shape of Adversarial Influence: Characterizing LLM Latent Spaces with Persistent Homology

Pith reviewed 2026-05-19 12:33 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CGmath.AT
keywords persistent homologyadversarial attacksLLM latent spacestopological data analysisactivation geometrymodel interpretabilitytopological compressionadversarial robustness
0
0 comments X

The pith

Adversarial inputs cause topological compression in the latent spaces of large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies persistent homology to point clouds of activations from LLM layers to track how adversarial inputs reshape internal representations. It establishes that attacks produce a consistent compression: the space loses varied small-scale topological features and retains fewer dominant large-scale ones. This pattern holds across six models ranging from 3.8B to 70B parameters, appears early in the layers, and remains similar under indirect prompt injection and backdoor fine-tuning. A sympathetic reader would care because the method supplies a nonlinear geometric signature that existing linear interpretability approaches do not capture.

Core claim

Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, the framework reveals geometric invariants of representational change that complement existing linear interpretability methods.

What carries the argument

Persistent homology applied to sampled activation point clouds from LLM layers, which quantifies the birth and persistence of topological features such as connected components and higher-dimensional holes across filtration scales.

If this is right

  • The topological compression provides an architecture-independent marker that distinguishes adversarial from clean inputs across different attack modes.
  • The signature appears early enough in the network to support layer-wise monitoring of representational integrity.
  • Quantifying neuron-level information flow through this lens identifies geometric invariants that linear methods overlook.
  • The approach works for models spanning 3.8B to 70B parameters, suggesting scalability to larger systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Real-time monitoring of topological features during inference could enable new detection systems that flag attacks without task-specific retraining.
  • Enhancing small-scale topological diversity in training might increase resistance to the observed compression effect.
  • The same point-cloud homology pipeline could be tested on vision transformers or other modalities under adversarial perturbation to check for analogous signatures.
  • Correlating the degree of compression with attack success rates or downstream task degradation would test whether the topological measure predicts practical harm.

Load-bearing premise

That persistent homology on sampled activation point clouds from LLM layers reliably detects attack-induced topological changes without being dominated by sampling artifacts or filtration hyperparameter choices.

What would settle it

Compare persistent homology barcodes computed on matched sets of clean and adversarial activations from the same layers, using multiple independent samplings and filtration parameters; the compression signature should reliably separate the two classes only if the claim holds.

read the original abstract

Existing interpretability methods for Large Language Models (LLMs) predominantly capture linear directions or isolated features. This overlooks the high-dimensional, relational, and nonlinear geometry of model representations. We apply persistent homology (PH) to characterize how adversarial inputs reshape the geometry and topology of internal representation spaces of LLMs. This phenomenon, especially when considered across operationally different attack modes, remains poorly understood. We analyze six models (3.8B to 70B parameters) under two distinct attacks, indirect prompt injection and backdoor fine--tuning, and show that a consistent topological signature persists throughout. Adversarial inputs induce topological compression, where the latent space becomes structurally simpler, collapsing the latent space from varied, compact, small-scale features into fewer, dominant, large-scale ones. This signature is architecture-agnostic, emerges early in the network, and is highly discriminative across layers. By quantifying the shape of activation point clouds and neuron-level information flow, our framework reveals geometric invariants of representational change that complement existing linear interpretability methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper applies persistent homology to activation point clouds from six LLMs (3.8B–70B parameters) under indirect prompt injection and backdoor fine-tuning attacks. It claims that adversarial inputs produce a consistent topological compression signature: the latent space collapses from varied, compact small-scale features to fewer dominant large-scale ones. This effect is reported as architecture-agnostic, emerging early in the network, and highly discriminative across layers, providing geometric invariants that complement linear interpretability methods.

Significance. If the compression signature proves robust, the work would meaningfully extend LLM interpretability by importing tools from topological data analysis to capture nonlinear, relational geometry in representations. The multi-model and dual-attack design is a strength that could support generalizable claims, and the emphasis on early-layer emergence offers potential for practical detection. The manuscript currently provides no machine-checked proofs or parameter-free derivations, so its contribution rests entirely on the empirical reliability of the PH observations.

major comments (3)
  1. [§4] §4 (Persistent Homology on Activations): the central claim of topological compression requires that observed changes in persistence diagrams reflect intrinsic manifold topology rather than shifts in point density or variance. No controls are described for matching adversarial and clean activation clouds on first- and second-order statistics, nor for varying sample size; without these, the architecture-agnostic and early-emergence assertions rest on an untested assumption.
  2. [§5] §5 (Experimental Results): the abstract states 'consistent findings across six models and two attacks' yet the text supplies no statistical tests, confidence intervals, or ablation on PH hyperparameters (filtration radius, maximum homology dimension, number of points per cloud). This is load-bearing for the claim that the signature is 'highly discriminative across layers.'
  3. [Table 1] Table 1 or equivalent results table: the reported separation between clean and adversarial PH diagrams is presented without baseline comparisons to non-adversarial inputs that induce similar activation-norm changes; this leaves open whether the compression is attack-specific or a generic response to distribution shift.
minor comments (2)
  1. [Abstract] The term 'topological compression' is used repeatedly but never given an explicit quantitative definition (e.g., in terms of total persistence, number of long-lived bars, or Wasserstein distance between diagrams).
  2. [Figures] Figure captions should explicitly state the number of tokens/points sampled per layer and the exact Vietoris–Rips or alpha-complex parameters used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which will help improve the clarity and robustness of our empirical findings. We provide point-by-point responses to the major comments and outline the revisions we intend to make.

read point-by-point responses
  1. Referee: [§4] §4 (Persistent Homology on Activations): the central claim of topological compression requires that observed changes in persistence diagrams reflect intrinsic manifold topology rather than shifts in point density or variance. No controls are described for matching adversarial and clean activation clouds on first- and second-order statistics, nor for varying sample size; without these, the architecture-agnostic and early-emergence assertions rest on an untested assumption.

    Authors: We agree that it is important to rule out that the observed topological changes are not merely artifacts of shifts in point density or variance. The current manuscript does not explicitly detail such controls in §4. In the revised manuscript, we will add a subsection in §4 describing the preprocessing of activation point clouds, including how we ensure consistent sample sizes across clean and adversarial conditions. Furthermore, we will include additional analyses where we match the first- and second-order statistics (e.g., by centering and scaling the point clouds or using density-matched subsampling) and show that the persistence diagram differences persist. This will provide stronger evidence that the compression signature reflects intrinsic topological properties of the latent space. revision: yes

  2. Referee: [§5] §5 (Experimental Results): the abstract states 'consistent findings across six models and two attacks' yet the text supplies no statistical tests, confidence intervals, or ablation on PH hyperparameters (filtration radius, maximum homology dimension, number of points per cloud). This is load-bearing for the claim that the signature is 'highly discriminative across layers.'

    Authors: We acknowledge the lack of formal statistical validation and hyperparameter ablations in the current version of §5. To address this, we will revise the experimental results section to include appropriate statistical tests (such as t-tests or permutation tests) for the differences in topological features between clean and adversarial inputs, along with confidence intervals for key metrics. We will also conduct and report ablations varying the filtration radius, homology dimension, and number of points per cloud to demonstrate the stability of the discriminative signature across layers. These results will be added to support the consistency claims across models and attacks. revision: yes

  3. Referee: [Table 1] Table 1 or equivalent results table: the reported separation between clean and adversarial PH diagrams is presented without baseline comparisons to non-adversarial inputs that induce similar activation-norm changes; this leaves open whether the compression is attack-specific or a generic response to distribution shift.

    Authors: We appreciate this point regarding the specificity of the observed effect. The manuscript does not currently include such baseline comparisons in Table 1 or the associated text. In the revision, we will augment the results with additional experiments using non-adversarial distribution shifts (e.g., inputs with added noise or from shifted but non-adversarial distributions) that produce comparable changes in activation norms. We will then compare the PH diagrams to determine if the topological compression is unique to the adversarial settings or more general. This will clarify the nature of the signature and its relation to adversarial influence specifically. revision: yes

Circularity Check

0 steps flagged

Empirical persistent homology application shows no circular derivation

full rationale

The paper conducts an empirical study by applying standard persistent homology to activation point clouds sampled from LLM layers under two adversarial attack types across six models. No derivation chain exists that reduces predictions or topological signatures to fitted inputs, self-definitions, or self-citation chains; the observed compression pattern is presented as a data-driven finding rather than a mathematical necessity derived from prior author work. The methodology relies on external, falsifiable computations on real activations without renaming known results or smuggling ansatzes via citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the applicability of persistent homology to neural activation point clouds and the assumption that observed topological changes are causally linked to adversarial inputs rather than other factors. No new entities are postulated.

axioms (1)
  • domain assumption Activation vectors from LLM layers form point clouds whose persistent homology features meaningfully reflect representational geometry relevant to adversarial influence.
    Core assumption enabling the application of PH to this domain.

pith-pipeline@v0.9.0 · 5739 in / 1331 out tokens · 70099 ms · 2026-05-19T12:33:52.229274+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.