When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

Alban Goupil; Emmanuel Chochoy; Nisrine Rair; Valeriu Vrabie

arxiv: 2510.17548 · v2 · submitted 2025-10-20 · 💻 cs.CL

When Annotators Disagree, Topology Explains: Mapper, a Topological Tool for Exploring Text Embedding Geometry and Ambiguity

Nisrine Rair , Alban Goupil , Valeriu Vrabie , Emmanuel Chochoy This is my paper

Pith reviewed 2026-05-18 06:20 UTC · model grok-4.3

classification 💻 cs.CL

keywords Mappertopological data analysistext embeddingsfine-tuningannotator disagreementmodel ambiguitydecision regionsNLP diagnostics

0 comments

The pith

Fine-tuning organizes text embedding spaces into modular non-convex regions that align with model predictions even when annotators disagree on labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies Mapper, a tool from topological data analysis, to examine how fine-tuned language models internally represent ambiguous text instances. It demonstrates that fine-tuning reorganizes the embedding space into distinct modular regions where the model's own predictions remain highly consistent within each region. This consistency holds across more than 98 percent of the identified components at a 90 percent or higher purity level. Yet the same regions show reduced alignment with human-provided ground-truth labels precisely when annotator disagreement is high. The approach matters because standard scalar metrics like accuracy cannot expose this structural tension between internal model confidence and external label uncertainty.

Core claim

When Mapper is applied to the embeddings of a fine-tuned RoBERTa-Large model on the MD-Offense dataset, the high-dimensional space is restructured into modular, non-convex connected components. Over 98 percent of these components display at least 90 percent purity with respect to the model's predictions. Alignment of the same components with ground-truth labels declines as ambiguity increases, revealing a separation between the model's topologically expressed decision structure and the uncertainty present in human annotations.

What carries the argument

Mapper, a topological data analysis algorithm that covers the embedding space with overlapping intervals of a filter function, clusters points inside each interval, and connects clusters that share data points to produce a graph whose nodes correspond to coherent geometric regions.

If this is right

Models maintain high internal consistency within topological regions even for inputs that humans label inconsistently.
Mapper surfaces overconfident clusters and boundary collapses that PCA or UMAP visualizations do not reveal.
Topological metrics extracted from the Mapper graph can serve as additional diagnostics for subjective NLP tasks.
The observed drop in label alignment within high-purity components quantifies a specific form of mismatch between model structure and human uncertainty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Repeating the Mapper analysis on the same model before and after fine-tuning would isolate which modular features are induced by the training process itself.
Purity statistics from Mapper graphs could be incorporated into training objectives to penalize overconfident regions on ambiguous data.
Extending the method to other subjective tasks such as sentiment or toxicity detection would test whether the purity-versus-label-alignment tension is a general property of fine-tuned text embeddings.

Load-bearing premise

The connected components and purity statistics produced by Mapper faithfully reflect the model's internal encoding of ambiguity and decision boundaries rather than depending on the particular filter function, cover resolution, or preprocessing steps selected for the analysis.

What would settle it

Recomputing the Mapper graph on the same embeddings but with a substantially different filter function or number of intervals, then finding that prediction purity within connected components drops below the reported 90 percent threshold in most cases, would indicate that the observed structure is an artifact of the chosen parameters.

read the original abstract

Language models are often evaluated with scalar metrics like accuracy, but such measures fail to capture how models internally represent ambiguity, especially when human annotators disagree. We propose a topological perspective to analyze how fine-tuned models encode ambiguity and more generally instances. Applied to RoBERTa-Large on the MD-Offense dataset, Mapper, a tool from topological data analysis, reveals that fine-tuning restructures embedding space into modular, non-convex regions aligned with model predictions, even for highly ambiguous cases. Over $98\%$ of connected components exhibit $\geq 90\%$ prediction purity, yet alignment with ground-truth labels drops in ambiguous data, surfacing a hidden tension between structural confidence and label uncertainty. Unlike traditional tools such as PCA or UMAP, Mapper captures this geometry directly uncovering decision regions, boundary collapses, and overconfident clusters. Our findings position Mapper as a powerful diagnostic tool for understanding how models resolve ambiguity. Beyond visualization, it also enables topological metrics that may inform proactive modeling strategies in subjective NLP tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mapper applied to fine-tuned RoBERTa embeddings reveals high prediction purity in connected components even on ambiguous offense data, but the result may depend on filter and cover choices rather than raw geometry.

read the letter

The main point is that this paper uses the Mapper algorithm from topological data analysis on RoBERTa-Large embeddings from the MD-Offense dataset. It shows that fine-tuning breaks the space into modular, non-convex regions where the model's own predictions stay consistent inside most components—over 98 percent of them reach at least 90 percent purity—even when human annotators disagree and alignment with ground-truth labels weakens. That tension between structural confidence and label uncertainty is the useful observation. Unlike PCA or UMAP, Mapper directly produces connected components and purity numbers that line up with decision regions and boundary collapses. The application to annotator disagreement in subjective NLP tasks is new enough to stand out from prior TDA work in embeddings. The paper does a decent job framing this as a diagnostic that could help with evaluation in moderation or sentiment settings where scalar accuracy hides the internal structure. The soft spot is the dependence on Mapper's filter function and cover parameters. If the filter pulls from model logits or if the interval count and overlap were tuned after seeing the data, the purity statistic can be partly induced by the method instead of discovered in the embeddings. The abstract gives no sign of systematic checks with neutral filters like the first two PCA coordinates or sweeps over resolution settings, so the central claim needs those controls to hold up. The work is also limited to one model and one dataset, which is fine for an initial study but keeps the scope narrow. This is worth reading for people who evaluate models on ambiguous or subjective data and want tools that go past accuracy scores. It shows clear thinking about what scalar metrics miss. I would send it to peer review because the topological angle is grounded and the application is honest, even though the current evidence would benefit from added robustness checks on the Mapper setup.

Referee Report

3 major / 2 minor

Summary. The paper proposes Mapper from topological data analysis as a tool to examine the geometry of text embeddings produced by fine-tuned language models, focusing on cases of annotator disagreement. Applied to RoBERTa-Large fine-tuned on the MD-Offense dataset, it claims that fine-tuning induces modular, non-convex regions in embedding space that align closely with the model's own predictions—even on highly ambiguous instances—while alignment with ground-truth labels weakens. Over 98% of Mapper-derived connected components are reported to show at least 90% prediction purity. The work contrasts Mapper favorably with PCA and UMAP for revealing decision regions, boundary collapses, and overconfident clusters, and suggests the approach can yield topological metrics useful for subjective NLP tasks.

Significance. If the central observations prove robust to parameter choices and controls, the topological framing could supply a useful diagnostic complement to scalar accuracy metrics for understanding how models internally encode ambiguity. The emphasis on modular structure aligned with predictions rather than labels is a potentially interesting observation for subjective tasks, though its value depends on demonstrating that the reported purity and modularity are not artifacts of the Mapper construction itself.

major comments (3)

[Abstract and §3] Abstract and §3 (Methods): The abstract states that 'over 98% of connected components exhibit ≥90% prediction purity' without describing the filter function f, the number of intervals in the cover, or the overlap parameter used in Mapper. If f is derived from model logits or predictions, the high purity statistic risks being induced by construction rather than reflecting intrinsic embedding geometry.
[§4] §4 (Experiments): No results are presented for neutral alternative filters (e.g., the first two PCA coordinates of the raw embeddings) or for systematic sweeps of cover resolution and overlap. Without these controls, it is impossible to determine whether the reported modular regions and purity levels are properties of the fine-tuned space or consequences of the chosen Mapper hyperparameters.
[§5] §5 (Results and Discussion): The claim that alignment with ground-truth labels 'drops in ambiguous data' is presented without statistical significance tests, confidence intervals, or controls for embedding dimensionality and preprocessing steps. This weakens the interpretation of a 'hidden tension' between structural confidence and label uncertainty.

minor comments (2)

[§3] Notation for the Mapper graph and connected components should be defined more explicitly (e.g., with a small diagram or pseudocode) to aid readers unfamiliar with TDA.
[§4] The manuscript would benefit from a table listing all Mapper hyperparameters used for the reported figures, including any preprocessing of the RoBERTa embeddings.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript accordingly to improve clarity, reproducibility, and statistical rigor.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Methods): The abstract states that 'over 98% of connected components exhibit ≥90% prediction purity' without describing the filter function f, the number of intervals in the cover, or the overlap parameter used in Mapper. If f is derived from model logits or predictions, the high purity statistic risks being induced by construction rather than reflecting intrinsic embedding geometry.

Authors: We agree that the abstract and §3 require explicit parameter reporting for reproducibility. The filter function f in our experiments is a neutral linear projection onto the first principal component of the raw embeddings and is independent of the model's logits or predictions. In the revised manuscript we have updated the abstract and expanded §3 to state the filter, the number of intervals (10), and the overlap parameter (0.25). These additions clarify that the reported purity is not induced by construction. revision: yes
Referee: [§4] §4 (Experiments): No results are presented for neutral alternative filters (e.g., the first two PCA coordinates of the raw embeddings) or for systematic sweeps of cover resolution and overlap. Without these controls, it is impossible to determine whether the reported modular regions and purity levels are properties of the fine-tuned space or consequences of the chosen Mapper hyperparameters.

Authors: We acknowledge the importance of these controls. The revised §4 now includes results using the first two PCA coordinates of the raw (pre-fine-tuning) embeddings as an alternative filter, together with systematic sweeps of cover resolution (5–20 intervals) and overlap (0.1–0.4). The additional experiments confirm that the observed modularity and purity levels remain stable and are characteristic of the fine-tuned embedding geometry rather than artifacts of specific hyperparameter settings. revision: yes
Referee: [§5] §5 (Results and Discussion): The claim that alignment with ground-truth labels 'drops in ambiguous data' is presented without statistical significance tests, confidence intervals, or controls for embedding dimensionality and preprocessing steps. This weakens the interpretation of a 'hidden tension' between structural confidence and label uncertainty.

Authors: We accept that the original presentation lacked quantitative statistical support. The revised §5 now reports bootstrap confidence intervals and Wilcoxon signed-rank tests on the drop in label alignment for ambiguous instances. We also repeat the analysis on PCA-reduced embeddings (50 dimensions) to control for dimensionality and detail all preprocessing steps in the methods. These additions provide statistical backing for the observed tension between model structure and label uncertainty. revision: yes

Circularity Check

0 steps flagged

No significant circularity; Mapper metrics are empirical observations

full rationale

The paper applies the external Mapper algorithm from topological data analysis to RoBERTa embeddings on the MD-Offense dataset. Connected-component purity with respect to model predictions and alignment with ground-truth labels are computed post-hoc from the resulting graph. No equations or steps in the abstract or described claims reduce these statistics by construction to the choice of filter function or cover parameters; the derivation chain consists of standard TDA application followed by descriptive statistics. Self-citations, if present, are not load-bearing for the core empirical findings. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; Mapper analysis implicitly depends on standard TDA choices such as filter functions and cover parameters whose selection is not detailed.

pith-pipeline@v0.9.0 · 5725 in / 1128 out tokens · 74977 ms · 2026-05-18T06:20:49.987657+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mapper ... constructs a graph-based summary of high-dimensional data by combining filtering, covering, and clustering.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.