arxiv: 2603.21174 · v1 · submitted 2026-03-22 · 💻 cs.CL

Recognition: no theorem link

Explainable Semantic Textual Similarity via Dissimilar Span Detection

Diego Miguel Lozano , Daryna Dementieva , Alexander Fraser

Authors on Pith no claims yet

Pith reviewed 2026-05-15 07:16 UTC · model grok-4.3

classification 💻 cs.CL

keywords dissimilar span detectionsemantic textual similarityexplainable NLPparaphrase detectionSpan Similarity Datasetlarge language modelsinterpretability

0 comments

The pith

Detecting dissimilar spans between text pairs explains semantic similarity scores and boosts paraphrase detection performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Dissimilar Span Detection to move beyond single-number semantic similarity scores by identifying which specific parts of two texts differ in meaning. A new Span Similarity Dataset is created through an LLM-assisted process with human checks, and several baseline methods are tested on it. Results show that large language models and supervised approaches work best though overall accuracy stays modest, and an extra experiment demonstrates that using dissimilar span information raises performance on paraphrase detection.

Core claim

Dissimilar Span Detection identifies the semantically differing spans between a pair of texts to make semantic textual similarity interpretable; the released Span Similarity Dataset supports training and evaluation of such detectors; and incorporating the detected spans improves accuracy on the downstream task of paraphrase detection.

What carries the argument

Dissimilar Span Detection (DSD), a task that locates the specific token spans whose semantic content differs between two input texts.

If this is right

Semantic textual similarity systems can output not only a score but also the concrete spans responsible for lowering similarity.
The Span Similarity Dataset enables supervised training of span detectors that outperform unsupervised baselines such as LIME and SHAP.
Paraphrase detection models that receive dissimilar-span signals achieve higher accuracy than models that see only the original sentence pair.
Large language models currently give the strongest zero-shot performance on dissimilar span detection among the tested approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same span-level signal could be applied to other pairwise NLP tasks such as textual entailment or duplicate question detection.
Fully automated dataset construction without the human verification step would allow scaling the approach to much larger corpora.
Integrating dissimilar-span detection directly into the training objective of similarity models might produce more robust embeddings.

Load-bearing premise

The semi-automated pipeline that combines large language models with human verification produces accurate and consistent labels for dissimilar spans in the new dataset.

What would settle it

A large-scale human re-annotation of the Span Similarity Dataset that yields low agreement with the released labels, or a replication experiment showing no accuracy gain on paraphrase detection when dissimilar-span features are added.

Figures

Figures reproduced from arXiv: 2603.21174 by Alexander Fraser, Daryna Dementieva, Diego Miguel Lozano.

**Figure 1.** Figure 1: Examples of the Dissimilar Span Detection task, i.e., given a pair of texts, identify which spans differ semantically. Here, we show two pairs that contain dissimilar spans and one pair that does not. Cosine similarities are calculated with the Sentence Transformer model all-MiniLM-L6-v2. Note that pairs containing dissimilar spans might still yield a high semantic textual similarity, sometimes even higher… view at source ↗

**Figure 2.** Figure 2: Example of trigram replacements. The considered trigrams proceeding from the first sentence (highlighted) are inserted into the second sentence, replacing the original trigrams. We then calculate the similarity between each of the replacements and the first, original sentence, and identify which one of the replacements [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Dissimilar Span Detection (DSD) task to identify semantically differing spans between text pairs, thereby improving interpretability of Semantic Textual Similarity (STS) scores beyond a single aggregate value. It releases the Span Similarity Dataset (SSD) constructed via a semi-automated pipeline that combines LLM generation with human verification, evaluates multiple baselines (LIME, SHAP, LLMs, and a supervised model) showing generally low performance, and reports an auxiliary experiment in which DSD integration yields improved paraphrase detection accuracy.

Significance. If the SSD labels prove reliable, the work offers a concrete mechanism for span-level explanations in STS and a practical route to performance gains on paraphrase detection. The dataset release itself is a useful resource for the community, and the low baseline results appropriately highlight the task's difficulty.

major comments (2)

[Dataset Construction] Dataset construction section: no inter-annotator agreement statistics, no error-rate analysis on the human verification step, and no examination of possible LLM-induced biases in span boundary selection are reported. Because the paraphrase-detection improvement claim rests directly on the trustworthiness of the dissimilar-span labels, this omission is load-bearing for the central experimental result.
[Paraphrase Detection Experiment] Paraphrase detection experiment: the precise mechanism by which DSD outputs are incorporated (feature augmentation, span masking, or filtering) is not specified with sufficient detail to allow reproduction or to rule out confounding factors in the reported accuracy lift.

minor comments (2)

[Abstract] Abstract: the statement that 'overall results remain low' is not accompanied by concrete metrics; adding the primary F1 or accuracy figures would improve immediate readability.
[Task Definition] Notation: the distinction between 'dissimilar span' and 'similar span' labels in the SSD description is introduced without an explicit formal definition or example pair before the evaluation tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and have revised the manuscript accordingly to improve clarity and rigor.

read point-by-point responses

Referee: Dataset construction section: no inter-annotator agreement statistics, no error-rate analysis on the human verification step, and no examination of possible LLM-induced biases in span boundary selection are reported. Because the paraphrase-detection improvement claim rests directly on the trustworthiness of the dissimilar-span labels, this omission is load-bearing for the central experimental result.

Authors: We agree that these details are important for establishing the reliability of the SSD labels. In the revised manuscript, we will add inter-annotator agreement statistics calculated on a double-annotated subset of the data. We will also include an error-rate analysis based on a post-verification review of a random sample of 200 instances. For potential LLM-induced biases, we will provide a comparative analysis of span boundaries chosen by the LLM versus those adjusted by human annotators, including statistics on boundary shifts. These additions will directly support the trustworthiness of the labels used in the paraphrase detection experiment. revision: yes
Referee: Paraphrase detection experiment: the precise mechanism by which DSD outputs are incorporated (feature augmentation, span masking, or filtering) is not specified with sufficient detail to allow reproduction or to rule out confounding factors in the reported accuracy lift.

Authors: We appreciate this point and will clarify the integration method in the revised version. Specifically, the DSD outputs are incorporated through feature augmentation: the detected dissimilar spans are encoded as binary features indicating the presence of differing spans and appended to the input representation of the paraphrase detection model. We will include a detailed description, along with pseudocode, to ensure reproducibility and to allow readers to assess potential confounding factors. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent experimental outcomes

full rationale

The paper introduces the DSD task, constructs the SSD dataset via an LLM+human pipeline, evaluates multiple baselines (unsupervised and supervised), and reports an experimental improvement on paraphrase detection. No equations, fitted parameters, or predictions are defined such that any result reduces to its own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central performance claim is presented as an empirical outcome on an external task, making the derivation self-contained against benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that dissimilar spans can be reliably identified and that doing so improves downstream tasks; the new task and dataset are the main invented elements.

axioms (1)

domain assumption Semantic Textual Similarity is a crucial component of many NLP applications
Stated directly in the abstract as motivation for the work.

invented entities (2)

Dissimilar Span Detection (DSD) task no independent evidence
purpose: Identify semantically differing spans between text pairs to explain STS scores
Newly defined in this paper as the core contribution.
Span Similarity Dataset (SSD) no independent evidence
purpose: Provide labeled data for training and evaluating DSD models
Created through the paper's semi-automated pipeline.

pith-pipeline@v0.9.0 · 5492 in / 1122 out tokens · 46411 ms · 2026-05-15T07:16:42.546092+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 3 internal anchors

[1]

Introduction Semantic Textual Similarity (STS) is a fundamental concept in Natural Language Processing (NLP), being present in a myriad of tasks. For example, STS is the cornerstone of paraphrase identification (Zhou et al., 2022), it is widely used for text classifi- cation and clustering tasks (Minaee et al., 2021), it lays the foundation for popular ev...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

the alignment between pairs of segments 1 across the two sen- tences, where the relation between the segments is labeled with a relation type and a similarity score

Related Work The task of annotating tokens or spans of text to provide explanations has already been worked on, albeit not extensively, in the context of Natural Lan- guage Inference (NLI). A popular dataset in this regard is e-SNLI (Camburu et al., 2018), which extends the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) by highlig...

work page 2018
[3]

a woman” and “a man

Span Similarity Dataset (SSD) In this section, we discuss the motivation behind building a new dataset for the task and describe how it was constructed. We also conduct a dataset analysis, reporting several statistics about it. 1 Segments are what we refer to asspans. 3.1. Dataset Motivation We initially considered NLI-related datasets like e-SNLI. Howeve...

work page 2016
[4]

The modified spans could ei- ther be equivalent in meaning to the original one, or be semantically dissimilar

Taking the first sentence and altering one or more spans of words, giving result to the sec- ond sentence. The modified spans could ei- ther be equivalent in meaning to the original one, or be semantically dissimilar

work page
[5]

In our case, {{ de- notes the beginning of a span, and }} its end

Enclosing each of the altered spans between span annotation markers. In our case, {{ de- notes the beginning of a span, and }} its end

work page
[6]

Annotating each of the span pairs with either a 1, if they are equivalent in meaning, or 0 otherwise

work page
[7]

Annotating whether the entire two sentences are semantically equivalent (1) or not (0). The annotation was performed in a semi- automatic way through the use of an LLM 3 via a manually engineered prompt, significantly reducing time and effort by allowing the model to replace spans and assign labels. Nevertheless, since the model was unable to consistently...

work page 2023
[8]

explainer

Experimental Setup Next, we present and explain the different strate- gies we considered to tackle the problem of DSD. We also introduce the evaluation schema adopted to assess their performance. 4.1. Methods We propose a total of 5 methods plus 2 baselines. Some of these methods have the advantage of working with smaller models and requiring no fine- tun...

work page 1953
[9]

This distinction aims to determine how often evaluated methods treat semantically equivalent spans with lexical differences as dissimilar

Results and Discussion In the case of the SSD, in order to account for false positives (i.e., spans labeled as dissimilar when they are not), we report separate metrics for those sentence-pairs that contain no dissimi- lar spans (NoDiff), and those that do (Diff). This distinction aims to determine how often evaluated methods treat semantically equivalent...

work page 2016
[10]

The task consists in, given two texts, identifying spans pairs with a common semantic function, but conveying different meanings

Conclusion In this work, we present the task of Dissimilar Span Detection (DSD): a method to improve the inter- pretability and reliability of STS scores. The task consists in, given two texts, identifying spans pairs with a common semantic function, but conveying different meanings. DSD can complement current STS metrics, which typically report a single ...

work page
[11]

In future work, a multilingual setting could be con- sidered

Limitations The main limitations of our work stem from aspects regarding the annotation of the SSD, namely: • It currently contains data solely in English. In future work, a multilingual setting could be con- sidered. Still, all the methods presented here would be applicable with no or minor modifica- tions. • We worked exclusively at the sentence level. ...

work page
[12]

Bibliographical References Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Mar- itxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. SemEval-2015 task 2: Semantic textual similarity, English, Span- ish and pilot on interpretability. InProceedings of the...

work page 2015
[13]

Explaining Text Matching on Neural Natu- ral Language Inference.ACM Trans. Inf. Syst., 38(4). Jan-Christoph Klie, Bonnie Webber, and Iryna Gurevych. 2023. Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future.Computational Linguistics, 49(1):157–198. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Kar...

work page 2023
[14]

RoBERTa: A Robustly Optimized BERT Pretraining Approach.ArXiv, abs/1907.11692. I. Lopez-Gazpio, M. Maritxalar, A. Gonzalez-Agirre, G. Rigau, L. Uria, and E. Agirre. 2017. In- terpretable semantic textual similarity: Finding and explaining differences between sentences. Knowledge-Based Systems, 119:186–199. Scott M Lundberg and Su-In Lee. 2017. A Unified A...

work page internal anchor Pith review Pith/arXiv arXiv 1907
[15]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Interpreting BERT -based Text Similarity via Activation and Saliency Maps. InProceed- ings of the ACM Web Conference 2022, WWW ’22, page 3259–3268, New Y ork, NY , USA. As- sociation for Computing Machinery. Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jian- feng Gao. 2021. Deep Learning–based Text Classification: A...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Dec. 4” and “Jan. 24

Language Resource References Agirre, Eneko and Gonzalez-Agirre, Aitor and Lopez-Gazpio, Iñigo and Maritxalar, Montse and Rigau, German and Uria, Larraitz. 2016. SemEval-2016 Task 2: Interpretable Semantic Textual Similarity. Association for Computational Linguistics. Miriam Anschütz, Diego Miguel Lozano, and Georg Groh. 2023. This is not correct! Negation...

work page 2016
[17]

food” and “portions

Identify the differing spans and enclose them within double curly braces, i.e.,{{ to signal the beginning of a span, and }} to signal its end. Try to annotate the spans in such way that the spans have enough context on its own (e.g., in example 2, “food” and “portions” are included in spans). (a) Example: • Sentence 1:I’m not a bad talker either. • Senten...

work page
[18]

Then write a 1 if the annotated span is equivalent, or a 0 if it is dissimilar

Add a space to the end of the line. Then write a 1 if the annotated span is equivalent, or a 0 if it is dissimilar. In case there are several spans, separate the numbers with a comma (leave no space in between the numbers), e.g.,0,1. Span pairs will always appear in the same order in both sentences. (a) Example: • Original Line: I’m not a bad talker eithe...

work page 2016