Recognition: no theorem link
Explainable Semantic Textual Similarity via Dissimilar Span Detection
Pith reviewed 2026-05-15 07:16 UTC · model grok-4.3
The pith
Detecting dissimilar spans between text pairs explains semantic similarity scores and boosts paraphrase detection performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Dissimilar Span Detection identifies the semantically differing spans between a pair of texts to make semantic textual similarity interpretable; the released Span Similarity Dataset supports training and evaluation of such detectors; and incorporating the detected spans improves accuracy on the downstream task of paraphrase detection.
What carries the argument
Dissimilar Span Detection (DSD), a task that locates the specific token spans whose semantic content differs between two input texts.
If this is right
- Semantic textual similarity systems can output not only a score but also the concrete spans responsible for lowering similarity.
- The Span Similarity Dataset enables supervised training of span detectors that outperform unsupervised baselines such as LIME and SHAP.
- Paraphrase detection models that receive dissimilar-span signals achieve higher accuracy than models that see only the original sentence pair.
- Large language models currently give the strongest zero-shot performance on dissimilar span detection among the tested approaches.
Where Pith is reading between the lines
- The same span-level signal could be applied to other pairwise NLP tasks such as textual entailment or duplicate question detection.
- Fully automated dataset construction without the human verification step would allow scaling the approach to much larger corpora.
- Integrating dissimilar-span detection directly into the training objective of similarity models might produce more robust embeddings.
Load-bearing premise
The semi-automated pipeline that combines large language models with human verification produces accurate and consistent labels for dissimilar spans in the new dataset.
What would settle it
A large-scale human re-annotation of the Span Similarity Dataset that yields low agreement with the released labels, or a replication experiment showing no accuracy gain on paraphrase detection when dissimilar-span features are added.
Figures
read the original abstract
Semantic Textual Similarity (STS) is a crucial component of many Natural Language Processing (NLP) applications. However, existing approaches typically reduce semantic nuances to a single score, limiting interpretability. To address this, we introduce the task of Dissimilar Span Detection (DSD), which aims to identify semantically differing spans between pairs of texts. This can help users understand which particular words or tokens negatively affect the similarity score, or be used to improve performance in STS-dependent downstream tasks. Furthermore, we release a new dataset suitable for the task, the Span Similarity Dataset (SSD), developed through a semi-automated pipeline combining large language models (LLMs) with human verification. We propose and evaluate different baseline methods for DSD, both unsupervised, based on LIME, SHAP, LLMs, and our own method, as well as an additional supervised approach. While LLMs and supervised models achieve the highest performance, overall results remain low, highlighting the complexity of the task. Finally, we set up an additional experiment that shows how DSD can lead to increased performance in the specific task of paraphrase detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Dissimilar Span Detection (DSD) task to identify semantically differing spans between text pairs, thereby improving interpretability of Semantic Textual Similarity (STS) scores beyond a single aggregate value. It releases the Span Similarity Dataset (SSD) constructed via a semi-automated pipeline that combines LLM generation with human verification, evaluates multiple baselines (LIME, SHAP, LLMs, and a supervised model) showing generally low performance, and reports an auxiliary experiment in which DSD integration yields improved paraphrase detection accuracy.
Significance. If the SSD labels prove reliable, the work offers a concrete mechanism for span-level explanations in STS and a practical route to performance gains on paraphrase detection. The dataset release itself is a useful resource for the community, and the low baseline results appropriately highlight the task's difficulty.
major comments (2)
- [Dataset Construction] Dataset construction section: no inter-annotator agreement statistics, no error-rate analysis on the human verification step, and no examination of possible LLM-induced biases in span boundary selection are reported. Because the paraphrase-detection improvement claim rests directly on the trustworthiness of the dissimilar-span labels, this omission is load-bearing for the central experimental result.
- [Paraphrase Detection Experiment] Paraphrase detection experiment: the precise mechanism by which DSD outputs are incorporated (feature augmentation, span masking, or filtering) is not specified with sufficient detail to allow reproduction or to rule out confounding factors in the reported accuracy lift.
minor comments (2)
- [Abstract] Abstract: the statement that 'overall results remain low' is not accompanied by concrete metrics; adding the primary F1 or accuracy figures would improve immediate readability.
- [Task Definition] Notation: the distinction between 'dissimilar span' and 'similar span' labels in the SSD description is introduced without an explicit formal definition or example pair before the evaluation tables.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and have revised the manuscript accordingly to improve clarity and rigor.
read point-by-point responses
-
Referee: Dataset construction section: no inter-annotator agreement statistics, no error-rate analysis on the human verification step, and no examination of possible LLM-induced biases in span boundary selection are reported. Because the paraphrase-detection improvement claim rests directly on the trustworthiness of the dissimilar-span labels, this omission is load-bearing for the central experimental result.
Authors: We agree that these details are important for establishing the reliability of the SSD labels. In the revised manuscript, we will add inter-annotator agreement statistics calculated on a double-annotated subset of the data. We will also include an error-rate analysis based on a post-verification review of a random sample of 200 instances. For potential LLM-induced biases, we will provide a comparative analysis of span boundaries chosen by the LLM versus those adjusted by human annotators, including statistics on boundary shifts. These additions will directly support the trustworthiness of the labels used in the paraphrase detection experiment. revision: yes
-
Referee: Paraphrase detection experiment: the precise mechanism by which DSD outputs are incorporated (feature augmentation, span masking, or filtering) is not specified with sufficient detail to allow reproduction or to rule out confounding factors in the reported accuracy lift.
Authors: We appreciate this point and will clarify the integration method in the revised version. Specifically, the DSD outputs are incorporated through feature augmentation: the detected dissimilar spans are encoded as binary features indicating the presence of differing spans and appended to the input representation of the paraphrase detection model. We will include a detailed description, along with pseudocode, to ensure reproducibility and to allow readers to assess potential confounding factors. revision: yes
Circularity Check
No significant circularity; claims rest on independent experimental outcomes
full rationale
The paper introduces the DSD task, constructs the SSD dataset via an LLM+human pipeline, evaluates multiple baselines (unsupervised and supervised), and reports an experimental improvement on paraphrase detection. No equations, fitted parameters, or predictions are defined such that any result reduces to its own inputs by construction. No self-citation chains or uniqueness theorems are invoked as load-bearing premises. The central performance claim is presented as an empirical outcome on an external task, making the derivation self-contained against benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Semantic Textual Similarity is a crucial component of many NLP applications
invented entities (2)
-
Dissimilar Span Detection (DSD) task
no independent evidence
-
Span Similarity Dataset (SSD)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introduction Semantic Textual Similarity (STS) is a fundamental concept in Natural Language Processing (NLP), being present in a myriad of tasks. For example, STS is the cornerstone of paraphrase identification (Zhou et al., 2022), it is widely used for text classifi- cation and clustering tasks (Minaee et al., 2021), it lays the foundation for popular ev...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[2]
Related Work The task of annotating tokens or spans of text to provide explanations has already been worked on, albeit not extensively, in the context of Natural Lan- guage Inference (NLI). A popular dataset in this regard is e-SNLI (Camburu et al., 2018), which extends the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) by highlig...
work page 2018
-
[3]
Span Similarity Dataset (SSD) In this section, we discuss the motivation behind building a new dataset for the task and describe how it was constructed. We also conduct a dataset analysis, reporting several statistics about it. 1 Segments are what we refer to asspans. 3.1. Dataset Motivation We initially considered NLI-related datasets like e-SNLI. Howeve...
work page 2016
-
[4]
Taking the first sentence and altering one or more spans of words, giving result to the sec- ond sentence. The modified spans could ei- ther be equivalent in meaning to the original one, or be semantically dissimilar
-
[5]
In our case, {{ de- notes the beginning of a span, and }} its end
Enclosing each of the altered spans between span annotation markers. In our case, {{ de- notes the beginning of a span, and }} its end
-
[6]
Annotating each of the span pairs with either a 1, if they are equivalent in meaning, or 0 otherwise
-
[7]
Annotating whether the entire two sentences are semantically equivalent (1) or not (0). The annotation was performed in a semi- automatic way through the use of an LLM 3 via a manually engineered prompt, significantly reducing time and effort by allowing the model to replace spans and assign labels. Nevertheless, since the model was unable to consistently...
work page 2023
-
[8]
Experimental Setup Next, we present and explain the different strate- gies we considered to tackle the problem of DSD. We also introduce the evaluation schema adopted to assess their performance. 4.1. Methods We propose a total of 5 methods plus 2 baselines. Some of these methods have the advantage of working with smaller models and requiring no fine- tun...
work page 1953
-
[9]
Results and Discussion In the case of the SSD, in order to account for false positives (i.e., spans labeled as dissimilar when they are not), we report separate metrics for those sentence-pairs that contain no dissimi- lar spans (NoDiff), and those that do (Diff). This distinction aims to determine how often evaluated methods treat semantically equivalent...
work page 2016
-
[10]
Conclusion In this work, we present the task of Dissimilar Span Detection (DSD): a method to improve the inter- pretability and reliability of STS scores. The task consists in, given two texts, identifying spans pairs with a common semantic function, but conveying different meanings. DSD can complement current STS metrics, which typically report a single ...
-
[11]
In future work, a multilingual setting could be con- sidered
Limitations The main limitations of our work stem from aspects regarding the annotation of the SSD, namely: • It currently contains data solely in English. In future work, a multilingual setting could be con- sidered. Still, all the methods presented here would be applicable with no or minor modifica- tions. • We worked exclusively at the sentence level. ...
-
[12]
Bibliographical References Eneko Agirre, Carmen Banea, Claire Cardie, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Weiwei Guo, Iñigo Lopez-Gazpio, Montse Mar- itxalar, Rada Mihalcea, German Rigau, Larraitz Uria, and Janyce Wiebe. 2015. SemEval-2015 task 2: Semantic textual similarity, English, Span- ish and pilot on interpretability. InProceedings of the...
work page 2015
-
[13]
Explaining Text Matching on Neural Natu- ral Language Inference.ACM Trans. Inf. Syst., 38(4). Jan-Christoph Klie, Bonnie Webber, and Iryna Gurevych. 2023. Annotation Error Detection: Analyzing the Past and Present for a More Coherent Future.Computational Linguistics, 49(1):157–198. Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Kar...
work page 2023
-
[14]
RoBERTa: A Robustly Optimized BERT Pretraining Approach.ArXiv, abs/1907.11692. I. Lopez-Gazpio, M. Maritxalar, A. Gonzalez-Agirre, G. Rigau, L. Uria, and E. Agirre. 2017. In- terpretable semantic textual similarity: Finding and explaining differences between sentences. Knowledge-Based Systems, 119:186–199. Scott M Lundberg and Su-In Lee. 2017. A Unified A...
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[15]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Interpreting BERT -based Text Similarity via Activation and Saliency Maps. InProceed- ings of the ACM Web Conference 2022, WWW ’22, page 3259–3268, New Y ork, NY , USA. As- sociation for Computing Machinery. Shervin Minaee, Nal Kalchbrenner, Erik Cambria, Narjes Nikzad, Meysam Chenaghlu, and Jian- feng Gao. 2021. Deep Learning–based Text Classification: A...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Language Resource References Agirre, Eneko and Gonzalez-Agirre, Aitor and Lopez-Gazpio, Iñigo and Maritxalar, Montse and Rigau, German and Uria, Larraitz. 2016. SemEval-2016 Task 2: Interpretable Semantic Textual Similarity. Association for Computational Linguistics. Miriam Anschütz, Diego Miguel Lozano, and Georg Groh. 2023. This is not correct! Negation...
work page 2016
-
[17]
Identify the differing spans and enclose them within double curly braces, i.e.,{{ to signal the beginning of a span, and }} to signal its end. Try to annotate the spans in such way that the spans have enough context on its own (e.g., in example 2, “food” and “portions” are included in spans). (a) Example: • Sentence 1:I’m not a bad talker either. • Senten...
-
[18]
Then write a 1 if the annotated span is equivalent, or a 0 if it is dissimilar
Add a space to the end of the line. Then write a 1 if the annotated span is equivalent, or a 0 if it is dissimilar. In case there are several spans, separate the numbers with a comma (leave no space in between the numbers), e.g.,0,1. Span pairs will always appear in the same order in both sentences. (a) Example: • Original Line: I’m not a bad talker eithe...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.