Optimal Transport-based Alignment of Learned Character Representations for String Similarity
Pith reviewed 2026-05-24 17:14 UTC · model grok-4.3
The pith
STANCE encodes characters, aligns them via optimal transport, and scores alignments with a CNN to compute string similarity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
STANCE encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. On five newly constructed alias detection datasets, STANCE or one of its variants outperforms both state-of-the-art and classic, parameter-free similarity models on four of the five datasets. Applying the model to an instance of cross-document coreference yields a 2.8 point improvement in B^3 F1 over the previous state-of-the-art approach.
What carries the argument
Sinkhorn Iteration that solves the optimal transport problem to align learned character encodings before CNN scoring.
If this is right
- STANCE or its variants outperform both state-of-the-art learned models and classic parameter-free measures on four of five alias detection datasets.
- Inserting STANCE into a cross-document coreference pipeline raises B^3 F1 by 2.8 points over the prior state of the art.
- The five alias detection datasets are released publicly to support further research on string similarity.
- The architecture supplies a fully differentiable alternative to non-learned string metrics for downstream entity resolution tasks.
Where Pith is reading between the lines
- Because the alignment step is differentiable, the same optimal-transport mechanism could be inserted into larger end-to-end models that jointly learn string matching and task-specific objectives.
- The character-level optimal transport alignment may transfer to other variable-length sequence comparison problems such as biological sequence matching without requiring new hand-crafted features.
- If the learned encodings capture systematic spelling regularities, replacing the final CNN scorer with a more global architecture could further improve handling of long-distance character correspondences.
Load-bearing premise
The five new alias detection datasets capture the range of string variations that appear in actual record linkage and entity resolution applications.
What would settle it
An evaluation on a held-out real-world alias dataset in which STANCE fails to outperform the strongest baseline or in which the 2.8 point coreference gain disappears once the string similarity component is swapped for a fixed baseline while all other modeling choices remain unchanged.
Figures
read the original abstract
String similarity models are vital for record linkage, entity resolution, and search. In this work, we present STANCE --a learned model for computing the similarity of two strings. Our approach encodes the characters of each string, aligns the encodings using Sinkhorn Iteration (alignment is posed as an instance of optimal transport) and scores the alignment with a convolutional neural network. We evaluate STANCE's ability to detect whether two strings can refer to the same entity--a task we term alias detection. We construct five new alias detection datasets (and make them publicly available). We show that STANCE or one of its variants outperforms both state-of-the-art and classic, parameter-free similarity models on four of the five datasets. We also demonstrate STANCE's ability to improve downstream tasks by applying it to an instance of cross-document coreference and show that it leads to a 2.8 point improvement in B^3 F1 over the previous state-of-the-art approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces STANCE, a learned string similarity model that encodes individual characters, aligns the resulting representations via Sinkhorn iteration (formulated as optimal transport), and scores the alignment with a convolutional neural network. It constructs and releases five new alias-detection datasets, reports that STANCE or its variants outperform both learned SOTA and classic parameter-free baselines on four of the five datasets, and shows a 2.8-point B³ F1 gain when the model is plugged into a cross-document coreference pipeline.
Significance. If the empirical results are robust, the work supplies a new inductive bias for string matching that combines learned character embeddings with differentiable optimal transport, together with five publicly released datasets that could serve as benchmarks for record linkage and entity resolution. The downstream coreference experiment provides a concrete use case. The public release of the datasets is a clear positive contribution.
major comments (3)
- [§4, Table 2] §4 (Experiments) and Table 2: the abstract and results claim outperformance on four of five datasets and a 2.8-point F1 lift, yet no statistical significance tests, standard errors, or confidence intervals are reported; without these, it is impossible to determine whether the observed margins exceed what would be expected from random variation.
- [§3.2] §3.2 (Dataset construction): the five alias-detection datasets are newly authored; the paper provides no description of pair-sampling procedure, negative-example generation strategy, class balance, string-length distribution, or domain coverage. This information is load-bearing for the central claim that the observed gains generalize beyond construction artifacts that may favor character-embedding + Sinkhorn + CNN biases.
- [§5] §5 (Coreference experiment): the 2.8-point B³ F1 improvement is attributed to the string-similarity module, but the paper does not present an ablation that isolates the contribution of STANCE from other pipeline choices (e.g., mention-pair scoring, clustering algorithm, or feature set).
minor comments (2)
- [§2] Notation for the Sinkhorn scaling vectors and the CNN scoring function is introduced without a consolidated table of symbols; a short notation table would improve readability.
- [§4.1] The baseline implementations (Levenshtein, Jaro-Winkler, etc.) are described only at a high level; exact parameter settings and software versions should be stated to ensure reproducibility.
Simulated Author's Rebuttal
Thank you for the referee's constructive comments. We respond to each major point below and will make revisions to address the concerns raised regarding statistical reporting, dataset details, and ablations.
read point-by-point responses
-
Referee: [§4, Table 2] §4 (Experiments) and Table 2: the abstract and results claim outperformance on four of five datasets and a 2.8-point F1 lift, yet no statistical significance tests, standard errors, or confidence intervals are reported; without these, it is impossible to determine whether the observed margins exceed what would be expected from random variation.
Authors: We agree that including measures of statistical significance would strengthen the empirical claims. In the revised manuscript, we will add standard errors computed over multiple training runs with different random seeds for all learned models, as well as statistical significance tests comparing STANCE to the baselines. These will be reported in Section 4 and Table 2. revision: yes
-
Referee: [§3.2] §3.2 (Dataset construction): the five alias-detection datasets are newly authored; the paper provides no description of pair-sampling procedure, negative-example generation strategy, class balance, string-length distribution, or domain coverage. This information is load-bearing for the central claim that the observed gains generalize beyond construction artifacts that may favor character-embedding + Sinkhorn + CNN biases.
Authors: The referee is correct that these details were not included in the original submission. We will revise Section 3.2 to provide a complete description of how the datasets were constructed, including the pair-sampling procedure, negative-example generation, class balance ratios, string-length statistics, and the domains from which the data were drawn. This additional information will help readers evaluate the generalizability of the results. revision: yes
-
Referee: [§5] §5 (Coreference experiment): the 2.8-point B³ F1 improvement is attributed to the string-similarity module, but the paper does not present an ablation that isolates the contribution of STANCE from other pipeline choices (e.g., mention-pair scoring, clustering algorithm, or feature set).
Authors: While the improvement is shown relative to a prior system using a different similarity function, we agree that an ablation isolating the effect of the string similarity module is desirable. In the revised version, we will include an ablation experiment in Section 5 that keeps the rest of the coreference pipeline fixed and varies only the string similarity component to directly measure its contribution. revision: yes
Circularity Check
No circularity; empirical performance claims on held-out data
full rationale
The paper defines a model (character embeddings + Sinkhorn alignment + CNN scoring) and reports empirical results on five newly constructed alias-detection datasets plus one downstream coreference task. No derivation chain exists that reduces a claimed prediction or first-principles result to its own fitted inputs by construction. No self-citation is invoked to justify a uniqueness theorem or ansatz that would make the central result tautological. The evaluation numbers are external to the model parameters and therefore constitute independent evidence rather than a renaming or self-definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
David Alvarez-Melis and Tommi Jaakkola. 2018. Gromov-Wasserstein Alignment of Word Embedding Spaces. Empirical Methods in Natural Language Processing (EMNLP)(2018)
work page 2018
- [2]
-
[3]
Nicholas Andrews, Jason Eisner, and Mark Dredze. 2012. Name phylogeny: A generative model of string variation.Empirical Methods in Natural Language Processing (EMNLP)(2012)
work page 2012
-
[4]
Nicholas Andrews, Jason Eisner, and Mark Dredze. 2014. Robust Entity Clustering via Phylogenetic Inference.Association for Computational Linguistics (ACL)(2014)
work page 2014
-
[5]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate.International Conference on Learning Representations (ICLR) (2015)
work page 2015
-
[6]
Lasse Bergroth, Harri Hakonen, and Timo Raita. 2000. A survey of longest common subsequence algorithms. String Processing and Information Retrieval(2000)
work page 2000
-
[7]
Mikhail Bilenko and Raymond J. Mooney. 2003. Adaptive Duplicate Detection Using Learnable String Similarity Measures.Knowledge Discovery and Data Mining (KDD)(2003)
work page 2003
-
[8]
Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge.International Conference on Data Mining (ICDM)(2008)
work page 2008
-
[9]
Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. A large annotated corpus for learning natural language inference.Empirical Methods in Natural Language Processing (EMNLP)(2015)
work page 2015
-
[10]
O. Celma. 2010.Music Recommendation and Discovery in the Long Tail. Springer
work page 2010
-
[11]
WilliamCohen, PradeepRavikumar, andStephenFienberg.2003. Acomparisonofstringmetrics for matching names and records.KDD workshop on data cleaning and object consolidation (2003)
work page 2003
-
[12]
Marco Cuturi. 2013. Sinkhorn distances: Lightspeed computation of optimal transport.Advances in Neural Information Processing Systems (NeurIPS)(2013)
work page 2013
-
[13]
Marco Cuturi and Mathieu Blondel. 2017. Soft-DTW: a differentiable loss function for time-series. International Conference on Machine Learning (ICML)(2017)
work page 2017
-
[14]
Allan Peter Davis, Cynthia J Grondin, Kelley Lennon-Hopkins, Cynthia Saraceni-Richards, Daniela Sciaky, Benjamin L King, Thomas C Wiegers, and Carolyn J Mattingly. 2014. The Comparative Toxicogenomics Database’s 10th year anniversary: update 2015.Nucleic acids research 43, D1 (2014), D914–D920
work page 2014
-
[15]
Mark Dredze, Nicholas Andrews, and Jay DeYoung. 2016. Twitter at the grammys: A social media corpus for entity linking and disambiguation. International Workshop on Natural Language Processing for Social Media(2016). 13
work page 2016
-
[16]
Markus Dreyer, Jason R Smith, and Jason Eisner. 2008. Latent-variable modeling of string transductions with finite-state methods.Empirical Methods in Natural Language Processing (EMNLP) (2008)
work page 2008
-
[17]
Maud Ehrmann, Guillaume Jacquet, and Ralf Steinberger. 2017. JRC-names: Multilingual entity name variants and titles as linked data.Semantic Web(2017)
work page 2017
-
[18]
Manaal Faruqui, Yulia Tsvetkov, Graham Neubig, and Chris Dyer. 2016. Morphological inflection generation using character sequence to sequence learning.North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (2016)
work page 2016
-
[19]
LearningEntropicWasserstein Embeddings
CharlieFrogner, FarzanehMirzazadeh, andJustinSolomon.2019. LearningEntropicWasserstein Embeddings. International Conference on Learning Representations (ICLR)(2019)
work page 2019
-
[20]
Zhe Gan, P. D. Singh, Ameet Joshi, Xiaodong He, Jianshu Chen, Jianfeng Gao, and Li Deng
-
[21]
Character-level Deep Conflation for Business Data Analytics.International Conference on Acoustics, Speech, and Signal Processing (ICASSP)(2017)
work page 2017
-
[22]
Aude Genevay, Gabriel Peyré, and Marco Cuturi. 2018. Learning generative models with sinkhorn divergences.AISTATS (2018)
work page 2018
-
[23]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. Very Large Data Bases (VLDB)(1999)
work page 1999
-
[24]
Alex Graves. 2012. Sequence transduction with recurrent neural networks.Representation Learning Worksop, ICML(2012)
work page 2012
-
[25]
Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional LSTM and other neural network architectures.Neural Networks(2005)
work page 2005
-
[26]
Spence Green, Nicholas Andrews, Matthew R Gormley, Mark Dredze, and Christopher D Manning. 2012. Entity clustering across languages.North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2012)
work page 2012
-
[27]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.Neural computation (1997)
work page 1997
-
[28]
Gao Huang, Chuan Guo, Matt J Kusner, Yu Sun, Fei Sha, and Kilian Q Weinberger. 2016. Supervised word mover’s distance.NeurIPS (2016)
work page 2016
-
[29]
Kunho Kim, Madian Khabsa, and C Lee Giles. 2016. Random Forest DBSCAN for USPTO Inventor Name Disambiguation.Joint Conference on Digital Library (JCDL)(2016)
work page 2016
-
[30]
Yoon Kim, Carl Denton, Luong Hoang, and Alexander M Rush. 2017. Structured attention networks. International Conference on Learning Representations (ICLR)(2017)
work page 2017
-
[31]
Yoon Kim, Yacine Jernite, David Sontag, and Alexander M. Rush. 2016. Character-Aware Neural Language Models.Association for the Advancement of Artificial Intelligence (AAAI) (2016). 14
work page 2016
-
[32]
Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: a method for stochastic optimization. International Conference on Learning Representations (ICLR)(2015)
work page 2015
-
[33]
Matt Kusner, Yu Sun, Nicholas Kolkin, and Kilian Weinberger. 2015. From word embeddings to document distances.International Conference on Machine Learning (ICML)(2015)
work page 2015
-
[34]
Last.fm. [n. d.]. https://www.last.fm/. ([n. d.])
-
[35]
Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. 1998. Gradient-based learning applied to document recognition.Proc. IEEE(1998)
work page 1998
-
[36]
Michael Levin, Stefan Krawczyk, Steven Bethard, and Dan Jurafsky. 2012. Citation-Based Bootstrapping for Large-Scale Author Disambiguation.Journal of the American Society for Information Science and Technology (JASIST)(2012)
work page 2012
-
[37]
Pei Li, Xin Luna Dong, Songtao Guo, Andrea Maurino, and Divesh Srivastava. 2015. Robust Group Linkage.The Web Conference (WWW)(2015)
work page 2015
-
[38]
Scott Linderman, Gonzalo Mena, Hal Cooper, Liam Paninski, and John Cunningham. 2018. Reparameterizing the Birkhoff Polytope for Variational Permutation Inference. Artificial Intelligence and Statistics (AISTATS)(2018)
work page 2018
-
[39]
Andrew McCallum, Kedar Bellare, and Fernando Pereira. 2005. A Conditional Random Field for Discriminatively-trained Finite-state String Edit Distance.Uncertainty in Artificial Intelligence (UAI) (2005)
work page 2005
-
[40]
Gonzalo Mena, David Belanger, Scott Linderman, and Jasper Snoek. 2018. Learning La- tent Permutations with Gumbel-Sinkhorn Networks.International Conference on Learning Representations (ICLR)(2018)
work page 2018
-
[41]
Saul B Needleman and Christian D Wunsch. 1970. A general method applicable to the search for similarities in the amino acid sequence of two proteins.Journal of molecular biology(1970)
work page 1970
-
[42]
Ankur Parikh, Oscar Täckström, Dipanjan Das, and Jakob Uszkoreit. 2016. A Decomposable Attention Model for Natural Language Inference.Empirical Methods in Natural Language Processing (EMNLP)(2016)
work page 2016
-
[43]
Gabriel Peyré, Marco Cuturi, et al. 2017. Computational optimal transport. Technical Report
work page 2017
-
[44]
Pushpendre Rastogi, Ryan Cotterell, and Jason Eisner. 2016. Weighting Finite-State Transduc- tions With Neural Context.North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2016)
work page 2016
-
[45]
Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2009. BPR: Bayesian personalized ranking from implicit feedback.Uncertainty in Artificial Intelligence (UAI) (2009)
work page 2009
-
[46]
Peter Sadosky, Anshumali Shrivastava, Megan Price, and Rebecca C Steorts. 2015. Blocking Methods Applied to Casualty Records from the Syrian Conflict.arXiv preprint arXiv:1510.07714 (2015). 15
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[47]
Rui Santos, Patricia Murrieta-Flores, Pável Calado, and Bruno Martins. 2017. Toponym matching through deep neural networks.International Journal of Geographical Information Science (2017)
work page 2017
-
[48]
Temple F Smith and Michael S Waterman. 1981. Identification of common molecular subse- quences. Journal of molecular biology(1981)
work page 1981
-
[49]
Ralf Steinberger, Bruno Pouliquen, Mijail Kabadjov, Jenya Belyaeva, and Erik van der Goot
-
[50]
In International Conference Recent Advances in Natural Language Processing
JRC-NAMES: A Freely Available, Highly Multilingual Named Entity Resource. In International Conference Recent Advances in Natural Language Processing
-
[51]
Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks.International Conference on Machine Learning (ICML)(2011)
work page 2011
-
[52]
Aaron Swartz. 2002. Musicbrainz: A semantic web service.IEEE Intelligent Systems(2002)
work page 2002
-
[53]
Aaron Traylor, Nicholas Monath, Rajarshi Das, and Andrew McCallum. 2017. Learning String Alignments for Entity Aliases.Workshop on Automated Knowledge Base Construction (AKBC) (2017)
work page 2017
-
[54]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is All You Need.Advances in Neural Information Processing Systems (NeurIPS)(2017)
work page 2017
-
[55]
Ventura, Rebecca Nugent, and Erica R.H
Samuel L. Ventura, Rebecca Nugent, and Erica R.H. Fuchs. 2015. Seeing the non-stars: (Some) sources of bias in past disambiguation approaches and a new public tool leveraging labeled records. Research Policy(2015)
work page 2015
-
[56]
2008.Optimal transport: old and new
Cédric Villani. 2008.Optimal transport: old and new
work page 2008
-
[57]
Shengxian Wan, Yanyan Lan, Jun Xu, Jiafeng Guo, Liang Pang, and Xueqi Cheng. 2016. Match-srnn: Modeling the recursive matching structure with spatial rnn.International Joint Conference on Artificial Intelligence (IJCAI)(2016)
work page 2016
-
[58]
Adina Williams, Nikita Nangia, and Samuel Bowman. 2018. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference.North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT)(2018)
work page 2018
-
[59]
William E Winkler. 1999. The state of record linkage and current research problems.Statistical Research Division, US Census Bureau(1999). 16
work page 1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.