Proposal and study of statistical features for string similarity computation and classification

A. Conci; D. Casanova; E. Clua; E.O. Rodrigues; F. Favarim; M. TEIXEIRA; Panos Liatsis; V. Pegorini

arxiv: 2605.15110 · v1 · pith:BUE2MKVXnew · submitted 2026-05-14 · 💻 cs.LG · cs.CL· cs.IT· math.IT

Proposal and study of statistical features for string similarity computation and classification

E.O. Rodrigues , D. Casanova , M. TEIXEIRA , V. Pegorini , F. Favarim , E. Clua , A. Conci , Panos Liatsis This is my paper

Pith reviewed 2026-06-30 21:08 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.ITmath.IT

keywords string similarityco-occurrence matrixrun-length matrixplagiarism detectionstatistical featuresedit distanceclassificationlongest common subsequence

0 comments

The pith

Adapted co-occurrence and run-length matrices outperform standard statistical measures for string similarity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts co-occurrence matrix and run-length matrix features from visual computing to compute similarity between arbitrary strings such as words, phrases, codes, and texts. These features rely only on statistical patterns and remain independent of language or grammar. Experiments on synthetic datasets show that the adapted features, particularly run-length matrices, deliver higher accuracy than longest common subsequence, maximal consecutive longest common subsequence, mutual information, and edit distances. On a real plagiarism dataset the run-length matrix version produced the strongest classification results. A reader would care because the approach supplies a uniform, language-agnostic statistical tool for any string comparison task.

Core claim

The central claim is that adaptations of the co-occurrence matrix (COM) and run-length matrix (RLM) originally developed for images can be applied directly to strings, and that these adapted features outperform the other tested statistical measures in similarity-based classification on both synthetic collections and a real text plagiarism dataset.

What carries the argument

The adapted co-occurrence matrix (COM) and run-length matrix (RLM) computed on character or token sequences within strings.

If this is right

The features apply to any language or grammatical structure because they use only frequency statistics.
In the synthetic experiments the RLM and COM features reached statistical significance over the second-best group in three of four cases.
The RLM version produced the highest accuracy on the real plagiarism dataset.
The same purely statistical pipeline can be used for classification of any string type without language-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same matrix construction might be tested on non-text strings such as program code or numerical sequences to check whether the performance advantage persists.
Combining the RLM or COM vectors with standard machine-learning classifiers could be explored as a direct next step for string classification pipelines.
If the statistical patterns generalize, the method could serve as a lightweight baseline for similarity tasks where edit-distance computation becomes expensive.

Load-bearing premise

The matrix adaptations retain enough statistical information to measure meaningful similarity for classification across arbitrary strings.

What would settle it

A new collection of string datasets on which the COM and RLM features fail to show statistically significant superiority over distance-based or subsequence-based alternatives would falsify the performance claim.

read the original abstract

Adaptations of features commonly applied in the field of visual computing, co-occurrence matrix (COM) and run-length matrix (RLM), are proposed for the similarity computation of strings in general (words, phrases, codes and texts). The proposed features are not sensitive to language related information. These are purely statistical and can be used in any context with any language or grammatical structure. Other statistical measures that are commonly employed in the field such as longest common subsequence, maximal consecutive longest common subsequence, mutual information and edit distances are evaluated and compared. In the first synthetic set of experiments, the COM and RLM features outperform the remaining state-of-the-art statistical features. In 3 out of 4 cases, the RLM and COM features were statistically more significant than the second best group based on distances (P-value < 0.001). When it comes to a real text plagiarism dataset, the RLM features obtained the best results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adapts COM and RLM from images to strings and reports wins over edit-distance baselines on synthetic data plus one plagiarism set, but the language-independence claim is only as strong as the chosen test distributions.

read the letter

The main takeaway is that the authors take co-occurrence and run-length matrices, which are standard in texture analysis, and repurpose them as string features for similarity. On four synthetic test cases the new features beat the distance-based group in three with p<0.001, and on the real plagiarism corpus the run-length version ranks first.

What the paper does cleanly is run a direct empirical comparison against longest common subsequence, maximal consecutive LCS, mutual information and edit distances, and it supplies p-values rather than just raw scores. That makes the performance numbers easier to interpret. The synthetic construction also lets them control string properties, which is useful for isolating what the features actually capture.

The softer part is the assertion that the features are purely statistical and therefore work for any language or grammatical structure. The evidence is limited to the synthetic strings they generated and one plagiarism collection; nothing in the abstract or stress-test note shows tests on non-Latin scripts, code, or morphologically rich languages. If the full paper adds those checks or shows that the matrix definitions themselves force language independence, the claim strengthens. Otherwise it rests on the data distributions rather than a property of the method. Reproducibility would also benefit from explicit pseudocode or parameters for how the matrices are built from character sequences.

This is useful for groups already doing string classification or plagiarism detection who want another set of features to try. It is incremental rather than foundational, so I would not cite it in the next year unless I needed exactly these numbers. Still, the work is coherent enough on its own terms that a serious editor should send it to referees rather than desk-reject.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes adaptations of co-occurrence matrix (COM) and run-length matrix (RLM) features from visual computing for string similarity computation and classification tasks on arbitrary strings (words, phrases, codes, texts). It claims these adaptations are purely statistical and insensitive to language or grammatical structure. The paper compares them empirically to baselines including longest common subsequence, maximal consecutive LCS, mutual information, and edit distances. On four synthetic datasets, COM and RLM outperform the others, with RLM/COM statistically superior to the second-best (distance-based) group in 3/4 cases at p<0.001; on one real plagiarism corpus, RLM ranks first.

Significance. If the adaptations preserve sufficient information and the reported superiority is reproducible, the work could supply new language-agnostic statistical features for string similarity in ML pipelines. The empirical head-to-head design with p-value reporting is a positive element, but the absence of methodological detail prevents assessment of whether the central performance and independence claims are supported.

major comments (2)

[Abstract] Abstract: the central performance claim (COM/RLM outperform baselines with p<0.001 in 3/4 synthetic cases and RLM best on the plagiarism set) cannot be evaluated because the abstract provides no description of how COM and RLM are adapted to strings, how the synthetic datasets were constructed, or the exact statistical testing procedure (e.g., which test, multiple-comparison correction, sample sizes).
[Abstract] Abstract: the claim that the features "are not sensitive to language related information" and "can be used in any context with any language or grammatical structure" is load-bearing for the paper's positioning, yet the reported evidence is confined to the chosen synthetic distributions plus one plagiarism corpus; no results on non-Latin scripts, code, or morphologically rich languages are described, so the independence assertion reduces to an untested generalization rather than a demonstrated property of the feature definitions.

minor comments (1)

[Abstract] Abstract: the phrase "the first synthetic set of experiments" is ambiguous about how many total synthetic configurations were run and whether the four cases are exhaustive.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the abstract to incorporate additional methodological details and to qualify the language-independence claim in line with the feature definitions and experimental scope.

read point-by-point responses

Referee: [Abstract] Abstract: the central performance claim (COM/RLM outperform baselines with p<0.001 in 3/4 synthetic cases and RLM best on the plagiarism set) cannot be evaluated because the abstract provides no description of how COM and RLM are adapted to strings, how the synthetic datasets were constructed, or the exact statistical testing procedure (e.g., which test, multiple-comparison correction, sample sizes).

Authors: We agree that the abstract would benefit from brief summaries of these elements. In the revised manuscript we will expand the abstract to include concise descriptions of the COM and RLM adaptations to strings, the construction of the four synthetic datasets, and the statistical testing procedure (including the test employed and sample sizes). The full methodological details remain in the body of the paper. revision: yes
Referee: [Abstract] Abstract: the claim that the features "are not sensitive to language related information" and "can be used in any context with any language or grammatical structure" is load-bearing for the paper's positioning, yet the reported evidence is confined to the chosen synthetic distributions plus one plagiarism corpus; no results on non-Latin scripts, code, or morphologically rich languages are described, so the independence assertion reduces to an untested generalization rather than a demonstrated property of the feature definitions.

Authors: The COM and RLM adaptations are defined exclusively via statistical operations on sequences (co-occurrence counts and run lengths) with no incorporation of grammatical rules, lexical resources, or language-specific priors; this construction renders them insensitive to language-related information by design. We will revise the abstract and introduction to state this definitional property explicitly and to note that the reported experiments demonstrate effectiveness on the synthetic and plagiarism datasets examined, while acknowledging that additional validation on non-Latin scripts or other languages would further support broader applicability. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical feature comparison with no derivation chain

full rationale

The paper proposes adaptations of COM and RLM features for string similarity, asserts they are purely statistical and language-independent, and supports this via head-to-head experiments on synthetic sets and one plagiarism corpus. No equations, parameter fits, or self-citations are invoked to derive the superiority claims; the reported outperformance (e.g., RLM best on plagiarism data) is presented as direct experimental outcome rather than a quantity forced by construction from the inputs. The central claims rest on external data distributions and statistical tests, not on any reduction to the feature definitions themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that visual-computing matrices can be directly repurposed for strings while remaining language-independent; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Statistical features extracted via co-occurrence and run-length matrices from strings are independent of language-related information and grammatical structure.
Explicitly stated in the abstract as the basis for general applicability.

pith-pipeline@v0.9.1-grok · 5727 in / 1226 out tokens · 24215 ms · 2026-06-30T21:08:38.870293+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 2 canonical work pages

[1]

(1992) ‘An introduction to kernel and nearest-neighbor nonparametric regression’, The American Statistician , Vol

Altman, N.S. (1992) ‘An introduction to kernel and nearest-neighbor nonparametric regression’, The American Statistician , Vol. 46, No. 3, pp.174–189. Bookstein, A., Kulyukin, V.A. and Raita, T. (2002) ‘Generalized hamming distance’, Information Retrieval , October, Vol. 5, pp.353–375 [online] https://link.springer.com/article/10.1023/ A:1020499411651. Br...

work page doi:10.1023/a:1010933404324 1992
[2]

and McNamee, P

Han, L., Finin, T. and McNamee, P. (2013) ‘Improving word similarity by augmenting PMI with estimates of word polysemy’, IEEE Transactions on Knowle dge and Data Engineering , Vol. 25, pp.1307–1322. Hirchberg, D.S. (1977) ‘Algorithms for th e longest common subsequence problem’, Journal of the ACM , Vol. 24, No. 4, pp.664–675. Hulten, G., Spencer, L. and ...

2013
[3]

Islam, A

‘Mining time-chang ing data streams’, Proceedings of the Seventh ACM SIGKDD Inter national Conference on Knowledge Discovery and Data Mining , pp.96–106. Islam, A. and Inkpen, D. (2008) ‘Semantic text similarity using corpus-based word similarity and string similarity’, ACM Transactions on Knowledge Discovery from Data (TKDD) , Vol. 2, No

2008
[4]

and Langley, P

John, G. and Langley, P. (1995) ‘Estimating con tinuous distributions in Bayesian classifiers’, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence , pp.338–345. Kohavi, R. (2005) ‘The pow er of decision tables’, Proceedings of the European Conference on Machine Learning , pp.174–189. Landwehr, N., Hall, M. and Frank, E. (2005...

work page doi:10.1007/s10994-005-0466-3 1995

[1] [1]

(1992) ‘An introduction to kernel and nearest-neighbor nonparametric regression’, The American Statistician , Vol

Altman, N.S. (1992) ‘An introduction to kernel and nearest-neighbor nonparametric regression’, The American Statistician , Vol. 46, No. 3, pp.174–189. Bookstein, A., Kulyukin, V.A. and Raita, T. (2002) ‘Generalized hamming distance’, Information Retrieval , October, Vol. 5, pp.353–375 [online] https://link.springer.com/article/10.1023/ A:1020499411651. Br...

work page doi:10.1023/a:1010933404324 1992

[2] [2]

and McNamee, P

Han, L., Finin, T. and McNamee, P. (2013) ‘Improving word similarity by augmenting PMI with estimates of word polysemy’, IEEE Transactions on Knowle dge and Data Engineering , Vol. 25, pp.1307–1322. Hirchberg, D.S. (1977) ‘Algorithms for th e longest common subsequence problem’, Journal of the ACM , Vol. 24, No. 4, pp.664–675. Hulten, G., Spencer, L. and ...

2013

[3] [3]

Islam, A

‘Mining time-chang ing data streams’, Proceedings of the Seventh ACM SIGKDD Inter national Conference on Knowledge Discovery and Data Mining , pp.96–106. Islam, A. and Inkpen, D. (2008) ‘Semantic text similarity using corpus-based word similarity and string similarity’, ACM Transactions on Knowledge Discovery from Data (TKDD) , Vol. 2, No

2008

[4] [4]

and Langley, P

John, G. and Langley, P. (1995) ‘Estimating con tinuous distributions in Bayesian classifiers’, Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence , pp.338–345. Kohavi, R. (2005) ‘The pow er of decision tables’, Proceedings of the European Conference on Machine Learning , pp.174–189. Landwehr, N., Hall, M. and Frank, E. (2005...

work page doi:10.1007/s10994-005-0466-3 1995