pith. sign in

arxiv: 1906.11761 · v1 · pith:4KJNFWO4new · submitted 2019-06-27 · 💻 cs.DL · cs.IR

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Pith reviewed 2026-05-25 13:39 UTC · model grok-4.3

classification 💻 cs.DL cs.IR
keywords academic plagiarism detectionmathematical content analysiscitation analysisSTEM documentssimilarity measuresconcealed plagiarismtwo-stage detection
0
0 comments X

The pith

Combining math content and citation similarity with text analysis improves detection of concealed plagiarism in STEM documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper advances a two-stage detection process that first measures similarity in mathematical expressions and citation patterns, then integrates those signals with conventional text comparison to surface concealed cases such as strong paraphrases, translations, and idea reuse. It introduces new similarity measures for mathematical features that take feature order into account and evaluates the full approach against confirmed plagiarism instances before running it over 102,000 STEM documents. A sympathetic reader would care because current text-only tools reliably miss many forms of misconduct that dominate in fields where equations and references carry substantial intellectual content. The work shows that these non-text features supply usable additional signals for identifying suspicious documents.

Core claim

The authors establish that a two-stage process combining assessments of mathematical content similarity, academic citation similarity, and text similarity, using newly developed order-sensitive measures for mathematical features, outperforms text-only approaches in identifying confirmed cases of academic plagiarism and can flag suspicious documents within a collection of 102,000 STEM publications.

What carries the argument

The two-stage detection process integrating math-based, citation-based, and text-based similarity measures, with new measures that incorporate the order of mathematical features.

If this is right

  • The new order-aware similarity measures for mathematical features outperform the measures from prior work.
  • Combined math and citation analysis identifies potentially suspicious cases inside a large collection of 102K STEM documents.
  • Math-based and citation-based features serve as a supplement to text-based detection for concealed plagiarism.
  • Direct comparison on confirmed cases shows measurable gains from the multi-feature approach.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Detection systems could incorporate domain-specific non-text features like equations as a standard layer for technical literature.
  • Similar ordered-feature analysis might be applied to diagrams, tables, or data sets to address additional reuse patterns.
  • Large-scale screening of submissions could become feasible if the method proves efficient on production collections.

Load-bearing premise

The confirmed cases of academic plagiarism used for evaluation are representative of concealed forms such as strong paraphrases, translations, and idea reuse.

What would settle it

A new test set of confirmed plagiarism cases in which the combined math-plus-citation approach flags no additional instances beyond those already caught by text analysis alone would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 1906.11761 by Bela Gipp, Michael Karmer, Moritz Schubotz, Norman Meuschke, Vincent Stange.

Figure 1
Figure 1. Figure 1: Overview of the hybrid plagiarism detection approach. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Similarity scores in 1M random document pairs. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript extends prior work on plagiarism detection in STEM documents by proposing a two-stage process that integrates similarity measures for mathematical content (including new order-aware features), academic citations, and text. It claims these new math measures outperform prior versions, that the combined math/citation/text approaches are effective when evaluated on confirmed plagiarism cases, and that applying the math+citation combination to a 102K-document STEM collection identifies suspicious cases. The central claim is that math and citation analysis provides a striking supplement to conventional text-based methods specifically for detecting concealed forms of plagiarism such as strong paraphrases, translations, and idea reuse.

Significance. If the evaluation holds, the work would meaningfully advance detection of non-textual and concealed plagiarism in STEM by exploiting domain-specific signals (ordered math expressions and citation patterns) that are harder to disguise than text. The scale of the 102K-document demonstration and the focus on order-aware math measures are positive elements that could inform practical systems if the representativeness of the ground-truth cases is established.

major comments (2)
  1. [Contribution (iii) and evaluation section] Contribution (iii) and the associated evaluation section: the claim that the math-based and citation-based approaches supplement text-based detection for concealed plagiarism rests on performance differences observed on 'confirmed cases of academic plagiarism.' The manuscript does not report the breakdown of these cases by concealment type (verbatim/light rewording vs. strong paraphrases, translations, or idea reuse), which is load-bearing for the central claim; if the confirmed set is dominated by easily detectable verbatim copies, the comparative results do not establish added value in the concealed-plagiarism regime highlighted in the abstract and skeptic note.
  2. [Two-stage process and math similarity measures section] Section describing the two-stage detection process and new order-aware math measures: the outperformance of the new measures over prior work is asserted, but without explicit reporting of statistical significance tests, effect sizes, or controls for post-hoc threshold selection on the confirmed cases, it is unclear whether the gains are robust or depend on dataset-specific tuning.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'striking supplement' without quantifying the improvement (e.g., precision/recall deltas); a concrete metric comparison would strengthen the presentation.
  2. [Mathematical content similarity section] Notation for the order-aware math features should be defined more explicitly when first introduced to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important aspects of our evaluation that we will address through revisions to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Contribution (iii) and evaluation section] Contribution (iii) and the associated evaluation section: the claim that the math-based and citation-based approaches supplement text-based detection for concealed plagiarism rests on performance differences observed on 'confirmed cases of academic plagiarism.' The manuscript does not report the breakdown of these cases by concealment type (verbatim/light rewording vs. strong paraphrases, translations, or idea reuse), which is load-bearing for the central claim; if the confirmed set is dominated by easily detectable verbatim copies, the comparative results do not establish added value in the concealed-plagiarism regime highlighted in the abstract and skeptic note.

    Authors: We agree that explicitly reporting the breakdown of confirmed cases by concealment type would strengthen support for the central claim regarding concealed plagiarism. We will revise the evaluation section to include this breakdown based on the available case metadata. revision: yes

  2. Referee: [Two-stage process and math similarity measures section] Section describing the two-stage detection process and new order-aware math measures: the outperformance of the new measures over prior work is asserted, but without explicit reporting of statistical significance tests, effect sizes, or controls for post-hoc threshold selection on the confirmed cases, it is unclear whether the gains are robust or depend on dataset-specific tuning.

    Authors: We acknowledge the need for statistical rigor. We will add significance tests and effect sizes to the revised manuscript. We will also clarify the threshold selection procedure and add any necessary controls to demonstrate it was not performed post-hoc on the evaluation cases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; evaluation relies on external ground truth

full rationale

The paper's contributions consist of a two-stage detection process, new order-aware similarity measures for math features, and empirical comparisons on confirmed external plagiarism cases plus an independent 102K-document collection. No equations or derivations reduce by construction to fitted parameters or self-referential definitions. Self-citation to prior work on math/citation analysis is present but not load-bearing, as the effectiveness claims are validated against independent confirmed cases rather than derived from the cited prior results. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied empirical paper on detection methods with no mathematical derivations, free parameters, axioms, or invented entities described in the abstract.

pith-pipeline@v0.9.0 · 5764 in / 1147 out tokens · 58182 ms · 2026-05-25T13:39:05.454133+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    Akiko Aizawa, Michael Kohlhase, Iadh Ounis, and Moritz Schubotz. 2014. NTCIR- 11 Math-2 Task Overview. In Proc. NTCIR

  2. [2]

    Alzahrani, Naomie Salim, and Ajith Abraham

    Salha M. Alzahrani, Naomie Salim, and Ajith Abraham. 2012. Understanding Plagiarism Linguistic Patterns, Textual Features, and Detection Methods. In IEEE Trans. Syst., Man, Cybern. C, Appl. Rev. , Vol. 42. 133–149

  3. [3]

    Alberto Barrón-Cedeño, Parth Gupta, and Paolo Rosso. 2013. Methods for Cross- language Plagiarism Detection. Know.-Based Syst. 50 (2013), 211–217

  4. [4]

    Hannah Bast and Claudius Korzen. 2017. A Benchmark and Evaluation for Text Extraction from PDF. In Proc. JCDL

  5. [5]

    Zdenek Ceska. 2008. Plagiarism Detection Based on Singular Value Decomposi- tion. In Advances in Natural Language Processing . LNCS, Vol. 5221. Springer

  6. [6]

    Nava Ehsan and Azadeh Shakery. 2016. Candidate Document Retrieval for Cross- lingual Plagiarism Detection Using Two-level Proximity Information. Inf. Process. Manage. 52, 6 (2016), 1004–1017

  7. [7]

    Tompa, and Azadeh Shakery

    Nava Ehsan, Frank Wm. Tompa, and Azadeh Shakery. 2016. Using a Dictionary and N-gram Alignment to Improve Fine-grained Cross-Language Plagiarism Detection. In Proc. DocEng

  8. [8]

    We know it when we see it

    Teddy Fishman. 2009. "We know it when we see it"? is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In Proc. Asia Pacific Conf. on Educational Integrity

  9. [9]

    Bela Gipp. 2014. Citation-based Plagiarism Detection - Detecting Disguised and Cross-language Plagiarism using Citation Pattern Analysis . Springer

  10. [10]

    Bela Gipp and Norman Meuschke. 2011. Citation Pattern Matching Algorithms for Citation-based Plagiarism Detection: Greedy Citation Tiling, Citation Chunking and Longest Common Citation Sequence. In Proc. DocEng

  11. [11]

    Bela Gipp, Norman Meuschke, and Joeran Beel. 2011. Comparative Evaluation of Text- and Citation-based Plagiarism Detection Approaches using GuttenPlag. In Proc. JCDL

  12. [12]

    Bela Gipp, Norman Meuschke, Corinna Breitinger, Jim Pitman, and Andreas Nuernberger. 2014. Web-based Demonstration of Semantic Similarity Detection using Citation Pattern Visualization for a Cross Language Plagiarism Case. In Proc. Int. Conf. on Enterprise Inform. Sys

  13. [13]

    Bela Gipp, Norman Meuschke, and Mario Lipinski. 2015. CITREC: An Evaluation Framework for Citation-Based Similarity Measures based on TREC Genomics and PubMed Central. In Proc. iConference

  14. [14]

    Christian Grozea, Christian Gehl, and Marius Popescu. 2009. ENCOPLOT: Pair- wise Sequence Matching in Linear Time Applied to Plagiarism Detection. In Proc. PAN WS

  15. [15]

    Ferruccio Guidi and Claudio Sacerdoti Coen. 2016. A Survey on Retrieval of Mathematical Knowledge. Mathem. in Computer Science 10, 4 (2016), 409–427

  16. [16]

    Gupta, Vani K, and C

    D. Gupta, Vani K, and C. K. Singh. 2014. Using Natural Language Processing tech- niques and fuzzy-semantic similarity for automatic external plagiarism detection. In Proc. Int. Conf. on Advances in Computing, Communications and Informatics

  17. [17]

    Matthias Hagen, Martin Potthast, and Benno Stein. 2015. Source Retrieval for Plagiarism Detection from Large Web Corpora. In Proc. PAN WS

  18. [18]

    Kenichi Iwatsuki, Takeshi Sagara, Tadayoshi Hara, and Akiko Aizawa. 2017. Detecting In-line Mathematical Expressions in Scientific Documents. In Proc. DocEng

  19. [19]

    Vani K and Deepa Gupta. 2015. Investigating the impact of combined similarity metrics and POS tagging in extrinsic text plagiarism detection system. In Proc. Int. Conf. on Advances in Computing, Communications and Informatics

  20. [20]

    Leilei Kong, Haoliang Qi, Cuixia Du, Mingxing Wang, and Zhongyuan Han. 2013. Approaches for Source Retrieval and Text Alignment of Plagiarism Detection. In Proc. PAN WS

  21. [21]

    Arun kumar Jayapal. 2012. Similarity Overlap Metric and Greedy String Tiling at PAN 2012. In Proc. PAN WS

  22. [22]

    Donald L. McCabe. 2005. Cheating among College and University Students: A North American Perspective. Int.J. for Academic Integrity 1, 1 (2005), 1–11

  23. [23]

    Norman Meuschke and Bela Gipp. 2013. State-of-the-art in detecting academic plagiarism. Int. J. for Educational Integrity (2013)

  24. [24]

    Norman Meuschke and Bela Gipp. 2014. Reducing Computational Effort for Plagiarism Detection by using Citation Characteristics to Limit Retrieval Space. In Proc. JCDL

  25. [25]

    Keim, and Bela Gipp

    Norman Meuschke, Christopher Gondek, Daniel Seebacher, Corinna Breitinger, Daniel A. Keim, and Bela Gipp. 2018. An Adaptive Image-based Plagiarism Detection Approach. In Proc. JCDL

  26. [26]

    Norman Meuschke, Moritz Schubotz, Felix Hamborg, Tomas Skopal, and Bela Gipp. 2017. Analyzing Mathematical Content to Detect Academic Plagiarism. In Proc. CIKM

  27. [27]

    Norman Meuschke, Nicolas Siebeck, Moritz Schubotz, and Bela Gipp. 2017. Ana- lyzing Semantic Concept Patterns to Detect Academic Plagiarism. In Proc. Int. WS on Mining Scientific Publ. (WOSP) at JCDL

  28. [28]

    Norman Meuschke, Vincent Stange, Moritz Schubotz, and Bela Gipp. 2018. Hy- Plag: A Hybrid Approach to Academic Plagiarism Detection. In Proc. SIGIR

  29. [29]

    Moed, W.J.M

    H.F. Moed, W.J.M. Burger, J.G. Frankfort, and A.F.J. Van Raan. 1985. The applica- tion of bibliometric indicators: Important field- and time-dependent factors to be considered. 8, 3-4 (1985), 177–203

  30. [30]

    Velásquez

    Gabriel Oberreuter, Gaston L’Huillier, Sebastián Ríos, and Juan. Velásquez. 2011. Approaches for Intrinsic and External Plagiarism Detection. In Proc. PAN WS

  31. [31]

    Merin Paul and Sangeetha Jamal. 2015. An improved SRL based plagiarism detection technique using sentence ranking. Proc. CS 46 (2015), 223–230

  32. [32]

    Pertile, Viviane P

    Solange de L. Pertile, Viviane P. Moreira, and Paolo Rosso. 2016. Comparing and combining Content- and Citation-based approaches for plagiarism detection. JASIST 67, 10 (2016), 2511–2526

  33. [33]

    Martin Potthast, Tim Gollub, Matthias Hagen, Jan Graßegger, Johannes Kiesel, Maximilian Michel, Arnd Oberländer, Martin Tippmann, Alberto Barrón-Cedeño, Parth Gupta, Paolo Rosso, and Benno Stein. 2012. Overview of the 4th Interna- tional Competition on Plagiarism Detection. In Proc. PAN WS

  34. [34]

    Martin Potthast, Benno Stein, Alberto Barrón Cedeño, and Paolo Rosso. 2010. An Evaluation Framework for Plagiarism Detection. In Proc. ACL

  35. [35]

    Lutz Prechelt, Guido Malpohl, and Michael Philippsen. 2002. Finding plagiarisms among a set of programs with JPlag. J. of Univ. CS 8, 11 (2002), 1016

  36. [36]

    Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov

    Miguel A. Sanchez-Perez, Alexander Gelbukh, and Grigori Sidorov. 2015. Adaptive Algorithm for Plagiarism Detection: The Best-Performing Approach at PAN 2014 Text Alignment Competition. In Proc. CLEF (LNCS) , Vol. 9283

  37. [37]

    Cohl, Norman Meuschke, Bela Gipp, Abdou S

    Moritz Schubotz, Alexey Grigorev, Marcus Leich, Howard S. Cohl, Norman Meuschke, Bela Gipp, Abdou S. Youssef, and Volker Markl. 2016. Semantification of Identifiers in Mathematics for Better Math Information Retrieval. In Proc. SIGIR

  38. [38]

    Moritz Schubotz, Olaf Teschke, Vincent Stange, Norman Meuschke, and Bela Gipp. 2019. Forms of Plagiarism in Digital Mathematical Libraries. In Proc. Int. Conf. on Intelligent Computer Mathematics

  39. [39]

    Petr Sojka and Martin Líška. 2011. Indexing and Searching Mathematics in Digital Libraries – Architecture, Design and Scalability Issues. In Proc. Int. Conf. on Intelligent Computer Mathematics (LNCS) , Vol. 6824

  40. [40]

    Soleman and A

    S. Soleman and A. Purwarianti. 2014. Experiments on the Indonesian plagiarism detection using latent semantic analysis. In Int. Conf. on ICT

  41. [41]

    Benno Stein, Sven Meyer zu Eissen, and Martin Potthast. 2007. Strategies for Retrieving Plagiarized Documents. In Proc. SIGIR

  42. [42]

    Dominika Tkaczyk, PawełSzostek, Mateusz Fedoryszak, Piotr Jan Dendek, and Lukasz Bolikowski. 2015. CERMINE: Automatic Extraction of Structured Meta- data from Scientific Literature. Int. J. Doc. Anal. Recognit. 18, 4 (2015), 317–335

  43. [43]

    Velásquez, Yerko Covacevich, Francisco Molina, Edison Marrese-Taylor, Cristián Rodríguez, and Felipe Bravo-Marquez

    Juan D. Velásquez, Yerko Covacevich, Francisco Molina, Edison Marrese-Taylor, Cristián Rodríguez, and Felipe Bravo-Marquez. 2016. DOCODE 3.0 (DOcument COpy DEtector). Information Fusion 27 (2016)

  44. [44]

    Debora Weber-Wulff. 2014. False Feathers: A Perspective on Academic Plagiarism

  45. [45]

    Michael J. Wise. 1993. String Similarity via Greedy String Tiling and Running Karp-Rabin Matching. TR (Univ. of Sydney. Basser Dept. of CS) 463. Improving PD for STEM Documents by Analyzing Mathematics and Citations JCDL’19, Jun. 2019, Urbana-Champaign, IL, USA Listing 1: Use the following BibTeX code to cite this article @inproceedings { Meuschke2019 , a...