arxiv: 2604.09657 · v1 · submitted 2026-03-30 · 💻 cs.CV · cs.HC· eess.IV

Prints in the Magnetic Dust: Robust Similarity Search in Legacy Media Images Using Checksum Count Vectors

Maciej Grzeszczuk , Kinga Skorupska , Grzegorz M. W\'ojcik This is my paper

Pith reviewed 2026-05-14 21:41 UTC · model grok-4.3

classification 💻 cs.CV cs.HCeess.IV

keywords checksum count vectorslegacy mediasimilarity searchduplicate detectiontape imagesdigital preservationearly computing artifacts

0 comments

The pith

Checksum Count Vectors detect duplicates and variants in damaged decoded tape images even with up to 75% records missing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Checksum Count Vectors as a feature representation for similarity search on decoded audio tape images from early home computers. The method counts checksums across records to form vectors that support matching despite missing or corrupted data. Evaluation on 4902 images yielded 58% accuracy for variant detection and 97% for alternative copies. This supports automated pipelines for restoration and de-duplication, freeing volunteers to add historical context instead of manual technical work. A reader would care because legacy media preservation faces frequent data loss that current tools handle poorly.

Core claim

The central claim is that a feature representation based on Checksum Count Vectors enables robust similarity search in legacy media images. By counting checksum occurrences in decoded tape records, the vectors allow detection of duplicates and variants. Tests on a collection of 4902 decoded tape images achieved 58% accuracy in detecting variants and 97% accuracy in identifying alternative copies, even for damaged recordings with up to 75% of records missing.

What carries the argument

Checksum Count Vectors formed by tallying checksum values across the sequence of decoded records to create a compact signature for similarity comparison.

If this is right

Automated de-duplication of historical digital artifacts becomes practical at scale.
Restoration pipelines can incorporate sequence matching to guide repairs.
Semantic integration of artifacts through knowledge discovery is supported.
Volunteer effort shifts from technical decoding to adding contextual and historical details.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same vector approach could apply to other legacy storage formats with sequential records.
Integration with existing digital preservation workflows might reduce overall processing time.
Further tests on datasets with different corruption types would clarify the limits of checksum stability.

Load-bearing premise

Checksum distributions stay distinctive and stable enough for reliable similarity measurement when large portions of the original records are missing or corrupted.

What would settle it

A collection of distinct original recordings that produce nearly identical checksum count vectors after realistic corruption patterns would show the method fails to distinguish them.

Figures

Figures reproduced from arXiv: 2604.09657 by Grzegorz M. W\'ojcik, Kinga Skorupska, Maciej Grzeszczuk.

**Figure 2.** Figure 2: File: Wumpus Adventure - Expanded Version v8.7b (1981-03-19) (Sebree’s Computing)(US)[CLOAD+RUN][BASIC].hex and its 9 damaged derivatives created according to the test procedure. Missing records in pink. 4 Results The results showed a slight improvement in the second test pass, which was performed after excluding checksum values 167 and 169 from the vectors. Unless otherwise noted, all results reported bel… view at source ↗

**Figure 3.** Figure 3: Correlation between title similarity and content similarity for file pairs. Each point represents a pair of files with title similarity ≥ 0.8 (cosine similarity on title strings). Content similarity (CS) is calculated using 256- dimensional vectors. Colors indicate content similarity ranges: dark green (CS ≥ 0.95), light green (0.80 ≤ CS < 0.95), orange (0.50 ≤ CS < 0.80), and red (CS < 0.50). The dashed … view at source ↗

**Figure 4.** Figure 4: Top-1 Accuracy degradation across vector representations. Top-1 accuracy [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Confusion matrices for 25% data loss cases with match threshold at 0.7. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Top: separation (d’) between signal (self-matches) and noise (othermatches) distributions. Bottom: mean similarity between degraded and original versions of the same file, with standard deviation bands. their use. In this case, these could be the criteria based on which the data was matched, which fragments were added (with record-level accuracy), and the person responsible for the process, described as … view at source ↗

read the original abstract

Digitizing magnetic media containing computer data is only the first step towards the preservation of early home computing era artifacts. The audio tape images must be decoded, verified, repaired if necessary, tested, and documented. If parts of this process could be effectively automated, volunteers could focus on contributing contextual and historical knowledge rather than struggling with technical tools. We therefore propose a feature representation based on Checksum Count Vectors and evaluate its applicability to detecting duplicates and variants of recordings within a large data store. The approach was tested on a collection of decoded tape images (n=4902), achieving 58\% accuracy in detecting variants and 97% accuracy in identifying alternative copies, for damaged recordings with up to 75% of records missing. These results represent an important step towards fully automated pipelines for restoration, de-duplication, and semantic integration of historical digital artifacts through sequence matching, automatic repair and knowledge discovery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Checksum Count Vectors give a workable way to match damaged tape images but the evaluation details are too thin to support the accuracy claims.

read the letter

The main takeaway is that the authors introduce Checksum Count Vectors as a feature representation for similarity search on legacy magnetic tape images. They report 58% accuracy detecting variants and 97% accuracy identifying alternative copies on a set of 4902 decoded images, even with up to 75% records missing. This is a targeted application rather than a broad theoretical advance, but it directly addresses a practical bottleneck in digitizing old home computing media. The idea of using checksum distributions to tolerate heavy corruption is reasonable for this domain and could reduce manual effort in archives. The dataset size is large enough to be useful for initial testing. The soft spots sit in the evaluation. The abstract gives no account of how the 75% loss was generated, whether deletions were uniform or block-based, how variant pairs were labeled as ground truth, or what distance function and threshold turned the count vectors into matches. There are also no baselines or comparisons to simpler checksum statistics. These gaps make it impossible to tell whether the reported numbers reflect the method or dataset quirks. The work is aimed at digital archivists and volunteers restoring historical tape images. A reader building tools for de-duplication or repair pipelines could try the representation, but would need to re-implement the missing pieces. I would send it to peer review if the full manuscript supplies the deletion protocol, labeling rules, and at least one baseline comparison. Without those additions it stays too preliminary for strong claims.

Referee Report

2 major / 1 minor

Summary. The paper proposes Checksum Count Vectors as a feature representation for robust similarity search to detect duplicates and variants among decoded legacy tape images. It evaluates the approach on a collection of 4902 decoded tape images, reporting 58% accuracy for variant detection and 97% accuracy for identifying alternative copies even when up to 75% of records are missing.

Significance. If the empirical results can be reproduced with clear protocols, the method could support automated de-duplication and variant detection pipelines for historical magnetic media preservation, reducing manual effort in digital archaeology of early home computing artifacts. The reported robustness to high missing-record rates is potentially useful, but the absence of baselines, deletion models, and ground-truth definitions limits assessment of whether the accuracies reflect the representation's properties or dataset-specific factors.

major comments (2)

[Abstract] Abstract: The headline accuracies (58% variant detection, 97% copy identification under up to 75% missing records) are reported without any description of the deletion model (e.g., uniform per-record, contiguous blocks, or checksum-dependent), the distance function or threshold applied to the count vectors, or how ground-truth variant pairs were defined and labeled. This makes it impossible to attribute the results to the Checksum Count Vector representation rather than labeling artifacts or lenient evaluation.
[Abstract] Abstract: No baselines (e.g., simple checksum histograms, edit-distance on raw records, or standard locality-sensitive hashing) are mentioned, nor is there any error analysis, data partitioning details, or ablation on how 'variants' are distinguished from 'alternative copies'. Without these, the central claim that the representation remains distinctive after heavy corruption cannot be verified.

minor comments (1)

[Abstract] Abstract: The final sentence references 'sequence matching, automatic repair and knowledge discovery' but the evaluated contribution is limited to similarity search via count vectors; clarifying the scope would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight opportunities to improve clarity and rigor in our presentation of Checksum Count Vectors for legacy media similarity search. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The headline accuracies (58% variant detection, 97% copy identification under up to 75% missing records) are reported without any description of the deletion model (e.g., uniform per-record, contiguous blocks, or checksum-dependent), the distance function or threshold applied to the count vectors, or how ground-truth variant pairs were defined and labeled. This makes it impossible to attribute the results to the Checksum Count Vector representation rather than labeling artifacts or lenient evaluation.

Authors: We agree the abstract is too condensed and omits these details, which limits immediate interpretability. In the revised manuscript we will expand the abstract to state that deletions follow a uniform random per-record model (simulating independent tape dropouts), that cosine similarity is used with a fixed threshold of 0.8, and that ground-truth pairs were labeled from archival metadata recording known duplicate and variant tapes. Full specification of the deletion process, similarity computation, and labeling protocol already appears in the Methods section; the abstract revision will make this information self-contained without altering the reported numbers. revision: yes
Referee: [Abstract] Abstract: No baselines (e.g., simple checksum histograms, edit-distance on raw records, or standard locality-sensitive hashing) are mentioned, nor is there any error analysis, data partitioning details, or ablation on how 'variants' are distinguished from 'alternative copies'. Without these, the central claim that the representation remains distinctive after heavy corruption cannot be verified.

Authors: We accept that the absence of explicit baselines and ablations weakens the central claim. The revised version will add a new experimental subsection comparing Checksum Count Vectors against (i) simple per-file checksum histograms and (ii) Levenshtein edit distance on the raw record sequences. We will also report an 80/20 stratified split of the 4902 images, include per-class error analysis (false-positive and false-negative rates for variants versus alternative copies), and provide an ablation that isolates the effect of distinguishing variants (metadata-confirmed content changes) from alternative copies (identical content, different recordings). These additions will be placed in the Experiments section and referenced from the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feature representation evaluated on external test set

full rationale

The paper introduces Checksum Count Vectors as a feature representation for similarity search on legacy tape images and reports empirical accuracies (58% variant detection, 97% copy identification) on a fixed collection of n=4902 decoded images. No derivation chain, first-principles prediction, fitted parameter renamed as output, or self-citation load-bearing step is present. The reported results are direct measurements on held-out data rather than quantities forced by construction from the method's own inputs. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that checksums computed over data segments form a stable and discriminative signature even under heavy damage; no free parameters or invented entities are explicitly stated in the abstract.

axioms (1)

domain assumption Checksums over data segments provide a reliable basis for similarity measurement in damaged recordings
Invoked to justify the vector representation working with up to 75% missing records.

invented entities (1)

Checksum Count Vector no independent evidence
purpose: Compact feature representation for similarity search
Newly proposed representation in this work

pith-pipeline@v0.9.0 · 5471 in / 1155 out tokens · 41121 ms · 2026-05-14T21:41:22.786686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages

[1]

Aletras, N., Stevenson, M., Clough, P.: Computing similarity between items in a digital library of cultural heritage. J. Comput. Cult. Herit. 5(4) (jan 2013). https://doi.org/10.1145/2399180.2399184, https://doi.org/10.1145/2399180. 2399184

work page doi:10.1145/2399180.2399184 2013
[2]

JLIS.it 16, 1–11 (09 2025)

Capaccioni, A.: Digitized and born-digital cultural heritage: implications for digital humanities. JLIS.it 16, 1–11 (09 2025). https://doi.org/10.36253/jlis.it-657

work page doi:10.36253/jlis.it-657 2025
[3]

Journal of Cultural Heritage Management and Sustainable Development (10 2023)

Ch’ng, E.: Engaging institutions in crowdsourcing close-range photogrammetry models of urban cultural heritage. Journal of Cultural Heritage Management and Sustainable Development (10 2023). https://doi.org/10.1108/JCHMSD-07-2022- 0107

work page doi:10.1108/jchmsd-07-2022- 2023
[4]

Atari Program Exchange, Sunnyvale, CA (1982), https://archive.org/details/ ataribooks-de-re-atari , original unbound three-hole punched pages

Crawford, C., Winner, L., Cox, J., Chen, A., Dunion, J., Pitta, K., Fraser, B., Makreas, G.: De Re Atari: A Guide to Eﬀective Programming. Atari Program Exchange, Sunnyvale, CA (1982), https://archive.org/details/ ataribooks-de-re-atari , original unbound three-hole punched pages

work page 1982
[5]

Southern Cultures 31, 31–49 (09 2025)

Dauterive, J., Mitchell, M.: Save What You Can: Tending Katri- nas Community Archive. Southern Cultures 31, 31–49 (09 2025). https://doi.org/10.1353/scu.2025.a968661

work page doi:10.1353/scu.2025.a968661 2025
[6]

https://doi.org/10.30484/NASTINFO.2020.2497.1942

Dorkhosh, M., Fattahi, R., Arastoopoor, S.: Extensions of CIDOC-CRM: Re- sponses to the Need for Knowledge Organization in Subject Domains (02 2021). https://doi.org/10.30484/NASTINFO.2020.2497.1942

work page doi:10.30484/nastinfo.2020.2497.1942 2021
[7]

https://doi.org/10.1108/rmj-08-2019- 0045

Erez, S.A., Blanke, T., Bryant, M., Rodríguez, K.J., Speck, R., Daelen, V.V.: Record Linking in the EHRI Portal (2020). https://doi.org/10.1108/rmj-08-2019- 0045

work page doi:10.1108/rmj-08-2019- 2020
[8]

Weytjens and J

Garda, M., Grabarczyk, P.: The Last Cassette and the Local Chronology of 8-Bit Video Games in Poland, pp. 37–55 (05 2021). https://doi.org/10.1007/978-3-030- 66422-0_3

work page doi:10.1007/978-3-030- 2021
[9]

Grzeszczuk, M., Skorupska, K.: Preserving the Artifacts of the Early Digital Era: A Study of What, Why and How? (2023)

work page 2023
[10]

In: Biele, C., Kacprzyk, J., Kopeć, W., Możaryn, J., Owsiński, J.W., Romanowski, A., Sikorski, M

Grzeszczuk, M., Skorupska, K., Wójcik, G.M.: Bridging the Digital Divide: Ap- proach to Documenting Early Computing Artifacts Using Established Standards for Cross-Collection Knowledge Integration Ontology. In: Biele, C., Kacprzyk, J., Kopeć, W., Możaryn, J., Owsiński, J.W., Romanowski, A., Sikorski, M. (eds.) Digi- tal Interaction and Machine Intelligenc...

work page 2026
[11]

In: 10 M

Jeong, E.E., Jackson, C., Dowthwaite, L., Ahmad, T., Trouille, L.: Assessing the Value Orientations of Contributors to Virtual Citizen Science Projects. In: 10 M. Grzeszczuk et al. Proceedings of the 11th International Conference on Communities and Tech- nologies. p. 191202. C&T ’23, Association for Computing Machinery, New York, NY, USA (2023). https://d...

work page doi:10.1145/3593743.3593782 2023
[12]

73–103 (09 2014)

Lekkas, T.: Legal Pirates Ltd: Home Computing Cultures in Early 1980s Greece, pp. 73–103 (09 2014). https://doi.org/10.1007/978-1-4471-5493-8

work page doi:10.1007/978-1-4471-5493-8 2014
[13]

ódzkie Studia Etnograﬁczne 63, 217234 (wrz 2024)

Lichnerowicz, A., Grzeszczuk, M., Skorupska, K.: Wyzwania i moliwoci w zakresie ochrony niematerialnego dziedzictwa kulturowego: wnioski z Demosceny. ódzkie Studia Etnograﬁczne 63, 217234 (wrz 2024). https://doi.org/10.12775/LSE.2024.63.12, https://apcz.umk.pl/LSE/article/ view/55083

work page doi:10.12775/lse.2024.63.12 2024
[14]

Liber Quarterly 20 (01 2010)

Moyle, M., Tonra, J., Wallace, V.: Manuscript Transcription by Crowdsourcing: Transcribe Bentham. Liber Quarterly 20 (01 2010). https://doi.org/10.18352/lq.7999

work page doi:10.18352/lq.7999 2010
[15]

Parsons, A., Werthimer, D., Anderson, D., Bowyer, S., Cobb, J., Demorest, P., Korpela, E., Lampton, M., Lebofsky, M.: Searching for ET with help from four million volunteers: The SETI@home, SERENDIP, SEVENDIP, ASTROPULSE and SPOCK seti programs. vol. 4 (01 2004)

work page 2004
[16]

International Journal of Digital Humanities 7, 39–69 (03 2025)

Pereda, J., Willcox, P., Candela, G., Sanchez, A., Murrieta-Flores, P.: Online cul- tural heritage as a social machine: a socio-technical approach to digital infras- tructure and ecosystems. International Journal of Digital Humanities 7, 39–69 (03 2025). https://doi.org/10.1007/s42803-025-00097-6

work page doi:10.1007/s42803-025-00097-6 2025
[17]

Rahutomo, F., Kitasuka, T., Aritsugi, M.: Semantic Cosine Similarity (10 2012)

work page 2012
[18]

https://doi.org/10.48550/arXiv.2509.14601

Sadia, M., Chowdhury, A., Chen, A.: A Case for Computing on Unstructured Data (09 2025). https://doi.org/10.48550/arXiv.2509.14601

work page doi:10.48550/arxiv.2509.14601 2025
[19]

Journal of Open Source Software 7 (10 2022)

Six, J.: Panako: a scalable audio search system. Journal of Open Source Software 7 (10 2022). https://doi.org/10.21105/joss.04554

work page doi:10.21105/joss.04554 2022
[20]

Skorupska, K., Nielek, R., Kope, W.: AFFORCE: Actionable Framework for De- signing Crowdsourcing Experiences for Older Adults (07 2022)

work page 2022
[21]

IEEE Annals of the History of Computing 37(1), 12–23 (2015)

Stachniak, Z.: Red Clones: The Soviet Computer Hobby Movement of the 1980s. IEEE Annals of the History of Computing 37(1), 12–23 (2015). https://doi.org/10.1109/MAHC.2015.11

work page doi:10.1109/mahc.2015.11 2015
[22]

2023.Physically Based Rendering: From Theory to Implementation(4 ed.)

Swalwell, M.: Homebrew Gaming and the Beginnings of Vernacular Digitality. MIT Press, Cambridge, MA (2021), https://mitpress.mit.edu/9780262044776

work page arXiv 2021
[23]

International Journal of Digital Curation 11 (12 2016)

Vries, D., Harrington, C.: Recovery of heritage software stored on magnetic tape for Commodore microcomputers. International Journal of Digital Curation 11 (12 2016). https://doi.org/10.2218/ijdc.v11i2.386

work page doi:10.2218/ijdc.v11i2.386 2016
[24]

In: ISMIR 2003, 4th International Conference on Music Information Retrieval, Baltimore, Maryland, USA, October 27-30, 2003, Proceedings (2003)

Wang, A.: An Industrial Strength Audio Search Algorithm. In: ISMIR 2003, 4th International Conference on Music Information Retrieval, Baltimore, Maryland, USA, October 27-30, 2003, Proceedings (2003)

work page 2003
[25]

Bényi and K

Wasiak, P.: Playing and Copying: Social Practices of Home Computer Users in Poland during the 1980s, pp. 129–150 (01 2014). https://doi.org/10.1007/978-1- 4471-5493-8_6

work page doi:10.1007/978-1- 2014
[26]

ACM Journal on Computing and Cultural Heritage 16(1), 1e (2023)

Weber, A., Heerlien, M., Gassó Miracle, E., Wolstencroft, K.: Introduction to the Special Issue on Digital Natural and Cultural Heritage: Opportunities and Challenges. ACM Journal on Computing and Cultural Heritage 16(1), 1e (2023). https://doi.org/10.1145/3597459

work page doi:10.1145/3597459 2023
[27]

Preservation, Digital Technology & Culture 54, 251–253 (09 2025)

Wirtz, G.: Saving Ukrainian Cultural Heritage Online (SUCHO). Preservation, Digital Technology & Culture 54, 251–253 (09 2025). https://doi.org/10.1515/pdtc-2025-0049

work page doi:10.1515/pdtc-2025-0049 2025