Metadata Extraction from Raw Astroparticle Data of TAIGA Experiment

Alexandr Kryukov; Alexey Shigarov; Andrey Mikhailov; Elena Korosteleva; Igor Bychkov; Julia Dubenskaya; Minh-Duc Nguyen

arxiv: 1907.06183 · v1 · pith:GRA4EX6Cnew · submitted 2019-07-14 · 🌌 astro-ph.IM · cs.DC

Metadata Extraction from Raw Astroparticle Data of TAIGA Experiment

Igor Bychkov , Julia Dubenskaya , Elena Korosteleva , Alexandr Kryukov , Andrey Mikhailov , Minh-Duc Nguyen , Alexey Shigarov This is my paper

Pith reviewed 2026-05-24 21:51 UTC · model grok-4.3

classification 🌌 astro-ph.IM cs.DC

keywords metadata extractionraw data filesdata catalogbinary file parsingastroparticle datadescriptive metadataunified interface

0 comments

The pith

An extensible metadata extractor pulls hidden descriptive data from TAIGA raw files into a unified catalog.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make large volumes of raw astroparticle data usable by the scientific community through automatic extraction of metadata scattered across file names, folder structures, and package headers. It introduces a modular extractor design that can incorporate facility-specific components to handle multiple data formats without repeated manual work. If the approach holds, events and runs become searchable and aggregatable by time or equipment through one interface. A reader would care because uncharacterized binary data stays inaccessible for analysis or sharing.

Core claim

The authors developed a concept of the metadata extractor that can be extended by facility-specific extraction modules and is designed to automatically collect descriptive metadata from raw data files of all TAIGA formats, transforming information hidden in folder and file names plus package headers into a unified catalog form for digital objects such as events and runs.

What carries the argument

The extensible metadata extractor using facility-specific modules to parse scattered metadata from binary files and load it into a catalog.

If this is right

Raw data files become queryable by time and equipment through a single interface.
Events and runs can be aggregated without per-format manual metadata handling.
The system supports both current and future TAIGA data formats through added modules.
Descriptive metadata is loaded automatically into the catalog from binary sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar modular parsing could apply to other experiments where metadata lives in file names and headers.
Success would reduce the effort needed to integrate new instruments into existing data catalogs.
The design implies that catalog completeness depends on how completely each module captures its format's fields.

Load-bearing premise

Metadata scattered in folder and file names plus package headers can be reliably parsed and transformed into a unified catalog form across all existing and future data formats without significant manual intervention or data loss.

What would settle it

A new TAIGA data format where the extractor requires substantial custom coding or produces incomplete or incorrect catalog entries for key fields like time or equipment.

Figures

Figures reproduced from arXiv: 1907.06183 by Alexandr Kryukov, Alexey Shigarov, Andrey Mikhailov, Elena Korosteleva, Igor Bychkov, Julia Dubenskaya, Minh-Duc Nguyen.

**Figure 1.** Figure 1: Aspects of time and equipment in metadata hidden in TAIGA raw data. GET data WHERE time == range = time between start and end (less than a night) run = a specified run | a calibration run night = a specified date moonless month = a period of time (not calendar month) summer = a summer period of time GET data WHERE equipment == facility = a specified facility cluster = a specified cluster (station) of a fac… view at source ↗

**Figure 2.** Figure 2: General metadata hidden in TAIGA raw data. and files to collect attributes being available in the folder/file names. It identifies the format of each input file, parses and validates binary data by using an appropriate format-specific library to extract metadata from package headers. The module also collects attributes from the input supplementary files (e.g. facility configuration file). All extracted m… view at source ↗

**Figure 3.** Figure 3: Workflow for the metadata extractor. Graphene-Python14 library. It also uses the object-relational mapping based on SQLAlchemy15 on the catalog side. Since all digital objects (events and runs) we consider are characterized by time, the design of the architecture suggests to use TimeScale16, a time series database management system, for organizing metadata stored in the catalog. 5 Conclusion and further wo… view at source ↗

read the original abstract

Today, the operating TAIGA (Tunka Advanced Instrument for cosmic rays and Gamma Astronomy) experiment continuously produces and accumulates a large volume of raw astroparticle data. To be available for the scientific community these data should be well-described and formally characterized. The use of metadata makes it possible to search for and to aggregate digital objects (e.g. events and runs) by time and equipment through a unified interface to access them. The important part of the metadata is hidden and scattered in folder/files names and package headers. Such metadata should be extracted from binary files, transformed to a unified form of digital objects, and loaded into the catalog. To address this challenge we developed a concept of the metadata extractor that can be extended by facility-specific extraction modules. It is designed to automatically collect descriptive metadata from raw data files of all TAIGA formats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sketches a high-level concept for an extensible metadata extractor for TAIGA raw data but gives no code, parsing rules, schema, or tests, so the claim of automatic handling across all formats stays unverified.

read the letter

The main takeaway is that this manuscript outlines a concept for pulling metadata from TAIGA file names and binary headers into a unified catalog, using an extensible set of facility-specific modules, but it contains no implementation details or validation at all. The authors correctly identify a real operational issue: large volumes of raw astroparticle data need searchable descriptions, and much of the needed info sits scattered in names and package headers rather than in a central place. Framing the solution as modular and extensible is a reasonable way to handle format changes over time without rewriting everything from scratch. That part aligns with standard practice in data-intensive experiments and shows they understand the maintenance problem. The paper does a clean job stating the goal of enabling searches by time and equipment through one interface. Beyond that, there is little substance. No module interfaces are defined, no examples of how folder structures or headers would actually be parsed, no catalog schema is shown, and there are zero test cases or error rates on real TAIGA files. The central assertion that the system works automatically for all current and future formats without significant manual fixes therefore cannot be checked. This leaves the work at the level of an idea rather than a demonstrated tool. Readers who manage data pipelines at similar cosmic-ray or gamma-ray facilities might find the high-level description useful as a prompt for their own thinking. Anyone needing reusable code, benchmarks, or a concrete design to adapt will not get it here. The thinking is straightforward and honest about the problem, but the lack of any concrete evidence means the paper is not ready for peer review. I would desk-reject and ask for at least one working module with validation results before sending it out.

Referee Report

2 major / 0 minor

Summary. The paper presents a high-level concept for an extensible metadata extractor intended to automatically harvest descriptive metadata (scattered in folder/file names and package headers) from raw TAIGA astroparticle data files of all formats, transform it into a unified catalog form, and thereby enable search and aggregation of events/runs by the scientific community.

Significance. If a concrete implementation with defined module interfaces, parsing rules, and validation on real TAIGA files were supplied, the work would address a genuine data-management bottleneck for a running experiment and could improve data discoverability. As written, however, the manuscript supplies only the design goal with no implementation details, examples, or tests, so its practical significance cannot yet be assessed.

major comments (2)

[Abstract] Abstract: the central claim that the extractor 'is designed to automatically collect descriptive metadata from raw data files of all TAIGA formats' is unsupported; the text provides neither module interface definitions, parsing rules for folder/file names or headers, nor a unified catalog schema.
[Abstract] Abstract: no demonstration, test case, or error-rate measurement on even a single TAIGA format is given, so the assertion that extraction works reliably 'across all existing and future TAIGA data formats without significant manual intervention' cannot be evaluated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review. Our manuscript presents a high-level conceptual design for an extensible metadata extractor rather than a fully implemented and validated system. We address the comments point by point below and propose revisions to better align the abstract with the paper's scope.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the extractor 'is designed to automatically collect descriptive metadata from raw data files of all TAIGA formats' is unsupported; the text provides neither module interface definitions, parsing rules for folder/file names or headers, nor a unified catalog schema.

Authors: The manuscript describes the overall concept and design goals of an extensible architecture that incorporates facility-specific extraction modules. Detailed module interfaces, parsing rules, and the exact catalog schema are not specified because this is a conceptual paper; such elements would be defined during implementation. We will revise the abstract to explicitly state that the work outlines a conceptual framework without providing implementation specifications. revision: partial
Referee: [Abstract] Abstract: no demonstration, test case, or error-rate measurement on even a single TAIGA format is given, so the assertion that extraction works reliably 'across all existing and future TAIGA data formats without significant manual intervention' cannot be evaluated.

Authors: We agree that no empirical tests or demonstrations are present, as the paper focuses on the architectural concept and its intended extensibility rather than a deployed implementation. The language about reliable operation across formats describes the design objective of the module-based approach. We will revise the abstract to clarify this as a conceptual goal rather than a demonstrated capability. revision: yes

Circularity Check

0 steps flagged

No circularity: high-level software concept with no derivation chain

full rationale

The paper presents only a conceptual description of an extensible metadata extractor for TAIGA data formats. No equations, fitted parameters, predictions, uniqueness theorems, or self-citations appear in the provided text or abstract. The central claim is a design assertion ('we developed a concept of the metadata extractor... designed to automatically collect descriptive metadata from raw data files of all TAIGA formats') rather than a result derived from prior inputs. This is a self-contained architectural outline with no load-bearing steps that reduce to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced; the paper is a descriptive account of a software design.

pith-pipeline@v0.9.0 · 5694 in / 975 out tokens · 17655 ms · 2026-05-24T21:51:29.959951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · 1 internal anchor

[1]

Addressing big data challenges for scientiﬁc data infrastructure,

Y. Demchenko, Z. Zhao, P. Grosso, A. Wibisono, and C. de Laat, “Addressing big data challenges for scientiﬁc data infrastructure,” in 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, pp. 614–617, 2012

work page 2012
[2]

Russian-german astroparticle data life cycle initiative,

e. a. Bychkov I., “Russian-german astroparticle data life cycle initiative,” Data, vol. 3, no. 4:56, 2018

work page 2018
[3]

Understanding the emergence of ‘open science’ institutions: func- tionalist economics in historical context,

P. A. David, “Understanding the emergence of ‘open science’ institutions: func- tionalist economics in historical context,” Indus. & Corp. Change , vol. 13, no. 4, pp. 571–589, 2004

work page 2004
[4]

Promoting an open research culture,

B. A. e. a. Nosek, “Promoting an open research culture,”Science, vol. 348, no. 6242, pp. 1422–1425, 2015

work page 2015
[5]

The TAIGA exper- iment: from cosmic ray to gamma-ray astronomy in the Tunka valley,

Budnev, N.; Astapov, I.; Bezyazeekov, P.; Bogdanov, A.; Boreyko, V.; B¨ uker, M.; Br¨ uckner, M.; Chiavassa, A.; Chvalaev, O.; Gress, O. et al, “The TAIGA exper- iment: from cosmic ray to gamma-ray astronomy in the Tunka valley,” J. Phys. Conf. Ser. , vol. 718, no. 5, p. 052006, 2016

work page 2016
[6]

Results from Tunka-133 (5 years observation) and from the Tunka-HiSCORE prototype,

V. V. Prosin and et al, “Results from Tunka-133 (5 years observation) and from the Tunka-HiSCORE prototype,” EPJ Web Conf. , vol. 121, p. 03004, 2016

work page 2016
[7]

TAIGA Gamma Observatory: Status and Prospects,

L. A. Kuzmichev and et al, “TAIGA Gamma Observatory: Status and Prospects,” Phys. Atom. Nucl. , vol. 81, pp. 497–507, 2018

work page 2018
[8]

The Tunka-Grande experiment: Status and prospects,

R. D. Monkhoev and et al, “The Tunka-Grande experiment: Status and prospects,” Bull. Russ. Acad. Sci. , vol. 81, no. 4, pp. 468–470, 2017

work page 2017
[9]

Measurement of cosmic-ray air showers with the Tunka Radio Extension (Tunka-Rex),

P. A. Bezyazeekov and et al, “Measurement of cosmic-ray air showers with the Tunka Radio Extension (Tunka-Rex),” Nucl. Instrum. Meth. , vol. A802, pp. 89– 96, 2015

work page 2015
[11]

Using Binary File Format Description Languages for Documenting, Parsing, and Verifying Raw Data in TAIGA Experiment

I. Bychkov and et al., “Using binary ﬁle format description languages for documenting, parsing, and verifying raw data in TAIGA experiment,” CoRR, vol. abs/1812.01324, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[12]

A declarative language FlexT for analysis and documenting of binary data formats,

M. A. Khmel’nov A., Bychkov I., “A declarative language FlexT for analysis and documenting of binary data formats,” Proceedings of ISP RAS , vol. 28, no. 5, pp. 239–268, 2016

work page 2016
[13]

Gamma/hadron separation in imaging air cherenkov telescopes using deep learning libraries tensorﬂow and pytorch,

E. B. Postnikov, A. P. Kryukov, S. P. Polyakov, D. A. Shipilov, and D. P. Zhurov, “Gamma/hadron separation in imaging air cherenkov telescopes using deep learning libraries tensorﬂow and pytorch,” Journal of Physics: Conference Series, vol. 1181, p. 012048, 2019

work page 2019
[14]

Architecture of distributed data storage for astroparticle physics,

A. P. Kryukov and A. P. Demichev, “Architecture of distributed data storage for astroparticle physics,” Lobachevskii Journal of Mathematics , vol. 39, no. 9, pp. 1199–1206, 2018

work page 2018
[15]

A distributed storage for astroparticle physics,

A. Kryukov and M.-D. Nguyen, “A distributed storage for astroparticle physics,” EPJ Web of Conferences , vol. 207, p. 08003, 2019

work page 2019

[1] [1]

Addressing big data challenges for scientiﬁc data infrastructure,

Y. Demchenko, Z. Zhao, P. Grosso, A. Wibisono, and C. de Laat, “Addressing big data challenges for scientiﬁc data infrastructure,” in 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, pp. 614–617, 2012

work page 2012

[2] [2]

Russian-german astroparticle data life cycle initiative,

e. a. Bychkov I., “Russian-german astroparticle data life cycle initiative,” Data, vol. 3, no. 4:56, 2018

work page 2018

[3] [3]

Understanding the emergence of ‘open science’ institutions: func- tionalist economics in historical context,

P. A. David, “Understanding the emergence of ‘open science’ institutions: func- tionalist economics in historical context,” Indus. & Corp. Change , vol. 13, no. 4, pp. 571–589, 2004

work page 2004

[4] [4]

Promoting an open research culture,

B. A. e. a. Nosek, “Promoting an open research culture,”Science, vol. 348, no. 6242, pp. 1422–1425, 2015

work page 2015

[5] [5]

The TAIGA exper- iment: from cosmic ray to gamma-ray astronomy in the Tunka valley,

Budnev, N.; Astapov, I.; Bezyazeekov, P.; Bogdanov, A.; Boreyko, V.; B¨ uker, M.; Br¨ uckner, M.; Chiavassa, A.; Chvalaev, O.; Gress, O. et al, “The TAIGA exper- iment: from cosmic ray to gamma-ray astronomy in the Tunka valley,” J. Phys. Conf. Ser. , vol. 718, no. 5, p. 052006, 2016

work page 2016

[6] [6]

Results from Tunka-133 (5 years observation) and from the Tunka-HiSCORE prototype,

V. V. Prosin and et al, “Results from Tunka-133 (5 years observation) and from the Tunka-HiSCORE prototype,” EPJ Web Conf. , vol. 121, p. 03004, 2016

work page 2016

[7] [7]

TAIGA Gamma Observatory: Status and Prospects,

L. A. Kuzmichev and et al, “TAIGA Gamma Observatory: Status and Prospects,” Phys. Atom. Nucl. , vol. 81, pp. 497–507, 2018

work page 2018

[8] [8]

The Tunka-Grande experiment: Status and prospects,

R. D. Monkhoev and et al, “The Tunka-Grande experiment: Status and prospects,” Bull. Russ. Acad. Sci. , vol. 81, no. 4, pp. 468–470, 2017

work page 2017

[9] [9]

Measurement of cosmic-ray air showers with the Tunka Radio Extension (Tunka-Rex),

P. A. Bezyazeekov and et al, “Measurement of cosmic-ray air showers with the Tunka Radio Extension (Tunka-Rex),” Nucl. Instrum. Meth. , vol. A802, pp. 89– 96, 2015

work page 2015

[10] [11]

Using Binary File Format Description Languages for Documenting, Parsing, and Verifying Raw Data in TAIGA Experiment

I. Bychkov and et al., “Using binary ﬁle format description languages for documenting, parsing, and verifying raw data in TAIGA experiment,” CoRR, vol. abs/1812.01324, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [12]

A declarative language FlexT for analysis and documenting of binary data formats,

M. A. Khmel’nov A., Bychkov I., “A declarative language FlexT for analysis and documenting of binary data formats,” Proceedings of ISP RAS , vol. 28, no. 5, pp. 239–268, 2016

work page 2016

[12] [13]

Gamma/hadron separation in imaging air cherenkov telescopes using deep learning libraries tensorﬂow and pytorch,

E. B. Postnikov, A. P. Kryukov, S. P. Polyakov, D. A. Shipilov, and D. P. Zhurov, “Gamma/hadron separation in imaging air cherenkov telescopes using deep learning libraries tensorﬂow and pytorch,” Journal of Physics: Conference Series, vol. 1181, p. 012048, 2019

work page 2019

[13] [14]

Architecture of distributed data storage for astroparticle physics,

A. P. Kryukov and A. P. Demichev, “Architecture of distributed data storage for astroparticle physics,” Lobachevskii Journal of Mathematics , vol. 39, no. 9, pp. 1199–1206, 2018

work page 2018

[14] [15]

A distributed storage for astroparticle physics,

A. Kryukov and M.-D. Nguyen, “A distributed storage for astroparticle physics,” EPJ Web of Conferences , vol. 207, p. 08003, 2019

work page 2019