Executable Boundary Contracts for Sound Event Traces

Faruk Alpay; Hamdi Alakkad

arxiv: 2605.19632 · v1 · pith:5XJVJE4Dnew · submitted 2026-05-19 · 💻 cs.LO · cs.SD

Executable Boundary Contracts for Sound Event Traces

Faruk Alpay , Hamdi Alakkad This is my paper

Pith reviewed 2026-05-20 02:04 UTC · model grok-4.3

classification 💻 cs.LO cs.SD

keywords sound event tracesboundary contractstemporal logicSTLevent detectionevaluation metricsboundary failuresunion activity

0 comments

The pith

Executable boundary contracts measure typed boundary behavior in sound event traces more precisely than compressed frame or event scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to define executable boundary contracts for finite sound event traces so that timed boundary behavior is not lost when reports compress it into frame, segment, or event scores. It specifies a frame fragment as a bounded Boolean fragment that embeds into Signal Temporal Logic after grid projection, and adds an event layer with interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring. Evaluations on controlled scenes, real soundscapes, pretrained probes, and baseline tracks show that contract coordinates disagree with standard scores in interpretable ways. The main corpus finding is that union activity can conceal typed boundary failures, with baseline outputs offering class indexed references. A reader would care because better boundary measurement improves assessment of detection systems where timing precision affects overall results.

Core claim

The paper establishes executable boundary contracts for finite sound event traces. The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection. The event layer adds declared interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring. The contracts aim at measurement and show that standard scores and contract coordinates disagree, with the strongest real corpus finding that union activity can hide typed boundary failure while external baseline outputs provide a class indexed challenge level reference.

What carries the argument

The executable boundary contract, consisting of a bounded Boolean frame fragment embeddable in STL after grid projection together with an event layer for declared interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring.

If this is right

Standard scores and contract coordinates disagree in interpretable ways across the evaluated tracks.
Union activity can hide typed boundary failure.
Baseline outputs provide a class indexed challenge level reference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The contracts could be applied to other timed event domains to check whether similar masking effects occur in their standard scores.
If adopted in practice, the method might prompt revisions to how aggregate scores are interpreted when timing details matter.
The findings suggest examining union operations more closely in any scoring system that combines overlapping detections.

Load-bearing premise

The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection.

What would settle it

If contract coordinates matched standard scores without interpretable disagreements on the evaluated tracks, or if union activity never concealed any typed boundary failures, the contracts would show no measurement advantage.

read the original abstract

Sound event reports often compress timed boundary behavior into frame, segment, or event scores. This paper defines executable boundary contracts for finite sound event traces. The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection. The event layer adds declared interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring. The aim is measurement, not a new general temporal logic and not a challenge leaderboard. The artifact evaluates controlled Mini LibriSpeech seeded scenes, MAESTRO Real soundscapes, frozen pretrained timing probes, and an official DCASE 2024 Task 4 baseline track. Across these tracks, standard scores and contract coordinates disagree in interpretable ways. The strongest real corpus finding is that union activity can hide typed boundary failure, while external DCASE outputs provide a class indexed challenge level reference. Code, generated tables, manifests, and Lean checks for the finite frame core are supplied as ancillary material.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper defines executable boundary contracts for measuring boundary timing in sound event traces more granularly than standard scores, but the grid projection step needs a clearer soundness argument.

read the letter

Two things stand out. The paper introduces executable boundary contracts that layer declared interval matching, duration clauses, fragmentation clauses, and restricted vector scoring on a bounded Boolean frame fragment meant to embed in STL after grid projection. It also reports that these contract coordinates disagree with standard scores in interpretable ways on real data, with union activity hiding typed boundary failures as the strongest corpus finding.

Referee Report

2 major / 1 minor

Summary. The paper defines executable boundary contracts for finite sound event traces. The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection. The event layer adds declared interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring. The artifact evaluates controlled Mini LibriSpeech seeded scenes, MAESTRO Real soundscapes, frozen pretrained timing probes, and an official DCASE 2024 Task 4 baseline track. Across these tracks, standard scores and contract coordinates disagree in interpretable ways, with the strongest real corpus finding that union activity can hide typed boundary failure.

Significance. If the contracts are sound, the work supplies a measurement-oriented formalism that can expose boundary issues masked by conventional frame/segment/event scores in sound event detection. The provision of code, generated tables, manifests, and Lean checks for the finite frame core is a positive contribution to reproducibility and machine-checked executable specifications.

major comments (2)

[Abstract] Abstract and frame fragment definition: the claim that the frame fragment is a bounded Boolean fragment embeddable in STL after grid projection is central to interpreting disagreements as genuine boundary measurements rather than artifacts. No explicit soundness proof is reported that the projection preserves satisfaction for boundary conditions (onset/offset precision) on finite traces; the supplied Lean checks address only the finite frame core.
[Evaluation] Evaluation findings on union activity hiding typed boundary failure: this strongest corpus claim depends on the contracts correctly detecting typed failures. Without the missing embeddability soundness argument, it remains possible that observed disagreements arise from discretization effects rather than improved measurement.

minor comments (1)

The distinction between the frame fragment and the full event-layer contract could be clarified with explicit notation or a running example early in the manuscript.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the manuscript. We address each major comment below, agreeing where the observation identifies a genuine gap in the current presentation.

read point-by-point responses

Referee: [Abstract] Abstract and frame fragment definition: the claim that the frame fragment is a bounded Boolean fragment embeddable in STL after grid projection is central to interpreting disagreements as genuine boundary measurements rather than artifacts. No explicit soundness proof is reported that the projection preserves satisfaction for boundary conditions (onset/offset precision) on finite traces; the supplied Lean checks address only the finite frame core.

Authors: We agree that the manuscript does not supply an explicit soundness proof that the grid projection preserves satisfaction of boundary conditions on finite traces. The Lean development formalizes and checks the semantics of the finite frame core itself. The embeddability claim is presented as holding by construction of the projection, which discretizes continuous-time intervals onto a fixed grid while retaining the Boolean fragment. We will revise the abstract and the frame-fragment section to state this limitation explicitly, to describe the projection construction in more detail, and to include a high-level preservation argument for onset/offset conditions together with a note that a machine-checked proof of the projection step remains future work. revision: yes
Referee: [Evaluation] Evaluation findings on union activity hiding typed boundary failure: this strongest corpus claim depends on the contracts correctly detecting typed failures. Without the missing embeddability soundness argument, it remains possible that observed disagreements arise from discretization effects rather than improved measurement.

Authors: We accept that the strongest corpus finding is presented without a completed soundness argument for boundary preservation, so the possibility that some disagreements reflect discretization artifacts cannot be ruled out on the basis of the current text. We will revise the evaluation section to add an explicit caveat that the reported disagreements are interpreted under the working assumption that the projection preserves the relevant boundary conditions, to reference the Lean checks that support executability of the core, and to qualify the union-activity observation accordingly. This will make the evidential status of the claim clearer to readers. revision: yes

Circularity Check

0 steps flagged

No significant circularity in definitions or evaluations of boundary contracts

full rationale

The paper introduces executable boundary contracts through explicit definitions: the frame fragment is specified as a bounded Boolean fragment embeddable in STL after grid projection, with the event layer adding declared interval matching, duration clauses, fragmentation clauses, and obligation restricted vector scoring. These are presented as newly defined constructs for measurement on finite traces, supported by Lean checks for the finite frame core. The reported findings consist of empirical disagreements between contract coordinates and standard scores on external corpora (Mini LibriSpeech, MAESTRO, DCASE 2024 baseline), without any fitted parameters renamed as predictions or self-referential reductions in the derivation. The central claims rest on the supplied definitions and direct application to data rather than any load-bearing self-citation chain or ansatz smuggled via prior work, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central addition is the definition of the contracts themselves; the main background assumption is the embeddability of the Boolean fragment.

axioms (1)

domain assumption The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection.
Stated directly in the abstract as the foundation for the frame layer.

invented entities (1)

executable boundary contracts no independent evidence
purpose: To measure timed boundary behavior in sound event traces with declared interval, duration, and fragmentation rules.
Newly introduced construct whose purpose is measurement rather than general temporal reasoning.

pith-pipeline@v0.9.0 · 5683 in / 1199 out tokens · 58743 ms · 2026-05-20T02:04:41.471343+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The frame fragment is a bounded Boolean fragment embeddable in STL after grid projection... Lean checks for the finite frame core
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery and embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

obligation restricted vector scoring... matched event clauses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

[1]

37 K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection.arXiv preprint arXiv:2202.00874, 2022a. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei. Beats: Audio pre-training with acoustic tokenizers.arXiv preprint arXiv:2212.09058, 2...

work page arXiv 2024
[2]

Desed task 2024 baseline pre-trained model

DCASE Task 4 2024 Organizers. Desed task 2024 baseline pre-trained model. https://zenodo.org/ records/11034682,

work page arXiv 2024
[3]

C. Deng, S. Lokegaonkar, C. Lockard, B. Fetahu, N. Zalmout, and X. Li. Byteflow: Language modeling through adaptive byte compression without a tokenizer.arXiv preprint arXiv:2603.03583,

work page arXiv
[4]

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

T. Gigant, B. Peng, and J. Quesnelle. Decoupling the benefits of subword tokenization for language model training via byte-level simulation.arXiv preprint arXiv:2604.27263,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Ast: Audio spectrogram transformer,

Y. Gong, Y.-A. Chung, and J. Glass. Ast: Audio spectrogram transformer.arXiv preprint arXiv:2104.01778,

work page arXiv
[6]

C., Parmar, N., Zhang, Y., Yu, J.,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition.arXiv preprint arXiv:2005.08100,

work page arXiv 2005
[7]

K. Li, Y. Song, L.-R. Dai, I. McLoughlin, X. Fang, and L. Liu. Ast-sed: An effective sound event detection method based on audio spectrogram transformer.arXiv preprint arXiv:2303.03689,

work page arXiv
[8]

Compute Optimal Tokenization

T. Limisiewicz, A. Pagnoni, S. Iyer, M. Lewis, S. Mehta, A. Liu, M. Li, G. Ghosh, and L. Zettlemoyer. Compute optimal tokenization.arXiv preprint arXiv:2605.01188,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mart´ ın-Morat´ o, M

I. Mart´ ın-Morat´ o, M. Harju, and A. Mesaros. Crowdsourcing strong labels for sound event detection.arXiv preprint arXiv:2107.12089,

work page arXiv
[10]

In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

I. Mart´ ın-Morat´ o, M. Harju, P. Ahokas, and A. Mesaros. Training sound event detection with soft labels from crowdsourced annotations. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023a. doi: 10.1109/ICASSP49357.2023.10095504. I. Mart´ ın-Morat´ o, M. Harju, and A. Mesaros. Maestro real: Multi-annotator e...

work page doi:10.1109/icassp49357.2023.10095504 2023
[11]

Accessed 2026-05-14. 38 V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210. IEEE,

work page 2026
[12]

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779,

work page arXiv 1904
[13]

Schmid, C

F. Schmid, C. I. Tang, S. Parekh, V. K. Ithapu, J. A. Ortiz, G. Ferroni, Y. Qian, A. Jasonas, C. Frateanu, C. Clark, G. Widmer, and C ¸. Bilen. Sound event detection with boundary-aware optimization and inference.arXiv preprint arXiv:2601.04178,

work page arXiv
[14]

K. Slagle. Spacebyte: Towards deleting tokenization from large language modeling.arXiv preprint arXiv:2404.14408,

work page arXiv
[15]

Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.arXiv preprint arXiv:2211.06687,

work page arXiv
[16]

B. Xiao, B. Wang, and H. Cheng. Bypassing direct reconstruction: Speech detection from meg via large-scale audio retrieval.arXiv preprint arXiv:2605.13099,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

37 K. Chen, X. Du, B. Zhu, Z. Ma, T. Berg-Kirkpatrick, and S. Dubnov. Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection.arXiv preprint arXiv:2202.00874, 2022a. S. Chen, Y. Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, and F. Wei. Beats: Audio pre-training with acoustic tokenizers.arXiv preprint arXiv:2212.09058, 2...

work page arXiv 2024

[2] [2]

Desed task 2024 baseline pre-trained model

DCASE Task 4 2024 Organizers. Desed task 2024 baseline pre-trained model. https://zenodo.org/ records/11034682,

work page arXiv 2024

[3] [3]

C. Deng, S. Lokegaonkar, C. Lockard, B. Fetahu, N. Zalmout, and X. Li. Byteflow: Language modeling through adaptive byte compression without a tokenizer.arXiv preprint arXiv:2603.03583,

work page arXiv

[4] [4]

Decoupling the Benefits of Subword Tokenization for Language Model Training via Byte-level Simulation

T. Gigant, B. Peng, and J. Quesnelle. Decoupling the benefits of subword tokenization for language model training via byte-level simulation.arXiv preprint arXiv:2604.27263,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Ast: Audio spectrogram transformer,

Y. Gong, Y.-A. Chung, and J. Glass. Ast: Audio spectrogram transformer.arXiv preprint arXiv:2104.01778,

work page arXiv

[6] [6]

C., Parmar, N., Zhang, Y., Yu, J.,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu, and R. Pang. Conformer: Convolution-augmented transformer for speech recognition.arXiv preprint arXiv:2005.08100,

work page arXiv 2005

[7] [7]

K. Li, Y. Song, L.-R. Dai, I. McLoughlin, X. Fang, and L. Liu. Ast-sed: An effective sound event detection method based on audio spectrogram transformer.arXiv preprint arXiv:2303.03689,

work page arXiv

[8] [8]

Compute Optimal Tokenization

T. Limisiewicz, A. Pagnoni, S. Iyer, M. Lewis, S. Mehta, A. Liu, M. Li, G. Ghosh, and L. Zettlemoyer. Compute optimal tokenization.arXiv preprint arXiv:2605.01188,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mart´ ın-Morat´ o, M

I. Mart´ ın-Morat´ o, M. Harju, and A. Mesaros. Crowdsourcing strong labels for sound event detection.arXiv preprint arXiv:2107.12089,

work page arXiv

[10] [10]

In: ICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), pp

I. Mart´ ın-Morat´ o, M. Harju, P. Ahokas, and A. Mesaros. Training sound event detection with soft labels from crowdsourced annotations. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 1–5. IEEE, 2023a. doi: 10.1109/ICASSP49357.2023.10095504. I. Mart´ ın-Morat´ o, M. Harju, and A. Mesaros. Maestro real: Multi-annotator e...

work page doi:10.1109/icassp49357.2023.10095504 2023

[11] [11]

Accessed 2026-05-14. 38 V. Panayotov, G. Chen, D. Povey, and S. Khudanpur. Librispeech: An asr corpus based on public domain audio books. InIEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210. IEEE,

work page 2026

[12] [12]

D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D. Cubuk, and Q. V. Le. Specaugment: A simple data augmentation method for automatic speech recognition.arXiv preprint arXiv:1904.08779,

work page arXiv 1904

[13] [13]

Schmid, C

F. Schmid, C. I. Tang, S. Parekh, V. K. Ithapu, J. A. Ortiz, G. Ferroni, Y. Qian, A. Jasonas, C. Frateanu, C. Clark, G. Widmer, and C ¸. Bilen. Sound event detection with boundary-aware optimization and inference.arXiv preprint arXiv:2601.04178,

work page arXiv

[14] [14]

K. Slagle. Spacebyte: Towards deleting tokenization from large language modeling.arXiv preprint arXiv:2404.14408,

work page arXiv

[15] [15]

Y. Wu, K. Chen, T. Zhang, Y. Hui, T. Berg-Kirkpatrick, and S. Dubnov. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation.arXiv preprint arXiv:2211.06687,

work page arXiv

[16] [16]

B. Xiao, B. Wang, and H. Cheng. Bypassing direct reconstruction: Speech detection from meg via large-scale audio retrieval.arXiv preprint arXiv:2605.13099,

work page internal anchor Pith review Pith/arXiv arXiv