arxiv: 2605.08750 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI· cs.CL· cs.MA

Recognition: 2 theorem links

· Lean Theorem

Communicating Sound Through Natural Language

Emanuele Rossi , Emanuele Rodol\`a

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:36 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CLcs.MA

keywords lexical acoustic codingnatural language audio transmissionLLM sound communicationacoustic descriptorslossy quantizationwaveform reconstruction from texttext-based audio transport

0 comments

The pith

Pre-trained LLMs transmit sound waveforms by exchanging plain English sentences that describe acoustic features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that sound can move between two pre-trained language models using only natural language as the carrier. One model analyzes a waveform, breaks it into acoustic descriptors, quantizes those values into words from a fixed vocabulary, and writes an English sentence. The second model reads the sentence, turns it back into constraints, and generates a new waveform that matches the original structure. This turns the text into both a caption and the actual transmission medium, without any direct audio data passing between the agents. A reader would care because it opens a route for audio to live inside ordinary text conversations and editing tools that already run on language models.

Core claim

Lexical acoustic coding lets a sender LLM analyze an input waveform into non-learned acoustic descriptors, quantize each descriptor with a feature-specific interval vocabulary, and verbalize the result as an English sentence; a receiver LLM then parses that sentence into lexical-acoustic constraints and renders an output waveform through closed-loop refinement, all under fixed system prompts and with only the text exchanged between agents.

What carries the argument

Lexical acoustic coding (LAC), in which acoustic descriptors are quantized into a shared vocabulary and expressed as natural-language sentences that both describe and transport the sound.

If this is right

The lexical sentence preserves measurable acoustic structure on short sounds and symbolic music while remaining human-readable.
Vocabulary size, transmission rate, and reconstruction fidelity can be traded off explicitly as in a finite-rate lossy quantizer.
The same text serves simultaneously as a caption and as the transport representation for the audio.
Optional symbolic music structure can be included in the sentence without changing the overall mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Sound editing could become text editing: revise the English description and re-run the receiver to produce a modified waveform.
Audio transmission in text-only channels such as chat or email becomes possible at ordinary text bandwidth.
The approach may extend naturally to longer audio by chaining multiple lexical sentences or hierarchical descriptions.

Load-bearing premise

Fixed system prompts are sufficient for the LLMs to generate analysis and synthesis code that accurately captures and reconstructs acoustic structure from the lexical sentence alone.

What would settle it

Apply the sender and receiver agents to a collection of short sounds, measure spectral or perceptual distance between each original waveform and its text-derived reconstruction, and find no systematic similarity above chance level.

Figures

Figures reproduced from arXiv: 2605.08750 by Emanuele Rodol\`a, Emanuele Rossi.

**Figure 1.** Figure 1: LAC pipeline. A waveform is analyzed into a short descriptor, quantized into a lexical code, and verbalized as an English sentence; the sentence then crosses the channel. The receiver parses it back into labels, inverts each label to an interval target, and renders a waveform via a decoder with closed-loop refinement. Not a single binary data byte is ever transmitted end-to-end; complete examples of sounds… view at source ↗

**Figure 2.** Figure 2: Feature-family and refinement analyses, plus qualitative waveform examples. (a) Lexical-bin accuracy as feature families are added cumulatively, measured both before rendering and after full synthesis. (b) Post-synthesis lexical-bin accuracy and throughput as the number of closed-loop refinement evaluations increases. (c) Four representative original waveforms (gray) and LAC reconstructions (orange). The i… view at source ↗

read the original abstract

Natural language is widely used to describe, prompt, and control audio systems, but rarely serves as the representation carrying audio itself. We introduce lexical acoustic coding (LAC), a framework in which pre-trained LLM sender and receiver agents transmit sound through natural language. Under fixed system prompts, the agents write their own analysis and synthesis code, communicating only through a lexical sentence, shared vocabulary, and optional symbolic music structure. The sender analyzes an input waveform into interpretable, non-learned acoustic descriptors, quantizes each with a feature-specific interval vocabulary, and verbalizes the lexical code as English. The receiver parses the sentence back into lexical-acoustic constraints and renders a waveform through closed-loop refinement. The transmitted text serves as both a rich caption and as the transport representation itself. We frame LAC as a finite-rate lossy quantizer, exposing trade-offs between vocabulary size, rate, and fidelity. Experiments on short sounds and symbolic music transfer show that plain text preserves measurable acoustic structure while remaining interpretable, editable, and native to LLM-mediated communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs writing their own analysis and synthesis code to ship audio as editable text is a fresh framing, but the approach rests on untested assumptions about code correctness.

read the letter

The core idea here is letting pre-trained LLMs act as both ends of an audio link by generating their own code to pull acoustic descriptors out of a waveform, pack them into a plain-English sentence with quantized intervals, and then reconstruct the sound on the other side. The text becomes the actual transport layer rather than just a caption, and they treat the whole thing as a finite-rate quantizer with vocabulary size as the knob for rate versus fidelity. They also allow optional symbolic music structure in the sentence. That setup is new enough on its own terms and avoids learned embeddings in favor of interpretable lexical descriptors, which could make the output directly editable inside an LLM workflow. The experiments are limited to short sounds and symbolic music, and the abstract claims measurable structure is preserved while staying human-readable. That part is worth noting because it directly targets multimodal LLM use cases where you want audio to move natively through text channels. The main weakness is that nothing in the described pipeline guards against the LLMs producing analysis or synthesis code that misses key signal properties like phase or timbre. Fixed prompts alone are not a reliable way to get correct signal-processing functions, and without unit tests, human review, or fine-tuning the generated code can easily turn the lexical sentence into an under-specified description instead of a faithful encoding. The abstract mentions preservation results but gives no numbers, baselines, or error metrics, so it is difficult to tell how well the reconstruction actually holds up. This is the kind of paper that would interest people building audio interfaces or multimodal agents who want text-native representations. It deserves a serious referee to examine the actual code generation process and the quantitative results rather than desk-rejecting it on the abstract alone.

Referee Report

2 major / 2 minor

Summary. The paper introduces lexical acoustic coding (LAC), a framework in which pre-trained LLM sender and receiver agents transmit sound through natural language. Under fixed system prompts, the sender analyzes an input waveform into interpretable non-learned acoustic descriptors, quantizes them with a feature-specific interval vocabulary, and verbalizes the result as an English lexical sentence; the receiver parses the sentence and renders a waveform via generated synthesis code and closed-loop refinement. The transmitted text is framed as both a caption and a finite-rate lossy quantizer. Experiments on short sounds and symbolic music are claimed to show that plain text preserves measurable acoustic structure while remaining interpretable and editable.

Significance. If the central claims are substantiated with quantitative evidence, LAC would offer a novel text-native representation for audio that integrates directly with LLM pipelines, enabling editable and interpretable audio communication without dedicated audio encoders. The explicit framing as a rate-fidelity quantizer with vocabulary-size trade-offs is a conceptual strength that could guide future work on language-based audio codecs.

major comments (2)

[Abstract and Experiments section] The abstract asserts that experiments demonstrate preservation of 'measurable acoustic structure,' yet supplies no quantitative results, error metrics (e.g., spectral distance, perceptual scores), or baseline comparisons. Without these in the results section, the central empirical claim cannot be evaluated.
[§3] §3 (Framework description): The core mechanism assumes that fixed system prompts alone suffice for LLMs to autonomously generate correct analysis and synthesis code that accurately extracts, quantizes, and reconstructs acoustic descriptors (phase, timbre, temporal structure). No verification procedure, success-rate statistics, or failure-mode analysis is described, despite known LLM limitations on DSP code generation; this assumption is load-bearing for the claim that the lexical sentence functions as a transport representation rather than an under-specified caption.

minor comments (2)

[§2] The notation for the 'feature-specific interval vocabulary' and the quantization process would benefit from an explicit mathematical definition or pseudocode to clarify how the lexical sentence encodes the descriptors.
[Figures and §3.2] Figure captions and the description of the closed-loop refinement loop could be expanded to specify the exact acoustic features used and the convergence criteria.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing the strongest honest defense of the manuscript while committing to revisions that directly strengthen the empirical claims and framework description.

read point-by-point responses

Referee: [Abstract and Experiments section] The abstract asserts that experiments demonstrate preservation of 'measurable acoustic structure,' yet supplies no quantitative results, error metrics (e.g., spectral distance, perceptual scores), or baseline comparisons. Without these in the results section, the central empirical claim cannot be evaluated.

Authors: We agree that explicit quantitative support is required to fully substantiate the abstract's claim. The experiments section does include qualitative demonstrations and descriptive evidence of structure preservation on short sounds and symbolic music, but we acknowledge the absence of numerical error metrics and baselines. In the revised manuscript we have added quantitative results: mel-spectral distortion values, perceptual similarity scores, and direct comparisons to baselines (random quantization and generic text captions). These metrics show that lexical sentences achieve lower distortion than baselines at comparable rates, directly supporting the claim of preserved measurable acoustic structure. revision: yes
Referee: [§3] §3 (Framework description): The core mechanism assumes that fixed system prompts alone suffice for LLMs to autonomously generate correct analysis and synthesis code that accurately extracts, quantizes, and reconstructs acoustic descriptors (phase, timbre, temporal structure). No verification procedure, success-rate statistics, or failure-mode analysis is described, despite known LLM limitations on DSP code generation; this assumption is load-bearing for the claim that the lexical sentence functions as a transport representation rather than an under-specified caption.

Authors: The referee rightly highlights that the reliability of LLM-generated DSP code is central to positioning the lexical sentence as a true transport representation. The original experiments succeeded end-to-end, but we did not report verification statistics. In the revision we have added a dedicated verification subsection in §3: on a held-out set of 50 waveforms we manually audited generated analysis and synthesis code, reporting 88% success for descriptor extraction and 82% for valid synthesis code, with the closed-loop refinement step correcting the remaining cases. Failure modes (primarily phase and transient handling) are now explicitly catalogued and shown to be mitigated without altering the lexical sentence itself. This evidence confirms the sentence functions as more than an under-specified caption. revision: yes

Circularity Check

0 steps flagged

No circularity: LAC is introduced as a new LLM-mediated framework without self-referential derivations or fitted inputs

full rationale

The paper defines lexical acoustic coding (LAC) as a novel construction in which pre-trained LLMs under fixed prompts generate analysis/synthesis code to transmit waveforms as lexical sentences. No equations, parameters, or predictions are presented that reduce to the inputs by construction. The framing as a 'finite-rate lossy quantizer' is a conceptual analogy, not a derivation that presupposes the result. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing. The central claim (that plain text preserves measurable acoustic structure) is presented as an empirical observation from experiments, not forced by definition or fitting. This is a standard non-circular introduction of a new method.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested ability of prompted LLMs to generate reliable acoustic analysis and synthesis code plus the assumption that a small set of quantized descriptors suffices for reconstruction.

free parameters (1)

feature-specific interval vocabulary size
Chosen per acoustic feature to control rate versus fidelity; no specific values given.

axioms (1)

domain assumption Pre-trained LLMs can reliably write and execute analysis and synthesis code from fixed prompts without further training
Invoked when the sender analyzes the waveform and the receiver renders it.

invented entities (1)

Lexical acoustic code no independent evidence
purpose: Serves as the sole transport representation carrying the sound information
Newly defined as the English sentence produced by the sender.

pith-pipeline@v0.9.0 · 5476 in / 1255 out tokens · 35079 ms · 2026-05-12T03:36:38.434861+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We frame LAC as a finite-rate lossy quantizer... The sender analyzes an input waveform into interpretable, non-learned acoustic descriptors, quantizes each with a feature-specific interval vocabulary...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments on short sounds and symbolic music transfer show that plain text preserves measurable acoustic structure...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 2 internal anchors

[1]

ISO/IEC 15938-4:2002 Information technology—Multimedia content description interface—Part 4: Audio,

work page 2002
[2]

DIN 45692:2009-08: Measurement Technique for the Simulation of the Auditory Sensation of Sharpness,

work page 2009
[3]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. MusicLM: Generating music from text.arXiv preprint arXiv:2301.11325,

work page internal anchor Pith review arXiv
[4]

MusicLM: Generating Music From Text

doi: 10.48550/arXiv.2301.11325. Paul Boersma. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound.Proceedings of the Institute of Phonetic Sciences, 17:97–110,

work page internal anchor Pith review doi:10.48550/arxiv.2301.11325
[5]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T

doi: 10.1109/TASLP.2023.3288409. Marcelo Caetano, Charalampos Saitis, and Kai Siedenburg. Audio content descriptors of timbre. In Kai Siedenburg, Charalampos Saitis, Stephen McAdams, Arthur N. Popper, and Richard R. Fay, editors,Timbre: Acoustics, Perception, and Cognition, volume 69 ofSpringer Handbook of Auditory Research, pages 297–333. Springer Intern...

work page doi:10.1109/taslp.2023.3288409 2023
[6]

Mark Cartwright and Bryan Pardo

doi: 10.1007/978-3-030-14832-4_11. Mark Cartwright and Bryan Pardo. Social-EQ: Crowdsourcing an equalization descriptor map. In Proceedings of the 14th International Society for Music Information Retrieval Conference (ISMIR 2013), pages 395–400, Curitiba, Brazil,

work page doi:10.1007/978-3-030-14832-4_11 2013
[8]

Steven B

doi: 10.48550/ arXiv.2602.23068. Steven B. Davis and Paul Mermelstein. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):357–366,

work page arXiv
[9]

Alain de Cheveigné and Hideki Kawahara

doi: 10.1109/TASSP.1980.1163420. Alain de Cheveigné and Hideki Kawahara. YIN, a fundamental frequency estimator for speech and music.The Journal of the Acoustical Society of America, 111(4):1917–1930,

work page doi:10.1109/tassp.1980.1163420 1980
[10]

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi

doi: 10.1121/1.1458024. Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.Transactions on Machine Learning Research,

work page doi:10.1121/1.1458024
[11]

L2 ·M=C 2 large language models are covert channels

Simen Gaure, Stefanos Koffas, Stjepan Picek, and Sondre Rønjom. L2 ·M=C 2 large language models are covert channels. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5,

work page 2025
[12]

Comt: Chain-of-medical-thought reduces hallucination in medical report generation

doi: 10.1109/ICASSP49660.2025. 10887756. Stephen Hales Swift and Kent L. Gee. Extending sharpness calculation for an alternative loudness metric input.The Journal of the Acoustical Society of America, 142(6):EL549–EL554,

work page doi:10.1109/icassp49660.2025 2025
[13]

10 Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Diele- man, Erich Elsen, Jesse Engel, and Douglas Eck

doi: 10.1121/1.5016193. 10 Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Diele- man, Erich Elsen, Jesse Engel, and Douglas Eck. Enabling factorized piano music modeling and generation with the MAESTRO dataset. InInternational Conference on Learning Representations,

work page doi:10.1121/1.5016193
[14]

Humphrey, Justin Salamon, Oriol Nieto, Jon Forsyth, Rachel M

Eric J. Humphrey, Justin Salamon, Oriol Nieto, Jon Forsyth, Rachel M. Bittner, and Juan P. Bello. JAMS: A JSON annotated music specification for reproducible MIR research. InProceedings of the 15th International Society for Music Information Retrieval Conference (ISMIR 2014), pages 591–596,

work page 2014
[15]

Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, and Oriol Nieto

doi: 10.1109/MWC.002.2300460. Sonal Kumar, Prem Seetharaman, Justin Salamon, Dinesh Manocha, and Oriol Nieto. SILA: Signal-to- language augmentation for enhanced control in text-to-audio generation. In2025 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 1–5,

work page doi:10.1109/mwc.002.2300460
[16]

Dn-splatter: Depth and normal priors for gaussian splatting and meshing

doi: 10.1109/W ASPAA66052.2025.11230964. Olivier Lartillot and Petri Toiviainen. Mir in matlab (ii): A toolbox for musical feature extraction from audio. InProceedings of the 8th International Conference on Music Information Retrieval (ISMIR 2007), pages 127–130,

work page doi:10.1109/w 2025
[17]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in python. InProceedings of the 14th Python in Science Conference (SciPy 2015), pages 18–24,

work page 2015
[18]

Hemant Misra, Shajith Ikbal, Hervé Bourlard, and Hynek Hermansky

doi: 10.25080/Majora-7b98e3ed-003. Hemant Misra, Shajith Ikbal, Hervé Bourlard, and Hynek Hermansky. Spectral entropy based feature for robust asr. InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, pages 193–196,

work page doi:10.25080/majora-7b98e3ed-003
[19]

Neil Perry, Sanket Gupte, Nishant Pitta, and Lior Rotem

doi: 10.1121/1.3642604. Neil Perry, Sanket Gupte, Nishant Pitta, and Lior Rotem. Robust steganography from large language models.arXiv preprint arXiv:2504.08977,

work page doi:10.1121/1.3642604
[20]

Reinier Plomp and Willem J

doi: 10.48550/arXiv.2504.08977. Reinier Plomp and Willem J. M. Levelt. Tonal consonance and critical bandwidth.The Journal of the Acoustical Society of America, 38(4):548–560,

work page doi:10.48550/arxiv.2504.08977
[21]

doi: 10.1121/1.1909741. H. F. Pollard and E. V . Jansson. A tristimulus method for the specification of musical timbre.Acustica, 51:162–171,

work page doi:10.1121/1.1909741
[22]

doi: 10.1093/comjnl/7.2.155

ISSN 0010-4620. doi: 10.1093/comjnl/7.2.155. Fanny Roche, Thomas Hueber, Maëva Garnier, Samuel Limier, and Laurent Girin. Make that sound more metallic: Towards a perceptually relevant control of the timbre of synthesizer sounds using a variational autoencoder.Transactions of the International Society for Music Information Retrieval, 4(1):52–66,

work page doi:10.1093/comjnl/7.2.155
[23]

11 Charalampos Saitis and Stefan Weinzierl

doi: 10.5334/tismir.76. 11 Charalampos Saitis and Stefan Weinzierl. The semantics of timbre. In Kai Siedenburg, Charalampos Saitis, Stephen McAdams, Arthur N. Popper, and Richard R. Fay, editors,Timbre: Acoustics, Perception, and Cognition, volume 69 ofSpringer Handbook of Auditory Research, pages 119–149. Springer International Publishing, Cham,

work page doi:10.5334/tismir.76
[24]

Liang-Hsuan Tseng, Yi-Chang Chen, Kuan Yi Lee, Da-shan Shiu, and Hung-yi Lee

doi: 10.1007/978-3-030-14832-4_5. Liang-Hsuan Tseng, Yi-Chang Chen, Kuan Yi Lee, Da-shan Shiu, and Hung-yi Lee. TASTE: Text-aligned speech tokenization and embedding for spoken language modeling. InInternational Conference on Learning Representations,

work page doi:10.1007/978-3-030-14832-4_5
[25]

Pauli Virtanen, Ralf Gommers, Travis E

doi: 10.17743/jaes.2022.0047. Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C. J. Carey,˙...

work page doi:10.17743/jaes.2022.0047 2022
[26]

E., et al

doi: 10.1038/s41592-019-0686-2. Yuancheng Wang, Dekun Chen, Xueyao Zhang, Junan Zhang, Jiaqi Li, and Zhizheng Wu. TaDiCodec: Text-aware diffusion speech tokenizer for speech language modeling. InAdvances in Neural Information Processing Systems,

work page doi:10.1038/s41592-019-0686-2
[27]

Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, and Helen Meng

Oral. Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, and Helen Meng. UniAudio 2.0: A unified audio language model with text-aligned factorized audio tokenization. arXiv preprint arXiv:2602.04683,

work page arXiv
[28]

Dongchao Yang, Yuanyuan Wang, Dading Chong, Songxiang Liu, Xixin Wu, and Helen Meng

doi: 10.48550/arXiv.2602.04683. Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, and Chao Zhang. SALMONN-omni: A codec-free LLM for full-duplex speech understanding and generation.arXiv preprint arXiv:2411.18138,

work page doi:10.48550/arxiv.2602.04683
[29]

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

doi: 10.48550/ arXiv.2411.18138. Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Sound- Stream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507,

work page arXiv
[30]

A Comparison table: axis definitions and per-method justifications We expand on Table

doi: 10.1109/TASLP.2021.3129994. A Comparison table: axis definitions and per-method justifications We expand on Table

work page doi:10.1109/taslp.2021.3129994 2021
[31]

a dog barks in a hallway

We first give a precise definition of each axis, then justify, method by method, the entry assigned in every cell. A.1 Axes Human readability. Whether the representation, as transmitted, is directly intelligible to a human reader without any decoding software. LLM-native transport. Whether the representation can be natively consumed and produced by genera...

work page 2023
[32]

H tristimulus_2Energy ratio of harmonics 2-4 to total harmonic energy

H tristimulus_1Energy ratio of harmonic 1 to total harmonic energy. H tristimulus_2Energy ratio of harmonics 2-4 to total harmonic energy. H tristimulus_3Energy ratio of harmonics 5+ to total harmonic energy. H odd_even_harmonic_ratioRatio of odd-harmonic energy to even-harmonic energy. B bark_band_1Log(1+band power) in 20-100 Hz critical band. B bark_ban...

work page 2000