LiveBand: Live Accompaniment Generation in the Audio Domain
Pith reviewed 2026-06-28 08:34 UTC · model grok-4.3
The pith
A causal transformer generates real-time music accompaniments from live audio using only past context and noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LiveBand trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder. At every timestep the generator receives only the causally available mix context and Gaussian noise and predicts accompaniment latents without any future mix frames or ground-truth target latents. Sequence-level adversarial supervision is supplied by a discriminator. Training occurs in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The matching of training and inference computations eliminates teacher forcing and exposure bias.
What carries the argument
Causal transformer generator inside the latent space of a pre-trained causal audio autoencoder, trained with sequence-level adversarial supervision from a discriminator.
If this is right
- The generated accompaniments score higher than prior work on objective measures of audio quality, beat alignment, and mix adherence.
- Streaming generation proceeds without any lookahead into future audio frames.
- Training and inference computations are identical by design, removing exposure bias.
- The system runs in real time on consumer hardware while respecting strict causal constraints.
Where Pith is reading between the lines
- The same causal-latent approach could be tested on related tasks such as live effect generation or multi-track mixing.
- If the latent space supports coherent structure from past context alone, similar generators might handle other causal audio problems like real-time source separation.
- Practical live-performance tools could incorporate the method once the autoencoder is fixed, because no future buffering is required.
- Extending the benchmark to include longer performances would test whether the rolling attention state maintains coherence over many minutes.
Load-bearing premise
The latent space learned by the pre-trained causal audio autoencoder is rich enough for the generator to learn coherent accompaniments from only causally available mix context and noise.
What would settle it
Running the model on the multi-instrument accompaniment benchmark and finding that it does not improve on at least two of the three reported objective measures (audio quality, beat alignment, mix adherence) or that it cannot sustain real-time generation without lookahead on consumer hardware.
read the original abstract
We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents LiveBand, a real-time music accompaniment generation system that operates strictly causally. It encodes input via a fixed pre-trained causal audio autoencoder, trains a causal transformer generator in that latent space using adversarial sequence-level discrimination, and produces accompaniment latents from only causally-masked mix context plus noise. Training uses a single parallel forward pass under causal masking; inference is autoregressive with rolling attention. The work claims objective improvements over prior methods on audio quality, beat alignment, and mix adherence for a multi-instrument benchmark, together with real-time streaming performance on consumer hardware without lookahead.
Significance. If the empirical results hold and the latent-space assumption is validated, the work would advance practical live-accompaniment systems by demonstrating that strict causality can be maintained while matching training and inference procedures, thereby avoiding exposure bias. The adversarial sequence-level supervision and explicit alignment of train/inference computation are concrete strengths.
major comments (2)
- [Method] Method section (description of the generator and autoencoder): the central claim that the generator produces coherent, beat-aligned, mix-adherent accompaniments from only causally available mix latents plus noise rests on the untested assumption that the fixed pre-trained causal autoencoder's continuous latent space already encodes the necessary harmonic, rhythmic, and timbral relations in a form accessible to the transformer. No ablation replacing the encoder, no analysis of latent-space musical structure preservation, and no comparison against a jointly trained encoder are reported; if the space collapses or entangles these features, the reported objective gains cannot be attributed to the generator training procedure.
- [Experiments] Experiments section (benchmark results): the abstract asserts improvements on objective measures of audio quality, beat alignment, and mix adherence, yet the provided text supplies neither the numerical values, error bars, dataset statistics, nor the precise baseline comparisons that would allow verification that the gains are statistically meaningful and not artifacts of the particular autoencoder choice.
minor comments (2)
- [Abstract] Abstract: the claim of improvement would be more informative if accompanied by at least the headline metric deltas rather than a purely qualitative statement.
- [Method] Notation: the distinction between 'causally available mix context' and the precise masking schedule used during the parallel training pass should be clarified with an equation or diagram to avoid ambiguity about what information is visible at each timestep.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating planned changes to the manuscript where appropriate.
read point-by-point responses
-
Referee: [Method] Method section (description of the generator and autoencoder): the central claim that the generator produces coherent, beat-aligned, mix-adherent accompaniments from only causally available mix latents plus noise rests on the untested assumption that the fixed pre-trained causal autoencoder's continuous latent space already encodes the necessary harmonic, rhythmic, and timbral relations in a form accessible to the transformer. No ablation replacing the encoder, no analysis of latent-space musical structure preservation, and no comparison against a jointly trained encoder are reported; if the space collapses or entangles these features, the reported objective gains cannot be attributed to the generator training procedure.
Authors: We agree that the attribution of gains to the generator training procedure would be strengthened by explicit validation of the latent space. The fixed pre-trained causal autoencoder was chosen specifically to enforce strict causality and enable real-time inference without joint optimization overhead. While the original autoencoder publication reports strong reconstruction metrics on music, we did not include ablations, latent-space analyses, or joint-training comparisons in this work. In revision we will add a dedicated paragraph in Section 3.1 citing the autoencoder's reported preservation of harmonic and rhythmic features and explicitly noting the lack of encoder ablations as a limitation and avenue for future investigation. revision: partial
-
Referee: [Experiments] Experiments section (benchmark results): the abstract asserts improvements on objective measures of audio quality, beat alignment, and mix adherence, yet the provided text supplies neither the numerical values, error bars, dataset statistics, nor the precise baseline comparisons that would allow verification that the gains are statistically meaningful and not artifacts of the particular autoencoder choice.
Authors: The referee correctly observes that the main narrative does not quote the numerical results. These appear in Table 2 (with means and standard deviations) and Figure 3, along with dataset details in Section 4.1. We will revise Section 4.2 to inline the key metric values, explicitly reference the error bars and statistical comparisons, and add a sentence on dataset scale (number of tracks and total duration) to allow independent verification of the reported improvements. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper describes an independent adversarial training procedure for a causal transformer generator operating in the fixed latent space of an external pre-trained causal audio autoencoder. No equations, predictions, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The benchmark improvements are presented as empirical outcomes of the described training, not as tautological renamings or forced results. The approach matches training and inference computations explicitly but does not create circularity in the claimed results.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The latent space of the pre-trained causal audio autoencoder supports learning of coherent accompaniments from causally available context only
Reference graph
Works this paper leans on
-
[1]
LiveBand: Live Accompaniment Generation in the Audio Domain
INTRODUCTION Designing AI systems to jam–creating musical accompani- ments in real time while listening to a live audio stream–is a long-standing goal at the intersection of music informa- tion retrieval, generative modelling, and human-computer interaction. Such a system would enable musicians to jam with an AI companion that responds naturally to their ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Early systems range from rule-based and symbolic ap- proaches [8–14] to recent neural models operating di- rectly on audio [2–4, 15–18]
RELA TED WORK Real-time accompaniment generation builds on prior work in both automated and interactive music generation. Early systems range from rule-based and symbolic ap- proaches [8–14] to recent neural models operating di- rectly on audio [2–4, 15–18]. While these latter sys- tems show that high-quality accompaniment can be learned from acoustic con...
-
[3]
BACKGROUND 3.1 Teacher/Student Forcing DriftAutoregressive mod- els are usually trained withteacher forcing: LTF =− X t logp θ(xt |x ⋆ <t),(1) wherex ⋆ <t is the ground-truth past. At inference, the same model must instead sample ˆxt ∼p θ(xt |ˆx<t).(2) This mismatch is exposure bias: the model is optimized on histories from the data distribution, but depl...
-
[4]
Let m= (m 1, m2,
LIVEBAND 4.1 Problem formulationWe consider the task of real- time accompaniment generation from a live input mix. Let m= (m 1, m2, . . .)denote the sequence of mix latent frames anda= (a 1, a2, . . .)the sequence of accompa- niment latent frames to be generated. At streaming stept, the model has access to the causally available mix history m≤t = (m 1, . ...
-
[5]
We form each training example by selecting one stem as the target accompaniment, randomly choosing a subset of [1,
EXPERIMENTS 5.1 DatasetUnless otherwise stated, all models are trained and evaluated on the official Slakh2100 train/test split [48]. We form each training example by selecting one stem as the target accompaniment, randomly choosing a subset of [1, . . . , N−1]remaining stems, and summing them to cre- ate the conditioning mix [2, 5]. Audio is encoded with...
-
[6]
6.1 Sink vs
RESULTS We provide audio examples at this link2 . 6.1 Sink vs. No-SinkTable 2 reports the sink ablation, isolating long-form drift at the main real-time operating 2 https://sonycslparis.github.io/liveband-companion Model∆FADvgg ∆FADclap ∆Beat∆COCfull ∆COCharm ∆COCperc w/o sink -0.02 -0.06 +0.01 +0.29 +0.27 +0.34 sink -0.02 -0.05 +0.02 +0.31 +0.39 +0.36 Ta...
-
[7]
The effective frame budget forτ= 0.1s is exactly 92.88ms, corresponding to one latent frame (4096audio samples at44.1kHz)
All measurements are averaged across128streaming steps. The effective frame budget forτ= 0.1s is exactly 92.88ms, corresponding to one latent frame (4096audio samples at44.1kHz). In eager mode, end-to-end gener- ation plus decoding already remains within this real-time budget. Withtorch.compile, latency is substantially reduced. These measurements confirm...
-
[8]
By pairing a causal transformer with sequence-level adversarial supervision, we eliminate teacher forcing and the associated exposure bias
CONCLUSION We introduced LiveBand, a real-time system for live music accompaniment that operates under strict causal and latency constraints. By pairing a causal transformer with sequence-level adversarial supervision, we eliminate teacher forcing and the associated exposure bias. This fully aligns training with streaming inference, allowing the model to ...
-
[9]
Automated accompaniment generation systems raise ques- tions around authorship, the impact on professional musi- cians, and the potential for misuse in generating deceptive content
ETHICS STA TEMENT This work is intended for creative and artistic applications. Automated accompaniment generation systems raise ques- tions around authorship, the impact on professional musi- cians, and the potential for misuse in generating deceptive content. We encourage the community to develop appro- priate guidelines as these technologies mature
-
[10]
L. Team, A. Caillon, B. McWilliams, C. Tarakajian, I. Simon, I. Manco, J. Engel, N. Constant, Y . Li, T. I. Denket al., “Live music models,”arXiv preprint arXiv:2508.04651, 2025
-
[11]
Diff-a-riff: Musical accompaniment co- creation via latent diffusion models,
J. Nistal, M. Pasini, C. Aouameur, M. Grachten, and S. Lattner, “Diff-a-riff: Musical accompaniment co- creation via latent diffusion models,” inProceedings of the 25th International Society for Music Information Retrieval Conference, 2024, pp. 272–280
2024
-
[12]
Improved diff-a-riff: Musical accompaniment co-creation via latent diffusion models,
——, “Improved diff-a-riff: Musical accompaniment co-creation via latent diffusion models,” inNeurIPS 2024 Workshop, 2024
2024
-
[13]
Stemgen: A music generation model that lis- tens,
J. Parker, J. Spijkervet, K. Kosta, F. Yesiler, B. Kuznetsov, J.-C. Wang, M. Avent, J. Chen, and D. Le, “Stemgen: A music generation model that lis- tens,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1116–1120
2024
-
[14]
Streaming generation for music accompaniment,
Y . Wu, M. Wang, H. Lei, S. Brade, L. Blanchard, S.-L. Wu, A. C. Courville, and C.-Z. A. Huang, “Streaming generation for music accompaniment,”arXiv preprint arXiv:2510.22105, 2025
-
[15]
Why exposure bias matters: An imitation learn- ing perspective of error accumulation in language gen- eration,
K. Arora, L. E. Asri, H. Bahuleyan, and J. C. K. Che- ung, “Why exposure bias matters: An imitation learn- ing perspective of error accumulation in language gen- eration,” inFindings of the Association for Computa- tional Linguistics: ACL 2022, 2022
2022
-
[16]
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
X. Huang, Z. Li, G. He, M. Zhou, and E. Shecht- man, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”arXiv preprint arXiv:2506.08009, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
An on-line algorithm for real- time accompaniment,
R. B. Dannenberg, “An on-line algorithm for real- time accompaniment,”Proceedings of the 1984 Inter- national Computer Music Conference, 1984
1984
-
[18]
Music transformer: Generating music with long-term structure,
C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoff- man, M. Dinculescu, and D. Eck, “Music transformer: Generating music with long-term structure,” in7th In- ternational Conference on Learning Representations (ICLR), 2019
2019
-
[19]
SongDriver: Real-time music accompaniment gen- eration without logical latency nor exposure bias,
Z. Wang, K. Zhang, Y . Wang, C. Zhang, Q. Liang, P. Yu, Y . Feng, W. Liu, Y . Wang, Y . Bao, and Y . Yang, “SongDriver: Real-time music accompaniment gen- eration without logical latency nor exposure bias,” in Proceedings of the 30th ACM International Conference on Multimedia (MM), 2022
2022
-
[20]
Life with GenJam: Interacting with a mu- sical IGA,
J. A. Biles, “Life with GenJam: Interacting with a mu- sical IGA,” inProceedings of the 1999 IEEE Interna- tional Conference on Systems, Man, and Cybernetics, vol. 3. Tokyo, Japan: IEEE, 1999, pp. 652–656
1999
-
[21]
Re- aljam: Real-time human-ai music jamming with re- inforcement learning-tuned transformers,
A. Scarlatos, Y . Wu, I. Simon, A. Roberts, T. Cooij- mans, N. Jaques, C. Tarakajian, and C. A. Huang, “Re- aljam: Real-time human-ai music jamming with re- inforcement learning-tuned transformers,” inProceed- ings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA 2025, Yokohama, Japan, 26 April 2025- 1 May 2025
2025
-
[22]
Generative adversarial post-training mitigates reward hacking in live human-AI music interaction,
Y . Wu, S. Brade, T. Ma, T.-J. Fowler, E. Yang, B. Ba- nar, A. Courville, N. Jaques, and C.-Z. A. Huang, “Generative adversarial post-training mitigates reward hacking in live human-AI music interaction,” inThe Fourteenth International Conference on Learning Rep- resentations, 2026
2026
-
[23]
Anticipatory music transformer,
J. Thickstun, D. L. W. Hall, C. Donahue, and P. Liang, “Anticipatory music transformer,”Transactions on Machine Learning Research, 2024
2024
-
[24]
Musika! fast infinite wave- form music generation,
M. Pasini and J. Schlüter, “Musika! fast infinite wave- form music generation,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 568–575
2022
-
[25]
Bassnet: A variational gated autoencoder for conditional genera- tion of bass guitar tracks with learned interactive con- trol,
M. Grachten, S. Lattner, and E. Deruty, “Bassnet: A variational gated autoencoder for conditional genera- tion of bass guitar tracks with learned interactive con- trol,”Applied Sciences, 2020
2020
-
[26]
Bass accompaniment generation via latent diffusion,
M. Pasini, M. Grachtenet al., “Bass accompaniment generation via latent diffusion,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024
2024
-
[27]
DRUMGAN: synthesis of drum sounds with timbral feature conditioning using generative adversarial networks,
J. Nistal, S. Lattneret al., “DRUMGAN: synthesis of drum sounds with timbral feature conditioning using generative adversarial networks,” inProceedings of the 21th International Society for Music Information Re- trieval Conference (ISMIR), Oct. 2020
2020
-
[28]
T. Karchkhadze and S. Dubnov, “Towards real-time human-ai musical co-performance: Accompaniment generation with latent diffusion models and max/msp,” arXiv preprint arXiv:2604.07612, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Time- series generative adversarial networks,
J. Yoon, D. Jarrett, and M. van der Schaar, “Time- series generative adversarial networks,” inAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019
2019
-
[30]
Adversarial audio synthesis,
C. Donahue, J. J. McAuleyet al., “Adversarial audio synthesis,” in7th International Conference on Learn- ing Representations (ICLR), May 2019
2019
-
[31]
GANSynth: Ad- versarial neural audio synthesis,
J. H. Engel, K. K. Agrawalet al., “GANSynth: Ad- versarial neural audio synthesis,” in7th International Conference on Learning Representations (ICLR), May 2019
2019
-
[32]
VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predic- tive Coding,
J. Nistal, C. Aouameur, S. Lattner, and G. Richard, “VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predic- tive Coding,” inProceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acous- tics (WASPAA), 2021
2021
-
[33]
The GAN is dead; long live the GAN! A modern GAN baseline,
N. Huang, A. Gokaslan, V . Kuleshov, and J. Tompkin, “The GAN is dead; long live the GAN! A modern GAN baseline,” inAdvances in Neural Information Process- ing Systems 37 (NeurIPS), 2024
2024
-
[34]
Se- quence level training with recurrent neural networks,
M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Se- quence level training with recurrent neural networks,” in4th International Conference on Learning Represen- tations (ICLR), 2016
2016
-
[35]
Self-Forcing++: Towards Minute-Scale High-Quality Video Generation
J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,” arXiv preprint arXiv:2510.02283, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[36]
Diffusion forcing: Next- token prediction meets full-sequence diffusion,
B. Chen, D. Marti Monso, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next- token prediction meets full-sequence diffusion,” inAd- vances in Neural Information Processing Systems 37 (NeurIPS), 2024
2024
-
[37]
Continuous autoregressive models with noise aug- mentation avoid error accumulation,
M. Pasini, J. Nistal, S. Lattner, and G. Fazekas, “Continuous autoregressive models with noise aug- mentation avoid error accumulation,”arXiv preprint arXiv:2411.18447, 2024
-
[38]
S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Con- sistent autoregressive video generation with long con- text,”arXiv preprint arXiv:2602.06028, 2026
-
[39]
H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation,”arXiv preprint arXiv:2602.02214, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[40]
Rolling forc- ing: Autoregressive long video diffusion in real time,
K. Liu, W. Hu, J. Xu, Y . Shan, and S. Lu, “Rolling forc- ing: Autoregressive long video diffusion in real time,” ICLR, 2026
2026
-
[41]
Effi- cient streaming language models with attention sinks,
G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Effi- cient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024
2024
-
[42]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, “gpt-oss-120b & gpt-oss-20b model card,” arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Z. Novack, S. Brade, H. Kim, H. F. García, N. Shikarpur, C. Talegaonkar, S. Kim, V . K. Chen, J. McAuley, T. Berg-Kirkpatricket al., “Live mu- sic diffusion models: Efficient fine-tuning and post- training of interactive diffusion music generators,” arXiv preprint arXiv:2605.22717, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Generative adversarial nets,
I. J. Goodfellow, J. Pouget-Abadieet al., “Generative adversarial nets,” inAdvances in Neural Information Processing Systems 27, Dec. 2014
2014
-
[45]
J. H. Lim and J. C. Ye, “Geometric gan,”arXiv preprint arXiv:1705.02894, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[46]
The relativistic discrimina- tor: A key element missing from standard GAN,
A. Jolicoeur-Martineau, “The relativistic discrimina- tor: A key element missing from standard GAN,” in 7th International Conference on Learning Representa- tions (ICLR), 2019
2019
-
[47]
Which training methods for GANs do actually converge?
L. M. Mescheder, A. Geigeret al., “Which training methods for GANs do actually converge?” inProceed- ings of the 35th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learn- ing Research, vol. 80, Jul. 2018
2018
-
[48]
Analyzing and Improving the Image Quality of StyleGAN,
T. Karras, S. Laineet al., “Analyzing and Improving the Image Quality of StyleGAN,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020
2020
-
[49]
CoDiCodec: Unifying continuous and discrete compressed repre- sentations of audio,
M. Pasini, S. Lattner, and G. Fazekas, “CoDiCodec: Unifying continuous and discrete compressed repre- sentations of audio,” inProceedings of the 26th Inter- national Society for Music Information Retrieval Con- ference (ISMIR), 2025
2025
-
[50]
Arbitrary style trans- fer in real-time with adaptive instance normalization,
X. Huang and S. J. Belongie, “Arbitrary style trans- fer in real-time with adaptive instance normalization,” inIEEE International Conference on Computer Vision (ICCV), Oct. 2017
2017
-
[51]
Query-key normalization for transformers,
A. Henry, P. R. Dachapally, S. S. Pawar, and Y . Chen, “Query-key normalization for transformers,”arXiv preprint arXiv:2010.04245, 2020
-
[52]
RoFormer: Enhanced transformer with ro- tary position embedding,
J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “RoFormer: Enhanced transformer with ro- tary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024
2024
-
[53]
GLU Variants Improve Transformer
N. Shazeer, “GLU variants improve transformer,” arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[54]
Gemma 2: Improving Open Language Models at a Practical Size
J. Dong, B. Feng, D. Guessous, Y . Liang, and H. He, “FlexAttention: A programming model for generating optimized attention kernels,”arXiv preprint arXiv:2408.00118, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
A convnet for the 2020s,
Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022
2022
-
[56]
Adam: A method for stochas- tic optimization,
D. P. Kingma and J. Ba, “Adam: A method for stochas- tic optimization,” in3rd International Conference on Learning Representations (ICLR), May 2015
2015
-
[57]
Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,
E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” inProc. IEEE Workshop on Ap- plications of Signal Processing to Audio and Acoustics (WASPAA), 2019, pp. 45–49
2019
-
[58]
Fréchet Audio Dis- tance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,
K. Kilgour, M. Zuluagaet al., “Fréchet Audio Dis- tance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,” in20th Annual Conference of the International Speech Communication Associa- tion (INTERSPEECH), Sep. 2019, place: Graz, Aus- tria
2019
-
[59]
CNN architectures for large-scale audio classification,
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem- meke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson, “CNN architectures for large-scale audio classification,” in2017 IEEE International Con- ference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9,
2017
-
[60]
IEEE, 2017, pp. 131–135
2017
-
[61]
Large-scale contrastive language- audio pretraining with feature fusion and keyword- to-caption augmentation,
Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language- audio pretraining with feature fusion and keyword- to-caption augmentation,” inIEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
2023
-
[62]
madmom: A New Python Audio and Music Signal Processing Library,
S. Böck, F. Korzeniowskiet al., “madmom: A New Python Audio and Music Signal Processing Library,” inProceedings of the 2016 ACM Conference on Mul- timedia Conference (MM), Oct. 2016, place: Amster- dam, The Netherlands
2016
-
[63]
Beat this! accurate beat tracking without DBN postprocessing,
F. Foscarin, J. Schlüter, and G. Widmer, “Beat this! accurate beat tracking without DBN postprocessing,” inProceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR 2024, San Francisco, California, USA and Online, November 10-14, 2024, 2024
2024
-
[64]
Cocola: Coherence-oriented contrastive learning of musical au- dio representations,
R. Ciranni, G. Mariani, M. Mancusi, E. Postolache, G. Fabbro, E. Rodolà, and L. Cosmo, “Cocola: Coherence-oriented contrastive learning of musical au- dio representations,” pp. 1–5, 2025
2025
-
[65]
High-fidelity audio compression with im- proved RVQGAN,
R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with im- proved RVQGAN,” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.