pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 10

  1. cs.MM 2024-11-26 reviewed
    Both global and shared position IDs align video text and speech

    Mechanisms of Multimodal Synchronization: Insights from Decoder-Based Video-Text-to-Speech Synthesis

    Akshita Gupta +5

  2. cs.SD 2024-11-24 reviewed
    Image diffusion models transfer music styles without training

    Repurposing Image Diffusion Models for Training-Free Music Style Transfer on Mel-spectrograms

    Heehwan Wang +6

  3. cs.SD 2024-11-19 reviewed
    DGSNA generates dynamic scene-based noise via prompts and diffusion models to augment…

    DGSNA: Dynamic Generative Scene-based Noise Addition method

    Zihao Chen +4

  4. cs.SD 2024-11-06 reviewed
    Pooling speech datasets improves quality model generalization

    MOS-Bench: Benchmarking Generalization Abilities of Subjective Speech Quality Assessment Models

    Wen-Chin Huang +2

  5. cs.SD 2024-11-05 reviewed
    Slide text cues extract target speaker from mixed audio

    pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues

    Ziyang Jiang +6

  6. cs.CL 2024-10-25 reviewed
    GPT-4o responds to audio inputs in 232 milliseconds

    GPT-4o System Card

    OpenAI: Aaron Hurst +415

  7. eess.AS 2024-10-24 reviewed
    Top audio models score only 53 percent on expert reasoning benchmark

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S Sakshi +8

  8. cs.SD 2024-10-23 reviewed
    Equivariant transformer beats prototype on chord accompaniment

    Music102: An $D_{12}$-equivariant transformer for chord progression accompaniment

    Weiliang Luo

  9. cs.CL 2024-10-22 reviewed
    VoiceBench tests LLM voice assistants on varied real-world speech

    VoiceBench: Benchmarking LLM-Based Voice Assistants

    Yiming Chen +5

  10. eess.AS 2024-10-09 reviewed
    Text padding plus ConvNeXt yields 0.15 RTF zero-shot TTS

    F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

    Yushen Chen +7

  11. eess.AS 2024-10-04 reviewed
    Dataset supplies first mixed cardiopulmonary sounds from manikin

    Manikin-Recorded Cardiopulmonary Sounds Dataset Using Digital Stethoscope

    Yasaman Torabi +2

  12. cs.SD 2024-09-27 reviewed
    Two-stage method improves emotion and speaker match in zero-shot TTS

    Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

    Haoyu Wang +8

  13. eess.AS 2024-09-17 reviewed
    Moshi delivers real-time full-duplex speech at 160 ms latency

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre D\'efossez +7

  14. cs.SD 2024-08-30 reviewed
    KAN-enhanced AASIST more than halves deepfake detection error

    AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

    Kirill Borodin +6

  15. cs.SD 2024-08-14 reviewed
    Tuned MFCC parameters lift respiratory detection accuracy by up to 19.6%

    Optimising MFCC parameters for the automatic detection of respiratory diseases

    Yuyang Yan +5

  16. eess.AS 2024-07-15 reviewed
    Audio model outperforms Gemini on voice instruction tasks

    Qwen2-Audio Technical Report

    Yunfei Chu +11

  17. cs.SD 2024-07-07 reviewed
    Supervised tokens improve zero-shot TTS cloning

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du +11

  18. cs.SD 2024-06-20 reviewed
    Discrete tokens lag continuous features on audio tasks

    DASB - Discrete Audio and Speech Benchmark

    Pooneh Mousavi +7

  19. eess.AS 2024-06-04 reviewed
    TTS model matches human speech in similarity and naturalness

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou +45

  20. cs.SD 2024-06-03 reviewed
    Self-supervised transformer learns rare animal calls from unlabeled audio

    animal2vec and MeerKAT: A self-supervised transformer for rare-event raw audio input and a large-scale reference dataset for bioacoustics

    Julian C. Sch\"afer-Zimmermann +11

  21. eess.SP 2024-05-10 reviewed
    Lightweight net detects heart murmurs on phones with 80% accuracy

    FunnelNet: An End-to-End Deep Learning Framework to Monitor Digital Heart Murmur in Real-Time

    Md Jobayer +6

  22. cs.SD 2024-03-30 reviewed
    5-second clips classify pediatric heart sounds at 93.69% accuracy

    Classification of Short Segment Pediatric Heart Sounds Based on a Transformer-Based Convolutional Neural Network

    Md Hassanuzzaman +5

  23. cs.HC 2024-03-25 reviewed
    Humans detect AI media at coin-toss accuracy

    As Good As A Coin Toss: Human detection of AI-generated images, videos, audio, and audiovisual stimuli

    Di Cooke +3

  24. cs.SD 2024-02-12 reviewed
    HuBERT detects COVID-19 from voice at 86% accuracy

    Developing a Multi-variate Prediction Model For COVID-19 From Crowd-sourced Respiratory Voice Data

    Yuyang Yan +3

  25. cs.CY 2024-01-24 reviewed
    Community input required for AI reviewing police stops

    Community-Informed AI Models for Police Accountability

    Benjamin A.T. Graham +15

  26. cs.SD 2024-01-17 reviewed
    Multi-language dataset of 175 TTS voices boosts deepfake detector training

    MLAAD: The Multi-Language Audio Anti-Spoofing Dataset

    Nicolas M. M\"uller +8

  27. cs.SD 2023-12-28 reviewed
    Knowledge transfer reconstructs missing audio to improve sentiment analysis

    Multimodal Sentiment Analysis with Missing Modality: A Knowledge-Transfer Approach

    Weide Liu +1

  28. eess.AS 2023-11-14 reviewed
    One audio model covers 30+ tasks without fine-tuning

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Yunfei Chu +7

  29. cs.SD 2023-10-20 reviewed
    Model lets LLMs hear speech, sounds and music directly

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    Changli Tang +8

  30. cs.SD 2023-09-22 reviewed
    Deepfake audio augments speech-to-text training data

    Deepfake audio as a data augmentation technique for training automatic speech to text transcription models

    Alexandre R. Ferreira +1

  31. cs.CL 2023-06-22 reviewed
    Fused text-speech model beats prior translation systems

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Paul K. Rubenstein +29

  32. cs.CL 2023-06-05 reviewed
    Video-LLaMA adds Q-formers so LLMs grasp video sights and sounds

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang +2

  33. cs.CL 2023-05-02 reviewed
    Nets trained on single words start concatenating them

    Basic syntax from speech: Spontaneous concatenation in unsupervised deep neural networks

    Ga\v{s}per Begu\v{s} +2

  34. cs.SD 2023-01-26 reviewed
    MusicLM turns text into minutes of consistent 24 kHz music

    MusicLM: Generating Music From Text

    Andrea Agostinelli +12

  35. cs.CL 2023-01-05 reviewed
    Discrete audio code model enables zero-shot TTS from 3s prompt

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang +12

  36. eess.AS 2022-12-06 reviewed
    Scale to 680k hours enables zero-shot speech recognition

    Robust Speech Recognition via Large-Scale Weak Supervision

    Alec Radford +5

  37. cs.SD 2022-11-22 reviewed
    Two-stage filter cleans noisy labels for speaker verification

    Robust Training for Speaker Verification against Noisy Labels

    Zhihua Fang +4

  38. eess.AS 2022-10-24 reviewed
    Neural codec beats baselines at real-time high-fidelity audio compression

    High Fidelity Neural Audio Compression

    Alexandre D\'efossez +3

  39. cs.SD 2021-10-12 reviewed
    Augmented ConvNet classifies COVID coughs at 87 percent AUC

    COVID-19 Diagnosis from Cough Acoustics using ConvNets and Data Augmentation

    Saranga Kingkor Mahanta +4

  40. cs.LG 2021-07-30 reviewed
    One architecture handles any input and any output structure at linear cost

    Perceiver IO: A General Architecture for Structured Inputs & Outputs

    Andrew Jaegle +14

  41. eess.AS 2021-06-29 reviewed
    Five fixed channels unify monaural and binaural auditory model

    Towards a generalized monaural and binaural auditory model for psychoacoustics and speech intelligibility

    Thomas Biberger +1

  42. cs.CL 2021-05-26 reviewed
    Multitask model lowers Anglicism errors in German ASR by 3%

    Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition

    Julia Pritzen +3

  43. cs.SD 2021-05-03 reviewed
    Neural net reaches SOTA on 20-instrument task with MFCCs only

    Deep Neural Network for Musical Instrument Recognition using MFCCs

    Saranga Kingkor Mahanta +2

  44. eess.AS 2020-12-07 reviewed
    Multilingual dataset supplies 50,000 hours of speech audio

    MLS: A Large-Scale Multilingual Dataset for Speech Research

    Vineel Pratap +4

  45. eess.AS 2020-09-21 reviewed
    Diffusion model matches WaveNet audio quality but runs far faster

    DiffWave: A Versatile Diffusion Model for Audio Synthesis

    Zhifeng Kong +4

  46. eess.AS 2020-04-30 reviewed
    Jukebox generates coherent multi-minute songs with vocals in raw audio

    Jukebox: A Generative Model for Music

    Prafulla Dhariwal +5

  47. cs.SD 2019-12-24 reviewed
    Neural net predicts giant panda mating success from calls

    Audio-based automatic mating success prediction of giant pandas

    Weiran Yan +6

  48. eess.AS 2019-07-27 reviewed
    Residual filtering removes differential prediction from any voice converter

    Generalization of Spectrum Differential based Direct Waveform Modification for Voice Conversion

    Wen-Chin Huang +8

  49. cs.CL 2019-07-26 reviewed
    Many speech papers misuse the term 'phoneme'

    On the Use/Misuse of the Term 'Phoneme'

    Roger K. Moore +1

  50. eess.AS 2019-07-26 reviewed
    Skip connections plus correlation penalty cut speech errors in noise

    Correlation Distance Skip Connection Denoising Autoencoder (CDSK-DAE) for Speech Feature Enhancement

    Alzahra Badi +3