pith. sign in

archive

Every paper Pith has read. Search by title, abstract, or pith.

623 papers in eess.AS · page 8

  1. eess.AS 2025-10-07 reviewed
    Discrete tokens close the ASR-TTS loop and cut error rates

    TokenChain: A Discrete Speech Chain via Semantic Token Modeling

    Mingxuan Wang +1

  2. eess.AS 2025-10-06 reviewed
    Pruned Whisper keeps 90 percent accuracy at 48 percent smaller size

    BaldWhisper: Faster Whisper with Head Shearing and Layer Merging

    Yaya Sy +2

  3. eess.AS 2025-10-03 reviewed
    Source embedding scales music structure analysis to messy labels

    SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

    Chunbo Hao +7

  4. eess.AS 2025-10-02 reviewed
    Matrix method localizes multiple sound sources with one variable per source

    Multi-Source Position and Direction-of-Arrival Estimation Based on Euclidean Distance Matrices

    Klaus Br\"umann +1

  5. cs.SD 2025-10-02 reviewed
    System changes live audio effects from performer emotions

    Go witheFlow: Real-time Emotion Driven Audio Effects Modulation

    Edmund Dervakos +4

  6. cs.MM 2025-09-30 reviewed
    Twin DiT modules generate synced audio and video in one pass

    Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation

    Chetwin Low +2

  7. eess.AS 2025-09-30 reviewed
    Spoken AI models falter on timing and sync tasks

    Game-Time: Evaluating Temporal Dynamics in Spoken Language Models

    Kai-Wei Chang +9

  8. eess.AS 2025-09-29 reviewed
    Semantic tokens guide flow matching to enhance speech faithfully

    SenSE: Semantic-Aware High-Fidelity Universal Speech Enhancement

    Xingchen Li +6

  9. eess.AS 2025-09-29 reviewed
    Zero-shot cosine outperforms few-shot in OOD deepfake tracing

    Advancing Zero-Shot Open-Set Speech Deepfake Source Tracing

    Manasi Chhibber +2

  10. cs.SD 2025-09-27 reviewed
    TV dialogue dataset raises voice role-play scores 38 percent

    AudioRole: An Audio Dataset for Character Role-Playing in Large Language Models

    Wenyu Li +4

  11. cs.SD 2025-09-26 reviewed
    VLM generates music from images with no training or fine-tuning

    Zero-Effort Image-to-Music Generation: An Interpretable RAG-based VLM Approach

    Zijian Zhao +2

  12. eess.AS 2025-09-26 reviewed
    Models bias continuations toward modal voice more for female prompts

    Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

    Shree Harsha Bokkahalli Satish +4

  13. eess.AS 2025-09-25 reviewed
    Adversarial noise tricks speech enhancers into new meanings

    Are Modern Speech Enhancement Systems Vulnerable to Adversarial Attacks?

    Rostislav Makarov +2

  14. eess.AS 2025-09-23 reviewed
    Source count fusion lifts DOA F1-scores by 14% in hearing aids

    Multi-Speaker DOA Estimation in Binaural Hearing Aids using Deep Learning and Speaker Count Fusion

    Farnaz Jazaeri +3

  15. cs.SD 2025-09-22 reviewed
    Model turns video into object-aware stereo sound

    StereoFoley: Object-Aware Stereo Audio Generation from Video

    Tornike Karchkhadze +6

  16. cs.CL 2025-09-22 reviewed
    One model hits SOTA on text, image, audio and video

    Qwen3-Omni Technical Report

    Jin Xu +37

  17. eess.AS 2025-09-22 reviewed
    Comparator loss yields speech severity scores for disease tracking

    Comparator Loss: An Ordinal Contrastive Loss to Derive a Severity Score for Speech-based Health Monitoring

    Jacob J Webber +7

  18. eess.AS 2025-09-21 reviewed
    Single model separates and locates multiple overlapping sounds

    DeepASA: An Object-Oriented Multi-Purpose Network for Auditory Scene Analysis

    Dongheon Lee +2

  19. eess.AS 2025-09-19 reviewed
    Visual cues first cluster speech features in AVSR models

    Interpreting the Role of Visemes in Audio-Visual Speech Recognition

    Aristeidis Papadopoulos +1

  20. cs.SD 2025-09-19 reviewed
    Differentiable method boosts acoustic modeling from limited data

    Differentiable Acoustic Radiance Transfer

    Sungho Lee +5

  21. cs.SD 2025-09-19 reviewed
    Speaker-aware model yields more realistic conversation timings

    From Independence to Interaction: Speaker-Aware Simulation of Multi-Speaker Conversational Timing

    M\'at\'e Gedeon +1

  22. cs.SD 2025-09-19 reviewed
    1% data addition activates simultaneous translation in audio models

    Direct Simultaneous Translation Activation for Large Audio-Language Models

    Pei Zhang +6

  23. cs.SD 2025-09-18 reviewed
    Thai encoder and data pipeline enable multitask speech AI

    Towards Building Speech Large Language Models for Multitask Understanding in Low-Resource Languages

    Mingchen Shao +6

  24. eess.AS 2025-09-17 reviewed
    TTS systems default to adult voices despite child or elderly instructions

    Do You Hear What I Mean? Quantifying the Instruction-Perception Gap in Instruction-Guided Expressive Text-To-Speech Systems

    Yi-Cheng Lin +4

  25. cs.SD 2025-09-16 reviewed
    Mixture of experts blends binaural filters for moving talkers

    Mixture-of-Experts Framework for Field-of-View Enhanced Signal-Dependent Binauralization of Moving Talkers

    Manan Mittal +6

  26. eess.AS 2025-09-10 reviewed
    Expert routing boosts noisy emotion recognition 12 percent

    Joint Learning using Mixture-of-Expert-Based Representation for Speech Enhancement and Robust Emotion Recognition

    Jing-Tong Tzeng +2

  27. cs.SD 2025-09-09 reviewed
    Audio LLM toolkit delivers up to 151 percent faster evaluations

    AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

    Hoang Nguyen +11

  28. cs.SD 2025-09-07 reviewed
    Diffusion model adds personalized events from few reference audios

    DreamAudio: Customized Text-to-Audio Generation with Diffusion Models

    Yi Yuan +7

  29. eess.AS 2025-09-04 reviewed
    Audiobook quotes with verb labels boost TTS expressivity

    Computational Narrative Understanding for Expressive Text-to-Speech

    Gaspard Michel +2

  30. eess.AS 2025-08-28 reviewed
    Shared encoder beats separate models on diarization and separation

    Unifying Diarization, Separation, and ASR with Multi-Speaker Encoder

    Muhammad Shakeel +4

  31. cs.CL 2025-08-25 reviewed
    Self-generated data aligns speech LLMs to instructions better

    Enhancing Speech Large Language Models through Reinforced Behavior Alignment

    Yansong Liu +2

  32. cs.CL 2025-08-24 reviewed
    Adapted open LLMs match commercial dementia screening tools

    Speech-Based Cognitive Screening: A Systematic Evaluation of LLM Adaptation Strategies

    Fatemeh Taherinezhad +8

  33. eess.AS 2025-08-20 reviewed
    Physics-aware kernels recover steering vectors from 10x fewer measurements

    Gaussian Process Regression of Steering Vectors With Physics-Aware Deep Composite Kernels for Augmented Listening

    Diego Di Carlo (RIKEN AIP) +6

  34. cs.CL 2025-08-17 reviewed
    Fine-tuned causal Whisper outperforms streaming baselines under 300ms

    WhisperRT -- Turning Whisper into a Causal Streaming Model

    Tomer Krichli +2

  35. eess.AS 2025-08-10 reviewed
    ASR fixes grouped into five classes with shared metrics

    Non-Intrusive Automatic Speech Recognition Refinement: A Survey

    Mohammad Reza Peyghan +5

  36. cs.SD 2025-08-05 reviewed
    One model restores and masters music from text instructions

    SonicMaster: Towards Controllable All-in-One Music Restoration and Mastering

    Jan Melechovsky +3

  37. eess.AS 2025-07-31 reviewed
    Benchmark with multi-expert captions tests detailed audio understanding

    MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

    Yadong Niu +9

  38. eess.AS 2025-07-30 reviewed
    New benchmark finds two strategies in full-duplex speech models

    Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

    Guan-Ting Lin +6

  39. cs.CL 2025-07-22 reviewed
    Step-Audio 2 sets SOTA on audio understanding and conversation

    Step-Audio 2 Technical Report

    Boyong Wu +108

  40. cs.SD 2025-07-21 reviewed
    Chaos discriminators set new SOTA for speech bandwidth extension

    CIS-BWE: Chaos-Informed Speech Bandwidth Extension

    Tarikul Islam Tamiti +3

  41. cs.CL 2025-07-17 reviewed
    Pipeline adds stress and punctuation to Russian speech data

    Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech

    Kirill Borodin +5

  42. eess.AS 2025-07-16 reviewed
    Text transcriptions boost attacks on voice anonymization

    VoxATtack: A Multimodal Attack on Voice Anonymization Systems

    Ahmad Aloradi +3

  43. cs.SD 2025-07-16 reviewed
    AI model builds 3D ocean sound maps from surface data alone

    A Multimodal Data Fusion Attention-Empowered Generative Adversarial Network for Real Time 3D Underwater Sound Speed Field Construction

    Wei Huang +6

  44. eess.AS 2025-07-12 reviewed
    Flow-matching model generates spoken dialogues faster

    ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching

    Han Zhu +13

  45. cs.SD 2025-07-10 reviewed
    Open audio model tops 20+ benchmarks with public data only

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Arushi Goel +10

  46. cs.SD 2025-07-09 reviewed
    Coupled quadratic program cuts distortion in multichannel mixers

    Constraint Optimized Multichannel Mixer-limiter Design

    Yuancheng Luo +2

  47. eess.AS 2025-07-08 reviewed
    One-step model beats 30-step teacher in speech enhancement

    Robust One-step Speech Enhancement via Consistency Distillation

    Liang Xu +2

  48. cs.CV 2025-06-30 reviewed
    Unified model generates speech and facial motion together

    JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching

    Mingi Kwon +4

  49. cs.SD 2025-06-17 reviewed
    Scattered sound classifies hair type and moisture at nearly 90 percent

    Acoustic scattering AI for non-invasive object classifications: A case study on hair assessment

    Long-Vu Hoang +2

  50. cs.SD 2025-06-16 reviewed
    Calibrated distillation lifts low-complexity speech enhancement

    Leveraging Local and Global Knowledge Integration with Time-Frequency Calibrated Distillation for Speech Enhancement

    Jiaming Cheng +8