pith. sign in

hub Canonical reference

Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis

Canonical reference. 86% of citing Pith papers cite this work as background.

19 Pith papers citing it
Background 86% of classified citations

hub tools

citation-role summary

background 6 baseline 1

citation-polarity summary

years

2026 16 2025 3

representative citing papers

Break-the-Beat! Controllable MIDI-to-Drum Audio Synthesis

cs.SD · 2026-05-14 · unverdicted · novelty 6.0

Break-the-Beat! renders drum MIDI audio that matches the timbre of a reference clip by fine-tuning a text-to-audio model with a content encoder and hybrid conditioning on a new paired dataset.

RTCFake: Speech Deepfake Detection in Real-Time Communication

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.

Qwen3-TTS Technical Report

cs.SD · 2026-01-22 · unverdicted · novelty 6.0

Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech

eess.AS · 2026-05-20 · unverdicted · novelty 5.0

Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.

JaiTTS: A Thai Voice Cloning Model

cs.CL · 2026-04-30 · unverdicted · novelty 5.0 · 2 refs

JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

cs.CL · 2025-08-25 · unverdicted · novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

cs.SD · 2026-04-09 · unverdicted · novelty 3.0

AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

citing papers explorer

Showing 19 of 19 citing papers.