hub

Utmos: Utokyo-sarulab system for voicemos challenge 2022

Utmos: Utokyo-sarulab system for voicemos challenge · 2022 · arXiv 2204.02152

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

method 1

citation-polarity summary

use method 1

representative citing papers

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

Hierarchical Codec Diffusion for Video-to-Speech Generation

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.

UniVocal: Unified Speech-Singing Code-Switching Synthesis

cs.SD · 2026-06-01 · unverdicted · novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

Rethinking Training Targets, Architectures and Data Quality for Universal Speech Enhancement

cs.SD · 2026-03-03 · unverdicted · novelty 6.0

Replacing early-reflected speech with time-shifted anechoic clean speech as the training target, combined with a two-stage distortion-perception framework, yields state-of-the-art universal speech enhancement.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

Meta Audiobox Aesthetics: Unified Automatic Quality Assessment for Speech, Music, and Sound

cs.SD · 2025-02-07 · unverdicted · novelty 6.0

Unified no-reference models assess audio aesthetics across speech, music, and sound via four perceptual axes and achieve performance comparable or superior to human mean opinion scores.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.

Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?

cs.SD · 2026-04-30 · unverdicted · novelty 5.0

Random phoneme substitutions recover most ASR gains from synthetic accented speech, with targeted edits and ground-truth prosody providing only marginal additional benefits.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

OLIVE: View-Augmented Latent Prediction with Waveform Reconstruction for Speech SSL

cs.CL · 2026-06-29 · unverdicted · novelty 4.0

OLIVE is a new self-supervised speech representation framework that unifies view-augmented masked latent prediction with waveform reconstruction under one objective.

Natural Yet Challenging to Detect: Robust In-the-Wild TTS through EMA and Dual-Scoring Prompt Selection -- Submission for WildSpoof 2026 TTS Track

eess.AS · 2026-05-22 · unverdicted · novelty 3.0

F5-TTS-DPS integrates EMA and dual-scoring prompt selection into F5-TTS to produce in-the-wild TTS that achieves the best a-DCF scores (0.1582, 0.5233, 0.2562) on three SASV systems in the WildSpoof challenge.

MMControl: Unified Multi-Modal Control for Joint Audio-Video Generation

cs.CV · 2026-04-21

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

eess.AS · 2026-03-18 · 2 refs

citing papers explorer

Showing 1 of 1 citing paper after filters.

SyncDPO: Enhancing Temporal Synchronization in Video-Audio Joint Generation via Preference Learning cs.CV · 2026-05-12 · unverdicted · none · ref 36
SyncDPO improves temporal synchronization in video-audio joint generation using DPO with efficient on-the-fly negative sample construction and curriculum learning.

Utmos: Utokyo-sarulab system for voicemos challenge 2022

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer