RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.
Multimodal latent language modeling with next- token diffusion
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
citing papers explorer
-
RobustSpeechFlow: Learning Robust Text-to-Speech Trajectories via Augmentation-based Contrastive Flow Matching
RobustSpeechFlow improves TTS alignment robustness by extending contrastive flow matching with length-preserving repeat and skip latent augmentations, lowering WER from 1.44 to 1.38 on Seed-TTS-eval and CER on ZERO500.
-
Leveraging Latent Visual Reasoning in Silence
Latent visual reasoning improves multimodal models via training effects even without using latent tokens at inference, enabled by an attention-based RL reward that promotes interaction with text tokens.
-
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
-
LLaMo: Scaling Pretrained Language Models for Unified Motion Understanding and Generation with Continuous Autoregressive Tokens
LLaMo scales pretrained LLMs for unified motion-language tasks by encoding motion into continuous causal latents and adding a flow-matching head for real-time autoregressive generation and captioning.
-
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought
MVoT lets multimodal models create coherent images during chain-of-thought reasoning via a token discrepancy loss, yielding competitive or better results than text-only CoT on dynamic spatial tasks.
-
Fast Text-to-Audio Generation with One-Step Sampling via Energy-Scoring and Auxiliary Contextual Representation Distillation
A one-step text-to-audio model using energy-distance training and contextual distillation outperforms prior fast baselines on AudioCaps and achieves up to 8.5x faster inference than the multi-step IMPACT system with competitive quality.
- FeynmanBench: Benchmarking Multimodal LLMs on Diagrammatic Physics Reasoning
- WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation