Recognition: unknown
Low-Bitrate Video Compression through Semantic-Conditioned Diffusion
read the original abstract
Traditional video codecs optimized for pixel fidelity collapse at ultra-low bitrates and produce severe artifacts. This failure arises from a fundamental misalignment between pixel accuracy and human perception. We propose a semantic video compression framework named DiSCo that transmits only the most meaningful information while relying on generative priors for detail synthesis. The source video is decomposed into three compact modalities: a textual description, a spatiotemporally degraded video, and optional sketches or poses that respectively capture semantic, appearance, and motion cues. A conditional video diffusion model then reconstructs high-quality, temporally coherent videos from these compact representations. Temporal forward filling, token interleaving, and modality-specific codecs are proposed to improve multimodal generation and modality compactness. Experiments show that our method outperforms baseline semantic and traditional codecs by 2-10X on perceptual metrics at low bitrates.
This paper has not been read by Pith yet.
Forward citations
Cited by 1 Pith paper
-
NeuralLVC: Neural Lossless Video Compression via Masked Diffusion with Temporal Conditioning
NeuralLVC achieves better lossless compression than H.264 and H.265 on video sequences by combining masked diffusion with temporal conditioning on frame differences.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.