UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks

Wu, P · 2025 · arXiv 2507.11336

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

cs.CV · 2026-04-13 · unverdicted · novelty 7.0

MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.

MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection

cs.CV · 2025-11-29 · conditional · novelty 7.0

MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.

Building a Precise Video Language with Human-AI Oversight

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

citing papers explorer

Showing 3 of 3 citing papers.

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding cs.CV · 2026-04-13 · unverdicted · none · ref 27
MTSS replaces monolithic video captions with factorized streams and relational grounding, yielding reported gains in understanding benchmarks and generation consistency.
MVAD: A Benchmark Dataset for Multimodal AI-Generated Video-Audio Detection cs.CV · 2025-11-29 · conditional · none · ref 53
MVAD is the first comprehensive benchmark dataset for AI-generated multimodal video-audio detection, with three realistic forgery patterns, high-quality outputs from state-of-the-art models, and diversity across visual styles and content categories.
Building a Precise Video Language with Human-AI Oversight cs.CV · 2026-04-22 · unverdicted · none · ref 70
CHAI framework pairs AI pre-captions with expert human critiques to produce precise video descriptions, enabling open models to outperform closed ones like Gemini-3.1-Pro and improve fine-grained control in video generation models.

UGC-VideoCaptioner : An omni ugc video detail caption model and new benchmarks

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer