arxiv: 2604.11244 · v2 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams and Relational Grounding

Tencent Hunyuan Team

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:02 UTC · model grok-4.3

classification 💻 cs.CV

keywords video captioningmulti-stream factorizationrelational groundingmultimodal large language modelsvideo understandingscene scriptaudio-visual consistency

0 comments

The pith

Replacing monolithic video captions with factorized streams linked by identity and temporal relations improves understanding and generation performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that current video captions, written as single dense paragraphs, mix visual, audio, and identity details in ways that reduce accuracy and make small changes costly. It introduces Multi-Stream Scene Script to split descriptions into four separate streams—Reference for identities, Shot for visual elements, Event for actions, and Global for overall context—then reconnects them with explicit links for identities and timing. This structure should make captions more precise and easier for models to use, leading to better video comprehension in language models and more consistent multi-shot video generation. Experiments show clear gains on standard benchmarks, suggesting the new format is more learnable.

Core claim

MTSS is a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. It is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. This yields an average 25% reduction in total error rate on Video-SALMONN-2, 67% gain on Daily-Omni, and substantial improvements in generated video quality without model changes.

What carries the argument

Multi-Stream Scene Script (MTSS) that factorizes video descriptions into Reference, Shot, Event, and Global streams and reconnects them via identity and temporal relational links.

If this is right

Models using MTSS captions achieve 25% lower total error rates on Video-SALMONN-2 benchmark.
Performance on Daily-Omni reasoning improves by 67% on average.
The gap between smaller and larger multimodal models narrows with MTSS.
Multi-shot video generation sees 45% better cross-shot identity consistency, 56% better audio-visual alignment, and 71% better temporal controllability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Structured captions like MTSS could support incremental video editing where only one stream needs updating for local changes.
Applying similar factorization to other tasks such as video question answering might improve model reasoning by providing disentangled information.
MTSS may reduce the data requirements for training effective video models since the format carries less entangled noise.

Load-bearing premise

That splitting video descriptions into separate Reference, Shot, Event, and Global streams and linking them with identity and temporal connections maintains overall video coherence without losing key information or introducing new errors.

What would settle it

A controlled experiment where the same video understanding model is prompted with MTSS captions versus standard monolithic captions on Video-SALMONN-2, and the error rate does not decrease or even increases, would falsify the central performance claim.

read the original abstract

Advances in Multimodal Large Language Models (MLLMs) are transforming video captioning from a descriptive endpoint into a semantic interface for both video understanding and generation. However, the dominant paradigm still casts videos as monolithic narrative paragraphs that entangle visual, auditory, and identity information. This dense coupling not only compromises representational fidelity but also limits scalability, since even local edits can trigger global rewrites. To address this structural bottleneck, we propose Multi-Stream Scene Script (MTSS), a novel paradigm that replaces monolithic text with factorized and explicitly grounded scene descriptions. MTSS is built on two core principles: Stream Factorization, which decouples a video into complementary streams (Reference, Shot, Event, and Global), and Relational Grounding, which reconnects these isolated streams through explicit identity and temporal links to maintain holistic video consistency. Extensive experiments demonstrate that MTSS consistently enhances video understanding across various models, achieving an average reduction of 25% in the total error rate on Video-SALMONN-2 and an average performance gain of 67% on the Daily-Omni reasoning benchmark. It also narrows the performance gap between smaller and larger MLLMs, indicating a substantially more learnable caption interface. Finally, even without architectural adaptation, replacing monolithic prompts with MTSS in multi-shot video generation yields substantial human-rated improvements: a 45% boost in cross-shot identity consistency, a 56% boost in audio-visual alignment, and a 71% boost in temporal controllability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MTSS factorizes video captions into four streams with relational links and reports big gains on understanding and generation benchmarks, but the experimental controls are not visible in the abstract.

read the letter

The punchline is that this paper replaces the usual dense paragraph captions with a factored script split into Reference, Shot, Event, and Global streams, then reconnects them with explicit identity and temporal links. That structure is the main novelty relative to standard monolithic captioning, and the reported numbers suggest it helps both understanding and multi-shot generation consistency without changing the underlying models.

Referee Report

2 major / 1 minor

Summary. The paper proposes Multi-Stream Scene Script (MTSS) as a replacement for monolithic video captions. MTSS factorizes descriptions into four complementary streams (Reference, Shot, Event, Global) and reconnects them via explicit identity and temporal links (Relational Grounding) to improve representational fidelity, scalability, and learnability for MLLMs in video understanding and generation tasks. It reports average 25% error reduction on Video-SALMONN-2, 67% gain on Daily-Omni, and human-rated gains (45-71%) in generation consistency.

Significance. If the empirical claims are substantiated with proper controls, MTSS could offer a more structured caption interface that narrows model-size gaps and improves cross-shot consistency in generation. The factorization-plus-grounding design directly targets entanglement issues in current paradigms, with potential for broader adoption in video MLLM pipelines.

major comments (2)

[Abstract] Abstract: The central performance claims (25% total error reduction on Video-SALMONN-2; 67% gain on Daily-Omni) are stated without any description of baselines, experimental controls, statistical significance testing, caption generation procedure, or evaluation protocol. These details are load-bearing for assessing whether the reported gains support the MTSS claims.
[Abstract] Abstract / core method description: The assumption that explicit identity and temporal links in Relational Grounding fully recover all cross-stream dependencies without information loss or hallucinated relations is not tested. This is critical because the 25% and 67% gains rest on this recovery property holding for complex videos with overlapping events or ambiguous audio-visual references; no ablation or counter-example analysis is provided.

minor comments (1)

[Abstract] Abstract: The phrase 'even without architectural adaptation' for the generation results is unclear; specify whether this means zero-shot prompt replacement or a particular inference setting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that directly strengthen the presentation of our empirical claims and the validation of Relational Grounding.

read point-by-point responses

Referee: [Abstract] Abstract: The central performance claims (25% total error reduction on Video-SALMONN-2; 67% gain on Daily-Omni) are stated without any description of baselines, experimental controls, statistical significance testing, caption generation procedure, or evaluation protocol. These details are load-bearing for assessing whether the reported gains support the MTSS claims.

Authors: We agree that the abstract's brevity omits these load-bearing details. In the revised version we will insert a concise clause noting the baselines (standard monolithic captioning methods), the evaluation protocol on Video-SALMONN-2 and Daily-Omni, and the caption generation procedure. Full experimental controls, statistical significance tests, and protocol descriptions already appear in Sections 4 and 5; we will also add a brief reference to them in the abstract so readers can immediately contextualize the reported gains. revision: yes
Referee: [Abstract] Abstract / core method description: The assumption that explicit identity and temporal links in Relational Grounding fully recover all cross-stream dependencies without information loss or hallucinated relations is not tested. This is critical because the 25% and 67% gains rest on this recovery property holding for complex videos with overlapping events or ambiguous audio-visual references; no ablation or counter-example analysis is provided.

Authors: We acknowledge that the current manuscript does not contain a dedicated ablation isolating the recovery property of Relational Grounding or an explicit search for hallucinated relations. Our gains on complex multi-event benchmarks provide indirect evidence, yet a direct test is warranted. We will add an ablation that removes the identity and temporal links, quantify any resulting information loss or hallucination rates against ground-truth annotations, and include a short qualitative counter-example analysis in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper introduces MTSS via Stream Factorization into Reference/Shot/Event/Global streams and Relational Grounding via identity/temporal links, but presents no equations, derivations, or fitted parameters. Central claims of 25% error reduction and 67% benchmark gains are supported by reported experiments on Video-SALMONN-2 and Daily-Omni rather than any self-referential construction or self-citation that reduces the result to its inputs. The work is self-contained against external evaluation and contains no load-bearing steps that collapse by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the unstated assumption that the four streams plus relational links are sufficient to reconstruct consistent video semantics; no free parameters, axioms, or invented entities are explicitly listed in the abstract.

invented entities (1)

Multi-Stream Scene Script (MTSS) no independent evidence
purpose: Structured caption format replacing monolithic paragraphs
Introduced as the core new paradigm; no independent evidence provided beyond reported benchmark gains.

pith-pipeline@v0.9.0 · 5570 in / 1186 out tokens · 54731 ms · 2026-05-10T15:02:45.132067+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SARA: Semantically Adaptive Relational Alignment for Video Diffusion Models
cs.CV 2026-05 unverdicted novelty 6.0

SARA improves text alignment and motion quality in video diffusion models by routing token-relation distillation supervision to semantically salient pairs using a Stage-1 aligner trained with SAM masks and InfoNCE.

Reference graph

Works this paper leans on

38 extracted references · 37 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Mixture of contexts for long video generation

Shengqu Cai, Ceyuan Yang, Lvmin Zhang, Yuwei Guo, Junfei Xiao, Ziyan Yang, Yinghao Xu, Zhenheng Yang, Alan Yuille, Leonidas Guibas, et al. Mixture of contexts for long video generation.arXiv preprint arXiv:2508.21058,

work page arXiv
[2]

Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772, 2025

J. Chao, Jianzhang Gao, Wenhui Tan, Yuchong Sun, Ruihua Song, and Liyun Ru. Jointavbench: A benchmark for joint audio-visual reasoning evaluation.arXiv preprint arXiv:2512.12772,

work page internal anchor Pith review arXiv
[3]

Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning.arXiv preprint arXiv:2311.00990,

Hong Chen, Xin Wang, Guanning Zeng, Yipeng Zhang, Yuwei Zhou, Feilin Han, Yaofei Wu, and Wenwu Zhu. Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning.arXiv preprint arXiv:2311.00990,

work page arXiv
[4]

Disenstudio: Customized multi-subject text-to-video generation with disentangled spatial control.arXiv preprint arXiv:2405.12796,

Hong Chen, Xin Wang, Yipeng Zhang, Yuwei Zhou, Zeyang Zhang, Siao Tang, and Wenwu Zhu. Disenstudio: Customized multi-subject text-to-video generation with disentangled spatial control.arXiv preprint arXiv:2405.12796,

work page arXiv
[5]

Multi-subject open-set personalization in video generation

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Yuwei Fang, Kwot Sin Lee, Ivan Skorokhodov, Kfir Aberman, Jun-Yan Zhu, Ming-Hsuan Yang, and Sergey Tulyakov. Multi-subject open-set personalization in video generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6099–6110, 2025a. Xinlong Chen, Yue Ding, Weihong Lin, ...

work page arXiv
[6]

Vc4vg: Optimizing video captions for text-to-video generation

Yang Du, Zhuoran Lin, Kaiqiang Song, Biao Wang, Zhicheng Zheng, Tiezheng Ge, Bo Zheng, and Qin Jin. Vc4vg: Optimizing video captions for text-to-video generation. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 1124–1138,

2025
[7]

Ingredients: Blending custom photos with video diffusion transformers.arXiv preprint arXiv:2501.01790, 2025a

Zhengcong Fei, Debang Li, Di Qiu, Changqian Yu, and Mingyuan Fan. Ingredients: Blending custom photos with video diffusion transformers.arXiv preprint arXiv:2501.01790, 2025a. Zhengcong Fei, Di Qiu, Jiahua Wang, Yikun Dou, Guibin Chen, Yang Li, Yahui Zhou, et al. Skyreels-a2: Compose anything in video diffusion transformers.arXiv preprint arXiv:2504.02436...

work page arXiv
[8]

Taming text-to-sounding video generation via advanced modality condition and interaction, 2025

Technical Report. Kaisi Guan, Xihua Wang, Zhengfeng Lai, Xin Cheng, Peng Zhang, XiaoJiang Liu, Ruihua Song, and Meng Cao. Taming text-to-sounding video generation via advanced modality condition and interaction.arXiv preprint arXiv:2510.03117,

work page arXiv
[9]

Long context tuning for video generation

Yuwei Guo, Ceyuan Yang, Ziyan Yang, Zhibei Ma, Zhijie Lin, Zhenheng Yang, Dahua Lin, and Lu Jiang. Long context tuning for video generation.arXiv preprint arXiv:2503.10589,

work page arXiv
[10]

LTX-2: Efficient Joint Audio-Visual Foundation Model

Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berkowitz, Daniel Shalem, Daphna Lifschitz, Dudu Moshe, Eitan Porat, et al. Ltx-2: Efficient joint audio-visual foundation model, 2026.URL https://arxiv. org/abs/2601.03233, 2026a. Yoav HaCohen, Benny Brazowski, Nisan Chiprut, Yaki Bitterman, Andrew Kvochko, Avishai Berk...

work page Pith review arXiv 2026
[11]

arXiv preprint arXiv:2502.04326 (2025)

Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326,

work page arXiv
[12]

Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin, and Qinglin Lu. Hunyuancustom: A multimodal-driven architecture for customized video generation.arXiv preprint arXiv:2505.04512, 2025a. Teng Hu, Zhentao Yu, Zhengguang Zhou, Jiangning Zhang, Yuan Zhou, Qinglin Lu, and Ran Yi. Polyvivid: Vivid multi-subject video generation with cross-mod...

work page arXiv
[13]

Videomage: Multi-subject and motion customization of text-to-video diffusion models.arXiv preprint arXiv:2503.21781, 2025a

Chi-Pin Huang, Yen-Siang Wu, et al. Videomage: Multi-subject and motion customization of text-to-video diffusion models.arXiv preprint arXiv:2503.21781, 2025a. Yuzhou Huang, Ziyang Yuan, Quande Liu, Qiulin Wang, Xintao Wang, Ruimao Zhang, Pengfei Wan, Di Zhang, and Kun Gai. Conceptmaster: Multi-concept video customization on diffusion transformer models w...

work page arXiv
[14]

Omnivideobench: Towards audio-visual understanding evaluation for omni mllms, 2025

Caorui Li, Yu Chen, Yiyan Ji, Jin Xu, Zhenyu Cui, Shihao Li, Yuanxing Zhang, et al. Omnivideobench: Towards audio-visual understanding evaluation for omni mllms.arXiv preprint arXiv:2510.10689, 2025a. KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wenhai Wang, Ping Luo, Yali Wang, Limin Wang, and Yu Qiao. Videochat: Chat-centric video understanding.Science Ch...

work page arXiv
[15]

Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data.arXiv preprint arXiv:2511.12609, 2025e

Yunxin Li, Xinyu Chen, Shenyuan Jiang, Haoyuan Shi, Zhenyu Liu, Xuanyu Zhang, Nanhao Deng, Zhenran Xu, Yicheng Ma, Meishan Zhang, et al. Uni-moe-2.0-omni: Scaling language-centric omnimodal large model with advanced moe, training and data.arXiv preprint arXiv:2511.12609, 2025e. Script-a-Video: Deep Structured Audio-visual Captions via Factorized Streams a...

work page arXiv
[16]

Phantom: Subject-consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025

Lijie Liu, Tianxaing Ma, Bingchuan Li, Zhuowei Chen, Jiawei Liu, Qian He, and Xinglong Wu. Phantom: Subject- consistent video generation via cross-modal alignment.arXiv preprint arXiv:2502.11079, 2025a. Zuyan Liu, Yuhao Dong, Jiahui Liu, Xiaoxi Hu, Yongxin Lu, et al. Ola: Pushing the frontiers of omni-modal language model.arXiv preprint arXiv:2502.04328, ...

work page arXiv
[17]

Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception,

Ziyang Ma, Ruiyang Xu, Zheng Xing, Yunfei Chu, Yuxuan Wang, Jinzheng He, et al. Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception.arXiv preprint arXiv:2510.12720,

work page arXiv
[18]

Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822, 2025

Yihao Meng, Hao Ouyang, Yue Yu, Qiuyu Wang, Wen Wang, Ka Leong Cheng, Hanlin Wang, Yixuan Li, Cheng Chen, Yanhong Zeng, et al. Holocine: Holistic generation of cinematic multi-shot long video narratives.arXiv preprint arXiv:2510.20822,

work page arXiv
[19]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022.https://arxiv.org/abs/2212.04356. Takaaki Saeki, Detai Xin, Wataru Nakata, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Utmos: Utokyo-sarulab system for voicemos challenge 2022.arXiv...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704, 2024

Guangzhi Sun, Changli Tang, Wenyi Zhang, Yixuan Li, Wei Li, Zejun Ma, and Chao Zhang. Video-salmonn: Speech-enhanced audio-visual large language models.arXiv preprint arXiv:2406.15704,

work page arXiv
[22]

video-salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025

Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Caption-enhanced audio-visual large language models.arXiv preprint arXiv:2506.15220, 2025a. Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, and Chao Zhang. video- salmonn 2: Captioning-enhanced audio-vis...

work page arXiv
[23]

Qwen3-Omni Technical Report

Qwen Team. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765,

work page internal anchor Pith review arXiv
[24]

Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,

Yuanpeng Tu, Hao Luo, Xi Chen, Sihui Ji, Xiang Bai, and Hengshuang Zhao. Videoanydoor: High-fidelity video object insertion with precise motion control.arXiv preprint arXiv:2501.01427,

work page arXiv
[25]

Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025

Duomin Wang, Wei Zuo, Aojie Li, Ling-Hao Chen, Xinyao Liao, Deyu Zhou, Zixin Yin, Xili Dai, Daxin Jiang, and Gang Yu. Universe-1: Unified audio-video generation via stitching of experts.arXiv preprint arXiv:2509.06155, 2025a. Jun Wang, Chunyu Qiang, Yuxin Guo, Yiran Wang, Xijuan Zeng, and Feng Deng. Apollo: Unified multi-task audio-video joint generation....

work page arXiv
[26]

Av-dit: Effi- cient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024

Kai Wang, Shijian Deng, Jing Shi, Dimitrios Hatzinakos, and Yapeng Tian. Av-dit: Efficient audio-visual diffusion transformer for joint audio and video generation.arXiv preprint arXiv:2406.07686, 2024a. Qinghe Wang, Xiaoyu Shi, Baolu Li, Weikang Bian, Quande Liu, Huchuan Lu, Xintao Wang, Pengfei Wan, Kun Gai, and Xu Jia. Multishotmaster: A controllable mu...

work page arXiv
[27]

Ugc-videocaptioner: An omni ugc video de- tail caption model and new benchmarks.arXiv preprint arXiv:2507.11336, 2025

Peiran Wu, Yunze Liu, Zhengdong Zhu, Enmin Zhou, and Shawn Shen. Ugc-videocaptioner: An omni ugc video detail caption model and new benchmarks.arXiv preprint arXiv:2507.11336, 2025a. Xiaoxue Wu, Bingjie Gao, Yu Qiao, Yaohui Wang, and Xinyuan Chen. Cinetrans: Learning to generate videos with cinematic transitions via masked diffusion models.arXiv preprint ...

work page arXiv
[28]

Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning.arXiv preprint arXiv:2505.04623,

Zhenghao Xing, Xiaowei Hu, Chi-Wing Fu, Wenhai Wang, Jifeng Dai, and Pheng-Ann Heng. Echoink-r1: Exploring audio-visual reasoning in multimodal llms via reinforcement learning.arXiv preprint arXiv:2505.04623,

work page arXiv
[29]

Glave-cap: Global-local aligned video captioning with vision expert integration.arXiv preprint arXiv:2509.11360,

Wan Xu, Feng Zhu, Yihan Zeng, Yuanfan Guo, Ming Liu, Hang Xu, and Wangmeng Zuo. Glave-cap: Global-local aligned video captioning with vision expert integration.arXiv preprint arXiv:2509.11360,

work page arXiv
[30]

Humanomniv2: From understanding to omni-modal reasoning with context, 2025

Qize Yang, Shimin Yao, Weixuan chen, Shenghao Fu, Detao Bai, Jiaxin Zhao, et al. Humanomniv2: From understanding to omni-modal reasoning with context.arXiv preprint arXiv:2506.21277,

work page arXiv
[31]

SVAgent: Storyline-Guided Long Video Understanding via Cross-Modal Multi-Agent Collaboration

Zhongyu Yang, Zuhao Yang, Shuo Zhan, Tan Yue, Wei Pang, and Yingfang Yuan. Svagent: Storyline-guided long video understanding via cross-modal multi-agent collaboration, 2026.https://arxiv.org/pdf/2604.05079. Linli Yao, Yuancheng Wei, Yaojie Zhang, Lei Li, Xinlong Chen, Feifan Song, Ziyue Wang, Kun Ouyang, Yuanxin Liu, Lingpeng Kong, et al. Timechat-captio...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Zhibin Yu, Tianyu Li, et al. Minicpm-o 2.6: A multi-modal end-side model.arXiv preprint arXiv:2408.01800,

work page internal anchor Pith review arXiv
[33]

Identity-preserving text-to-video generation by frequency decomposition

Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyuan Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, and Li Yuan. Identity- preserving text-to-video generation by frequency decomposition.arXiv preprint arXiv:2411.17440,

work page arXiv
[34]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding.arXiv preprint arXiv:2306.02858,

work page internal anchor Pith review arXiv
[35]

Shouldershot: Generating over-the-shoulder dialogue videos.arXiv preprint arXiv:2508.07597,

Yuang Zhang, Junqi Cheng, Haoyu Zhao, Jiaxi Gu, Fangyuan Zou, Zenghui Lu, and Peng Shu. Shouldershot: Generating over-the-shoulder dialogue videos.arXiv preprint arXiv:2508.07597,

work page arXiv
[36]

R1-omni: Explainable omni- multimodal emotion recognition with reinforcing learning,

JiaxinZhao, XihanWei, andLiefengBo. R1-omni: Explainableomni-multimodalemotionrecognitionwithreinforcement learning.arXiv preprint arXiv:2503.05379,

work page arXiv
[37]

Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256,

Hao Zhong, Muzhi Zhu, Zongze Du, Zheng Huang, Canyu Zhao, Mingyu Liu, Wen Wang, Hao Chen, and Chunhua Shen. Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256,

work page arXiv
[38]

Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities, 2025

Ziwei Zhou, Rui Wang, and Zuxuan Wu. Daily-omni: Towards audio-visual reasoning with temporal alignment across modalities.arXiv preprint arXiv:2505.17862, 2025

work page arXiv 2025