pith. sign in

arxiv: 2607.01667 · v1 · pith:VBOLZ6LWnew · submitted 2026-07-02 · 💻 cs.CV

Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning

Pith reviewed 2026-07-03 16:35 UTC · model grok-4.3

classification 💻 cs.CV
keywords audiovisual video captioningtemporal alignmentcross-modal alignmentObserver-Checker-Correctortraining data refinementmultimodal large language modelsvideo understanding benchmark
0
0 comments X

The pith

TCA-Captioner uses an Observer-Checker-Corrector process to generate training data that binds auditory events to visual entities and respects causal timing in video captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that multimodal models commonly produce captions where sounds float free of visible objects and where event order breaks down. It introduces the Observer-Checker-Corrector framework as an iterative loop that creates carefully grounded training examples from dense recordings of human interactions, then trains a captioner on that data. A sympathetic reader would care because accurate binding of audio to visuals and correct sequencing matter for any system that must describe conversations, actions with sound, or cause-and-effect sequences. The work also supplies TCA-Bench and a decoupled evaluation protocol that measures these two skills in isolation.

Core claim

TCA-Captioner is a framework engineered to enhance temporal and cross-modal alignment for audiovisual video captioning. The Observer-Checker-Corrector (OCC) framework is an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. TCA-Bench applies a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning.

What carries the argument

The Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity training data from a high-density human interaction dataset.

If this is right

  • Captions produced by the trained model bind auditory events to the correct visual entities.
  • The model captures complex causal dynamics rather than listing events out of sequence.
  • TCA-Bench isolates and quantifies separate failures in audiovisual binding versus temporal reasoning.
  • The same refinement loop can be reused to produce additional grounded audiovisual training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on videos outside human-interaction scenes to check whether the dataset choice drives the gains.
  • The iterative correction loop might transfer to other multimodal generation tasks such as video question answering.
  • If the generated data truly resolves detachment, the same captions should improve downstream tasks that rely on synchronized audiovisual features.

Load-bearing premise

The assumption that the Observer-Checker-Corrector framework, when applied to a curated high-density human interaction dataset, produces training data that actually resolves modality detachment and temporal incoherence.

What would settle it

Measure audiovisual binding accuracy and temporal relational reasoning scores of TCA-Captioner against prior MLLMs on the TCA-Bench decoupled protocol; a clear gap in favor of TCA-Captioner would support the claim, while comparable or worse scores would falsify it.

Figures

Figures reproduced from arXiv: 2607.01667 by Chen Zhao, Hongyu Li, Jiajun Ma, Jian Yang, Qilong Huang, Tiehan Fan, Xiaoming Wei, Ying Tai, Zhuoliang Kang.

Figure 1
Figure 1. Figure 1: The challenges for audiovisual video captioning. Existing state-of-the-art mod￾els, including Qwen3-Omni, Gemini-3-Pro, and AVoCaDO, still struggle with two fundamental challenges: Audio-Visual Cross-Modal Binding and Audio-Visual Temporal Coherence. In the provided examples, text highlighted in red indicates errors in Cross-Modal Binding. Text marked with purple underlines identifies failures in Temporal … view at source ↗
Figure 2
Figure 2. Figure 2: The proposed Observer-Checker-Corrector (OCC) framework, an iterative data synthesis pipeline designed to generate high-quality audiovisual video captions. The framework consists of three collaborative modules: a Global-Perceptive Observer for comprehensive understanding, a Local-Perceptive Checker for detecting unimodal fac￾tual inaccuracies and cross-modal misalignments, and an Iterative Refinement Corre… view at source ↗
Figure 3
Figure 3. Figure 3: TCA-Bench data curation and annotation pipeline. Stage I: source videos un￾dergo parallel visual (YOLOv8 [55]) and auditory (Silero VAD [54]) analysis; timelines are fused under a joint co-occurrence constraint to extract segments. Stage II: each clip receives three ground-truth layers via LLM draft → human validation. validation paradigm: a multimodal LLM first generates an initial dense caption, which hu… view at source ↗
Figure 4
Figure 4. Figure 4: TCA-Bench ground-truth annotation showcase. Top: a dense caption with inline highlights delineating five evaluation sub-dimensions grouped into two modal￾ity tracks (Visual: Subject & Action, Scene & Atmosphere, Cinematography; Audio: Transcription Accuracy, Tone & Emotion). Bottom-left: audio-visual binding pairs mapping each acoustic event to its source—foreground character, off-screen character, or envi… view at source ↗
Figure 5
Figure 5. Figure 5: The performance analysis for OCC framework. Since the generation of LLM drafts in our benchmark construction utilized the OCC framework, a direct compar￾ison within the full TCA-Bench would introduce inherent bias. To conduct a more rigorous and equitable ablation study of the OCC framework’s con￾tributions, we constructed an addi￾tional mini-bench comprising 20 sam￾ples. These samples utilized Gemini￾2.5-… view at source ↗
read the original abstract

While Multimodal Large Language Models (MLLMs) have advanced video understanding, achieving precise temporal and cross-modal alignment in audiovisual video captioning remains a formidable challenge. Most existing approaches suffer from modality detachment and temporal incoherence, failing to accurately bind auditory events to visual entities or capture complex causal dynamics. To address these deficiencies, we propose TCA-Captioner, a framework specifically engineered to enhance Temporal and Cross-Modal Alignment for audiovisual video captioning. We first introduce the Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. Furthermore, we present TCA-Bench, a diagnostic benchmark utilizing a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning. Extensive experiments demonstrate that TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes TCA-Captioner, a framework for audiovisual video captioning that introduces the Observer-Checker-Corrector (OCC) iterative refinement strategy to generate high-fidelity training data from a curated high-density human interaction dataset. It further presents TCA-Bench, a diagnostic benchmark with a Decoupled Evaluation Protocol to assess audiovisual binding and temporal relational reasoning. The central claim is that extensive experiments show TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives, addressing modality detachment and temporal incoherence in multimodal large language models.

Significance. If the experimental claims hold, the work could advance multimodal video understanding by targeting key limitations in temporal and cross-modal alignment. The OCC data-generation approach and TCA-Bench diagnostic protocol represent potentially useful methodological contributions. However, the significance cannot be assessed because the manuscript supplies no quantitative results, ablations, baselines, or error analysis to support the performance assertions.

major comments (1)
  1. Abstract: the claim that 'extensive experiments demonstrate that TCA-Captioner sets a new standard' is presented without any supporting data, ablation studies, comparison tables, error bars, or quantitative metrics, rendering the central empirical claim unverifiable from the manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the lack of supporting evidence for the empirical claims. We address the major comment below and commit to revisions that will make the experimental support verifiable.

read point-by-point responses
  1. Referee: Abstract: the claim that 'extensive experiments demonstrate that TCA-Captioner sets a new standard' is presented without any supporting data, ablation studies, comparison tables, error bars, or quantitative metrics, rendering the central empirical claim unverifiable from the manuscript text.

    Authors: We agree that the abstract asserts performance improvements without the quantitative backing being present in the manuscript text. The current version does not supply the required results, ablations, baselines, or metrics. In the revised manuscript we will add the full experimental section with comparison tables, ablation studies, quantitative metrics, and error analysis. The abstract will be updated to reflect only what the new results support. revision: yes

Circularity Check

0 steps flagged

No circularity detected; no derivation chain present

full rationale

The abstract and query provide no equations, derivations, fitted parameters, or self-citations. Claims rest on introducing the OCC framework and TCA-Bench without any mathematical reduction or load-bearing self-reference visible. Full manuscript details are noted as unavailable, precluding any inspection of potential circular steps. This is the expected non-finding when no technical derivation exists to analyze.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework description remains at the level of named components without mathematical or empirical grounding.

pith-pipeline@v0.9.1-grok · 5722 in / 1094 out tokens · 25544 ms · 2026-07-03T16:35:11.723896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

97 extracted references · 43 canonical work pages · 20 internal anchors

  1. [1]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

    Abdar, M., Kollati, M., Kuraparthi, S., Pourpanah, F., McDuff, D., Ghavamzadeh, M., Yan, S., Mohamed, A., Khosravi, A., Cambria, E., et al.: A review of deep learning for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

  2. [2]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  3. [3]

    Calibrated Harmonic Overlaid Implicit Neural Representations for Multi-Dimensional Data

    Chen, H., Zhang, X., Sun, X., Mingqing, X.: Calibrated harmonic overlaid implicit neural representations for multi-dimensional data. arXiv preprint arXiv:2606.26763 (2026)

  4. [4]

    IEEE Transactions on Image Processing 35, 5982–5994 (2026)

    Chen, L., Wang, J., Pan, Z., Zhu, B., Yang, X., Zhang, C.: Detail++: Training-free detail enhancer for t2i diffusion models. IEEE Transactions on Image Processing 35, 5982–5994 (2026)

  5. [5]

    ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

    Chen, L., You, T., Liu, H., Bao, Z., Jiao, J., Han, X., Ou, Z., Sun, T., Mou, X., Jin, X., et al.: Echo: Efficient chest x-ray report generation with one-step block diffusion. arXiv preprint arXiv:2604.09450 (2026)

  6. [6]

    Advances in Neural Information Processing Systems37, 19472–19495 (2024)

    Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Tang, Z., Yuan, L., et al.: Sharegpt4video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024)

  7. [7]

    arXiv preprint arXiv:2510.10395 (2025)

    Chen, X., Ding, Y., Lin, W., Hua, J., Yao, L., Shi, Y., Li, B., Zhang, Y., Liu, Q., Wan, P., et al.: Avocado: An audiovisual video captioner driven by temporal orchestration. arXiv preprint arXiv:2510.10395 (2025)

  8. [8]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  9. [9]

    arXiv preprint arXiv:2308.00303 (2023)

    Chen, Z., Gao, R., Xiang, T.Z., Lin, F.: Diffusion model for camouflaged object detection. arXiv preprint arXiv:2308.00303 (2023)

  10. [10]

    Chen, Z., Li, Y., Wang, H., Chen, Z., Jiang, Z., Li, J., Wang, Q., Yang, J., Tai, Y.: Ragd:Regional-awarediffusionmodelfortext-to-imagegeneration.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19331–19341 (2025)

  11. [11]

    L2P: Unlocking Latent Potential for Pixel Generation

    Chen, Z., Zhu, J., Chen, X., Zhang, J., Chen, J., Zeng, Z., Zhang, W., Wang, C., Yang, J., Tai, Y.: L2p: Unlocking latent potential for pixel generation. arXiv preprint arXiv:2605.12013 (2026)

  12. [12]

    arXiv preprint arXiv:2511.18822 (2025)

    Chen, Z., Zhu, J., Chen, X., Zhang, J., Hu, X., Zhao, H., Wang, C., Yang, J., Tai, Y.: Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822 (2025)

  13. [13]

    Qwen2-Audio Technical Report

    Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024)

  14. [14]

    Company, T.D.: Tiktok-10m: A large-scale short video dataset for video un- derstanding (2025),https://huggingface.co/datasets/The- data- company/ TikTok-10M, accessed: June 30, 2026

  15. [15]

    In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

    Dong, C., Zhao, C., Cai, W., Yang, B., Guo, Y.: O-mamba: O-shape state-space model for underwater image enhancement. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 168–182. Springer (2025) Temporal and Cross-Modal Alignment for AVC 17

  16. [16]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Du, S., Zou, Y., Li, J., Liu, M., Li, Y., Shang, C., Shen, Q.: Pansharpening for thin- cloud contaminated remote sensing images: a unified framework and benchmark dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 3696–3704 (2026)

  17. [17]

    Inter- national Journal of Computer Vision134, 152 (2026)

    Du, S., Zou, Y., Wang, Z., Li, X., Li, Y., Shang, C., Shen, Q.: Unsupervised hy- perspectral image super-resolution via self-supervised modality decoupling. Inter- national Journal of Computer Vision134, 152 (2026)

  18. [18]

    In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

    Du, Y., Lin, Z., Song, K., Wang, B., Zheng, Z., Ge, T., Zheng, B., Jin, Q.: Vc4vg: Optimizing video captions for text-to-video generation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 1124–1138 (2025)

  19. [19]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Fan, T., Nan, K., Xie, R., Zhou, P., Yang, Z., Fu, C., Li, X., Yang, J., Tai, Y.: Instancecap:Improvingtext-to-videogenerationviainstance-awarestructuredcap- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28974–28983 (2025)

  20. [20]

    Farré, M., Marafioti, A., Tunstall, L., Von Werra, L., Wolf, T.: Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo(2024), accessed: June 30, 2026

  21. [21]

    arXiv preprint arXiv:2408.05211 (2024)

    Fu, C., Lin, H., Long, Z., Shen, Y., Dai, Y., Zhao, M., Zhang, Y.F., Dong, S., Li, Y., Wang, X., et al.: Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211 (2024)

  22. [22]

    arXiv preprint arXiv:2507.20939 (2025)

    Ge, Y., Ge, Y., Li, C., Wang, T., Pu, J., Li, Y., Qiu, L., Ma, J., Duan, L., Zuo, X., et al.: Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts. arXiv preprint arXiv:2507.20939 (2025)

  23. [23]

    Gemini-3-Pro Team: Gemini-3-Pro.https://gemini.google.com, accessed: June 30, 2026

  24. [24]

    arXiv preprint arXiv:2602.08682 (2026)

    Guo, Y., Gan, Q., Zhang, Y., Liu, J., Hu, Y., Xie, P., Qian, D., Zhang, Y., Li, R., Zhang, Y., et al.: Alive: Animate your world with lifelike audio-video generation. arXiv preprint arXiv:2602.08682 (2026)

  25. [25]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Hu,X.,Tai,Y.,Zhao,X.,Zhao,C.,Zhang,Z.,Li,J.,Zhong,B.,Yang,J.:Exploiting multimodal spatial-temporal patterns for video object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 3581–3589 (2025)

  26. [26]

    Huang, G., Mao, J., Huang, F., Liu, F., Luo, X., Liang, Y., Lu, J., Wang, X., Liu, P., Fu, R., Huang, R., Huang, S.L.: Exposure bias can alleviate itself via directional and frequency rectification in flow matching (2026)

  27. [27]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  28. [28]

    Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

    Jiang, T., Wu, L., Xia, S., Li, S., Yan, Z., Yang, H., Qiao, Y., Wang, Y.: Imagine before you predict: Interleaved latent visual reasoning for video event prediction. arXiv preprint arXiv:2606.05769 (2026)

  29. [29]

    arXiv preprint arXiv:2511.20272 (2025)

    Jiang, T., Xia, S., Xu, Y., Wu, L., Zeng, X., Wang, L., Qiao, Y., Wang, Y.: Vknowu: Evaluating visual knowledge understanding in multimodal llms. arXiv preprint arXiv:2511.20272 (2025)

  30. [30]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Kim, Y., Abdelrahman, A.S., Abdel-Aty, M.: Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 761–771 (2025)

  31. [31]

    In: International conference on machine learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 18 C. Zhao et al

  32. [32]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Li, X., Du, S., Zou, Y., Xu, H., Jiang, Z., Liu, J.: Unifusion: A unified image fusion framework with robust representation and source-aware preservation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33869–33880 (2026)

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lian, L., Ding, Y., Ge, Y., Liu, S., Mao, H., Li, B., Pavone, M., Liu, M.Y., Darrell, T., Yala, A., et al.: Describe anything: Detailed localized image and video cap- tioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21766–21777 (2025)

  34. [34]

    In: Proceedings of the 2024 conference on empirical methods in natural language processing

    Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

  35. [35]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

  36. [36]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: End-to-end transformers with sparse attention for video captioning. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17949–17958 (2022)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

  38. [38]

    DriveVA: Video Action Models are Zero-Shot Drivers

    Liu, M., Zhang, D., Liu, J., Cui, J., Xie, H., Chen, G., Ye, H., Yang, M.Y., Nex, F., Cheng, H.: Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198 (2026)

  39. [39]

    Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Tem- pcompass: Do video llms really understand videos? In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 8731–8772 (2024)

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26439–26455 (2024)

  41. [41]

    Lu, S., Lian, Z., Zhou, Z., Zhang, S., Zhao, C., Kong, A.W.K.: Does flux al- ready know how to perform physically plausible image composition? arXiv preprint arXiv:2509.21278 (2025)

  42. [42]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2294–2305 (2023)

  43. [43]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Lu, S., Wang, Z., Li, L., Liu, Y., Kong, A.W.K.: Mace: Mass concept erasure in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6430–6440 (2024)

  44. [44]

    arXiv preprint arXiv:2410.18775 (2024)

    Lu, S., Zhou, Z., Lu, J., Zhu, Y., Kong, A.W.K.: Robust watermarking using gener- ative priors against image editing: From benchmarking to advances. arXiv preprint arXiv:2410.18775 (2024)

  45. [45]

    arXiv preprint arXiv:2510.12720 (2025)

    Ma, Z., Xu, R., Xing, Z., Chu, Y., Wang, Y., He, J., Xu, J., Heng, P.A., Yu, K., Lin, J., et al.: Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. arXiv preprint arXiv:2510.12720 (2025)

  46. [46]

    In: International Conference on Learning Representations

    Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. In: International Conference on Learning Representations. vol. 2025, pp. 1045–1064 (2025) Temporal and Cross-Modal Alignment for AVC 19

  47. [47]

    See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

    Nguyen, L.T.P., Yu, Z., Hang, S.L.Y., An, S., Lee, J., Ban, Y., Chung, S., Nguyen, T.H., Maeng, J., Lee, S., et al.: See, hear, and understand: Benchmarking audio- visual human speech understanding in multimodal large language models. arXiv preprint arXiv:2512.02231 (2025)

  48. [48]

    ACM Computing Surveys57(6), 1–36 (2025)

    Qasim, I., Horsch, A., Prasad, D.: Dense video captioning: A survey of techniques, datasets and evaluation protocols. ACM Computing Surveys57(6), 1–36 (2025)

  49. [49]

    In: Companion Proceedings of the ACM on Web Conference 2025

    Shang, Y., Gao, C., Li, N., Li, Y.: A large-scale dataset with behavior, attributes, and content of mobile short-video platform. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 793–796 (2025)

  50. [50]

    In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp

    Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp. 11–23 (2023)

  51. [51]

    arXiv preprint arXiv:2506.15220 (2025)

    Tang, C., Li, Y., Yang, Y., Zhuang, J., Sun, G., Li, W., Ma, Z., Zhang, C.: video- salmonn 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220 (2025)

  52. [52]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., Zhang, C.: Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289 (2023)

  53. [53]

    arXiv preprint arXiv:2510.22200 (2025)

    Team, M.L., Cai, X., Huang, Q., Kang, Z., Li, H., Liang, S., Ma, L., Ren, S., Wei, X., Xie, R., et al.: Longcat-video technical report. arXiv preprint arXiv:2510.22200 (2025)

  54. [54]

    Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier (2024)

  55. [55]

    In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS)

    Varghese, R., Sambath, M.: Yolov8: A novel object detection algorithm with en- hanced performance and robustness. In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS). pp. 1–6. IEEE (2024)

  56. [56]

    arXiv preprint arXiv:2407.00634 (2024)

    Wang, J., Yuan, L., Zhang, Y., Sun, H.: Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634 (2024)

  57. [57]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  58. [58]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  59. [59]

    In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Wang, X., Hua, J., Lin, W., Zhang, Y., Zhang, F., Wu, J., Zhang, D., Nie, L.: Haic: Improving human action understanding and generation with better captions for multi-modal large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10158–10181 (2025)

  60. [60]

    arXiv preprint arXiv:2507.11336 (2025)

    Wu, P., Liu, Y., Zhu, Z., Zhou, E., Shen, J.: Ugc-videocaptioner: An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336 (2025)

  61. [61]

    arXiv preprint arXiv:2404.01717 (2024)

    Xie, R., Zhao, C., Zhang, K., Zhang, Z., Zhou, J., Yang, J., Tai, Y.: Addsr: Acceler- ating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717 (2024)

  62. [62]

    Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., Zhang, B., Wang, X., Chu, Y., Lin, J.: Qwen2.5-omni technical report (2025)

  63. [63]

    Qwen3-Omni Technical Report

    Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025) 20 C. Zhao et al

  64. [64]

    arXiv preprint arXiv:2506.21277 (2025)

    Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. arXiv preprint arXiv:2506.21277 (2025)

  65. [65]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)

  66. [66]

    arXiv preprint arXiv:2501.07888 (2025)

    Yuan, L., Wang, J., Sun, H., Zhang, Y., Lin, Y.: Tarsier2: Advancing large vision- language models from detailed video description to comprehensive video under- standing. arXiv preprint arXiv:2501.07888 (2025)

  67. [67]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Zhan, J., Dai, J., Ye, J., Zhou, Y., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al.: Anygpt: Unified multimodal llm with discrete sequence modeling. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9637–9662 (2024)

  68. [68]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

  69. [69]

    In: Findings of the Association for Computational Linguistics: EMNLP 2023

    Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., Qiu, X.: Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 15757–15773 (2023)

  70. [70]

    In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)

    Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)

  71. [71]

    In: European Conference on Computer Vision

    Zhang, L., Meng, W., Zhong, Y., Kong, B., Xu, M., Du, J., Wang, X., Wang, R., Liu, L.: U-cope: Taking a further step to universal 9d category-level object pose estimation. In: European Conference on Computer Vision. pp. 254–270. Springer (2025)

  72. [72]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

    Zhang, L., Meng, X., Liu, L., Jiang, H., Wang, J., Wang, R., Lu, C., Liu, J., Zhang, H.: Probing effective and efficient category-level articulated object pose perception. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

  73. [73]

    Zhang, L., Xu, M., Wang, J., Yu, Q., Yang, L., Li, Y., Lu, C., Wang, R., Liu, L.: Gapt-dar:Category-levelgarmentsposetrackingviaintegrated2ddeformationand 3dreconstruction.In:ProceedingsoftheComputerVisionandPatternRecognition Conference. pp. 22638–22647 (2025)

  74. [74]

    In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

    Zhang, L., Zhong, Y., Wang, J., Min, Z., Liu, L., et al.: Rethinking 3d convolution inℓ p-norm space. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

  75. [75]

    arXiv preprint arXiv:2506.13642 (2025)

    Zhang, S., Guo, S., Fang, Q., Zhou, Y., Feng, Y.: Stream-omni: Simultaneous multimodal interactions with large language-vision-speech model. arXiv preprint arXiv:2506.13642 (2025)

  76. [76]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)

  77. [77]

    The Visual Computer41(1), 129–140 (2025)

    Zhao, C., Cai, W.L., Yuan, Z.: Spectral normalization and dual contrastive reg- ularization for image-to-image translation. The Visual Computer41(1), 129–140 (2025)

  78. [78]

    IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21

    Zhao, C., Cai, W.L., Yuan, Z., Hu, C.W.: Multi-cropping contrastive learning and domain consistency for unsupervised image-to-image translation. IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21

  79. [79]

    In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

    Zhao, C., Cai, W., Dong, C., Hu, C.: Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

  80. [80]

    In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

    Zhao, C., Cai, W., Dong, C., Zeng, Z.: Toward sufficient spatial-frequency in- teraction for gradient-aware underwater image enhancement. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3220–3224. IEEE (2024)

Showing first 80 references.