Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning

Chen Zhao; Hongyu Li; Jiajun Ma; Jian Yang; Qilong Huang; Tiehan Fan; Xiaoming Wei; Ying Tai; Zhuoliang Kang

arxiv: 2607.01667 · v1 · pith:VBOLZ6LWnew · submitted 2026-07-02 · 💻 cs.CV

Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning

Chen Zhao , Jiajun Ma , Qilong Huang , Tiehan Fan , Hongyu Li , Zhuoliang Kang , Xiaoming Wei , Jian Yang

show 1 more author

Ying Tai

This is my paper

Pith reviewed 2026-07-03 16:35 UTC · model grok-4.3

classification 💻 cs.CV

keywords audiovisual video captioningtemporal alignmentcross-modal alignmentObserver-Checker-Correctortraining data refinementmultimodal large language modelsvideo understanding benchmark

0 comments

The pith

TCA-Captioner uses an Observer-Checker-Corrector process to generate training data that binds auditory events to visual entities and respects causal timing in video captions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that multimodal models commonly produce captions where sounds float free of visible objects and where event order breaks down. It introduces the Observer-Checker-Corrector framework as an iterative loop that creates carefully grounded training examples from dense recordings of human interactions, then trains a captioner on that data. A sympathetic reader would care because accurate binding of audio to visuals and correct sequencing matter for any system that must describe conversations, actions with sound, or cause-and-effect sequences. The work also supplies TCA-Bench and a decoupled evaluation protocol that measures these two skills in isolation.

Core claim

TCA-Captioner is a framework engineered to enhance temporal and cross-modal alignment for audiovisual video captioning. The Observer-Checker-Corrector (OCC) framework is an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. TCA-Bench applies a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning.

What carries the argument

The Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity training data from a high-density human interaction dataset.

If this is right

Captions produced by the trained model bind auditory events to the correct visual entities.
The model captures complex causal dynamics rather than listing events out of sequence.
TCA-Bench isolates and quantifies separate failures in audiovisual binding versus temporal reasoning.
The same refinement loop can be reused to produce additional grounded audiovisual training sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on videos outside human-interaction scenes to check whether the dataset choice drives the gains.
The iterative correction loop might transfer to other multimodal generation tasks such as video question answering.
If the generated data truly resolves detachment, the same captions should improve downstream tasks that rely on synchronized audiovisual features.

Load-bearing premise

The assumption that the Observer-Checker-Corrector framework, when applied to a curated high-density human interaction dataset, produces training data that actually resolves modality detachment and temporal incoherence.

What would settle it

Measure audiovisual binding accuracy and temporal relational reasoning scores of TCA-Captioner against prior MLLMs on the TCA-Bench decoupled protocol; a clear gap in favor of TCA-Captioner would support the claim, while comparable or worse scores would falsify it.

Figures

Figures reproduced from arXiv: 2607.01667 by Chen Zhao, Hongyu Li, Jiajun Ma, Jian Yang, Qilong Huang, Tiehan Fan, Xiaoming Wei, Ying Tai, Zhuoliang Kang.

**Figure 1.** Figure 1: The challenges for audiovisual video captioning. Existing state-of-the-art models, including Qwen3-Omni, Gemini-3-Pro, and AVoCaDO, still struggle with two fundamental challenges: Audio-Visual Cross-Modal Binding and Audio-Visual Temporal Coherence. In the provided examples, text highlighted in red indicates errors in Cross-Modal Binding. Text marked with purple underlines identifies failures in Temporal … view at source ↗

**Figure 2.** Figure 2: The proposed Observer-Checker-Corrector (OCC) framework, an iterative data synthesis pipeline designed to generate high-quality audiovisual video captions. The framework consists of three collaborative modules: a Global-Perceptive Observer for comprehensive understanding, a Local-Perceptive Checker for detecting unimodal factual inaccuracies and cross-modal misalignments, and an Iterative Refinement Corre… view at source ↗

**Figure 3.** Figure 3: TCA-Bench data curation and annotation pipeline. Stage I: source videos undergo parallel visual (YOLOv8 [55]) and auditory (Silero VAD [54]) analysis; timelines are fused under a joint co-occurrence constraint to extract segments. Stage II: each clip receives three ground-truth layers via LLM draft → human validation. validation paradigm: a multimodal LLM first generates an initial dense caption, which hu… view at source ↗

**Figure 4.** Figure 4: TCA-Bench ground-truth annotation showcase. Top: a dense caption with inline highlights delineating five evaluation sub-dimensions grouped into two modality tracks (Visual: Subject & Action, Scene & Atmosphere, Cinematography; Audio: Transcription Accuracy, Tone & Emotion). Bottom-left: audio-visual binding pairs mapping each acoustic event to its source—foreground character, off-screen character, or envi… view at source ↗

**Figure 5.** Figure 5: The performance analysis for OCC framework. Since the generation of LLM drafts in our benchmark construction utilized the OCC framework, a direct comparison within the full TCA-Bench would introduce inherent bias. To conduct a more rigorous and equitable ablation study of the OCC framework’s contributions, we constructed an additional mini-bench comprising 20 samples. These samples utilized Gemini2.5-… view at source ↗

read the original abstract

While Multimodal Large Language Models (MLLMs) have advanced video understanding, achieving precise temporal and cross-modal alignment in audiovisual video captioning remains a formidable challenge. Most existing approaches suffer from modality detachment and temporal incoherence, failing to accurately bind auditory events to visual entities or capture complex causal dynamics. To address these deficiencies, we propose TCA-Captioner, a framework specifically engineered to enhance Temporal and Cross-Modal Alignment for audiovisual video captioning. We first introduce the Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. Furthermore, we present TCA-Bench, a diagnostic benchmark utilizing a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning. Extensive experiments demonstrate that TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract describes TCA-Captioner and OCC data generation plus a new benchmark, but supplies zero results or comparisons, so nothing can be verified.

read the letter

The main thing here is that the paper claims TCA-Captioner with its Observer-Checker-Corrector loop sets a new standard for audiovisual video captioning, yet the abstract contains no numbers, tables, ablations, or baselines to support that. We cannot tell if the work is new or solid.

What looks new is the OCC iterative refinement process run on a high-density human interaction dataset to produce grounded training data, plus TCA-Bench and its decoupled evaluation protocol meant to separately score audiovisual binding and temporal reasoning. Those target a concrete issue in current MLLMs.

The paper does a reasonable job stating the problems of modality detachment and temporal incoherence, which are real limitations when models try to link audio events to visual entities or track causal sequences.

The soft spot is the complete lack of evidence. The central assertion rests on an unsupported sentence about extensive experiments. Without the actual results, error bars, or comparisons we have no way to judge whether OCC actually improves alignment or whether TCA-Bench reveals meaningful differences. The full manuscript is referenced but not provided, so the methods and any reproducibility details stay out of reach.

This is for researchers already working on multimodal video models who might want to look at the benchmark protocol or the data-generation idea if the full paper ever shows the numbers. A reader could extract the problem framing, but little else.

I would not send it to peer review in its current state. The experimental claims need to be visible before it is worth a referee's time.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes TCA-Captioner, a framework for audiovisual video captioning that introduces the Observer-Checker-Corrector (OCC) iterative refinement strategy to generate high-fidelity training data from a curated high-density human interaction dataset. It further presents TCA-Bench, a diagnostic benchmark with a Decoupled Evaluation Protocol to assess audiovisual binding and temporal relational reasoning. The central claim is that extensive experiments show TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives, addressing modality detachment and temporal incoherence in multimodal large language models.

Significance. If the experimental claims hold, the work could advance multimodal video understanding by targeting key limitations in temporal and cross-modal alignment. The OCC data-generation approach and TCA-Bench diagnostic protocol represent potentially useful methodological contributions. However, the significance cannot be assessed because the manuscript supplies no quantitative results, ablations, baselines, or error analysis to support the performance assertions.

major comments (1)

Abstract: the claim that 'extensive experiments demonstrate that TCA-Captioner sets a new standard' is presented without any supporting data, ablation studies, comparison tables, error bars, or quantitative metrics, rendering the central empirical claim unverifiable from the manuscript text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for identifying the lack of supporting evidence for the empirical claims. We address the major comment below and commit to revisions that will make the experimental support verifiable.

read point-by-point responses

Referee: Abstract: the claim that 'extensive experiments demonstrate that TCA-Captioner sets a new standard' is presented without any supporting data, ablation studies, comparison tables, error bars, or quantitative metrics, rendering the central empirical claim unverifiable from the manuscript text.

Authors: We agree that the abstract asserts performance improvements without the quantitative backing being present in the manuscript text. The current version does not supply the required results, ablations, baselines, or metrics. In the revised manuscript we will add the full experimental section with comparison tables, ablation studies, quantitative metrics, and error analysis. The abstract will be updated to reflect only what the new results support. revision: yes

Circularity Check

0 steps flagged

No circularity detected; no derivation chain present

full rationale

The abstract and query provide no equations, derivations, fitted parameters, or self-citations. Claims rest on introducing the OCC framework and TCA-Bench without any mathematical reduction or load-bearing self-reference visible. Full manuscript details are noted as unavailable, precluding any inspection of potential circular steps. This is the expected non-finding when no technical derivation exists to analyze.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework description remains at the level of named components without mathematical or empirical grounding.

pith-pipeline@v0.9.1-grok · 5722 in / 1094 out tokens · 25544 ms · 2026-07-03T16:35:11.723896+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

97 extracted references · 43 canonical work pages · 20 internal anchors

[1]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Abdar, M., Kollati, M., Kuraparthi, S., Pourpanah, F., McDuff, D., Ghavamzadeh, M., Yan, S., Mohamed, A., Khosravi, A., Cambria, E., et al.: A review of deep learning for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

2024
[2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Calibrated Harmonic Overlaid Implicit Neural Representations for Multi-Dimensional Data

Chen, H., Zhang, X., Sun, X., Mingqing, X.: Calibrated harmonic overlaid implicit neural representations for multi-dimensional data. arXiv preprint arXiv:2606.26763 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

IEEE Transactions on Image Processing 35, 5982–5994 (2026)

Chen, L., Wang, J., Pan, Z., Zhu, B., Yang, X., Zhang, C.: Detail++: Training-free detail enhancer for t2i diffusion models. IEEE Transactions on Image Processing 35, 5982–5994 (2026)

2026
[5]

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Chen, L., You, T., Liu, H., Bao, Z., Jiao, J., Han, X., Ou, Z., Sun, T., Mou, X., Jin, X., et al.: Echo: Efficient chest x-ray report generation with one-step block diffusion. arXiv preprint arXiv:2604.09450 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Advances in Neural Information Processing Systems37, 19472–19495 (2024)

Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Tang, Z., Yuan, L., et al.: Sharegpt4video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024)

2024
[7]

arXiv preprint arXiv:2510.10395 (2025)

Chen, X., Ding, Y., Lin, W., Hua, J., Yao, L., Shi, Y., Li, B., Zhang, Y., Liu, Q., Wan, P., et al.: Avocado: An audiovisual video captioner driven by temporal orchestration. arXiv preprint arXiv:2510.10395 (2025)

work page arXiv 2025
[8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024
[9]

arXiv preprint arXiv:2308.00303 (2023)

Chen, Z., Gao, R., Xiang, T.Z., Lin, F.: Diffusion model for camouflaged object detection. arXiv preprint arXiv:2308.00303 (2023)

work page arXiv 2023
[10]

Chen, Z., Li, Y., Wang, H., Chen, Z., Jiang, Z., Li, J., Wang, Q., Yang, J., Tai, Y.: Ragd:Regional-awarediffusionmodelfortext-to-imagegeneration.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19331–19341 (2025)

2025
[11]

L2P: Unlocking Latent Potential for Pixel Generation

Chen, Z., Zhu, J., Chen, X., Zhang, J., Chen, J., Zeng, Z., Zhang, W., Wang, C., Yang, J., Tai, Y.: L2p: Unlocking latent potential for pixel generation. arXiv preprint arXiv:2605.12013 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

arXiv preprint arXiv:2511.18822 (2025)

Chen, Z., Zhu, J., Chen, X., Zhang, J., Hu, X., Zhao, H., Wang, C., Yang, J., Tai, Y.: Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822 (2025)

work page arXiv 2025
[13]

Qwen2-Audio Technical Report

Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Company, T.D.: Tiktok-10m: A large-scale short video dataset for video un- derstanding (2025),https://huggingface.co/datasets/The- data- company/ TikTok-10M, accessed: June 30, 2026

2025
[15]

In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

Dong, C., Zhao, C., Cai, W., Yang, B., Guo, Y.: O-mamba: O-shape state-space model for underwater image enhancement. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 168–182. Springer (2025) Temporal and Cross-Modal Alignment for AVC 17

2025
[16]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Du, S., Zou, Y., Li, J., Liu, M., Li, Y., Shang, C., Shen, Q.: Pansharpening for thin- cloud contaminated remote sensing images: a unified framework and benchmark dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 3696–3704 (2026)

2026
[17]

Inter- national Journal of Computer Vision134, 152 (2026)

Du, S., Zou, Y., Wang, Z., Li, X., Li, Y., Shang, C., Shen, Q.: Unsupervised hy- perspectral image super-resolution via self-supervised modality decoupling. Inter- national Journal of Computer Vision134, 152 (2026)

2026
[18]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Du, Y., Lin, Z., Song, K., Wang, B., Zheng, Z., Ge, T., Zheng, B., Jin, Q.: Vc4vg: Optimizing video captions for text-to-video generation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 1124–1138 (2025)

2025
[19]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fan, T., Nan, K., Xie, R., Zhou, P., Yang, Z., Fu, C., Li, X., Yang, J., Tai, Y.: Instancecap:Improvingtext-to-videogenerationviainstance-awarestructuredcap- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28974–28983 (2025)

2025
[20]

Farré, M., Marafioti, A., Tunstall, L., Von Werra, L., Wolf, T.: Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo(2024), accessed: June 30, 2026

2024
[21]

arXiv preprint arXiv:2408.05211 (2024)

Fu, C., Lin, H., Long, Z., Shen, Y., Dai, Y., Zhao, M., Zhang, Y.F., Dong, S., Li, Y., Wang, X., et al.: Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211 (2024)

work page arXiv 2024
[22]

arXiv preprint arXiv:2507.20939 (2025)

Ge, Y., Ge, Y., Li, C., Wang, T., Pu, J., Li, Y., Qiu, L., Ma, J., Duan, L., Zuo, X., et al.: Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts. arXiv preprint arXiv:2507.20939 (2025)

work page arXiv 2025
[23]

Gemini-3-Pro Team: Gemini-3-Pro.https://gemini.google.com, accessed: June 30, 2026

2026
[24]

arXiv preprint arXiv:2602.08682 (2026)

Guo, Y., Gan, Q., Zhang, Y., Liu, J., Hu, Y., Xie, P., Qian, D., Zhang, Y., Li, R., Zhang, Y., et al.: Alive: Animate your world with lifelike audio-video generation. arXiv preprint arXiv:2602.08682 (2026)

work page arXiv 2026
[25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Hu,X.,Tai,Y.,Zhao,X.,Zhao,C.,Zhang,Z.,Li,J.,Zhong,B.,Yang,J.:Exploiting multimodal spatial-temporal patterns for video object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 3581–3589 (2025)

2025
[26]

Huang, G., Mao, J., Huang, F., Liu, F., Luo, X., Liang, Y., Lu, J., Wang, X., Liu, P., Fu, R., Huang, R., Huang, S.L.: Exposure bias can alleviate itself via directional and frequency rectification in flow matching (2026)

2026
[27]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Jiang, T., Wu, L., Xia, S., Li, S., Yan, Z., Yang, H., Qiao, Y., Wang, Y.: Imagine before you predict: Interleaved latent visual reasoning for video event prediction. arXiv preprint arXiv:2606.05769 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

arXiv preprint arXiv:2511.20272 (2025)

Jiang, T., Xia, S., Xu, Y., Wu, L., Zeng, X., Wang, L., Qiao, Y., Wang, Y.: Vknowu: Evaluating visual knowledge understanding in multimodal llms. arXiv preprint arXiv:2511.20272 (2025)

work page arXiv 2025
[30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kim, Y., Abdelrahman, A.S., Abdel-Aty, M.: Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 761–771 (2025)

2025
[31]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 18 C. Zhao et al

2023
[32]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, X., Du, S., Zou, Y., Xu, H., Jiang, Z., Liu, J.: Unifusion: A unified image fusion framework with robust representation and source-aware preservation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33869–33880 (2026)

2026
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lian, L., Ding, Y., Ge, Y., Liu, S., Mao, H., Li, B., Pavone, M., Liu, M.Y., Darrell, T., Yala, A., et al.: Describe anything: Detailed localized image and video cap- tioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21766–21777 (2025)

2025
[34]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

2024
[35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

2024
[36]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: End-to-end transformers with sparse attention for video captioning. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17949–17958 (2022)

2022
[37]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024
[38]

DriveVA: Video Action Models are Zero-Shot Drivers

Liu, M., Zhang, D., Liu, J., Cui, J., Xie, H., Chen, G., Ye, H., Yang, M.Y., Nex, F., Cheng, H.: Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Tem- pcompass: Do video llms really understand videos? In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 8731–8772 (2024)

2024
[40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26439–26455 (2024)

2024
[41]

Lu, S., Lian, Z., Zhou, Z., Zhang, S., Zhao, C., Kong, A.W.K.: Does flux al- ready know how to perform physically plausible image composition? arXiv preprint arXiv:2509.21278 (2025)

work page arXiv 2025
[42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2294–2305 (2023)

2023
[43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lu, S., Wang, Z., Li, L., Liu, Y., Kong, A.W.K.: Mace: Mass concept erasure in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6430–6440 (2024)

2024
[44]

arXiv preprint arXiv:2410.18775 (2024)

Lu, S., Zhou, Z., Lu, J., Zhu, Y., Kong, A.W.K.: Robust watermarking using gener- ative priors against image editing: From benchmarking to advances. arXiv preprint arXiv:2410.18775 (2024)

work page arXiv 2024
[45]

arXiv preprint arXiv:2510.12720 (2025)

Ma, Z., Xu, R., Xing, Z., Chu, Y., Wang, Y., He, J., Xu, J., Heng, P.A., Yu, K., Lin, J., et al.: Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. arXiv preprint arXiv:2510.12720 (2025)

work page arXiv 2025
[46]

In: International Conference on Learning Representations

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. In: International Conference on Learning Representations. vol. 2025, pp. 1045–1064 (2025) Temporal and Cross-Modal Alignment for AVC 19

2025
[47]

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Nguyen, L.T.P., Yu, Z., Hang, S.L.Y., An, S., Lee, J., Ban, Y., Chung, S., Nguyen, T.H., Maeng, J., Lee, S., et al.: See, hear, and understand: Benchmarking audio- visual human speech understanding in multimodal large language models. arXiv preprint arXiv:2512.02231 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

ACM Computing Surveys57(6), 1–36 (2025)

Qasim, I., Horsch, A., Prasad, D.: Dense video captioning: A survey of techniques, datasets and evaluation protocols. ACM Computing Surveys57(6), 1–36 (2025)

2025
[49]

In: Companion Proceedings of the ACM on Web Conference 2025

Shang, Y., Gao, C., Li, N., Li, Y.: A large-scale dataset with behavior, attributes, and content of mobile short-video platform. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 793–796 (2025)

2025
[50]

In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp

Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp. 11–23 (2023)

2023
[51]

arXiv preprint arXiv:2506.15220 (2025)

Tang, C., Li, Y., Yang, Y., Zhuang, J., Sun, G., Li, W., Ma, Z., Zhang, C.: video- salmonn 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220 (2025)

work page arXiv 2025
[52]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., Zhang, C.: Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

arXiv preprint arXiv:2510.22200 (2025)

Team, M.L., Cai, X., Huang, Q., Kang, Z., Li, H., Liang, S., Ma, L., Ren, S., Wei, X., Xie, R., et al.: Longcat-video technical report. arXiv preprint arXiv:2510.22200 (2025)

work page arXiv 2025
[54]

Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier (2024)

2024
[55]

In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS)

Varghese, R., Sambath, M.: Yolov8: A novel object detection algorithm with en- hanced performance and robustness. In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS). pp. 1–6. IEEE (2024)

2024
[56]

arXiv preprint arXiv:2407.00634 (2024)

Wang, J., Yuan, L., Zhang, Y., Sun, H.: Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634 (2024)

work page arXiv 2024
[57]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[58]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Wang, X., Hua, J., Lin, W., Zhang, Y., Zhang, F., Wu, J., Zhang, D., Nie, L.: Haic: Improving human action understanding and generation with better captions for multi-modal large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10158–10181 (2025)

2025
[60]

arXiv preprint arXiv:2507.11336 (2025)

Wu, P., Liu, Y., Zhu, Z., Zhou, E., Shen, J.: Ugc-videocaptioner: An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336 (2025)

work page arXiv 2025
[61]

arXiv preprint arXiv:2404.01717 (2024)

Xie, R., Zhao, C., Zhang, K., Zhang, Z., Zhou, J., Yang, J., Tai, Y.: Addsr: Acceler- ating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717 (2024)

work page arXiv 2024
[62]

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., Zhang, B., Wang, X., Chu, Y., Lin, J.: Qwen2.5-omni technical report (2025)

2025
[63]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025) 20 C. Zhao et al

work page internal anchor Pith review Pith/arXiv arXiv 2025
[64]

arXiv preprint arXiv:2506.21277 (2025)

Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. arXiv preprint arXiv:2506.21277 (2025)

work page arXiv 2025
[65]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

arXiv preprint arXiv:2501.07888 (2025)

Yuan, L., Wang, J., Sun, H., Zhang, Y., Lin, Y.: Tarsier2: Advancing large vision- language models from detailed video description to comprehensive video under- standing. arXiv preprint arXiv:2501.07888 (2025)

work page arXiv 2025
[67]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhan, J., Dai, J., Ye, J., Zhou, Y., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al.: Anygpt: Unified multimodal llm with discrete sequence modeling. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9637–9662 (2024)

2024
[68]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[69]

In: Findings of the Association for Computational Linguistics: EMNLP 2023

Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., Qiu, X.: Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 15757–15773 (2023)

2023
[70]

In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)

2023
[71]

In: European Conference on Computer Vision

Zhang, L., Meng, W., Zhong, Y., Kong, B., Xu, M., Du, J., Wang, X., Wang, R., Liu, L.: U-cope: Taking a further step to universal 9d category-level object pose estimation. In: European Conference on Computer Vision. pp. 254–270. Springer (2025)

2025
[72]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

Zhang, L., Meng, X., Liu, L., Jiang, H., Wang, J., Wang, R., Lu, C., Liu, J., Zhang, H.: Probing effective and efficient category-level articulated object pose perception. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

2026
[73]

Zhang, L., Xu, M., Wang, J., Yu, Q., Yang, L., Li, Y., Lu, C., Wang, R., Liu, L.: Gapt-dar:Category-levelgarmentsposetrackingviaintegrated2ddeformationand 3dreconstruction.In:ProceedingsoftheComputerVisionandPatternRecognition Conference. pp. 22638–22647 (2025)

2025
[74]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

Zhang, L., Zhong, Y., Wang, J., Min, Z., Liu, L., et al.: Rethinking 3d convolution inℓ p-norm space. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

2024
[75]

arXiv preprint arXiv:2506.13642 (2025)

Zhang, S., Guo, S., Fang, Q., Zhou, Y., Feng, Y.: Stream-omni: Simultaneous multimodal interactions with large language-vision-speech model. arXiv preprint arXiv:2506.13642 (2025)

work page arXiv 2025
[76]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

The Visual Computer41(1), 129–140 (2025)

Zhao, C., Cai, W.L., Yuan, Z.: Spectral normalization and dual contrastive reg- ularization for image-to-image translation. The Visual Computer41(1), 129–140 (2025)

2025
[78]

IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21

Zhao, C., Cai, W.L., Yuan, Z., Hu, C.W.: Multi-cropping contrastive learning and domain consistency for unsupervised image-to-image translation. IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21

2025
[79]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

Zhao, C., Cai, W., Dong, C., Hu, C.: Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

2024
[80]

In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Zhao, C., Cai, W., Dong, C., Zeng, Z.: Toward sufficient spatial-frequency in- teraction for gradient-aware underwater image enhancement. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3220–3224. IEEE (2024)

2024

Showing first 80 references.

[1] [1]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

Abdar, M., Kollati, M., Kuraparthi, S., Pourpanah, F., McDuff, D., Ghavamzadeh, M., Yan, S., Mohamed, A., Khosravi, A., Cambria, E., et al.: A review of deep learning for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)

2024

[2] [2]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Calibrated Harmonic Overlaid Implicit Neural Representations for Multi-Dimensional Data

Chen, H., Zhang, X., Sun, X., Mingqing, X.: Calibrated harmonic overlaid implicit neural representations for multi-dimensional data. arXiv preprint arXiv:2606.26763 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[4] [4]

IEEE Transactions on Image Processing 35, 5982–5994 (2026)

Chen, L., Wang, J., Pan, Z., Zhu, B., Yang, X., Zhang, C.: Detail++: Training-free detail enhancer for t2i diffusion models. IEEE Transactions on Image Processing 35, 5982–5994 (2026)

2026

[5] [5]

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

Chen, L., You, T., Liu, H., Bao, Z., Jiao, J., Han, X., Ou, Z., Sun, T., Mou, X., Jin, X., et al.: Echo: Efficient chest x-ray report generation with one-step block diffusion. arXiv preprint arXiv:2604.09450 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Advances in Neural Information Processing Systems37, 19472–19495 (2024)

Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Tang, Z., Yuan, L., et al.: Sharegpt4video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024)

2024

[7] [7]

arXiv preprint arXiv:2510.10395 (2025)

Chen, X., Ding, Y., Lin, W., Hua, J., Yao, L., Shi, Y., Li, B., Zhang, Y., Liu, Q., Wan, P., et al.: Avocado: An audiovisual video captioner driven by temporal orchestration. arXiv preprint arXiv:2510.10395 (2025)

work page arXiv 2025

[8] [8]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

2024

[9] [9]

arXiv preprint arXiv:2308.00303 (2023)

Chen, Z., Gao, R., Xiang, T.Z., Lin, F.: Diffusion model for camouflaged object detection. arXiv preprint arXiv:2308.00303 (2023)

work page arXiv 2023

[10] [10]

Chen, Z., Li, Y., Wang, H., Chen, Z., Jiang, Z., Li, J., Wang, Q., Yang, J., Tai, Y.: Ragd:Regional-awarediffusionmodelfortext-to-imagegeneration.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19331–19341 (2025)

2025

[11] [11]

L2P: Unlocking Latent Potential for Pixel Generation

Chen, Z., Zhu, J., Chen, X., Zhang, J., Chen, J., Zeng, Z., Zhang, W., Wang, C., Yang, J., Tai, Y.: L2p: Unlocking latent potential for pixel generation. arXiv preprint arXiv:2605.12013 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

arXiv preprint arXiv:2511.18822 (2025)

Chen, Z., Zhu, J., Chen, X., Zhang, J., Hu, X., Zhao, H., Wang, C., Yang, J., Tai, Y.: Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822 (2025)

work page arXiv 2025

[13] [13]

Qwen2-Audio Technical Report

Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Company, T.D.: Tiktok-10m: A large-scale short video dataset for video un- derstanding (2025),https://huggingface.co/datasets/The- data- company/ TikTok-10M, accessed: June 30, 2026

2025

[15] [15]

In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

Dong, C., Zhao, C., Cai, W., Yang, B., Guo, Y.: O-mamba: O-shape state-space model for underwater image enhancement. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 168–182. Springer (2025) Temporal and Cross-Modal Alignment for AVC 17

2025

[16] [16]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Du, S., Zou, Y., Li, J., Liu, M., Li, Y., Shang, C., Shen, Q.: Pansharpening for thin- cloud contaminated remote sensing images: a unified framework and benchmark dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 3696–3704 (2026)

2026

[17] [17]

Inter- national Journal of Computer Vision134, 152 (2026)

Du, S., Zou, Y., Wang, Z., Li, X., Li, Y., Shang, C., Shen, Q.: Unsupervised hy- perspectral image super-resolution via self-supervised modality decoupling. Inter- national Journal of Computer Vision134, 152 (2026)

2026

[18] [18]

In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Du, Y., Lin, Z., Song, K., Wang, B., Zheng, Z., Ge, T., Zheng, B., Jin, Q.: Vc4vg: Optimizing video captions for text-to-video generation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 1124–1138 (2025)

2025

[19] [19]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Fan, T., Nan, K., Xie, R., Zhou, P., Yang, Z., Fu, C., Li, X., Yang, J., Tai, Y.: Instancecap:Improvingtext-to-videogenerationviainstance-awarestructuredcap- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28974–28983 (2025)

2025

[20] [20]

Farré, M., Marafioti, A., Tunstall, L., Von Werra, L., Wolf, T.: Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo(2024), accessed: June 30, 2026

2024

[21] [21]

arXiv preprint arXiv:2408.05211 (2024)

Fu, C., Lin, H., Long, Z., Shen, Y., Dai, Y., Zhao, M., Zhang, Y.F., Dong, S., Li, Y., Wang, X., et al.: Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211 (2024)

work page arXiv 2024

[22] [22]

arXiv preprint arXiv:2507.20939 (2025)

Ge, Y., Ge, Y., Li, C., Wang, T., Pu, J., Li, Y., Qiu, L., Ma, J., Duan, L., Zuo, X., et al.: Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts. arXiv preprint arXiv:2507.20939 (2025)

work page arXiv 2025

[23] [23]

Gemini-3-Pro Team: Gemini-3-Pro.https://gemini.google.com, accessed: June 30, 2026

2026

[24] [24]

arXiv preprint arXiv:2602.08682 (2026)

Guo, Y., Gan, Q., Zhang, Y., Liu, J., Hu, Y., Xie, P., Qian, D., Zhang, Y., Li, R., Zhang, Y., et al.: Alive: Animate your world with lifelike audio-video generation. arXiv preprint arXiv:2602.08682 (2026)

work page arXiv 2026

[25] [25]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Hu,X.,Tai,Y.,Zhao,X.,Zhao,C.,Zhang,Z.,Li,J.,Zhong,B.,Yang,J.:Exploiting multimodal spatial-temporal patterns for video object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 3581–3589 (2025)

2025

[26] [26]

Huang, G., Mao, J., Huang, F., Liu, F., Luo, X., Liang, Y., Lu, J., Wang, X., Liu, P., Fu, R., Huang, R., Huang, S.L.: Exposure bias can alleviate itself via directional and frequency rectification in flow matching (2026)

2026

[27] [27]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

Jiang, T., Wu, L., Xia, S., Li, S., Yan, Z., Yang, H., Qiao, Y., Wang, Y.: Imagine before you predict: Interleaved latent visual reasoning for video event prediction. arXiv preprint arXiv:2606.05769 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

arXiv preprint arXiv:2511.20272 (2025)

Jiang, T., Xia, S., Xu, Y., Wu, L., Zeng, X., Wang, L., Qiao, Y., Wang, Y.: Vknowu: Evaluating visual knowledge understanding in multimodal llms. arXiv preprint arXiv:2511.20272 (2025)

work page arXiv 2025

[30] [30]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kim, Y., Abdelrahman, A.S., Abdel-Aty, M.: Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 761–771 (2025)

2025

[31] [31]

In: International conference on machine learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 18 C. Zhao et al

2023

[32] [32]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Li, X., Du, S., Zou, Y., Xu, H., Jiang, Z., Liu, J.: Unifusion: A unified image fusion framework with robust representation and source-aware preservation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33869–33880 (2026)

2026

[33] [33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lian, L., Ding, Y., Ge, Y., Liu, S., Mao, H., Li, B., Pavone, M., Liu, M.Y., Darrell, T., Yala, A., et al.: Describe anything: Detailed localized image and video cap- tioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21766–21777 (2025)

2025

[34] [34]

In: Proceedings of the 2024 conference on empirical methods in natural language processing

Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)

2024

[35] [35]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

2024

[36] [36]

In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: End-to-end transformers with sparse attention for video captioning. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17949–17958 (2022)

2022

[37] [37]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)

2024

[38] [38]

DriveVA: Video Action Models are Zero-Shot Drivers

Liu, M., Zhang, D., Liu, J., Cui, J., Xie, H., Chen, G., Ye, H., Yang, M.Y., Nex, F., Cheng, H.: Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198 (2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Tem- pcompass: Do video llms really understand videos? In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 8731–8772 (2024)

2024

[40] [40]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26439–26455 (2024)

2024

[41] [41]

Lu, S., Lian, Z., Zhou, Z., Zhang, S., Zhao, C., Kong, A.W.K.: Does flux al- ready know how to perform physically plausible image composition? arXiv preprint arXiv:2509.21278 (2025)

work page arXiv 2025

[42] [42]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2294–2305 (2023)

2023

[43] [43]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Lu, S., Wang, Z., Li, L., Liu, Y., Kong, A.W.K.: Mace: Mass concept erasure in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6430–6440 (2024)

2024

[44] [44]

arXiv preprint arXiv:2410.18775 (2024)

Lu, S., Zhou, Z., Lu, J., Zhu, Y., Kong, A.W.K.: Robust watermarking using gener- ative priors against image editing: From benchmarking to advances. arXiv preprint arXiv:2410.18775 (2024)

work page arXiv 2024

[45] [45]

arXiv preprint arXiv:2510.12720 (2025)

Ma, Z., Xu, R., Xing, Z., Chu, Y., Wang, Y., He, J., Xu, J., Heng, P.A., Yu, K., Lin, J., et al.: Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. arXiv preprint arXiv:2510.12720 (2025)

work page arXiv 2025

[46] [46]

In: International Conference on Learning Representations

Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. In: International Conference on Learning Representations. vol. 2025, pp. 1045–1064 (2025) Temporal and Cross-Modal Alignment for AVC 19

2025

[47] [47]

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models

Nguyen, L.T.P., Yu, Z., Hang, S.L.Y., An, S., Lee, J., Ban, Y., Chung, S., Nguyen, T.H., Maeng, J., Lee, S., et al.: See, hear, and understand: Benchmarking audio- visual human speech understanding in multimodal large language models. arXiv preprint arXiv:2512.02231 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

ACM Computing Surveys57(6), 1–36 (2025)

Qasim, I., Horsch, A., Prasad, D.: Dense video captioning: A survey of techniques, datasets and evaluation protocols. ACM Computing Surveys57(6), 1–36 (2025)

2025

[49] [49]

In: Companion Proceedings of the ACM on Web Conference 2025

Shang, Y., Gao, C., Li, N., Li, Y.: A large-scale dataset with behavior, attributes, and content of mobile short-video platform. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 793–796 (2025)

2025

[50] [50]

In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp

Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp. 11–23 (2023)

2023

[51] [51]

arXiv preprint arXiv:2506.15220 (2025)

Tang, C., Li, Y., Yang, Y., Zhuang, J., Sun, G., Li, W., Ma, Z., Zhang, C.: video- salmonn 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220 (2025)

work page arXiv 2025

[52] [52]

SALMONN: Towards Generic Hearing Abilities for Large Language Models

Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., Zhang, C.: Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[53] [53]

arXiv preprint arXiv:2510.22200 (2025)

Team, M.L., Cai, X., Huang, Q., Kang, Z., Li, H., Liang, S., Ma, L., Ren, S., Wei, X., Xie, R., et al.: Longcat-video technical report. arXiv preprint arXiv:2510.22200 (2025)

work page arXiv 2025

[54] [54]

Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier (2024)

2024

[55] [55]

In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS)

Varghese, R., Sambath, M.: Yolov8: A novel object detection algorithm with en- hanced performance and robustness. In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS). pp. 1–6. IEEE (2024)

2024

[56] [56]

arXiv preprint arXiv:2407.00634 (2024)

Wang, J., Yuan, L., Zhang, Y., Sun, H.: Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634 (2024)

work page arXiv 2024

[57] [57]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[58] [58]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [59]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Wang, X., Hua, J., Lin, W., Zhang, Y., Zhang, F., Wu, J., Zhang, D., Nie, L.: Haic: Improving human action understanding and generation with better captions for multi-modal large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10158–10181 (2025)

2025

[60] [60]

arXiv preprint arXiv:2507.11336 (2025)

Wu, P., Liu, Y., Zhu, Z., Zhou, E., Shen, J.: Ugc-videocaptioner: An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336 (2025)

work page arXiv 2025

[61] [61]

arXiv preprint arXiv:2404.01717 (2024)

Xie, R., Zhao, C., Zhang, K., Zhang, Z., Zhou, J., Yang, J., Tai, Y.: Addsr: Acceler- ating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717 (2024)

work page arXiv 2024

[62] [62]

Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., Zhang, B., Wang, X., Chu, Y., Lin, J.: Qwen2.5-omni technical report (2025)

2025

[63] [63]

Qwen3-Omni Technical Report

Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025) 20 C. Zhao et al

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [64]

arXiv preprint arXiv:2506.21277 (2025)

Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. arXiv preprint arXiv:2506.21277 (2025)

work page arXiv 2025

[65] [65]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[66] [66]

arXiv preprint arXiv:2501.07888 (2025)

Yuan, L., Wang, J., Sun, H., Zhang, Y., Lin, Y.: Tarsier2: Advancing large vision- language models from detailed video description to comprehensive video under- standing. arXiv preprint arXiv:2501.07888 (2025)

work page arXiv 2025

[67] [67]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Zhan, J., Dai, J., Ye, J., Zhou, Y., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al.: Anygpt: Unified multimodal llm with discrete sequence modeling. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9637–9662 (2024)

2024

[68] [68]

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[69] [69]

In: Findings of the Association for Computational Linguistics: EMNLP 2023

Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., Qiu, X.: Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 15757–15773 (2023)

2023

[70] [70]

In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)

Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)

2023

[71] [71]

In: European Conference on Computer Vision

Zhang, L., Meng, W., Zhong, Y., Kong, B., Xu, M., Du, J., Wang, X., Wang, R., Liu, L.: U-cope: Taking a further step to universal 9d category-level object pose estimation. In: European Conference on Computer Vision. pp. 254–270. Springer (2025)

2025

[72] [72]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

Zhang, L., Meng, X., Liu, L., Jiang, H., Wang, J., Wang, R., Lu, C., Liu, J., Zhang, H.: Probing effective and efficient category-level articulated object pose perception. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)

2026

[73] [73]

Zhang, L., Xu, M., Wang, J., Yu, Q., Yang, L., Li, Y., Lu, C., Wang, R., Liu, L.: Gapt-dar:Category-levelgarmentsposetrackingviaintegrated2ddeformationand 3dreconstruction.In:ProceedingsoftheComputerVisionandPatternRecognition Conference. pp. 22638–22647 (2025)

2025

[74] [74]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

Zhang, L., Zhong, Y., Wang, J., Min, Z., Liu, L., et al.: Rethinking 3d convolution inℓ p-norm space. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)

2024

[75] [75]

arXiv preprint arXiv:2506.13642 (2025)

Zhang, S., Guo, S., Fang, Q., Zhou, Y., Feng, Y.: Stream-omni: Simultaneous multimodal interactions with large language-vision-speech model. arXiv preprint arXiv:2506.13642 (2025)

work page arXiv 2025

[76] [76]

LLaVA-Video: Video Instruction Tuning With Synthetic Data

Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[77] [77]

The Visual Computer41(1), 129–140 (2025)

Zhao, C., Cai, W.L., Yuan, Z.: Spectral normalization and dual contrastive reg- ularization for image-to-image translation. The Visual Computer41(1), 129–140 (2025)

2025

[78] [78]

IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21

Zhao, C., Cai, W.L., Yuan, Z., Hu, C.W.: Multi-cropping contrastive learning and domain consistency for unsupervised image-to-image translation. IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21

2025

[79] [79]

In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

Zhao, C., Cai, W., Dong, C., Hu, C.: Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)

2024

[80] [80]

In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Zhao, C., Cai, W., Dong, C., Zeng, Z.: Toward sufficient spatial-frequency in- teraction for gradient-aware underwater image enhancement. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3220–3224. IEEE (2024)

2024