Temporal and Cross-Modal Alignment for Enhanced Audiovisual Video Captioning
Pith reviewed 2026-07-03 16:35 UTC · model grok-4.3
The pith
TCA-Captioner uses an Observer-Checker-Corrector process to generate training data that binds auditory events to visual entities and respects causal timing in video captions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TCA-Captioner is a framework engineered to enhance temporal and cross-modal alignment for audiovisual video captioning. The Observer-Checker-Corrector (OCC) framework is an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. TCA-Bench applies a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning.
What carries the argument
The Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity training data from a high-density human interaction dataset.
If this is right
- Captions produced by the trained model bind auditory events to the correct visual entities.
- The model captures complex causal dynamics rather than listing events out of sequence.
- TCA-Bench isolates and quantifies separate failures in audiovisual binding versus temporal reasoning.
- The same refinement loop can be reused to produce additional grounded audiovisual training sets.
Where Pith is reading between the lines
- The method could be tested on videos outside human-interaction scenes to check whether the dataset choice drives the gains.
- The iterative correction loop might transfer to other multimodal generation tasks such as video question answering.
- If the generated data truly resolves detachment, the same captions should improve downstream tasks that rely on synchronized audiovisual features.
Load-bearing premise
The assumption that the Observer-Checker-Corrector framework, when applied to a curated high-density human interaction dataset, produces training data that actually resolves modality detachment and temporal incoherence.
What would settle it
Measure audiovisual binding accuracy and temporal relational reasoning scores of TCA-Captioner against prior MLLMs on the TCA-Bench decoupled protocol; a clear gap in favor of TCA-Captioner would support the claim, while comparable or worse scores would falsify it.
Figures
read the original abstract
While Multimodal Large Language Models (MLLMs) have advanced video understanding, achieving precise temporal and cross-modal alignment in audiovisual video captioning remains a formidable challenge. Most existing approaches suffer from modality detachment and temporal incoherence, failing to accurately bind auditory events to visual entities or capture complex causal dynamics. To address these deficiencies, we propose TCA-Captioner, a framework specifically engineered to enhance Temporal and Cross-Modal Alignment for audiovisual video captioning. We first introduce the Observer-Checker-Corrector (OCC) framework, an iterative refinement strategy that generates high-fidelity, meticulously grounded training data. Leveraging a curated high-density human interaction dataset, TCA-Captioner is optimized to model sophisticated audiovisual interactions. Furthermore, we present TCA-Bench, a diagnostic benchmark utilizing a Decoupled Evaluation Protocol to isolate and quantify model proficiency in audiovisual binding and temporal relational reasoning. Extensive experiments demonstrate that TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes TCA-Captioner, a framework for audiovisual video captioning that introduces the Observer-Checker-Corrector (OCC) iterative refinement strategy to generate high-fidelity training data from a curated high-density human interaction dataset. It further presents TCA-Bench, a diagnostic benchmark with a Decoupled Evaluation Protocol to assess audiovisual binding and temporal relational reasoning. The central claim is that extensive experiments show TCA-Captioner sets a new standard for temporally-coherent and synchronized audiovisual narratives, addressing modality detachment and temporal incoherence in multimodal large language models.
Significance. If the experimental claims hold, the work could advance multimodal video understanding by targeting key limitations in temporal and cross-modal alignment. The OCC data-generation approach and TCA-Bench diagnostic protocol represent potentially useful methodological contributions. However, the significance cannot be assessed because the manuscript supplies no quantitative results, ablations, baselines, or error analysis to support the performance assertions.
major comments (1)
- Abstract: the claim that 'extensive experiments demonstrate that TCA-Captioner sets a new standard' is presented without any supporting data, ablation studies, comparison tables, error bars, or quantitative metrics, rendering the central empirical claim unverifiable from the manuscript text.
Simulated Author's Rebuttal
We thank the referee for their review and for identifying the lack of supporting evidence for the empirical claims. We address the major comment below and commit to revisions that will make the experimental support verifiable.
read point-by-point responses
-
Referee: Abstract: the claim that 'extensive experiments demonstrate that TCA-Captioner sets a new standard' is presented without any supporting data, ablation studies, comparison tables, error bars, or quantitative metrics, rendering the central empirical claim unverifiable from the manuscript text.
Authors: We agree that the abstract asserts performance improvements without the quantitative backing being present in the manuscript text. The current version does not supply the required results, ablations, baselines, or metrics. In the revised manuscript we will add the full experimental section with comparison tables, ablation studies, quantitative metrics, and error analysis. The abstract will be updated to reflect only what the new results support. revision: yes
Circularity Check
No circularity detected; no derivation chain present
full rationale
The abstract and query provide no equations, derivations, fitted parameters, or self-citations. Claims rest on introducing the OCC framework and TCA-Bench without any mathematical reduction or load-bearing self-reference visible. Full manuscript details are noted as unavailable, precluding any inspection of potential circular steps. This is the expected non-finding when no technical derivation exists to analyze.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
Abdar, M., Kollati, M., Kuraparthi, S., Pourpanah, F., McDuff, D., Ghavamzadeh, M., Yan, S., Mohamed, A., Khosravi, A., Cambria, E., et al.: A review of deep learning for video captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024)
2024
-
[2]
Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Calibrated Harmonic Overlaid Implicit Neural Representations for Multi-Dimensional Data
Chen, H., Zhang, X., Sun, X., Mingqing, X.: Calibrated harmonic overlaid implicit neural representations for multi-dimensional data. arXiv preprint arXiv:2606.26763 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
IEEE Transactions on Image Processing 35, 5982–5994 (2026)
Chen, L., Wang, J., Pan, Z., Zhu, B., Yang, X., Zhang, C.: Detail++: Training-free detail enhancer for t2i diffusion models. IEEE Transactions on Image Processing 35, 5982–5994 (2026)
2026
-
[5]
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Chen, L., You, T., Liu, H., Bao, Z., Jiao, J., Han, X., Ou, Z., Sun, T., Mou, X., Jin, X., et al.: Echo: Efficient chest x-ray report generation with one-step block diffusion. arXiv preprint arXiv:2604.09450 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[6]
Advances in Neural Information Processing Systems37, 19472–19495 (2024)
Chen, L., Wei, X., Li, J., Dong, X., Zhang, P., Zang, Y., Chen, Z., Duan, H., Tang, Z., Yuan, L., et al.: Sharegpt4video: Improving video understanding and genera- tion with better captions. Advances in Neural Information Processing Systems37, 19472–19495 (2024)
2024
-
[7]
arXiv preprint arXiv:2510.10395 (2025)
Chen, X., Ding, Y., Lin, W., Hua, J., Yao, L., Shi, Y., Li, B., Zhang, Y., Liu, Q., Wan, P., et al.: Avocado: An audiovisual video captioner driven by temporal orchestration. arXiv preprint arXiv:2510.10395 (2025)
-
[8]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
2024
-
[9]
arXiv preprint arXiv:2308.00303 (2023)
Chen, Z., Gao, R., Xiang, T.Z., Lin, F.: Diffusion model for camouflaged object detection. arXiv preprint arXiv:2308.00303 (2023)
-
[10]
Chen, Z., Li, Y., Wang, H., Chen, Z., Jiang, Z., Li, J., Wang, Q., Yang, J., Tai, Y.: Ragd:Regional-awarediffusionmodelfortext-to-imagegeneration.In:Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 19331–19341 (2025)
2025
-
[11]
L2P: Unlocking Latent Potential for Pixel Generation
Chen, Z., Zhu, J., Chen, X., Zhang, J., Chen, J., Zeng, Z., Zhang, W., Wang, C., Yang, J., Tai, Y.: L2p: Unlocking latent potential for pixel generation. arXiv preprint arXiv:2605.12013 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[12]
arXiv preprint arXiv:2511.18822 (2025)
Chen, Z., Zhu, J., Chen, X., Zhang, J., Hu, X., Zhao, H., Wang, C., Yang, J., Tai, Y.: Dip: Taming diffusion models in pixel space. arXiv preprint arXiv:2511.18822 (2025)
-
[13]
Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Company, T.D.: Tiktok-10m: A large-scale short video dataset for video un- derstanding (2025),https://huggingface.co/datasets/The- data- company/ TikTok-10M, accessed: June 30, 2026
2025
-
[15]
In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV)
Dong, C., Zhao, C., Cai, W., Yang, B., Guo, Y.: O-mamba: O-shape state-space model for underwater image enhancement. In: Chinese Conference on Pattern Recognition and Computer Vision (PRCV). pp. 168–182. Springer (2025) Temporal and Cross-Modal Alignment for AVC 17
2025
-
[16]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Du, S., Zou, Y., Li, J., Liu, M., Li, Y., Shang, C., Shen, Q.: Pansharpening for thin- cloud contaminated remote sensing images: a unified framework and benchmark dataset. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 40, pp. 3696–3704 (2026)
2026
-
[17]
Inter- national Journal of Computer Vision134, 152 (2026)
Du, S., Zou, Y., Wang, Z., Li, X., Li, Y., Shang, C., Shen, Q.: Unsupervised hy- perspectral image super-resolution via self-supervised modality decoupling. Inter- national Journal of Computer Vision134, 152 (2026)
2026
-
[18]
In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Du, Y., Lin, Z., Song, K., Wang, B., Zheng, Z., Ge, T., Zheng, B., Jin, Q.: Vc4vg: Optimizing video captions for text-to-video generation. In: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. pp. 1124–1138 (2025)
2025
-
[19]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Fan, T., Nan, K., Xie, R., Zhou, P., Yang, Z., Fu, C., Li, X., Yang, J., Tai, Y.: Instancecap:Improvingtext-to-videogenerationviainstance-awarestructuredcap- tion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 28974–28983 (2025)
2025
-
[20]
Farré, M., Marafioti, A., Tunstall, L., Von Werra, L., Wolf, T.: Finevideo.https: //huggingface.co/datasets/HuggingFaceFV/finevideo(2024), accessed: June 30, 2026
2024
-
[21]
arXiv preprint arXiv:2408.05211 (2024)
Fu, C., Lin, H., Long, Z., Shen, Y., Dai, Y., Zhao, M., Zhang, Y.F., Dong, S., Li, Y., Wang, X., et al.: Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211 (2024)
-
[22]
arXiv preprint arXiv:2507.20939 (2025)
Ge, Y., Ge, Y., Li, C., Wang, T., Pu, J., Li, Y., Qiu, L., Ma, J., Duan, L., Zuo, X., et al.: Arc-hunyuan-video-7b: Structured video comprehension of real-world shorts. arXiv preprint arXiv:2507.20939 (2025)
-
[23]
Gemini-3-Pro Team: Gemini-3-Pro.https://gemini.google.com, accessed: June 30, 2026
2026
-
[24]
arXiv preprint arXiv:2602.08682 (2026)
Guo, Y., Gan, Q., Zhang, Y., Liu, J., Hu, Y., Xie, P., Qian, D., Zhang, Y., Li, R., Zhang, Y., et al.: Alive: Animate your world with lifelike audio-video generation. arXiv preprint arXiv:2602.08682 (2026)
-
[25]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Hu,X.,Tai,Y.,Zhao,X.,Zhao,C.,Zhang,Z.,Li,J.,Zhong,B.,Yang,J.:Exploiting multimodal spatial-temporal patterns for video object tracking. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 3581–3589 (2025)
2025
-
[26]
Huang, G., Mao, J., Huang, F., Liu, F., Luo, X., Liang, Y., Lu, J., Wang, X., Liu, P., Fu, R., Huang, R., Huang, S.L.: Exposure bias can alleviate itself via directional and frequency rectification in flow matching (2026)
2026
-
[27]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction
Jiang, T., Wu, L., Xia, S., Li, S., Yan, Z., Yang, H., Qiao, Y., Wang, Y.: Imagine before you predict: Interleaved latent visual reasoning for video event prediction. arXiv preprint arXiv:2606.05769 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
arXiv preprint arXiv:2511.20272 (2025)
Jiang, T., Xia, S., Xu, Y., Wu, L., Zeng, X., Wang, L., Qiao, Y., Wang, Y.: Vknowu: Evaluating visual knowledge understanding in multimodal llms. arXiv preprint arXiv:2511.20272 (2025)
-
[30]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Kim, Y., Abdelrahman, A.S., Abdel-Aty, M.: Vru-accident: A vision-language benchmark for video question answering and dense captioning for accident scene understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 761–771 (2025)
2025
-
[31]
In: International conference on machine learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: International conference on machine learning. pp. 19730–19742. PMLR (2023) 18 C. Zhao et al
2023
-
[32]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Li, X., Du, S., Zou, Y., Xu, H., Jiang, Z., Liu, J.: Unifusion: A unified image fusion framework with robust representation and source-aware preservation. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 33869–33880 (2026)
2026
-
[33]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lian, L., Ding, Y., Ge, Y., Liu, S., Mao, H., Li, B., Pavone, M., Liu, M.Y., Darrell, T., Yala, A., et al.: Describe anything: Detailed localized image and video cap- tioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21766–21777 (2025)
2025
-
[34]
In: Proceedings of the 2024 conference on empirical methods in natural language processing
Lin, B., Ye, Y., Zhu, B., Cui, J., Ning, M., Jin, P., Yuan, L.: Video-llava: Learning united visual representation by alignment before projection. In: Proceedings of the 2024 conference on empirical methods in natural language processing. pp. 5971– 5984 (2024)
2024
-
[35]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)
2024
-
[36]
In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition
Lin, K., Li, L., Lin, C.C., Ahmed, F., Gan, Z., Liu, Z., Lu, Y., Wang, L.: Swinbert: End-to-end transformers with sparse attention for video captioning. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17949–17958 (2022)
2022
-
[37]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tun- ing. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26296–26306 (2024)
2024
-
[38]
DriveVA: Video Action Models are Zero-Shot Drivers
Liu, M., Zhang, D., Liu, J., Cui, J., Xie, H., Chen, G., Ye, H., Yang, M.Y., Nex, F., Cheng, H.: Driveva: Video action models are zero-shot drivers. arXiv preprint arXiv:2604.04198 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Liu, Y., Li, S., Liu, Y., Wang, Y., Ren, S., Li, L., Chen, S., Sun, X., Hou, L.: Tem- pcompass: Do video llms really understand videos? In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 8731–8772 (2024)
2024
-
[40]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lu, J., Clark, C., Lee, S., Zhang, Z., Khosla, S., Marten, R., Hoiem, D., Kembhavi, A.: Unified-io 2: Scaling autoregressive multimodal models with vision language audio and action. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 26439–26455 (2024)
2024
- [41]
-
[42]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lu, S., Liu, Y., Kong, A.W.K.: Tf-icon: Diffusion-based training-free cross-domain image composition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2294–2305 (2023)
2023
-
[43]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Lu, S., Wang, Z., Li, L., Liu, Y., Kong, A.W.K.: Mace: Mass concept erasure in diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6430–6440 (2024)
2024
-
[44]
arXiv preprint arXiv:2410.18775 (2024)
Lu, S., Zhou, Z., Lu, J., Zhu, Y., Kong, A.W.K.: Robust watermarking using gener- ative priors against image editing: From benchmarking to advances. arXiv preprint arXiv:2410.18775 (2024)
-
[45]
arXiv preprint arXiv:2510.12720 (2025)
Ma, Z., Xu, R., Xing, Z., Chu, Y., Wang, Y., He, J., Xu, J., Heng, P.A., Yu, K., Lin, J., et al.: Omni-captioner: Data pipeline, models, and benchmark for omni detailed perception. arXiv preprint arXiv:2510.12720 (2025)
-
[46]
In: International Conference on Learning Representations
Nan, K., Xie, R., Zhou, P., Fan, T., Yang, Z., Chen, Z., Li, X., Yang, J., Tai, Y.: Openvid-1m: A large-scale high-quality dataset for text-to-video generation. In: International Conference on Learning Representations. vol. 2025, pp. 1045–1064 (2025) Temporal and Cross-Modal Alignment for AVC 19
2025
-
[47]
Nguyen, L.T.P., Yu, Z., Hang, S.L.Y., An, S., Lee, J., Ban, Y., Chung, S., Nguyen, T.H., Maeng, J., Lee, S., et al.: See, hear, and understand: Benchmarking audio- visual human speech understanding in multimodal large language models. arXiv preprint arXiv:2512.02231 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
ACM Computing Surveys57(6), 1–36 (2025)
Qasim, I., Horsch, A., Prasad, D.: Dense video captioning: A survey of techniques, datasets and evaluation protocols. ACM Computing Surveys57(6), 1–36 (2025)
2025
-
[49]
In: Companion Proceedings of the ACM on Web Conference 2025
Shang, Y., Gao, C., Li, N., Li, Y.: A large-scale dataset with behavior, attributes, and content of mobile short-video platform. In: Companion Proceedings of the ACM on Web Conference 2025. pp. 793–796 (2025)
2025
-
[50]
In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp
Su, Y., Lan, T., Li, H., Xu, J., Wang, Y., Cai, D.: Pandagpt: One model to instruction-follow them all. In: Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants! pp. 11–23 (2023)
2023
-
[51]
arXiv preprint arXiv:2506.15220 (2025)
Tang, C., Li, Y., Yang, Y., Zhuang, J., Sun, G., Li, W., Ma, Z., Zhang, C.: video- salmonn 2: Caption-enhanced audio-visual large language models. arXiv preprint arXiv:2506.15220 (2025)
-
[52]
SALMONN: Towards Generic Hearing Abilities for Large Language Models
Tang, C., Yu, W., Sun, G., Chen, X., Tan, T., Li, W., Lu, L., Ma, Z., Zhang, C.: Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
arXiv preprint arXiv:2510.22200 (2025)
Team, M.L., Cai, X., Huang, Q., Kang, Z., Li, H., Liang, S., Ma, L., Ren, S., Wei, X., Xie, R., et al.: Longcat-video technical report. arXiv preprint arXiv:2510.22200 (2025)
-
[54]
Team, S.: Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier (2024)
2024
-
[55]
In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS)
Varghese, R., Sambath, M.: Yolov8: A novel object detection algorithm with en- hanced performance and robustness. In: 2024 International conference on advances in data engineering and intelligent computing systems (ADICS). pp. 1–6. IEEE (2024)
2024
-
[56]
arXiv preprint arXiv:2407.00634 (2024)
Wang, J., Yuan, L., Zhang, Y., Sun, H.: Tarsier: Recipes for training and evaluating large video description models. arXiv preprint arXiv:2407.00634 (2024)
-
[57]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wang, X., Hua, J., Lin, W., Zhang, Y., Zhang, F., Wu, J., Zhang, D., Nie, L.: Haic: Improving human action understanding and generation with better captions for multi-modal large language models. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 10158–10181 (2025)
2025
-
[60]
arXiv preprint arXiv:2507.11336 (2025)
Wu, P., Liu, Y., Zhu, Z., Zhou, E., Shen, J.: Ugc-videocaptioner: An omni ugc video detail caption model and new benchmarks. arXiv preprint arXiv:2507.11336 (2025)
-
[61]
arXiv preprint arXiv:2404.01717 (2024)
Xie, R., Zhao, C., Zhang, K., Zhang, Z., Zhou, J., Yang, J., Tai, Y.: Addsr: Acceler- ating diffusion-based blind super-resolution with adversarial diffusion distillation. arXiv preprint arXiv:2404.01717 (2024)
-
[62]
Xu, J., Guo, Z., He, J., Hu, H., He, T., Bai, S., Chen, K., Wang, J., Fan, Y., Dang, K., Zhang, B., Wang, X., Chu, Y., Lin, J.: Qwen2.5-omni technical report (2025)
2025
-
[63]
Xu, J., Guo, Z., Hu, H., Chu, Y., Wang, X., He, J., Wang, Y., Shi, X., He, T., Zhu, X., et al.: Qwen3-omni technical report. arXiv preprint arXiv:2509.17765 (2025) 20 C. Zhao et al
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
arXiv preprint arXiv:2506.21277 (2025)
Yang, Q., Yao, S., Chen, W., Fu, S., Bai, D., Zhao, J., Sun, B., Yin, B., Wei, X., Zhou, J.: Humanomniv2: From understanding to omni-modal reasoning with context. arXiv preprint arXiv:2506.21277 (2025)
-
[65]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yao, Y., Yu, T., Zhang, A., Wang, C., Cui, J., Zhu, H., Cai, T., Li, H., Zhao, W., He, Z., et al.: Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
arXiv preprint arXiv:2501.07888 (2025)
Yuan, L., Wang, J., Sun, H., Zhang, Y., Lin, Y.: Tarsier2: Advancing large vision- language models from detailed video description to comprehensive video under- standing. arXiv preprint arXiv:2501.07888 (2025)
-
[67]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhan, J., Dai, J., Ye, J., Zhou, Y., Zhang, D., Liu, Z., Zhang, X., Yuan, R., Zhang, G., Li, L., et al.: Anygpt: Unified multimodal llm with discrete sequence modeling. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9637–9662 (2024)
2024
-
[68]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Zhang, B., Li, K., Cheng, Z., Hu, Z., Yuan, Y., Chen, G., Leng, S., Jiang, Y., Zhang, H., Li, X., et al.: Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[69]
In: Findings of the Association for Computational Linguistics: EMNLP 2023
Zhang, D., Li, S., Zhang, X., Zhan, J., Wang, P., Zhou, Y., Qiu, X.: Speechgpt: Empowering large language models with intrinsic cross-modal conversational abil- ities. In: Findings of the Association for Computational Linguistics: EMNLP 2023. pp. 15757–15773 (2023)
2023
-
[70]
In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)
Zhang, H., Li, X., Bing, L.: Video-llama: An instruction-tuned audio-visual lan- guage model for video understanding. In: Proceedings of the 2023 conference on empiricalmethodsinnaturallanguageprocessing:systemdemonstrations.pp.543– 553 (2023)
2023
-
[71]
In: European Conference on Computer Vision
Zhang, L., Meng, W., Zhong, Y., Kong, B., Xu, M., Du, J., Wang, X., Wang, R., Liu, L.: U-cope: Taking a further step to universal 9d category-level object pose estimation. In: European Conference on Computer Vision. pp. 254–270. Springer (2025)
2025
-
[72]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)
Zhang, L., Meng, X., Liu, L., Jiang, H., Wang, J., Wang, R., Lu, C., Liu, J., Zhang, H.: Probing effective and efficient category-level articulated object pose perception. IEEE Transactions on Pattern Analysis and Machine Intelligence (2026)
2026
-
[73]
Zhang, L., Xu, M., Wang, J., Yu, Q., Yang, L., Li, Y., Lu, C., Wang, R., Liu, L.: Gapt-dar:Category-levelgarmentsposetrackingviaintegrated2ddeformationand 3dreconstruction.In:ProceedingsoftheComputerVisionandPatternRecognition Conference. pp. 22638–22647 (2025)
2025
-
[74]
In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)
Zhang, L., Zhong, Y., Wang, J., Min, Z., Liu, L., et al.: Rethinking 3d convolution inℓ p-norm space. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024)
2024
-
[75]
arXiv preprint arXiv:2506.13642 (2025)
Zhang, S., Guo, S., Fang, Q., Zhou, Y., Feng, Y.: Stream-omni: Simultaneous multimodal interactions with large language-vision-speech model. arXiv preprint arXiv:2506.13642 (2025)
-
[76]
LLaVA-Video: Video Instruction Tuning With Synthetic Data
Zhang, Y., Wu, J., Li, W., Li, B., Ma, Z., Liu, Z., Li, C.: Llava-video: Video instruction tuning with synthetic data. arXiv preprint arXiv:2410.02713 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
The Visual Computer41(1), 129–140 (2025)
Zhao, C., Cai, W.L., Yuan, Z.: Spectral normalization and dual contrastive reg- ularization for image-to-image translation. The Visual Computer41(1), 129–140 (2025)
2025
-
[78]
IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21
Zhao, C., Cai, W.L., Yuan, Z., Hu, C.W.: Multi-cropping contrastive learning and domain consistency for unsupervised image-to-image translation. IET Image Pro- cessing19(1), e70006 (2025) Temporal and Cross-Modal Alignment for AVC 21
2025
-
[79]
In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)
Zhao, C., Cai, W., Dong, C., Hu, C.: Wavelet-based fourier information interaction with frequency diffusion adjustment for underwater image restoration. In: Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (June 2024)
2024
-
[80]
In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Zhao, C., Cai, W., Dong, C., Zeng, Z.: Toward sufficient spatial-frequency in- teraction for gradient-aware underwater image enhancement. In: ICASSP 2024- 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 3220–3224. IEEE (2024)
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.