MOSS-Video-Preview: Toward Real-Time Video Understanding via Cross-Attention
Pith reviewed 2026-06-28 15:20 UTC · model grok-4.3
The pith
A cross-attention backbone lets visual features enter through a side channel so perception and generation run on separate non-blocking pathways.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Perception must not be blocked by generation; its natural realization is a two-channel architecture in which visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways, reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression.
What carries the argument
Cross-attention backbone that routes visual features through a side channel separate from the autoregressive text sequence.
If this is right
- Visual processing frequency drops because frames no longer enter the main sequence.
- A clean channel-wise interface appears that supports independent compression of the vision stream.
- The model acquires behaviors absent from offline models: continuous perception, answer revision on new evidence, and timely silence.
- Time to first token drops by roughly 5x and decoding throughput rises by 2.7x on a single H200 with 256-frame inputs.
- Offline video and multimodal understanding remain competitive with strong decoder-only baselines.
Where Pith is reading between the lines
- The same side-channel design could be applied to audio or other streaming modalities without retraining the language backbone.
- Independent compression opens the possibility of running the vision encoder at a lower rate or on a separate device.
- Real-time revision behavior may transfer to live camera feeds or interactive agents once the data pipeline is adapted.
Load-bearing premise
Converting dense captions into real-time QA pairs whose answers are revised to match only what the model has perceived so far will produce genuine real-time behavior when an offline model is specialized on them.
What would settle it
A controlled run in which the fine-tuned model either processes every new frame at the same rate as token generation or fails to revise an answer once contradictory frames arrive.
read the original abstract
Video understanding is shifting from the offline paradigm -- taking a fully recorded video as input and producing a single answer after it ends -- toward real-time interaction, in which the model perceives new frames while still replying, revises its answer as new evidence appears, and remains silent when there is nothing to say. We present MOSS-Video-Preview to validate this paradigm. Our central claim is that perception must not be blocked by generation; its natural realization is a two-channel architecture. We argue that a cross-attention backbone is better suited to real-time vision-language fusion than the prevailing decoder-only design: visual features enter through a side channel rather than joining the autoregressive sequence, so perception and generation run on separate, non-blocking pathways -- reducing the frequency of visual processing and exposing a clean channel-wise interface for independent compression. We complement this with a data synthesis pipeline that converts dense captions into real-time understanding QA whose answers are revised to match what the model has perceived so far, and we specialize an offline model on these data to elicit real-time behavior. Our model trails the strong Qwen2.5-VL-7B baseline overall -- a gap we attribute primarily to data and scale rather than the architecture -- yet attains competitive offline video and multimodal understanding, remains robust on the spatial and fine-grained temporal reasoning central to real-time use, and acquires behaviors that offline models lack: continuous perception, answer revision, and timely silence. On a single H200 with 256 frames per video, it achieves about a 5x speedup in time to first token and 2.7x higher decoding throughput, with negligible degradation in offline ability. Our study of paradigm, architecture, and data outlines a viable path toward real-time video understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents MOSS-Video-Preview, a two-channel cross-attention architecture for real-time video understanding. It argues that visual features should enter via a side channel rather than the autoregressive sequence, enabling non-blocking perception and generation pathways that reduce visual processing frequency and allow independent compression. A data synthesis pipeline converts dense captions into real-time QA (with answers revised to match perceived frames so far) to specialize an offline model and elicit behaviors such as continuous perception, answer revision, and timely silence. The model trails Qwen2.5-VL-7B overall (gap attributed to data/scale) but achieves competitive offline performance, remains robust on spatial and fine-grained temporal reasoning, and delivers ~5x TTFT speedup and 2.7x decoding throughput on a single H200 with 256 frames per video.
Significance. If the architecture and synthesis approach hold, the work outlines a concrete path to efficient real-time vision-language models by separating perception and generation channels, with reported speedups and modularity benefits for compression. The explicit two-channel design and the attempt to induce streaming behaviors from offline pretraining are notable contributions, though the absence of detailed quantitative results, error analysis, or ablations limits the strength of the evidence for the central paradigm claim.
major comments (3)
- [Abstract and §3] Abstract and §3 (data synthesis pipeline): the behavioral claims (continuous perception, answer revision, timely silence) rest on the synthesis step that revises answers to match perceived frames so far, yet no ablation or comparison to standard fine-tuning is reported to show that this elicits genuine incremental evidence handling under streaming uncertainty rather than pattern-matching from dense captions that contain future information.
- [Abstract] Abstract: the claim that the cross-attention design is 'better suited to real-time vision-language fusion' is undercut by the model trailing the Qwen2.5-VL-7B baseline overall, with the performance gap attributed to data and scale without isolating the architecture's contribution via controlled experiments.
- [Abstract] Abstract: while 5x TTFT and 2.7x throughput gains are reported, no detailed quantitative results, error analysis, or per-task breakdowns are supplied to support that the model 'remains robust on the spatial and fine-grained temporal reasoning central to real-time use.'
minor comments (2)
- [Methods] Notation for the two-channel cross-attention (visual side channel vs. language autoregressive path) should be formalized with equations in the methods section for reproducibility.
- [Experiments] The manuscript would benefit from a table comparing real-time behaviors (revision frequency, silence rate) against decoder-only baselines on the same synthetic QA.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below with clarifications on our contributions and indicate where revisions to the manuscript are planned.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (data synthesis pipeline): the behavioral claims (continuous perception, answer revision, timely silence) rest on the synthesis step that revises answers to match perceived frames so far, yet no ablation or comparison to standard fine-tuning is reported to show that this elicits genuine incremental evidence handling under streaming uncertainty rather than pattern-matching from dense captions that contain future information.
Authors: The synthesis pipeline explicitly constructs QA pairs by revising answers to align only with frames perceived up to the current timestep, a step that standard fine-tuning on full dense captions does not perform. This design targets incremental reasoning under partial information. We agree that an explicit ablation against unmodified fine-tuning would provide additional support and will add a discussion of this distinction in the revised manuscript, along with any feasible comparative results. revision: partial
-
Referee: [Abstract] Abstract: the claim that the cross-attention design is 'better suited to real-time vision-language fusion' is undercut by the model trailing the Qwen2.5-VL-7B baseline overall, with the performance gap attributed to data and scale without isolating the architecture's contribution via controlled experiments.
Authors: The suitability claim rests on the architectural separation of perception and generation channels, which directly enables the non-blocking pathways and the measured efficiency gains (5x TTFT, 2.7x throughput). The overall accuracy comparison is to a larger-scale model trained under different data regimes; we attribute the gap primarily to those factors rather than the backbone. A controlled same-data, same-scale isolation experiment is computationally prohibitive at this stage and is noted as future work, but the paper supplies the design rationale and concrete efficiency evidence. revision: no
-
Referee: [Abstract] Abstract: while 5x TTFT and 2.7x throughput gains are reported, no detailed quantitative results, error analysis, or per-task breakdowns are supplied to support that the model 'remains robust on the spatial and fine-grained temporal reasoning central to real-time use.'
Authors: The abstract condenses the offline evaluation results reported in the main body, which include competitive performance on spatial and temporal tasks. We concur that expanded per-task breakdowns and error analysis would strengthen the robustness statement and will incorporate additional quantitative details and breakdowns from our existing experiments in the revised manuscript. revision: yes
Circularity Check
No circularity: architecture motivated by independent design arguments; data synthesis is a separate training method
full rationale
The paper's central derivation is a design argument that cross-attention enables non-blocking perception-generation pathways via a side channel, stated directly in the abstract and introduction without reference to fitted quantities or self-referential equations. The data synthesis pipeline that converts dense captions into revised QA is presented as an empirical complement to elicit behaviors, not as a prediction that reduces to its own inputs by construction. No self-citation load-bearing steps, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. Performance claims are benchmarked externally against Qwen2.5-VL-7B and attributed to data/scale differences, keeping the chain self-contained.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Cross-attention can effectively fuse vision and language in a non-blocking manner for real-time tasks.
invented entities (1)
-
Two-channel cross-attention architecture for real-time video
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InAdvances in Neural Information Processing Systems (NeurIPS), 2023. arXiv:2304.08485,https://arxiv.org/abs/2304.08485
Pith/arXiv arXiv 2023
-
[2]
Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Video Instruction Tuning With Synthetic Data.Transactions on Machine Learning Research (TMLR), 2024. arXiv:2410.02713,https://arxiv.org/ abs/2410.02713
Pith/arXiv arXiv 2024
-
[3]
LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding
Haoning Wu, Dongxu Li, Bei Chen, and Junnan Li. LongVideoBench: A benchmark for long-context interleaved video-languageunderstanding. InAdvancesinNeuralInformationProcessingSystems(NeurIPS),DatasetsandBenchmarks Track, 2024. arXiv:2407.15754,https://arxiv.org/abs/2407.15754
Pith/arXiv arXiv 2024
-
[4]
VideoLLM-online: Online video large language model for streaming video
Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. VideoLLM-online: Online video large language model for streaming video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2406.11816, https://arxiv.org/abs/2406.11816
arXiv 2024
-
[5]
Yueqian Wang, Xiaojun Meng, Yuxuan Wang, Jianxin Liang, Jiansheng Wei, Huishuai Zhang, and Dongyan Zhao. VideoLLM knows when to speak: Enhancing time-sensitive video comprehension with video-text duet interaction format. InFindings of the Association for Computational Linguistics (EMNLP), 2025. arXiv:2411.17991, https://arxiv.org/abs/2411.17991
arXiv 2025
-
[6]
Rui Qian, Shuangrui Ding, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Dispider: Enabling video LLMs with active real-time interaction via disentangled perception, decision, and reaction. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.03218, https://arxiv.org/abs/...
arXiv 2025
-
[7]
Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv
Shuai Bai, Qwen Team, et al. Qwen3-VL technical report.arXiv preprint arXiv:2511.21631, 2025.https://arxiv. org/abs/2511.21631
Pith/arXiv arXiv 2025
-
[8]
Xiang An, Yin Xie, Feilong Tang, et al. LLaVA-OneVision-2: Towards next-generation perceptual intelligence.arXiv preprint arXiv:2605.25979, 2026.https://arxiv.org/abs/2605.25979
Pith/arXiv arXiv 2026
-
[9]
Flamingo: a visual language model for few-shot learning
Jean-BaptisteAlayrac,JeffDonahue,PaulineLuc,AntoineMiech,IainBarr,YanaHasson,KarelLenc,ArthurMensch, Katie Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2204.14198,https://arxiv.org/abs/2204.14198
Pith/arXiv arXiv 2022
-
[10]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024.https://arxiv.org/abs/2407.21783; Section on multi-modal extensions describes the cross-attention design later released as Llama 3.2-Vision (11B/90B)
Pith/arXiv arXiv 2024
-
[11]
JunmingLin,ZhengFang,ChiChen,ZihaoWan,FuwenLuo,PengLi,YangLiu,andMaosongSun. StreamingBench: Assessing the gap for MLLMs to achieve streaming video understanding.arXiv preprint arXiv:2411.03628, 2024. https://arxiv.org/abs/2411.03628
arXiv 2024
-
[12]
Yifei Li, Junbo Niu, Ziyang Miao, Chunjiang Ge, Yuanhang Zhou, Qihao He, Xiaoyi Dong, Haodong Duan, et al. OVO-Bench: How far is your video-LLMs from real-world online video understanding? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2501.05510,https: //arxiv.org/abs/2501.05510
arXiv 2025
-
[13]
Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923
ShuaiBai,KeqinChen,XuejingLiu,JialinWang,WenbinGe,SiboSong,KaiDang,PengWang,ShijieWang,JunTang, et al. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025.https://arxiv.org/abs/2502.13923
Pith/arXiv arXiv 2025
-
[14]
RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024
Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. arXiv:2104.09864,https://arxiv.org/abs/ 2104.09864
Pith/arXiv arXiv 2024
-
[15]
Multimodal C4: An open, billion-scale corpus of images interleaved with text
Wanrong Zhu, Jack Hessel, Anas Awadalla, Samir Yitzhak Gadre, Jesse Dodge, Alex Fang, Youngjae Yu, Ludwig Schmidt, William Yang Wang, and Yejin Choi. Multimodal C4: An open, billion-scale corpus of images interleaved with text. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2304.06939,https://arxi...
arXiv 2023
-
[16]
Hugo Laurençon, Lucile Saulnier, Léo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image-text documents. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2306...
arXiv 2023
-
[17]
Pengyu Wang, Shaojun Zhou, Chenkun Tan, et al. UnifiedVisual: A framework for constructing unified vision- language datasets.arXiv preprint arXiv:2509.14738, 2025.https://arxiv.org/abs/2509.14738
arXiv 2025
-
[18]
ShareGPT4V: Improving large multi-modal models with better captions
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. ShareGPT4V: Improving large multi-modal models with better captions. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv:2311.12793,https://arxiv.org/abs/2311.12793
Pith/arXiv arXiv 2024
-
[19]
ShareGPT4Video: Improving video understanding and generation with better captions
Lin Chen, Xilin Wei, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Bin Lin, Zhenyu Tang, et al. ShareGPT4Video: Improving video understanding and generation with better captions. In Advances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2024. arXiv:2406.04325, https://arxiv.org/abs/2406.04325
arXiv 2024
-
[20]
ChenkunTan,PengyuWang,ShaojunZhou,etal. DecoupledProxyAlignment: Mitigatinglanguagepriorconflictfor multimodal alignment in MLLM.arXiv preprint arXiv:2509.14735, 2025.https://arxiv.org/abs/2509.14735
arXiv 2025
-
[21]
ZeRO: Memory optimizations toward training trillion parameter models
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. ZeRO: Memory optimizations toward training trillion parameter models. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2020. arXiv:1910.02054,https://arxiv.org/abs/1910.02054
Pith/arXiv arXiv 2020
-
[22]
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zhengxue Cheng, et al. LLaVA-OneVision- 1.5: Fully open framework for democratized multimodal training.arXiv preprint arXiv:2509.23661, 2025.https: //arxiv.org/abs/2509.23661
Pith/arXiv arXiv 2025
-
[23]
Yuliang Liu, Zhang Li, Mingxin Huang, Biao Yang, Wenwen Yu, Chunyuan Li, Xu-Cheng Yin, Cheng-Lin Liu, Lianwen Jin, and Xiang Bai. OCRBench: On the hidden mystery of OCR in large multimodal models.Science China Information Sciences, 2024. arXiv:2305.07895,https://arxiv.org/abs/2305.07895
Pith/arXiv arXiv 2024
-
[24]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, and Feng Zhao. Are we on the right way for evaluating large vision-language models? InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2403.20330,https://arxiv.org/abs/2403.20330
Pith/arXiv arXiv 2024
-
[25]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He,ZiweiLiu,etal. MMBench: Isyourmulti-modalmodelanall-aroundplayer? InEuropeanConferenceonComputer Vision (ECCV), 2024. arXiv:2307.06281,https://arxiv.org/abs/2307.06281
Pith/arXiv arXiv 2024
-
[26]
MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2311.16502,https://arx...
Pith/arXiv arXiv 2024
-
[27]
RealWorldQA: A new benchmark for real-world multimodal understanding
xAI. RealWorldQA: A new benchmark for real-world multimodal understanding. Hugging Face dataset, 2024. Released alongside the Grok-1.5 Vision announcement; no accompanying paper.https://huggingface.co/ datasets/xai-org/RealworldQA
2024
-
[28]
Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al
Fei Wang, Xingyu Fu, James Y. Huang, Zekun Li, Qin Liu, Xiaogeng Liu, Mingyu Derek Ma, Nan Xu, Wenxuan Zhou, Kai Zhang, et al. MuirBench: A comprehensive benchmark for robust multi-image understanding.arXiv preprint arXiv:2406.09411, 2024.https://arxiv.org/abs/2406.09411
Pith/arXiv arXiv 2024
-
[29]
BohaoLi,RuiWang,GuangzhiWang,YuyingGe,YixiaoGe,andYingShan. SEED-Bench: Benchmarkingmultimodal LLMs with generative comprehension.arXiv preprint arXiv:2307.16125, 2023.https://arxiv.org/abs/2307. 16125
Pith/arXiv arXiv 2023
-
[30]
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, et al. MME-RealWorld: Could your multimodal LLM challenge high-resolution real-world scenarios that are difficult for humans? InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2408.13257,https://arxiv.or...
Pith/arXiv arXiv 2025
-
[31]
Evaluating object hallucination in large vision-language models
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023. arXiv:2305.10355,https://arxiv.org/abs/2305.10355
Pith/arXiv arXiv 2023
-
[32]
Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal LLMs. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2406.16860, https://arxiv.org/abs/2406.16860; CV-Bench is...
Pith/arXiv arXiv 2024
-
[33]
V*: Guided visual search as a core mechanism in multimodal LLMs
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal LLMs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. arXiv:2312.14135,https: //arxiv.org/abs/2312.14135
arXiv 2024
-
[34]
A diagram is worth a dozen images
Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Min Joon Seo, Hannaneh Hajishirzi, and Ali Farhadi. A diagram is worth a dozen images. InEuropean Conference on Computer Vision (ECCV), 2016. arXiv:1603.07396,https: //arxiv.org/abs/1603.07396
Pith/arXiv arXiv 2016
-
[35]
Weiye Xu, Jiahao Wang, Weiyun Wang, Zhe Chen, Wengang Zhou, Aijun Yang, Lewei Lu, Houqiang Li, Xiaohua Wang, Xizhou Zhu, et al. VisuLogic: A benchmark for evaluating visual reasoning in multi-modal large language models.arXiv preprint arXiv:2504.15279, 2025.https://arxiv.org/abs/2504.15279
arXiv 2025
-
[36]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind. InAsian Conference on Computer Vision (ACCV), 2024. arXiv:2407.06581,https://arxiv.org/ abs/2407.06581
arXiv 2024
-
[37]
Jonathan Roberts, Mohammad Reza Taesiri, Ansh Sharma, Akash Gupta, Samuel Roberts, Ioana Croitoru, Simion- Vlad Bogolin, Jialu Tang, et al. ZeroBench: An impossible visual benchmark for contemporary large multimodal models.arXiv preprint arXiv:2502.09696, 2025.https://arxiv.org/abs/2502.09696
arXiv 2025
-
[38]
Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-MME: The first-ever comprehensive evaluation benchmark of multi-modal LLMs in video analysis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
-
[39]
arXiv:2405.21075,https://arxiv.org/abs/2405.21075
-
[40]
EgoSchema: A diagnostic benchmark for very long-form video language understanding
Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. EgoSchema: A diagnostic benchmark for very long-form video language understanding. InAdvances in Neural Information Processing Systems (NeurIPS), Datasets and Benchmarks Track, 2023. arXiv:2308.09126,https://arxiv.org/abs/2308.09126
arXiv 2023
-
[41]
MLVU: Benchmarking multi-task long video understanding
Junjie Zhou, Yan Shu, Bo Zhao, Boya Wu, Zhengyang Liang, Shitao Xiao, Minghao Qin, Xi Yang, Yongping Xiong, Bo Zhang, Tiejun Huang, and Zheng Liu. MLVU: Benchmarking multi-task long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2406.04264, https://arxiv.org/abs/2406.04264
Pith/arXiv arXiv 2025
-
[42]
LVBench: An extreme long video understanding benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. LVBench: An extreme long video understanding benchmark. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025. arXiv:2406.08035,https: //arxiv.org/abs/2406.08035
Pith/arXiv arXiv 2025
-
[43]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. TempCompass: Do video LLMs really understand videos? InFindings of the Association for Computational Linguistics (ACL), 2024. arXiv:2403.00476,https://arxiv.org/abs/2403.00476
Pith/arXiv arXiv 2024
-
[44]
Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie
Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. arXiv:2412.14171,https://arxiv.org/abs/2412.14171; introduces the VSI-Bench benchmark
Pith/arXiv arXiv 2025
-
[45]
Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-Holmes: Can MLLM think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025.https://arxiv.org/abs/2505. 21374. 24
Pith/arXiv arXiv 2025
-
[46]
Tri Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023.https://arxiv.org/abs/2307.08691
Pith/arXiv arXiv 2023
-
[47]
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017.https://arxiv.org/abs/1707.06347
Pith/arXiv arXiv 2017
-
[48]
DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025.https://arxiv.org/abs/ 2501.12948
Pith/arXiv arXiv 2025
-
[49]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron- LM: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019.https://arxiv.org/abs/1909.08053
Pith/arXiv arXiv 1909
-
[50]
Efficient large-scale language model training on GPU clusters using Megatron-LM
DeepakNarayanan,MohammadShoeybi,JaredCasper,PatrickLeGresley,MostofaPatwary,VijayAnandKorthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, et al. Efficient large-scale language model training on GPU clusters using Megatron-LM. InProceedings of the International Conference for High Performance Computing, Networking, Storage a...
arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.