Recognition: 2 theorem links
· Lean TheoremMolmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding
Pith reviewed 2026-05-16 04:12 UTC · model grok-4.3
The pith
Molmo2 releases new open video and multi-image datasets plus a training recipe that lets an 8B model outperform other open-weight VLMs on video tasks and beat some proprietary models on pixel grounding.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Molmo2 is a family of vision-language models trained from scratch on seven newly collected video datasets and two multi-image datasets, all created without proprietary VLMs, together with an efficient packing scheme, message-tree encoding, bi-directional attention on vision tokens, and a token-weight strategy; the resulting 8B model outperforms other open-weight and open-data models on short-video understanding, counting, and captioning while remaining competitive on long videos, and records large gains on grounding benchmarks including 35.5 versus 29.6 accuracy on video counting against Qwen3-VL and 38.4 versus 20.0 F1 on video pointing and 56.2 versus 41.1 J&F on video tracking against Gem
What carries the argument
Nine newly collected open datasets (seven video, two multi-image) paired with a training recipe that uses message-tree encoding, bi-directional attention on vision tokens, and a novel token-weight strategy.
If this is right
- Open researchers can now iterate on the released data and recipe without needing access to closed VLMs.
- Downstream applications that require pixel-level pointing or tracking in video become feasible with fully open models.
- The same data-collection and encoding approach can be scaled to larger models while remaining fully reproducible.
- Video grounding benchmarks gain stronger open baselines that proprietary systems must now surpass.
Where Pith is reading between the lines
- The released pointing and tracking datasets could serve as training targets for future models that output masks or trajectories directly.
- Because the data are collected without proprietary teachers, the same pipeline may transfer to domains where synthetic distillation is currently blocked by policy or cost.
- The efficiency gains from message-tree encoding and token weighting may generalize to other long-context multimodal training runs.
Load-bearing premise
The newly collected datasets are high-quality, diverse, and contain no leakage or bias that would inflate performance over baselines.
What would settle it
An independent team retrains a comparable 8B model on only publicly available datasets and measures no gap on the reported video-counting, pointing, or tracking metrics.
read the original abstract
Today's strongest video-language models (VLMs) remain proprietary. The strongest open-weight models either rely on synthetic data from proprietary VLMs, effectively distilling from them, or do not disclose their training data or recipe. As a result, the open-source community lacks the foundations needed to improve on the state-of-the-art video (and image) language models. Crucially, many downstream applications require more than just high-level video understanding; they require grounding -- either by pointing or by tracking in pixels. Even proprietary models lack this capability. We present Molmo2, a new family of VLMs that are state-of-the-art among open-source models and demonstrate exceptional new capabilities in point-driven grounding in single image, multi-image, and video tasks. Our key contribution is a collection of 7 new video datasets and 2 multi-image datasets, including a dataset of highly detailed video captions for pre-training, a free-form video Q&A dataset for fine-tuning, a new object tracking dataset with complex queries, and an innovative new video pointing dataset, all collected without the use of closed VLMs. We also present a training recipe for this data utilizing an efficient packing and message-tree encoding scheme, and show bi-directional attention on vision tokens and a novel token-weight strategy improves performance. Our best-in-class 8B model outperforms others in the class of open weight and data models on short videos, counting, and captioning, and is competitive on long-videos. On video-grounding Molmo2 significantly outperforms existing open-weight models like Qwen3-VL (35.5 vs 29.6 accuracy on video counting) and surpasses proprietary models like Gemini 3 Pro on some tasks (38.4 vs 20.0 F1 on video pointing and 56.2 vs 41.1 J&F on video tracking).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Molmo2, a family of open-weight vision-language models (including an 8B variant) trained on newly collected open datasets for video understanding, captioning, counting, and pixel-level grounding tasks. It introduces 7 new video datasets and 2 multi-image datasets collected without proprietary VLMs, plus a training recipe using efficient packing, message-tree encoding, bi-directional vision-token attention, and a novel token-weighting strategy. The central claim is that the 8B model outperforms other open-weight models on short-video tasks and grounding (e.g., 35.5 vs. 29.6 accuracy on video counting vs. Qwen3-VL) and surpasses some proprietary models on pointing and tracking (e.g., 38.4 vs. 20.0 F1 on video pointing vs. Gemini 3 Pro).
Significance. If the dataset quality and leakage-free status hold, the work supplies valuable open weights, data, and recipes for video VLMs with grounding capabilities, which remain rare in open-source settings. This directly addresses the gap noted in the abstract where open models either distill from closed systems or withhold data details, potentially enabling reproducible community advances on video grounding.
major comments (2)
- [§4] §4 (Dataset Collection): The paper asserts that the 7 new video datasets (detailed captions, free-form QA, object tracking, video pointing) were collected without closed VLMs and are high-quality, yet supplies no collection protocol, annotation guidelines, diversity statistics, inter-annotator agreement scores, or decontamination steps against existing benchmarks. This directly undermines the headline performance deltas (e.g., 35.5 vs 29.6 on video counting), as any test-set overlap or annotation bias would make the gains artifacts rather than evidence of a superior open recipe.
- [§5] §5 (Experiments and Evaluation): The reported comparisons (38.4 vs 20.0 F1 on video pointing; 56.2 vs 41.1 J&F on video tracking) lack full disclosure of baseline re-implementations, exact evaluation prompts, metric definitions, or code for the packing/message-tree scheme. Without these, it is impossible to confirm that the gains over Qwen3-VL and Gemini 3 Pro are robust rather than arising from post-hoc protocol choices or implementation differences.
minor comments (2)
- [Abstract and §3.2] The abstract and §3.2 mention 'bi-directional attention on vision tokens' and 'novel token-weight strategy' without a clear statement of whether these are incremental improvements on existing mechanisms or fully new; a short ablation table would clarify their individual contributions.
- [Table 1] Table 1 (model comparisons) reports numeric results but does not include standard deviations or the number of evaluation runs, which is standard for grounding metrics like J&F and F1.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity and reproducibility. We will revise the manuscript to address both major points by adding the requested details.
read point-by-point responses
-
Referee: [§4] §4 (Dataset Collection): The paper asserts that the 7 new video datasets (detailed captions, free-form QA, object tracking, video pointing) were collected without closed VLMs and are high-quality, yet supplies no collection protocol, annotation guidelines, diversity statistics, inter-annotator agreement scores, or decontamination steps against existing benchmarks. This directly undermines the headline performance deltas (e.g., 35.5 vs 29.6 on video counting), as any test-set overlap or annotation bias would make the gains artifacts rather than evidence of a superior open recipe.
Authors: We acknowledge that the manuscript provides insufficient detail on the data collection process. In the revised version we will expand §4 with a dedicated subsection describing the full collection protocol, annotation guidelines provided to workers, diversity statistics across video sources and query types, inter-annotator agreement metrics where applicable, and the exact decontamination procedure used to verify no overlap with existing benchmarks. We will also document the human-only annotation pipeline that avoided any closed VLMs. These additions will allow readers to evaluate the quality and leakage-free status of the datasets directly. revision: yes
-
Referee: [§5] §5 (Experiments and Evaluation): The reported comparisons (38.4 vs 20.0 F1 on video pointing; 56.2 vs 41.1 J&F on video tracking) lack full disclosure of baseline re-implementations, exact evaluation prompts, metric definitions, or code for the packing/message-tree scheme. Without these, it is impossible to confirm that the gains over Qwen3-VL and Gemini 3 Pro are robust rather than arising from post-hoc protocol choices or implementation differences.
Authors: We agree that full reproducibility requires these details. In the revised manuscript we will augment §5 with (i) exact evaluation prompts for each task, (ii) precise metric definitions and implementation notes, (iii) descriptions of how each baseline was re-implemented (including any prompt adaptations), and (iv) a detailed explanation plus pseudocode for the packing and message-tree encoding scheme. We will also release the corresponding evaluation code upon acceptance. These changes will enable independent verification that the reported improvements are robust. revision: yes
Circularity Check
No significant circularity; claims rest on new empirical datasets and recipe evaluated on external benchmarks
full rationale
The paper introduces 7 new video datasets and 2 multi-image datasets collected without closed VLMs, plus a training recipe with packing/message-tree encoding, bi-directional vision attention, and token weighting. Reported gains (e.g., 35.5 vs 29.6 video-counting accuracy) are presented as direct empirical outcomes on standard benchmarks rather than any derivation, fitted parameter, or self-citation that reduces the result to its own inputs by construction. No equations, self-definitional steps, or load-bearing self-citations appear; the chain is self-contained via external evaluation.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
Count Anything at Any Granularity
Multi-grained counting is introduced with five granularity levels, supported by the new KubriCount dataset generated via 3D synthesis and editing, and HieraCount model that combines text and visual exemplars for impro...
-
ChartREG++: Towards Benchmarking and Improving Chart Referring Expression Grounding under Diverse referring clues and Multi-Target Referring
ChartREG++ creates a new multi-target chart grounding benchmark with diverse cues and a code-driven synthesis pipeline for accurate masks, yielding a model that outperforms baselines and generalizes to real ChartQA charts.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 delivers an open VLA model with new specialized components, datasets, and techniques that outperforms baselines on benchmarks while releasing all weights, code, and data for real-world robot use.
-
Grounding Video Reasoning in Physical Signals
A new benchmark converts video clips into shared grounded event records and tests models across physics, semantic, and control prompts under original, shuffled, ablated, and masked conditions, finding selective robust...
-
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
-
WildDet3D: Scaling Promptable 3D Detection in the Wild
WildDet3D is a promptable 3D detector paired with a new 1M-image dataset across 13.5K categories that sets SOTA on open-world and zero-shot 3D detection benchmarks.
-
TrajTok: Learning Trajectory Tokens enables better Video Understanding
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
-
Skill-Aligned Annotation for Reliable Evaluation in Text-to-Image Generation
Skill-aligned annotation improves inter-annotator agreement and evaluation stability in text-to-image generation compared to uniform annotation baselines.
-
Learning to See What You Need: Gaze Attention for Multimodal Large Language Models
Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing
CC-OCR V2 reveals that state-of-the-art large multimodal models substantially underperform on challenging real-world document processing tasks.
-
MolmoAct2: Action Reasoning Models for Real-world Deployment
MolmoAct2 is an open VLA model that outperforms baselines like Pi-05 on 7 benchmarks and whose backbone surpasses GPT-5 on 13 embodied-reasoning tasks through new datasets, specialized training, and architecture chang...
-
ViPO: Visual Preference Optimization at Scale
Poly-DPO improves robustness to noisy preference data in visual models, and the new ViPO dataset enables superior performance, with the method reducing to standard DPO on high-quality data.
-
Latent Denoising Improves Visual Alignment in Large Multimodal Models
A latent denoising objective with saliency-aware corruption and contrastive distillation improves visual alignment and corruption robustness in large multimodal models.
-
UniversalVTG: A Universal and Lightweight Foundation Model for Video Temporal Grounding
UniversalVTG is a lightweight foundation model for video temporal grounding that achieves state-of-the-art results across five benchmarks while being over 100 times smaller than recent MLLM-based methods.
-
Scaling Video Understanding via Compact Latent Multi-Agent Collaboration
MACF decouples agent perception budgets from overall video length using latent token collaboration to scale video understanding in MLLMs beyond current limits.
-
ZAYA1-VL-8B Technical Report
ZAYA1-VL-8B is a new MoE vision-language model with vision-specific LoRA adapters and bidirectional image attention that reports competitive performance against several 3B-4B models on image, reasoning, and counting b...
-
Weighting What Matters: Boosting Sample Efficiency in Medical Report Generation via Token Reweighting
Reweighting the training loss to emphasize semantically salient tokens lets ophthalmological report generation models reach similar quality with up to ten times less data.
Reference graph
Works this paper leans on
-
[1]
A. Abdolmaleki, S. Abeyruwan, J. Ainslie, J.-B. Alayrac, M. G. Arenas, A. Balakrishna, N. Batchelor, A. Bewley, J. Bingham, M. Bloesch, et al. Gemini robotics 1.5: Pushing the frontier of generalist robots with advanced embodied reasoning, thinking, and motion transfer.arXiv preprint arXiv:2510.03342, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
M. Acharya, K. Kafle, and C. Kanan. TallyQA: Answering complex counting questions. InAAAI, 2019
work page 2019
- [3]
-
[4]
M. AI. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Claude sonnet 4.5 system card, 2025
Anthropic. Claude sonnet 4.5 system card, 2025. URLhttps://assets.anthropic.com/m/12f214efcc2f457a/ original/Claude-Sonnet-4-5-System-Card.pdf
work page 2025
- [6]
- [7]
-
[8]
Program Synthesis with Large Language Models
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[9]
J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. InNeurIPS Deep Learning Symposium, 2016
work page 2016
-
[10]
S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Z. Bai, T. He, H. Mei, P. Wang, Z. Gao, J. Chen, L. Liu, Z. Zhang, and M. Z. Shou. One token to seg them all: Language instructed reasoning segmentation in videos. InNeurIPS, 2024
work page 2024
-
[12]
M. Bellver, C. Ventura, C. Silberer, I. Kazakos, J. Torres, and X. Giro-i Nieto. Refvos: a closer look at referring expressions for video object segmentation.arXiv preprint arXiv:2010.00263, 2020
-
[13]
PaliGemma: A versatile 3B VLM for transfer
L. Beyer, A. Steiner, A. S. Pinto, A. Kolesnikov, X. Wang, D. Salz, M. Neumann, I. Alabdulmohsin, M. Tschannen, E. Bugliarello, T. Unterthiner, D. Keysers, S. Koppula, F. Liu, A. Grycner, A. Gritsenko, N. Houlsby, M. Kumar, K. Rong, J. Eisenschlos, R. Kabra, M. Bauer, M. Bošnjak, X. Chen, M. Minderer, P. Voigtlaender, I. Bica, I. Balazevic, J. Puigcerver,...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
A. F. Biten, R. Tito, A. Mafla, L. Gomez, M. Rusinol, E. Valveny, C. Jawahar, and D. Karatzas. Scene text visual question answering. InICCV, 2019
work page 2019
-
[15]
F. Caba Heilbron, V. Escorcia, B. Ghanem, and J. Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InCVPR, 2015
work page 2015
-
[16]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
G. Chen, Z. Li, S. Wang, J. Jiang, Y. Liu, L. Lu, D.-A. Huang, W. Byeon, M. Le, M. Ehrlich, T. Lu, L. Wang, B. Catanzaro, J. Kautz, A. Tao, Z. Yu, and G. Liu. Eagle 2.5: Boosting long-context post-training for frontier vision-language models. InNeurIPS, 2025
work page 2025
-
[18]
L. Chen, J. Li, X. Dong, P. Zhang, C. He, J. Wang, F. Zhao, and D. Lin. ShareGPT4V: Improving large multi-modal models with better captions.arXiv preprint arXiv:2311.12793, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
L. Chen, X. Wei, J. Li, X. Dong, P. Zhang, Y. Zang, Z. Chen, H. Duan, B. Lin, Z. Tang, L. Yuan, Y. Qiao, D. Lin, F. Zhao, and J. Wang. Sharegpt4video: Improving video understanding and generation with better captions. InNeurIPS Track on Datasets and Benchmarks, 2024. 18
work page 2024
- [20]
- [21]
-
[22]
J. H. Cho, A. Madotto, E. Mavroudi, T. Afouras, T. Nagarajan, M. Maaz, Y. Song, T. Ma, S. Hu, H. Rasheed, P. Sun, P.-Y. Huang, D. Bolya, S. Jain, M. Martin, H. Wang, N. Ravi, S. Jain, T. Stark, S. Moon, B. Damavandi, V. Lee, A. Westbury, S. Khan, P. Krähenbühl, P. Dollár, L. Torresani, K. Grauman, and C. Feichtenhofer. Perceptionlm: Open-access data and m...
-
[23]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[24]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[25]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [26]
-
[27]
T. Dao. FlashAttention-2: Faster attention with better parallelism and work partitioning. InICLR, 2024
work page 2024
-
[28]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InNeurIPS, 2022
work page 2022
-
[29]
M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, J. Lu, T. Anderson, E. Bransom, K. Ehsani, H. Ngo, Y. Chen, A. Patel, M. Yatskar, C. Callison-Burch, A. Head, R. Hendrix, F. Bastani, E. VanderBilt, N. Lambert, Y. Chou, A. Chheda, J. Sparks, S. Skjonsberg, M. Schmitz, A. Sarnat, B. Bischoff, P. W...
work page 2025
-
[30]
P. Dendorfer, H. Rezatofighi, A. Milan, J. Shi, D. Cremers, I. Reid, S. Roth, K. Schindler, and L. Leal-Taixé. Mot20: A benchmark for multi object tracking in crowded scenes.arXiv preprint arXiv:2003.09003, 2020
-
[31]
H. Ding, C. Liu, S. He, X. Jiang, and C. C. Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InICCV, 2023
work page 2023
-
[32]
H. Ding, C. Liu, S. He, X. Jiang, P. H. Torr, and S. Bai. MOSE: A new dataset for video object segmentation in complex scenes. InICCV, 2023
work page 2023
- [33]
- [34]
-
[35]
C. Doersch, A. Gupta, L. Markeeva, A. Recasens, L. Smaira, Y. Aytar, J. a. Carreira, A. Zisserman, and Y. Yang. Tap-vid: a benchmark for tracking any point in a video. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2022
work page 2022
-
[36]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR, 2021
work page 2021
-
[37]
D. Du, Y. Qi, H. Yu, Y. Yang, K. Duan, G. Li, W. Zhang, Q. Huang, and Q. Tian. The unmanned aerial vehicle benchmark: Object detection and tracking. InECCV, 2018. 19
work page 2018
-
[38]
D. Dwibedi, Y. Aytar, J. Tompson, P. Sermanet, and A. Zisserman. Counting out time: Class agnostic video repetition counting in the wild. InCVPR, 2020
work page 2020
-
[39]
H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling. Lasot: A high-quality benchmark for large-scale single object tracking. InCVPR, 2019
work page 2019
-
[40]
C. Fu, Y. Dai, Y. Luo, L. Li, S. Ren, R. Zhang, Z. Wang, C. Zhou, Y. Shen, M. Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InCVPR, 2025
work page 2025
-
[41]
X. Fu, Y. Hu, B. Li, Y. Feng, H. Wang, X. Lin, D. Roth, N. A. Smith, W.-C. Ma, and R. Krishna. Blink: Multimodal large language models can see but not perceive. InECCV, 2024
work page 2024
-
[42]
J. Gao, C. Sun, Z. Yang, and R. Nevatia. Tall: Temporal activity localization via language query. InICCV, 2017
work page 2017
-
[43]
M. Gao, J. Liu, M. Li, J. Xie, Q. Liu, B. Zhao, X. Chen, and H. Xiong. Tc-llava: Rethinking the transfer from image to video understanding with temporal considerations. InAAAI, 2025
work page 2025
-
[44]
S. Giancola, M. Amine, T. Dghaily, and B. Ghanem. Soccernet: A scalable dataset for action spotting in soccer videos. InCVPR Workshop on Computer Vision in Sports, 2018
work page 2018
-
[45]
Google. Gemini 3 Pro model card, 2025. URL https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf
work page 2025
-
[46]
R. Goyal, S. Ebrahimi Kahou, V. Michalski, J. Materzynska, S. Westphal, H. Kim, V. Haenel, I. Fruend, P. Yianilos, M. Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InICCV, 2017
work page 2017
- [48]
-
[49]
K. Grauman, A. Westbury, E. Byrne, Z. Chavis, A. Furnari, R. Girdhar, J. Hamburger, H. Jiang, M. Liu, X. Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InCVPR, 2022
work page 2022
-
[50]
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding. InICLR, 2021
work page 2021
-
[51]
J. R. Hermans, G. Spanakis, and R. Möckel. Accumulated gradient normalization. InACML, 2017
work page 2017
-
[52]
The Curious Case of Neural Text Degeneration
A. Holtzman, J. Buys, L. Du, M. Forbes, and Y. Choi. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[53]
L. Hong, W. Chen, Z. Liu, W. Zhang, P. Guo, Z. Chen, and W. Zhang. Lvos: A benchmark for long-term video object segmentation. InICCV, 2023
work page 2023
-
[54]
W. Hong, Y. Cheng, Z. Yang, W. Wang, L. Wang, X. Gu, S. Huang, Y. Dong, and J. Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InCVPR, 2025
work page 2025
- [55]
-
[56]
S. A. Jacobs, M. Tanaka, C. Zhang, M. Zhang, R. Y. Aminadabi, S. L. Song, S. Rajbhandari, and Y. He. System optimizations for enabling training of extreme long sequence transformer models. InPODC, 2024
work page 2024
-
[57]
S. Jahagirdar, M. Mathew, D. Karatzas, and C. Jawahar. Watching the news: Towards videoqa models that can read. InCVPR, 2023
work page 2023
-
[58]
H. Jhamtani and T. Berg-Kirkpatrick. Learning to describe differences between pairs of similar images. In EMNLP, 2018
work page 2018
- [59]
-
[60]
Detect anything via next point prediction
Q. Jiang, J. Huo, X. Chen, Y. Xiong, Z. Zeng, Y. Chen, T. Ren, J. Yu, and L. Zhang. Detect anything via next point prediction.arXiv preprint arXiv:2510.12798, 2025. 20
- [61]
-
[62]
S. E. Kahou, V. Michalski, A. Atkinson, Á. Kádár, A. Trischler, and Y. Bengio. FigureQA: An annotated figure dataset for visual reasoning.arXiv preprint arXiv:1710.07300, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [63]
-
[64]
W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset.arXiv preprint arXiv:1705.06950, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[65]
A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. InECCV, 2016
work page 2016
-
[66]
A. Khoreva, A. Rohrbach, and B. Schiele. Video object segmentation with language referring expressions. In ACCV, 2018
work page 2018
-
[67]
D. P. Kingma. Adam: A method for stochastic optimization. InICLR, 2015
work page 2015
-
[68]
R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 123:32 – 73, 2016
work page 2016
-
[69]
R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. InICCV, 2017
work page 2017
- [70]
-
[71]
N. Lambert, J. Morrison, V. Pyatkin, S. Huang, H. Ivison, F. Brahman, L. J. V. Miranda, A. Liu, N. Dziri, S. Lyu, Y. Gu, S. Malik, V. Graf, J. D. Hwang, J. Yang, R. L. Bras, O. Tafjord, C. Wilhelm, L. Soldaini, N. A. Smith, Y. Wang, P. Dasigi, and H. Hajishirzi. Tülu 3: Pushing frontiers in open language model post-training. InCOLM, 2025
work page 2025
-
[72]
H. Lamdouar, C. Yang, W. Xie, and A. Zisserman. Betrayed by motion: Camouflaged object discovery via motion segmentation. InACCV, 2020
work page 2020
-
[73]
J. Lee, J. Duan, H. Fang, Y. Deng, S. Liu, B. Li, B. Fang, J. Zhang, Y. R. Wang, S. Lee, W. Han, W. Pumacay, A. Wu, R. Hendrix, K. Farley, E. VanderBilt, A. Farhadi, D. Fox, and R. Krishna. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[74]
J. Lei, L. Yu, M. Bansal, and T. L. Berg. Tvqa: Localized, compositional video question answering. InEMNLP, 2018
work page 2018
-
[75]
J. Lei, T. L. Berg, and M. Bansal. Detecting moments and highlights in videos via natural language queries. In NeurIPS, 2021
work page 2021
- [76]
-
[77]
J. Li, P. Wei, W. Han, and L. Fan. Intentqa: Context-aware video intent reasoning. InCVPR, 2023
work page 2023
-
[78]
K. Li, Y. Wang, Y. He, Y. Li, Y. Wang, Y. Liu, Z. Wang, J. Xu, G. Chen, P. Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InCVPR, 2024
work page 2024
- [79]
-
[80]
Y. Li, Y. Song, L. Cao, J. Tetreault, L. Goldberg, A. Jaimes, and J. Luo. Tgif: A new dataset and benchmark on animated gif description. InCVPR, 2016
work page 2016
-
[81]
Y. Li, J. Zhang, X. Teng, H. Zhang, X. Liu, and L. Lan. Refsam: Efficiently adapting segmenting anything model for referring video object segmentation.Neural Networks, 2025
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.