{"total":12,"items":[{"citing_arxiv_id":"2607.00248","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed2.0 Model Card: Towards Intelligence Frontier for Real-World Complexity","primary_cat":"cs.AI","submitted_at":"2026-06-30T22:57:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"Seed2.0 model series reports gains in reasoning, visual understanding, search, and reliability on intricate long-horizon tasks via an internal evaluation system.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22823","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Which Way Did It Move? Diagnosing and Overcoming Directional Motion Blindness in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-21T17:59:56+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Video-LLMs exhibit directional motion blindness from a direction binding gap; DeltaDirect projector objective lifts synthetic accuracy to 85.4% and real accuracy by 21.9 points while preserving other video capabilities.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.22570","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VGenST-Bench: A Benchmark for Spatio-Temporal Reasoning via Active Video Synthesis","primary_cat":"cs.CV","submitted_at":"2026-05-21T14:48:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VGenST-Bench is a new video benchmark for MLLM spatio-temporal reasoning built via generative synthesis, a multi-agent pipeline with human oversight, a 3x2x2 taxonomy, and hierarchical tasks separating perception from reasoning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"inflates reported performance, leaving the reliability of current MLLM evaluations questionable [65, 9, 61].(ii) Shortcut exploitation.Beyond contamination, passively curated benchmarks inherit distributional regularities from their source data that allow models to substitute linguistic priors, single-frame cues, or static scene context for genuine spatio-temporal reasoning [12, 36]. Recent stud- 2 Benchmark Venue/Year Modality Reasoning Type QA Pairs (#) Data Scale (#) Data Source MME [15] NeurIPS'25 I S 2.3K 1.1K Real image datasets3DSRBench [55] ICCV'25 I S 6.9K 2.7K Real image datasetsSpatialViz-Bench [73] ICLR'26 I S 1.1K 1.1K Programmatic generated imagesSpatial457 [77] CVPR'25 I S 23K 1.0K Rendered synthetic 3D scenesVSI-Bench [82] CVPR'25 V S 5K 288 3D indoor scene datasetsEgoExoBench [26] NeurIPS'25 V S/T 7."},{"citing_arxiv_id":"2605.21988","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning","primary_cat":"cs.CV","submitted_at":"2026-05-21T04:38:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15342","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Minerva-Ego: Spatiotemporal Hints for Egocentric Video Understanding","primary_cat":"cs.CV","submitted_at":"2026-05-14T19:12:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Minerva-Ego is a new benchmark for egocentric visual reasoning with dense human-annotated traces and masks, showing that spatiotemporal hints substantially improve frontier model performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.07568","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs","primary_cat":"cs.CV","submitted_at":"2026-05-08T10:40:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"can be effectively transferred when temporal structure is preserved. investigateRQ3: Can AoT supervision improve Video-LLM performance on broader temporal reasoning tasks?Our results show that AoT supervision serves as an effective training signal beyond the task itself, improving general temporal reasoning in Video-LLMs and yielding gains of up to 6.0 points on VITATECS-Direction [27] and 1.3 points on TVBench [14]. In summary, we (i) show that video-centric encoder with explicit temporal modeling is critical for en- coding temporal information; (ii) identify projector design and CLIP-style training as key bottlenecks for transferring temporal information to LLMs; (iii) improve Video-LLM AoT performance from chance level to beyond human accuracy; and (iv) demonstrate that AoT supervision also benefits"},{"citing_arxiv_id":"2604.23407","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PushupBench: Your VLM is not good at counting pushups","primary_cat":"cs.CV","submitted_at":"2026-04-25T18:58:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11399","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Reasoning Resides in Layers: Restoring Temporal Reasoning in Video-Language Models with Layer-Selective Merging","primary_cat":"cs.CV","submitted_at":"2026-04-13T12:41:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"MERIT restores temporal reasoning in VLMs via layer-selective self-attention merging guided by a TR-improving objective that penalizes TP degradation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.20633","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.8 Model Card: Towards Generalized Real-World Agency","primary_cat":"cs.AI","submitted_at":"2026-03-21T04:03:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Seed1.8 is a new foundation model that adds unified agentic capabilities for search, code execution, and GUI interaction to existing LLM and vision strengths.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Capability Benchmark Gemini 2.5 Pro Gemini-3-Pro Seed-1.5-VL Seed1.8 Knowledge & Reasoning VideoMMMU [29] 83.6∗ 87.6∗ 81.4 82.7 MMVU [92] 76.176.370.1 73.1 VCRBench [54] 53.4 51.4 51.8 59.8 VideoReasonBench [39] 59.7 59.5 18.7 52.8 VideoHolmes‡ [13] 62.4 64.2 59.1 65.5 Minerva [45] 67.6 65.0 49.9 62.4 VideoSimpleQA [8] 69.671.959.2 67.8 Motion & Perception TVBench [16] 67.4 71.1 66.6 71.5 TempCompass [38] 83.988.083.7 86.9 TOMATO [60] 50.3 55.8 44.9 60.8 EgoTempo [52] 58.1 65.4 51.7 67.0 MotionBench [28] 66.3∗ 70.3∗ 68.8 70.6 Countix [20] 18.6 18.7 26.0 31.0 Long Video VideoMME‡ [21] 86.9∗ 88.4∗ 83.0 87.8 CGBench [10] 64.6 64.5 57.4 62.4 LongVideoBench [79] 77.676.7 74.0 77.4 L VBench [71] 73.5- 64.6 73.0 Streaming"},{"citing_arxiv_id":"2512.13511","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adapting MLLMs for Nuanced Video Retrieval","primary_cat":"cs.CV","submitted_at":"2025-12-15T16:38:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.09985","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning","primary_cat":"cs.AI","submitted_at":"2025-06-11T17:57:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In each iteration of training we randomly sample a mini-batch of 4 second video clips from the Droid dataset, and, for simplicity, discard any videos shorter than 4 seconds, leaving us with a smaller subset of the dataset comprising under 62 hours of video. The video clips are sampled with resolution 256 × 256 and a frame-rate of 4 frames-per-second (fps), yielding 16 frame clips denoted by(xk)k∈[16], where each xk represents a single video frame. The robot's end-effector state in each observation is denoted by the sequence (sk)k∈[16], where sk is a real-valued 7D vector defined relative to the base of the robot. The first three dimensions ofsk encode the cartesian position of the end-effector, the next three dimensions encode its orientation in the form of extrinsic Euler angles, and the last dimension encodes the gripper state."},{"citing_arxiv_id":"2505.07062","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed1.5-VL Technical Report","primary_cat":"cs.CV","submitted_at":"2025-05-11T17:28:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"We evaluate this capability on ScreenSpot Pro [72], which focuses on expert-annotated tasks in professional settings, and ScreenSpot v2 [149], which covers grounding across 25 Capability Benchmark Seed1.5-VL thinking Seed1.5-VL non-thinking Prior SOTA Short video MotionBench [48] 68.4 68.4 62.8 GLM-4V MVBench [73] 74.4 74.3 76.4 InternVL-2.5 TOMATO [117] 44.7 44.2 46.9∗ Gemini 2.5 Pro TVBench [19] 63.6 61.5 62.6∗ Gemini 2.5 Pro Dream-1K [139] 43.9 42.6 42.0 Tarsier2 TempCompass [82] 83.7 83.1 75.8∗ Gemini 2.5 Pro Long video LongVideoBench [147] 74.0 74.4 66.7 GPT-4o LVBench [142] 64.6 64.0 69.2∗ Gemini 2.5 Pro MLVU [178] 82.1 81.8 81.2∗ Gemini 2.5 Pro VideoMME(w/o sub)[32] 77.9 77.6 87.0∗ Gemini 2.5 Pro TemporalBench [12] 79.8 78.9 73.3 GPT-4o"}],"limit":50,"offset":0}