pith. sign in

hub Baseline reference

Lost in time: A new temporal benchmark for videollms

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

12 Pith papers citing it
Baseline 60% of classified citations

hub tools

citation-role summary

background 2 baseline 2 dataset 1

citation-polarity summary

fields

cs.CV 9 cs.AI 3

years

2026 9 2025 3

clear filters

representative citing papers

PushupBench: Your VLM is not good at counting pushups

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.

Adapting MLLMs for Nuanced Video Retrieval

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

citing papers explorer

Showing 3 of 3 citing papers after filters.

  • Adapting MLLMs for Nuanced Video Retrieval cs.CV · 2025-12-15 · unverdicted · none · ref 16

    Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

  • V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning cs.AI · 2025-06-11 · unverdicted · none · ref 16

    V-JEPA 2 pre-trained on massive unlabeled video achieves strong results on motion understanding and action anticipation, SOTA video QA at 8B scale, and enables zero-shot robotic planning on Franka arms using only 62 hours of unlabeled robot video.

  • Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 19

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.