pith. sign in

hub Baseline reference

Lost in time: A new temporal benchmark for videollms

Baseline reference. 60% of citing Pith papers use this work as a benchmark or comparison.

12 Pith papers citing it
Baseline 60% of classified citations

hub tools

citation-role summary

background 2 baseline 2 dataset 1

citation-polarity summary

fields

cs.CV 9 cs.AI 3

years

2026 9 2025 3

clear filters

representative citing papers

PushupBench: Your VLM is not good at counting pushups

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

VLMs reach only 42.1% exact accuracy on counting pushups in videos, with weaker models exploiting modal counts, and 1k-sample fine-tuning transfers gains to MVBench, PerceptionTest, and TVBench.

Adapting MLLMs for Nuanced Video Retrieval

cs.CV · 2025-12-15 · unverdicted · novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

Seed1.5-VL Technical Report

cs.CV · 2025-05-11 · unverdicted · novelty 4.0

Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.

citing papers explorer

Showing 2 of 2 citing papers after filters.

  • Tracing the Arrow of Time: Diagnosing Temporal Information Flow in Video-LLMs cs.CV · 2026-05-08 · unverdicted · none · ref 14

    Temporal information in Video-LLMs is encoded well by video-centric encoders but disrupted by standard projectors; time-preserved MLPs plus AoT supervision yield 98.1% accuracy on arrow-of-time and gains on other temporal tasks.

  • Seed1.5-VL Technical Report cs.CV · 2025-05-11 · unverdicted · none · ref 19

    Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.