pith. sign in

hub Canonical reference

Valley: Video assistant with large language model enhanced ability

Canonical reference. 80% of citing Pith papers cite this work as background.

16 Pith papers citing it
Background 80% of classified citations

hub tools

citation-role summary

background 4 baseline 1

citation-polarity summary

representative citing papers

Dynamic Model Merging Made Slim

cs.LG · 2026-05-17 · unverdicted · novelty 6.0

DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

cs.CV · 2024-11-04 · unverdicted · novelty 6.0

PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.

TempCompass: Do Video LLMs Really Understand Videos?

cs.CV · 2024-03-01 · unverdicted · novelty 6.0

TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.

TemporalVLM: Video LLMs for Temporal Reasoning in Long Videos

cs.CV · 2024-12-04 · unverdicted · novelty 5.0

TemporalVLM adds timestamp-aware clip encoding and BiLSTM global aggregation to video LLMs, introduces the IndustryASM factory dataset, and reports outperformance on dense captioning, temporal grounding, highlight detection, and action segmentation.

citing papers explorer

Showing 16 of 16 citing papers.