MoHallBench is a new benchmark evaluating motion hallucination in VideoLLMs from co-occurrence priors, sequential inference, and similarity confusion, revealing decoupling from action recognition performance.
hub Canonical reference
Valley: Video assistant with large language model enhanced ability.arXiv preprint arXiv:2306.07207
Canonical reference. 80% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Iterative solvers in layer-wise model merging act as spectral regularizers on an ill-posed interference operator; closed-form SWUDI and adaptive SWUDI-A match or exceed SOTA merging accuracy with 28-72x wall-clock speedup.
SVAgent improves long video question answering by constructing storylines via multi-agent collaboration and aligning cross-modal predictions for more robust, human-like reasoning.
SEED-Bench is a new benchmark of 19K multiple-choice questions for evaluating generative comprehension in multimodal LLMs across 12 image and video dimensions.
DiDi-Merging achieves dynamic model merging performance matching or exceeding prior methods while using only 1.24x to 1.4x the parameters of a single fine-tuned model.
XComp reaches extreme video compression (one token per selective frame) via learnable progressive token compression and question-conditioned frame selection, lifting LVBench accuracy from 42.9 percent to 46.2 percent after tuning on 2.5 percent of standard data.
DEViL offloads spatial grounding to a detector via a distilled reference-semantic token and temporal consistency regularization, reaching 43.1% m_vIoU at 14.33 FPS on HC-STVG.
UniMind unifies multi-task brain decoding from EEG by bridging signals to LLMs via a Neuro-Language Connector and dynamic task queries, outperforming prior models by 12% on average across ten datasets.
FaVChat proposes hierarchical prompt-query guided visual features and Data-Efficient GRPO for efficient training, plus the FaVChat-170K dataset, claiming consistent outperformance over prior VLLMs on facial video tasks.
PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.
LongVU adaptively compresses long video tokens using DINOv2-based frame deduplication, text-guided cross-modal selection, and temporal spatial reduction to improve video-language understanding in MLLMs with minimal detail loss.
TempCompass benchmark reveals that state-of-the-art Video LLMs have poor ability to perceive temporal aspects such as speed, direction, and ordering in videos.
MVBench is a benchmark of 20 temporal video understanding tasks built by transforming static tasks into dynamic ones, with VideoChat2 outperforming prior MLLMs by over 15%.
Video-LLaVA creates a unified visual representation for images and videos via pre-projection alignment, enabling mutual enhancement from joint training and strong results on image and video benchmarks.
AVEX-Prune is an RL-based audio-visual token pruning method using modality exchange that maintains near-full performance at 40% token retention on VILA 1.5-8B and VideoLLaMA 2.
TemporalVLM adds timestamp-aware clip encoding and BiLSTM global aggregation to video LLMs, introduces the IndustryASM factory dataset, and reports outperformance on dense captioning, temporal grounding, highlight detection, and action segmentation.
InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
InternVideo3 introduces Multimodal Contextual Reasoning and M^2LA attention to enable closed-loop evidence accumulation in long-video understanding and agentic tool use, reporting strong benchmark results.
Vision-language models fail at zero-shot detection of climate-specific classes in social media videos, while DINOv2 and ConvNeXt V2 embeddings yield meaningful clusters via minimum-cost multicut.