hub

arXiv preprint arXiv:2402.04252 (2023) 32 Leong, et al

Sun, Q · 2023 · arXiv 2402.04252

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 2 background 1 method 1

citation-polarity summary

baseline 2 background 1 use method 1

representative citing papers

Benchmarking Deflection and Hallucination in Large Vision-Language Models

cs.CL · 2026-04-13 · unverdicted · novelty 7.0

VLM-DeflectionBench is a new benchmark showing that current large vision-language models rarely deflect and instead hallucinate when given conflicting or insufficient multimodal evidence.

HiCrew: Hierarchical Reasoning for Long-Form Video Understanding via Question-Aware Multi-Agent Collaboration

cs.AI · 2026-04-23 · unverdicted · novelty 6.0

HiCrew improves long-form video question answering on EgoSchema and NExT-QA via a hybrid tree for temporal topology, question-aware captioning, and adaptive multi-agent planning, with gains in temporal and causal reasoning.

Exploring High-Order Self-Similarity for Video Understanding

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

Chain-of-Models Pre-Training: Rethinking Training Acceleration of Vision Foundation Models

cs.CV · 2026-04-14 · unverdicted · novelty 6.0

CoM-PT trains vision foundation models in ascending size order using inverse knowledge transfer, allowing larger models to achieve superior performance with significantly reduced overall computational cost compared to individual training.

MG$^2$-RAG: Multi-Granularity Graph for Multimodal Retrieval-Augmented Generation

cs.IR · 2026-04-04 · unverdicted · novelty 6.0

MG²-RAG proposes a multi-granularity graph RAG framework that constructs hierarchical multimodal nodes via entity-driven visual grounding and performs structured retrieval, delivering SOTA results on four multimodal tasks with 43.3× faster graph construction.

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.

QKVQA: Question-Focused Filtering for Knowledge-based VQA

cs.IR · 2026-01-20 · unverdicted · novelty 6.0

QKVQA proposes a question-focused filtering method with QFF and CDA modules that boosts accuracy by 3.2 points on Encyclopedic-VQA and 2.2 points on InfoSeek over prior state-of-the-art.

Perceive, Verify and Understand Long Video: Multi-Granular Perception and Active Verification via Interactive Agents

cs.CV · 2025-09-29 · unverdicted · novelty 6.0

CogniGPT uses an interactive loop between a Multi-Granular Perception Agent and an Active Verification Agent to identify reliable clues in long videos with high accuracy and low frame usage.

Perception Encoder: The best visual embeddings are not at the output of the network

cs.CV · 2025-04-17 · unverdicted · novelty 6.0

Intermediate layers of a contrastively trained vision-language encoder yield stronger general embeddings than the output layer, enabling state-of-the-art performance across image/video classification, multimodal QA, and dense prediction after simple alignment.

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

cs.AI · 2026-03-12 · unverdicted · novelty 5.0

Introduces Explicit Logic Channel (ELC) with LLM, VFM and probabilistic inference for validating, selecting and enhancing MLLMs on zero-shot tasks using Consistency Rate and cross-channel integration.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Exploring High-Order Self-Similarity for Video Understanding cs.CV · 2026-04-22 · unverdicted · none · ref 74
The MOSS module learns and combines multi-order space-time self-similarity features to enhance temporal dynamics modeling in videos across action recognition, VQA, and robotic tasks.

arXiv preprint arXiv:2402.04252 (2023) 32 Leong, et al

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer