Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou · 2023

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

browse 3 citing papers

representative citing papers

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning

cs.CV · 2026-05-13 · unverdicted · novelty 7.0

SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.

FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require external evidence search and verification.

TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions

cs.CV · 2026-04-30 · unverdicted · novelty 7.0

TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.

citing papers explorer

Showing 3 of 3 citing papers.

Unlocking Patch-Level Features for CLIP-Based Class-Incremental Learning cs.CV · 2026-05-13 · unverdicted · none · ref 2
SPA unlocks patch-level features in CLIP for class-incremental learning via semantic-guided selection and optimal transport alignment with class descriptions, plus projectors and pseudo-feature replay to reduce forgetting.
FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition cs.CV · 2026-05-13 · unverdicted · none · ref 3 · 2 links
FIKA-Bench is a leakage-aware benchmark of 311 instances showing that even the best large multimodal models and tool-equipped agents reach only 25.1% accuracy on fine-grained recognition questions that require external evidence search and verification.
TransVLM: A Vision-Language Framework and Benchmark for Detecting Any Shot Transitions cs.CV · 2026-04-30 · unverdicted · none · ref 2
TransVLM formalizes Shot Transition Detection as identifying full temporal transition segments rather than single cut points and introduces a VLM that injects optical flow as a motion prior via simple feature fusion, plus a synthetic data engine and benchmark.

Qwen-vl: A versatile vision-language model for understanding, localization.Text Reading, and Beyond, 2(1):1, 2023

fields

years

verdicts

representative citing papers

citing papers explorer