F-vlm: Open-vocabulary object detection upon frozen vision and language models,

· 2022 · arXiv 2209.15639

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation

cs.CV · 2026-03-26 · unverdicted · novelty 6.0

CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.

Vision Transformers Need More Than Registers

cs.CV · 2026-02-25 · unverdicted · novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization

cs.CV · 2024-10-14 · unverdicted · novelty 6.0

ForgeryGPT integrates a forgery localization expert and mask encoder into an LLM for pixel-level forgery detection, localization, and explainable output via three-stage training on custom mask-text and instruction datasets.

citing papers explorer

Showing 3 of 3 citing papers.

CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation cs.CV · 2026-03-26 · unverdicted · none · ref 28
CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.
Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 16
ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.
ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization cs.CV · 2024-10-14 · unverdicted · none · ref 45
ForgeryGPT integrates a forgery localization expert and mask encoder into an LLM for pixel-level forgery detection, localization, and explainable output via three-stage training on custom mask-text and instruction datasets.

F-vlm: Open-vocabulary object detection upon frozen vision and language models,

fields

years

verdicts

representative citing papers

citing papers explorer