pith. sign in

F-vlm: Open-vocabulary object detection upon frozen vision and language models,

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

fields

cs.CV 3

years

2026 2 2024 1

verdicts

UNVERDICTED 3

representative citing papers

CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation

cs.CV · 2026-03-26 · unverdicted · novelty 6.0

CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.

Vision Transformers Need More Than Registers

cs.CV · 2026-02-25 · unverdicted · novelty 6.0

ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

citing papers explorer

Showing 3 of 3 citing papers.

  • CLIP-RD: Relative Distillation for Efficient CLIP Knowledge Distillation cs.CV · 2026-03-26 · unverdicted · none · ref 28

    CLIP-RD adds VRD for cross-modality distillation consistency and XRD for bidirectional cross-modal symmetry to align student embedding geometry more closely with the teacher, yielding a 0.8 percentage point gain over prior distillation methods.

  • Vision Transformers Need More Than Registers cs.CV · 2026-02-25 · unverdicted · none · ref 16

    ViTs exhibit lazy aggregation by relying on irrelevant background patches for global semantics, and selectively integrating patch features into the CLS token reduces this effect and improves results across label-, text-, and self-supervision.

  • ForgeryGPT: A Multimodal LLM for Interpretable Image Forgery Detection and Localization cs.CV · 2024-10-14 · unverdicted · none · ref 45

    ForgeryGPT integrates a forgery localization expert and mask encoder into an LLM for pixel-level forgery detection, localization, and explainable output via three-stage training on custom mask-text and instruction datasets.