SVL uses vision-language alignment via scene-level shadow ratio regression and global-to-local coupling on a frozen DINOv3 encoder to disambiguate shadows from dark surfaces in dense prediction.
Vision-language models for vision tasks: A survey
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 2polarities
background 2representative citing papers
MFVLR uses multi-domain vision-language reconstruction with a fine-grained language transformer, multi-domain vision encoder, and vision injection module to achieve generalizable detection and localization of diffusion-synthesized face forgeries.
citing papers explorer
-
Revisiting Shadow Detection from a Vision-Language Perspective
SVL uses vision-language alignment via scene-level shadow ratio regression and global-to-local coupling on a frozen DINOv3 encoder to disambiguate shadows from dark surfaces in dense prediction.
-
MFVLR: Multi-domain Fine-grained Vision-Language Reconstruction for Generalizable Diffusion Face Forgery Detection and Localization
MFVLR uses multi-domain vision-language reconstruction with a fine-grained language transformer, multi-domain vision encoder, and vision injection module to achieve generalizable detection and localization of diffusion-synthesized face forgeries.