AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 6polarities
background 2representative citing papers
PRISM benchmark perturbs Crello layouts into 110K samples isolating design principle violations, reveals limited sensitivity in several multimodal models, and proposes a multi-scale framework combining scorers, instruction-tuned VLMs, and prompt methods for interpretable design assessment.
MINGLE is a modular pipeline that combines off-the-shelf detection tools with VLM reasoning to localize socially connected groups in urban scenes and is supported by a new 100K-image dataset.
MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.
This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.
citing papers explorer
-
AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
-
Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs
PRISM benchmark perturbs Crello layouts into 110K samples isolating design principle violations, reveals limited sensitivity in several multimodal models, and proposes a multi-scale framework combining scorers, instruction-tuned VLMs, and prompt methods for interpretable design assessment.
-
MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes
MINGLE is a modular pipeline that combines off-the-shelf detection tools with VLM reasoning to localize socially connected groups in urban scenes and is supported by a new 100K-image dataset.