Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S

Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S Khan · 2023 · arXiv 2311.03356

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 2

representative citing papers

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.

Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs

cs.CV · 2026-05-30 · unverdicted · novelty 6.0

PRISM benchmark perturbs Crello layouts into 110K samples isolating design principle violations, reveals limited sensitivity in several multimodal models, and proposes a multi-scale framework combining scorers, instruction-tuned VLMs, and prompt methods for interpretable design assessment.

MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes

cs.CV · 2025-09-16 · unverdicted · novelty 6.0

MINGLE is a modular pipeline that combines off-the-shelf detection tools with VLM reasoning to localize socially connected groups in urban scenes and is supported by a new 100K-image dataset.

MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

cs.CV · 2024-01-29 · conditional · novelty 6.0

MoE-LLaVA applies mixture-of-experts sparsity to LVLMs via MoE-Tuning, delivering LLaVA-1.5-7B level visual understanding and better hallucination resistance with only ~3B active parameters.

A Survey on Multimodal Large Language Models

cs.CV · 2023-06-23 · accept · novelty 3.0

This survey organizes the architectures, training strategies, data, evaluation methods, extensions, and challenges of Multimodal Large Language Models.

OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

cs.CV · 2026-03-31

citing papers explorer

Showing 3 of 3 citing papers after filters.

AnchorSeg: Language Grounded Query Banks for Reasoning Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 195
AnchorSeg uses ordered query banks of latent reasoning tokens plus a spatial anchor token and a Token-Mask Cycle Consistency loss to achieve 67.7% gIoU and 68.1% cIoU on the ReasonSeg benchmark.
Through the PRISM: Principle-Aware, Interpretable, and Multi-Scale Evaluation of Visual Designs cs.CV · 2026-05-30 · unverdicted · none · ref 42
PRISM benchmark perturbs Crello layouts into 110K samples isolating design principle violations, reveals limited sensitivity in several multimodal models, and proposes a multi-scale framework combining scorers, instruction-tuned VLMs, and prompt methods for interpretable design assessment.
MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes cs.CV · 2025-09-16 · unverdicted · none · ref 22
MINGLE is a modular pipeline that combines off-the-shelf detection tools with VLM reasoning to localize socially connected groups in urban scenes and is supported by a new 100K-image dataset.

Anwer, Erix Xing, Ming-Hsuan Yang, and Fahad S

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer