ChatRex: Tam- ing Multimodal LLM for Joint Perception and Understand- ing

Qing Jiang, Gen Luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, Lei Zhang · 2024 · arXiv 2411.18363

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector

cs.CV · 2026-06-27 · unverdicted · novelty 6.0

HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.

SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.

Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues

cs.CV · 2026-04-27 · unverdicted · novelty 6.0

Language-guided semantic cues from MLLM visual pipelines, steered by text embeddings, refine object semantics and boost grounding accuracy against occlusion and small objects.

Grounding Everything in Tokens for Multimodal Large Language Models

cs.CV · 2025-12-11 · unverdicted · novelty 5.0

GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.

citing papers explorer

Showing 4 of 4 citing papers after filters.

HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector cs.CV · 2026-06-27 · unverdicted · none · ref 1
HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 17
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues cs.CV · 2026-04-27 · unverdicted · none · ref 13
Language-guided semantic cues from MLLM visual pipelines, steered by text embeddings, refine object semantics and boost grounding accuracy against occlusion and small objects.
Grounding Everything in Tokens for Multimodal Large Language Models cs.CV · 2025-12-11 · unverdicted · none · ref 18
GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.

ChatRex: Tam- ing Multimodal LLM for Joint Perception and Understand- ing

fields

years

verdicts

representative citing papers

citing papers explorer