HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.
ChatRex: Tam- ing Multimodal LLM for Joint Perception and Understand- ing
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 4verdicts
UNVERDICTED 4representative citing papers
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
Language-guided semantic cues from MLLM visual pipelines, steered by text embeddings, refine object semantics and boost grounding accuracy against occlusion and small objects.
GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.
citing papers explorer
-
HKVLM: Faithful Reasoning Grounding by Binding Language Queries to a Frozen Detector
HKVLM trains only an alignment hook to bind frozen LM query embeddings to frozen detector proposals via contrastive retrieval and bipartite assignment, yielding 50-90x grounding gains and reduced hallucinations on RefCOCO and POPE.
-
SceneParser: Hierarchical Scene Parsing for Visual Semantics Understanding
SceneParser introduces hierarchical scene parsing as object-part-affordance chains, a VLM trained with pseudo labels and curriculum learning, and SceneParser-Bench with 1.74M affordance annotations, showing better structure-aware results than existing MLLMs.
-
Robust Grounding with MLLMs Against Occlusion and Small Objects via Language-Guided Semantic Cues
Language-guided semantic cues from MLLM visual pipelines, steered by text embeddings, refine object semantics and boost grounding accuracy against occlusion and small objects.
-
Grounding Everything in Tokens for Multimodal Large Language Models
GETok partitions images with grid tokens and refines locations via offset tokens to enable better native 2D spatial reasoning in MLLMs.