Recognition: 1 theorem link
Ferret: Refer and Ground Anything Anywhere at Any Granularity
Pith reviewed 2026-05-15 11:24 UTC · model grok-4.3
The pith
Ferret unifies referring and grounding in multimodal LLMs via a hybrid region representation of coordinates and continuous features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ferret unifies referring and grounding inside the LLM paradigm through a hybrid region representation that jointly encodes discrete coordinates and continuous features, extracted on demand by a spatial-aware visual sampler that accommodates arbitrary shapes and sparsity levels; the resulting model is trained on the GRIT dataset of 1.1 million refer-and-ground instructions and demonstrates stronger performance on both classical benchmarks and region-based multimodal dialogue.
What carries the argument
Hybrid region representation that integrates discrete coordinates with continuous visual features extracted by a spatial-aware visual sampler capable of handling regions of any shape or sparsity.
If this is right
- Classical referring-expression and visual-grounding benchmarks improve because the model directly consumes and localizes arbitrary region inputs.
- Region-based multimodal chatting outperforms prior MLLMs because the same representation supports both input references and output grounding.
- Object hallucination decreases and fine-detail description increases as the continuous feature stream supplies richer spatial context to the language model.
- The model accepts points, boxes, and free-form shapes interchangeably without architectural changes.
Where Pith is reading between the lines
- The hybrid encoding could be ported to video or 3-D data if the sampler is extended to handle temporal or volumetric sparsity.
- Instruction datasets that deliberately include hard negatives may become a standard recipe for robustness in any grounding or localization task.
- Applications such as robotic manipulation or medical image annotation could adopt the same input format to let users point or sketch rather than describe regions in text alone.
Load-bearing premise
The spatial-aware visual sampler can reliably extract continuous features from regions of arbitrary shape and sparsity without introducing systematic bias or information loss that affects downstream grounding accuracy.
What would settle it
A controlled evaluation on a held-out set of highly irregular or sparse free-form regions in which Ferret shows no accuracy gain, or even degradation, relative to a bounding-box-only baseline would falsify the claim that the hybrid representation and sampler deliver the intended grounding benefit.
read the original abstract
We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Ferret, a Multimodal Large Language Model (MLLM) capable of referring to and grounding open-vocabulary descriptions for image regions of arbitrary shape or granularity. It proposes a hybrid region representation that combines discrete coordinates with continuous features extracted by a spatial-aware visual sampler designed to handle varying sparsity, and trains the model on the newly curated GRIT dataset containing 1.1M refer-and-ground instruction samples (including 95K hard negatives). The resulting model is claimed to outperform prior MLLMs on classical referring/grounding benchmarks, region-based multimodal chatting, image detail description, and object hallucination reduction.
Significance. If the empirical gains hold under rigorous verification, the work advances MLLM spatial reasoning by enabling flexible, fine-grained region inputs beyond bounding boxes. The hybrid representation and GRIT dataset curation provide reusable components for future grounding research, while the reported hallucination reduction and detail-description improvements address key practical limitations of current MLLMs.
major comments (1)
- [Spatial-aware visual sampler description] The central claim that the spatial-aware visual sampler enables accurate grounding for arbitrary shapes without systematic degradation depends on unverified assumptions about feature preservation. The manuscript describes the sampler as handling varying sparsity but provides no direct measurements (e.g., feature reconstruction error, ablation on IoU drop for sparse vs. dense or non-convex regions) to confirm that boundary or interior semantics are retained for point-like or free-form inputs.
minor comments (2)
- [Abstract] The abstract asserts performance improvements and reduced hallucination without citing specific metrics, baselines, or table references; adding one-sentence quantitative highlights would improve readability while the full tables remain in the body.
- [Method] Clarify the exact sampling mechanism (interpolation, attention, or masking strategy) used by the spatial-aware sampler on irregular masks, as this detail is load-bearing for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and detailed feedback. We address the major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: The central claim that the spatial-aware visual sampler enables accurate grounding for arbitrary shapes without systematic degradation depends on unverified assumptions about feature preservation. The manuscript describes the sampler as handling varying sparsity but provides no direct measurements (e.g., feature reconstruction error, ablation on IoU drop for sparse vs. dense or non-convex regions) to confirm that boundary or interior semantics are retained for point-like or free-form inputs.
Authors: We agree that direct measurements would provide stronger substantiation for the sampler's feature preservation properties. While the reported gains on referring, grounding, and region-based chatting benchmarks across points, boxes, and free-form shapes offer supporting evidence, these are indirect. In the revised manuscript we will add (i) feature reconstruction error metrics for regions of varying sparsity and (ii) targeted ablations measuring IoU degradation on sparse, dense, and non-convex regions. These additions will directly address whether boundary and interior semantics are retained for point-like and free-form inputs. revision: yes
Circularity Check
No significant circularity; empirical architecture and evaluation
full rationale
The paper introduces a hybrid region representation and spatial-aware visual sampler as architectural choices, curates the GRIT dataset, trains Ferret, and reports benchmark results. No equations or derivations are presented that reduce claimed performance gains to quantities defined by the authors' own fitted parameters or self-citations. The central claims rest on empirical training and evaluation rather than self-referential definitions or imported uniqueness theorems.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 21 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
-
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
-
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
-
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
-
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
-
STORM: End-to-End Referring Multi-Object Tracking in Videos
STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
-
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
-
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
-
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
-
Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration
A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accura...
-
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
-
X2SAM: Any Segmentation in Images and Videos
X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
-
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
CogVLM: Visual Expert for Pretrained Language Models
CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
-
DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA
DIAGRAMS introduces a schema-driven annotation tool that proposes reasoning-level evidence regions for Diagram QA pairs and reports 85.39% precision and 75.30% recall against human final selections on six datasets.
-
Affordance Agent Harness: Verification-Gated Skill Orchestration
Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
-
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
-
A Survey on Hallucination in Large Vision-Language Models
This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.
Reference graph
Works this paper leans on
-
[1]
The length of the output list needs to be exactly equal to the input list
-
[2]
Do not explain the reasons
-
[3]
Do not mention the input entities, at least the output name and input name needs to be different
-
[4]
Do not mention something abstract, like ¨alien¨
-
[5]
When dealing with quantities, focus solely on increasing the numbers during revision
-
[6]
When dealing with words like ”a few”, ”a group”, ”several”, ”some”, etc., try changing the objects (A few men → A few women)
-
[7]
Ensure that inclusive words are not substituted with their specific subsets. For example, if the word is ”people,” avoid replacing it with genders like ”man” or ”woman.” Instead, consider modifying them to different categories, such as ”people”→ ”animals.”.”’}] B E XAMPLES AND PROMPTS FOR FERRET -BENCH We leverage GPT-4 to generate three kinds of region-b...
-
[8]
Do not ask any questions that cannot be answered confidently
Only include questions that have definite answers: (1) one can see the content in the image that the question asks about and can answer confidently; (2) one can determine confidently from the image that it is not in the image. Do not ask any questions that cannot be answered confidently
-
[9]
Again, do not ask about uncertain details
Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the objects in the image, asking to discuss events happening in the image, asking about object actions in the context of entire images, etc. Again, do not ask about uncertain details
-
[10]
Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary
-
[11]
In answer, explain the region in the context of the scene
In all samples, either in question or answer, you must mention bounding box coordinates to refer to the object or regions instead of directly saying the object name or describing the regions in text. In answer, explain the region in the context of the scene
-
[12]
Always answer as if you are directly looking at the image
Do not mention that the information source is provided in the text/caption/region description. Always answer as if you are directly looking at the image
-
[13]
Make the question as diverse as possible. Include questions asking about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects, object selection, object functions, etc. Make the question challenging by less including the visual content details in the question...
-
[14]
In answers, explain the region in the context of scene
In question or answer, you must mention bounding box coordinates to refer to the object or regions, instead of directly say the object name or describing the regions in text. In answers, explain the region in the context of scene. Include details like object counts, position of the objects, relative position between the objects
-
[17]
Make the question as diverse as possible and as complex-reasoning required as possible.”’} ] for sample in fewshot samples: messages.append({"role":"user", "content":sample[‘context’] }) messages.append({"role":"assistant", "content":sample[‘response’] } ) messages.append({"role":"user", "content":‘ \n’.join(query)}) 23 Preprint Table 16: One example used...
-
[18]
In answers, explain the region in the context of scene
In question, you must mention bounding box coordinates to refer to the object or regions, instead of directly say the object name or describing the regions in text. In answers, explain the region in the context of scene. Include details like object counts, position of the objects, relative position between the objects
-
[19]
Only include question that have definite answer
Don’t ask the question you are not confident to answer. Only include question that have definite answer
-
[20]
Always answer as if you are directly looking at the image
Do not mention that the information source is provided in text/catpion/region description. Always answer as if you are directly looking at the image
-
[21]
Don’t mention additional coordinates in the answer
-
[22]
Question should be explicitly ask about context/surrounding/nearby information/interaction.”’} ] for sample in fewshot samples: messages.append({"role":"user", "content":sample[‘context’] }) messages.append({"role":"assistant", "content":sample[‘response’] } ) messages.append({"role":"user", "content":‘ \n’.join(query)}) 25 Preprint Table 18: One example ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.