arxiv: 2310.07704 · v1 · submitted 2023-10-11 · 💻 cs.CV · cs.CL

Recognition: 1 theorem link

Ferret: Refer and Ground Anything Anywhere at Any Granularity

Haoxuan You , Haotian Zhang , Zhe Gan , Xianzhi Du , Bowen Zhang , Zirui Wang , Liangliang Cao , Shih-Fu Chang

show 1 more author

Yinfei Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 11:24 UTC · model grok-4.3

classification 💻 cs.CV cs.CL

keywords multimodal large language modelvisual groundingreferring expression comprehensionspatial-aware samplerinstruction tuning datasetobject hallucinationhybrid region representation

0 comments

The pith

Ferret unifies referring and grounding in multimodal LLMs via a hybrid region representation of coordinates and continuous features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ferret, a multimodal large language model that accepts references to image regions of any shape or granularity and grounds open-vocabulary descriptions back to those regions. It does so by feeding the language model a hybrid encoding that pairs discrete coordinates with continuous visual features pulled from the chosen region by a spatial-aware sampler. The sampler is built to work on points, boxes, and free-form shapes without regard to sparsity. Training relies on a new 1.1-million-sample dataset called GRIT that supplies hierarchical spatial instructions plus hard negatives. If the approach holds, language models gain reliable spatial localization without separate detection heads, while also cutting object hallucination and improving detail description in chat responses.

Core claim

Ferret unifies referring and grounding inside the LLM paradigm through a hybrid region representation that jointly encodes discrete coordinates and continuous features, extracted on demand by a spatial-aware visual sampler that accommodates arbitrary shapes and sparsity levels; the resulting model is trained on the GRIT dataset of 1.1 million refer-and-ground instructions and demonstrates stronger performance on both classical benchmarks and region-based multimodal dialogue.

What carries the argument

Hybrid region representation that integrates discrete coordinates with continuous visual features extracted by a spatial-aware visual sampler capable of handling regions of any shape or sparsity.

If this is right

Classical referring-expression and visual-grounding benchmarks improve because the model directly consumes and localizes arbitrary region inputs.
Region-based multimodal chatting outperforms prior MLLMs because the same representation supports both input references and output grounding.
Object hallucination decreases and fine-detail description increases as the continuous feature stream supplies richer spatial context to the language model.
The model accepts points, boxes, and free-form shapes interchangeably without architectural changes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hybrid encoding could be ported to video or 3-D data if the sampler is extended to handle temporal or volumetric sparsity.
Instruction datasets that deliberately include hard negatives may become a standard recipe for robustness in any grounding or localization task.
Applications such as robotic manipulation or medical image annotation could adopt the same input format to let users point or sketch rather than describe regions in text alone.

Load-bearing premise

The spatial-aware visual sampler can reliably extract continuous features from regions of arbitrary shape and sparsity without introducing systematic bias or information loss that affects downstream grounding accuracy.

What would settle it

A controlled evaluation on a held-out set of highly irregular or sparse free-form regions in which Ferret shows no accuracy gain, or even degradation, relative to a bounding-box-only baseline would falsify the claim that the hybrid representation and sampler deliver the intended grounding benefit.

read the original abstract

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with 95K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination. Code and data will be available at https://github.com/apple/ml-ferret

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ferret's hybrid discrete-plus-continuous region representation and the GRIT dataset with hard negatives are the concrete new pieces, but the abstract supplies no numbers or ablations so the performance claims stay unverified.

read the letter

Hi, the main thing to know about Ferret is that it adds a hybrid region encoder—discrete coordinates paired with continuous features pulled by a spatial-aware sampler—to let an MLLM accept and ground points, boxes, or free-form shapes inside conversation. They also release GRIT, a 1.1M-sample instruction set that includes hierarchical spatial relations and 95k hard negatives. That combination is presented as the step that improves both classical referring/grounding and region-based chatting while cutting hallucination.

Referee Report

1 major / 2 minor

Summary. The paper introduces Ferret, a Multimodal Large Language Model (MLLM) capable of referring to and grounding open-vocabulary descriptions for image regions of arbitrary shape or granularity. It proposes a hybrid region representation that combines discrete coordinates with continuous features extracted by a spatial-aware visual sampler designed to handle varying sparsity, and trains the model on the newly curated GRIT dataset containing 1.1M refer-and-ground instruction samples (including 95K hard negatives). The resulting model is claimed to outperform prior MLLMs on classical referring/grounding benchmarks, region-based multimodal chatting, image detail description, and object hallucination reduction.

Significance. If the empirical gains hold under rigorous verification, the work advances MLLM spatial reasoning by enabling flexible, fine-grained region inputs beyond bounding boxes. The hybrid representation and GRIT dataset curation provide reusable components for future grounding research, while the reported hallucination reduction and detail-description improvements address key practical limitations of current MLLMs.

major comments (1)

[Spatial-aware visual sampler description] The central claim that the spatial-aware visual sampler enables accurate grounding for arbitrary shapes without systematic degradation depends on unverified assumptions about feature preservation. The manuscript describes the sampler as handling varying sparsity but provides no direct measurements (e.g., feature reconstruction error, ablation on IoU drop for sparse vs. dense or non-convex regions) to confirm that boundary or interior semantics are retained for point-like or free-form inputs.

minor comments (2)

[Abstract] The abstract asserts performance improvements and reduced hallucination without citing specific metrics, baselines, or table references; adding one-sentence quantitative highlights would improve readability while the full tables remain in the body.
[Method] Clarify the exact sampling mechanism (interpolation, attention, or masking strategy) used by the spatial-aware sampler on irregular masks, as this detail is load-bearing for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and detailed feedback. We address the major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: The central claim that the spatial-aware visual sampler enables accurate grounding for arbitrary shapes without systematic degradation depends on unverified assumptions about feature preservation. The manuscript describes the sampler as handling varying sparsity but provides no direct measurements (e.g., feature reconstruction error, ablation on IoU drop for sparse vs. dense or non-convex regions) to confirm that boundary or interior semantics are retained for point-like or free-form inputs.

Authors: We agree that direct measurements would provide stronger substantiation for the sampler's feature preservation properties. While the reported gains on referring, grounding, and region-based chatting benchmarks across points, boxes, and free-form shapes offer supporting evidence, these are indirect. In the revised manuscript we will add (i) feature reconstruction error metrics for regions of varying sparsity and (ii) targeted ablations measuring IoU degradation on sparse, dense, and non-convex regions. These additions will directly address whether boundary and interior semantics are retained for point-like and free-form inputs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical architecture and evaluation

full rationale

The paper introduces a hybrid region representation and spatial-aware visual sampler as architectural choices, curates the GRIT dataset, trains Ferret, and reports benchmark results. No equations or derivations are presented that reduce claimed performance gains to quantities defined by the authors' own fitted parameters or self-citations. The central claims rest on empirical training and evaluation rather than self-referential definitions or imported uniqueness theorems.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the effectiveness of the hybrid region representation and the spatial-aware sampler for arbitrary shapes; no explicit free parameters, mathematical axioms, or newly invented physical entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5535 in / 1118 out tokens · 37394 ms · 2026-05-15T11:24:46.089802+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
cs.CL 2026-05 accept novelty 8.0

CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
cs.CV 2024-09 accept novelty 8.0

Molmo VLMs trained on newly collected PixMo open datasets achieve state-of-the-art performance among open-weight models and surpass multiple proprietary VLMs including Claude 3.5 Sonnet and Gemini 1.5 Pro.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
cs.CV 2026-05 unverdicted novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
STORM: End-to-End Referring Multi-Object Tracking in Videos
cs.CV 2026-04 unverdicted novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Mosaic: Cross-Modal Clustering for Efficient Video Understanding
cs.PF 2026-04 unverdicted novelty 7.0

Mosaic uses cross-modal clusters as the unit for KVCache organization in VLMs to achieve up to 1.38x speedup in streaming long-video understanding.
OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning
cs.CV 2026-03 conditional novelty 7.0

OmniSch is the first benchmark exposing gaps in LMMs for PCB schematic visual grounding, topology-to-graph parsing, geometric weighting, and tool-augmented reasoning.
LLMind: Bio-inspired Training-free Adaptive Visual Representations for Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

LLMind uses bio-inspired non-uniform sampling via a Mobius module and closed-loop semantic feedback to retain 82-97% of full-resolution VLM performance with only 1-5% of pixels on VQA benchmarks.
Logit-Attention Divergence: Mitigating Position Bias in Multi-Image Retrieval via Attention-Guided Calibration
cs.CV 2026-05 unverdicted novelty 6.0

A training-free attention-guided debiasing framework mitigates position bias in MLLM multi-image retrieval by exploiting the observed mismatch between biased logits and aligned attention maps, yielding over 40% accura...
CAST: Mitigating Object Hallucination in Large Vision-Language Models via Caption-Guided Visual Attention Steering
cs.CV 2026-05 unverdicted novelty 6.0

CAST reduces object hallucination in LVLMs by 6.03% on average across five models and five benchmarks by identifying caption-sensitive attention heads and applying optimized steering directions to their outputs, with ...
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
X2SAM: Any Segmentation in Images and Videos
cs.CV 2026-04 unverdicted novelty 6.0

X2SAM unifies any-segmentation across images and videos in one MLLM by adding a Mask Memory module for temporal consistency and joint training on mixed datasets.
Chat-Scene++: Exploiting Context-Rich Object Identification for 3D LLM
cs.CV 2026-03 unverdicted novelty 6.0

Chat-Scene++ improves 3D scene understanding in multimodal LLMs by representing scenes as context-rich object sequences with identifier tokens and grounded chain-of-thought reasoning, reaching state-of-the-art on five...
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
CogVLM: Visual Expert for Pretrained Language Models
cs.CV 2023-11 conditional novelty 6.0

CogVLM adds a trainable visual expert inside frozen language model layers for deep vision-language fusion and reports state-of-the-art results on ten cross-modal benchmarks while preserving NLP performance.
DIAGRAMS: A Review Framework for Reasoning-Level Attribution in Diagram QA
cs.CL 2026-04 unverdicted novelty 5.0

DIAGRAMS introduces a schema-driven annotation tool that proposes reasoning-level evidence regions for Diagram QA pairs and reports 85.39% precision and 75.30% recall against human final selections on six datasets.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...
A Survey on Hallucination in Large Vision-Language Models
cs.CV 2024-02 unverdicted novelty 3.0

This survey reviews the definition, symptoms, evaluation benchmarks, root causes, and mitigation methods for hallucinations in large vision-language models.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 20 Pith papers

[1]

The length of the output list needs to be exactly equal to the input list

work page
[2]

Do not explain the reasons

work page
[3]

Do not mention the input entities, at least the output name and input name needs to be different

work page
[4]

Do not mention something abstract, like ¨alien¨

work page
[5]

When dealing with quantities, focus solely on increasing the numbers during revision

work page
[6]

When dealing with words like ”a few”, ”a group”, ”several”, ”some”, etc., try changing the objects (A few men → A few women)

work page
[7]

role":"system

Ensure that inclusive words are not substituted with their specific subsets. For example, if the word is ”people,” avoid replacing it with genders like ”man” or ”woman.” Instead, consider modifying them to different categories, such as ”people”→ ”animals.”.”’}] B E XAMPLES AND PROMPTS FOR FERRET -BENCH We leverage GPT-4 to generate three kinds of region-b...

work page
[8]

Do not ask any questions that cannot be answered confidently

Only include questions that have definite answers: (1) one can see the content in the image that the question asks about and can answer confidently; (2) one can determine confidently from the image that it is not in the image. Do not ask any questions that cannot be answered confidently

work page
[9]

Again, do not ask about uncertain details

Also include complex questions that are relevant to the content in the image, for example, asking about background knowledge of the objects in the image, asking to discuss events happening in the image, asking about object actions in the context of entire images, etc. Again, do not ask about uncertain details

work page
[10]

For example, give detailed examples or reasoning steps to make the content more convincing and well-organized

Provide detailed answers when answering complex questions. For example, give detailed examples or reasoning steps to make the content more convincing and well-organized. You can include multiple paragraphs if necessary

work page
[11]

In answer, explain the region in the context of the scene

In all samples, either in question or answer, you must mention bounding box coordinates to refer to the object or regions instead of directly saying the object name or describing the regions in text. In answer, explain the region in the context of the scene

work page
[12]

Always answer as if you are directly looking at the image

Do not mention that the information source is provided in the text/caption/region description. Always answer as if you are directly looking at the image

work page
[13]

role":"user

Make the question as diverse as possible. Include questions asking about the visual content of the image, including the object types, counting the objects, object actions, object locations, relative positions between objects, object selection, object functions, etc. Make the question challenging by less including the visual content details in the question...

work page
[14]

In answers, explain the region in the context of scene

In question or answer, you must mention bounding box coordinates to refer to the object or regions, instead of directly say the object name or describing the regions in text. In answers, explain the region in the context of scene. Include details like object counts, position of the objects, relative position between the objects

work page
[17]

role":"user

Make the question as diverse as possible and as complex-reasoning required as possible.”’} ] for sample in fewshot samples: messages.append({"role":"user", "content":sample[‘context’] }) messages.append({"role":"assistant", "content":sample[‘response’] } ) messages.append({"role":"user", "content":‘ \n’.join(query)}) 23 Preprint Table 16: One example used...

work page
[18]

In answers, explain the region in the context of scene

In question, you must mention bounding box coordinates to refer to the object or regions, instead of directly say the object name or describing the regions in text. In answers, explain the region in the context of scene. Include details like object counts, position of the objects, relative position between the objects

work page
[19]

Only include question that have definite answer

Don’t ask the question you are not confident to answer. Only include question that have definite answer

work page
[20]

Always answer as if you are directly looking at the image

Do not mention that the information source is provided in text/catpion/region description. Always answer as if you are directly looking at the image

work page
[21]

Don’t mention additional coordinates in the answer

work page
[22]

role":"user

Question should be explicitly ask about context/surrounding/nearby information/interaction.”’} ] for sample in fewshot samples: messages.append({"role":"user", "content":sample[‘context’] }) messages.append({"role":"assistant", "content":sample[‘response’] } ) messages.append({"role":"user", "content":‘ \n’.join(query)}) 25 Preprint Table 18: One example ...

work page