arxiv: 2306.15195 · v2 · submitted 2023-06-27 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen , Zhao Zhang , Weili Zeng , Richong Zhang , Feng Zhu , Rui Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 15:47 UTC · model grok-4.3

classification 💻 cs.CV

keywords multimodal large language modelreferential dialoguespatial coordinatesnatural language interfacereferring expression comprehensionvisual question answeringvision-language tasks

0 comments

The pith

Shikra lets multimodal LLMs accept and output spatial coordinates as ordinary words.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Shikra as a multimodal large language model that treats locations in images as text tokens rather than special encodings. Its design uses only a vision encoder, a simple alignment layer, and an existing LLM, with every reference to position expressed in natural language. This single change turns referential dialogue into the central capability, so tasks such as pointing at objects, answering questions about pointed regions, captioning, and standard visual question answering all become instances of the same conversation. The authors show that the model can generate coordinates inside its reasoning steps and compare user-indicated areas without any added detection heads or position encoders. If the approach holds, vision-language work no longer needs separate modules for location; language itself carries the geometry.

Core claim

Shikra shows that a standard multimodal LLM architecture can learn to read and write spatial coordinates directly in natural language, making referential dialogue the unifying format for location-aware vision-language tasks.

What carries the argument

Natural-language encoding of bounding-box coordinates passed through the same token stream as all other text, learned by the LLM without extra vocabularies or position encoders.

If this is right

Shikra performs referring expression comprehension and point-based question answering by outputting or accepting coordinates inside the same language response.
Conventional tasks such as image captioning and visual question answering are handled without task-specific heads because they reduce to referential dialogue.
The model can insert object coordinates into its chain-of-thought steps and compare similarities between user-pointed image regions.
No pre- or post-detection modules, extra position encoders, or external plug-in models are required at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same natural-language coordinate format could let future models reason about spatial relations across multiple images or video frames without changing the architecture.
Training data focused on referential chains may improve generalization on existing vision-language benchmarks by forcing explicit location grounding.
Deployment becomes simpler because the same model weights serve both open-ended dialogue and precise localization without separate tool-calling interfaces.
Robotics or augmented-reality agents could adopt the format to receive and report pointing instructions in ordinary language.

Load-bearing premise

A plain vision-encoder-plus-alignment-layer-plus-LLM stack can learn accurate spatial-coordinate prediction and interpretation from natural-language training pairs alone.

What would settle it

A test set of referential dialogues in which Shikra systematically outputs wrong box coordinates for the referenced objects while still producing fluent text would falsify the claim.

read the original abstract

In human conversations, individuals can indicate relevant regions within a scene while addressing others. In turn, the other person can then respond by referring to specific regions if necessary. This natural referential ability in dialogue remains absent in current Multimodal Large Language Models (MLLMs). To fill this gap, this paper proposes an MLLM called Shikra, which can handle spatial coordinate inputs and outputs in natural language. Its architecture consists of a vision encoder, an alignment layer, and a LLM. It is designed to be straightforward and simple, without the need for extra vocabularies, position encoder, pre-/post-detection modules, or external plug-in models. All inputs and outputs are in natural language form. Referential dialogue is a superset of various vision-language (VL) tasks. Shikra can naturally handle location-related tasks like REC and PointQA, as well as conventional VL tasks such as Image Captioning and VQA. Experimental results showcase Shikra's promising performance. Furthermore, it enables numerous exciting applications, like providing mentioned objects' coordinates in chains of thoughts and comparing user-pointed regions similarities. Our code, model and dataset are accessed at https://github.com/shikras/shikra.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Shikra shows a plain vision-encoder-plus-LLM stack can output and read natural-language coordinate strings for referential tasks without extra modules, and the open code makes the claim checkable.

read the letter

The core contribution is that referential dialogue—pointing at regions and responding to pointed regions—can be added to an MLLM by training it to treat normalized coordinates as ordinary text tokens. No new vocabulary, no position encoder, no separate detection head. The model is just the usual vision encoder, alignment layer, and LLM, trained end-to-end on referential data. That is the actual new piece relative to prior MLLMs that could not do this natively in dialogue form. It also unifies REC, PointQA, captioning, and VQA under one interface, which is convenient. Releasing the code, weights, and dataset is the strongest part of the work; it turns the claim into something that can be run and measured directly. The stress-test note is right that the internal logic is consistent and the evaluation protocol is falsifiable. On the soft side, the abstract gives no numbers, so the real test is whether the released model actually reaches competitive REC and PointQA scores or whether it only works in narrow cases. If the gains are modest once you look at the tables, the paper becomes a useful engineering note rather than a capability jump. The weakest assumption is that the LLM will reliably parse and generate coordinate strings without hallucinating locations; the paper needs to show that this holds on held-out images and not just on the training distribution. This is the kind of paper I would bring to a reading group focused on practical multimodal systems. Readers who want an open implementation they can extend for spatial dialogue will get immediate value from the repo. It deserves peer review because the architecture is minimal, the assets are public, and the task it targets is a clear missing piece in current MLLMs. I would not desk-reject it.

Referee Report

2 major / 2 minor

Summary. The paper introduces Shikra, an MLLM for referential dialogue that accepts and generates spatial coordinates as natural language text. Its architecture is a standard vision encoder plus alignment layer plus LLM, trained end-to-end without extra vocabularies, position encoders, detection modules, or plug-in models. The work claims this suffices for REC, PointQA, captioning, and VQA, demonstrates additional applications such as coordinate-aware chain-of-thought, and releases code, weights, and a public dataset.

Significance. If the empirical claims hold, the result is significant: it shows that referential spatial reasoning can be acquired by a conventional MLLM stack through data alone, without architectural specialization for coordinates. This simplifies model design and enables natural integration of location into dialogue and reasoning chains. The public release of code, model, and dataset is a clear strength that supports reproducibility and falsifiability of the central claim.

major comments (2)

Abstract and §4 (Experiments): the claim of 'promising performance' on REC, PointQA, captioning, and VQA is load-bearing for the central thesis that the simple architecture suffices, yet the manuscript supplies no quantitative metrics, baseline comparisons, or ablation results in the text provided. Without these numbers and controls, the sufficiency argument cannot be evaluated from the paper alone.
§3 (Method): the assertion that spatial information is carried solely by ordinary text tokens after the alignment layer, with no hidden position encoder or regression head, is central; however, the description lacks detail on coordinate tokenization, normalization scheme, and how the LLM learns to emit valid bounding-box strings. This detail is required to confirm the architecture truly operates without dedicated spatial components.

minor comments (2)

Abstract: the final sentence should read 'Our code, model, and dataset are available at ...' for standard academic phrasing.
The paper would benefit from an explicit statement of the coordinate normalization range (e.g., [0,1] or pixel values) and the exact format of coordinate strings used in training and inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review, positive overall assessment, and recommendation for minor revision. The feedback helps clarify the presentation of our central claims. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: Abstract and §4 (Experiments): the claim of 'promising performance' on REC, PointQA, captioning, and VQA is load-bearing for the central thesis that the simple architecture suffices, yet the manuscript supplies no quantitative metrics, baseline comparisons, or ablation results in the text provided. Without these numbers and controls, the sufficiency argument cannot be evaluated from the paper alone.

Authors: We agree that quantitative support is essential for the sufficiency claim. The full manuscript contains Section 4 with tables reporting metrics (e.g., Acc@0.5 on RefCOCO/RefCOCO+/RefCOCOg for REC, accuracy on PointQA, CIDEr/BLEU for captioning, and VQA scores), direct comparisons to prior MLLMs and specialized models, and ablations on training data and architecture variants. These appear to have been omitted from the excerpt provided to the referee. In revision we will (1) insert key headline numbers into the abstract and (2) add a concise summary paragraph at the start of §4 that explicitly ties the numbers to the 'simple architecture suffices' thesis. This change will make the evaluation self-contained. revision: yes
Referee: §3 (Method): the assertion that spatial information is carried solely by ordinary text tokens after the alignment layer, with no hidden position encoder or regression head, is central; however, the description lacks detail on coordinate tokenization, normalization scheme, and how the LLM learns to emit valid bounding-box strings. This detail is required to confirm the architecture truly operates without dedicated spatial components.

Authors: We accept that the current §3 description is insufficiently precise. In the revised manuscript we will expand the method section with: (a) exact tokenization format (bounding boxes rendered as the literal string '[x1,y1,x2,y2]' using four decimal tokens after normalization), (b) normalization details (coordinates scaled to [0,1] relative to image width/height and tokenized with the same vocabulary as other text), and (c) training dynamics (the LLM is trained end-to-end with standard next-token prediction; no auxiliary regression loss or position embedding is added). We confirm there are no extra vocabularies, position encoders, or regression heads. These additions will be placed immediately after the architecture diagram so readers can verify the absence of dedicated spatial modules. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical architecture proposal for Shikra, an MLLM that processes spatial coordinates via natural language text tokens after a standard vision encoder + alignment layer + LLM stack. No equations, derivations, or first-principles results are presented that reduce to fitted parameters, self-definitions, or self-citation chains. Claims rest on released code, weights, and dataset plus standard REC/VQA benchmarks, which are externally falsifiable. The architecture description explicitly states the absence of position encoders or extra modules, so the central result does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper describes a model architecture but does not detail specific free parameters, axioms, or new entities beyond the model itself. Shikra is the name of the proposed model.

pith-pipeline@v0.9.0 · 5519 in / 1196 out tokens · 53078 ms · 2026-05-12T15:47:00.469995+00:00 · methodology

discussion (0)

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark
cs.CV 2026-05 unverdicted novelty 8.0

VISTA is the first large-scale interaction-aware benchmark that decomposes videos into entities, actions, and relations to diagnose spatio-temporal biases in vision-language models.
Qwen3-VL-Seg: Unlocking Open-World Referring Segmentation with Vision-Language Grounding
cs.CV 2026-05 unverdicted novelty 7.0

Qwen3-VL-Seg decodes MLLM bounding boxes into pixel-level referring segmentation via a lightweight box-guided mask decoder, new SA1B-ORS training data, and ORS-Bench evaluation, showing strong open-world performance.
From Diffusion to Rectified Flow: Rethinking Text-Based Segmentation
cs.CV 2026-05 unverdicted novelty 7.0

RLFSeg repurposes pretrained generative models via Rectified Flow for direct latent-space image-to-mask mapping in text-based segmentation, outperforming diffusion-based methods especially in zero-shot cases.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
PixDLM: A Dual-Path Multimodal Language Model for UAV Reasoning Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

The work introduces the UAV Reasoning Segmentation task, the DRSeg benchmark dataset, and PixDLM as a baseline dual-path multimodal language model for reasoning-based segmentation in aerial imagery.
BoxTuning: Directly Injecting the Object Box for Multimodal Model Fine-Tuning
cs.CV 2026-04 unverdicted novelty 7.0

By drawing object boxes and motion trails visually on video frames instead of serializing coordinates as text, BoxTuning reduces token costs dramatically and improves accuracy on video question answering benchmarks.
STORM: End-to-End Referring Multi-Object Tracking in Videos
cs.CV 2026-04 unverdicted novelty 7.0

STORM is an end-to-end MLLM for referring multi-object tracking that uses task-composition learning to leverage sub-task data and introduces the STORM-Bench dataset, achieving SOTA results.
Focus Matters: Phase-Aware Suppression for Hallucination in Vision-Language Models
cs.CV 2026-04 unverdicted novelty 7.0

Suppressing low-attention tokens during the focus phase of vision-encoder processing reduces object hallucinations in LVLMs while preserving caption quality and adding negligible inference time.
Evaluating Object Hallucination in Large Vision-Language Models
cs.CV 2023-05 accept novelty 7.0

Large vision-language models exhibit severe object hallucination that varies with training instructions, and the proposed POPE polling method evaluates it more stably and flexibly than prior approaches.
Guide, Think, Act: Interactive Embodied Reasoning in Vision-Language-Action Models
cs.RO 2026-05 conditional novelty 6.0

GTA-VLA conditions VLA models on user spatial priors to produce a unified spatial-visual chain-of-thought, reaching 81.2% success on SimplerEnv WidowX and improving performance under out-of-distribution shifts.
MMCL-Bench: Multimodal Context Learning from Visual Rules, Procedures, and Evidence
cs.CV 2026-05 unverdicted novelty 6.0

MMCL-Bench shows that even the strongest frontier multimodal models solve fewer than one-third of tasks requiring recovery and application of visual rules, procedures, and empirical patterns.
Mitigating Action-Relation Hallucinations in LVLMs via Relation-aware Visual Enhancement
cs.CV 2026-05 unverdicted novelty 6.0

A new attention-enhancement method using ARS scores and RVE reduces action-relation hallucinations in LVLMs while generalizing to spatial and object hallucinations.
Keep What Audio Cannot Say: Context-Preserving Token Pruning for Omni-LLMs
cs.CV 2026-05 unverdicted novelty 6.0

ContextGuard prunes 55% of tokens in Qwen2.5-Omni 7B while matching full performance on five of six audio-visual benchmarks by preserving audio-irrecoverable visual context.
SpatialForge: Bootstrapping 3D-Aware Spatial Reasoning from Open-World 2D Images
cs.CV 2026-05 unverdicted novelty 6.0

SpatialForge synthesizes 10 million spatial QA pairs from in-the-wild 2D images to train VLMs for better depth ordering, layout, and viewpoint-dependent reasoning.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
State Beyond Appearance: Diagnosing and Improving State Consistency in Dial-Based Measurement Reading
cs.CV 2026-04 unverdicted novelty 6.0

MLLMs ignore dial state geometry and cluster by appearance, causing inconsistency under variations; TriSCA's state-distance alignment, metadata supervision, and objective alignment improve robustness on clock and gaug...
One Identity, Many Roles: Multimodal Entity Coreference for Enhanced Video Situation Recognition
cs.CV 2026-04 unverdicted novelty 6.0

CineMEC performs multimodal entity coreference by clustering visual entities and aligning them with text role mentions to boost captioning and grounding performance on an extended VidSitu dataset.
CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning
cs.CV 2026-04 unverdicted novelty 6.0

CoME-VL fuses contrastive and self-supervised vision encoders via entropy-guided multi-layer aggregation and RoPE cross-attention to improve vision-language model performance on benchmarks.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
cs.CV 2025-08 unverdicted novelty 6.0

InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
cs.CV 2025-04 conditional novelty 6.0

InternVL3-78B sets a new open-source SOTA of 72.2 on MMMU via native joint multimodal pre-training, V2PE, MPO, and test-time scaling while remaining competitive with proprietary models.
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
cs.CV 2024-12 unverdicted novelty 6.0

InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
ShareGPT4V: Improving Large Multi-Modal Models with Better Captions
cs.CV 2023-11 conditional novelty 6.0

A new 1.2M-caption dataset generated via GPT-4V improves LMMs on MME and MMBench by 222.8/22.0/22.3 and 2.7/1.3/1.5 points respectively when used for supervised fine-tuning.
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
eess.AS 2023-11 unverdicted novelty 6.0

Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.
Otter: A Multi-Modal Model with In-Context Instruction Tuning
cs.CV 2023-05 unverdicted novelty 6.0

Otter is a multi-modal model instruction-tuned on the MIMIC-IT dataset of over 3 million in-context instruction-response pairs to improve convergence and generalization on tasks with multiple images and videos.
Not Blind but Silenced: Rebalancing Vision and Language via Adversarial Counter-Commonsense Equilibrium
cs.CV 2026-05 unverdicted novelty 5.0

ACE uses adversarial counter-commonsense perturbations on image tokens during decoding to suppress hallucinated linguistic priors while preserving stable visual signals in MLLMs.
StateVLM: A State-Aware Vision-Language Model for Robotic Affordance Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

StateVLM uses an Auxiliary Regression Loss on box decoder outputs to boost VLMs' accuracy on object and state localization for robotic affordance reasoning, with gains of 1.6% on RefCOCO variants and 5.2% on the new O...
Aligning What Vision-Language Models See and Perceive with Adaptive Information Flow
cs.CV 2026-04 unverdicted novelty 5.0

An inference-time technique that uses token activation dynamics to adaptively restrict text attention to important visual tokens, improving VLM accuracy on VQA, grounding, counting, OCR, and hallucination benchmarks.
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models
cs.AI 2026-04 unverdicted novelty 5.0

Multimodal reasoning models hallucinate at high-entropy cognitive bifurcation points due to loss of visual semantic anchoring, and the V-STAR training paradigm with HVAR rewards and FRM reflection mitigates this by re...
Reasoning-Guided Grounding: Elevating Video Anomaly Detection through Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 5.0

VANGUARD is a staged-training VLM framework that reports 94% ROC-AUC and 84% F1 on UCF-Crime while adding chain-of-thought reasoning and spatial grounding to video anomaly detection.
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
cs.CV 2024-12 accept novelty 5.0

DeepSeek-VL2 is a series of MoE vision-language models using dynamic tiling and latent attention that reach competitive or state-of-the-art results on VQA, OCR, document understanding and grounding with 1.0B to 4.5B a...
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
cs.CV 2024-08 conditional novelty 5.0

MiniCPM-Llama3-V 2.5 delivers GPT-4V-level multimodal performance on phones through architecture, pretraining, and alignment optimizations.
Hallucination of Multimodal Large Language Models: A Survey
cs.CV 2024-04 accept novelty 5.0

The survey organizes causes of hallucinations in MLLMs, reviews evaluation benchmarks and metrics, and outlines mitigation approaches plus open questions.
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
cs.CV 2023-12 unverdicted novelty 5.0

InternVL scales a vision model to 6B parameters and aligns it with LLMs using web data to achieve state-of-the-art results on 32 visual-linguistic benchmarks.
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
cs.CV 2024-06 unverdicted novelty 4.0

VideoLLaMA 2 improves video LLMs via a new STC connector for spatial-temporal dynamics and joint audio training, reaching competitive results on video QA and captioning benchmarks.
How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites
cs.CV 2024-04 unverdicted novelty 4.0

InternVL 1.5 narrows the performance gap to proprietary multimodal models via a stronger transferable vision encoder, dynamic high-resolution tiling, and curated English-Chinese training data.
Improved Baselines with Visual Instruction Tuning
cs.CV 2023-10 conditional novelty 4.0

Simple changes to LLaVA using CLIP-ViT-L-336px, an MLP connector, and academic VQA data yield state-of-the-art results on 11 benchmarks with only 1.2M public examples and one-day training on 8 A100 GPUs.
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
cs.CV 2026-04 unverdicted novelty 3.0

This review organizes literature on large multimodal models and object-centric vision into four themes—understanding, referring segmentation, editing, and generation—while summarizing paradigms, strategies, and challe...