pith. sign in

arxiv: 2601.13633 · v3 · submitted 2026-01-20 · 💻 cs.CV

EGM: Efficient Visual Grounding Language Models

Pith reviewed 2026-05-16 13:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual groundingvisual language modelsefficient inferencetoken generationRefCOCOamodal groundingVLM latency
0
0 comments X

The pith

Small visual language models can match large VLMs on visual grounding by generating many mid-quality tokens instead of few high-quality ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the main shortfall of small VLMs in visual grounding comes from weaker language understanding rather than visual encoding, since visual encoders are similar in size across models. It introduces EGM to close this gap by having the small model produce many mid-quality tokens that collectively match the output quality of a large model using fewer expensive tokens. On RefCOCO this yields 91.4 IoU at 737 ms average latency, compared with 90.5 IoU at 4320 ms for a 235B model. The same token-volume approach also lifts performance on a new amodal grounding task that requires predicting both visible and occluded object parts.

Core claim

EGM shows that small VLMs can reach or exceed the grounding accuracy of far larger models by increasing the quantity of mid-quality tokens they generate, delivering equivalent or better IoU scores with substantially lower end-to-end latency.

What carries the argument

EGM (Efficient visual Grounding language Models), a token-generation strategy that produces many mid-quality tokens from a small VLM to compensate for limited language-model capacity.

If this is right

  • Small VLMs become practical for real-time grounding on edge devices.
  • The same token-volume method improves both standard and amodal grounding accuracy.
  • End-to-end inference becomes 5.9 times faster while maintaining or exceeding large-model IoU.
  • Deployment cost drops because the visual encoder stays small and only token count rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Token quantity may act as a scalable substitute for raw model size in other perception-language tasks.
  • Hardware that processes larger token batches efficiently could widen the advantage of this approach.
  • The technique might reduce the need for trillion-parameter models in grounding-heavy applications.

Load-bearing premise

The performance difference between small and large VLMs is caused mainly by language-model size, and simply increasing the count of mid-quality tokens closes the gap without introducing new failure modes.

What would settle it

A direct measurement on RefCOCO showing that the 8B EGM model either falls short of 91.4 IoU or fails to keep total latency under 800 ms when the number of generated mid-quality tokens is increased.

Figures

Figures reproduced from arXiv: 2601.13633 by Changye Li, Guanqi Zhan, Ligeng Zhu, Song Han, Yao Lu, Yi Wu, Zhijian Liu.

Figure 1
Figure 1. Figure 1: Overview of Efficient Visual Grounding Language Models. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Failure cases of small VLMs. We find small VLMs, e.g., InternVL-3-8B, tend to fail when the text prompt is semantically complicated and there are multiple candidates in the image that can confuse the model. We term this failure pattern ‘COMPLEX-PROMPT’ and label the ground truth bbox in blue, and the 8B model prediction in orange in examples. as complete, even when occluded [5,15]. Although amodal completi… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of our method. Top (a): Data curation pipeline of SFT training data with reasoning. We feed the image, text prompt and ground truth bounding box of the target object into a proprietary VLM to generate the detailed reasoning process of how to locate the object correctly given the image and text prompt. The generated reasoning process is incorporated as part of the training data. Middle (b): Example… view at source ↗
Figure 3
Figure 3. Figure 3: The task is to predict the amodal bounding box of the tiger behind the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Demo of Qwen3-VL-8B, Qwen3-VL-235B, and our EGM-Qwen3-VL-8B for amodal grounding in autonomous driving and robotics scenarios. 5.5 Efficiency Comparison [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy vs. Efficiency. Our models, such as EGM-Qwen3-VL-4B and EGM-Qwen3-VL-8B, have greatly improved the efficiency of visual grounding. For example, EGM-Qwen3-VL-8B outperforms both the state-of-the-art Qwen3-VL-235B￾Instruct and Qwen3-VL-235B-Thinking models for accuracy, while speeding up 5.9×/18.9× in terms of GPU latency. For Qwen models, ‘-T’ denotes ‘-Thinking’ and ‘-I’ denotes ‘-Instruct’. The t… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative comparison of Qwen3-VL-8B, Qwen3-VL-235B, and our EGM￾Qwen3-VL-8B for vanilla grounding and amodal grounding. Vanilla grounding: top left, top right and bottom left; Amodal grounding: bottom right. 6 Conclusion In this paper, we introduced ‘Efficient visual Grounding language Models’ (EGM), a method to improve the efficiency of visual grounding language mod￾els. Our method equips small models w… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt for training and inference of models from two different fam [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt for vanilla grounding reasoning dataset generation [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for amodal grounding prompt generation [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt for amodal grounding prompt verification [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt for amodal grounding reasoning dataset generation [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative comparison of InternVL-3-8B, InternVL-3-78B, and our EGM￾InternVL-3-8B for vanilla grounding and amodal grounding. Vanilla grounding: top left, top right and bottom left; Amodal grounding: bottom right [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: The prompt to analyze failure reasons of small VLMs. [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
read the original abstract

Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): generate many mid-quality tokens (from small models) to match the performance of large VLMs with few high-quality but expensive tokens. This method is deployment-friendly, and yields better end-to-end latency: On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to reach 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method consistently improves both vanilla and amodal grounding capabilities of small models to match or outperform larger models, thereby improving efficiency for visual grounding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the performance gap between small and large VLMs on visual grounding stems mainly from language-model scale rather than visual-encoder differences, and introduces EGM to close this gap by emitting many mid-quality tokens from small models instead of few high-quality tokens from large models. On RefCOCO, EGM-Qwen3-VL-8B is reported to reach 91.4 IoU at 737 ms (5.9x faster) versus 90.5 IoU at 4320 ms for Qwen3-VL-235B; similar gains are shown on a newly introduced amodal grounding task that requires predicting both visible and occluded object parts.

Significance. If the empirical claims hold after proper validation, the work would offer a practical route to high-accuracy visual grounding in latency-sensitive and resource-constrained settings without scaling the entire VLM. The introduction of the amodal benchmark is a modest but useful addition for testing robustness to occlusion.

major comments (2)
  1. [Abstract] Abstract: the premise that visual encoders are 'nearly the same' for small and large VLMs and that the grounding gap is caused only by language-model size is load-bearing yet unsupported; no encoder-feature comparison (spatial alignment quality, embedding similarity on RefCOCO images, or grounding-relevant metrics) is supplied to show that simply increasing mid-quality token count can substitute for the 235B encoder's output without new failure modes.
  2. [Results] Results (RefCOCO and amodal experiments): the headline numbers (91.4 IoU / 737 ms vs. 90.5 IoU / 4320 ms) are single point estimates with no error bars, no statistical tests, no ablation on token quantity or generation strategy, and no analysis of attention dilution or context-length effects; this makes the 5.9x speedup claim and the assertion that the method 'consistently improves' both tasks difficult to evaluate.
minor comments (2)
  1. The method section should provide a precise description of how mid-quality tokens are generated, ranked, and injected into the language-model context so that the approach can be reproduced.
  2. Latency measurements should state the exact hardware, batch size, and inference framework used; without this the 737 ms and 4320 ms figures cannot be compared across papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to strengthen the core premise with direct encoder comparisons and to add rigorous statistical validation, ablations, and analyses for the reported results. Our point-by-point responses follow.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the premise that visual encoders are 'nearly the same' for small and large VLMs and that the grounding gap is caused only by language-model size is load-bearing yet unsupported; no encoder-feature comparison (spatial alignment quality, embedding similarity on RefCOCO images, or grounding-relevant metrics) is supplied to show that simply increasing mid-quality token count can substitute for the 235B encoder's output without new failure modes.

    Authors: We acknowledge that the abstract presents the observation on encoder sizes without accompanying feature-level evidence. In the revision we will add a dedicated analysis (new subsection and appendix) comparing the small and large VLMs' visual encoders on RefCOCO images, including cosine similarity of embeddings, spatial alignment quality metrics, and grounding-relevant feature statistics. This will directly support that visual representations are comparable and that the performance gap is driven by language-model scale. We will also examine and report any new failure modes that arise from emitting more mid-quality tokens, such as changes in attention distribution. revision: yes

  2. Referee: [Results] Results (RefCOCO and amodal experiments): the headline numbers (91.4 IoU / 737 ms vs. 90.5 IoU / 4320 ms) are single point estimates with no error bars, no statistical tests, no ablation on token quantity or generation strategy, and no analysis of attention dilution or context-length effects; this makes the 5.9x speedup claim and the assertion that the method 'consistently improves' both tasks difficult to evaluate.

    Authors: We agree that single-run point estimates limit the strength of the claims. In the revised version we will rerun the key experiments with multiple random seeds to report means and standard deviations, add ablations on token quantity and generation strategies (including beam search variants), and include attention-map visualizations plus quantitative metrics to assess dilution and context-length effects. These additions will allow a more robust evaluation of the 5.9x speedup and the consistent gains on both RefCOCO and the amodal task. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivations or self-referential reductions

full rationale

The paper introduces an empirical technique (EGM) for generating additional mid-quality tokens from smaller VLMs to approach the grounding performance of larger models. No equations, parameter fittings, or derivation chains are present that reduce by construction to the paper's own inputs. Performance numbers (e.g., 91.4 IoU at 737ms on RefCOCO) are direct benchmark comparisons against public models. The stated observation that visual-encoder sizes are similar across model scales is presented as a premise for the method rather than a derived result from self-citations or fitted data. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is presented as an empirical engineering adjustment.

pith-pipeline@v0.9.0 · 5570 in / 1057 out tokens · 29978 ms · 2026-05-16T13:03:22.768094+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 16 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Alibaba: Qwen3-vl (2025)

  3. [3]

    Anthropic: Claude-4.5 (2025)

  4. [4]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

  5. [5]

    Briscoe,R.E.:Mentalimageryandthevarietiesofamodalperception.PacificPhilo- sophical Quarterly92(2), 153–173 (2011)

  6. [6]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Chen, K., Ramanan, D., Khurana, T.: Using diffusion priors for video amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22890–22900 (2025)

  7. [7]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

  9. [9]

    arXiv e-prints pp

    Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv e-prints pp. arXiv–2409 (2024)

  10. [10]

    arXiv e-prints pp

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

  11. [11]

    Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)

    Evans, T., Parthasarathy, N., Merzic, H., Henaff, O.J.: Data curation via joint example selection further accelerates multimodal learning. Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)

  12. [12]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

  13. [13]

    arXiv preprint arXiv:2312.12433 (2023)

    Hsieh, C.Y., Khurana, T., Dave, A., Ramanan, D.: Tracking any object amodally. arXiv preprint arXiv:2312.12433 (2023)

  14. [14]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

  15. [15]

    Psychological Research88(2), 307–337 (2024) EGM 17

    Kaup, B., Ulrich, R., Bausenhart, K.M., Bryce, D., Butz, M.V., Dignath, D., Dud- schig, C., Franz, V.H., Friedrich, C., Gawrilow, C., et al.: Modal and amodal cog- nition: an overarching principle in various domains of psychology. Psychological Research88(2), 307–337 (2024) EGM 17

  16. [16]

    In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)

    Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)

  17. [17]

    Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., Fagan, P.D., Hejna, J., Itkina, M., Lepert, M., Ma, Y.J., Miller, P.T., Wu, J., Belkhale, S., Dass, S., Ha, H., Jain, A., Lee, A., Lee, Y., Memmel, M., Park, S., Radosavovic, I., Wang, K., Zhan, A., Black, K., Chi, C., Ha...

  18. [18]

    In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., Stoica, I.: Efficient memory management for large language model serv- ing with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)

  19. [19]

    LLaVA-OneVision: Easy Visual Task Transfer

    Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

  20. [20]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Li, K., Malik, J.: Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 677–693. Springer (2016)

  21. [21]

    In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

    Li, Z., Lavreniuk, M., Shi, J., Bhat, S.F., Wonka, P.: Amodal depth anything: Amodal depth estimation in the wild. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9673–9682 (2025)

  22. [22]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Li, Z., Ye, W., Jiang, T., Huang, T.: 2d amodal instance segmentation guided by 3d shape prior. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 165–181. Springer (2022)

  23. [23]

    IEEE Transactions on Multimedia (MM) (2023)

    Li, Z., Ye, W., Jiang, T., Huang, T.: Gin: Generative invariant shape prior for amodal instance segmentation. IEEE Transactions on Multimedia (MM) (2023)

  24. [24]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Li, Z., Ye, W., Terven, J., Bennett, Z., Zheng, Y., Jiang, T., Huang, T.: Muva: A new large-scale benchmark for multi-view amodal instance segmentation in the shopping scenario. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23504–23513 (2023)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

  26. [26]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Liu, Z., Qiao, L., Chu, X., Ma, L., Jiang, T.: Towards efficient foundation model for zero-shot amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20254–20264 (2025)

  27. [27]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025)

  28. [28]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Lu, R., Chen, Y., Liu, Y., Tang, J., Ni, J., Wan, D., Zeng, G., Huang, S.: Taco: Taming diffusion for in-the-wild video amodal completion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13638–13650 (2025) 18 G. Zhan, C. Li et al

  29. [29]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

  30. [30]

    Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300

  31. [31]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

  32. [32]

    Tan, H., Pan, J., Lin, J., Chen, T., Zheng, Z., Tang, Z., Yang, H.: Gtpo and grpo- s: Token and sequence-level reward shaping with policy entropy (2025),https: //arxiv.org/abs/2508.04349

  33. [33]

    Gemini: A Family of Highly Capable Multimodal Models

    Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

  34. [34]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

  35. [35]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  36. [36]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  37. [37]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

  38. [38]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

  39. [39]

    Amodal3r: Amodal 3d reconstruction from occluded 2d images

    Wu, T., Zheng, C., Guan, F., Vedaldi, A., Cham, T.J.: Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439 (2025)

  40. [40]

    International Journal of Computer Vision (IJCV) (2025)

    Xia, Y., Ding, R., Qin, Z., Zhan, G., Zhou, K., Yang, L., Dong, H., Cremers, D.: Targo: benchmarking target-driven object grasping under occlusions. International Journal of Computer Vision (IJCV) (2025)

  41. [41]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Xu, K., Zhang, L., Shi, J.: Amodal completion via progressive mixed context diffu- sion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

  42. [42]

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...

  43. [43]

    Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)

    Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual-language foundation models for image retrieval. Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)

  44. [44]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19

    Zhan, G., Zheng, C., Xie, W., Zisserman, A.: Amodal ground truth and completion in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19

  45. [45]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C.: Self-supervised scene de- occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3784–3792 (2020)

  46. [46]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

  47. [47]

    {question}

    Zhu, Y., Tian, Y., Metaxas, D., Dollár, P.: Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1464–1472 (2017) 20 G. Zhan, C. Li et al. Appendix A Implementation Details In this section, we provide the detailed experimental settings referenced in Sec- tion 5.1 and Section 5.2 of the...

  48. [48]

    Must uniquely identify this object (not confusable with other objects)

  49. [49]

    the second/third/fourth xxx from left/right/top/bottom

    Use spatial relationships for multiple similar objects: - Ordinal: "the second/third/fourth xxx from left/right/top/bottom" - Position: "the leftmost/rightmost/topmost/bottommost xxx" - Location: "xxx in the top-left/center/bottom-right" - Relative: "xxx behind/next to/above/below the yyy"

  50. [50]

    Describe natural features: position, appearance, color (of the actual object, not the annotation), state

  51. [51]

    the man in yellow coat

    Use lowercase, natural referring expression style Good examples: - "the man in yellow coat" - "the second teddy bear from left" - "the rightmost hot dog in the container" - "woman in coveralls at the back" - "the white computer screen that is on" - "top sandwich in the left container" Bad examples (DO NOT do this): - "the object marked with green box" ❌ -...

  52. [52]

    Does the description match the object/region inside the TARGET box?

  53. [53]

    Is the description specific enough that it couldn't refer to any other object in the image?

  54. [54]

    YES" if the description uniquely and accurately identifies the TARGET region -

    Would someone reading this description be able to locate exactly this TARGET region and no other? Respond with ONLY ONE WORD: - "YES" if the description uniquely and accurately identifies the TARGET region - "NO" if the description is ambiguous, incorrect, or could refer to multiple objects Your response (YES or NO): Fig.10: Prompt for amodal grounding pr...

  55. [55]

    How to locate this object

  56. [56]

    What object(s) are causing the occlusion

  57. [57]

    What does this object's complete unoccluded shape look like?

  58. [58]

    slightly

    Therefore, to recover the complete object, explain in which direction(s) ({directions_str}) and how much (using terms: "slightly", "a bit", "moderately", "considerably", or "significantly") should be extended from the visible part Be extremely concise. Maximum 5 sentences. DO NOT mention: specific pixel values, image boundaries, red/green boxes, bounding ...

  59. [59]

    sofa against the wall

    Second: girl with white cap, green skirt, holding racket. 3. Third from right: the girl in white top, green skirt, between the cap girl and the next. Let’s check the image. Looking at the image, let’s count from the right: - First (rightmost): [838, 129, 993, 999] - Second: [742, 299, 922, 999] - Third from right: [635, 711, 771, 999]. Yes, that’s the ‘th...