EGM: Efficient Visual Grounding Language Models

Changye Li; Guanqi Zhan; Ligeng Zhu; Song Han; Yao Lu; Yi Wu; Zhijian Liu

arxiv: 2601.13633 · v3 · submitted 2026-01-20 · 💻 cs.CV

EGM: Efficient Visual Grounding Language Models

Guanqi Zhan , Changye Li , Zhijian Liu , Yao Lu , Yi Wu , Song Han , Ligeng Zhu This is my paper

Pith reviewed 2026-05-16 13:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual groundingvisual language modelsefficient inferencetoken generationRefCOCOamodal groundingVLM latency

0 comments

The pith

Small visual language models can match large VLMs on visual grounding by generating many mid-quality tokens instead of few high-quality ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that the main shortfall of small VLMs in visual grounding comes from weaker language understanding rather than visual encoding, since visual encoders are similar in size across models. It introduces EGM to close this gap by having the small model produce many mid-quality tokens that collectively match the output quality of a large model using fewer expensive tokens. On RefCOCO this yields 91.4 IoU at 737 ms average latency, compared with 90.5 IoU at 4320 ms for a 235B model. The same token-volume approach also lifts performance on a new amodal grounding task that requires predicting both visible and occluded object parts.

Core claim

EGM shows that small VLMs can reach or exceed the grounding accuracy of far larger models by increasing the quantity of mid-quality tokens they generate, delivering equivalent or better IoU scores with substantially lower end-to-end latency.

What carries the argument

EGM (Efficient visual Grounding language Models), a token-generation strategy that produces many mid-quality tokens from a small VLM to compensate for limited language-model capacity.

If this is right

Small VLMs become practical for real-time grounding on edge devices.
The same token-volume method improves both standard and amodal grounding accuracy.
End-to-end inference becomes 5.9 times faster while maintaining or exceeding large-model IoU.
Deployment cost drops because the visual encoder stays small and only token count rises.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Token quantity may act as a scalable substitute for raw model size in other perception-language tasks.
Hardware that processes larger token batches efficiently could widen the advantage of this approach.
The technique might reduce the need for trillion-parameter models in grounding-heavy applications.

Load-bearing premise

The performance difference between small and large VLMs is caused mainly by language-model size, and simply increasing the count of mid-quality tokens closes the gap without introducing new failure modes.

What would settle it

A direct measurement on RefCOCO showing that the 8B EGM model either falls short of 91.4 IoU or fails to keep total latency under 800 ms when the number of generated mid-quality tokens is increased.

Figures

Figures reproduced from arXiv: 2601.13633 by Changye Li, Guanqi Zhan, Ligeng Zhu, Song Han, Yao Lu, Yi Wu, Zhijian Liu.

**Figure 2.** Figure 2: Failure cases of small VLMs. We find small VLMs, e.g., InternVL-3-8B, tend to fail when the text prompt is semantically complicated and there are multiple candidates in the image that can confuse the model. We term this failure pattern ‘COMPLEX-PROMPT’ and label the ground truth bbox in blue, and the 8B model prediction in orange in examples. as complete, even when occluded [5,15]. Although amodal completi… view at source ↗

**Figure 3.** Figure 3: Overview of our method. Top (a): Data curation pipeline of SFT training data with reasoning. We feed the image, text prompt and ground truth bounding box of the target object into a proprietary VLM to generate the detailed reasoning process of how to locate the object correctly given the image and text prompt. The generated reasoning process is incorporated as part of the training data. Middle (b): Example… view at source ↗

**Figure 3.** Figure 3: The task is to predict the amodal bounding box of the tiger behind the [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Demo of Qwen3-VL-8B, Qwen3-VL-235B, and our EGM-Qwen3-VL-8B for amodal grounding in autonomous driving and robotics scenarios. 5.5 Efficiency Comparison [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy vs. Efficiency. Our models, such as EGM-Qwen3-VL-4B and EGM-Qwen3-VL-8B, have greatly improved the efficiency of visual grounding. For example, EGM-Qwen3-VL-8B outperforms both the state-of-the-art Qwen3-VL-235BInstruct and Qwen3-VL-235B-Thinking models for accuracy, while speeding up 5.9×/18.9× in terms of GPU latency. For Qwen models, ‘-T’ denotes ‘-Thinking’ and ‘-I’ denotes ‘-Instruct’. The t… view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of Qwen3-VL-8B, Qwen3-VL-235B, and our EGMQwen3-VL-8B for vanilla grounding and amodal grounding. Vanilla grounding: top left, top right and bottom left; Amodal grounding: bottom right. 6 Conclusion In this paper, we introduced ‘Efficient visual Grounding language Models’ (EGM), a method to improve the efficiency of visual grounding language models. Our method equips small models w… view at source ↗

**Figure 7.** Figure 7: Prompt for training and inference of models from two different fam [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Prompt for vanilla grounding reasoning dataset generation [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for amodal grounding prompt generation [PITH_FULL_IMAGE:figures/full_fig_p024_9.png] view at source ↗

**Figure 10.** Figure 10: Prompt for amodal grounding prompt verification [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt for amodal grounding reasoning dataset generation [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of InternVL-3-8B, InternVL-3-78B, and our EGMInternVL-3-8B for vanilla grounding and amodal grounding. Vanilla grounding: top left, top right and bottom left; Amodal grounding: bottom right [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: The prompt to analyze failure reasons of small VLMs. [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗

read the original abstract

Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): generate many mid-quality tokens (from small models) to match the performance of large VLMs with few high-quality but expensive tokens. This method is deployment-friendly, and yields better end-to-end latency: On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to reach 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method consistently improves both vanilla and amodal grounding capabilities of small models to match or outperform larger models, thereby improving efficiency for visual grounding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EGM shows an 8B VLM can match a 235B model's RefCOCO grounding accuracy by emitting more mid-quality tokens, cutting latency roughly 6x, but the claim rests on an untested assumption about visual encoders.

read the letter

The key point is that this paper gets an 8B VLM to hit 91.4 IoU on RefCOCO grounding with 737ms latency, beating the 235B model's 90.5 IoU at 4320ms by using more mid-quality tokens instead of a bigger model. They argue the visual encoders are similar in size across model scales, so the performance gap comes from the language model. Their fix is to generate lots of those mid-quality tokens from the small model to make up for weaker language understanding. This works on the standard RefCOCO benchmark and they extend it to a new amodal grounding task that includes occluded object parts. The strength is in the concrete efficiency numbers and the focus on real deployment latency. Showing consistent gains on both visible and amodal cases makes the method look more general than just one benchmark trick. The main weakness is the lack of supporting evidence for the core assumption. The paper says the encoders are nearly the same but gives no direct comparison of their feature quality or spatial accuracy on the same images. If the small model's encoder produces less precise features, adding more tokens might not close the gap the way they think and could create new problems like diluted attention. The results are also just point estimates with no error bars or ablations on how many tokens are needed. This paper is aimed at engineers and researchers working on running VLMs on edge devices for robotics or AR. Anyone looking for practical ways to speed up visual grounding without huge models will find the latency claims useful. It deserves peer review because the empirical result on speed versus accuracy is worth a closer look, even though the analysis of why it works needs more depth.

Referee Report

2 major / 2 minor

Summary. The paper claims that the performance gap between small and large VLMs on visual grounding stems mainly from language-model scale rather than visual-encoder differences, and introduces EGM to close this gap by emitting many mid-quality tokens from small models instead of few high-quality tokens from large models. On RefCOCO, EGM-Qwen3-VL-8B is reported to reach 91.4 IoU at 737 ms (5.9x faster) versus 90.5 IoU at 4320 ms for Qwen3-VL-235B; similar gains are shown on a newly introduced amodal grounding task that requires predicting both visible and occluded object parts.

Significance. If the empirical claims hold after proper validation, the work would offer a practical route to high-accuracy visual grounding in latency-sensitive and resource-constrained settings without scaling the entire VLM. The introduction of the amodal benchmark is a modest but useful addition for testing robustness to occlusion.

major comments (2)

[Abstract] Abstract: the premise that visual encoders are 'nearly the same' for small and large VLMs and that the grounding gap is caused only by language-model size is load-bearing yet unsupported; no encoder-feature comparison (spatial alignment quality, embedding similarity on RefCOCO images, or grounding-relevant metrics) is supplied to show that simply increasing mid-quality token count can substitute for the 235B encoder's output without new failure modes.
[Results] Results (RefCOCO and amodal experiments): the headline numbers (91.4 IoU / 737 ms vs. 90.5 IoU / 4320 ms) are single point estimates with no error bars, no statistical tests, no ablation on token quantity or generation strategy, and no analysis of attention dilution or context-length effects; this makes the 5.9x speedup claim and the assertion that the method 'consistently improves' both tasks difficult to evaluate.

minor comments (2)

The method section should provide a precise description of how mid-quality tokens are generated, ranked, and injected into the language-model context so that the approach can be reproduced.
Latency measurements should state the exact hardware, batch size, and inference framework used; without this the 737 ms and 4320 ms figures cannot be compared across papers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to strengthen the core premise with direct encoder comparisons and to add rigorous statistical validation, ablations, and analyses for the reported results. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract] Abstract: the premise that visual encoders are 'nearly the same' for small and large VLMs and that the grounding gap is caused only by language-model size is load-bearing yet unsupported; no encoder-feature comparison (spatial alignment quality, embedding similarity on RefCOCO images, or grounding-relevant metrics) is supplied to show that simply increasing mid-quality token count can substitute for the 235B encoder's output without new failure modes.

Authors: We acknowledge that the abstract presents the observation on encoder sizes without accompanying feature-level evidence. In the revision we will add a dedicated analysis (new subsection and appendix) comparing the small and large VLMs' visual encoders on RefCOCO images, including cosine similarity of embeddings, spatial alignment quality metrics, and grounding-relevant feature statistics. This will directly support that visual representations are comparable and that the performance gap is driven by language-model scale. We will also examine and report any new failure modes that arise from emitting more mid-quality tokens, such as changes in attention distribution. revision: yes
Referee: [Results] Results (RefCOCO and amodal experiments): the headline numbers (91.4 IoU / 737 ms vs. 90.5 IoU / 4320 ms) are single point estimates with no error bars, no statistical tests, no ablation on token quantity or generation strategy, and no analysis of attention dilution or context-length effects; this makes the 5.9x speedup claim and the assertion that the method 'consistently improves' both tasks difficult to evaluate.

Authors: We agree that single-run point estimates limit the strength of the claims. In the revised version we will rerun the key experiments with multiple random seeds to report means and standard deviations, add ablations on token quantity and generation strategies (including beam search variants), and include attention-map visualizations plus quantitative metrics to assess dilution and context-length effects. These additions will allow a more robust evaluation of the 5.9x speedup and the consistent gains on both RefCOCO and the amodal task. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical claims with no derivations or self-referential reductions

full rationale

The paper introduces an empirical technique (EGM) for generating additional mid-quality tokens from smaller VLMs to approach the grounding performance of larger models. No equations, parameter fittings, or derivation chains are present that reduce by construction to the paper's own inputs. Performance numbers (e.g., 91.4 IoU at 737ms on RefCOCO) are direct benchmark comparisons against public models. The stated observation that visual-encoder sizes are similar across model scales is presented as a premise for the method rather than a derived result from self-citations or fitted data. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method is presented as an empirical engineering adjustment.

pith-pipeline@v0.9.0 · 5570 in / 1057 out tokens · 29978 ms · 2026-05-16T13:03:22.768094+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 16 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Alibaba: Qwen3-vl (2025)

work page 2025
[3]

Anthropic: Claude-4.5 (2025)

work page 2025
[4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Briscoe,R.E.:Mentalimageryandthevarietiesofamodalperception.PacificPhilo- sophical Quarterly92(2), 153–173 (2011)

work page 2011
[6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, K., Ramanan, D., Khurana, T.: Using diffusion priors for video amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22890–22900 (2025)

work page 2025
[7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

arXiv e-prints pp

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv e-prints pp. arXiv–2409 (2024)

work page 2024
[10]

arXiv e-prints pp

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

work page 2024
[11]

Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)

Evans, T., Parthasarathy, N., Merzic, H., Henaff, O.J.: Data curation via joint example selection further accelerates multimodal learning. Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)

work page 2024
[12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

arXiv preprint arXiv:2312.12433 (2023)

Hsieh, C.Y., Khurana, T., Dave, A., Ramanan, D.: Tracking any object amodally. arXiv preprint arXiv:2312.12433 (2023)

work page arXiv 2023
[14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Psychological Research88(2), 307–337 (2024) EGM 17

Kaup, B., Ulrich, R., Bausenhart, K.M., Bryce, D., Butz, M.V., Dignath, D., Dud- schig, C., Franz, V.H., Friedrich, C., Gawrilow, C., et al.: Modal and amodal cog- nition: an overarching principle in various domains of psychology. Psychological Research88(2), 307–337 (2024) EGM 17

work page 2024
[16]

In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)

work page 2014
[17]

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., Fagan, P.D., Hejna, J., Itkina, M., Lepert, M., Ma, Y.J., Miller, P.T., Wu, J., Belkhale, S., Dass, S., Ha, H., Jain, A., Lee, A., Lee, Y., Memmel, M., Park, S., Radosavovic, I., Wang, K., Zhan, A., Black, K., Chi, C., Ha...

work page 2024
[18]

In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., Stoica, I.: Efficient memory management for large language model serv- ing with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)

work page 2023
[19]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Li, K., Malik, J.: Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 677–693. Springer (2016)

work page 2016
[21]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Li, Z., Lavreniuk, M., Shi, J., Bhat, S.F., Wonka, P.: Amodal depth anything: Amodal depth estimation in the wild. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9673–9682 (2025)

work page 2025
[22]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Li, Z., Ye, W., Jiang, T., Huang, T.: 2d amodal instance segmentation guided by 3d shape prior. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 165–181. Springer (2022)

work page 2022
[23]

IEEE Transactions on Multimedia (MM) (2023)

Li, Z., Ye, W., Jiang, T., Huang, T.: Gin: Generative invariant shape prior for amodal instance segmentation. IEEE Transactions on Multimedia (MM) (2023)

work page 2023
[24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Z., Ye, W., Terven, J., Bennett, Z., Zheng, Y., Jiang, T., Huang, T.: Muva: A new large-scale benchmark for multi-view amodal instance segmentation in the shopping scenario. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23504–23513 (2023)

work page 2023
[25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

work page 2024
[26]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Qiao, L., Chu, X., Ma, L., Jiang, T.: Towards efficient foundation model for zero-shot amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20254–20264 (2025)

work page 2025
[27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025)

work page 2025
[28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lu, R., Chen, Y., Liu, Y., Tang, J., Ni, J., Wan, D., Zeng, G., Huang, S.: Taco: Taming diffusion for in-the-wild video amodal completion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13638–13650 (2025) 18 G. Zhan, C. Li et al

work page 2025
[29]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

work page 2016
[30]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Tan, H., Pan, J., Lin, J., Chen, T., Zheng, Z., Tang, Z., Yang, H.: Gtpo and grpo- s: Token and sequence-level reward shaping with policy entropy (2025),https: //arxiv.org/abs/2508.04349

work page arXiv 2025
[33]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Wu, T., Zheng, C., Guan, F., Vedaldi, A., Cham, T.J.: Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439 (2025)

work page arXiv 2025
[40]

International Journal of Computer Vision (IJCV) (2025)

Xia, Y., Ding, R., Qin, Z., Zhan, G., Zhou, K., Yang, L., Dong, H., Cremers, D.: Targo: benchmarking target-driven object grasping under occlusions. International Journal of Computer Vision (IJCV) (2025)

work page 2025
[41]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Xu, K., Zhang, L., Shi, J.: Amodal completion via progressive mixed context diffu- sion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

work page 2024
[42]

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)

Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual-language foundation models for image retrieval. Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)

work page 2025
[44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19

Zhan, G., Zheng, C., Xie, W., Zisserman, A.: Amodal ground truth and completion in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19

work page 2024
[45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C.: Self-supervised scene de- occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3784–3792 (2020)

work page 2020
[46]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

{question}

Zhu, Y., Tian, Y., Metaxas, D., Dollár, P.: Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1464–1472 (2017) 20 G. Zhan, C. Li et al. Appendix A Implementation Details In this section, we provide the detailed experimental settings referenced in Sec- tion 5.1 and Section 5.2 of the...

work page 2017
[48]

Must uniquely identify this object (not confusable with other objects)

work page
[49]

the second/third/fourth xxx from left/right/top/bottom

Use spatial relationships for multiple similar objects: - Ordinal: "the second/third/fourth xxx from left/right/top/bottom" - Position: "the leftmost/rightmost/topmost/bottommost xxx" - Location: "xxx in the top-left/center/bottom-right" - Relative: "xxx behind/next to/above/below the yyy"

work page
[50]

Describe natural features: position, appearance, color (of the actual object, not the annotation), state

work page
[51]

the man in yellow coat

Use lowercase, natural referring expression style Good examples: - "the man in yellow coat" - "the second teddy bear from left" - "the rightmost hot dog in the container" - "woman in coveralls at the back" - "the white computer screen that is on" - "top sandwich in the left container" Bad examples (DO NOT do this): - "the object marked with green box" ❌ -...

work page
[52]

Does the description match the object/region inside the TARGET box?

work page
[53]

Is the description specific enough that it couldn't refer to any other object in the image?

work page
[54]

YES" if the description uniquely and accurately identifies the TARGET region -

Would someone reading this description be able to locate exactly this TARGET region and no other? Respond with ONLY ONE WORD: - "YES" if the description uniquely and accurately identifies the TARGET region - "NO" if the description is ambiguous, incorrect, or could refer to multiple objects Your response (YES or NO): Fig.10: Prompt for amodal grounding pr...

work page
[55]

How to locate this object

work page
[56]

What object(s) are causing the occlusion

work page
[57]

What does this object's complete unoccluded shape look like?

work page
[58]

slightly

Therefore, to recover the complete object, explain in which direction(s) ({directions_str}) and how much (using terms: "slightly", "a bit", "moderately", "considerably", or "significantly") should be extended from the visible part Be extremely concise. Maximum 5 sentences. DO NOT mention: specific pixel values, image boundaries, red/green boxes, bounding ...

work page
[59]

sofa against the wall

Second: girl with white cap, green skirt, holding racket. 3. Third from right: the girl in white top, green skirt, between the cap girl and the next. Let’s check the image. Looking at the image, let’s count from the right: - First (rightmost): [838, 129, 993, 999] - Second: [742, 299, 922, 999] - Third from right: [635, 711, 771, 999]. Yes, that’s the ‘th...

work page

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Alibaba: Qwen3-vl (2025)

work page 2025

[3] [3]

Anthropic: Claude-4.5 (2025)

work page 2025

[4] [4]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Briscoe,R.E.:Mentalimageryandthevarietiesofamodalperception.PacificPhilo- sophical Quarterly92(2), 153–173 (2011)

work page 2011

[6] [6]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Chen, K., Ramanan, D., Khurana, T.: Using diffusion priors for video amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22890–22900 (2025)

work page 2025

[7] [7]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)

work page 2024

[8] [8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

arXiv e-prints pp

Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv e-prints pp. arXiv–2409 (2024)

work page 2024

[10] [10]

arXiv e-prints pp

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)

work page 2024

[11] [11]

Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)

Evans, T., Parthasarathy, N., Merzic, H., Henaff, O.J.: Data curation via joint example selection further accelerates multimodal learning. Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)

work page 2024

[12] [12]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

arXiv preprint arXiv:2312.12433 (2023)

Hsieh, C.Y., Khurana, T., Dave, A., Ramanan, D.: Tracking any object amodally. arXiv preprint arXiv:2312.12433 (2023)

work page arXiv 2023

[14] [14]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Psychological Research88(2), 307–337 (2024) EGM 17

Kaup, B., Ulrich, R., Bausenhart, K.M., Bryce, D., Butz, M.V., Dignath, D., Dud- schig, C., Franz, V.H., Friedrich, C., Gawrilow, C., et al.: Modal and amodal cog- nition: an overarching principle in various domains of psychology. Psychological Research88(2), 307–337 (2024) EGM 17

work page 2024

[16] [16]

In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)

Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)

work page 2014

[17] [17]

Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., Fagan, P.D., Hejna, J., Itkina, M., Lepert, M., Ma, Y.J., Miller, P.T., Wu, J., Belkhale, S., Dass, S., Ha, H., Jain, A., Lee, A., Lee, Y., Memmel, M., Park, S., Radosavovic, I., Wang, K., Zhan, A., Black, K., Chi, C., Ha...

work page 2024

[18] [18]

In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., Stoica, I.: Efficient memory management for large language model serv- ing with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)

work page 2023

[19] [19]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Li, K., Malik, J.: Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 677–693. Springer (2016)

work page 2016

[21] [21]

In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision

Li, Z., Lavreniuk, M., Shi, J., Bhat, S.F., Wonka, P.: Amodal depth anything: Amodal depth estimation in the wild. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9673–9682 (2025)

work page 2025

[22] [22]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Li, Z., Ye, W., Jiang, T., Huang, T.: 2d amodal instance segmentation guided by 3d shape prior. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 165–181. Springer (2022)

work page 2022

[23] [23]

IEEE Transactions on Multimedia (MM) (2023)

Li, Z., Ye, W., Jiang, T., Huang, T.: Gin: Generative invariant shape prior for amodal instance segmentation. IEEE Transactions on Multimedia (MM) (2023)

work page 2023

[24] [24]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Li, Z., Ye, W., Terven, J., Bennett, Z., Zheng, Y., Jiang, T., Huang, T.: Muva: A new large-scale benchmark for multi-view amodal instance segmentation in the shopping scenario. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23504–23513 (2023)

work page 2023

[25] [25]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)

work page 2024

[26] [26]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Qiao, L., Chu, X., Ma, L., Jiang, T.: Towards efficient foundation model for zero-shot amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20254–20264 (2025)

work page 2025

[27] [27]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025)

work page 2025

[28] [28]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Lu, R., Chen, Y., Liu, Y., Tang, J., Ni, J., Wan, D., Zeng, G., Huang, S.: Taco: Taming diffusion for in-the-wild video amodal completion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13638–13650 (2025) 18 G. Zhan, C. Li et al

work page 2025

[29] [29]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)

work page 2016

[30] [30]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

HybridFlow: A Flexible and Efficient RLHF Framework

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Tan, H., Pan, J., Lin, J., Chen, T., Zheng, Z., Tang, Z., Yang, H.: Gtpo and grpo- s: Token and sequence-level reward shaping with policy entropy (2025),https: //arxiv.org/abs/2508.04349

work page arXiv 2025

[33] [33]

Gemini: A Family of Highly Capable Multimodal Models

Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [39]

Amodal3r: Amodal 3d reconstruction from occluded 2d images

Wu, T., Zheng, C., Guan, F., Vedaldi, A., Cham, T.J.: Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439 (2025)

work page arXiv 2025

[40] [40]

International Journal of Computer Vision (IJCV) (2025)

Xia, Y., Ding, R., Qin, Z., Zhan, G., Zhou, K., Yang, L., Dong, H., Cremers, D.: Targo: benchmarking target-driven object grasping under occlusions. International Journal of Computer Vision (IJCV) (2025)

work page 2025

[41] [41]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

Xu, K., Zhang, L., Shi, J.: Amodal completion via progressive mixed context diffu- sion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

work page 2024

[42] [42]

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)

Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual-language foundation models for image retrieval. Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)

work page 2025

[44] [44]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19

Zhan, G., Zheng, C., Xie, W., Zisserman, A.: Amodal ground truth and completion in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19

work page 2024

[45] [45]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C.: Self-supervised scene de- occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3784–3792 (2020)

work page 2020

[46] [46]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[47] [47]

{question}

Zhu, Y., Tian, Y., Metaxas, D., Dollár, P.: Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1464–1472 (2017) 20 G. Zhan, C. Li et al. Appendix A Implementation Details In this section, we provide the detailed experimental settings referenced in Sec- tion 5.1 and Section 5.2 of the...

work page 2017

[48] [48]

Must uniquely identify this object (not confusable with other objects)

work page

[49] [49]

the second/third/fourth xxx from left/right/top/bottom

Use spatial relationships for multiple similar objects: - Ordinal: "the second/third/fourth xxx from left/right/top/bottom" - Position: "the leftmost/rightmost/topmost/bottommost xxx" - Location: "xxx in the top-left/center/bottom-right" - Relative: "xxx behind/next to/above/below the yyy"

work page

[50] [50]

Describe natural features: position, appearance, color (of the actual object, not the annotation), state

work page

[51] [51]

the man in yellow coat

Use lowercase, natural referring expression style Good examples: - "the man in yellow coat" - "the second teddy bear from left" - "the rightmost hot dog in the container" - "woman in coveralls at the back" - "the white computer screen that is on" - "top sandwich in the left container" Bad examples (DO NOT do this): - "the object marked with green box" ❌ -...

work page

[52] [52]

Does the description match the object/region inside the TARGET box?

work page

[53] [53]

Is the description specific enough that it couldn't refer to any other object in the image?

work page

[54] [54]

YES" if the description uniquely and accurately identifies the TARGET region -

Would someone reading this description be able to locate exactly this TARGET region and no other? Respond with ONLY ONE WORD: - "YES" if the description uniquely and accurately identifies the TARGET region - "NO" if the description is ambiguous, incorrect, or could refer to multiple objects Your response (YES or NO): Fig.10: Prompt for amodal grounding pr...

work page

[55] [55]

How to locate this object

work page

[56] [56]

What object(s) are causing the occlusion

work page

[57] [57]

What does this object's complete unoccluded shape look like?

work page

[58] [58]

slightly

Therefore, to recover the complete object, explain in which direction(s) ({directions_str}) and how much (using terms: "slightly", "a bit", "moderately", "considerably", or "significantly") should be extended from the visible part Be extremely concise. Maximum 5 sentences. DO NOT mention: specific pixel values, image boundaries, red/green boxes, bounding ...

work page

[59] [59]

sofa against the wall

Second: girl with white cap, green skirt, holding racket. 3. Third from right: the girl in white top, green skirt, between the cap girl and the next. Let’s check the image. Looking at the image, let’s count from the right: - First (rightmost): [838, 129, 993, 999] - Second: [742, 299, 922, 999] - Third from right: [635, 711, 771, 999]. Yes, that’s the ‘th...

work page