EGM: Efficient Visual Grounding Language Models
Pith reviewed 2026-05-16 13:03 UTC · model grok-4.3
The pith
Small visual language models can match large VLMs on visual grounding by generating many mid-quality tokens instead of few high-quality ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EGM shows that small VLMs can reach or exceed the grounding accuracy of far larger models by increasing the quantity of mid-quality tokens they generate, delivering equivalent or better IoU scores with substantially lower end-to-end latency.
What carries the argument
EGM (Efficient visual Grounding language Models), a token-generation strategy that produces many mid-quality tokens from a small VLM to compensate for limited language-model capacity.
If this is right
- Small VLMs become practical for real-time grounding on edge devices.
- The same token-volume method improves both standard and amodal grounding accuracy.
- End-to-end inference becomes 5.9 times faster while maintaining or exceeding large-model IoU.
- Deployment cost drops because the visual encoder stays small and only token count rises.
Where Pith is reading between the lines
- Token quantity may act as a scalable substitute for raw model size in other perception-language tasks.
- Hardware that processes larger token batches efficiently could widen the advantage of this approach.
- The technique might reduce the need for trillion-parameter models in grounding-heavy applications.
Load-bearing premise
The performance difference between small and large VLMs is caused mainly by language-model size, and simply increasing the count of mid-quality tokens closes the gap without introducing new failure modes.
What would settle it
A direct measurement on RefCOCO showing that the 8B EGM model either falls short of 91.4 IoU or fails to keep total latency under 800 ms when the number of generated mid-quality tokens is increased.
Figures
read the original abstract
Visual grounding is an essential capability of Visual Language Models (VLMs) to understand the real physical world. Previous state-of-the-art grounding visual language models usually have large model sizes, making them heavy for deployment and slow for inference. However, we notice that the sizes of visual encoders are nearly the same for small and large VLMs and the major difference is the sizes of the language models. Small VLMs fall behind larger VLMs in grounding because of the difference in language understanding capability rather than visual information handling. To mitigate the gap, we introduce 'Efficient visual Grounding language Models' (EGM): generate many mid-quality tokens (from small models) to match the performance of large VLMs with few high-quality but expensive tokens. This method is deployment-friendly, and yields better end-to-end latency: On the RefCOCO benchmark, our EGM-Qwen3-VL-8B demonstrates 91.4 IoU with an average of 737ms (5.9x faster) latency while Qwen3-VL-235B demands 4,320ms to reach 90.5 IoU. To validate our approach's generality, we further set up a new amodal grounding setting that requires the model to predict both the visible and occluded parts of the objects. Experiments show our method consistently improves both vanilla and amodal grounding capabilities of small models to match or outperform larger models, thereby improving efficiency for visual grounding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the performance gap between small and large VLMs on visual grounding stems mainly from language-model scale rather than visual-encoder differences, and introduces EGM to close this gap by emitting many mid-quality tokens from small models instead of few high-quality tokens from large models. On RefCOCO, EGM-Qwen3-VL-8B is reported to reach 91.4 IoU at 737 ms (5.9x faster) versus 90.5 IoU at 4320 ms for Qwen3-VL-235B; similar gains are shown on a newly introduced amodal grounding task that requires predicting both visible and occluded object parts.
Significance. If the empirical claims hold after proper validation, the work would offer a practical route to high-accuracy visual grounding in latency-sensitive and resource-constrained settings without scaling the entire VLM. The introduction of the amodal benchmark is a modest but useful addition for testing robustness to occlusion.
major comments (2)
- [Abstract] Abstract: the premise that visual encoders are 'nearly the same' for small and large VLMs and that the grounding gap is caused only by language-model size is load-bearing yet unsupported; no encoder-feature comparison (spatial alignment quality, embedding similarity on RefCOCO images, or grounding-relevant metrics) is supplied to show that simply increasing mid-quality token count can substitute for the 235B encoder's output without new failure modes.
- [Results] Results (RefCOCO and amodal experiments): the headline numbers (91.4 IoU / 737 ms vs. 90.5 IoU / 4320 ms) are single point estimates with no error bars, no statistical tests, no ablation on token quantity or generation strategy, and no analysis of attention dilution or context-length effects; this makes the 5.9x speedup claim and the assertion that the method 'consistently improves' both tasks difficult to evaluate.
minor comments (2)
- The method section should provide a precise description of how mid-quality tokens are generated, ranked, and injected into the language-model context so that the approach can be reproduced.
- Latency measurements should state the exact hardware, batch size, and inference framework used; without this the 737 ms and 4320 ms figures cannot be compared across papers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We will revise the manuscript to strengthen the core premise with direct encoder comparisons and to add rigorous statistical validation, ablations, and analyses for the reported results. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: the premise that visual encoders are 'nearly the same' for small and large VLMs and that the grounding gap is caused only by language-model size is load-bearing yet unsupported; no encoder-feature comparison (spatial alignment quality, embedding similarity on RefCOCO images, or grounding-relevant metrics) is supplied to show that simply increasing mid-quality token count can substitute for the 235B encoder's output without new failure modes.
Authors: We acknowledge that the abstract presents the observation on encoder sizes without accompanying feature-level evidence. In the revision we will add a dedicated analysis (new subsection and appendix) comparing the small and large VLMs' visual encoders on RefCOCO images, including cosine similarity of embeddings, spatial alignment quality metrics, and grounding-relevant feature statistics. This will directly support that visual representations are comparable and that the performance gap is driven by language-model scale. We will also examine and report any new failure modes that arise from emitting more mid-quality tokens, such as changes in attention distribution. revision: yes
-
Referee: [Results] Results (RefCOCO and amodal experiments): the headline numbers (91.4 IoU / 737 ms vs. 90.5 IoU / 4320 ms) are single point estimates with no error bars, no statistical tests, no ablation on token quantity or generation strategy, and no analysis of attention dilution or context-length effects; this makes the 5.9x speedup claim and the assertion that the method 'consistently improves' both tasks difficult to evaluate.
Authors: We agree that single-run point estimates limit the strength of the claims. In the revised version we will rerun the key experiments with multiple random seeds to report means and standard deviations, add ablations on token quantity and generation strategies (including beam search variants), and include attention-map visualizations plus quantitative metrics to assess dilution and context-length effects. These additions will allow a more robust evaluation of the 5.9x speedup and the consistent gains on both RefCOCO and the amodal task. revision: yes
Circularity Check
No circularity; purely empirical claims with no derivations or self-referential reductions
full rationale
The paper introduces an empirical technique (EGM) for generating additional mid-quality tokens from smaller VLMs to approach the grounding performance of larger models. No equations, parameter fittings, or derivation chains are present that reduce by construction to the paper's own inputs. Performance numbers (e.g., 91.4 IoU at 737ms on RefCOCO) are direct benchmark comparisons against public models. The stated observation that visual-encoder sizes are similar across model scales is presented as a premise for the method rather than a derived result from self-citations or fitted data. This matches the default case of a self-contained empirical contribution with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Alibaba: Qwen3-vl (2025)
work page 2025
-
[3]
Anthropic: Claude-4.5 (2025)
work page 2025
-
[4]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Briscoe,R.E.:Mentalimageryandthevarietiesofamodalperception.PacificPhilo- sophical Quarterly92(2), 153–173 (2011)
work page 2011
-
[6]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Chen, K., Ramanan, D., Khurana, T.: Using diffusion priors for video amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 22890–22900 (2025)
work page 2025
-
[7]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 24185–24198 (2024)
work page 2024
-
[8]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Deitke, M., Clark, C., Lee, S., Tripathi, R., Yang, Y., Park, J.S., Salehi, M., Muen- nighoff, N., Lo, K., Soldaini, L., et al.: Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models. arXiv e-prints pp. arXiv–2409 (2024)
work page 2024
-
[10]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The llama 3 herd of models. arXiv e-prints pp. arXiv–2407 (2024)
work page 2024
-
[11]
Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)
Evans, T., Parthasarathy, N., Merzic, H., Henaff, O.J.: Data curation via joint example selection further accelerates multimodal learning. Proceedings of the Ad- vances in Neural Information Processing Systems (NeurIPS) (2024)
work page 2024
-
[12]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al.: Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
arXiv preprint arXiv:2312.12433 (2023)
Hsieh, C.Y., Khurana, T., Dave, A., Ramanan, D.: Tracking any object amodally. arXiv preprint arXiv:2312.12433 (2023)
-
[14]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[15]
Psychological Research88(2), 307–337 (2024) EGM 17
Kaup, B., Ulrich, R., Bausenhart, K.M., Bryce, D., Butz, M.V., Dignath, D., Dud- schig, C., Franz, V.H., Friedrich, C., Gawrilow, C., et al.: Modal and amodal cog- nition: an overarching principle in various domains of psychology. Psychological Research88(2), 307–337 (2024) EGM 17
work page 2024
-
[16]
Kazemzadeh, S., Ordonez, V., Matten, M., Berg, T.: Referitgame: Referring to objects in photographs of natural scenes. In: Proceedings of the 2014 conference onempiricalmethodsinnaturallanguageprocessing(EMNLP).pp.787–798(2014)
work page 2014
-
[17]
Khazatsky, A., Pertsch, K., Nair, S., Balakrishna, A., Dasari, S., Karamcheti, S., Nasiriany, S., Srirama, M.K., Chen, L.Y., Ellis, K., Fagan, P.D., Hejna, J., Itkina, M., Lepert, M., Ma, Y.J., Miller, P.T., Wu, J., Belkhale, S., Dass, S., Ha, H., Jain, A., Lee, A., Lee, Y., Memmel, M., Park, S., Radosavovic, I., Wang, K., Zhan, A., Black, K., Chi, C., Ha...
work page 2024
-
[18]
In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)
Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C.H., Gonzalez, J.E., Zhang, H., Stoica, I.: Efficient memory management for large language model serv- ing with pagedattention. In: Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles (2023)
work page 2023
-
[19]
LLaVA-OneVision: Easy Visual Task Transfer
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., et al.: Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
In: Proceedings of the European Conference on Computer Vision (ECCV)
Li, K., Malik, J.: Amodal instance segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 677–693. Springer (2016)
work page 2016
-
[21]
In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision
Li, Z., Lavreniuk, M., Shi, J., Bhat, S.F., Wonka, P.: Amodal depth anything: Amodal depth estimation in the wild. In: Proceedings of the IEEE/CVF Interna- tional Conference on Computer Vision. pp. 9673–9682 (2025)
work page 2025
-
[22]
In: Proceedings of the European Conference on Computer Vision (ECCV)
Li, Z., Ye, W., Jiang, T., Huang, T.: 2d amodal instance segmentation guided by 3d shape prior. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 165–181. Springer (2022)
work page 2022
-
[23]
IEEE Transactions on Multimedia (MM) (2023)
Li, Z., Ye, W., Jiang, T., Huang, T.: Gin: Generative invariant shape prior for amodal instance segmentation. IEEE Transactions on Multimedia (MM) (2023)
work page 2023
-
[24]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)
Li, Z., Ye, W., Terven, J., Bennett, Z., Zheng, Y., Jiang, T., Huang, T.: Muva: A new large-scale benchmark for multi-view amodal instance segmentation in the shopping scenario. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 23504–23513 (2023)
work page 2023
-
[25]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Lin, J., Yin, H., Ping, W., Molchanov, P., Shoeybi, M., Han, S.: Vila: On pre- training for visual language models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 26689–26699 (2024)
work page 2024
-
[26]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Liu, Z., Qiao, L., Chu, X., Ma, L., Jiang, T.: Towards efficient foundation model for zero-shot amodal segmentation. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 20254–20264 (2025)
work page 2025
-
[27]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Liu, Z., Zhu, L., Shi, B., Zhang, Z., Lou, Y., Yang, S., Xi, H., Cao, S., Gu, Y., Li, D., et al.: Nvila: Efficient frontier visual language models. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 4122–4134 (2025)
work page 2025
-
[28]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Lu, R., Chen, Y., Liu, Y., Tang, J., Ni, J., Wan, D., Zeng, G., Huang, S.: Taco: Taming diffusion for in-the-wild video amodal completion. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13638–13650 (2025) 18 G. Zhan, C. Li et al
work page 2025
-
[29]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Mao, J., Huang, J., Toshev, A., Camburu, O., Yuille, A.L., Murphy, K.: Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 11–20 (2016)
work page 2016
-
[30]
Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y.K., Wu, Y., Guo, D.: Deepseekmath: Pushing the limits of mathematical reasoning in open language models (2024),https://arxiv.org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
HybridFlow: A Flexible and Efficient RLHF Framework
Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., Wu, C.: Hybridflow: A flexible and efficient rlhf framework. arXiv preprint arXiv: 2409.19256 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [32]
-
[33]
Gemini: A Family of Highly Capable Multimodal Models
Team, G., Anil, R., Borgeaud, S., Alayrac, J.B., Yu, J., Soricut, R., Schalkwyk, J., Dai, A.M., Hauth, A., Millican, K., et al.: Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[37]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Amodal3r: Amodal 3d reconstruction from occluded 2d images
Wu, T., Zheng, C., Guan, F., Vedaldi, A., Cham, T.J.: Amodal3r: Amodal 3d reconstruction from occluded 2d images. arXiv preprint arXiv:2503.13439 (2025)
-
[40]
International Journal of Computer Vision (IJCV) (2025)
Xia, Y., Ding, R., Qin, Z., Zhan, G., Zhou, K., Yang, L., Dong, H., Cremers, D.: Targo: benchmarking target-driven object grasping under occlusions. International Journal of Computer Vision (IJCV) (2025)
work page 2025
-
[41]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
Xu, K., Zhang, L., Shi, J.: Amodal completion via progressive mixed context diffu- sion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)
work page 2024
-
[42]
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.Y., Zhang, Y.Q., Yan, L., Qiao, M., Wu, Y., Wang, M.: Dapo: An open-source l...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Zhan, G., Liu, Y., Han, K., Xie, W., Zisserman, A.: Elip: Enhanced visual-language foundation models for image retrieval. Proceedings of the IEEE International Con- ference on Content-Based Multimedia Indexing (CBMI) (2025)
work page 2025
-
[44]
Zhan, G., Zheng, C., Xie, W., Zisserman, A.: Amodal ground truth and completion in the wild. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024) EGM 19
work page 2024
-
[45]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Zhan, X., Pan, X., Dai, B., Liu, Z., Lin, D., Loy, C.C.: Self-supervised scene de- occlusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3784–3792 (2020)
work page 2020
-
[46]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Zhu, Y., Tian, Y., Metaxas, D., Dollár, P.: Semantic amodal segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1464–1472 (2017) 20 G. Zhan, C. Li et al. Appendix A Implementation Details In this section, we provide the detailed experimental settings referenced in Sec- tion 5.1 and Section 5.2 of the...
work page 2017
-
[48]
Must uniquely identify this object (not confusable with other objects)
-
[49]
the second/third/fourth xxx from left/right/top/bottom
Use spatial relationships for multiple similar objects: - Ordinal: "the second/third/fourth xxx from left/right/top/bottom" - Position: "the leftmost/rightmost/topmost/bottommost xxx" - Location: "xxx in the top-left/center/bottom-right" - Relative: "xxx behind/next to/above/below the yyy"
-
[50]
Describe natural features: position, appearance, color (of the actual object, not the annotation), state
-
[51]
Use lowercase, natural referring expression style Good examples: - "the man in yellow coat" - "the second teddy bear from left" - "the rightmost hot dog in the container" - "woman in coveralls at the back" - "the white computer screen that is on" - "top sandwich in the left container" Bad examples (DO NOT do this): - "the object marked with green box" ❌ -...
-
[52]
Does the description match the object/region inside the TARGET box?
-
[53]
Is the description specific enough that it couldn't refer to any other object in the image?
-
[54]
YES" if the description uniquely and accurately identifies the TARGET region -
Would someone reading this description be able to locate exactly this TARGET region and no other? Respond with ONLY ONE WORD: - "YES" if the description uniquely and accurately identifies the TARGET region - "NO" if the description is ambiguous, incorrect, or could refer to multiple objects Your response (YES or NO): Fig.10: Prompt for amodal grounding pr...
-
[55]
How to locate this object
-
[56]
What object(s) are causing the occlusion
-
[57]
What does this object's complete unoccluded shape look like?
-
[58]
Therefore, to recover the complete object, explain in which direction(s) ({directions_str}) and how much (using terms: "slightly", "a bit", "moderately", "considerably", or "significantly") should be extended from the visible part Be extremely concise. Maximum 5 sentences. DO NOT mention: specific pixel values, image boundaries, red/green boxes, bounding ...
-
[59]
Second: girl with white cap, green skirt, holding racket. 3. Third from right: the girl in white top, green skirt, between the cap girl and the next. Let’s check the image. Looking at the image, let’s count from the right: - First (rightmost): [838, 129, 993, 999] - Second: [742, 299, 922, 999] - Third from right: [635, 711, 771, 999]. Yes, that’s the ‘th...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.