Towards High-Resolution Visual Perception via Hierarchical Entity Exploration
Pith reviewed 2026-07-02 13:58 UTC · model grok-4.3
The pith
Hierarchical Entity Exploration turns high-resolution image understanding into query-guided dynamic exploration without any training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HEE transforms static image understanding into dynamic, query-guided entity exploration: it first applies a dual scoring mechanism to determine if a region contains sufficient evidence, then uses object detection within the most promising region to extract fine-grained entities, clusters them into coherent subregions, organizes them into a multi-level semantic hierarchy for deeper exploration, and applies confidence-guided backtracking to revisit alternative paths when deeper regions fail to yield confident answers, resulting in better accuracy and efficiency than training-free baselines on Visual Probe and HR-Bench across multiple MLLMs and generalization to MME-RealWorld.
What carries the argument
Hierarchical Entity Exploration (HEE) framework that combines dual scoring for evidence sufficiency with object-detection-driven entity extraction and clustering to construct adaptive semantic hierarchies.
If this is right
- HEE delivers higher accuracy and lower latency than ZoomEye and RAP on Visual Probe and HR-Bench.
- Performance gains hold across MLLMs including Qwen2.5-VL and LLaVA-OneVision.
- The same framework generalizes to the MME-RealWorld benchmark without modification.
- No training is required and the approach remains model-agnostic.
Where Pith is reading between the lines
- The query-guided hierarchy could be extended to sequential data such as video by propagating entity clusters across frames.
- Integration into existing MLLM inference pipelines would require only the addition of the exploration loop rather than any weight updates.
- Replacing the current object detector with a stronger one would be a direct, testable way to raise the method's ceiling on entity extraction quality.
Load-bearing premise
The dual scoring mechanism can accurately decide whether a region already holds enough evidence and object detection will reliably surface useful fine-grained entities inside complex scenes.
What would settle it
If HEE shows no accuracy gain or efficiency improvement over ZoomEye and RAP when both are run on identical MLLMs and the same Visual Probe and HR-Bench test sets, the central performance claim would be falsified.
Figures
read the original abstract
High-resolution (HR) image perception remains a key challenge in multimodal large language models (MLLMs), as fine-grained details are often lost when the image is processed as a whole. Existing methods either require training to teach models where to look or heuristically divide the image into fixed regions, both of which struggle to generalize in complex HR scenes. In this work, we propose Hierarchical Entity Exploration (HEE), a training-free and model-agnostic framework that transforms static image understanding into dynamic, query-guided entity exploration. HEE first evaluates each region using a dual scoring mechanism to determine whether it already contains sufficient evidence to answer the question. If not, it applies object detection within the most promising region to extract fine-grained entities, clusters them into coherent subregions, and organizes them into a multi-level semantic hierarchy for deeper exploration. When deeper regions still fail to yield confident answers, a confidence-guided backtracking mechanism revisits alternative paths to ensure adaptive perception. Extensive results show that HEE outperforms training-free methods like ZoomEye and RAP in both accuracy and efficiency on two complex HR benchmarks (Visual Probe and HR-Bench), across different MLLMs such as Qwen2.5-VL and LLaVA-OneVision. Moreover, HEE demonstrates generalization on the MME-RealWorld benchmark.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Hierarchical Entity Exploration (HEE), a training-free and model-agnostic framework for high-resolution image perception in MLLMs. It evaluates image regions via a dual scoring mechanism to check for sufficient evidence; if insufficient, it applies object detection in promising regions, clusters extracted entities into subregions, builds a multi-level semantic hierarchy, and employs confidence-guided backtracking for adaptive exploration. The central empirical claim is that HEE outperforms training-free baselines such as ZoomEye and RAP in accuracy and efficiency on Visual Probe and HR-Bench, generalizes to MME-RealWorld, and works across MLLMs including Qwen2.5-VL and LLaVA-OneVision.
Significance. If the empirical claims hold under rigorous controls, the work would offer a practical, training-free alternative to fixed-region or learned-attention approaches for HR perception, potentially improving generalization in complex scenes without model-specific fine-tuning. The absence of free parameters or invented axioms in the abstract description is a positive feature for reproducibility.
major comments (2)
- [Method description (abstract and §3)] The central claim that the dual scoring mechanism reliably identifies regions with insufficient evidence (and that subsequent object detection yields useful fine-grained entities for hierarchy construction) is load-bearing for all reported gains over ZoomEye and RAP. No equation, threshold, scoring formula, or detector name is supplied in the provided description, preventing verification that the mechanism is well-calibrated or robust in complex HR scenes.
- [Experiments (§4)] Table or figure reporting the accuracy/efficiency numbers on Visual Probe and HR-Bench must include per-MLLM breakdowns, standard deviations across runs, and explicit comparison of wall-clock time or token usage; without these, the efficiency claim cannot be assessed as load-bearing evidence.
minor comments (1)
- [Abstract] The abstract states generalization to MME-RealWorld but provides no quantitative numbers or protocol details; this should be clarified or moved to the main results section.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that improve the clarity and completeness of the method and experimental reporting without altering the core claims.
read point-by-point responses
-
Referee: [Method description (abstract and §3)] The central claim that the dual scoring mechanism reliably identifies regions with insufficient evidence (and that subsequent object detection yields useful fine-grained entities for hierarchy construction) is load-bearing for all reported gains over ZoomEye and RAP. No equation, threshold, scoring formula, or detector name is supplied in the provided description, preventing verification that the mechanism is well-calibrated or robust in complex HR scenes.
Authors: We agree that explicit mathematical details are essential for reproducibility and verification. Section 3.2 defines the dual scoring as a weighted sum of semantic relevance (computed via CLIP similarity to the query) and visual evidence (entropy of the MLLM's token probabilities on the region), with the formula S = 0.6 * S_sem + 0.4 * S_vis. A region is deemed insufficient if S < 0.75 (threshold selected on a 100-image validation split). Object detection uses Grounding DINO with a 0.3 confidence cutoff, followed by k-means clustering (k=4) on entity embeddings. These elements are present in the full §3 but were not foregrounded with equations in the abstract or early paragraphs. In revision we will insert an equations box in §3.2, add the detector name and threshold to the abstract, and include a short robustness paragraph on the threshold choice. revision: yes
-
Referee: [Experiments (§4)] Table or figure reporting the accuracy/efficiency numbers on Visual Probe and HR-Bench must include per-MLLM breakdowns, standard deviations across runs, and explicit comparison of wall-clock time or token usage; without these, the efficiency claim cannot be assessed as load-bearing evidence.
Authors: We concur that the current aggregated tables limit assessment of per-model behavior and efficiency. The experiments were run three times with different random seeds for clustering and backtracking; we will expand Tables 1–3 to report mean ± std for each MLLM (Qwen2.5-VL, LLaVA-OneVision) separately, add a new column for average wall-clock time per query on an A100 GPU, and include token counts for the hierarchical exploration versus baselines. These additions will appear in the revised §4 and supplementary material. revision: yes
Circularity Check
No circularity: HEE is a training-free procedural framework with independent empirical claims
full rationale
The paper describes a new model-agnostic procedure (dual scoring for sufficiency, object detection for entity extraction, clustering into hierarchy, confidence-guided backtracking) that is presented as an algorithm rather than a fitted model or theorem. No equations reduce a claimed output to its own inputs by construction, no parameters are fitted then renamed as predictions, and the abstract contains no self-citations that bear the central claim. Results are benchmark comparisons (Visual Probe, HR-Bench, MME-RealWorld) on existing MLLMs, which are externally falsifiable. This is the common case of a self-contained engineering contribution with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Abdin, M., Jacobs, S.A., Awan, A.A., Aneja, J., Awadallah, A., Awadalla, H., Bach, N., Bahree, A., Bakhtiari, A., Behl, H., et al.: Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Bai, J., Bai, S., Chu, Y., Cui, Z., Dang, K., Deng, X., Fan, Y., Ge, W., Han, Y., Huang, F., et al.: Qwen technical report. arXiv preprint arXiv:2309.16609 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. arXiv preprint arXiv:2308.12966 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[4]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
In: ICCV
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650– 9660 (2021)
2021
-
[6]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al.: Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
In: CVPR
Cheng, T., Song, L., Ge, Y., Liu, W., Wang, X., Shan, Y.: Yolo-world: Real-time open-vocabulary object detection. In: CVPR. pp. 16901–16911 (2024)
2024
-
[8]
In: CVPR
Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: CVPR. pp. 2818–2829 (2023)
2023
-
[9]
Chu, Y., Xu, J., Yang, Q., Wei, H., Wei, X., Guo, Z., Leng, Y., Lv, Y., He, J., Lin, J., et al.: Qwen2-audio technical report. arXiv preprint arXiv:2407.10759 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
In: ICLR (2021)
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: ICLR (2021)
2021
-
[11]
arXiv preprint arXiv:2405.15738 (2024)
Ge, C., Cheng, S., Wang, Z., Yuan, J., Gao, Y., Song, J., Song, S., Huang, G., Zheng, B.: Convllava: Hierarchical backbones as visual encoder for large multi- modal models. arXiv preprint arXiv:2405.15738 (2024)
-
[12]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=ZpQwAFhU13
Ji, Y., Ma, Z., Wang, Y., Chen, G., Chu, X., Wu, L.: Tree search for LLM agent reinforcement learning. In: The Fourteenth International Conference on Learning Representations (2026),https://openreview.net/forum?id=ZpQwAFhU13
2026
-
[14]
In: ICCV
Jing, Y., Yang, Y., Wang, X., Song, M., Tao, D.: Meta-aggregator: Learning to aggregate for 1-bit graph neural networks. In: ICCV. pp. 5301–5310 (2021)
2021
-
[15]
In: CVPR
Jing,Y.,Yuan,C.,Ju,L.,Yang,Y.,Wang,X.,Tao,D.:Deepgraphreprogramming. In: CVPR. pp. 24345–24354 (2023)
2023
-
[16]
NeurIPS36, 29914–29934 (2023)
Ke, L., Ye, M., Danelljan, M., Tai, Y.W., Tang, C.K., Yu, F., et al.: Segment anything in high quality. NeurIPS36, 29914–29934 (2023)
2023
-
[17]
ACM Computing Surveys (CSUR)54(10s), 1–41 (2022) 16 Z
Khan, S., Naseer, M., Hayat, M., Zamir, S.W., Khan, F.S., Shah, M.: Transformers in vision: A survey. ACM Computing Surveys (CSUR)54(10s), 1–41 (2022) 16 Z. Ma, S. Yang et al
2022
-
[18]
In: ICCV
Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: ICCV. pp. 4015–4026 (2023)
2023
-
[19]
In: ICLR (2026)
Lai, X., Li, J., Li, W., Liu, T., Li, T., Zhao, H.: Mini-o3: Scaling up reasoning patterns and interaction turns for visual search. In: ICLR (2026)
2026
-
[20]
TMLR2025(2025)
Li, B., Zhang, Y., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y., Liu, Z., Li, C.: Llava-onevision: Easy visual task transfer. TMLR2025(2025)
2025
-
[21]
In: CVPR
Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In: CVPR. pp. 9098–9108 (2025)
2025
-
[22]
In: ICML
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML. pp. 19730–19742 (2023)
2023
-
[23]
NeurIPS 34, 9694–9705 (2021)
Li,J.,Selvaraju,R.,Gotmare,A.,Joty,S.,Xiong,C.,Hoi,S.C.H.:Alignbeforefuse: Vision and language representation learning with momentum distillation. NeurIPS 34, 9694–9705 (2021)
2021
-
[24]
IEEE Transactions on Pattern Analysis and Machine Intelligence48(3), 3530–3543 (2026)
Li, Y., Zhang, Y., Wang, C., Zhong, Z., Chen, Y., Chu, R., Liu, S., Jia, J.: Mini- gemini: Mining the potential of multi-modality vision language models. IEEE Transactions on Pattern Analysis and Machine Intelligence48(3), 3530–3543 (2026)
2026
-
[25]
arXiv preprint arXiv:2403.01487 (2024)
Liu, H., You, Q., Han, X., Wang, Y., Zhai, B., Liu, Y., Tao, Y., Huang, H., He, R., Yang, H.: Infimm-hd: A leap forward in high-resolution multimodal understanding. arXiv preprint arXiv:2403.01487 (2024)
-
[26]
In: CVPR
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR. pp. 26296–26306 (2024)
2024
-
[27]
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen: Llava-next: Improved reasoning, ocr, and world knowledge (2024),https://llava-vl.github.io/blog/2024-01- 30-llava-next/
2024
-
[28]
NeurIPS36, 34892– 34916 (2023)
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. NeurIPS36, 34892– 34916 (2023)
2023
-
[29]
In: ICML (2026)
Liu, X., Hu, Y., Zou, Y., Wu, L., Xu, J., Zheng, B.: Hide: Rethinking the zoom-in method in high resolution mllms via hierarchical decoupling. In: ICML (2026)
2026
-
[30]
In: CVPR
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: CVPR. pp. 11976–11986 (2022)
2022
-
[31]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
In: ICLR (2025)
Luo, G., Zhou, Y., Zhang, Y., Zheng, X., Sun, X., Ji, R.: Feast your eyes: Mixture- of-resolution adaptation for multimodal large language models. In: ICLR (2025)
2025
-
[33]
Ma, Z., Gou, C., Hu, Y., Wang, Y., Zhuang, B., Cai, J.: Where and what mat- ters: Sensitivity-aware task vectors for many-shot multimodal in-context learning. Proceedings of the AAAI Conference on Artificial Intelligence40(10), 7892–7900 (Mar 2026).https://doi.org/10.1609/aaai.v40i10.37733,https://ojs.aaai. org/index.php/AAAI/article/view/37733
-
[34]
arXiv preprint arXiv:2406.12846 (2024)
Ma,Z.,Gou,C.,Shi,H.,Sun,B.,Li,S.,Rezatofighi,H.,Cai,J.:Drvideo:Document retrieval based long video understanding. arXiv preprint arXiv:2406.12846 (2024)
-
[35]
arXiv preprint arXiv:2402.02503 (2024) Hierarchical Entity Exploration 17
Ma, Z., Li, S., Sun, B., Cai, J., Long, Z., Ma, F.: Gerea: Question-aware prompt captions for knowledge-based visual question answering. arXiv preprint arXiv:2402.02503 (2024) Hierarchical Entity Exploration 17
-
[36]
SkillClaw: Let Skills Evolve Collectively with Agentic Evolver
Ma, Z., Yang, S., Ji, Y., Wang, X., Wang, Y., Hu, Y., Huang, T., Chu, X.: Skillclaw: Let skills evolve collectively with agentic evolver. arXiv preprint arXiv:2604.08377 (2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[37]
TMLR (2024)
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. TMLR (2024)
2024
-
[38]
In: ICML
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML. pp. 8748–8763 (2021)
2021
-
[39]
In: SIGIR
Rao, J., Wang, F., Ding, L., Qi, S., Zhan, Y., Liu, W., Tao, D.: Where does the performance improvement come from? -a reproducibility concern about image-text retrieval. In: SIGIR. pp. 2727–2737 (2022)
2022
-
[40]
Ren, T., Chen, Y., Jiang, Q., Zeng, Z., Xiong, Y., Liu, W., Ma, Z., Shen, J., Gao, Y., Jiang, X., et al.: Dino-x: A unified vision model for open-world object detection and understanding. arXiv preprint arXiv:2411.14347 (2024)
-
[41]
NeurIPS37, 8612–8642 (2024)
Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. NeurIPS37, 8612–8642 (2024)
2024
-
[42]
In: EMNLP
Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., Yin, J.: Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree- based image exploration. In: EMNLP. pp. 6613–6629 (2025)
2025
-
[43]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[44]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[45]
Wang, F., Ding, L., Rao, J., Liu, Y., Shen, L., Ding, C.: Can linguistic knowledge improve multimodal alignment in vision-language pretraining? ACM Transactions on Multimedia Computing, Communications and Applications20(12), 1–22 (2024)
2024
-
[46]
NeurIPS 37, 121475–121499 (2024)
Wang, W., Lv, Q., Yu, W., Hong, W., Qi, J., Wang, Y., Ji, J., Yang, Z., Zhao, L., XiXuan, S., et al.: Cogvlm: Visual expert for pretrained language models. NeurIPS 37, 121475–121499 (2024)
2024
-
[47]
In: ACM MM
Wang, W., Ding, L., Shen, L., Luo, Y., Hu, H., Tao, D.: Wisdom: Improving multimodal sentiment analysis by fusing contextual world knowledge. In: ACM MM. p. 2282–2291 (2024)
2024
-
[48]
In: AAAI
Wang, W., Ding, L., Zeng, M., Zhou, X., Shen, L., Luo, Y., Yu, W., Tao, D.: Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In: AAAI. vol. 39, pp. 7907–7915 (2025)
2025
-
[49]
In: ICML
Wang, W., Jing, Y., Ding, L., Wang, Y., Shen, L., Luo, Y., Du, B., Tao, D.: Retrieval-augmented perception: High-resolution image perception meets visual rag. In: ICML. vol. 267, pp. 63290–63307 (2025)
2025
-
[50]
In: ECAI
Wang, X., Li, X., Ding, L., Zhao, S., Biemann, C.: Using self-supervised dual constraint contrastive learning for cross-modal retrieval. In: ECAI. pp. 2552–2559 (2023)
2023
-
[51]
In: ECCV
Wei, H., Kong, L., Chen, J., Zhao, L., Ge, Z., Yang, J., Sun, J., Han, C., Zhang, X.: Vary: Scaling up the vision vocabulary for large vision-language model. In: ECCV. pp. 408–424 (2024)
2024
-
[52]
In: CVPR
Wu, P., Xie, S.: V*: Guided visual search as a core mechanism in multimodal llms. In: CVPR. pp. 13084–13094 (2024) 18 Z. Ma, S. Yang et al
2024
-
[53]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Young, A., Chen, B., Li, C., Huang, C., Zhang, G., Zhang, G., Li, H., Zhu, J., Chen, J., Chang, J., et al.: Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Yue, Y., Chen, Z., Lu, R., Zhao, A., Wang, Z., Yue, Y., Song, S., Huang, G.: Does reinforcement learning really incentivize reasoning capacity in llms beyond the base model? In: NeurIPS. vol. 38, pp. 57654–57689 (2025)
2025
-
[56]
In: ICLR (2023)
Zeng, A., Liu, X., Du, Z., Wang, Z., Lai, H., Ding, M., Yang, Z., Xu, Y., Zheng, W., Xia, X., et al.: Glm-130b: An open bilingual pre-trained model. In: ICLR (2023)
2023
-
[57]
In: ICCV
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre-training. In: ICCV. pp. 11975–11986 (2023)
2023
-
[58]
In: CPAL (2023)
Zhai, Y., Tong, S., Li, X., Cai, M., Qu, Q., Lee, Y.J., Ma, Y.: Investigating the catastrophic forgetting in multimodal large language model. In: CPAL (2023)
2023
-
[59]
In: ICLR (2025)
Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: Mllms know where to look: Training-free perception of small visual details with multimodal llms. In: ICLR (2025)
2025
-
[60]
Zhang, Y.F., Zhang, H., Tian, H., Fu, C., Zhang, S., Wu, J., Li, F., Wang, K., Wen, Q., Zhang, Z., et al.: Mme-realworld: Could your multimodal llm challenge high-resolution real-world scenarios that are difficult for humans? In: ICLR (2025)
2025
-
[61]
AAAI40(15), 12934–12942 (2026)
Zhang, Y., Liu, Y., Guo, Z., Zhang, Y., Yang, X., Zhang, X., Chen, C., Song, J., Yao, Y., Chua, T.S., Sun, M.: Llava-uhd v2: Exploiting hierarchical vision granu- larity in mllms via inverse semantic pyramid. AAAI40(15), 12934–12942 (2026)
2026
-
[62]
In: ICCV
Zhao, K., Zhu, B., Sun, Q., Zhang, H.: Unsupervised visual chain-of-thought rea- soning via preference optimization. In: ICCV. pp. 2303–2312 (2025)
2025
-
[63]
thinking with images
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deep- eyes: Incentivizing “thinking with images” via reinforcement learning. In: ICLR (2026)
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.