CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

Bin Chen; Haoqian Kang; Jinpeng Wang; Ke Chen; Liupeng Li; Yaowei Wang; Zhenyu Lu

CVSearch enables multimodal LLMs to perceive high-resolution images by adaptively switching between expert-assisted search and semantic scanning.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 04:45 UTC pith:DBL3B7VQ

load-bearing objection CVSearch offers a training-free Assess-then-Search workflow with adaptive patching and bottom-up exploration to cut compute on high-res MLLM inputs, but the advance is mostly in the specific integration rather than new primitives.

arxiv 2605.23655 v1 pith:DBL3B7VQ submitted 2026-05-22 cs.CV cs.AIcs.LGcs.MM

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

Liupeng Li , Haoqian Kang , Zhenyu Lu , Jinpeng Wang , Bin Chen , Ke Chen , Yaowei Wang This is my paper

classification cs.CV cs.AIcs.LGcs.MM

keywords high-resolution image perceptionmultimodal large language modelsvisual searchadaptive patchingcognitive visual searchAssess-then-Searchsemantic guided patching

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

High-resolution images challenge multimodal large language models because existing visual search methods either miss key details or waste computation on redundant scans. CVSearch introduces a training-free Assess-then-Search workflow that first tries efficient expert-assisted search and only falls back to a new semantic-aware scanning method if that fails. The scanning uses Semantic Guided Adaptive Patching to keep objects whole and Dynamic Bottom-Up Search guided by visual complexity to focus effort where needed. If successful, this approach delivers higher accuracy on high-resolution benchmarks while cutting search time compared to prior methods.

Core claim

CVSearch is a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow: it invokes expert-assisted search when global information is insufficient, and triggers semantic-aware scanning with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search only upon failure, achieving state-of-the-art accuracy and improved efficiency on HR benchmarks.

What carries the argument

The Assess-then-Search workflow that combines expert-assisted search with semantic-aware scanning triggered on failure, using Semantic Guided Adaptive Patching to avoid object fragmentation and Dynamic Bottom-Up Search driven by a Visual Complexity prior.

Load-bearing premise

The Assess-then-Search workflow correctly identifies when global information is insufficient and that failure of expert-assisted search reliably triggers the semantic scanning without introducing new blind spots or excessive overhead.

What would settle it

A benchmark where expert-assisted search proposals miss critical objects but the subsequent semantic scanning also fails to recover them at higher cost than a full grid scan would have required.

Watch this falsifier — get emailed when new claim-graph text bears on it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

CVSearch offers a training-free Assess-then-Search workflow with adaptive patching and bottom-up exploration to cut compute on high-res MLLM inputs, but the advance is mostly in the specific integration rather than new primitives.

read the letter

The core takeaway is that this paper gives a clean procedural fix for the coverage-efficiency trade-off in visual search for multimodal models. It starts with expert-assisted search when global context falls short, then switches to semantic scanning only on failure, using Semantic Guided Adaptive Patching to keep regions coherent and Dynamic Bottom-Up Search guided by a visual complexity prior for local detail. The training-free design and released code stand out as practical pluses that let others reproduce the pipeline quickly. The description of how rigid grids cause fragmentation and how the new mechanisms avoid that is straightforward and easy to follow. The workflow itself is falsifiable and avoids obvious circularity or hidden parameters. That said, the SOTA accuracy and efficiency claims rest entirely on experiments whose details, baselines, ablations, and error bars are not visible in the abstract, so the size of the actual gain is still unclear. The components draw from existing ideas in adaptive search and patching, so the novelty sits in the scheduling logic rather than a fundamental shift. Minor risk is that the failure trigger for scanning could add overhead or miss cases, but nothing in the description suggests a load-bearing flaw. This paper is for researchers building efficient perception stacks for MLLMs on high-resolution data who need concrete implementation ideas rather than theoretical breakthroughs. It deserves peer review because the method is described precisely enough for referees to test the empirical claims and the code lowers the barrier to verification.

Referee Report

0 major / 3 minor

Summary. The paper introduces CVSearch, a training-free adaptive framework for high-resolution image perception in multimodal LLMs. It uses an Assess-then-Search workflow that first applies expert-assisted search when global information is insufficient and triggers semantic-aware scanning (with Semantic Guided Adaptive Patching and Dynamic Bottom-Up Search driven by a Visual Complexity prior) only upon failure. The central claim is that this resolves the coverage-efficiency trade-off and achieves state-of-the-art accuracy with substantially improved search efficiency on HR benchmarks; code is released.

Significance. If the empirical results hold, the work addresses a practical bottleneck for MLLMs on high-resolution inputs by adaptively combining search primitives without training. The training-free design and released code are explicit strengths that support reproducibility and allow direct falsification of the pipeline. This could meaningfully improve vision-language performance on tasks requiring fine local detail.

minor comments (3)

[Method] The description of how the initial global insufficiency check is implemented (e.g., which MLLM outputs or thresholds are used) should be expanded with pseudocode or a concrete example to make the Assess-then-Search decision reproducible.
[Experiments] Table or figure reporting the efficiency metrics (e.g., number of patches or tokens processed) should include standard deviations across runs or datasets to substantiate the 'substantially improving search efficiency' claim.
[Method] The paper should clarify whether the expert-assisted search component relies on any external models or APIs whose failure modes could affect the overall pipeline.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our work, the recognition of its practical significance for MLLMs on high-resolution inputs, and the recommendation for minor revision. The training-free design and code release are indeed intended to facilitate reproducibility and direct evaluation.

Circularity Check

0 steps flagged

No significant circularity; procedural framework with external empirical validation

full rationale

The paper describes a training-free procedural framework (Assess-then-Search workflow with expert-assisted search, Semantic Guided Adaptive Patching, and Dynamic Bottom-Up Search) without any equations, derivations, fitted parameters, or self-referential definitions. Central claims rest on experimental results from HR benchmarks, which are independent of the method description. No self-citations are invoked as load-bearing uniqueness theorems, and no ansatzes or known results are renamed or smuggled. The pipeline is explicitly falsifiable and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are detailed. The framework implicitly assumes that semantic consistency in patching can be reliably computed from existing model features without additional training.

pith-pipeline@v0.9.0 · 5758 in / 1107 out tokens · 15493 ms · 2026-05-25T04:45:09.494205+00:00 · methodology

0 comments

read the original abstract

High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.

Figures

Figures reproduced from arXiv: 2605.23655 by Bin Chen, Haoqian Kang, Jinpeng Wang, Ke Chen, Liupeng Li, Yaowei Wang, Zhenyu Lu.

**Figure 1.** Figure 1: (a) Real-world HR image perception requires handling targets with distinct granularities. (b) Existing methods struggle to balance coverage and efficiency. Visual expert assisted methods lack sufficient coverage for tiny targets, while scan-based methods ensure coverage but suffer from low efficiency. (c) Built upon Qwen2.5-VL-7B, CVSearch achieves the best balance, delivering SOTA accuracy with competitiv… view at source ↗

**Figure 2.** Figure 2: Illustration of the CVSearch framework. (a) Workflow. A cognitive Assess-then-Search mechanism triggers Visual Expert Search when global information is insufficient (cq < τq). Expert failure (proposals Be = ∅) activates Scene-aware Scanning, which either yields visual evidence upon success or returns the optimal candidate for iterative search upon failure. (b) Visual Expert Search. This module parses queri… view at source ↗

**Figure 4.** Figure 4: Performance analysis of different search modes. The bar chart (left axis) displays the usage frequency of each mode, while the scatter plot (right axis) reports the corresponding accuracy. paradigms. Compared to the lightweight expert-assisted approach (SAM 3), CVSearch delivers substantial accuracy improvements (e.g., +4.7% on HR-4K) while maintaining competitive throughput. More importantly, rather tha… view at source ↗

**Figure 5.** Figure 5: Ablation study on the information sufficiency threshold τq on V* Bench. Evaluated with Qwen2.5-VL-7B, we analyze (a) the usage ratio of different search modes and (b) their corresponding accuracy as τq varies from 0.5 to 0.9. C. Qualitative Analysis and Case Studies To provide intuitive insights into the operational mechanisms of CVSearch, this section presents qualitative visualizations on challenging sam… view at source ↗

**Figure 6.** Figure 6: Comparison of patching strategies on a text-rich scene. Zoom Eye and RAP impose rigid grids that sever the storefront sign (“LIBROS”) and the entrance, disrupting OCR and scene understanding. In contrast, our CVSearch adaptively partitions the image based on semantic coherence. The annotated values represent Visual Complexity Scores. The high visual complexity score (0.95) of the central storefront trigger… view at source ↗

**Figure 7.** Figure 7: Visualization of semantic preservation in architectural scenes. Rigid partitioning methods (Zoom Eye and RAP) fragment the continuous structure of the church into disjoint blocks, separating the spire from the nave. CVSearch effectively separates the foreground architecture from the low-complexity sky background (0.49). The annotated values represent Visual Complexity Scores. The adaptive patching respects… view at source ↗

**Figure 8.** Figure 8: Impact of patching on object integrity. In the Zoom Eye and RAP examples, the truck is arbitrarily sliced by grid lines, making it difficult to perceive the vehicle as a whole. CVSearch utilizes semantic clustering to maintain the integrity of the truck cabin and the surrounding environment. The annotated values represent Visual Complexity Scores. The resulting patches group the vehicle features together w… view at source ↗

**Figure 9.** Figure 9: Comparison in cluttered scenarios. While rigid grids (Zoom Eye, RAP) indiscriminately divide the scene, CVSearch demonstrates superior flexibility. The annotated values represent Visual Complexity Scores. By calculating visual complexity scores to identify information-dense regions, our method ensures detailed scrutiny where necessary while pruning low-complexity background areas to maintain efficiency. An… view at source ↗

**Figure 10.** Figure 10: Adaptive search modes for efficiency. Left: For prominent targets, CVSearch employs Direct Answer to minimize latency. Right: For small objects, it activates Visual Expert Assisted Search for precise localization, avoiding the cost of exhaustive scanning. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Iterative Search for hard samples. When initial searches fail, the system zooms into the best candidate. Left: The enhanced resolution enables the Visual Expert to detect the “tissue box”. Right: For the extremely small “helmet”, the Expert fails again, but the fine-grained Scene-aware Scanning successfully captures the target in the second round. Query: What is the color of the SUV car? Ground Truth: Sil… view at source ↗

**Figure 12.** Figure 12: Failures despite accurate localization. Left: MLLM hallucinates the car color despite correct expert cropping. Right: Answer diverges due to attribute ambiguity (describing the clock face instead of the frame). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 12 internal anchors

[1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y .-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K. V ., Khedr, H., Huang, A., et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Convllava: Hierar- chical backbones as visual encoder for large multimodal models.arXiv preprint arXiv:2405.15738,

Ge, C., Cheng, S., Wang, Z., Yuan, J., Gao, Y ., Song, J., Song, S., Huang, G., and Zheng, B. Convllava: Hierar- chical backbones as visual encoder for large multimodal models.arXiv preprint arXiv:2405.15738,

work page arXiv
[4]

Mini- monkey: Alleviating the semantic sawtooth effect for lightweight mllms via complementary image pyramid

Huang, M., Liu, Y ., Liang, D., Jin, L., and Bai, X. Mini- monkey: Alleviating the semantic sawtooth effect for lightweight mllms via complementary image pyramid. arXiv preprint arXiv:2408.02034,

work page arXiv
[5]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., and Li, C. Llava-next-interleave: Tackling multi- image, video, and 3d in large multimodal models.arX...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

Lu, Z., Li, L., Wang, J., Feng, Y ., Chen, B., Chen, K., and Wang, Y . CoPRS: Learning positional prior from chain- of-thought for reasoning segmentation. InThe Fourteenth International Conference on Learning Representations, 2026a. Lu, Z., Li, L., Wang, J., Kang, H., Feng, Y ., Chen, K., and Wang, Y . Segcompass: Exploring interpretable align- ment with ...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Modern hierarchical, agglomerative clustering algorithms

10 CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception M¨ullner, D. Modern hierarchical, agglomerative clustering algorithms.arXiv preprint arXiv:1109.2378,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Pan, J., Wang, R., Qian, T., Mahdi, M., Fu, Y ., Xue, X., Huang, X., Van Gool, L., Paudel, D. P., and Fu, Y . V2-sam: Marrying sam2 with multi-prompt experts for cross-view object correspondence.arXiv preprint arXiv:2511.20886,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based im- age exploration

Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., and Yin, J. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based im- age exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 6613–6629,

work page 2025
[11]

Team, Q. et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Deep clustering using the soft silhouette score: Towards compact and well-separated clusters.arXiv preprint arXiv:2402.00608,

Vardakas, G., Papakostas, I., and Likas, A. Deep clustering using the soft silhouette score: Towards compact and well-separated clusters.arXiv preprint arXiv:2402.00608,

work page arXiv
[14]

arXiv preprint arXiv:2507.07999 , year=

Wang, H., Li, X., Huang, Z., Wang, A., Wang, J., Zhang, T., Zheng, J., Bai, S., Kang, Z., Feng, J., et al. Traceable evidence enhanced visual grounded reasoning: Evalua- tion and methodology.arXiv preprint arXiv:2507.07999, 2025a. Wang, H., Su, A., Ren, W., Lin, F., and Chen, W. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven rei...

work page arXiv
[15]

ai.arXiv preprint arXiv:2403.04652,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Zhang, J., Khayatkhoei, M., Chhikara, P., and Ilievski, F. Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025a. Zhang, L., Yu, J., Xiong, H., Hu, P., Zhuge, Y ., Lu, H., and He, Y . Finers: Fine-grained reasoning and segmenta- tion of small objects with reinforcement learni...

work page Pith review arXiv
[17]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

SAM 3: Segment Anything with Concepts

Carion, N., Gustafson, L., Hu, Y .-T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K. V ., Khedr, H., Huang, A., et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Convllava: Hierar- chical backbones as visual encoder for large multimodal models.arXiv preprint arXiv:2405.15738,

Ge, C., Cheng, S., Wang, Z., Yuan, J., Gao, Y ., Song, J., Song, S., Huang, G., and Zheng, B. Convllava: Hierar- chical backbones as visual encoder for large multimodal models.arXiv preprint arXiv:2405.15738,

work page arXiv

[4] [4]

Mini- monkey: Alleviating the semantic sawtooth effect for lightweight mllms via complementary image pyramid

Huang, M., Liu, Y ., Liang, D., Jin, L., and Bai, X. Mini- monkey: Alleviating the semantic sawtooth effect for lightweight mllms via complementary image pyramid. arXiv preprint arXiv:2408.02034,

work page arXiv

[5] [5]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

LLaVA-OneVision: Easy Visual Task Transfer

Li, B., Zhang, Y ., Guo, D., Zhang, R., Li, F., Zhang, H., Zhang, K., Zhang, P., Li, Y ., Liu, Z., et al. Llava- onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Li, F., Zhang, R., Zhang, H., Zhang, Y ., Li, B., Li, W., Ma, Z., and Li, C. Llava-next-interleave: Tackling multi- image, video, and 3d in large multimodal models.arX...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation

Lu, Z., Li, L., Wang, J., Feng, Y ., Chen, B., Chen, K., and Wang, Y . CoPRS: Learning positional prior from chain- of-thought for reasoning segmentation. InThe Fourteenth International Conference on Learning Representations, 2026a. Lu, Z., Li, L., Wang, J., Kang, H., Feng, Y ., Chen, K., and Wang, Y . Segcompass: Exploring interpretable align- ment with ...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Modern hierarchical, agglomerative clustering algorithms

10 CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception M¨ullner, D. Modern hierarchical, agglomerative clustering algorithms.arXiv preprint arXiv:1109.2378,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

V$^{2}$-SAM: Marrying SAM2 with Multi-Prompt Experts for Cross-View Object Correspondence

Pan, J., Wang, R., Qian, T., Mahdi, M., Fu, Y ., Xue, X., Huang, X., Van Gool, L., Paudel, D. P., and Fu, Y . V2-sam: Marrying sam2 with multi-prompt experts for cross-view object correspondence.arXiv preprint arXiv:2511.20886,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based im- age exploration

Shen, H., Zhao, K., Zhao, T., Xu, R., Zhang, Z., Zhu, M., and Yin, J. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based im- age exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp. 6613–6629,

work page 2025

[11] [11]

Team, Q. et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3),

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine- tuned chat models.arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Deep clustering using the soft silhouette score: Towards compact and well-separated clusters.arXiv preprint arXiv:2402.00608,

Vardakas, G., Papakostas, I., and Likas, A. Deep clustering using the soft silhouette score: Towards compact and well-separated clusters.arXiv preprint arXiv:2402.00608,

work page arXiv

[14] [14]

arXiv preprint arXiv:2507.07999 , year=

Wang, H., Li, X., Huang, Z., Wang, A., Wang, J., Zhang, T., Zheng, J., Bai, S., Kang, Z., Feng, J., et al. Traceable evidence enhanced visual grounded reasoning: Evalua- tion and methodology.arXiv preprint arXiv:2507.07999, 2025a. Wang, H., Su, A., Ren, W., Lin, F., and Chen, W. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven rei...

work page arXiv

[15] [15]

ai.arXiv preprint arXiv:2403.04652,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs

Zhang, J., Khayatkhoei, M., Chhikara, P., and Ilievski, F. Mllms know where to look: Training-free perception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025a. Zhang, L., Yu, J., Xiong, H., Hu, P., Zhuge, Y ., Lu, H., and He, Y . Finers: Fine-grained reasoning and segmenta- tion of small objects with reinforcement learni...

work page Pith review arXiv

[17] [17]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y ., Su, W., Shao, J., et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, A., Phan, L., Chen, S., Campbell, J., Guo, P., Ren, R., Pan, A., Yin, X., Mazeika, M., Dombrowski, A.-K., et al. Representation engineering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405,

work page internal anchor Pith review Pith/arXiv arXiv