arxiv: 2601.15356 · v5 · submitted 2026-01-21 · 📡 eess.IV · cs.AI

Recognition: no theorem link

Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing

Xiang Li , Xueheng Li , Yu Wang , Xuanhua He , Zhangchi Hu , Weiwei Yu , Chengjun Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-16 12:40 UTC · model grok-4.3

classification 📡 eess.IV cs.AI

keywords image quality assessmenthigh-resolution IQAmultimodal large language modelsreinforcement learningcontext-aware croppingagentic probingVista-Bench

0 comments

The pith

Q-Probe uses context-aware probing to let multimodal models judge high-resolution image quality without crop or depth-of-field biases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Q-Probe as an agentic framework that adapts reinforcement learning and multimodal models to image quality assessment at high resolutions. It targets the failure of global-view methods to catch subtle local degradations and the tendency of zoom-in approaches to misread crops or natural depth of field as artifacts. A three-stage training process with a novel context-aware cropping strategy progressively aligns the model to human preferences while removing those biases, supported by a new benchmark called Vista-Bench for fine-grained local analysis. This matters because many real-world images are large and require reliable detection of localized quality issues without sacrificing performance at other scales.

Core claim

Q-Probe is the first agentic IQA framework designed to scale image quality assessment to high resolution via context-aware probing. It constructs Vista-Bench for fine-grained local degradation analysis and applies a three-stage training paradigm that progressively aligns the model with human preferences while eliminating causal bias through a novel context-aware cropping strategy.

What carries the argument

The context-aware cropping strategy inside the three-stage training paradigm, which supplies surrounding image context during each probe step to block misreading of local regions as degraded or artifactual.

If this is right

Q-Probe reaches state-of-the-art accuracy on high-resolution IQA tasks.
Performance stays superior when the same model is tested on images at many different resolution levels.
Vista-Bench supplies a dedicated test set for measuring how well models handle fine local degradations.
The approach reduces spurious biases that arise from simple cropping or from misinterpreting natural depth of field.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same context-aware probing pattern could transfer to other high-resolution vision tasks such as defect detection or medical image review where local context is critical.
Training stages might be reused with different multimodal backbones to test whether the bias reduction holds in specialized domains like satellite or microscopy imagery.
Real-time systems could adopt the probing loop to flag quality problems in large incoming images without needing to downsample first.

Load-bearing premise

The context-aware cropping strategy in the three-stage training fully eliminates the cropping-implies-degradation bias and depth-of-field misinterpretations without introducing new artifacts or lowering alignment with human preferences on local degradations.

What would settle it

Apply Q-Probe to a collection of high-resolution images that contain only natural depth-of-field variations and no actual degradations, then check whether the resulting quality scores stay high and continue to match independent human ratings.

Figures

Figures reproduced from arXiv: 2601.15356 by Chengjun Xie, Weiwei Yu, Xiang Li, Xuanhua He, Xueheng Li, Yu Wang, Zhangchi Hu.

**Figure 1.** Figure 1: Challenges in detecting subtle distortions via global perception versus local zooming. (a) Existing MLLMs fail to capture subtle local artifacts. (b) Even when visible via cropping, Semantic Robustness Bias causes models to ignore defects in key semantic areas (e.g., face). (c-d) Naive zooming leads to Logic Collapse, where the model misinterprets natural bokeh as blur (c) or falsely learns that Zooming im… view at source ↗

**Figure 2.** Figure 2: This diagram illustrates the construction pipeline of Vista-Bench and the Data Flywheel for SFT. Specifically, we utilize wavelet transforms to decouple structure from texture, selectively injecting artifacts into texture-rich semantic regions, while employing Gemini-2.5 Pro to generate importance-weighted annotations for fine-grained perception probing. To support SFT, we generate traces that interleave g… view at source ↗

**Figure 3.** Figure 3: Overview of the three-stage training framework. Initially, RL Pre-training leverages ranking rewards to align global perception with human preferences. Subsequently, hybrid-resolution SFT enables the model to acquire robust logical reasoning. Finally, the RL Post-training stage fine-tunes the model for precise degradation detection and adaptive tool invocation. For localization, I(·) is an indicator functi… view at source ↗

**Figure 4.** Figure 4: To monitor the training dynamics, we calculated the average standard deviation of predicted scores across multiple inference runs at various checkpoints. The observed monotonic decrease in variance not only confirms that Q-Probe achieves greater stability throughout Stage-1. SPAQ (Fang et al., 2020), and KADID-10k (Lin et al., 2019) to represent in-the-wild and synthetic distortions. We also include PIPAL… view at source ↗

read the original abstract

Reinforcement Learning (RL) has empowered Multimodal Large Language Models (MLLMs) to achieve superior human preference alignment in Image Quality Assessment (IQA). However, existing RL-based IQA models typically rely on coarse-grained global views, failing to capture subtle local degradations in high-resolution scenarios. While emerging "Thinking with Images" paradigms enable multi-scale visual perception via zoom-in mechanisms, their direct adaptation to IQA induces spurious "cropping-implies-degradation" biases and misinterprets natural depth-of-field as artifacts. To address these challenges, we propose Q-Probe, the first agentic IQA framework designed to scale IQA to high resolution via context-aware probing. First, we construct Vista-Bench, a pioneering benchmark tailored for fine-grained local degradation analysis in high-resolution IQA settings. Furthermore, we propose a three-stage training paradigm that progressively aligns the model with human preferences, while simultaneously eliminating causal bias through a novel context-aware cropping strategy. Extensive experiments demonstrate that Q-Probe achieves state-of-the-art performance in high-resolution settings while maintaining superior efficacy across resolution scales.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Q-Probe introduces an agentic probing setup with context-aware cropping and a new Vista-Bench to handle high-res IQA biases, but the SOTA claim rests on experiments not visible in the abstract.

read the letter

Q-Probe tries to scale image quality assessment to high-resolution images by framing it as an agentic process that probes locally while staying context-aware. The authors build Vista-Bench for fine-grained local degradation testing and use a three-stage training process that adds context-aware cropping to avoid the usual pitfalls. This is the main thing to know: it directly targets the cropping-implies-degradation bias and depth-of-field misreads that come up when adapting zoom-in methods to IQA, moving past the global-view limits of earlier RL-based models. The framing is new in how it combines agentic probing with progressive alignment to human preferences. The paper does a clear job spelling out why prior approaches fall short on high-res cases and why keeping global context during cropping matters for avoiding spurious signals. Building a dedicated benchmark is a useful concrete step that future work can build on. The soft spots sit in the evaluation. The abstract states SOTA results across scales but gives no numbers, no ablation details on how much the cropping reduces bias, and no evidence on whether it introduces new artifacts or weakens local degradation alignment. Until those results are checked, the central assumption that context-aware cropping fully solves the bias without side effects stays untested. The stress-test concern lands because the abstract does not quantify bias reduction or human-preference correlation on Vista-Bench. This paper is for computer vision researchers working on multimodal models and IQA, especially anyone dealing with high-res medical or photographic images. A reader looking for practical extensions of RL alignment to vision tasks would find the setup worth examining. I would send it for peer review. The idea is grounded enough in a real limitation to deserve referee time, even if the experiments need tightening.

Referee Report

3 major / 2 minor

Summary. The paper proposes Q-Probe, the first agentic IQA framework for scaling image quality assessment to high resolutions via context-aware probing in multimodal LLMs. It introduces Vista-Bench as a new benchmark for fine-grained local degradation analysis and a three-stage training paradigm that uses a novel context-aware cropping strategy to align with human preferences while eliminating cropping-implies-degradation bias and depth-of-field misinterpretation. The central claim is that this achieves SOTA performance in high-resolution settings while maintaining superior efficacy across resolution scales.

Significance. If the performance claims and bias-elimination results hold, Q-Probe would represent a meaningful advance in RL-aligned IQA by enabling reliable local degradation detection at high resolutions, with potential impact on applications such as computational photography and medical imaging. The construction of Vista-Bench is a clear positive contribution as a specialized benchmark, and the three-stage training offers a structured approach to addressing known cropping artifacts in agentic visual reasoning.

major comments (3)

[§3.3] §3.3 (context-aware cropping strategy): the description states that the strategy eliminates the cropping-implies-degradation bias and depth-of-field misinterpretation, but supplies no explicit reward formulation, loss term, or selection criterion showing how global-context leakage is prevented while preserving correlation with human judgments on fine-grained local degradations.
[Table 2] Table 2 and §4.2 (high-resolution results): the SOTA claims on Vista-Bench and cross-resolution comparisons report performance gains without error bars, statistical significance tests, or dataset-size details, making it impossible to verify whether the improvements are robust or driven by the context-aware component.
[§4.3] §4.3 (ablation studies): no quantitative ablation isolates the contribution of context-aware cropping versus standard cropping or the three-stage schedule alone, leaving the central assumption that the strategy fully removes spurious bias without new artifacts untested.

minor comments (2)

[Introduction] The introduction uses 'agentic probing' and 'Thinking with Images' without a concise operational definition or explicit contrast to prior zoom-in mechanisms.
[§2.2] Vista-Bench construction details (image sources, degradation types, annotation protocol) are referenced but lack a dedicated table summarizing statistics such as number of images, resolution distribution, and inter-annotator agreement.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications from the manuscript and committing to revisions that strengthen the presentation without altering the core claims.

read point-by-point responses

Referee: [§3.3] §3.3 (context-aware cropping strategy): the description states that the strategy eliminates the cropping-implies-degradation bias and depth-of-field misinterpretation, but supplies no explicit reward formulation, loss term, or selection criterion showing how global-context leakage is prevented while preserving correlation with human judgments on fine-grained local degradations.

Authors: We agree that an explicit formulation would improve rigor. Section 3.3 describes the context-aware cropping as part of the three-stage training, where global context is retained during probing to avoid misinterpreting depth-of-field or implying degradation from crops, while aligning with human preferences on local degradations. In the revised manuscript, we will add the precise mathematical selection criterion and associated loss term used to enforce this, formalizing how global-context leakage is prevented. revision: yes
Referee: [Table 2] Table 2 and §4.2 (high-resolution results): the SOTA claims on Vista-Bench and cross-resolution comparisons report performance gains without error bars, statistical significance tests, or dataset-size details, making it impossible to verify whether the improvements are robust or driven by the context-aware component.

Authors: Vista-Bench dataset size and construction details are provided in §4.1. Table 2 reports consistent gains across resolutions and metrics. We acknowledge that error bars and significance tests would better demonstrate robustness. In the revision, we will include standard deviations from multiple runs where computationally feasible, along with statistical tests confirming that improvements stem from the context-aware component rather than other factors. revision: yes
Referee: [§4.3] §4.3 (ablation studies): no quantitative ablation isolates the contribution of context-aware cropping versus standard cropping or the three-stage schedule alone, leaving the central assumption that the strategy fully removes spurious bias without new artifacts untested.

Authors: Section 4.3 presents ablations on the overall three-stage paradigm and framework components. We concur that targeted quantitative isolation of context-aware cropping versus standard cropping would more directly validate bias removal. The revised manuscript will expand §4.3 with additional ablation results comparing these variants to confirm the strategy eliminates spurious biases without introducing new artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces a new framework (Q-Probe), benchmark (Vista-Bench), and three-stage training procedure with context-aware cropping as original constructions. Claims of SOTA high-resolution IQA performance rest on empirical experiments rather than any equations, fitted parameters, or derivations that reduce to prior inputs by construction. No self-definitional steps, fitted-input predictions, or load-bearing self-citations appear in the abstract or described method; the central results are externally falsifiable via the new benchmark and do not collapse to renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on standard assumptions about RL alignment in MLLMs and the effectiveness of the newly proposed context-aware strategy and benchmark, which lack independent external validation in the abstract.

axioms (2)

domain assumption Reinforcement learning can align multimodal large language models with human preferences for image quality assessment
Stated as the foundation that existing RL-based IQA models build upon.
ad hoc to paper Context-aware cropping can eliminate cropping-implies-degradation bias while preserving accurate local degradation detection
Core of the novel three-stage training paradigm introduced to address limitations of direct zoom-in adaptation.

invented entities (2)

Vista-Bench no independent evidence
purpose: Benchmark for fine-grained local degradation analysis in high-resolution IQA
Pioneering benchmark constructed specifically for this framework.
context-aware cropping strategy no independent evidence
purpose: Eliminate causal bias during probing in the training process
Novel component of the three-stage training to avoid spurious biases.

pith-pipeline@v0.9.0 · 5512 in / 1566 out tokens · 42134 ms · 2026-05-16T12:40:34.997143+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Q-ponder: A unified training pipeline for reasoning-based visual quality assessment.arXiv preprint arXiv:2506.05384,

Cai, Z., Zhang, J., Yuan, X., Jiang, P.-T., Chen, W., Tang, B., Yao, L., Wang, Q., Chen, J., and Li, B. Q-ponder: A unified training pipeline for reasoning-based visual quality assessment.arXiv preprint arXiv:2506.05384,

work page arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: In- centivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Refine-iqa: Multi-stage reinforcement finetuning for perceptual im- age quality assessment.arXiv preprint arXiv:2508.03763,

Jia, Z., Qian, J., Zhang, Z., Chen, Z., and Min, X. Refine-iqa: Multi-stage reinforcement finetuning for perceptual im- age quality assessment.arXiv preprint arXiv:2508.03763,

work page arXiv
[6]

Crafting papers on machine learning

Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA,

work page 2000
[7]

Q-insight: Understanding image qual- ity via visual reinforcement learning.arXiv preprint arXiv:2503.22679,

Li, W., Zhang, X., Zhao, S., Zhang, Y ., Li, J., Zhang, L., and Zhang, J. Q-insight: Understanding image qual- ity via visual reinforcement learning.arXiv preprint arXiv:2503.22679,

work page arXiv
[8]

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

10 Q-Probe: Scaling Image Quality Assessment to High Resolution via Context-Aware Agentic Probing Wang, H., Su, A., Ren, W., Lin, F., and Chen, W. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning.arXiv preprint arXiv:2505.15966, 2025a. Wang, L. A survey on iqa.arXiv preprint arXiv:2109.00347,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Ding, L., Zeng, M., Zhou, X., Shen, L., Luo, Y ., Yu, W., and Tao, D. Divide, conquer and combine: A training-free framework for high-resolution image percep- tion in multimodal large language models. InProceed- ings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 7907–7915, 2025b. Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., We...

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Visualquality-r1: Reasoning-induced image quality as- sessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460,

Wu, T., Zou, J., Liang, J., Zhang, L., and Ma, K. Visualquality-r1: Reasoning-induced image quality as- sessment via reinforcement learning to rank.arXiv preprint arXiv:2505.14460,

work page arXiv
[12]

Reasoning as representation: Rethinking visual reinforcement learning in image quality assessment

Zhao, S., Zhang, X., Li, W., Li, J., Zhang, L., Xue, T., and Zhang, J. Reasoning as representation: Rethinking visual reinforcement learning in image quality assessment. arXiv preprint arXiv:2510.11369,

work page arXiv
[13]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., and Yu, X. Deepeyes: Incentiviz- ing” thinking with images” via reinforcement learning. arXiv:2505.14362,

work page internal anchor Pith review Pith/arXiv arXiv