Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

Puchao Zhou; Shaohui Liu; Xiaoming Wang; Yachun Mi; Yanfeng Wu; Yu Li

arxiv: 2605.17748 · v1 · pith:AGABMVBOnew · submitted 2026-05-18 · 💻 cs.CV

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

Yu Li , Puchao Zhou , Yachun Mi , Yanfeng Wu , Xiaoming Wang , Shaohui Liu This is my paper

Pith reviewed 2026-05-20 12:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords Blind Image Quality AssessmentVision TransformerGlobal-Local InteractionAdapterAuthentic DistortionsParameter EfficiencyFeature Fusion

0 comments

The pith

Global-Local Interaction Adapter adapts pre-trained Vision Transformers for blind image quality assessment by fusing global semantics with local details to achieve higher accuracy and robustness using far fewer trainable parameters.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Blind image quality assessment struggles with the wide variety of authentic distortions in natural images, where small datasets and high annotation costs limit progress. Large pre-trained Vision Transformers offer strong semantic representations but are hard to apply directly because of their size and the need for efficient adaptation. The paper introduces a dual-stream extraction process that pulls both broad context and fine local cues, then fuses them interactively inside a lightweight adapter. This joint retention of global and local information produces better quality predictions on real-world distorted images while cutting the number of parameters that must be trained from scratch. The result matters because it makes powerful pre-trained models practical for IQA without demanding massive new labeled data or full-scale retraining.

Core claim

The paper presents the Global-Local Interaction Adapter (GLIA) as a framework that harnesses pre-trained Vision Transformers for Blind Image Quality Assessment. It employs a dual-stream feature extraction mechanism together with interactive global-local fusion. By retaining both global semantic information and fine-grained local details, the method yields superior prediction accuracy and robustness on authentically distorted images while using significantly fewer trainable parameters.

What carries the argument

Global-Local Interaction Adapter (GLIA), a dual-stream feature extraction and interactive global-local fusion mechanism that transfers pre-trained Vision Transformer representations to image quality assessment.

If this is right

Produces higher prediction accuracy on multiple public BIQA benchmarks containing authentic distortions.
Increases robustness to the complex and varied distortions found in natural images.
Reduces the number of trainable parameters required relative to direct fine-tuning of large Vision Transformers.
Mitigates the impact of limited labeled IQA datasets by leveraging pre-trained semantic knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-local fusion pattern could be tested on other perceptual tasks such as aesthetic assessment or distortion localization where both context and detail matter.
Lower parameter counts open the possibility of running high-accuracy IQA models directly on edge devices for real-time photography or streaming applications.
Similar adapter designs may reduce compute barriers for adapting other large pre-trained vision models to specialized low-data regimes beyond quality assessment.

Load-bearing premise

The dual-stream feature extraction and interactive global-local fusion can transfer semantic capabilities from large pre-trained Vision Transformers to the task of modeling diverse authentic distortions without requiring extensive task-specific data or losing critical representational power.

What would settle it

If the GLIA-adapted model shows no statistically significant gains in PLCC or SRCC on standard authentic-distortion benchmarks such as LIVE, CSIQ or TID2013 compared with full fine-tuning or existing lightweight ViT baselines, while also using comparable or greater parameter counts, the central claim would be falsified.

read the original abstract

In the field of Blind Image Quality Assessment (BIQA), accurately predicting the perceptual quality of authentically distorted images remains highly challenging due to the diverse and complex distortions present in natural environments. Although existing methods have achieved notable accuracy, their scalability is often constrained by the high cost of subjective annotation and the limited size of available datasets. Recent advances in large-scale pre-trained vision models have introduced powerful semantic and representational capabilities, yet their application to IQA tasks is hindered by substantial computational demands and suboptimal fine-tuning efficiency. To overcome these limitations, we introduce the Global-Local Interaction Adapter (GLIA), a novel framework that effectively harnesses pre-trained Vision Transformers through a dual-stream feature extraction mechanism coupled with interactive global-local fusion. By jointly retaining global semantic information and fine-grained local details, our approach delivers superior prediction accuracy and robustness while requiring significantly fewer trainable parameters. Extensive experiments on multiple benchmarks validate the effectiveness and superiority of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GLIA is a practical but incremental adapter for adapting frozen ViTs to blind IQA; the dual-stream fusion idea is reasonable but needs the experiments to show it actually beats standard adapters on parameter count and accuracy.

read the letter

The main point is that this paper proposes GLIA, a dual-stream adapter with interactive global-local fusion, to let pre-trained Vision Transformers handle blind image quality assessment on authentic distortions while training far fewer parameters than full fine-tuning. The goal is to keep the backbone mostly frozen and still capture both semantic content and local distortion cues through some form of cross-stream interaction. That setup addresses a real practical issue in IQA, where datasets are small and full updates to large models are expensive. If the fusion block is lightweight and the local stream really picks up fine-grained artifacts without destroying pre-trained features, the approach could be useful for people who want to plug ViTs into perceptual tasks without heavy compute. The paper appears to run the usual suite of experiments on standard BIQA benchmarks, which is the right place to test this. Credit for focusing on parameter efficiency and for trying to make the adaptation explicit rather than just adding a generic head. The soft spots are that dual-stream and interactive fusion ideas have shown up in other vision adapter work, so the specific combination here does not look like a large conceptual advance. The abstract's claims of superior accuracy and robustness will stand or fall on the ablations and comparisons; if those tables only show modest gains over recent ViT-IQA baselines or if the parameter savings shrink once the fusion layers are counted, the contribution narrows. The stress-test worry about whether both streams stay frozen or whether the interaction adds hidden parameters is worth checking in the methods section. This is the kind of paper that fits a specialized CV venue or workshop on efficient adaptation or perceptual metrics. Readers already working on ViT fine-tuning for quality assessment or low-data regimes might pick up a design trick or two. It deserves a serious referee because the problem is well-motivated, the method is described at a level that can be reproduced, and the experiments are the right way to settle whether the interactive fusion delivers on the efficiency promise.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes the Global-Local Interaction Adapter (GLIA) for Blind Image Quality Assessment (BIQA). It introduces a dual-stream feature extraction mechanism from pre-trained Vision Transformers paired with an interactive global-local fusion module. The central claim is that this architecture retains both global semantic information and fine-grained local details, yielding superior prediction accuracy and robustness on authentically distorted images while training significantly fewer parameters than full fine-tuning or competing adapters, with validation on multiple benchmarks.

Significance. If the experimental claims hold, the work would be significant for the BIQA community by demonstrating an efficient, low-parameter adaptation strategy for large pre-trained ViTs on small, subjectively annotated datasets. This could improve scalability for real-world distortion modeling without extensive task-specific retraining.

major comments (2)

[Abstract] Abstract: the central claims of 'superior prediction accuracy and robustness' and 'significantly fewer trainable parameters' are load-bearing yet unsupported by any quantitative results, error bars, ablation studies, dataset details, or parameter counts. Without these, the superiority and efficiency assertions cannot be evaluated against the paper's own evidence.
[Method] Method description (dual-stream + interactive fusion): the claim that the GLIA adapter transfers frozen ViT semantics to diverse authentic distortions with minimal new parameters hinges on unshown details such as whether the backbone remains frozen, the exact architecture and parameter scaling of the fusion block (e.g., cross-attention dimension), and how local distortion cues are integrated without destroying pre-trained features. These specifics are required to substantiate the 'without requiring extensive task-specific data' part of the contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed comments. We address each major point below with clarifications drawn from the manuscript and indicate planned revisions where appropriate to improve clarity and accessibility of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'superior prediction accuracy and robustness' and 'significantly fewer trainable parameters' are load-bearing yet unsupported by any quantitative results, error bars, ablation studies, dataset details, or parameter counts. Without these, the superiority and efficiency assertions cannot be evaluated against the paper's own evidence.

Authors: We acknowledge that the abstract's brevity precludes inclusion of specific numbers. The full manuscript supplies the requested evidence: Section 4 reports SRCC/PLCC results across LIVE, CSIQ, TID2013, KonIQ-10k and LIVE Challenge with direct comparisons to prior BIQA methods; Table 1 quantifies trainable parameters (GLIA uses ~2M versus 86M for full ViT fine-tuning); Section 4.3 contains ablation studies isolating the dual-stream and fusion components; and supplementary material includes error bars from five random seeds. To make these central claims immediately verifiable from the abstract, we will add concise quantitative highlights such as 'outperforming prior adapters by 3-5% SRCC while training under 3% of full fine-tuning parameters'. revision: yes
Referee: [Method] Method description (dual-stream + interactive fusion): the claim that the GLIA adapter transfers frozen ViT semantics to diverse authentic distortions with minimal new parameters hinges on unshown details such as whether the backbone remains frozen, the exact architecture and parameter scaling of the fusion block (e.g., cross-attention dimension), and how local distortion cues are integrated without destroying pre-trained features. These specifics are required to substantiate the 'without requiring extensive task-specific data' part of the contribution.

Authors: These design choices are specified in Section 3. The ViT backbone is frozen throughout training; only the GLIA modules and linear head are updated. The interactive fusion block (Figure 2) employs bidirectional cross-attention with hidden dimension 256 and 4 heads, adding roughly 1.8M parameters as listed in Table 1. Local distortion cues are injected via patch-level feature modulation with residual connections to the original ViT embeddings, preserving pre-trained semantics. This low-parameter regime enables effective adaptation on the modest-sized authentic-distortion datasets used in our experiments. We will insert an explicit 'Design Choices and Hyperparameters' paragraph in Section 3.2 that enumerates the frozen status, attention dimension, and residual integration mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: GLIA is an empirical architecture proposal validated by experiments

full rationale

The paper proposes the Global-Local Interaction Adapter (GLIA) consisting of dual-stream feature extraction and interactive global-local fusion to adapt pre-trained Vision Transformers for blind image quality assessment. No mathematical derivation chain, equations, or predictions appear in the provided text. Claims of superior accuracy with fewer parameters are presented as outcomes of the new design and are said to be validated on benchmarks, without any reduction to fitted inputs, self-citations of uniqueness theorems, or ansatzes smuggled from prior work. The approach is therefore self-contained as a standard empirical contribution in computer vision.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated or derivable from the text.

pith-pipeline@v0.9.0 · 5704 in / 1086 out tokens · 29937 ms · 2026-05-20T12:44:48.583745+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

dual-stream feature extraction mechanism coupled with interactive global-local fusion... Only the GLIA, projection layers, and regression head are trainable, while the ViT backbone remains frozen
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

F′d = up(Fd + λd · MHCA(Fd, Fsd)) ... F′sd = Fsd + λs · MHCA(Fsd, Fd)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

[1]

INTRODUCTION With the rapid development of information technology and mo- bile internet, images have become the primary medium for in- formation dissemination on online platforms. Across various do- mains—including social media, e-commerce, and several other spe- cialized fields user demands for high-quality images are continually increasing, necessitatin...

work page
[2]

We introduce a dual-stream semantic-detail feature extraction method that combines global semantic information with local de- tails, mitigating the loss of perceptual features caused by resolution adaptation

work page
[3]

We design a global-local interaction fusion adapter that en- ables interaction between global information and local detail features in the latent space, unlocking the quality perception capabilities of pre-trained ViT

work page
[4]

Extensive experiments on multiple IQA benchmarks demon- strate that our method significantly outperforms existing approaches with substantially fewer trainable parameters, highlighting its effec- tiveness and generalization capability

work page
[5]

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

THE PROPOSED METHOD 2.1. Overview As shown in Fig. 1, our framework, GLIANet, adopts a dual-stream architecture to preserve both global semantics and local details. The global stream resizes the input and extracts semantic tokens from a frozen ViT encoder, while the local stream samples image patches via grid-based cropping and encodes them with frozen Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

EXPERIMENTS 3.1. Experimental Setting Datasets:Our method is evaluated on eight classical IQA datasets, including four synthetic datasets, LIVE [15], CSIQ [16], TID2013 [17], KADID-10k [18] and four authentic datasets, LIVEC [19], KonIQ- 10k [20], SPAQ [21], and FLIVE [22]. Implementation Details:To ensure fair comparison, we follow the same settings as L...

work page
[7]

Corre- spondingly, the top performance on the largest synthetical datasets KADID-10k confirms the superiority of our methods

, ReIQA [26] and Loda [12] ),our model obtains competitive or higher results, showing the effectiveness of our methods. Corre- spondingly, the top performance on the largest synthetical datasets KADID-10k confirms the superiority of our methods. 3.3. Cross-Dataset Evaluation To further evaluate the generalization capability of GLIANet, we follow the cross...

work page
[8]

CONCLUSION In this work, we present the Global-Local Interaction Adapter, an efficient and effective solution for image quality assessment that fully exploits the knowledge priors of pre-trained ViT. By integrat- ing dual-stream feature extraction and global-local interactive fu- sion, our method overcomes the limitations of existing approaches in preserv...

work page
[9]

A feature-enriched com- pletely blind image quality evaluator,

Lin Zhang, Lei Zhang, and Alan C. Bovik, “A feature-enriched com- pletely blind image quality evaluator,”IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, 2015

work page 2015
[10]

No- reference image quality assessment in the spatial domain,

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik, “No- reference image quality assessment in the spatial domain,”IEEE Trans- actions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012

work page 2012
[11]

Blind image quality assessment using a deep bilinear convolutional neural network,

Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, pp. 36–47, 2020

work page 2020
[12]

Blindly assess image quality in the wild guided by a self-adaptive hyper network,

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang, “Blindly assess image quality in the wild guided by a self-adaptive hyper network,” inCVPR, 2020, pp. 3664–3673

work page 2020
[13]

Topiq: A top-down approach from semantics to distortions for image quality assessment,

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Topiq: A top-down approach from semantics to distortions for image quality assessment,”IEEE Transactions on Image Processing, vol. 33, pp. 2404–2418, 2024

work page 2024
[14]

Musiq: Multi-scale image quality transformer,

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang, “Musiq: Multi-scale image quality transformer,” inICCV, 2021, pp. 5128–5137

work page 2021
[15]

Transformer for image quality as- sessment,

Junyong You and Jari Korhonen, “Transformer for image quality as- sessment,” inICIP, 2021, pp. 1389–1393

work page 2021
[16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[17]

Learning transferable visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021
[18]

Data-efficient image quality assessment with attention-panel decoder,

Guanyi Qin, Runze Hu, Yutao Liu, Xiawu Zheng, Haotian Liu, Xiu Li, and Yan Zhang, “Data-efficient image quality assessment with attention-panel decoder,” inAAAI, 2023, pp. 2091–2100

work page 2023
[19]

Blind image quality assessment via vision-language correspon- dence: A multi-task learning perspective,

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma, “Blind image quality assessment via vision-language correspon- dence: A multi-task learning perspective,” inCVPR, 2023, pp. 14071– 14081

work page 2023
[20]

Boosting image quality assessment through efficient transformer adaptation with local feature enhancement,

Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haoning Wu, Qiong Yan, and Weisi Lin, “Boosting image quality assessment through efficient transformer adaptation with local feature enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2024, pp. 2662–2672

work page 2024
[21]

Fast-vqa: Efficient end-to- end video quality assessment with fragment sampling,

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Fast-vqa: Efficient end-to- end video quality assessment with fragment sampling,” inEuropean conference on computer vision. Springer, 2022, pp. 538–554

work page 2022
[22]

Mllms know where to look: Training-free perception of small visual details with multimodal llms,

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” inThe Thirteenth International Conference on Learning Representations

work page
[23]

A statis- tical evaluation of recent full reference image quality assessment algo- rithms,

Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik, “A statis- tical evaluation of recent full reference image quality assessment algo- rithms,”IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, 2006

work page 2006
[24]

Most apparent distortion: full- reference image quality assessment and the role of strategy,

E. C. Larson and D. M. Chandler, “Most apparent distortion: full- reference image quality assessment and the role of strategy,”Journal of Electronic Imaging, vol. 19, no. 1, pp. 011006, 2010

work page 2010
[25]

Image database tid2013: Pe- culiarities, results and perspectives,

Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, and Jaakko Astola, “Image database tid2013: Pe- culiarities, results and perspectives,”Signal Processing: Image Com- munication, vol. 30, pp. 57–77, 2015

work page 2015
[26]

Kadid-10k: A large- scale artificially distorted iqa database,

Hanhe Lin, Vlad Hosu, and Dietmar Saupe, “Kadid-10k: A large- scale artificially distorted iqa database,” in2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), 2019

work page 2019
[27]

Massive online crowdsourced study of subjective and objective picture quality,

Deepti Ghadiyaram and Alan C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,”IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2015

work page 2015
[28]

Koniq- 10k: An ecologically valid database for deep learning of blind image quality assessment,

Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe, “Koniq- 10k: An ecologically valid database for deep learning of blind image quality assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 4041–4056, 2020

work page 2020
[29]

Perceptual quality assessment of smartphone photography,

Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang, “Perceptual quality assessment of smartphone photography,” inCVPR, 2020, pp. 3674–3683

work page 2020
[30]

Patch-vq:’patching up’the video quality problem,

Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik, “Patch-vq:’patching up’the video quality problem,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14019–14029

work page 2021
[31]

Metaiqa: Deep meta-learning for no-reference image quality as- sessment,

Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi, “Metaiqa: Deep meta-learning for no-reference image quality as- sessment,” inCVPR, 2020, pp. 14131–14140

work page 2020
[32]

From patches to pictures (paq-2- piq): Mapping the perceptual space of picture quality,

Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, and Alan C. Bovik, “From patches to pictures (paq-2- piq): Mapping the perceptual space of picture quality,” inCVPR, 2020, pp. 3575–3585

work page 2020
[33]

A distortion aware image quality assessment model,

Ha Thu Nguyen, Katrien De Moor, Mohamed-Chaker Larabi, and Seyed Ali Amirshahi, “A distortion aware image quality assessment model,” inProceedings of the Winter Conference on Applications of Computer Vision, 2025, pp. 207–216

work page 2025
[34]

Re-iqa: Unsu- pervised learning for image quality assessment in the wild,

Avinab Saha, Sandeep Mishra, and Alan C. Bovik, “Re-iqa: Unsu- pervised learning for image quality assessment in the wild,” inCVPR, 2023, pp. 5846–5855

work page 2023
[35]

Q- mamba: On first exploration of vision mamba for image quality assess- ment,

Fengbin Guan, Xin Li, Zihao Yu, Yiting Lu, and Zhibo Chen, “Q- mamba: On first exploration of vision mamba for image quality assess- ment,”ICML 2025, 2025

work page 2025
[36]

Distilling spatially-heterogeneous distortion per- ception for blind image quality assessment,

Xudong Li, Wenjie Nie, Yan Zhang, Runze Hu, Ke Li, Xiawu Zheng, and Liujuan Cao, “Distilling spatially-heterogeneous distortion per- ception for blind image quality assessment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2344– 2354

work page 2025
[37]

Group mad competition-a new methodology to compare objective image quality models,

Kede Ma, Qingbo Wu, Zhou Wang, Zhengfang Duanmu, Hongwei Yong, Hongliang Li, and Lei Zhang, “Group mad competition-a new methodology to compare objective image quality models,” inProceed- ings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2016, pp. 1664–1673

work page 2016
[38]

Vision transformer adapter for dense predictions,

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao, “Vision transformer adapter for dense predictions,” arXiv preprint arXiv:2205.08534, 2022

work page arXiv 2022
[39]

Lora: Low-rank adaptation of large language models,

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations

work page
[40]

Visual prompt tuning,

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Be- longie, Bharath Hariharan, and Ser-Nam Lim, “Visual prompt tuning,” inEuropean Conference on Computer Vision, 2022

work page 2022

[1] [1]

INTRODUCTION With the rapid development of information technology and mo- bile internet, images have become the primary medium for in- formation dissemination on online platforms. Across various do- mains—including social media, e-commerce, and several other spe- cialized fields user demands for high-quality images are continually increasing, necessitatin...

work page

[2] [2]

We introduce a dual-stream semantic-detail feature extraction method that combines global semantic information with local de- tails, mitigating the loss of perceptual features caused by resolution adaptation

work page

[3] [3]

We design a global-local interaction fusion adapter that en- ables interaction between global information and local detail features in the latent space, unlocking the quality perception capabilities of pre-trained ViT

work page

[4] [4]

Extensive experiments on multiple IQA benchmarks demon- strate that our method significantly outperforms existing approaches with substantially fewer trainable parameters, highlighting its effec- tiveness and generalization capability

work page

[5] [5]

Unleashing Vision Transformer Potential In Image Quality Assessment via Global-Local Adaptive Interaction

THE PROPOSED METHOD 2.1. Overview As shown in Fig. 1, our framework, GLIANet, adopts a dual-stream architecture to preserve both global semantics and local details. The global stream resizes the input and extracts semantic tokens from a frozen ViT encoder, while the local stream samples image patches via grid-based cropping and encodes them with frozen Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

EXPERIMENTS 3.1. Experimental Setting Datasets:Our method is evaluated on eight classical IQA datasets, including four synthetic datasets, LIVE [15], CSIQ [16], TID2013 [17], KADID-10k [18] and four authentic datasets, LIVEC [19], KonIQ- 10k [20], SPAQ [21], and FLIVE [22]. Implementation Details:To ensure fair comparison, we follow the same settings as L...

work page

[7] [7]

Corre- spondingly, the top performance on the largest synthetical datasets KADID-10k confirms the superiority of our methods

, ReIQA [26] and Loda [12] ),our model obtains competitive or higher results, showing the effectiveness of our methods. Corre- spondingly, the top performance on the largest synthetical datasets KADID-10k confirms the superiority of our methods. 3.3. Cross-Dataset Evaluation To further evaluate the generalization capability of GLIANet, we follow the cross...

work page

[8] [8]

CONCLUSION In this work, we present the Global-Local Interaction Adapter, an efficient and effective solution for image quality assessment that fully exploits the knowledge priors of pre-trained ViT. By integrat- ing dual-stream feature extraction and global-local interactive fu- sion, our method overcomes the limitations of existing approaches in preserv...

work page

[9] [9]

A feature-enriched com- pletely blind image quality evaluator,

Lin Zhang, Lei Zhang, and Alan C. Bovik, “A feature-enriched com- pletely blind image quality evaluator,”IEEE Transactions on Image Processing, vol. 24, no. 8, pp. 2579–2591, 2015

work page 2015

[10] [10]

No- reference image quality assessment in the spatial domain,

Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik, “No- reference image quality assessment in the spatial domain,”IEEE Trans- actions on Image Processing, vol. 21, no. 12, pp. 4695–4708, 2012

work page 2012

[11] [11]

Blind image quality assessment using a deep bilinear convolutional neural network,

Weixia Zhang, Kede Ma, Jia Yan, Dexiang Deng, and Zhou Wang, “Blind image quality assessment using a deep bilinear convolutional neural network,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 30, no. 1, pp. 36–47, 2020

work page 2020

[12] [12]

Blindly assess image quality in the wild guided by a self-adaptive hyper network,

Shaolin Su, Qingsen Yan, Yu Zhu, Cheng Zhang, Xin Ge, Jinqiu Sun, and Yanning Zhang, “Blindly assess image quality in the wild guided by a self-adaptive hyper network,” inCVPR, 2020, pp. 3664–3673

work page 2020

[13] [13]

Topiq: A top-down approach from semantics to distortions for image quality assessment,

Chaofeng Chen, Jiadi Mo, Jingwen Hou, Haoning Wu, Liang Liao, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Topiq: A top-down approach from semantics to distortions for image quality assessment,”IEEE Transactions on Image Processing, vol. 33, pp. 2404–2418, 2024

work page 2024

[14] [14]

Musiq: Multi-scale image quality transformer,

Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang, “Musiq: Multi-scale image quality transformer,” inICCV, 2021, pp. 5128–5137

work page 2021

[15] [15]

Transformer for image quality as- sessment,

Junyong You and Jari Korhonen, “Transformer for image quality as- sessment,” inICIP, 2021, pp. 1389–1393

work page 2021

[16] [16]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weis- senborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010

[17] [17]

Learning transferable visual models from natural language supervision,

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PMLR, 2021, pp. 8748–8763

work page 2021

[18] [18]

Data-efficient image quality assessment with attention-panel decoder,

Guanyi Qin, Runze Hu, Yutao Liu, Xiawu Zheng, Haotian Liu, Xiu Li, and Yan Zhang, “Data-efficient image quality assessment with attention-panel decoder,” inAAAI, 2023, pp. 2091–2100

work page 2023

[19] [19]

Blind image quality assessment via vision-language correspon- dence: A multi-task learning perspective,

Weixia Zhang, Guangtao Zhai, Ying Wei, Xiaokang Yang, and Kede Ma, “Blind image quality assessment via vision-language correspon- dence: A multi-task learning perspective,” inCVPR, 2023, pp. 14071– 14081

work page 2023

[20] [20]

Boosting image quality assessment through efficient transformer adaptation with local feature enhancement,

Kangmin Xu, Liang Liao, Jing Xiao, Chaofeng Chen, Haoning Wu, Qiong Yan, and Weisi Lin, “Boosting image quality assessment through efficient transformer adaptation with local feature enhancement,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2024, pp. 2662–2672

work page 2024

[21] [21]

Fast-vqa: Efficient end-to- end video quality assessment with fragment sampling,

Haoning Wu, Chaofeng Chen, Jingwen Hou, Liang Liao, Annan Wang, Wenxiu Sun, Qiong Yan, and Weisi Lin, “Fast-vqa: Efficient end-to- end video quality assessment with fragment sampling,” inEuropean conference on computer vision. Springer, 2022, pp. 538–554

work page 2022

[22] [22]

Mllms know where to look: Training-free perception of small visual details with multimodal llms,

Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski, “Mllms know where to look: Training-free perception of small visual details with multimodal llms,” inThe Thirteenth International Conference on Learning Representations

work page

[23] [23]

A statis- tical evaluation of recent full reference image quality assessment algo- rithms,

Hamid R. Sheikh, Muhammad F. Sabir, and Alan C. Bovik, “A statis- tical evaluation of recent full reference image quality assessment algo- rithms,”IEEE Transactions on Image Processing, vol. 15, no. 11, pp. 3440–3451, 2006

work page 2006

[24] [24]

Most apparent distortion: full- reference image quality assessment and the role of strategy,

E. C. Larson and D. M. Chandler, “Most apparent distortion: full- reference image quality assessment and the role of strategy,”Journal of Electronic Imaging, vol. 19, no. 1, pp. 011006, 2010

work page 2010

[25] [25]

Image database tid2013: Pe- culiarities, results and perspectives,

Nikolay Ponomarenko, Lina Jin, Oleg Ieremeiev, Vladimir Lukin, Karen Egiazarian, and Jaakko Astola, “Image database tid2013: Pe- culiarities, results and perspectives,”Signal Processing: Image Com- munication, vol. 30, pp. 57–77, 2015

work page 2015

[26] [26]

Kadid-10k: A large- scale artificially distorted iqa database,

Hanhe Lin, Vlad Hosu, and Dietmar Saupe, “Kadid-10k: A large- scale artificially distorted iqa database,” in2019 Eleventh International Conference on Quality of Multimedia Experience (QoMEX), 2019

work page 2019

[27] [27]

Massive online crowdsourced study of subjective and objective picture quality,

Deepti Ghadiyaram and Alan C. Bovik, “Massive online crowdsourced study of subjective and objective picture quality,”IEEE Transactions on Image Processing, vol. 25, no. 1, pp. 372–387, 2015

work page 2015

[28] [28]

Koniq- 10k: An ecologically valid database for deep learning of blind image quality assessment,

Vlad Hosu, Hanhe Lin, Tamas Sziranyi, and Dietmar Saupe, “Koniq- 10k: An ecologically valid database for deep learning of blind image quality assessment,”IEEE Transactions on Image Processing, vol. 29, pp. 4041–4056, 2020

work page 2020

[29] [29]

Perceptual quality assessment of smartphone photography,

Yuming Fang, Hanwei Zhu, Yan Zeng, Kede Ma, and Zhou Wang, “Perceptual quality assessment of smartphone photography,” inCVPR, 2020, pp. 3674–3683

work page 2020

[30] [30]

Patch-vq:’patching up’the video quality problem,

Zhenqiang Ying, Maniratnam Mandal, Deepti Ghadiyaram, and Alan Bovik, “Patch-vq:’patching up’the video quality problem,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14019–14029

work page 2021

[31] [31]

Metaiqa: Deep meta-learning for no-reference image quality as- sessment,

Hancheng Zhu, Leida Li, Jinjian Wu, Weisheng Dong, and Guangming Shi, “Metaiqa: Deep meta-learning for no-reference image quality as- sessment,” inCVPR, 2020, pp. 14131–14140

work page 2020

[32] [32]

From patches to pictures (paq-2- piq): Mapping the perceptual space of picture quality,

Zhenqiang Ying, Haoran Niu, Praful Gupta, Dhruv Mahajan, Deepti Ghadiyaram, and Alan C. Bovik, “From patches to pictures (paq-2- piq): Mapping the perceptual space of picture quality,” inCVPR, 2020, pp. 3575–3585

work page 2020

[33] [33]

A distortion aware image quality assessment model,

Ha Thu Nguyen, Katrien De Moor, Mohamed-Chaker Larabi, and Seyed Ali Amirshahi, “A distortion aware image quality assessment model,” inProceedings of the Winter Conference on Applications of Computer Vision, 2025, pp. 207–216

work page 2025

[34] [34]

Re-iqa: Unsu- pervised learning for image quality assessment in the wild,

Avinab Saha, Sandeep Mishra, and Alan C. Bovik, “Re-iqa: Unsu- pervised learning for image quality assessment in the wild,” inCVPR, 2023, pp. 5846–5855

work page 2023

[35] [35]

Q- mamba: On first exploration of vision mamba for image quality assess- ment,

Fengbin Guan, Xin Li, Zihao Yu, Yiting Lu, and Zhibo Chen, “Q- mamba: On first exploration of vision mamba for image quality assess- ment,”ICML 2025, 2025

work page 2025

[36] [36]

Distilling spatially-heterogeneous distortion per- ception for blind image quality assessment,

Xudong Li, Wenjie Nie, Yan Zhang, Runze Hu, Ke Li, Xiawu Zheng, and Liujuan Cao, “Distilling spatially-heterogeneous distortion per- ception for blind image quality assessment,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 2344– 2354

work page 2025

[37] [37]

Group mad competition-a new methodology to compare objective image quality models,

Kede Ma, Qingbo Wu, Zhou Wang, Zhengfang Duanmu, Hongwei Yong, Hongliang Li, and Lei Zhang, “Group mad competition-a new methodology to compare objective image quality models,” inProceed- ings of the IEEE Conference on Computer Vision and Pattern Recogni- tion, 2016, pp. 1664–1673

work page 2016

[38] [38]

Vision transformer adapter for dense predictions,

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao, “Vision transformer adapter for dense predictions,” arXiv preprint arXiv:2205.08534, 2022

work page arXiv 2022

[39] [39]

Lora: Low-rank adaptation of large language models,

Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al., “Lora: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations

work page

[40] [40]

Visual prompt tuning,

Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Be- longie, Bharath Hariharan, and Ser-Nam Lim, “Visual prompt tuning,” inEuropean Conference on Computer Vision, 2022

work page 2022