Employing Vision-Language Models for Face Image Quality Assessment

Erdi Sar{\i}ta\c{s}; Eren Onaran; Haz{\i}m Kemal Ekenel; Vitomir \v{S}truc

arxiv: 2605.17489 · v1 · pith:QH7HSXABnew · submitted 2026-05-17 · 💻 cs.CV

Employing Vision-Language Models for Face Image Quality Assessment

Erdi Sar{\i}ta\c{s} , Eren Onaran , Vitomir \v{S}truc , Haz{\i}m Kemal Ekenel This is my paper

Pith reviewed 2026-05-20 13:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords face image quality assessmentvision-language modelszero-shot evaluationbiometricsinterpretabilityprompt robustnesssynthetic data ablation

0 comments

The pith

Vision-language models can estimate face image quality zero-shot and align with traditional biometric scores while offering potential explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether off-the-shelf vision-language models can assess the quality of face images without any additional training or fine-tuning. Traditional face image quality methods produce useful scores but act as black boxes that give no reasons for their decisions, which restricts their use in settings where humans need to understand or act on the output. The authors run these models on surveillance, controlled, and synthetic face datasets, comparing the resulting scores to established methods through error-versus-reject analysis and checking how stable the scores remain when prompts change. They find that architecture choice matters more than model size for biometric utility, that scores usually track traditional ones, and that larger models gain consistency yet lose some ability to flag degradations. If correct, this opens a path to add readable justifications to biometric pipelines where human review occurs.

Core claim

Off-the-shelf vision-language models prompted in zero-shot fashion generate face image quality scores whose biometric utility largely matches that of conventional FIQA methods across surveillance, controlled, and synthetic datasets, with the added property that their outputs can be inspected for human-readable reasons.

What carries the argument

Zero-shot prompting of vision-language models to produce scalar quality scores for face images, benchmarked via error-versus-reject curves and prompt-sensitivity tests on diverse datasets.

If this is right

Biometric utility of the VLM approach depends more on model architecture than on total parameter count.
Most tested VLMs produce scores that align with those from established FIQA methods on the chosen datasets.
Both the ranking order and the numeric scores returned by VLMs shift when the text prompt is altered.
Increasing parameter count improves internal score consistency yet reduces performance at detecting image degradations compared with smaller models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such VLM outputs could be added to existing biometric pipelines to supply short natural-language justifications for quality rejections during human review.
The same prompting strategy might transfer to quality assessment of other image types if the alignment pattern observed here generalizes beyond faces.
Deployment trials that measure end-to-end system accuracy and operator decision time would show whether the interpretability gain justifies any small loss in raw utility.

Load-bearing premise

That close agreement between VLM scores and traditional FIQA methods on the tested datasets means the VLM outputs truly reflect biometric usefulness rather than merely echoing chosen prompts or dataset patterns.

What would settle it

A controlled test on new degraded face images where VLM quality scores do not predict drops in downstream face recognition accuracy as well as traditional FIQA scores do.

Figures

Figures reproduced from arXiv: 2605.17489 by Erdi Sar{\i}ta\c{s}, Eren Onaran, Haz{\i}m Kemal Ekenel, Vitomir \v{S}truc.

**Figure 1.** Figure 1: VLMs for Quality Assessment. While traditional FIQA methods (top) function as opaque ”black boxes” outputting only scalar scores, VLM-driven approaches (bottom) offer transparency by providing both biometric utility scores and actionable semantic justifications. ∗VLM prompt is generated using QWEN2.5-32B. they produce a single scalar score without providing interpretable explanations. This lack of transp… view at source ↗

**Figure 2.** Figure 2: Error-versus-Reject (EvR) curves on LFW. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Score sensitivity to physical distance in surveillance [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Explainability analysis on SCFace (QWEN2.5-32B). The plot shows the distribution of generated attribute labels [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt Ablation Study: Comparison of score distribu [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Internal Consistency (QWEN2.5-32B). Boxplots of scalar quality scores grouped by the model’s generated text labels. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Sample output of VLMs from all four datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

read the original abstract

Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as "black boxes." They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs' outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper explores the application of off-the-shelf Vision-Language Models (VLMs) for zero-shot Face Image Quality Assessment (FIQA) to address the lack of interpretability in traditional methods. It describes a comprehensive evaluation framework that includes benchmarking traditional FIQA methods using error-versus-reject curves and analyzing VLMs for their alignment with traditional scores, consistency, prompt robustness, and interpretability on various datasets including surveillance and synthetic ones. Findings highlight that performance depends on model architecture rather than parameter count, with most VLMs aligning with traditional methods but showing sensitivity to prompt changes. A synthetic ablation indicates larger models enhance consistency but perform worse in degradation detection than smaller models. The conclusion is that VLMs offer promise as an interpretability complement to conventional FIQA pipelines, supported by publicly available code.

Significance. Should the results hold, this work could significantly improve transparency in biometric quality assessment, facilitating better human oversight in applications like border control. The emphasis on architecture dependence and prompt sensitivity provides useful insights for future VLM use in biometrics. The availability of code enhances reproducibility and allows for further validation. However, the significance is tempered by the indirect nature of the utility assessment.

major comments (3)

While traditional FIQA methods are evaluated using error-versus-reject curves to demonstrate biometric utility, the VLM assessment is restricted to alignment with traditional methods, interpretability, and synthetic ablations without equivalent direct utility testing via recognition performance curves. This gap is load-bearing for the claim that VLMs can effectively complement FIQA pipelines, as alignment may not guarantee equivalent error reduction in downstream tasks.
The finding that increasing parameter count improves internal consistency but leads to worse degradation-detection performance than smaller models needs clarification on the exact models, datasets, and quantitative metrics used for 'degradation-detection performance' to substantiate the architecture-dependence over parameter count.
The observation that VLM ranking performance and scores vary across prompts should be supported by specific quantitative results, such as correlation values or ranking agreement metrics across the tested datasets, to better assess the robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we intend to make.

read point-by-point responses

Referee: While traditional FIQA methods are evaluated using error-versus-reject curves to demonstrate biometric utility, the VLM assessment is restricted to alignment with traditional methods, interpretability, and synthetic ablations without equivalent direct utility testing via recognition performance curves. This gap is load-bearing for the claim that VLMs can effectively complement FIQA pipelines, as alignment may not guarantee equivalent error reduction in downstream tasks.

Authors: We agree that direct biometric utility evaluation strengthens the complementarity claim. Our primary emphasis was on interpretability and alignment as a complement to existing pipelines, but we acknowledge that alignment alone is an indirect proxy. In the revised manuscript we will add error-versus-reject curve analyses for selected VLM configurations using the same recognition back-ends employed for the traditional methods, thereby providing a more direct utility comparison. revision: yes
Referee: The finding that increasing parameter count improves internal consistency but leads to worse degradation-detection performance than smaller models needs clarification on the exact models, datasets, and quantitative metrics used for 'degradation-detection performance' to substantiate the architecture-dependence over parameter count.

Authors: We thank the referee for noting the need for greater specificity. The ablation compared CLIP variants of differing sizes together with BLIP and LLaVA models on the synthetic degradation dataset. Degradation-detection performance was quantified via precision-recall metrics and correlation with ground-truth degradation severity labels. We will expand the relevant experimental section with explicit model identifiers, dataset composition details, and the precise quantitative metrics to clarify the architecture-versus-size distinction. revision: yes
Referee: The observation that VLM ranking performance and scores vary across prompts should be supported by specific quantitative results, such as correlation values or ranking agreement metrics across the tested datasets, to better assess the robustness.

Authors: We concur that quantitative backing improves the assessment of prompt sensitivity. In the revision we will report Spearman rank correlation coefficients and Kendall tau agreement values that measure ranking and score stability across the prompt variants on each evaluated dataset, thereby providing concrete evidence of the observed variation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with direct dataset comparisons

full rationale

This paper is an empirical evaluation that benchmarks traditional FIQA methods via error-versus-reject curves and measures VLM zero-shot outputs for alignment, interpretability, consistency, and prompt robustness across external datasets (surveillance and synthetic). No mathematical derivations, parameter fits renamed as predictions, or self-citation chains appear in the load-bearing claims. All reported findings rest on direct experimental comparisons to independent data and methods rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical evaluation that relies on pre-trained VLMs and public datasets without introducing new mathematical parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5817 in / 1227 out tokens · 65637 ms · 2026-05-20T13:57:08.582119+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves... synthetic ablation study
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our results show biometric utility performance depends significantly on architecture, not merely on parameter count.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Atzori, F

A. Atzori, F. Boutros, and N. Damer. ViT-FIQA: Assessing face image quality using vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, volume 1, page 3, 2025

work page 2025
[3]

Babnik, P

v. Babnik, P. Peer, and V . ˇStruc. eDifFIQA: Towards efficient face image quality assessment based on denoising diffusion probabilistic models.IEEE Transactions on Biometrics, Behavior, and Identity Science, 6(4):458–474, 2024

work page 2024
[4]

Babnik, P

ˇZ. Babnik, P. Peer, and V . ˇStruc. FaceQAN: Face image quality assessment through adversarial noise exploration. In2022 26th International Conference on Pattern Recognition (ICPR), pages 748–

work page
[5]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, Y . Xu, and J. Lin. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Chaubey, X

A. Chaubey, X. Guan, and M. Soleymani. Face-LLaV A: Facial expression and attribute understanding through instruction tuning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2648–2660, 2026

work page 2026
[7]

T. Chen, J. Zhang, et al. MGFFD-VLM: Multi-granularity prompt learning for face forgery detection with VLM.arXiv:2507.12232, 2025

work page arXiv 2025
[8]

W.-T. Chen, G. Krishnan, Q. Gao, S.-Y . Kuo, S. Ma, and J. Wang. DSL-FIQA: Assessing facial image quality via dual-set degradation learning and landmark-guided transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2931–2941, 2024

work page 2024
[9]

J. Dan, Y . Liu, H. Xie, J. Deng, H. Xie, X. Xie, and B. Sun. TransFace: Calibrating transformer training for face recognition from a data- centric perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20642–20653, 2023

work page 2023
[10]

J. Deng, J. Guo, N. Xue, and S. Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690– 4699, 2019

work page 2019
[11]

Y . Gao, X. Min, J. Han, Y . Cao, S. Wu, Y . Dou, and G. Zhai. Multi-dimensional text-to-face image quality assessment using LLM: Database and method. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6948–6957, 2025

work page 2025
[12]

Grgic, K

M. Grgic, K. Delac, and S. Grgic. SCface – surveillance cameras face database.Multimedia tools and applications, 51(3):863–879, 2011

work page 2011
[13]

G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. InWorkshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition (ECCV Workshop), 2008

work page 2008
[14]

B. Jo, D. Cho, I. K. Park, and S. Hong. IFQA: Interpretable face quality assessment. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3444–3453, 2023

work page 2023
[15]

Kabbani, K

W. Kabbani, K. Raja, R. Ramachandra, and C. Busch. FaceOracle: Chat with a face image oracle. InEuropean Conference on Computer Vision Workshops, pages 210–226. Springer, 2024

work page 2024
[16]

Karras, T

T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018

work page 2018
[17]

M. Kim, A. K. Jain, and X. Liu. AdaFace: Quality adaptive margin for face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18750–18759, 2022

work page 2022
[18]

Laurenc ¸on, L

H. Laurenc ¸on, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. Rush, D. Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image- text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

work page 2023
[19]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, et al. Retrieval- augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[20]

K. Li, Z. Yang, J. Zhao, H. Shen, R. Hou, H. Chang, Y . Yu, and X. Chen. HERM: Benchmarking and enhancing multimodal LLMs for human-centric understanding.arXiv preprint arXiv:2410.06777, 2024

work page arXiv 2024
[21]

Lin, Y .-W

K.-H. Lin, Y .-W. Tseng, K.-Y . Huang, J.-C. Wu, and W.-H. Cheng. InstructFLIP: Exploring unified vision-language model for face anti- spoofing. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2987–2996, 2025

work page 2025
[22]

Ma, W.-T

S. Ma, W.-T. Chen, Q. Gao, J. Wang, C. W. Zhou, W. Sun, W. Zhang, L. Cao, J. Jia, X. Zhu, et al. VQualA 2025 challenge on face image quality assessment: Methods and results. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3448–3457, 2025

work page 2025
[23]

T. Miyata. ZEN-IQA: Zero-shot explainable and no-reference im- age quality assessment with vision language model.IEEE Access, 12:70973–70983, 2024

work page 2024
[24]

Najafzadeh, H

N. Najafzadeh, H. Kashiani, M. S. E. Saadabadi, N. A. Talemi, S. R. Malakshan, and N. M. Nasrabadi. Face image quality vector assess- ment for biometrics applications. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 511– 520, 2023

work page 2023
[25]

F.-Z. Ou, X. Chen, R. Zhang, Y . Huang, S. Li, J. Li, Y . Li, L. Cao, and Y .-G. Wang. SDD-FIQA: Unsupervised face image quality assessment with similarity distribution distance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7670– 7679, 2021

work page 2021
[26]

F.-Z. Ou, C. Li, S. Wang, and S. Kwong. CLIB-FIQA: Face image quality assessment with confidence calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1694–1704, 2024

work page 2024
[27]

F.-Z. Ou, C. Li, S. Wang, and S. Kwong. MR-FIQA: Face image quality assessment with multi-reference representations from synthetic data generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12915–12925, 2025

work page 2025
[28]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021
[29]

Saritas ¸ and H

E. Saritas ¸ and H. K. Ekenel. Analyzing the effect of combined degradations on face recognition. In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) Work- shops, pages 1–5. IEEE, 2024

work page 2024
[30]

Schlett, C

T. Schlett, C. Rathgeb, O. Henniger, J. Galbally, J. Fierrez, and C. Busch. Face image quality assessment: A literature survey.ACM Computing Surveys, 54(10s):1–49, 2022

work page 2022
[31]

H. O. Shahreza and S. Marcel. FaceLLM: A multimodal large language model for face understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677– 3687, 2025

work page 2025
[32]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Terhorst, J

P. Terhorst, J. N. Kolf, N. Damer, F. Kirchbuchner, and A. Kuijper. SER-FIQ: Unsupervised estimation of face image quality based on stochastic embedding robustness. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5651– 5660, 2020

work page 2020
[34]

H. Wang, Y . Shi, Z. Tao, Y . Gao, L. Zhang, X. Lin, J. Feng, X. Yuan, Z. Yu, and X. Cao. FaceShield: Explainable face anti-spoofing with multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9811–9819, 2026

work page 2026
[35]

J. Wang, K. C. Chan, and C. C. Loy. Exploring CLIP for assessing the look and feel of images. InProceedings of the AAAI conference on Artificial Intelligence, volume 37, pages 2555–2563, 2023

work page 2023
[36]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Whitelam, E

C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, et al. IARPA Janus Benchmark-B face dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 90–98, 2017

work page 2017
[38]

H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InProceedings of the 41st International Conference on Machine Learning, pages 54015– 54029, 2024

work page 2024
[39]

H. Wu, H. Zhu, Z. Zhang, E. Zhang, C. Chen, L. Liao, C. Li, A. Wang, W. Sun, Q. Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–

work page
[40]

S. Wu, Y . Li, Z. Xu, Y . Gao, H. Duan, W. Sun, and G. Zhai. FVQ- 20K: A large-scale dataset and an LMM-based method for face video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6928–6937, 2025

work page 2025
[41]

J. You, S. Li, Y . Sun, J. Wei, M. Guo, C. Feng, and J. Ran. LVFace: Progressive cluster optimization for large vision models in face recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11840–11849, 2025

work page 2025
[42]

Z. You, J. Gu, X. Cai, Z. Li, K. Zhu, C. Dong, and T. Xue. Enhancing descriptive image quality assessment with a large-scale multi-modal dataset.IEEE Transactions on Image Processing, 34:8201–8215, 2025

work page 2025
[43]

Zhang, Z

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. Joint face detection and alignment using multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23(10):1499–1503, 2016

work page 2016
[44]

Zhang, G

W. Zhang, G. Zhai, Y . Wei, X. Yang, and K. Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023

work page 2023
[45]

Zheng, J

Q. Zheng, J. Zhang, J. Gockel, M. B. Wakin, C. Brice, and X. Zhang. QA-VLM: Providing human-interpretable quality assessment for wire- feed laser additive manufacturing parts with vision language models. Journal of Manufacturing Processes, 160:611–623, 2026

work page 2026

[1] [1]

Phi-4 Technical Report

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Atzori, F

A. Atzori, F. Boutros, and N. Damer. ViT-FIQA: Assessing face image quality using vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, volume 1, page 3, 2025

work page 2025

[3] [3]

Babnik, P

v. Babnik, P. Peer, and V . ˇStruc. eDifFIQA: Towards efficient face image quality assessment based on denoising diffusion probabilistic models.IEEE Transactions on Biometrics, Behavior, and Identity Science, 6(4):458–474, 2024

work page 2024

[4] [4]

Babnik, P

ˇZ. Babnik, P. Peer, and V . ˇStruc. FaceQAN: Face image quality assessment through adversarial noise exploration. In2022 26th International Conference on Pattern Recognition (ICPR), pages 748–

work page

[5] [5]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, Y . Xu, and J. Lin. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Chaubey, X

A. Chaubey, X. Guan, and M. Soleymani. Face-LLaV A: Facial expression and attribute understanding through instruction tuning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2648–2660, 2026

work page 2026

[7] [7]

T. Chen, J. Zhang, et al. MGFFD-VLM: Multi-granularity prompt learning for face forgery detection with VLM.arXiv:2507.12232, 2025

work page arXiv 2025

[8] [8]

W.-T. Chen, G. Krishnan, Q. Gao, S.-Y . Kuo, S. Ma, and J. Wang. DSL-FIQA: Assessing facial image quality via dual-set degradation learning and landmark-guided transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2931–2941, 2024

work page 2024

[9] [9]

J. Dan, Y . Liu, H. Xie, J. Deng, H. Xie, X. Xie, and B. Sun. TransFace: Calibrating transformer training for face recognition from a data- centric perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20642–20653, 2023

work page 2023

[10] [10]

J. Deng, J. Guo, N. Xue, and S. Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690– 4699, 2019

work page 2019

[11] [11]

Y . Gao, X. Min, J. Han, Y . Cao, S. Wu, Y . Dou, and G. Zhai. Multi-dimensional text-to-face image quality assessment using LLM: Database and method. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6948–6957, 2025

work page 2025

[12] [12]

Grgic, K

M. Grgic, K. Delac, and S. Grgic. SCface – surveillance cameras face database.Multimedia tools and applications, 51(3):863–879, 2011

work page 2011

[13] [13]

G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. InWorkshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition (ECCV Workshop), 2008

work page 2008

[14] [14]

B. Jo, D. Cho, I. K. Park, and S. Hong. IFQA: Interpretable face quality assessment. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3444–3453, 2023

work page 2023

[15] [15]

Kabbani, K

W. Kabbani, K. Raja, R. Ramachandra, and C. Busch. FaceOracle: Chat with a face image oracle. InEuropean Conference on Computer Vision Workshops, pages 210–226. Springer, 2024

work page 2024

[16] [16]

Karras, T

T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018

work page 2018

[17] [17]

M. Kim, A. K. Jain, and X. Liu. AdaFace: Quality adaptive margin for face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18750–18759, 2022

work page 2022

[18] [18]

Laurenc ¸on, L

H. Laurenc ¸on, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. Rush, D. Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image- text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

work page 2023

[19] [19]

Lewis, E

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, et al. Retrieval- augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020

[20] [20]

K. Li, Z. Yang, J. Zhao, H. Shen, R. Hou, H. Chang, Y . Yu, and X. Chen. HERM: Benchmarking and enhancing multimodal LLMs for human-centric understanding.arXiv preprint arXiv:2410.06777, 2024

work page arXiv 2024

[21] [21]

Lin, Y .-W

K.-H. Lin, Y .-W. Tseng, K.-Y . Huang, J.-C. Wu, and W.-H. Cheng. InstructFLIP: Exploring unified vision-language model for face anti- spoofing. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2987–2996, 2025

work page 2025

[22] [22]

Ma, W.-T

S. Ma, W.-T. Chen, Q. Gao, J. Wang, C. W. Zhou, W. Sun, W. Zhang, L. Cao, J. Jia, X. Zhu, et al. VQualA 2025 challenge on face image quality assessment: Methods and results. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3448–3457, 2025

work page 2025

[23] [23]

T. Miyata. ZEN-IQA: Zero-shot explainable and no-reference im- age quality assessment with vision language model.IEEE Access, 12:70973–70983, 2024

work page 2024

[24] [24]

Najafzadeh, H

N. Najafzadeh, H. Kashiani, M. S. E. Saadabadi, N. A. Talemi, S. R. Malakshan, and N. M. Nasrabadi. Face image quality vector assess- ment for biometrics applications. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 511– 520, 2023

work page 2023

[25] [25]

F.-Z. Ou, X. Chen, R. Zhang, Y . Huang, S. Li, J. Li, Y . Li, L. Cao, and Y .-G. Wang. SDD-FIQA: Unsupervised face image quality assessment with similarity distribution distance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7670– 7679, 2021

work page 2021

[26] [26]

F.-Z. Ou, C. Li, S. Wang, and S. Kwong. CLIB-FIQA: Face image quality assessment with confidence calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1694–1704, 2024

work page 2024

[27] [27]

F.-Z. Ou, C. Li, S. Wang, and S. Kwong. MR-FIQA: Face image quality assessment with multi-reference representations from synthetic data generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12915–12925, 2025

work page 2025

[28] [28]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

work page 2021

[29] [29]

Saritas ¸ and H

E. Saritas ¸ and H. K. Ekenel. Analyzing the effect of combined degradations on face recognition. In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) Work- shops, pages 1–5. IEEE, 2024

work page 2024

[30] [30]

Schlett, C

T. Schlett, C. Rathgeb, O. Henniger, J. Galbally, J. Fierrez, and C. Busch. Face image quality assessment: A literature survey.ACM Computing Surveys, 54(10s):1–49, 2022

work page 2022

[31] [31]

H. O. Shahreza and S. Marcel. FaceLLM: A multimodal large language model for face understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677– 3687, 2025

work page 2025

[32] [32]

G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Terhorst, J

P. Terhorst, J. N. Kolf, N. Damer, F. Kirchbuchner, and A. Kuijper. SER-FIQ: Unsupervised estimation of face image quality based on stochastic embedding robustness. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5651– 5660, 2020

work page 2020

[34] [34]

H. Wang, Y . Shi, Z. Tao, Y . Gao, L. Zhang, X. Lin, J. Feng, X. Yuan, Z. Yu, and X. Cao. FaceShield: Explainable face anti-spoofing with multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9811–9819, 2026

work page 2026

[35] [35]

J. Wang, K. C. Chan, and C. C. Loy. Exploring CLIP for assessing the look and feel of images. InProceedings of the AAAI conference on Artificial Intelligence, volume 37, pages 2555–2563, 2023

work page 2023

[36] [36]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Whitelam, E

C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, et al. IARPA Janus Benchmark-B face dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 90–98, 2017

work page 2017

[38] [38]

H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InProceedings of the 41st International Conference on Machine Learning, pages 54015– 54029, 2024

work page 2024

[39] [39]

H. Wu, H. Zhu, Z. Zhang, E. Zhang, C. Chen, L. Liao, C. Li, A. Wang, W. Sun, Q. Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–

work page

[40] [40]

S. Wu, Y . Li, Z. Xu, Y . Gao, H. Duan, W. Sun, and G. Zhai. FVQ- 20K: A large-scale dataset and an LMM-based method for face video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6928–6937, 2025

work page 2025

[41] [41]

J. You, S. Li, Y . Sun, J. Wei, M. Guo, C. Feng, and J. Ran. LVFace: Progressive cluster optimization for large vision models in face recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11840–11849, 2025

work page 2025

[42] [42]

Z. You, J. Gu, X. Cai, Z. Li, K. Zhu, C. Dong, and T. Xue. Enhancing descriptive image quality assessment with a large-scale multi-modal dataset.IEEE Transactions on Image Processing, 34:8201–8215, 2025

work page 2025

[43] [43]

Zhang, Z

K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. Joint face detection and alignment using multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23(10):1499–1503, 2016

work page 2016

[44] [44]

Zhang, G

W. Zhang, G. Zhai, Y . Wei, X. Yang, and K. Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023

work page 2023

[45] [45]

Zheng, J

Q. Zheng, J. Zhang, J. Gockel, M. B. Wakin, C. Brice, and X. Zhang. QA-VLM: Providing human-interpretable quality assessment for wire- feed laser additive manufacturing parts with vision language models. Journal of Manufacturing Processes, 160:611–623, 2026

work page 2026