pith. sign in

arxiv: 2605.17489 · v1 · pith:QH7HSXABnew · submitted 2026-05-17 · 💻 cs.CV

Employing Vision-Language Models for Face Image Quality Assessment

Pith reviewed 2026-05-20 13:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords face image quality assessmentvision-language modelszero-shot evaluationbiometricsinterpretabilityprompt robustnesssynthetic data ablation
0
0 comments X

The pith

Vision-language models can estimate face image quality zero-shot and align with traditional biometric scores while offering potential explanations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether off-the-shelf vision-language models can assess the quality of face images without any additional training or fine-tuning. Traditional face image quality methods produce useful scores but act as black boxes that give no reasons for their decisions, which restricts their use in settings where humans need to understand or act on the output. The authors run these models on surveillance, controlled, and synthetic face datasets, comparing the resulting scores to established methods through error-versus-reject analysis and checking how stable the scores remain when prompts change. They find that architecture choice matters more than model size for biometric utility, that scores usually track traditional ones, and that larger models gain consistency yet lose some ability to flag degradations. If correct, this opens a path to add readable justifications to biometric pipelines where human review occurs.

Core claim

Off-the-shelf vision-language models prompted in zero-shot fashion generate face image quality scores whose biometric utility largely matches that of conventional FIQA methods across surveillance, controlled, and synthetic datasets, with the added property that their outputs can be inspected for human-readable reasons.

What carries the argument

Zero-shot prompting of vision-language models to produce scalar quality scores for face images, benchmarked via error-versus-reject curves and prompt-sensitivity tests on diverse datasets.

If this is right

  • Biometric utility of the VLM approach depends more on model architecture than on total parameter count.
  • Most tested VLMs produce scores that align with those from established FIQA methods on the chosen datasets.
  • Both the ranking order and the numeric scores returned by VLMs shift when the text prompt is altered.
  • Increasing parameter count improves internal score consistency yet reduces performance at detecting image degradations compared with smaller models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such VLM outputs could be added to existing biometric pipelines to supply short natural-language justifications for quality rejections during human review.
  • The same prompting strategy might transfer to quality assessment of other image types if the alignment pattern observed here generalizes beyond faces.
  • Deployment trials that measure end-to-end system accuracy and operator decision time would show whether the interpretability gain justifies any small loss in raw utility.

Load-bearing premise

That close agreement between VLM scores and traditional FIQA methods on the tested datasets means the VLM outputs truly reflect biometric usefulness rather than merely echoing chosen prompts or dataset patterns.

What would settle it

A controlled test on new degraded face images where VLM quality scores do not predict drops in downstream face recognition accuracy as well as traditional FIQA scores do.

Figures

Figures reproduced from arXiv: 2605.17489 by Erdi Sar{\i}ta\c{s}, Eren Onaran, Haz{\i}m Kemal Ekenel, Vitomir \v{S}truc.

Figure 1
Figure 1. Figure 1: VLMs for Quality Assessment. While traditional FIQA methods (top) function as opaque ”black boxes” out￾putting only scalar scores, VLM-driven approaches (bottom) offer transparency by providing both biometric utility scores and actionable semantic justifications. ∗VLM prompt is generated using QWEN2.5-32B. they produce a single scalar score without providing inter￾pretable explanations. This lack of transp… view at source ↗
Figure 2
Figure 2. Figure 2: Error-versus-Reject (EvR) curves on LFW. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Score sensitivity to physical distance in surveillance [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Explainability analysis on SCFace (QWEN2.5-32B). The plot shows the distribution of generated attribute labels [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt Ablation Study: Comparison of score distribu [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Internal Consistency (QWEN2.5-32B). Boxplots of scalar quality scores grouped by the model’s generated text labels. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sample output of VLMs from all four datasets. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Face Image Quality Assessment (FIQA) is a crucial control step in biometric pipelines. It ensures only reliable samples are processed to maintain system accuracy. State-of-the-art FIQA methods achieve high utility but typically operate as "black boxes." They produce scalar scores without human-interpretable justifications. This lack of transparency limits their effectiveness in human-in-the-loop scenarios, such as automated border control, where actionable feedback is essential. In this paper, we investigate the potential of off-the-shelf Vision-Language Models (VLMs) to bridge this gap by performing FIQA in a zero-shot setting. We present a comprehensive evaluation framework for assessing VLM performance. This involves benchmarking traditional FIQA methods through error-versus-reject curves. Additionally, using a diverse set of datasets, ranging from surveillance-oriented to synthetically generated, we analyzed their interpretability, consistency, and robustness to prompt changes. Our results show biometric utility performance depends significantly on architecture, not merely on parameter count. Most VLMs' outputs align with those of traditional methods. We also find that VLM ranking performance and the generated scores may vary across prompts. Our synthetic ablation study shows that while increasing the parameter count can improve internal consistency, it yields worse degradation-detection performance than smaller models. These findings suggest that zero-shot FIQA score estimation using VLMs is promising and could effectively complement conventional FIQA pipelines as an interpretability module. The codes are available at https://github.com/ThEnded32/VLM4FIQA.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper explores the application of off-the-shelf Vision-Language Models (VLMs) for zero-shot Face Image Quality Assessment (FIQA) to address the lack of interpretability in traditional methods. It describes a comprehensive evaluation framework that includes benchmarking traditional FIQA methods using error-versus-reject curves and analyzing VLMs for their alignment with traditional scores, consistency, prompt robustness, and interpretability on various datasets including surveillance and synthetic ones. Findings highlight that performance depends on model architecture rather than parameter count, with most VLMs aligning with traditional methods but showing sensitivity to prompt changes. A synthetic ablation indicates larger models enhance consistency but perform worse in degradation detection than smaller models. The conclusion is that VLMs offer promise as an interpretability complement to conventional FIQA pipelines, supported by publicly available code.

Significance. Should the results hold, this work could significantly improve transparency in biometric quality assessment, facilitating better human oversight in applications like border control. The emphasis on architecture dependence and prompt sensitivity provides useful insights for future VLM use in biometrics. The availability of code enhances reproducibility and allows for further validation. However, the significance is tempered by the indirect nature of the utility assessment.

major comments (3)
  1. While traditional FIQA methods are evaluated using error-versus-reject curves to demonstrate biometric utility, the VLM assessment is restricted to alignment with traditional methods, interpretability, and synthetic ablations without equivalent direct utility testing via recognition performance curves. This gap is load-bearing for the claim that VLMs can effectively complement FIQA pipelines, as alignment may not guarantee equivalent error reduction in downstream tasks.
  2. The finding that increasing parameter count improves internal consistency but leads to worse degradation-detection performance than smaller models needs clarification on the exact models, datasets, and quantitative metrics used for 'degradation-detection performance' to substantiate the architecture-dependence over parameter count.
  3. The observation that VLM ranking performance and scores vary across prompts should be supported by specific quantitative results, such as correlation values or ranking agreement metrics across the tested datasets, to better assess the robustness.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we intend to make.

read point-by-point responses
  1. Referee: While traditional FIQA methods are evaluated using error-versus-reject curves to demonstrate biometric utility, the VLM assessment is restricted to alignment with traditional methods, interpretability, and synthetic ablations without equivalent direct utility testing via recognition performance curves. This gap is load-bearing for the claim that VLMs can effectively complement FIQA pipelines, as alignment may not guarantee equivalent error reduction in downstream tasks.

    Authors: We agree that direct biometric utility evaluation strengthens the complementarity claim. Our primary emphasis was on interpretability and alignment as a complement to existing pipelines, but we acknowledge that alignment alone is an indirect proxy. In the revised manuscript we will add error-versus-reject curve analyses for selected VLM configurations using the same recognition back-ends employed for the traditional methods, thereby providing a more direct utility comparison. revision: yes

  2. Referee: The finding that increasing parameter count improves internal consistency but leads to worse degradation-detection performance than smaller models needs clarification on the exact models, datasets, and quantitative metrics used for 'degradation-detection performance' to substantiate the architecture-dependence over parameter count.

    Authors: We thank the referee for noting the need for greater specificity. The ablation compared CLIP variants of differing sizes together with BLIP and LLaVA models on the synthetic degradation dataset. Degradation-detection performance was quantified via precision-recall metrics and correlation with ground-truth degradation severity labels. We will expand the relevant experimental section with explicit model identifiers, dataset composition details, and the precise quantitative metrics to clarify the architecture-versus-size distinction. revision: yes

  3. Referee: The observation that VLM ranking performance and scores vary across prompts should be supported by specific quantitative results, such as correlation values or ranking agreement metrics across the tested datasets, to better assess the robustness.

    Authors: We concur that quantitative backing improves the assessment of prompt sensitivity. In the revision we will report Spearman rank correlation coefficients and Kendall tau agreement values that measure ranking and score stability across the prompt variants on each evaluated dataset, thereby providing concrete evidence of the observed variation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking study with direct dataset comparisons

full rationale

This paper is an empirical evaluation that benchmarks traditional FIQA methods via error-versus-reject curves and measures VLM zero-shot outputs for alignment, interpretability, consistency, and prompt robustness across external datasets (surveillance and synthetic). No mathematical derivations, parameter fits renamed as predictions, or self-citation chains appear in the load-bearing claims. All reported findings rest on direct experimental comparisons to independent data and methods rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is an empirical evaluation that relies on pre-trained VLMs and public datasets without introducing new mathematical parameters, axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5817 in / 1227 out tokens · 65637 ms · 2026-05-20T13:57:08.582119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

  1. [1]

    Phi-4 Technical Report

    M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, S. Gunasekar, M. Harrison, R. J. Hewett, M. Javaheripi, P. Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

  2. [2]

    Atzori, F

    A. Atzori, F. Boutros, and N. Damer. ViT-FIQA: Assessing face image quality using vision transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops, volume 1, page 3, 2025

  3. [3]

    Babnik, P

    v. Babnik, P. Peer, and V . ˇStruc. eDifFIQA: Towards efficient face image quality assessment based on denoising diffusion probabilistic models.IEEE Transactions on Biometrics, Behavior, and Identity Science, 6(4):458–474, 2024

  4. [4]

    Babnik, P

    ˇZ. Babnik, P. Peer, and V . ˇStruc. FaceQAN: Face image quality assessment through adversarial noise exploration. In2022 26th International Conference on Pattern Recognition (ICPR), pages 748–

  5. [5]

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, Y . Xu, and J. Lin. Qwen2.5-VL technical report.arXiv preprint arXiv:2502.13923, 2025

  6. [6]

    Chaubey, X

    A. Chaubey, X. Guan, and M. Soleymani. Face-LLaV A: Facial expression and attribute understanding through instruction tuning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2648–2660, 2026

  7. [7]

    T. Chen, J. Zhang, et al. MGFFD-VLM: Multi-granularity prompt learning for face forgery detection with VLM.arXiv:2507.12232, 2025

  8. [8]

    W.-T. Chen, G. Krishnan, Q. Gao, S.-Y . Kuo, S. Ma, and J. Wang. DSL-FIQA: Assessing facial image quality via dual-set degradation learning and landmark-guided transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2931–2941, 2024

  9. [9]

    J. Dan, Y . Liu, H. Xie, J. Deng, H. Xie, X. Xie, and B. Sun. TransFace: Calibrating transformer training for face recognition from a data- centric perspective. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20642–20653, 2023

  10. [10]

    J. Deng, J. Guo, N. Xue, and S. Zafeiriou. ArcFace: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690– 4699, 2019

  11. [11]

    Y . Gao, X. Min, J. Han, Y . Cao, S. Wu, Y . Dou, and G. Zhai. Multi-dimensional text-to-face image quality assessment using LLM: Database and method. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6948–6957, 2025

  12. [12]

    Grgic, K

    M. Grgic, K. Delac, and S. Grgic. SCface – surveillance cameras face database.Multimedia tools and applications, 51(3):863–879, 2011

  13. [13]

    G. B. Huang, M. Mattar, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. InWorkshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition (ECCV Workshop), 2008

  14. [14]

    B. Jo, D. Cho, I. K. Park, and S. Hong. IFQA: Interpretable face quality assessment. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3444–3453, 2023

  15. [15]

    Kabbani, K

    W. Kabbani, K. Raja, R. Ramachandra, and C. Busch. FaceOracle: Chat with a face image oracle. InEuropean Conference on Computer Vision Workshops, pages 210–226. Springer, 2024

  16. [16]

    Karras, T

    T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of GANs for improved quality, stability, and variation. InInternational Conference on Learning Representations, 2018

  17. [17]

    M. Kim, A. K. Jain, and X. Liu. AdaFace: Quality adaptive margin for face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18750–18759, 2022

  18. [18]

    Laurenc ¸on, L

    H. Laurenc ¸on, L. Saulnier, L. Tronchon, S. Bekman, A. Singh, A. Lozhkov, T. Wang, S. Karamcheti, A. Rush, D. Kiela, et al. OBELICS: An open web-scale filtered dataset of interleaved image- text documents.Advances in Neural Information Processing Systems, 36:71683–71702, 2023

  19. [19]

    Lewis, E

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel, et al. Retrieval- augmented generation for knowledge-intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  20. [20]

    K. Li, Z. Yang, J. Zhao, H. Shen, R. Hou, H. Chang, Y . Yu, and X. Chen. HERM: Benchmarking and enhancing multimodal LLMs for human-centric understanding.arXiv preprint arXiv:2410.06777, 2024

  21. [21]

    Lin, Y .-W

    K.-H. Lin, Y .-W. Tseng, K.-Y . Huang, J.-C. Wu, and W.-H. Cheng. InstructFLIP: Exploring unified vision-language model for face anti- spoofing. InProceedings of the 33rd ACM International Conference on Multimedia, pages 2987–2996, 2025

  22. [22]

    Ma, W.-T

    S. Ma, W.-T. Chen, Q. Gao, J. Wang, C. W. Zhou, W. Sun, W. Zhang, L. Cao, J. Jia, X. Zhu, et al. VQualA 2025 challenge on face image quality assessment: Methods and results. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3448–3457, 2025

  23. [23]

    T. Miyata. ZEN-IQA: Zero-shot explainable and no-reference im- age quality assessment with vision language model.IEEE Access, 12:70973–70983, 2024

  24. [24]

    Najafzadeh, H

    N. Najafzadeh, H. Kashiani, M. S. E. Saadabadi, N. A. Talemi, S. R. Malakshan, and N. M. Nasrabadi. Face image quality vector assess- ment for biometrics applications. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 511– 520, 2023

  25. [25]

    F.-Z. Ou, X. Chen, R. Zhang, Y . Huang, S. Li, J. Li, Y . Li, L. Cao, and Y .-G. Wang. SDD-FIQA: Unsupervised face image quality assessment with similarity distribution distance. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7670– 7679, 2021

  26. [26]

    F.-Z. Ou, C. Li, S. Wang, and S. Kwong. CLIB-FIQA: Face image quality assessment with confidence calibration. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1694–1704, 2024

  27. [27]

    F.-Z. Ou, C. Li, S. Wang, and S. Kwong. MR-FIQA: Face image quality assessment with multi-reference representations from synthetic data generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12915–12925, 2025

  28. [28]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021

  29. [29]

    Saritas ¸ and H

    E. Saritas ¸ and H. K. Ekenel. Analyzing the effect of combined degradations on face recognition. In2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG) Work- shops, pages 1–5. IEEE, 2024

  30. [30]

    Schlett, C

    T. Schlett, C. Rathgeb, O. Henniger, J. Galbally, J. Fierrez, and C. Busch. Face image quality assessment: A literature survey.ACM Computing Surveys, 54(10s):1–49, 2022

  31. [31]

    H. O. Shahreza and S. Marcel. FaceLLM: A multimodal large language model for face understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3677– 3687, 2025

  32. [32]

    G. Team, A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Perrin, T. Matejovicova, A. Ram ´e, M. Rivi `ere, et al. Gemma 3 technical report.arXiv preprint arXiv:2503.19786, 2025

  33. [33]

    Terhorst, J

    P. Terhorst, J. N. Kolf, N. Damer, F. Kirchbuchner, and A. Kuijper. SER-FIQ: Unsupervised estimation of face image quality based on stochastic embedding robustness. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5651– 5660, 2020

  34. [34]

    H. Wang, Y . Shi, Z. Tao, Y . Gao, L. Zhang, X. Lin, J. Feng, X. Yuan, Z. Yu, and X. Cao. FaceShield: Explainable face anti-spoofing with multimodal large language models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 9811–9819, 2026

  35. [35]

    J. Wang, K. C. Chan, and C. C. Loy. Exploring CLIP for assessing the look and feel of images. InProceedings of the AAAI conference on Artificial Intelligence, volume 37, pages 2555–2563, 2023

  36. [36]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, et al. Qwen2-VL: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  37. [37]

    Whitelam, E

    C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, et al. IARPA Janus Benchmark-B face dataset. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 90–98, 2017

  38. [38]

    H. Wu, Z. Zhang, W. Zhang, C. Chen, L. Liao, C. Li, Y . Gao, A. Wang, E. Zhang, W. Sun, et al. Q-Align: Teaching LMMs for visual scoring via discrete text-defined levels. InProceedings of the 41st International Conference on Machine Learning, pages 54015– 54029, 2024

  39. [39]

    H. Wu, H. Zhu, Z. Zhang, E. Zhang, C. Chen, L. Liao, C. Li, A. Wang, W. Sun, Q. Yan, et al. Towards open-ended visual quality comparison. InEuropean Conference on Computer Vision, pages 360–

  40. [40]

    S. Wu, Y . Li, Z. Xu, Y . Gao, H. Duan, W. Sun, and G. Zhai. FVQ- 20K: A large-scale dataset and an LMM-based method for face video quality assessment. InProceedings of the 33rd ACM International Conference on Multimedia, pages 6928–6937, 2025

  41. [41]

    J. You, S. Li, Y . Sun, J. Wei, M. Guo, C. Feng, and J. Ran. LVFace: Progressive cluster optimization for large vision models in face recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 11840–11849, 2025

  42. [42]

    Z. You, J. Gu, X. Cai, Z. Li, K. Zhu, C. Dong, and T. Xue. Enhancing descriptive image quality assessment with a large-scale multi-modal dataset.IEEE Transactions on Image Processing, 34:8201–8215, 2025

  43. [43]

    Zhang, Z

    K. Zhang, Z. Zhang, Z. Li, and Y . Qiao. Joint face detection and alignment using multitask cascaded convolutional networks.IEEE Signal Processing Letters, 23(10):1499–1503, 2016

  44. [44]

    Zhang, G

    W. Zhang, G. Zhai, Y . Wei, X. Yang, and K. Ma. Blind image quality assessment via vision-language correspondence: A multitask learning perspective. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14071–14081, 2023

  45. [45]

    Zheng, J

    Q. Zheng, J. Zhang, J. Gockel, M. B. Wakin, C. Brice, and X. Zhang. QA-VLM: Providing human-interpretable quality assessment for wire- feed laser additive manufacturing parts with vision language models. Journal of Manufacturing Processes, 160:611–623, 2026