pith. machine review for the scientific record. sign in

arxiv: 2512.18073 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:25 UTC · model grok-4.3

classification 💻 cs.CV
keywords fingerprint analysismultimodal LLMsbenchmarkbiometricsfine-tuningforensic tasksvision-language modelsdomain adaptation
0
0 comments X

The pith

FPBench tests 20 multimodal LLMs on eight fingerprint tasks across seven datasets and shows fine-tuning vision and language encoders raises performance by 7 to 39 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FPBench to measure how well multimodal large language models handle the fine details in fingerprint images for biometric and forensic work. It runs twenty models through eight tasks including pattern classification, verification, and real-versus-synthetic detection on seven real and synthetic datasets, using both zero-shot and chain-of-thought prompting. The key result is that fine-tuning the vision and language components of open-source models lifts accuracy on these tasks. A sympathetic reader would care because fingerprints remain a primary biometric for identification and security, yet most large models are trained on everyday scenes rather than these specialized textures. The benchmark is presented as an initial step toward building foundation models that are actually useful for fingerprint analysis.

Core claim

FPBench is a new benchmark that evaluates twenty multimodal LLMs on eight biometric and forensic tasks such as pattern analysis, fingerprint verification, and real-versus-synthetic classification, using seven real and synthetic fingerprint datasets under zero-shot and chain-of-thought prompting. Fine-tuning the vision and language encoders on open-source models improves performance by 7 to 39 percent across the tasks. The benchmark is positioned as a first step toward developing foundation models specialized for fingerprints.

What carries the argument

The FPBench benchmark itself, defined by its eight tasks applied to seven fingerprint datasets to test multimodal LLMs on fine structural and textural details.

If this is right

  • Fine-tuned MLLMs become viable tools for fingerprint pattern classification and verification tasks.
  • Domain adaptation through encoder fine-tuning works for the fine textures specific to biometric images.
  • Standardized benchmarks like FPBench allow direct comparison of open-source and proprietary models on forensic tasks.
  • Open-source MLLMs can be improved enough to narrow the gap with proprietary models in this domain.
  • The results support the feasibility of building specialized foundation models for fingerprint analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning approach could be tested on other biometrics such as iris or palm prints.
  • Forensic workflows might use these adapted models to handle partial or degraded prints not fully represented in the current datasets.
  • The performance lift indicates that general MLLMs underperform on biometric textures because of limited exposure during pretraining.
  • Future evaluations could add tasks like aging simulation or cross-sensor matching to stress-test the benchmark.

Load-bearing premise

The eight chosen tasks and seven datasets adequately cover the range of real-world fingerprint challenges in forensics and biometrics.

What would settle it

A new collection of fingerprint images from unseen sensors or populations where the fine-tuned models show no accuracy gain or fall below zero-shot baselines would falsify the reported benefit of fine-tuning.

Figures

Figures reproduced from arXiv: 2512.18073 by Chinmay Hegde, Ekta Gavas, Nasir Memon, Sudipta Banerjee.

Figure 1
Figure 1. Figure 1: FPBENCH: Overview of proposed benchmark for fingerprint analysis using MLLMs. We present examples of prompts curated for each task to evaluate the vision and language capabilities of MLLMs in fingerprint-based biometric and forensic tasks [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of questions across different categories and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example ACE-V sheet from the ACE-V analysis task on a sample fingerprint pair in FPB [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy (%) of all models across various fingerprint tasks presented in the form of heat map. The task ‘tool [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy (%) of top-5 best performing models along with mean performance across all tasks [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Performance (%) of top-5 best performing models across all tasks on zero-shot prompting. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Performance (%) variation of select models with change in model size (#params). Solid lines represent larger model variants [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comparison between responses produced by Qwen3-VL-32b[ [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Multimodal LLMs (MLLMs) are capable of performing complex data analysis, visual question answering, generation, and reasoning tasks. However, their ability to analyze biometric data is relatively underexplored. In this work, we investigate the effectiveness of MLLMs in understanding fine structural and textural details present in fingerprint images. To this end, we design a comprehensive benchmark, FPBench, to evaluate 20 MLLMs (open-source and proprietary models) across 7 real and synthetic datasets on a suite of 8 biometric and forensic tasks (e.g., pattern analysis, fingerprint verification, real versus synthetic classification, etc.) using zero-shot and chain-of-thought prompting strategies. We further fine-tune vision and language encoders on a subset of open-source MLLMs to demonstrate domain adaptation. FPBench is a novel benchmark designed as a first step towards developing foundation models in fingerprints. Our findings indicate fine-tuning of vision and language encoders improves the performance by 7%-39%. Our codes are available at https://github.com/Ektagavas/FPBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FPBench, a benchmark for evaluating 20 multimodal large language models (MLLMs) on fingerprint analysis. It tests these models across 8 biometric and forensic tasks (including pattern analysis, verification, and real-vs-synthetic classification) on 7 real and synthetic datasets using zero-shot and chain-of-thought prompting, then reports 7-39% performance gains after fine-tuning vision and language encoders on open-source MLLMs. The work positions FPBench as an initial step toward foundation models for fingerprints.

Significance. If the chosen tasks and datasets prove representative, the benchmark would provide a useful baseline for MLLM performance in biometrics and demonstrate the value of domain adaptation via fine-tuning. The empirical nature of the study (no circular derivations) and public code release strengthen its potential utility, but the significance hinges on whether the 7-39% gains generalize beyond the selected collection.

major comments (2)
  1. [Abstract] Abstract: the claim that FPBench is 'comprehensive' and that fine-tuning yields 7%-39% gains rests on the unverified assumption that the 8 tasks and 7 datasets adequately sample real-world forensic distributions (low-quality latents, partial prints, cross-sensor variation, aging). No analysis or justification of coverage is supplied, so the central performance claims cannot be assessed for generalizability.
  2. [Evaluation] Evaluation section (implied by summarized results): the reported performance gains lack full specification of exact metrics, statistical significance tests, variance across runs, and precise prompting templates. Without these, the 7-39% improvement figures cannot be reproduced or compared reliably to future work.
minor comments (2)
  1. The GitHub repository link is provided, which supports reproducibility; ensure the release includes all prompting templates, fine-tuning hyperparameters, and dataset splits.
  2. Clarify the distinction between open-source and proprietary models in the results tables to avoid ambiguity in zero-shot vs. fine-tuned comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity, transparency, and reproducibility.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that FPBench is 'comprehensive' and that fine-tuning yields 7%-39% gains rests on the unverified assumption that the 8 tasks and 7 datasets adequately sample real-world forensic distributions (low-quality latents, partial prints, cross-sensor variation, aging). No analysis or justification of coverage is supplied, so the central performance claims cannot be assessed for generalizability.

    Authors: We agree that the term 'comprehensive' requires qualification and that explicit discussion of coverage is needed. In the revised manuscript, we have updated the abstract to describe FPBench as 'a benchmark' (removing 'comprehensive') and added a new Limitations subsection in the Discussion that justifies task and dataset selection based on standard fingerprint biometrics literature while explicitly acknowledging gaps in coverage for low-quality latents, partial prints, cross-sensor variation, and aging. The 7-39% gains are presented as results on the chosen collection, with a statement that broader generalization requires additional validation. This provides the requested transparency without overstating scope. revision: partial

  2. Referee: [Evaluation] Evaluation section (implied by summarized results): the reported performance gains lack full specification of exact metrics, statistical significance tests, variance across runs, and precise prompting templates. Without these, the 7-39% improvement figures cannot be reproduced or compared reliably to future work.

    Authors: We thank the referee for highlighting this reproducibility gap. The revised manuscript expands the Evaluation section to report exact per-task metrics (accuracy for classification tasks and equal error rate for verification), includes statistical significance testing (McNemar's test for paired comparisons), reports standard deviation over five independent fine-tuning runs, and provides all zero-shot and chain-of-thought prompting templates verbatim in a new Appendix B. The public GitHub repository has been updated with the full evaluation scripts and templates to support direct reproduction and comparison. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivations or self-referential reductions

full rationale

The paper is a purely empirical benchmark study. It defines FPBench with 8 tasks across 7 datasets, evaluates 20 MLLMs in zero-shot and CoT settings, and reports measured accuracy gains from fine-tuning vision/language encoders (7-39%). No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced. The central claims rest on direct experimental measurements against external datasets rather than any derivation that reduces to its own inputs by construction. No self-citations are used to justify load-bearing steps. This is the standard case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, free parameters, or invented entities; relies on standard evaluation practices in machine learning.

pith-pipeline@v0.9.0 · 5499 in / 958 out tokens · 25213 ms · 2026-05-16T20:25:18.685231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

  1. [1]

    Anguli: Synthetic fingerprint generator. 7

  2. [2]

    Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv, 2023

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv, 2023. 3

  3. [3]

    Qwen3-vl technical report.arXiv,

    Shuai Bai et al. Qwen3-vl technical report.arXiv,

  4. [4]

    Learn to refuse: Making large language models more controllable and reliable through knowl- edge scope limitation and refusal mechanism

    Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowl- edge scope limitation and refusal mechanism. InCon- ference on Empirical Methods in Natural Language Processing, 2023. 10

  5. [5]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right 10 PatternMinutiae Orientation Flow Verification Sensor Synthetic Fingerprint ACE-V T ools Retrieval Average LLaVA-OneVision-0.5b-OV Qwen3-VL-2B-Instruct Gemma3-4b Chameleon-7b LLaVA-v1.5-7b LLaVA-NeXT-Interleave-...

  6. [6]

    How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Infor- 12 Table 5

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Infor- 12 Table 5. Change in performance (accuracy%) under chain-of-thought compared to zero-shot evaluation setting i...

  7. [7]

    Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 3

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7, 8, 12

  9. [9]

    Fin- gerprint minutiae extraction using deep learning

    Luke Nicholas Darlow and Benjamin Rosman. Fin- gerprint minutiae extraction using deep learning. In 2017 IEEE International Joint Conference on Biomet- rics (IJCB), pages 22–30. IEEE, 2017. 3

  10. [10]

    Learn- ing a fixed-length fingerprint representation.IEEE transactions on pattern analysis and machine intelli- gence, 43(6):1981–1997, 2019

    Joshua J Engelsma, Kai Cao, and Anil K Jain. Learn- ing a fixed-length fingerprint representation.IEEE transactions on pattern analysis and machine intelli- gence, 43(6):1981–1997, 2019. 3

  11. [11]

    Chatgpt meets iris biometrics

    Parisa Farmanifard and Arun Ross. Chatgpt meets iris biometrics. In2024 IEEE International Joint Confer- ence on Biometrics (IJCB), pages 1–10. IEEE, 2024. 3

  12. [12]

    Gregory Fiumara, Patricia Flanagan, Matthew Schwarz, Elham Tabassi, and Christopher Boehnen. Nist special database 301.Gaithersburg, MD, USA, 13 ZERO –SHOT RESPONSE:B CHAIN-OF-THOUGHT RESPONSE:Let’s analyze the three fingerprint images step by step:-Image 1: This is a very low-resolution, grainy, and distorted fingerprint. The ridge patterns are broken an...

  13. [13]

    Nist special database 302: Nail to nail fingerprint challenge

    Gregory P Fiumara, Patricia A Flanagan, John D Grantham, Kenneth Ko, Karen Marshall, Matthew Schwarz, Elham Tabassi, Bryan Woodgate, and Christopher Boehnen. Nist special database 302: Nail to nail fingerprint challenge. 2019. 6, 7

  14. [14]

    Enhancement-driven pretraining for ro- bust fingerprint representation learning.Proceedings Copyright, 821:828, 2024

    Ekta Gavas, Kaustubh Olpadkar, and Anoop Nam- boodiri. Enhancement-driven pretraining for ro- bust fingerprint representation learning.Proceedings Copyright, 821:828, 2024. 3

  15. [15]

    Universal fingerprint generation: Controllable diffusion model with multi- modal conditions.arXiv preprint arXiv:2404.13791,

    Steven A Grosz and Anil K Jain. Universal fingerprint generation: Controllable diffusion model with multi- modal conditions.arXiv preprint arXiv:2404.13791,

  16. [16]

    Gqa: A new dataset for real-world visual reasoning and com- positional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 1

  17. [17]

    Jain, Arun A

    Anil K. Jain, Arun A. Ross, and Karthik Nandakumar. Introduction to Biometrics. Springer Publishing Com- pany, Incorporated, 2011. 1

  18. [18]

    User’s guide to nist biometric image soft- ware (nbis)

    Kenneth Ko. User’s guide to nist biometric image soft- ware (nbis). 2007. 7, 9, 12

  19. [19]

    Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

    Hugo Laurenc ¸on, Lucile Saulnier, L ´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of inter- leaved image-text documents, 2023. 7, 12

  20. [20]

    What matters when building vision- language models?, 2024

    Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models?, 2024. 7, 12

  21. [21]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

  22. [22]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 7, 12

  23. [23]

    Monkey: Image resolution and text label are important things for large multi-modal models

    Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xi- ang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773,

  24. [24]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023. 3, 7, 12 14

  25. [25]

    Improved baselines with visual instruction tun- ing

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 3

  26. [26]

    Mianxin Liu, Weiguo Hu, Jinru Ding, Jie Xu, Xi- aoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, et al. Medbench: A com- prehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models.Big Data Mining and Analytics, 7(4):1116– 1128, 2024. 1

  27. [27]

    Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–

  28. [28]

    Deepseek- vl: Towards real-world vision-language understand- ing, 2024

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Han- wei Xu, Zhenda Xie, and Chong Ruan. Deepseek- vl: Towards real-world vision-language understand- ing, 2024. 7, 12

  29. [29]

    Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering

    Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3

  30. [30]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Math- vista: Evaluating mathematical reasoning of foun- dation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

  31. [31]

    Wayman, and Anil K

    Dario Maio, Davide Maltoni, Raffaele Cappelli, James L. Wayman, and Anil K. Jain. Fvc2000: Fin- gerprint verification competition.IEEE transactions on pattern analysis and machine intelligence, 24(3): 402–412, 2002. 6, 7

  32. [32]

    Fvc2002: Sec- ond fingerprint verification competition

    Dario Maio, Davide Maltoni, Raffaele Cappelli, James L Wayman, and Anil K Jain. Fvc2002: Sec- ond fingerprint verification competition. In2002 In- ternational conference on pattern recognition, pages 811–814. IEEE, 2002. 6, 7

  33. [33]

    Fvc2004: Third finger- print verification competition

    Dario Maio, Davide Maltoni, Raffaele Cappelli, Jim L Wayman, and Anil K Jain. Fvc2004: Third finger- print verification competition. InInternational confer- ence on biometric authentication, pages 1–7. Springer,

  34. [34]

    ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

    Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

  35. [35]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawa- har. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,

  36. [36]

    Facexbench: Evaluating multimodal llms on face un- derstanding.arXiv preprint arXiv:2501.10360, 2025

    Kartik Narayan, Vibashan VS, and Vishal M Patel. Facexbench: Evaluating multimodal llms on face un- derstanding.arXiv preprint arXiv:2501.10360, 2025. 3, 4, 7, 8, 9, 17

  37. [37]

    Robust minutiae extractor: Integrating deep networks and fin- gerprint domain knowledge

    Dinh-Luan Nguyen, Kai Cao, and Anil K Jain. Robust minutiae extractor: Integrating deep networks and fin- gerprint domain knowledge. In2018 International Conference on Biometrics (ICB), pages 9–16. IEEE,

  38. [38]

    Swgfast: Document #10 standards for ex- amining friction ridge impressions and resulting conclusions (latent/tenprint).https : / / www

    NIST. Swgfast: Document #10 standards for ex- amining friction ridge impressions and resulting conclusions (latent/tenprint).https : / / www . nist . gov / system / files / documents / 2016 / 10 / 26 / swgfast _ examinations - conclusions _ 2 . 0 _ 130427 . pdf, 2016. Accessed: 2025-06-30. 3, 6

  39. [39]

    Swgfast: Document #9 standard for the documentation of analysis, comparison, eval- uation, and verification (ace-v) in tenprint opera- tions (tenprint).https : / / www

    NIST. Swgfast: Document #9 standard for the documentation of analysis, comparison, eval- uation, and verification (ace-v) in tenprint opera- tions (tenprint).https : / / www . nist . gov / system / files / documents / 2016 / 10 / 26 / swgfast_standard- documentation- ace- v- tenprint_2.0_121124.pdf, 2016. Ac- cessed: 2025-06-30. 3, 6, 7

  40. [40]

    Gpt-5 system card

    OpenAI. Gpt-5 system card. Technical report, Ope- nAI, 2025. 7, 8, 12

  41. [41]

    Im- age captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604, 2025

    Sara Sarto, Marcella Cornia, and Rita Cucchiara. Im- age captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604, 2025. 1

  42. [42]

    Facellm: A multimodal large language model for face understanding.arXiv preprint arXiv:2507.10300,

    Hatef Otroshi Shahreza and S ´ebastien Marcel. Facellm: A multimodal large language model for face understanding.arXiv preprint arXiv:2507.10300,

  43. [43]

    Shield: An evalua- tion benchmark for face spoofing and forgery detec- tion with multimodal large language models.Visual Intelligence, 3(1):9, 2025

    Yichen Shi, Yuhao Gao, Yingxin Lai, Hongyang Wang, Jun Feng, Lei He, Jun Wan, Changsheng Chen, Zitong Yu, and Xiaochun Cao. Shield: An evalua- tion benchmark for face spoofing and forgery detec- tion with multimodal large language models.Visual Intelligence, 3(1):9, 2025. 3

  44. [44]

    Benchmarking foun- dation models for zero-shot biometric tasks.arXiv preprint arXiv:2505.24214, 2025

    Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, and Arun Ross. Benchmarking foun- dation models for zero-shot biometric tasks.arXiv preprint arXiv:2505.24214, 2025. 3 15

  45. [45]

    Face-mllm: A large face percep- tion model.arXiv preprint arXiv:2410.20717, 2024

    Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, and Shiguang Shan. Face-mllm: A large face percep- tion model.arXiv preprint arXiv:2410.20717, 2024. 3

  46. [46]

    Nist fingerprint image qual- ity 2

    Elham Tabassi, Martin Olsen, Oliver Bausinger, Christoph Busch, Andrew Figlarz, Gregory Fiumara, Olaf Henniger, Johannes Merkle, Timo Ruhland, Christopher Schiel, et al. Nist fingerprint image qual- ity 2. 2021. 9

  47. [47]

    Fingerprint feature extraction by combin- ing texture, minutiae, and frequency spectrum using multi-task cnn

    Ai Takahashi, Yoshinori Koda, Koichi Ito, and Taka- fumi Aoki. Fingerprint feature extraction by combin- ing texture, minutiae, and frequency spectrum using multi-task cnn. In2020 IEEE international joint con- ference on biometrics (IJCB), pages 1–8. IEEE, 2020. 3

  48. [48]

    Trans- former based fingerprint feature extraction

    Saraansh Tandon and Anoop Namboodiri. Trans- former based fingerprint feature extraction. In2022 26th International Conference on Pattern Recognition (ICPR), pages 870–876. IEEE, 2022. 3, 7, 9, 12

  49. [49]

    Fin- gernet: An unified deep network for fingerprint minu- tiae extraction

    Yao Tang, Fei Gao, Jufu Feng, and Yuhang Liu. Fin- gernet: An unified deep network for fingerprint minu- tiae extraction. In2017 IEEE International Joint Con- ference on Biometrics (IJCB), pages 108–116. IEEE,

  50. [50]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 7, 12

  51. [51]

    Gemma Team. Gemma 3. 2025. 3, 7, 12

  52. [52]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

  53. [53]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improv- ing open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 3

  54. [54]

    Digital identity guidelines.NIST Special Publication NIST SP 800-63-4, 2025

    David Temoshok, Diana Proud-Madruga, Yee-Yin Choong, Ryan Galluzzo, Sarbari Gupta, Connie LaSalle, Naomi Lefkovitz, and Andrew Regenscheid. Digital identity guidelines.NIST Special Publication NIST SP 800-63-4, 2025. 1

  55. [55]

    Document collection visual question answer- ing

    Rub `en Tito, Dimosthenis Karatzas, and Ernest Val- veny. Document collection visual question answer- ing. InInternational Conference on Document Anal- ysis and Recognition, pages 778–792. Springer, 2021. 1

  56. [56]

    A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

    Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024. 1

  57. [57]

    SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem- solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023. 1

  58. [58]

    Emo-llama: Enhancing facial emotion understanding with instruction tuning.arXiv preprint arXiv:2408.11424, 2024

    Bohao Xing, Zitong Yu, Xin Liu, Kaishen Yuan, Qi- lang Ye, Weicheng Xie, Huanjing Yue, Jingyu Yang, and Heikki K¨alvi¨ainen. Emo-llama: Enhancing facial emotion understanding with instruction tuning.arXiv preprint arXiv:2408.11424, 2024. 3

  59. [59]

    A survey on mul- timodal large language models.National Science Re- view, 11(12):nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on mul- timodal large language models.National Science Re- view, 11(12):nwae403, 2024. 1

  60. [60]

    Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 3

  61. [61]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual ques- tion answering.arXiv preprint arXiv:2305.10415,

  62. [62]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 7, 12 16 Table 6. Key statistics of questions in FPBENCH. Statistic Number Total questions 494...