arxiv: 2512.18073 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

FPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis

Ekta Gavas , Sudipta Banerjee , Chinmay Hegde , Nasir Memon

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords fingerprint analysismultimodal LLMsbenchmarkbiometricsfine-tuningforensic tasksvision-language modelsdomain adaptation

0 comments

The pith

FPBench tests 20 multimodal LLMs on eight fingerprint tasks across seven datasets and shows fine-tuning vision and language encoders raises performance by 7 to 39 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates FPBench to measure how well multimodal large language models handle the fine details in fingerprint images for biometric and forensic work. It runs twenty models through eight tasks including pattern classification, verification, and real-versus-synthetic detection on seven real and synthetic datasets, using both zero-shot and chain-of-thought prompting. The key result is that fine-tuning the vision and language components of open-source models lifts accuracy on these tasks. A sympathetic reader would care because fingerprints remain a primary biometric for identification and security, yet most large models are trained on everyday scenes rather than these specialized textures. The benchmark is presented as an initial step toward building foundation models that are actually useful for fingerprint analysis.

Core claim

FPBench is a new benchmark that evaluates twenty multimodal LLMs on eight biometric and forensic tasks such as pattern analysis, fingerprint verification, and real-versus-synthetic classification, using seven real and synthetic fingerprint datasets under zero-shot and chain-of-thought prompting. Fine-tuning the vision and language encoders on open-source models improves performance by 7 to 39 percent across the tasks. The benchmark is positioned as a first step toward developing foundation models specialized for fingerprints.

What carries the argument

The FPBench benchmark itself, defined by its eight tasks applied to seven fingerprint datasets to test multimodal LLMs on fine structural and textural details.

If this is right

Fine-tuned MLLMs become viable tools for fingerprint pattern classification and verification tasks.
Domain adaptation through encoder fine-tuning works for the fine textures specific to biometric images.
Standardized benchmarks like FPBench allow direct comparison of open-source and proprietary models on forensic tasks.
Open-source MLLMs can be improved enough to narrow the gap with proprietary models in this domain.
The results support the feasibility of building specialized foundation models for fingerprint analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning approach could be tested on other biometrics such as iris or palm prints.
Forensic workflows might use these adapted models to handle partial or degraded prints not fully represented in the current datasets.
The performance lift indicates that general MLLMs underperform on biometric textures because of limited exposure during pretraining.
Future evaluations could add tasks like aging simulation or cross-sensor matching to stress-test the benchmark.

Load-bearing premise

The eight chosen tasks and seven datasets adequately cover the range of real-world fingerprint challenges in forensics and biometrics.

What would settle it

A new collection of fingerprint images from unseen sensors or populations where the fine-tuned models show no accuracy gain or fall below zero-shot baselines would falsify the reported benefit of fine-tuning.

Figures

Figures reproduced from arXiv: 2512.18073 by Chinmay Hegde, Ekta Gavas, Nasir Memon, Sudipta Banerjee.

**Figure 1.** Figure 1: FPBENCH: Overview of proposed benchmark for fingerprint analysis using MLLMs. We present examples of prompts curated for each task to evaluate the vision and language capabilities of MLLMs in fingerprint-based biometric and forensic tasks [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Distribution of questions across different categories and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example ACE-V sheet from the ACE-V analysis task on a sample fingerprint pair in FPB [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Accuracy (%) of all models across various fingerprint tasks presented in the form of heat map. The task ‘tool [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy (%) of top-5 best performing models along with mean performance across all tasks [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Performance (%) of top-5 best performing models across all tasks on zero-shot prompting. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Performance (%) variation of select models with change in model size (#params). Solid lines represent larger model variants [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between responses produced by Qwen3-VL-32b[ [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗

read the original abstract

Multimodal LLMs (MLLMs) are capable of performing complex data analysis, visual question answering, generation, and reasoning tasks. However, their ability to analyze biometric data is relatively underexplored. In this work, we investigate the effectiveness of MLLMs in understanding fine structural and textural details present in fingerprint images. To this end, we design a comprehensive benchmark, FPBench, to evaluate 20 MLLMs (open-source and proprietary models) across 7 real and synthetic datasets on a suite of 8 biometric and forensic tasks (e.g., pattern analysis, fingerprint verification, real versus synthetic classification, etc.) using zero-shot and chain-of-thought prompting strategies. We further fine-tune vision and language encoders on a subset of open-source MLLMs to demonstrate domain adaptation. FPBench is a novel benchmark designed as a first step towards developing foundation models in fingerprints. Our findings indicate fine-tuning of vision and language encoders improves the performance by 7%-39%. Our codes are available at https://github.com/Ektagavas/FPBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FPBench is a useful first benchmark for MLLMs on fingerprints with reported fine-tuning gains, but its tasks and datasets may not fully capture real forensic challenges.

read the letter

Hi, the main point on this paper is that FPBench is the first benchmark built specifically to test multimodal LLMs on fingerprint images. They run 20 models across 8 tasks like pattern analysis, verification, and real-vs-synthetic classification, using 7 real and synthetic datasets, zero-shot and chain-of-thought prompting, plus fine-tuning of vision and language encoders that they say lifts performance 7-39% on a subset of models. They also release the code, which is straightforward and helpful for anyone who wants to extend it. That combination of a new dedicated testbed plus concrete adaptation results is the actual contribution here. The work is narrow by design, focused on one modality, but it fills a clear gap since prior MLLM benchmarks skipped biometrics entirely. The soft spot is the representativeness claim. The stress-test note is right to flag that the chosen tasks and datasets need to be checked against harder real-world cases like low-quality latents, partial prints, aging, or cross-sensor variation. If the collection skews toward clean rolled prints or synthetics, both the baseline scores and the fine-tuning improvements could be narrower than they appear. The abstract is also light on exact metric definitions, statistical tests, and prompting templates, so the full paper has to supply those to make the 7-39% range interpretable. This is for biometrics and computer vision researchers who want to try MLLMs on fine-grained images or build on the benchmark. It is not broad enough to interest the general LLM community. I would send it for peer review because the benchmark is new, the experiments are reproducible with the released code, and the core idea is worth checking even if the scope stays limited.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces FPBench, a benchmark for evaluating 20 multimodal large language models (MLLMs) on fingerprint analysis. It tests these models across 8 biometric and forensic tasks (including pattern analysis, verification, and real-vs-synthetic classification) on 7 real and synthetic datasets using zero-shot and chain-of-thought prompting, then reports 7-39% performance gains after fine-tuning vision and language encoders on open-source MLLMs. The work positions FPBench as an initial step toward foundation models for fingerprints.

Significance. If the chosen tasks and datasets prove representative, the benchmark would provide a useful baseline for MLLM performance in biometrics and demonstrate the value of domain adaptation via fine-tuning. The empirical nature of the study (no circular derivations) and public code release strengthen its potential utility, but the significance hinges on whether the 7-39% gains generalize beyond the selected collection.

major comments (2)

[Abstract] Abstract: the claim that FPBench is 'comprehensive' and that fine-tuning yields 7%-39% gains rests on the unverified assumption that the 8 tasks and 7 datasets adequately sample real-world forensic distributions (low-quality latents, partial prints, cross-sensor variation, aging). No analysis or justification of coverage is supplied, so the central performance claims cannot be assessed for generalizability.
[Evaluation] Evaluation section (implied by summarized results): the reported performance gains lack full specification of exact metrics, statistical significance tests, variance across runs, and precise prompting templates. Without these, the 7-39% improvement figures cannot be reproduced or compared reliably to future work.

minor comments (2)

The GitHub repository link is provided, which supports reproducibility; ensure the release includes all prompting templates, fine-tuning hyperparameters, and dataset splits.
Clarify the distinction between open-source and proprietary models in the results tables to avoid ambiguity in zero-shot vs. fine-tuned comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity, transparency, and reproducibility.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that FPBench is 'comprehensive' and that fine-tuning yields 7%-39% gains rests on the unverified assumption that the 8 tasks and 7 datasets adequately sample real-world forensic distributions (low-quality latents, partial prints, cross-sensor variation, aging). No analysis or justification of coverage is supplied, so the central performance claims cannot be assessed for generalizability.

Authors: We agree that the term 'comprehensive' requires qualification and that explicit discussion of coverage is needed. In the revised manuscript, we have updated the abstract to describe FPBench as 'a benchmark' (removing 'comprehensive') and added a new Limitations subsection in the Discussion that justifies task and dataset selection based on standard fingerprint biometrics literature while explicitly acknowledging gaps in coverage for low-quality latents, partial prints, cross-sensor variation, and aging. The 7-39% gains are presented as results on the chosen collection, with a statement that broader generalization requires additional validation. This provides the requested transparency without overstating scope. revision: partial
Referee: [Evaluation] Evaluation section (implied by summarized results): the reported performance gains lack full specification of exact metrics, statistical significance tests, variance across runs, and precise prompting templates. Without these, the 7-39% improvement figures cannot be reproduced or compared reliably to future work.

Authors: We thank the referee for highlighting this reproducibility gap. The revised manuscript expands the Evaluation section to report exact per-task metrics (accuracy for classification tasks and equal error rate for verification), includes statistical significance testing (McNemar's test for paired comparisons), reports standard deviation over five independent fine-tuning runs, and provides all zero-shot and chain-of-thought prompting templates verbatim in a new Appendix B. The public GitHub repository has been updated with the full evaluation scripts and templates to support direct reproduction and comparison. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivations or self-referential reductions

full rationale

The paper is a purely empirical benchmark study. It defines FPBench with 8 tasks across 7 datasets, evaluates 20 MLLMs in zero-shot and CoT settings, and reports measured accuracy gains from fine-tuning vision/language encoders (7-39%). No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced. The central claims rest on direct experimental measurements against external datasets rather than any derivation that reduces to its own inputs by construction. No self-citations are used to justify load-bearing steps. This is the standard case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, free parameters, or invented entities; relies on standard evaluation practices in machine learning.

pith-pipeline@v0.9.0 · 5499 in / 958 out tokens · 25213 ms · 2026-05-16T20:25:18.685231+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We design a comprehensive benchmark, FPBench, to evaluate 20 MLLMs ... across 7 real and synthetic datasets on a suite of 8 biometric and forensic tasks (e.g., pattern analysis, fingerprint verification, real versus synthetic classification, etc.)
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

fine-tuning of vision and language encoders improves the performance by 7%-39%

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 11 internal anchors

[1]

Anguli: Synthetic fingerprint generator. 7

work page
[2]

Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv, 2023

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv, 2023. 3

work page 2023
[3]

Qwen3-vl technical report.arXiv,

Shuai Bai et al. Qwen3-vl technical report.arXiv,

work page
[4]

Learn to refuse: Making large language models more controllable and reliable through knowl- edge scope limitation and refusal mechanism

Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowl- edge scope limitation and refusal mechanism. InCon- ference on Empirical Methods in Natural Language Processing, 2023. 10

work page 2023
[5]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right 10 PatternMinutiae Orientation Flow Verification Sensor Synthetic Fingerprint ACE-V T ools Retrieval Average LLaVA-OneVision-0.5b-OV Qwen3-VL-2B-Instruct Gemma3-4b Chameleon-7b LLaVA-v1.5-7b LLaVA-NeXT-Interleave-...

work page 1996
[6]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Infor- 12 Table 5

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Infor- 12 Table 5. Change in performance (accuracy%) under chain-of-thought compared to zero-shot evaluation setting i...

work page 2024
[7]

Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 3

work page 2024
[8]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7, 8, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Fin- gerprint minutiae extraction using deep learning

Luke Nicholas Darlow and Benjamin Rosman. Fin- gerprint minutiae extraction using deep learning. In 2017 IEEE International Joint Conference on Biomet- rics (IJCB), pages 22–30. IEEE, 2017. 3

work page 2017
[10]

Learn- ing a fixed-length fingerprint representation.IEEE transactions on pattern analysis and machine intelli- gence, 43(6):1981–1997, 2019

Joshua J Engelsma, Kai Cao, and Anil K Jain. Learn- ing a fixed-length fingerprint representation.IEEE transactions on pattern analysis and machine intelli- gence, 43(6):1981–1997, 2019. 3

work page 1981
[11]

Chatgpt meets iris biometrics

Parisa Farmanifard and Arun Ross. Chatgpt meets iris biometrics. In2024 IEEE International Joint Confer- ence on Biometrics (IJCB), pages 1–10. IEEE, 2024. 3

work page 2024
[12]

Gregory Fiumara, Patricia Flanagan, Matthew Schwarz, Elham Tabassi, and Christopher Boehnen. Nist special database 301.Gaithersburg, MD, USA, 13 ZERO –SHOT RESPONSE:B CHAIN-OF-THOUGHT RESPONSE:Let’s analyze the three fingerprint images step by step:-Image 1: This is a very low-resolution, grainy, and distorted fingerprint. The ridge patterns are broken an...

work page
[13]

Nist special database 302: Nail to nail fingerprint challenge

Gregory P Fiumara, Patricia A Flanagan, John D Grantham, Kenneth Ko, Karen Marshall, Matthew Schwarz, Elham Tabassi, Bryan Woodgate, and Christopher Boehnen. Nist special database 302: Nail to nail fingerprint challenge. 2019. 6, 7

work page 2019
[14]

Enhancement-driven pretraining for ro- bust fingerprint representation learning.Proceedings Copyright, 821:828, 2024

Ekta Gavas, Kaustubh Olpadkar, and Anoop Nam- boodiri. Enhancement-driven pretraining for ro- bust fingerprint representation learning.Proceedings Copyright, 821:828, 2024. 3

work page 2024
[15]

Universal fingerprint generation: Controllable diffusion model with multi- modal conditions.arXiv preprint arXiv:2404.13791,

Steven A Grosz and Anil K Jain. Universal fingerprint generation: Controllable diffusion model with multi- modal conditions.arXiv preprint arXiv:2404.13791,

work page arXiv
[16]

Gqa: A new dataset for real-world visual reasoning and com- positional question answering

Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 1

work page 2019
[17]

Jain, Arun A

Anil K. Jain, Arun A. Ross, and Karthik Nandakumar. Introduction to Biometrics. Springer Publishing Com- pany, Incorporated, 2011. 1

work page 2011
[18]

User’s guide to nist biometric image soft- ware (nbis)

Kenneth Ko. User’s guide to nist biometric image soft- ware (nbis). 2007. 7, 9, 12

work page 2007
[19]

Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh

Hugo Laurenc ¸on, Lucile Saulnier, L ´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of inter- leaved image-text documents, 2023. 7, 12

work page 2023
[20]

What matters when building vision- language models?, 2024

Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models?, 2024. 7, 12

work page 2024
[21]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Monkey: Image resolution and text label are important things for large multi-modal models

Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xi- ang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773,

work page
[24]

Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023. 3, 7, 12 14

work page 2023
[25]

Improved baselines with visual instruction tun- ing

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 3

work page 2024
[26]

Mianxin Liu, Weiguo Hu, Jinru Ding, Jie Xu, Xi- aoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, et al. Medbench: A com- prehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models.Big Data Mining and Analytics, 7(4):1116– 1128, 2024. 1

work page 2024
[27]

Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–

work page
[28]

Deepseek- vl: Towards real-world vision-language understand- ing, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Han- wei Xu, Zhenda Xie, and Chong Ruan. Deepseek- vl: Towards real-world vision-language understand- ing, 2024. 7, 12

work page 2024
[29]

Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering

Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3

work page 2022
[30]

MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Math- vista: Evaluating mathematical reasoning of foun- dation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Wayman, and Anil K

Dario Maio, Davide Maltoni, Raffaele Cappelli, James L. Wayman, and Anil K. Jain. Fvc2000: Fin- gerprint verification competition.IEEE transactions on pattern analysis and machine intelligence, 24(3): 402–412, 2002. 6, 7

work page 2002
[32]

Fvc2002: Sec- ond fingerprint verification competition

Dario Maio, Davide Maltoni, Raffaele Cappelli, James L Wayman, and Anil K Jain. Fvc2002: Sec- ond fingerprint verification competition. In2002 In- ternational conference on pattern recognition, pages 811–814. IEEE, 2002. 6, 7

work page 2002
[33]

Fvc2004: Third finger- print verification competition

Dario Maio, Davide Maltoni, Raffaele Cappelli, Jim L Wayman, and Anil K Jain. Fvc2004: Third finger- print verification competition. InInternational confer- ence on biometric authentication, pages 1–7. Springer,

work page
[34]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,

work page internal anchor Pith review Pith/arXiv arXiv
[35]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawa- har. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,

work page
[36]

Facexbench: Evaluating multimodal llms on face un- derstanding.arXiv preprint arXiv:2501.10360, 2025

Kartik Narayan, Vibashan VS, and Vishal M Patel. Facexbench: Evaluating multimodal llms on face un- derstanding.arXiv preprint arXiv:2501.10360, 2025. 3, 4, 7, 8, 9, 17

work page arXiv 2025
[37]

Robust minutiae extractor: Integrating deep networks and fin- gerprint domain knowledge

Dinh-Luan Nguyen, Kai Cao, and Anil K Jain. Robust minutiae extractor: Integrating deep networks and fin- gerprint domain knowledge. In2018 International Conference on Biometrics (ICB), pages 9–16. IEEE,

work page
[38]

Swgfast: Document #10 standards for ex- amining friction ridge impressions and resulting conclusions (latent/tenprint).https : / / www

NIST. Swgfast: Document #10 standards for ex- amining friction ridge impressions and resulting conclusions (latent/tenprint).https : / / www . nist . gov / system / files / documents / 2016 / 10 / 26 / swgfast _ examinations - conclusions _ 2 . 0 _ 130427 . pdf, 2016. Accessed: 2025-06-30. 3, 6

work page 2016
[39]

Swgfast: Document #9 standard for the documentation of analysis, comparison, eval- uation, and verification (ace-v) in tenprint opera- tions (tenprint).https : / / www

NIST. Swgfast: Document #9 standard for the documentation of analysis, comparison, eval- uation, and verification (ace-v) in tenprint opera- tions (tenprint).https : / / www . nist . gov / system / files / documents / 2016 / 10 / 26 / swgfast_standard- documentation- ace- v- tenprint_2.0_121124.pdf, 2016. Ac- cessed: 2025-06-30. 3, 6, 7

work page 2016
[40]

Gpt-5 system card

OpenAI. Gpt-5 system card. Technical report, Ope- nAI, 2025. 7, 8, 12

work page 2025
[41]

Im- age captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604, 2025

Sara Sarto, Marcella Cornia, and Rita Cucchiara. Im- age captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604, 2025. 1

work page arXiv 2025
[42]

Facellm: A multimodal large language model for face understanding.arXiv preprint arXiv:2507.10300,

Hatef Otroshi Shahreza and S ´ebastien Marcel. Facellm: A multimodal large language model for face understanding.arXiv preprint arXiv:2507.10300,

work page arXiv
[43]

Shield: An evalua- tion benchmark for face spoofing and forgery detec- tion with multimodal large language models.Visual Intelligence, 3(1):9, 2025

Yichen Shi, Yuhao Gao, Yingxin Lai, Hongyang Wang, Jun Feng, Lei He, Jun Wan, Changsheng Chen, Zitong Yu, and Xiaochun Cao. Shield: An evalua- tion benchmark for face spoofing and forgery detec- tion with multimodal large language models.Visual Intelligence, 3(1):9, 2025. 3

work page 2025
[44]

Benchmarking foun- dation models for zero-shot biometric tasks.arXiv preprint arXiv:2505.24214, 2025

Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, and Arun Ross. Benchmarking foun- dation models for zero-shot biometric tasks.arXiv preprint arXiv:2505.24214, 2025. 3 15

work page arXiv 2025
[45]

Face-mllm: A large face percep- tion model.arXiv preprint arXiv:2410.20717, 2024

Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, and Shiguang Shan. Face-mllm: A large face percep- tion model.arXiv preprint arXiv:2410.20717, 2024. 3

work page arXiv 2024
[46]

Nist fingerprint image qual- ity 2

Elham Tabassi, Martin Olsen, Oliver Bausinger, Christoph Busch, Andrew Figlarz, Gregory Fiumara, Olaf Henniger, Johannes Merkle, Timo Ruhland, Christopher Schiel, et al. Nist fingerprint image qual- ity 2. 2021. 9

work page 2021
[47]

Fingerprint feature extraction by combin- ing texture, minutiae, and frequency spectrum using multi-task cnn

Ai Takahashi, Yoshinori Koda, Koichi Ito, and Taka- fumi Aoki. Fingerprint feature extraction by combin- ing texture, minutiae, and frequency spectrum using multi-task cnn. In2020 IEEE international joint con- ference on biometrics (IJCB), pages 1–8. IEEE, 2020. 3

work page 2020
[48]

Trans- former based fingerprint feature extraction

Saraansh Tandon and Anoop Namboodiri. Trans- former based fingerprint feature extraction. In2022 26th International Conference on Pattern Recognition (ICPR), pages 870–876. IEEE, 2022. 3, 7, 9, 12

work page 2022
[49]

Fin- gernet: An unified deep network for fingerprint minu- tiae extraction

Yao Tang, Fei Gao, Jufu Feng, and Yuhang Liu. Fin- gernet: An unified deep network for fingerprint minu- tiae extraction. In2017 IEEE International Joint Con- ference on Biometrics (IJCB), pages 108–116. IEEE,

work page
[50]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 7, 12

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Gemma Team. Gemma 3. 2025. 3, 7, 12

work page 2025
[52]

Gemma: Open Models Based on Gemini Research and Technology

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improv- ing open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

Digital identity guidelines.NIST Special Publication NIST SP 800-63-4, 2025

David Temoshok, Diana Proud-Madruga, Yee-Yin Choong, Ryan Galluzzo, Sarbari Gupta, Connie LaSalle, Naomi Lefkovitz, and Andrew Regenscheid. Digital identity guidelines.NIST Special Publication NIST SP 800-63-4, 2025. 1

work page 2025
[55]

Document collection visual question answer- ing

Rub `en Tito, Dimosthenis Karatzas, and Ernest Val- veny. Document collection visual question answer- ing. InInternational Conference on Document Anal- ysis and Recognition, pages 778–792. Springer, 2021. 1

work page 2021
[56]

A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024

Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024. 1

work page arXiv 2024
[57]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem- solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Emo-llama: Enhancing facial emotion understanding with instruction tuning.arXiv preprint arXiv:2408.11424, 2024

Bohao Xing, Zitong Yu, Xin Liu, Kaishen Yuan, Qi- lang Ye, Weicheng Xie, Huanjing Yue, Jingyu Yang, and Heikki K¨alvi¨ainen. Emo-llama: Enhancing facial emotion understanding with instruction tuning.arXiv preprint arXiv:2408.11424, 2024. 3

work page arXiv 2024
[59]

A survey on mul- timodal large language models.National Science Re- view, 11(12):nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on mul- timodal large language models.National Science Re- view, 11(12):nwae403, 2024. 1

work page 2024
[60]

Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 3

work page 2024
[61]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual ques- tion answering.arXiv preprint arXiv:2305.10415,

work page internal anchor Pith review Pith/arXiv arXiv
[62]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 7, 12 16 Table 6. Key statistics of questions in FPBENCH. Statistic Number Total questions 494...

work page internal anchor Pith review Pith/arXiv arXiv 2025