Recognition: 2 theorem links
· Lean TheoremFPBench: A Comprehensive Benchmark of Multimodal Large Language Models for Fingerprint Analysis
Pith reviewed 2026-05-16 20:25 UTC · model grok-4.3
The pith
FPBench tests 20 multimodal LLMs on eight fingerprint tasks across seven datasets and shows fine-tuning vision and language encoders raises performance by 7 to 39 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FPBench is a new benchmark that evaluates twenty multimodal LLMs on eight biometric and forensic tasks such as pattern analysis, fingerprint verification, and real-versus-synthetic classification, using seven real and synthetic fingerprint datasets under zero-shot and chain-of-thought prompting. Fine-tuning the vision and language encoders on open-source models improves performance by 7 to 39 percent across the tasks. The benchmark is positioned as a first step toward developing foundation models specialized for fingerprints.
What carries the argument
The FPBench benchmark itself, defined by its eight tasks applied to seven fingerprint datasets to test multimodal LLMs on fine structural and textural details.
If this is right
- Fine-tuned MLLMs become viable tools for fingerprint pattern classification and verification tasks.
- Domain adaptation through encoder fine-tuning works for the fine textures specific to biometric images.
- Standardized benchmarks like FPBench allow direct comparison of open-source and proprietary models on forensic tasks.
- Open-source MLLMs can be improved enough to narrow the gap with proprietary models in this domain.
- The results support the feasibility of building specialized foundation models for fingerprint analysis.
Where Pith is reading between the lines
- The same fine-tuning approach could be tested on other biometrics such as iris or palm prints.
- Forensic workflows might use these adapted models to handle partial or degraded prints not fully represented in the current datasets.
- The performance lift indicates that general MLLMs underperform on biometric textures because of limited exposure during pretraining.
- Future evaluations could add tasks like aging simulation or cross-sensor matching to stress-test the benchmark.
Load-bearing premise
The eight chosen tasks and seven datasets adequately cover the range of real-world fingerprint challenges in forensics and biometrics.
What would settle it
A new collection of fingerprint images from unseen sensors or populations where the fine-tuned models show no accuracy gain or fall below zero-shot baselines would falsify the reported benefit of fine-tuning.
Figures
read the original abstract
Multimodal LLMs (MLLMs) are capable of performing complex data analysis, visual question answering, generation, and reasoning tasks. However, their ability to analyze biometric data is relatively underexplored. In this work, we investigate the effectiveness of MLLMs in understanding fine structural and textural details present in fingerprint images. To this end, we design a comprehensive benchmark, FPBench, to evaluate 20 MLLMs (open-source and proprietary models) across 7 real and synthetic datasets on a suite of 8 biometric and forensic tasks (e.g., pattern analysis, fingerprint verification, real versus synthetic classification, etc.) using zero-shot and chain-of-thought prompting strategies. We further fine-tune vision and language encoders on a subset of open-source MLLMs to demonstrate domain adaptation. FPBench is a novel benchmark designed as a first step towards developing foundation models in fingerprints. Our findings indicate fine-tuning of vision and language encoders improves the performance by 7%-39%. Our codes are available at https://github.com/Ektagavas/FPBench.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces FPBench, a benchmark for evaluating 20 multimodal large language models (MLLMs) on fingerprint analysis. It tests these models across 8 biometric and forensic tasks (including pattern analysis, verification, and real-vs-synthetic classification) on 7 real and synthetic datasets using zero-shot and chain-of-thought prompting, then reports 7-39% performance gains after fine-tuning vision and language encoders on open-source MLLMs. The work positions FPBench as an initial step toward foundation models for fingerprints.
Significance. If the chosen tasks and datasets prove representative, the benchmark would provide a useful baseline for MLLM performance in biometrics and demonstrate the value of domain adaptation via fine-tuning. The empirical nature of the study (no circular derivations) and public code release strengthen its potential utility, but the significance hinges on whether the 7-39% gains generalize beyond the selected collection.
major comments (2)
- [Abstract] Abstract: the claim that FPBench is 'comprehensive' and that fine-tuning yields 7%-39% gains rests on the unverified assumption that the 8 tasks and 7 datasets adequately sample real-world forensic distributions (low-quality latents, partial prints, cross-sensor variation, aging). No analysis or justification of coverage is supplied, so the central performance claims cannot be assessed for generalizability.
- [Evaluation] Evaluation section (implied by summarized results): the reported performance gains lack full specification of exact metrics, statistical significance tests, variance across runs, and precise prompting templates. Without these, the 7-39% improvement figures cannot be reproduced or compared reliably to future work.
minor comments (2)
- The GitHub repository link is provided, which supports reproducibility; ensure the release includes all prompting templates, fine-tuning hyperparameters, and dataset splits.
- Clarify the distinction between open-source and proprietary models in the results tables to avoid ambiguity in zero-shot vs. fine-tuned comparisons.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly to improve clarity, transparency, and reproducibility.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that FPBench is 'comprehensive' and that fine-tuning yields 7%-39% gains rests on the unverified assumption that the 8 tasks and 7 datasets adequately sample real-world forensic distributions (low-quality latents, partial prints, cross-sensor variation, aging). No analysis or justification of coverage is supplied, so the central performance claims cannot be assessed for generalizability.
Authors: We agree that the term 'comprehensive' requires qualification and that explicit discussion of coverage is needed. In the revised manuscript, we have updated the abstract to describe FPBench as 'a benchmark' (removing 'comprehensive') and added a new Limitations subsection in the Discussion that justifies task and dataset selection based on standard fingerprint biometrics literature while explicitly acknowledging gaps in coverage for low-quality latents, partial prints, cross-sensor variation, and aging. The 7-39% gains are presented as results on the chosen collection, with a statement that broader generalization requires additional validation. This provides the requested transparency without overstating scope. revision: partial
-
Referee: [Evaluation] Evaluation section (implied by summarized results): the reported performance gains lack full specification of exact metrics, statistical significance tests, variance across runs, and precise prompting templates. Without these, the 7-39% improvement figures cannot be reproduced or compared reliably to future work.
Authors: We thank the referee for highlighting this reproducibility gap. The revised manuscript expands the Evaluation section to report exact per-task metrics (accuracy for classification tasks and equal error rate for verification), includes statistical significance testing (McNemar's test for paired comparisons), reports standard deviation over five independent fine-tuning runs, and provides all zero-shot and chain-of-thought prompting templates verbatim in a new Appendix B. The public GitHub repository has been updated with the full evaluation scripts and templates to support direct reproduction and comparison. revision: yes
Circularity Check
Empirical benchmark with no derivations or self-referential reductions
full rationale
The paper is a purely empirical benchmark study. It defines FPBench with 8 tasks across 7 datasets, evaluates 20 MLLMs in zero-shot and CoT settings, and reports measured accuracy gains from fine-tuning vision/language encoders (7-39%). No equations, fitted parameters, uniqueness theorems, or ansatzes are introduced. The central claims rest on direct experimental measurements against external datasets rather than any derivation that reduces to its own inputs by construction. No self-citations are used to justify load-bearing steps. This is the standard case of a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a comprehensive benchmark, FPBench, to evaluate 20 MLLMs ... across 7 real and synthetic datasets on a suite of 8 biometric and forensic tasks (e.g., pattern analysis, fingerprint verification, real versus synthetic classification, etc.)
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
fine-tuning of vision and language encoders improves the performance by 7%-39%
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Anguli: Synthetic fingerprint generator. 7
-
[2]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv, 2023. 3
work page 2023
- [3]
-
[4]
Lang Cao. Learn to refuse: Making large language models more controllable and reliable through knowl- edge scope limitation and refusal mechanism. InCon- ference on Empirical Methods in Natural Language Processing, 2023. 10
work page 2023
-
[5]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right 10 PatternMinutiae Orientation Flow Verification Sensor Synthetic Fingerprint ACE-V T ools Retrieval Average LLaVA-OneVision-0.5b-OV Qwen3-VL-2B-Instruct Gemma3-4b Chameleon-7b LLaVA-v1.5-7b LLaVA-NeXT-Interleave-...
work page 1996
-
[6]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites.Science China Infor- 12 Table 5. Change in performance (accuracy%) under chain-of-thought compared to zero-shot evaluation setting i...
work page 2024
-
[7]
Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scal- ing up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 3
work page 2024
-
[8]
Gheorghe Comanici, Eric Bieber, Mike Schaek- ermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 7, 8, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Fin- gerprint minutiae extraction using deep learning
Luke Nicholas Darlow and Benjamin Rosman. Fin- gerprint minutiae extraction using deep learning. In 2017 IEEE International Joint Conference on Biomet- rics (IJCB), pages 22–30. IEEE, 2017. 3
work page 2017
-
[10]
Joshua J Engelsma, Kai Cao, and Anil K Jain. Learn- ing a fixed-length fingerprint representation.IEEE transactions on pattern analysis and machine intelli- gence, 43(6):1981–1997, 2019. 3
work page 1981
-
[11]
Parisa Farmanifard and Arun Ross. Chatgpt meets iris biometrics. In2024 IEEE International Joint Confer- ence on Biometrics (IJCB), pages 1–10. IEEE, 2024. 3
work page 2024
-
[12]
Gregory Fiumara, Patricia Flanagan, Matthew Schwarz, Elham Tabassi, and Christopher Boehnen. Nist special database 301.Gaithersburg, MD, USA, 13 ZERO –SHOT RESPONSE:B CHAIN-OF-THOUGHT RESPONSE:Let’s analyze the three fingerprint images step by step:-Image 1: This is a very low-resolution, grainy, and distorted fingerprint. The ridge patterns are broken an...
-
[13]
Nist special database 302: Nail to nail fingerprint challenge
Gregory P Fiumara, Patricia A Flanagan, John D Grantham, Kenneth Ko, Karen Marshall, Matthew Schwarz, Elham Tabassi, Bryan Woodgate, and Christopher Boehnen. Nist special database 302: Nail to nail fingerprint challenge. 2019. 6, 7
work page 2019
-
[14]
Ekta Gavas, Kaustubh Olpadkar, and Anoop Nam- boodiri. Enhancement-driven pretraining for ro- bust fingerprint representation learning.Proceedings Copyright, 821:828, 2024. 3
work page 2024
-
[15]
Steven A Grosz and Anil K Jain. Universal fingerprint generation: Controllable diffusion model with multi- modal conditions.arXiv preprint arXiv:2404.13791,
-
[16]
Gqa: A new dataset for real-world visual reasoning and com- positional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and com- positional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019. 1
work page 2019
-
[17]
Anil K. Jain, Arun A. Ross, and Karthik Nandakumar. Introduction to Biometrics. Springer Publishing Com- pany, Incorporated, 2011. 1
work page 2011
-
[18]
User’s guide to nist biometric image soft- ware (nbis)
Kenneth Ko. User’s guide to nist biometric image soft- ware (nbis). 2007. 7, 9, 12
work page 2007
-
[19]
Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh
Hugo Laurenc ¸on, Lucile Saulnier, L ´eo Tronchon, Stas Bekman, Amanpreet Singh, Anton Lozhkov, Thomas Wang, Siddharth Karamcheti, Alexander M. Rush, Douwe Kiela, Matthieu Cord, and Victor Sanh. Obelics: An open web-scale filtered dataset of inter- leaved image-text documents, 2023. 7, 12
work page 2023
-
[20]
What matters when building vision- language models?, 2024
Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision- language models?, 2024. 7, 12
work page 2024
-
[21]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models
Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava- next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Monkey: Image resolution and text label are important things for large multi-modal models
Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and Xi- ang Bai. Monkey: Image resolution and text label are important things for large multi-modal models. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26763–26773,
-
[24]
Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892– 34916, 2023. 3, 7, 12 14
work page 2023
-
[25]
Improved baselines with visual instruction tun- ing
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tun- ing. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 3
work page 2024
-
[26]
Mianxin Liu, Weiguo Hu, Jinru Ding, Jie Xu, Xi- aoyang Li, Lifeng Zhu, Zhian Bai, Xiaoming Shi, Benyou Wang, Haitao Song, et al. Medbench: A com- prehensive, standardized, and reliable benchmarking system for evaluating chinese medical large language models.Big Data Mining and Analytics, 7(4):1116– 1128, 2024. 1
work page 2024
-
[27]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? In European conference on computer vision, pages 216–
-
[28]
Deepseek- vl: Towards real-world vision-language understand- ing, 2024
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhu- oshu Li, Hao Yang, Yaofeng Sun, Chengqi Deng, Han- wei Xu, Zhenda Xie, and Chong Ruan. Deepseek- vl: Towards real-world vision-language understand- ing, 2024. 7, 12
work page 2024
-
[29]
Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering
Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai- Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multi- modal reasoning via thought chains for science ques- tion answering. InThe 36th Conference on Neural Information Processing Systems (NeurIPS), 2022. 3
work page 2022
-
[30]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chun- yuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Math- vista: Evaluating mathematical reasoning of foun- dation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Dario Maio, Davide Maltoni, Raffaele Cappelli, James L. Wayman, and Anil K. Jain. Fvc2000: Fin- gerprint verification competition.IEEE transactions on pattern analysis and machine intelligence, 24(3): 402–412, 2002. 6, 7
work page 2002
-
[32]
Fvc2002: Sec- ond fingerprint verification competition
Dario Maio, Davide Maltoni, Raffaele Cappelli, James L Wayman, and Anil K Jain. Fvc2002: Sec- ond fingerprint verification competition. In2002 In- ternational conference on pattern recognition, pages 811–814. IEEE, 2002. 6, 7
work page 2002
-
[33]
Fvc2004: Third finger- print verification competition
Dario Maio, Davide Maltoni, Raffaele Cappelli, Jim L Wayman, and Anil K Jain. Fvc2004: Third finger- print verification competition. InInternational confer- ence on biometric authentication, pages 1–7. Springer,
-
[34]
ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A benchmark for question answering about charts with visual and logical reasoning.arXiv preprint arXiv:2203.10244,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Docvqa: A dataset for vqa on document images
Minesh Mathew, Dimosthenis Karatzas, and CV Jawa- har. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209,
-
[36]
Facexbench: Evaluating multimodal llms on face un- derstanding.arXiv preprint arXiv:2501.10360, 2025
Kartik Narayan, Vibashan VS, and Vishal M Patel. Facexbench: Evaluating multimodal llms on face un- derstanding.arXiv preprint arXiv:2501.10360, 2025. 3, 4, 7, 8, 9, 17
-
[37]
Robust minutiae extractor: Integrating deep networks and fin- gerprint domain knowledge
Dinh-Luan Nguyen, Kai Cao, and Anil K Jain. Robust minutiae extractor: Integrating deep networks and fin- gerprint domain knowledge. In2018 International Conference on Biometrics (ICB), pages 9–16. IEEE,
-
[38]
NIST. Swgfast: Document #10 standards for ex- amining friction ridge impressions and resulting conclusions (latent/tenprint).https : / / www . nist . gov / system / files / documents / 2016 / 10 / 26 / swgfast _ examinations - conclusions _ 2 . 0 _ 130427 . pdf, 2016. Accessed: 2025-06-30. 3, 6
work page 2016
-
[39]
NIST. Swgfast: Document #9 standard for the documentation of analysis, comparison, eval- uation, and verification (ace-v) in tenprint opera- tions (tenprint).https : / / www . nist . gov / system / files / documents / 2016 / 10 / 26 / swgfast_standard- documentation- ace- v- tenprint_2.0_121124.pdf, 2016. Ac- cessed: 2025-06-30. 3, 6, 7
work page 2016
-
[40]
OpenAI. Gpt-5 system card. Technical report, Ope- nAI, 2025. 7, 8, 12
work page 2025
-
[41]
Sara Sarto, Marcella Cornia, and Rita Cucchiara. Im- age captioning evaluation in the age of multimodal llms: Challenges and future perspectives.arXiv preprint arXiv:2503.14604, 2025. 1
-
[42]
Facellm: A multimodal large language model for face understanding.arXiv preprint arXiv:2507.10300,
Hatef Otroshi Shahreza and S ´ebastien Marcel. Facellm: A multimodal large language model for face understanding.arXiv preprint arXiv:2507.10300,
-
[43]
Yichen Shi, Yuhao Gao, Yingxin Lai, Hongyang Wang, Jun Feng, Lei He, Jun Wan, Changsheng Chen, Zitong Yu, and Xiaochun Cao. Shield: An evalua- tion benchmark for face spoofing and forgery detec- tion with multimodal large language models.Visual Intelligence, 3(1):9, 2025. 3
work page 2025
-
[44]
Benchmarking foun- dation models for zero-shot biometric tasks.arXiv preprint arXiv:2505.24214, 2025
Redwan Sony, Parisa Farmanifard, Hamzeh Alzwairy, Nitish Shukla, and Arun Ross. Benchmarking foun- dation models for zero-shot biometric tasks.arXiv preprint arXiv:2505.24214, 2025. 3 15
-
[45]
Face-mllm: A large face percep- tion model.arXiv preprint arXiv:2410.20717, 2024
Haomiao Sun, Mingjie He, Tianheng Lian, Hu Han, and Shiguang Shan. Face-mllm: A large face percep- tion model.arXiv preprint arXiv:2410.20717, 2024. 3
-
[46]
Nist fingerprint image qual- ity 2
Elham Tabassi, Martin Olsen, Oliver Bausinger, Christoph Busch, Andrew Figlarz, Gregory Fiumara, Olaf Henniger, Johannes Merkle, Timo Ruhland, Christopher Schiel, et al. Nist fingerprint image qual- ity 2. 2021. 9
work page 2021
-
[47]
Ai Takahashi, Yoshinori Koda, Koichi Ito, and Taka- fumi Aoki. Fingerprint feature extraction by combin- ing texture, minutiae, and frequency spectrum using multi-task cnn. In2020 IEEE international joint con- ference on biometrics (IJCB), pages 1–8. IEEE, 2020. 3
work page 2020
-
[48]
Trans- former based fingerprint feature extraction
Saraansh Tandon and Anoop Namboodiri. Trans- former based fingerprint feature extraction. In2022 26th International Conference on Pattern Recognition (ICPR), pages 870–876. IEEE, 2022. 3, 7, 9, 12
work page 2022
-
[49]
Fin- gernet: An unified deep network for fingerprint minu- tiae extraction
Yao Tang, Fei Gao, Jufu Feng, and Yuhang Liu. Fin- gernet: An unified deep network for fingerprint minu- tiae extraction. In2017 IEEE International Joint Con- ference on Biometrics (IJCB), pages 108–116. IEEE,
-
[50]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024. 7, 12
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Gemma Team. Gemma 3. 2025. 3, 7, 12
work page 2025
-
[52]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
Gemma 2: Improving Open Language Models at a Practical Size
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhu- patiraju, L´eonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram´e, et al. Gemma 2: Improv- ing open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[54]
Digital identity guidelines.NIST Special Publication NIST SP 800-63-4, 2025
David Temoshok, Diana Proud-Madruga, Yee-Yin Choong, Ryan Galluzzo, Sarbari Gupta, Connie LaSalle, Naomi Lefkovitz, and Andrew Regenscheid. Digital identity guidelines.NIST Special Publication NIST SP 800-63-4, 2025. 1
work page 2025
-
[55]
Document collection visual question answer- ing
Rub `en Tito, Dimosthenis Karatzas, and Ernest Val- veny. Document collection visual question answer- ing. InInternational Conference on Document Anal- ysis and Recognition, pages 778–792. Springer, 2021. 1
work page 2021
-
[56]
Jiaqi Wang, Hanqi Jiang, Yiheng Liu, Chong Ma, Xu Zhang, Yi Pan, Mengyuan Liu, Peiran Gu, Sichen Xia, Wenjun Li, et al. A comprehensive review of multimodal large language models: Performance and challenges across different tasks.arXiv preprint arXiv:2408.01319, 2024. 1
-
[57]
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem- solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Bohao Xing, Zitong Yu, Xin Liu, Kaishen Yuan, Qi- lang Ye, Weicheng Xie, Huanjing Yue, Jingyu Yang, and Heikki K¨alvi¨ainen. Emo-llama: Enhancing facial emotion understanding with instruction tuning.arXiv preprint arXiv:2408.11424, 2024. 3
-
[59]
A survey on mul- timodal large language models.National Science Re- view, 11(12):nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on mul- timodal large language models.National Science Re- view, 11(12):nwae403, 2024. 1
work page 2024
-
[60]
Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A mas- sive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556–9567, 2024. 3
work page 2024
-
[61]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual ques- tion answering.arXiv preprint arXiv:2305.10415,
work page internal anchor Pith review Pith/arXiv arXiv
-
[62]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Ex- ploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 3, 7, 12 16 Table 6. Key statistics of questions in FPBENCH. Statistic Number Total questions 494...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.