Bridging the Modality Gap in Forensic Image Retrieval

Annette Morales-Gonz\'alez; Heydi M\'endez-V\'azquez; Milton Garc\'ia-Borroto; Ricardo Gonz\'alez-Gazapo; Yoanna Mart\'inez-D\'iaz

arxiv: 2606.12294 · v1 · pith:TIO4XRY5new · submitted 2026-06-10 · 💻 cs.CV · eess.IV

Bridging the Modality Gap in Forensic Image Retrieval

Ricardo Gonz\'alez-Gazapo , Annette Morales-Gonz\'alez , Yoanna Mart\'inez-D\'iaz , Heydi M\'endez-V\'azquez , Milton Garc\'ia-Borroto This is my paper

Pith reviewed 2026-06-27 10:04 UTC · model grok-4.3

classification 💻 cs.CV eess.IV

keywords forensic image retrievalmultimodal retrievaltattoo retrievalface sketch retrievalMLLMmodality fusionsentence transformerssketch-based retrieval

0 comments

The pith

Fusing text descriptions from an MLLM with visual embeddings raises retrieval precision on forensic tattoos, sketches and faces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework that handles four forensic retrieval tasks by generating textual descriptions for both queries and gallery images using a multimodal large language model, then embedding those texts with sentence transformers for comparison. It tests visual-only embeddings, text-only embeddings, and a fusion of the two similarity scores using task-specific visual extractors. The central result is that the fused approach delivers higher precision and greater robustness than single modalities, especially when images are incomplete, sketched, or described only in words. A sympathetic reader would care because forensic work frequently starts from noisy or non-visual evidence rather than clean photographs. The work therefore positions modern language models as a practical bridge for these modality gaps.

Core claim

The paper claims that a multimodal retrieval pipeline, which automatically produces structured textual descriptions via an MLLM for all inputs and then fuses text-based and image-based similarity scores, consistently improves retrieval precision and robustness over visual-only or text-only baselines across tattoo image retrieval, text-guided tattoo retrieval, sketch-guided tattoo retrieval, and forensic face sketch retrieval.

What carries the argument

The multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task.

If this is right

Tattoo retrieval from witness descriptions or hand-drawn sketches becomes more reliable than visual matching alone.
Face retrieval from forensic sketches gains robustness when text descriptions supplement the sketch.
A single pipeline can address multiple evidence types without task-specific redesign.
Tasks that traditionally require manual expert comparison can be partially automated.
Retrieval performance holds up better under noisy or partial visual input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Investigators could query databases using only a verbal description without needing a reference image.
The same fusion approach might extend to other forensic evidence such as clothing or vehicle descriptions.
Integration into existing case-management systems could shorten the time between receiving a witness statement and generating candidate matches.
Performance on very large galleries would need separate validation to confirm the observed gains scale.

Load-bearing premise

Automatically generated textual descriptions from the MLLM accurately and consistently capture the forensic-relevant features needed for effective text-based retrieval.

What would settle it

A controlled test set in which MLLM descriptions systematically omit or misrepresent key distinguishing marks, after which the fused scores show no gain or a drop in precision relative to visual-only retrieval.

Figures

Figures reproduced from arXiv: 2606.12294 by Annette Morales-Gonz\'alez, Heydi M\'endez-V\'azquez, Milton Garc\'ia-Borroto, Ricardo Gonz\'alez-Gazapo, Yoanna Mart\'inez-D\'iaz.

**Figure 1.** Figure 1: Overview of the proposed multimodal forensic retrieval framework, illustrating the visual and textual branches used to generate embeddings from different types of queries, including tattoo images, sketches, facial sketches, and textual descriptions. • Bunny-v1.1-4B [14]. A compact 4-billion-parameter multimodal model built on a SigLIP vision encoder and a Phi-3-mini language backbone, enhanced with an S-Wr… view at source ↗

**Figure 2.** Figure 2: Samples of the four datasets employed in our experimental evaluation. with observed difference Δobs = Rank@𝐾(𝐴) − Rank@𝐾(𝐵). (5) Mean Average Precision (mAP). Mean Average Precision evaluates the overall quality of the ranking by considering the exact position of the relevant item. Since each query has a single relevant item, the average precision for query 𝑖 reduces to AP𝑖 = 1 𝑟𝑖 . (6) Thus, mAP(𝐴) = 1 𝑄… view at source ↗

**Figure 3.** Figure 3: Rank@K performance comparison on WebTattoo. Combining TattTRN visual features with MLLM-generated textual descriptions consistently improves retrieval accuracy across all ranks [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: illustrates the performance gain across rank thresholds. The multimodal approach consistently outperforms the visual-only baseline at every level, ensuring that relevant matches are not pushed deeper into the retrieval list. 0 20 40 60 80 100 k 0.5 0.6 0.7 0.8 0.9 1.0 rank@k TattTRN TattTRN + MPNet_Bunny TattTRN + MPNet_Deepseek_VL2Tiny TattTRN + MPNet_Qwen2_5_7B_1 TattTRN + MPNet_Qwen [PITH_FULL_IMAGE:fi… view at source ↗

**Figure 5.** Figure 5: Comparison of standalone visual performance (ResNet50) and its combination with various MLLMs (Prompt 3) in terms of mAP [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies standard MLLM description generation plus score fusion to four forensic retrieval tasks but supplies no metrics or validation of the generated text, so the fusion claim stays untested.

read the letter

The paper takes existing pieces—MLLM-generated structured descriptions, sentence-transformer embeddings, and a simple fusion of visual and text similarity scores—and runs them on four forensic scenarios: tattoo image retrieval, text-described tattoo retrieval, sketch-to-tattoo retrieval, and sketch-to-face retrieval. It reports that the fused system works better than either modality alone, especially when the visual input is partial or noisy.

What it does is map those real investigative situations onto a single pipeline and show that modern off-the-shelf components can be repurposed without new algorithms. That framing is reasonable for an applied engineering note.

The soft spots are exactly where the stress-test note says. The abstract asserts consistent gains from fusion but gives no numbers, no datasets, no baselines, and no ablation. More importantly, there is no check on whether the MLLM descriptions actually capture the forensic features that matter—tattoo motifs, facial landmarks, or the details a witness would mention. Without human validation or a side-by-side with expert text on the same queries, any improvement could be an artifact of the prompt or the fusion rule rather than genuine modality bridging. If the full paper contains those controls and the numbers hold up, the practical value rises; otherwise it remains a workflow sketch.

This is for people who build or evaluate tools for law-enforcement image databases. A reader hunting for new multimodal methods or large-scale benchmarks will not find them. It is worth sending to peer review because the tasks are concrete and the setup is easy to test once the missing quantitative evidence and validation steps are added.

Referee Report

2 major / 0 minor

Summary. The paper presents a unified multimodal retrieval framework for four forensic tasks: tattoo retrieval from images, from expert textual descriptions, from hand-drawn sketches, and face retrieval from forensic sketches. It uses an MLLM to auto-generate structured textual descriptions for queries and gallery items, sentence-transformer embeddings for text-based retrieval, and score-level fusion with task-specific visual CNN embeddings. The central claim is that multimodal fusion consistently improves retrieval precision and robustness over visual-only or text-only baselines, especially when visual information is limited or noisy.

Significance. If the MLLM-generated descriptions prove reliable and the reported gains hold under rigorous controls, the work could offer a practical pipeline for forensic workflows that currently rely on manual expert matching of tattoos and sketches. The unified treatment of four distinct tasks and the emphasis on real-world noisy inputs are strengths, but the absence of any validation for the generated text descriptions makes the significance conditional on untested assumptions.

major comments (2)

[Evaluation / Results] The headline result (fusion improves precision and robustness) is load-bearing on the assumption that MLLM-generated descriptions accurately capture the same discriminative forensic attributes used by visual CNNs or human experts. No human validation, inter-annotator agreement, or ablation replacing MLLM text with expert annotations is reported for any of the four tasks.
[Abstract] Abstract and §4 (presumed experimental section): the claim of 'consistent improvement' is stated without any quantitative metrics, datasets, baselines, or error analysis supplied in the provided abstract; the soundness of the fusion claim cannot be evaluated from the given material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validation of MLLM descriptions and the presentation of quantitative claims. We address each major comment below and outline planned revisions.

read point-by-point responses

Referee: [Evaluation / Results] The headline result (fusion improves precision and robustness) is load-bearing on the assumption that MLLM-generated descriptions accurately capture the same discriminative forensic attributes used by visual CNNs or human experts. No human validation, inter-annotator agreement, or ablation replacing MLLM text with expert annotations is reported for any of the four tasks.

Authors: We agree this is a substantive gap: the manuscript does not report human validation, inter-annotator agreement, or expert-annotation ablations for the MLLM text. The reported fusion gains are empirical but rest on the untested assumption that the generated descriptions align with forensic-discriminative attributes. In revision we will add a dedicated qualitative analysis subsection with side-by-side examples of MLLM outputs versus image content for each task, plus an explicit limitations paragraph acknowledging the lack of expert validation and outlining how such studies could be conducted in future work. revision: yes
Referee: [Abstract] Abstract and §4 (presumed experimental section): the claim of 'consistent improvement' is stated without any quantitative metrics, datasets, baselines, or error analysis supplied in the provided abstract; the soundness of the fusion claim cannot be evaluated from the given material.

Authors: The abstract is written as a high-level summary following standard length constraints and therefore omits specific numbers. All quantitative results (precision@K improvements, datasets, baselines, and error analysis) appear in the experimental section. We will revise the abstract to incorporate representative quantitative findings (e.g., average precision gains under fusion) while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of off-the-shelf components

full rationale

The manuscript presents an empirical comparison of visual-only, text-only, and fused retrieval pipelines across four forensic tasks. It applies an existing MLLM for description generation, a sentence-transformer for text embeddings, and task-specific visual CNNs, then reports standard retrieval metrics. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation of results. The central claim (fusion improves performance) is supported by direct experimental outcomes rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the approach depends on off-the-shelf MLLMs and embedding models whose reliability for forensic text generation is assumed but not detailed here.

pith-pipeline@v0.9.1-grok · 5833 in / 1091 out tokens · 35763 ms · 2026-06-27T10:04:25.890365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 16 canonical work pages

[1]

Criminalistic Human Identification from Scar And Tattoo Marks

Balan, L., 2020. Criminalistic Human Identification from Scar And Tattoo Marks. European Journal of Law and Public Administration 7, 61–67

2020
[2]

Sketch-Based Multimodal Image Retrieval Using Deep Learning

Berno, B.C.S., 2021. Sketch-Based Multimodal Image Retrieval Using Deep Learning. Ph.D. thesis. Universidade Tecnológica Federal do Paraná, Curitiba. Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial

2021
[3]

Tattooinformationcreation:Towardsaholisticunderstandingoftattooinformationexperience

Campbell-Meier,J.,Krtalić,M.,2022. Tattooinformationcreation:Towardsaholisticunderstandingoftattooinformationexperience. Library and Information Science Research 44, 101161–101161. doi:doi.org/10.1016/j.lisr.2022.101161

work page doi:10.1016/j.lisr.2022.101161 2022
[4]

Training-freein-contextforensicchainforimagemanipulation detection and localization

Chen,R.,Liu,B.,Miao,C.,Wang,X.,Li,Y.,Gong,T.,Chu,Q.,Yu,N.,2025. Training-freein-contextforensicchainforimagemanipulation detection and localization. URL:https://arxiv.org/abs/2510.10111, arXiv:2510.10111

arXiv 2025
[5]

Mobilefacenets:Efficientcnnsforaccuratereal-timefaceverificationonmobiledevices,in:Chinese conference on biometric recognition, Springer

Chen,S.,Liu,Y.,Gao,X.,Han,Z.,2018. Mobilefacenets:Efficientcnnsforaccuratereal-timefaceverificationonmobiledevices,in:Chinese conference on biometric recognition, Springer. pp. 428–438

2018
[6]

Big Data and Cognitive Computing 9

Colangelo,M.T.,Meleti,M.,Guizzardi,S.,Calciolari,E.,Galli,C.,2025.Acomparativeanalysisofsentencetransformermodelsforautomated journal recommendation using pubmed metadata. Big Data and Cognitive Computing 9. URL:https://www.mdpi.com/2504-2289/9/ 3/67, doi:10.3390/bdcc9030067

work page doi:10.3390/bdcc9030067 2025
[7]

Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699. First Author et al.:Preprint submitted to Elsevier Page 21 of 23 Bridging the Modality Gap in Forensic Image Retrieval

2019
[8]

A survey on composed image retrieval

Du, L., Deng, S., Li, Y., Li, J., Tian, Q., 2025. A survey on composed image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. doi:10.1145/3723879

work page doi:10.1145/3723879 2025
[9]

A large-scale software-generated face composite sketch database, in: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp

Galea, C., Farrugia, R.A., 2016. A large-scale software-generated face composite sketch database, in: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–5. doi:10.1109/BIOSIG.2016.7736902

work page doi:10.1109/biosig.2016.7736902 2016
[10]

González-Gazapo, R., Morales-González, A., Méndez-Vázquez, H., García-Borroto, M., 2025. Comprehensive evaluation of multimodal large language models for tattoo identification, in: 28th Iberoamerican Congress on Pattern Recognition, CIARP 2025, Bogotá, Colombia, November 25-28, 2025, Proceedings, Accepted to be published by Springer-Verlag

2025
[11]

Tatttrn: Template reconstruction network for tattoo retrieval, in: Proc

Gonzalez-Soler, L., Salwowski, M., Rathgeb, C., Fischer, D., 2024. Tatttrn: Template reconstruction network for tattoo retrieval, in: Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)

2024
[12]

Tattoo based identification: Sketch to image matching

Han, H., Jain, A.K., 2013. Tattoo based identification: Sketch to image matching. 2013 International Conference on Biometrics (ICB) , 1–8URL:https://api.semanticscholar.org/CorpusID:18515563

2013
[13]

Tattoo image search at scale: Joint detection and compact representation learning

Han, H., Li, J., Jain, A.K., Shan, S., Chen, X., 2019. Tattoo image search at scale: Joint detection and compact representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2333–2348. doi:10.1109/TPAMI.2019.2891584

work page doi:10.1109/tpami.2019.2891584 2019
[14]

Efficient multimodal learning from data-centric perspective

He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B., 2024. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530

arXiv 2024
[15]

Vlforgery face triad: Detection, localization and attribution via multimodal large language models

He, X., Zhou, Y., Fan, B., Li, B., Zhu, G., Ding, F., 2025. Vlforgery face triad: Detection, localization and attribution via multimodal large language models. URL:https://arxiv.org/abs/2503.06142, arXiv:2503.06142

arXiv 2025
[16]

Tattoo recognition technology is gaining acceptance as a crime-solving technique

Hodge, Jr., S.D., Meehan, J., 2021. Tattoo recognition technology is gaining acceptance as a crime-solving technique. Northern Illinois University Law Review 42

2021
[17]

Ontheforensicvalueoftattoosforhumanidentification–aliteraturereview

Ibrahim,E.,Canal,R.,Silva,R.F.,Heit,O.F.J.,Franco,A.,2024. Ontheforensicvalueoftattoosforhumanidentification–aliteraturereview. Revista Brasileira de Odontologia Legal 11

2024
[18]

Vision-by-language for training-free compositional image retrieval

Karthik, S., Roth, K., Mancini, M., Akata, Z., 2024. Vision-by-language for training-free compositional image retrieval. International Conference on Learning Representations (ICLR)

2024
[19]

Face photo retrieval by sketch example, in: Proceedings of the 20th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA

Kiani Galoogahi, H., Sim, T., 2012. Face photo retrieval by sketch example, in: Proceedings of the 20th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA. p. 949–952. URL:https://doi.org/10.1145/2393347. 2396354, doi:10.1145/2393347.2396354

work page doi:10.1145/2393347 2012
[20]

Kim, J., Pozo, A.P., Yue, J., Li, H., Delp, E.J., 2015. Robust local and global shape context for tattoo image matching, in: 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015, IEEE. pp. 2194–2198. doi:10.1109/ICIP.2015.7351190

work page doi:10.1109/icip.2015.7351190 2015
[21]

Bridging modalities: A survey of cross-modal image-text retrieval

Li, T., Kong, L., Yang, X., Wang, B., Xu, J., 2024. Bridging modalities: A survey of cross-modal image-text retrieval. Chinese Journal of Information Fusion 1, 79–92. URL:https://doi.org/10.62762/CJIF.2024.361895, doi:10.62762/CJIF.2024.361895

work page doi:10.62762/cjif.2024.361895 2024
[22]

Visual instruction tuning, in: NeurIPS

Liu, H., Li, C., Wu, Q., Lee, Y.J., 2023. Visual instruction tuning, in: NeurIPS

2023
[23]

Martindez-Diaz, Y., Luevano, L.S., Mendez-Vazquez, H., Nicolas-Diaz, M., Chang, L., Gonzalez-Mendoza, M., 2019. Shufflefacenet: A lightweight face architecture for efficient and highly-accurate face recognition, in: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0

2019
[24]

Bench- marking lightweight face architectures on specific face recognition scenarios

Martínez-Díaz, Y., Nicolás-Díaz, M., Méndez-Vázquez, H., Luevano, L.S., Chang, L., Gonzalez-Mendoza, M., Sucar, L.E., 2021. Bench- marking lightweight face architectures on specific face recognition scenarios. Artif. Intell. Rev. 54, 6201–6244. URL:https://doi.org/ 10.1007/s10462-021-09974-2, doi:10.1007/s10462-021-09974-2

work page doi:10.1007/s10462-021-09974-2 2021
[25]

Morales-González,A.,Méndez-Vázquez,H.,García-Borroto,M.,2025. Multimodaltattoorecognitionbycombiningvisualfeaturesandllm- generated captions, in: IX International Congress on Artificial Intelligence and Pattern Recognition, IWAIPR 2025, Varadero, Cuba, October 14-17, 2025, Proceedings, Accepted to be published by Springer-Verlag

2025
[26]

Nicolás-Díaz, M., Morales-González, A., Méndez-Vázquez, H., 2019. Deep generic features for tattoo identification, in: Progress in Pattern Recognition,ImageAnalysis,ComputerVision,andApplications:24thIberoamericanCongress,CIARP2019,Havana,Cuba,October28-31, 2019, Proceedings, Springer-Verlag, Berlin, Heidelberg. p. 272–282

2019
[27]

Weighted average pooling of deep features for tattoo identification

Nicolás-Díaz, M., Morales-González, A., Méndez-Vázquez, H., 2022. Weighted average pooling of deep features for tattoo identification. Multimedia Tools Appl. 81, 25853–25875

2022
[28]

Approach for tattoo detection and identification based on yolov5 and similarity distance

Pocevič˙e, G., Stefanovič, P., Ramanauskait˙e, S., Pavlov, E., 2024. Approach for tattoo detection and identification based on yolov5 and similarity distance. Applied Sciences 14, 5576. doi:10.3390/app14135576

work page doi:10.3390/app14135576 2024
[29]

Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 8748–8763

2021
[30]

A transfer learning approach for the tattoo classification problem, in: 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp

da Silva, R.T., Silvério Lopes, H., 2022. A transfer learning approach for the tattoo classification problem, in: 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6. doi:10.1109/LA-CCI54402.2022.9981650

work page doi:10.1109/la-cci54402.2022.9981650 2022
[31]

Benchmarking multimodal large language models for forensic science and medicine: A comprehensive dataset and evaluation framework

Sohail, A., Patel, O.M., Choi, J., Venditti, J.C.S., Wu, A.J., 2025. Benchmarking multimodal large language models for forensic science and medicine: A comprehensive dataset and evaluation framework. medRxiv URL: https: //www.medrxiv.org/content/early/2025/07/07/2025.07.06.25330972, doi: 10.1101/2025.07.06.25330972, arXiv:https://www.medrxiv.org/content/e...

work page doi:10.1101/2025.07.06.25330972 2025
[32]

Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y., 2020. Mpnet: masked and permuted pre-training for language understanding, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA

2020
[33]

Análisis forense de imágenes de tatuajes para identificación de individuos

Teruel Leyva, L., Lucena de Ustariz, M.E., 2025. Análisis forense de imágenes de tatuajes para identificación de individuos. RECIMUNDO 9, 238–251

2025
[34]

Artificialintelligenceinmobileforensics:Asurveyofcurrentstatus,ausecaseanalysis and ai alignment objectives

Vasilaras,A.,Papadoudis,N.,Rizomiliotis,P.,2024. Artificialintelligenceinmobileforensics:Asurveyofcurrentstatus,ausecaseanalysis and ai alignment objectives. Forensic Science International: Digital Investigation 49, 301737. URL:https://www.sciencedirect.com/ First Author et al.:Preprint submitted to Elsevier Page 22 of 23 Bridging the Modality Gap in Fore...

work page doi:10.1016/j.fsidi.2024.301737 2024
[35]

Localdeepfeaturesforcompositefacesketch recognition

Vazquez,H.M.,Becerra-Riera,F.,Morales-González,A.,López-Avila,L.,Tistarelli,M.,2019. Localdeepfeaturesforcompositefacesketch recognition. 2019 7th International Workshop on Biometrics and Forensics (IWBF) , 1–6URL:https://api.semanticscholar.org/ CorpusID:195222350

2019
[36]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191

Pith/arXiv arXiv 2024
[37]

Exploring the potential of large language models for improving digital forensic investigation efficiency

Wickramasekara, A., Breitinger, F., Scanlon, M., 2025. Exploring the potential of large language models for improving digital forensic investigation efficiency. Forensic Science International: Digital Investigation 52, 301859. URL:https://www.sciencedirect.com/ science/article/pii/S2666281724001860, doi:https://doi.org/10.1016/j.fsidi.2024.301859

work page doi:10.1016/j.fsidi.2024.301859 2025
[38]

Training-freezero-shotcomposedimageretrievalviaweightedmodalityfusionandsimilarity

Wu,R.D.,Lin,Y.Y.,Yang,H.F.,2024a. Training-freezero-shotcomposedimageretrievalviaweightedmodalityfusionandsimilarity. URL: https://arxiv.org/abs/2409.04918, arXiv:2409.04918

arXiv
[39]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang,B.,Xie,Z.,Wu,Y.,Hu,K.,Wang,J.,Sun,Y.,Li,Y.,Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C., 2024b. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. URL:https://arxiv.org/abs/2412.10302, ar...

Pith/arXiv arXiv
[40]

TattoosinForensics:Retrieval,DetectionandSynthesis

Xu,X.,2021. TattoosinForensics:Retrieval,DetectionandSynthesis. Ph.D.thesis.SchoolofComputerScienceandEngineering,Forensics and Security Lab. doi:10.32657/10356/151535

work page doi:10.32657/10356/151535 2021
[41]

Yang,Z.,Xue,D.,Qian,S.,Dong,W.,Xu,C.,2024.Ldre:Llm-baseddivergentreasoningandensembleforzero-shotcomposedimageretrieval, in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 80–90. doi:10.1145/3626772.3657740. First Author et al.:Prep...

work page doi:10.1145/3626772.3657740 2024

[1] [1]

Criminalistic Human Identification from Scar And Tattoo Marks

Balan, L., 2020. Criminalistic Human Identification from Scar And Tattoo Marks. European Journal of Law and Public Administration 7, 61–67

2020

[2] [2]

Sketch-Based Multimodal Image Retrieval Using Deep Learning

Berno, B.C.S., 2021. Sketch-Based Multimodal Image Retrieval Using Deep Learning. Ph.D. thesis. Universidade Tecnológica Federal do Paraná, Curitiba. Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial

2021

[3] [3]

Tattooinformationcreation:Towardsaholisticunderstandingoftattooinformationexperience

Campbell-Meier,J.,Krtalić,M.,2022. Tattooinformationcreation:Towardsaholisticunderstandingoftattooinformationexperience. Library and Information Science Research 44, 101161–101161. doi:doi.org/10.1016/j.lisr.2022.101161

work page doi:10.1016/j.lisr.2022.101161 2022

[4] [4]

Training-freein-contextforensicchainforimagemanipulation detection and localization

Chen,R.,Liu,B.,Miao,C.,Wang,X.,Li,Y.,Gong,T.,Chu,Q.,Yu,N.,2025. Training-freein-contextforensicchainforimagemanipulation detection and localization. URL:https://arxiv.org/abs/2510.10111, arXiv:2510.10111

arXiv 2025

[5] [5]

Mobilefacenets:Efficientcnnsforaccuratereal-timefaceverificationonmobiledevices,in:Chinese conference on biometric recognition, Springer

Chen,S.,Liu,Y.,Gao,X.,Han,Z.,2018. Mobilefacenets:Efficientcnnsforaccuratereal-timefaceverificationonmobiledevices,in:Chinese conference on biometric recognition, Springer. pp. 428–438

2018

[6] [6]

Big Data and Cognitive Computing 9

Colangelo,M.T.,Meleti,M.,Guizzardi,S.,Calciolari,E.,Galli,C.,2025.Acomparativeanalysisofsentencetransformermodelsforautomated journal recommendation using pubmed metadata. Big Data and Cognitive Computing 9. URL:https://www.mdpi.com/2504-2289/9/ 3/67, doi:10.3390/bdcc9030067

work page doi:10.3390/bdcc9030067 2025

[7] [7]

Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699. First Author et al.:Preprint submitted to Elsevier Page 21 of 23 Bridging the Modality Gap in Forensic Image Retrieval

2019

[8] [8]

A survey on composed image retrieval

Du, L., Deng, S., Li, Y., Li, J., Tian, Q., 2025. A survey on composed image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. doi:10.1145/3723879

work page doi:10.1145/3723879 2025

[9] [9]

A large-scale software-generated face composite sketch database, in: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp

Galea, C., Farrugia, R.A., 2016. A large-scale software-generated face composite sketch database, in: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–5. doi:10.1109/BIOSIG.2016.7736902

work page doi:10.1109/biosig.2016.7736902 2016

[10] [10]

González-Gazapo, R., Morales-González, A., Méndez-Vázquez, H., García-Borroto, M., 2025. Comprehensive evaluation of multimodal large language models for tattoo identification, in: 28th Iberoamerican Congress on Pattern Recognition, CIARP 2025, Bogotá, Colombia, November 25-28, 2025, Proceedings, Accepted to be published by Springer-Verlag

2025

[11] [11]

Tatttrn: Template reconstruction network for tattoo retrieval, in: Proc

Gonzalez-Soler, L., Salwowski, M., Rathgeb, C., Fischer, D., 2024. Tatttrn: Template reconstruction network for tattoo retrieval, in: Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)

2024

[12] [12]

Tattoo based identification: Sketch to image matching

Han, H., Jain, A.K., 2013. Tattoo based identification: Sketch to image matching. 2013 International Conference on Biometrics (ICB) , 1–8URL:https://api.semanticscholar.org/CorpusID:18515563

2013

[13] [13]

Tattoo image search at scale: Joint detection and compact representation learning

Han, H., Li, J., Jain, A.K., Shan, S., Chen, X., 2019. Tattoo image search at scale: Joint detection and compact representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2333–2348. doi:10.1109/TPAMI.2019.2891584

work page doi:10.1109/tpami.2019.2891584 2019

[14] [14]

Efficient multimodal learning from data-centric perspective

He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B., 2024. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530

arXiv 2024

[15] [15]

Vlforgery face triad: Detection, localization and attribution via multimodal large language models

He, X., Zhou, Y., Fan, B., Li, B., Zhu, G., Ding, F., 2025. Vlforgery face triad: Detection, localization and attribution via multimodal large language models. URL:https://arxiv.org/abs/2503.06142, arXiv:2503.06142

arXiv 2025

[16] [16]

Tattoo recognition technology is gaining acceptance as a crime-solving technique

Hodge, Jr., S.D., Meehan, J., 2021. Tattoo recognition technology is gaining acceptance as a crime-solving technique. Northern Illinois University Law Review 42

2021

[17] [17]

Ontheforensicvalueoftattoosforhumanidentification–aliteraturereview

Ibrahim,E.,Canal,R.,Silva,R.F.,Heit,O.F.J.,Franco,A.,2024. Ontheforensicvalueoftattoosforhumanidentification–aliteraturereview. Revista Brasileira de Odontologia Legal 11

2024

[18] [18]

Vision-by-language for training-free compositional image retrieval

Karthik, S., Roth, K., Mancini, M., Akata, Z., 2024. Vision-by-language for training-free compositional image retrieval. International Conference on Learning Representations (ICLR)

2024

[19] [19]

Face photo retrieval by sketch example, in: Proceedings of the 20th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA

Kiani Galoogahi, H., Sim, T., 2012. Face photo retrieval by sketch example, in: Proceedings of the 20th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA. p. 949–952. URL:https://doi.org/10.1145/2393347. 2396354, doi:10.1145/2393347.2396354

work page doi:10.1145/2393347 2012

[20] [20]

Kim, J., Pozo, A.P., Yue, J., Li, H., Delp, E.J., 2015. Robust local and global shape context for tattoo image matching, in: 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015, IEEE. pp. 2194–2198. doi:10.1109/ICIP.2015.7351190

work page doi:10.1109/icip.2015.7351190 2015

[21] [21]

Bridging modalities: A survey of cross-modal image-text retrieval

Li, T., Kong, L., Yang, X., Wang, B., Xu, J., 2024. Bridging modalities: A survey of cross-modal image-text retrieval. Chinese Journal of Information Fusion 1, 79–92. URL:https://doi.org/10.62762/CJIF.2024.361895, doi:10.62762/CJIF.2024.361895

work page doi:10.62762/cjif.2024.361895 2024

[22] [22]

Visual instruction tuning, in: NeurIPS

Liu, H., Li, C., Wu, Q., Lee, Y.J., 2023. Visual instruction tuning, in: NeurIPS

2023

[23] [23]

Martindez-Diaz, Y., Luevano, L.S., Mendez-Vazquez, H., Nicolas-Diaz, M., Chang, L., Gonzalez-Mendoza, M., 2019. Shufflefacenet: A lightweight face architecture for efficient and highly-accurate face recognition, in: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0

2019

[24] [24]

Bench- marking lightweight face architectures on specific face recognition scenarios

Martínez-Díaz, Y., Nicolás-Díaz, M., Méndez-Vázquez, H., Luevano, L.S., Chang, L., Gonzalez-Mendoza, M., Sucar, L.E., 2021. Bench- marking lightweight face architectures on specific face recognition scenarios. Artif. Intell. Rev. 54, 6201–6244. URL:https://doi.org/ 10.1007/s10462-021-09974-2, doi:10.1007/s10462-021-09974-2

work page doi:10.1007/s10462-021-09974-2 2021

[25] [25]

Morales-González,A.,Méndez-Vázquez,H.,García-Borroto,M.,2025. Multimodaltattoorecognitionbycombiningvisualfeaturesandllm- generated captions, in: IX International Congress on Artificial Intelligence and Pattern Recognition, IWAIPR 2025, Varadero, Cuba, October 14-17, 2025, Proceedings, Accepted to be published by Springer-Verlag

2025

[26] [26]

Nicolás-Díaz, M., Morales-González, A., Méndez-Vázquez, H., 2019. Deep generic features for tattoo identification, in: Progress in Pattern Recognition,ImageAnalysis,ComputerVision,andApplications:24thIberoamericanCongress,CIARP2019,Havana,Cuba,October28-31, 2019, Proceedings, Springer-Verlag, Berlin, Heidelberg. p. 272–282

2019

[27] [27]

Weighted average pooling of deep features for tattoo identification

Nicolás-Díaz, M., Morales-González, A., Méndez-Vázquez, H., 2022. Weighted average pooling of deep features for tattoo identification. Multimedia Tools Appl. 81, 25853–25875

2022

[28] [28]

Approach for tattoo detection and identification based on yolov5 and similarity distance

Pocevič˙e, G., Stefanovič, P., Ramanauskait˙e, S., Pavlov, E., 2024. Approach for tattoo detection and identification based on yolov5 and similarity distance. Applied Sciences 14, 5576. doi:10.3390/app14135576

work page doi:10.3390/app14135576 2024

[29] [29]

Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 8748–8763

2021

[30] [30]

A transfer learning approach for the tattoo classification problem, in: 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp

da Silva, R.T., Silvério Lopes, H., 2022. A transfer learning approach for the tattoo classification problem, in: 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6. doi:10.1109/LA-CCI54402.2022.9981650

work page doi:10.1109/la-cci54402.2022.9981650 2022

[31] [31]

Benchmarking multimodal large language models for forensic science and medicine: A comprehensive dataset and evaluation framework

Sohail, A., Patel, O.M., Choi, J., Venditti, J.C.S., Wu, A.J., 2025. Benchmarking multimodal large language models for forensic science and medicine: A comprehensive dataset and evaluation framework. medRxiv URL: https: //www.medrxiv.org/content/early/2025/07/07/2025.07.06.25330972, doi: 10.1101/2025.07.06.25330972, arXiv:https://www.medrxiv.org/content/e...

work page doi:10.1101/2025.07.06.25330972 2025

[32] [32]

Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y., 2020. Mpnet: masked and permuted pre-training for language understanding, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA

2020

[33] [33]

Análisis forense de imágenes de tatuajes para identificación de individuos

Teruel Leyva, L., Lucena de Ustariz, M.E., 2025. Análisis forense de imágenes de tatuajes para identificación de individuos. RECIMUNDO 9, 238–251

2025

[34] [34]

Artificialintelligenceinmobileforensics:Asurveyofcurrentstatus,ausecaseanalysis and ai alignment objectives

Vasilaras,A.,Papadoudis,N.,Rizomiliotis,P.,2024. Artificialintelligenceinmobileforensics:Asurveyofcurrentstatus,ausecaseanalysis and ai alignment objectives. Forensic Science International: Digital Investigation 49, 301737. URL:https://www.sciencedirect.com/ First Author et al.:Preprint submitted to Elsevier Page 22 of 23 Bridging the Modality Gap in Fore...

work page doi:10.1016/j.fsidi.2024.301737 2024

[35] [35]

Localdeepfeaturesforcompositefacesketch recognition

Vazquez,H.M.,Becerra-Riera,F.,Morales-González,A.,López-Avila,L.,Tistarelli,M.,2019. Localdeepfeaturesforcompositefacesketch recognition. 2019 7th International Workshop on Biometrics and Forensics (IWBF) , 1–6URL:https://api.semanticscholar.org/ CorpusID:195222350

2019

[36] [36]

Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191

Pith/arXiv arXiv 2024

[37] [37]

Exploring the potential of large language models for improving digital forensic investigation efficiency

Wickramasekara, A., Breitinger, F., Scanlon, M., 2025. Exploring the potential of large language models for improving digital forensic investigation efficiency. Forensic Science International: Digital Investigation 52, 301859. URL:https://www.sciencedirect.com/ science/article/pii/S2666281724001860, doi:https://doi.org/10.1016/j.fsidi.2024.301859

work page doi:10.1016/j.fsidi.2024.301859 2025

[38] [38]

Training-freezero-shotcomposedimageretrievalviaweightedmodalityfusionandsimilarity

Wu,R.D.,Lin,Y.Y.,Yang,H.F.,2024a. Training-freezero-shotcomposedimageretrievalviaweightedmodalityfusionandsimilarity. URL: https://arxiv.org/abs/2409.04918, arXiv:2409.04918

arXiv

[39] [39]

Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang,B.,Xie,Z.,Wu,Y.,Hu,K.,Wang,J.,Sun,Y.,Li,Y.,Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C., 2024b. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. URL:https://arxiv.org/abs/2412.10302, ar...

Pith/arXiv arXiv

[40] [40]

TattoosinForensics:Retrieval,DetectionandSynthesis

Xu,X.,2021. TattoosinForensics:Retrieval,DetectionandSynthesis. Ph.D.thesis.SchoolofComputerScienceandEngineering,Forensics and Security Lab. doi:10.32657/10356/151535

work page doi:10.32657/10356/151535 2021

[41] [41]

Yang,Z.,Xue,D.,Qian,S.,Dong,W.,Xu,C.,2024.Ldre:Llm-baseddivergentreasoningandensembleforzero-shotcomposedimageretrieval, in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 80–90. doi:10.1145/3626772.3657740. First Author et al.:Prep...

work page doi:10.1145/3626772.3657740 2024