pith. sign in

arxiv: 2606.12294 · v1 · pith:TIO4XRY5new · submitted 2026-06-10 · 💻 cs.CV · eess.IV

Bridging the Modality Gap in Forensic Image Retrieval

Pith reviewed 2026-06-27 10:04 UTC · model grok-4.3

classification 💻 cs.CV eess.IV
keywords forensic image retrievalmultimodal retrievaltattoo retrievalface sketch retrievalMLLMmodality fusionsentence transformerssketch-based retrieval
0
0 comments X

The pith

Fusing text descriptions from an MLLM with visual embeddings raises retrieval precision on forensic tattoos, sketches and faces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a unified framework that handles four forensic retrieval tasks by generating textual descriptions for both queries and gallery images using a multimodal large language model, then embedding those texts with sentence transformers for comparison. It tests visual-only embeddings, text-only embeddings, and a fusion of the two similarity scores using task-specific visual extractors. The central result is that the fused approach delivers higher precision and greater robustness than single modalities, especially when images are incomplete, sketched, or described only in words. A sympathetic reader would care because forensic work frequently starts from noisy or non-visual evidence rather than clean photographs. The work therefore positions modern language models as a practical bridge for these modality gaps.

Core claim

The paper claims that a multimodal retrieval pipeline, which automatically produces structured textual descriptions via an MLLM for all inputs and then fuses text-based and image-based similarity scores, consistently improves retrieval precision and robustness over visual-only or text-only baselines across tattoo image retrieval, text-guided tattoo retrieval, sketch-guided tattoo retrieval, and forensic face sketch retrieval.

What carries the argument

The multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task.

If this is right

  • Tattoo retrieval from witness descriptions or hand-drawn sketches becomes more reliable than visual matching alone.
  • Face retrieval from forensic sketches gains robustness when text descriptions supplement the sketch.
  • A single pipeline can address multiple evidence types without task-specific redesign.
  • Tasks that traditionally require manual expert comparison can be partially automated.
  • Retrieval performance holds up better under noisy or partial visual input.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Investigators could query databases using only a verbal description without needing a reference image.
  • The same fusion approach might extend to other forensic evidence such as clothing or vehicle descriptions.
  • Integration into existing case-management systems could shorten the time between receiving a witness statement and generating candidate matches.
  • Performance on very large galleries would need separate validation to confirm the observed gains scale.

Load-bearing premise

Automatically generated textual descriptions from the MLLM accurately and consistently capture the forensic-relevant features needed for effective text-based retrieval.

What would settle it

A controlled test set in which MLLM descriptions systematically omit or misrepresent key distinguishing marks, after which the fused scores show no gain or a drop in precision relative to visual-only retrieval.

Figures

Figures reproduced from arXiv: 2606.12294 by Annette Morales-Gonz\'alez, Heydi M\'endez-V\'azquez, Milton Garc\'ia-Borroto, Ricardo Gonz\'alez-Gazapo, Yoanna Mart\'inez-D\'iaz.

Figure 1
Figure 1. Figure 1: Overview of the proposed multimodal forensic retrieval framework, illustrating the visual and textual branches used to generate embeddings from different types of queries, including tattoo images, sketches, facial sketches, and textual descriptions. • Bunny-v1.1-4B [14]. A compact 4-billion-parameter multimodal model built on a SigLIP vision encoder and a Phi-3-mini language backbone, enhanced with an S-Wr… view at source ↗
Figure 2
Figure 2. Figure 2: Samples of the four datasets employed in our experimental evaluation. with observed difference Δobs = Rank@𝐾(𝐴) − Rank@𝐾(𝐵). (5) Mean Average Precision (mAP). Mean Average Precision evaluates the overall quality of the ranking by consider￾ing the exact position of the relevant item. Since each query has a single relevant item, the average precision for query 𝑖 reduces to AP𝑖 = 1 𝑟𝑖 . (6) Thus, mAP(𝐴) = 1 𝑄… view at source ↗
Figure 3
Figure 3. Figure 3: Rank@K performance comparison on WebTattoo. Combining TattTRN visual features with MLLM-generated textual descriptions consistently improves retrieval accuracy across all ranks [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: illustrates the performance gain across rank thresholds. The multimodal approach consistently outperforms the visual-only baseline at every level, ensuring that relevant matches are not pushed deeper into the retrieval list. 0 20 40 60 80 100 k 0.5 0.6 0.7 0.8 0.9 1.0 rank@k TattTRN TattTRN + MPNet_Bunny TattTRN + MPNet_Deepseek_VL2Tiny TattTRN + MPNet_Qwen2_5_7B_1 TattTRN + MPNet_Qwen [PITH_FULL_IMAGE:fi… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison of standalone visual performance (ResNet50) and its combination with various MLLMs (Prompt 3) in terms of mAP [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper presents a unified multimodal retrieval framework for four forensic tasks: tattoo retrieval from images, from expert textual descriptions, from hand-drawn sketches, and face retrieval from forensic sketches. It uses an MLLM to auto-generate structured textual descriptions for queries and gallery items, sentence-transformer embeddings for text-based retrieval, and score-level fusion with task-specific visual CNN embeddings. The central claim is that multimodal fusion consistently improves retrieval precision and robustness over visual-only or text-only baselines, especially when visual information is limited or noisy.

Significance. If the MLLM-generated descriptions prove reliable and the reported gains hold under rigorous controls, the work could offer a practical pipeline for forensic workflows that currently rely on manual expert matching of tattoos and sketches. The unified treatment of four distinct tasks and the emphasis on real-world noisy inputs are strengths, but the absence of any validation for the generated text descriptions makes the significance conditional on untested assumptions.

major comments (2)
  1. [Evaluation / Results] The headline result (fusion improves precision and robustness) is load-bearing on the assumption that MLLM-generated descriptions accurately capture the same discriminative forensic attributes used by visual CNNs or human experts. No human validation, inter-annotator agreement, or ablation replacing MLLM text with expert annotations is reported for any of the four tasks.
  2. [Abstract] Abstract and §4 (presumed experimental section): the claim of 'consistent improvement' is stated without any quantitative metrics, datasets, baselines, or error analysis supplied in the provided abstract; the soundness of the fusion claim cannot be evaluated from the given material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on validation of MLLM descriptions and the presentation of quantitative claims. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Evaluation / Results] The headline result (fusion improves precision and robustness) is load-bearing on the assumption that MLLM-generated descriptions accurately capture the same discriminative forensic attributes used by visual CNNs or human experts. No human validation, inter-annotator agreement, or ablation replacing MLLM text with expert annotations is reported for any of the four tasks.

    Authors: We agree this is a substantive gap: the manuscript does not report human validation, inter-annotator agreement, or expert-annotation ablations for the MLLM text. The reported fusion gains are empirical but rest on the untested assumption that the generated descriptions align with forensic-discriminative attributes. In revision we will add a dedicated qualitative analysis subsection with side-by-side examples of MLLM outputs versus image content for each task, plus an explicit limitations paragraph acknowledging the lack of expert validation and outlining how such studies could be conducted in future work. revision: yes

  2. Referee: [Abstract] Abstract and §4 (presumed experimental section): the claim of 'consistent improvement' is stated without any quantitative metrics, datasets, baselines, or error analysis supplied in the provided abstract; the soundness of the fusion claim cannot be evaluated from the given material.

    Authors: The abstract is written as a high-level summary following standard length constraints and therefore omits specific numbers. All quantitative results (precision@K improvements, datasets, baselines, and error analysis) appear in the experimental section. We will revise the abstract to incorporate representative quantitative findings (e.g., average precision gains under fusion) while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation of off-the-shelf components

full rationale

The manuscript presents an empirical comparison of visual-only, text-only, and fused retrieval pipelines across four forensic tasks. It applies an existing MLLM for description generation, a sentence-transformer for text embeddings, and task-specific visual CNNs, then reports standard retrieval metrics. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation of results. The central claim (fusion improves performance) is supported by direct experimental outcomes rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review reveals no explicit free parameters, axioms, or invented entities; the approach depends on off-the-shelf MLLMs and embedding models whose reliability for forensic text generation is assumed but not detailed here.

pith-pipeline@v0.9.1-grok · 5833 in / 1091 out tokens · 35763 ms · 2026-06-27T10:04:25.890365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 16 canonical work pages

  1. [1]

    Criminalistic Human Identification from Scar And Tattoo Marks

    Balan, L., 2020. Criminalistic Human Identification from Scar And Tattoo Marks. European Journal of Law and Public Administration 7, 61–67

  2. [2]

    Sketch-Based Multimodal Image Retrieval Using Deep Learning

    Berno, B.C.S., 2021. Sketch-Based Multimodal Image Retrieval Using Deep Learning. Ph.D. thesis. Universidade Tecnológica Federal do Paraná, Curitiba. Programa de Pós-Graduação em Engenharia Elétrica e Informática Industrial

  3. [3]

    Tattooinformationcreation:Towardsaholisticunderstandingoftattooinformationexperience

    Campbell-Meier,J.,Krtalić,M.,2022. Tattooinformationcreation:Towardsaholisticunderstandingoftattooinformationexperience. Library and Information Science Research 44, 101161–101161. doi:doi.org/10.1016/j.lisr.2022.101161

  4. [4]

    Training-freein-contextforensicchainforimagemanipulation detection and localization

    Chen,R.,Liu,B.,Miao,C.,Wang,X.,Li,Y.,Gong,T.,Chu,Q.,Yu,N.,2025. Training-freein-contextforensicchainforimagemanipulation detection and localization. URL:https://arxiv.org/abs/2510.10111, arXiv:2510.10111

  5. [5]

    Mobilefacenets:Efficientcnnsforaccuratereal-timefaceverificationonmobiledevices,in:Chinese conference on biometric recognition, Springer

    Chen,S.,Liu,Y.,Gao,X.,Han,Z.,2018. Mobilefacenets:Efficientcnnsforaccuratereal-timefaceverificationonmobiledevices,in:Chinese conference on biometric recognition, Springer. pp. 428–438

  6. [6]

    Big Data and Cognitive Computing 9

    Colangelo,M.T.,Meleti,M.,Guizzardi,S.,Calciolari,E.,Galli,C.,2025.Acomparativeanalysisofsentencetransformermodelsforautomated journal recommendation using pubmed metadata. Big Data and Cognitive Computing 9. URL:https://www.mdpi.com/2504-2289/9/ 3/67, doi:10.3390/bdcc9030067

  7. [7]

    Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Deng, J., Guo, J., Xue, N., Zafeiriou, S., 2019. Arcface: Additive angular margin loss for deep face recognition, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 4690–4699. First Author et al.:Preprint submitted to Elsevier Page 21 of 23 Bridging the Modality Gap in Forensic Image Retrieval

  8. [8]

    A survey on composed image retrieval

    Du, L., Deng, S., Li, Y., Li, J., Tian, Q., 2025. A survey on composed image retrieval. ACM Trans. Multimedia Comput. Commun. Appl. doi:10.1145/3723879

  9. [9]

    A large-scale software-generated face composite sketch database, in: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp

    Galea, C., Farrugia, R.A., 2016. A large-scale software-generated face composite sketch database, in: 2016 International Conference of the Biometrics Special Interest Group (BIOSIG), pp. 1–5. doi:10.1109/BIOSIG.2016.7736902

  10. [10]

    González-Gazapo, R., Morales-González, A., Méndez-Vázquez, H., García-Borroto, M., 2025. Comprehensive evaluation of multimodal large language models for tattoo identification, in: 28th Iberoamerican Congress on Pattern Recognition, CIARP 2025, Bogotá, Colombia, November 25-28, 2025, Proceedings, Accepted to be published by Springer-Verlag

  11. [11]

    Tatttrn: Template reconstruction network for tattoo retrieval, in: Proc

    Gonzalez-Soler, L., Salwowski, M., Rathgeb, C., Fischer, D., 2024. Tatttrn: Template reconstruction network for tattoo retrieval, in: Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)

  12. [12]

    Tattoo based identification: Sketch to image matching

    Han, H., Jain, A.K., 2013. Tattoo based identification: Sketch to image matching. 2013 International Conference on Biometrics (ICB) , 1–8URL:https://api.semanticscholar.org/CorpusID:18515563

  13. [13]

    Tattoo image search at scale: Joint detection and compact representation learning

    Han, H., Li, J., Jain, A.K., Shan, S., Chen, X., 2019. Tattoo image search at scale: Joint detection and compact representation learning. IEEE Transactions on Pattern Analysis and Machine Intelligence 41, 2333–2348. doi:10.1109/TPAMI.2019.2891584

  14. [14]

    Efficient multimodal learning from data-centric perspective

    He, M., Liu, Y., Wu, B., Yuan, J., Wang, Y., Huang, T., Zhao, B., 2024. Efficient multimodal learning from data-centric perspective. arXiv preprint arXiv:2402.11530

  15. [15]

    Vlforgery face triad: Detection, localization and attribution via multimodal large language models

    He, X., Zhou, Y., Fan, B., Li, B., Zhu, G., Ding, F., 2025. Vlforgery face triad: Detection, localization and attribution via multimodal large language models. URL:https://arxiv.org/abs/2503.06142, arXiv:2503.06142

  16. [16]

    Tattoo recognition technology is gaining acceptance as a crime-solving technique

    Hodge, Jr., S.D., Meehan, J., 2021. Tattoo recognition technology is gaining acceptance as a crime-solving technique. Northern Illinois University Law Review 42

  17. [17]

    Ontheforensicvalueoftattoosforhumanidentification–aliteraturereview

    Ibrahim,E.,Canal,R.,Silva,R.F.,Heit,O.F.J.,Franco,A.,2024. Ontheforensicvalueoftattoosforhumanidentification–aliteraturereview. Revista Brasileira de Odontologia Legal 11

  18. [18]

    Vision-by-language for training-free compositional image retrieval

    Karthik, S., Roth, K., Mancini, M., Akata, Z., 2024. Vision-by-language for training-free compositional image retrieval. International Conference on Learning Representations (ICLR)

  19. [19]

    Face photo retrieval by sketch example, in: Proceedings of the 20th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA

    Kiani Galoogahi, H., Sim, T., 2012. Face photo retrieval by sketch example, in: Proceedings of the 20th ACM International Conference on Multimedia, Association for Computing Machinery, New York, NY, USA. p. 949–952. URL:https://doi.org/10.1145/2393347. 2396354, doi:10.1145/2393347.2396354

  20. [20]

    Kim, J., Pozo, A.P., Yue, J., Li, H., Delp, E.J., 2015. Robust local and global shape context for tattoo image matching, in: 2015 IEEE International Conference on Image Processing, ICIP 2015, Quebec City, QC, Canada, September 27-30, 2015, IEEE. pp. 2194–2198. doi:10.1109/ICIP.2015.7351190

  21. [21]

    Bridging modalities: A survey of cross-modal image-text retrieval

    Li, T., Kong, L., Yang, X., Wang, B., Xu, J., 2024. Bridging modalities: A survey of cross-modal image-text retrieval. Chinese Journal of Information Fusion 1, 79–92. URL:https://doi.org/10.62762/CJIF.2024.361895, doi:10.62762/CJIF.2024.361895

  22. [22]

    Visual instruction tuning, in: NeurIPS

    Liu, H., Li, C., Wu, Q., Lee, Y.J., 2023. Visual instruction tuning, in: NeurIPS

  23. [23]

    Martindez-Diaz, Y., Luevano, L.S., Mendez-Vazquez, H., Nicolas-Diaz, M., Chang, L., Gonzalez-Mendoza, M., 2019. Shufflefacenet: A lightweight face architecture for efficient and highly-accurate face recognition, in: Proceedings of the IEEE/CVF international conference on computer vision workshops, pp. 0–0

  24. [24]

    Bench- marking lightweight face architectures on specific face recognition scenarios

    Martínez-Díaz, Y., Nicolás-Díaz, M., Méndez-Vázquez, H., Luevano, L.S., Chang, L., Gonzalez-Mendoza, M., Sucar, L.E., 2021. Bench- marking lightweight face architectures on specific face recognition scenarios. Artif. Intell. Rev. 54, 6201–6244. URL:https://doi.org/ 10.1007/s10462-021-09974-2, doi:10.1007/s10462-021-09974-2

  25. [25]

    Morales-González,A.,Méndez-Vázquez,H.,García-Borroto,M.,2025. Multimodaltattoorecognitionbycombiningvisualfeaturesandllm- generated captions, in: IX International Congress on Artificial Intelligence and Pattern Recognition, IWAIPR 2025, Varadero, Cuba, October 14-17, 2025, Proceedings, Accepted to be published by Springer-Verlag

  26. [26]

    Nicolás-Díaz, M., Morales-González, A., Méndez-Vázquez, H., 2019. Deep generic features for tattoo identification, in: Progress in Pattern Recognition,ImageAnalysis,ComputerVision,andApplications:24thIberoamericanCongress,CIARP2019,Havana,Cuba,October28-31, 2019, Proceedings, Springer-Verlag, Berlin, Heidelberg. p. 272–282

  27. [27]

    Weighted average pooling of deep features for tattoo identification

    Nicolás-Díaz, M., Morales-González, A., Méndez-Vázquez, H., 2022. Weighted average pooling of deep features for tattoo identification. Multimedia Tools Appl. 81, 25853–25875

  28. [28]

    Approach for tattoo detection and identification based on yolov5 and similarity distance

    Pocevič˙e, G., Stefanovič, P., Ramanauskait˙e, S., Pavlov, E., 2024. Approach for tattoo detection and identification based on yolov5 and similarity distance. Applied Sciences 14, 5576. doi:10.3390/app14135576

  29. [29]

    Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I., 2021. Learning transferable visual models from natural language supervision, in: Proceedings of the 38th International Conference on Machine Learning, PMLR. pp. 8748–8763

  30. [30]

    A transfer learning approach for the tattoo classification problem, in: 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp

    da Silva, R.T., Silvério Lopes, H., 2022. A transfer learning approach for the tattoo classification problem, in: 2022 IEEE Latin American Conference on Computational Intelligence (LA-CCI), pp. 1–6. doi:10.1109/LA-CCI54402.2022.9981650

  31. [31]

    Benchmarking multimodal large language models for forensic science and medicine: A comprehensive dataset and evaluation framework

    Sohail, A., Patel, O.M., Choi, J., Venditti, J.C.S., Wu, A.J., 2025. Benchmarking multimodal large language models for forensic science and medicine: A comprehensive dataset and evaluation framework. medRxiv URL: https: //www.medrxiv.org/content/early/2025/07/07/2025.07.06.25330972, doi: 10.1101/2025.07.06.25330972, arXiv:https://www.medrxiv.org/content/e...

  32. [32]

    Song, K., Tan, X., Qin, T., Lu, J., Liu, T.Y., 2020. Mpnet: masked and permuted pre-training for language understanding, in: Proceedings of the 34th International Conference on Neural Information Processing Systems, Curran Associates Inc., Red Hook, NY, USA

  33. [33]

    Análisis forense de imágenes de tatuajes para identificación de individuos

    Teruel Leyva, L., Lucena de Ustariz, M.E., 2025. Análisis forense de imágenes de tatuajes para identificación de individuos. RECIMUNDO 9, 238–251

  34. [34]

    Artificialintelligenceinmobileforensics:Asurveyofcurrentstatus,ausecaseanalysis and ai alignment objectives

    Vasilaras,A.,Papadoudis,N.,Rizomiliotis,P.,2024. Artificialintelligenceinmobileforensics:Asurveyofcurrentstatus,ausecaseanalysis and ai alignment objectives. Forensic Science International: Digital Investigation 49, 301737. URL:https://www.sciencedirect.com/ First Author et al.:Preprint submitted to Elsevier Page 22 of 23 Bridging the Modality Gap in Fore...

  35. [35]

    Localdeepfeaturesforcompositefacesketch recognition

    Vazquez,H.M.,Becerra-Riera,F.,Morales-González,A.,López-Avila,L.,Tistarelli,M.,2019. Localdeepfeaturesforcompositefacesketch recognition. 2019 7th International Workshop on Biometrics and Forensics (IWBF) , 1–6URL:https://api.semanticscholar.org/ CorpusID:195222350

  36. [36]

    Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J., 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191

  37. [37]

    Exploring the potential of large language models for improving digital forensic investigation efficiency

    Wickramasekara, A., Breitinger, F., Scanlon, M., 2025. Exploring the potential of large language models for improving digital forensic investigation efficiency. Forensic Science International: Digital Investigation 52, 301859. URL:https://www.sciencedirect.com/ science/article/pii/S2666281724001860, doi:https://doi.org/10.1016/j.fsidi.2024.301859

  38. [38]

    Training-freezero-shotcomposedimageretrievalviaweightedmodalityfusionandsimilarity

    Wu,R.D.,Lin,Y.Y.,Yang,H.F.,2024a. Training-freezero-shotcomposedimageretrievalviaweightedmodalityfusionandsimilarity. URL: https://arxiv.org/abs/2409.04918, arXiv:2409.04918

  39. [39]

    Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding

    Wu,Z.,Chen,X.,Pan,Z.,Liu,X.,Liu,W.,Dai,D.,Gao,H.,Ma,Y.,Wu,C.,Wang,B.,Xie,Z.,Wu,Y.,Hu,K.,Wang,J.,Sun,Y.,Li,Y.,Piao, Y., Guan, K., Liu, A., Xie, X., You, Y., Dong, K., Yu, X., Zhang, H., Zhao, L., Wang, Y., Ruan, C., 2024b. Deepseek-vl2: Mixture-of-experts vision-language models for advanced multimodal understanding. URL:https://arxiv.org/abs/2412.10302, ar...

  40. [40]

    TattoosinForensics:Retrieval,DetectionandSynthesis

    Xu,X.,2021. TattoosinForensics:Retrieval,DetectionandSynthesis. Ph.D.thesis.SchoolofComputerScienceandEngineering,Forensics and Security Lab. doi:10.32657/10356/151535

  41. [41]

    Yang,Z.,Xue,D.,Qian,S.,Dong,W.,Xu,C.,2024.Ldre:Llm-baseddivergentreasoningandensembleforzero-shotcomposedimageretrieval, in: Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Association for Computing Machinery, New York, NY, USA. p. 80–90. doi:10.1145/3626772.3657740. First Author et al.:Prep...