arxiv: 2604.09690 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: no theorem link

Are We Recognizing the Jaguar or Its Background? A Diagnostic Framework for Jaguar Re-Identification

Antonio Rueda-Toicen , Abigail Allen Martin , Daniil Morozov , Matin Mahmood , Alexandra Schild , Shahabeddin Dayani , Davide Panza , Gerard de Melo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords jaguar re-identificationre-ID diagnosticsbackground leakagewildlife imagerycoat patterninpaintinglateralitycitizen science

0 comments

The pith

Jaguar re-identification models often achieve high scores by matching backgrounds or silhouettes instead of coat patterns.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Jaguar re-identification from citizen-science photos can produce strong retrieval metrics while depending on the wrong visual evidence. Models may match images using background scenery or body outline rather than the unique spotted coat that identifies each animal. The authors introduce a diagnostic framework that measures background dependence through a context ratio comparing performance on inpainted background-only and foreground-only images, and laterality through cross-flank matches and mirror self-similarity. They support the diagnostics with a new Pantanal jaguar benchmark that includes per-pixel segmentation masks and identity-balanced evaluation splits. The framework is applied to representative training methods to determine what evidence each model actually uses.

Core claim

Re-identification models for jaguars in natural images can achieve high performance by exploiting contextual information or non-unique shape features rather than the unique coat markings. The diagnostic framework quantifies this through a background-to-foreground context ratio derived from inpainted images and laterality metrics from cross-flank and mirror comparisons, tested on a new identity-balanced Pantanal jaguar dataset with segmentation masks. Case studies on fine-tuning, regularization, and hyperbolic embeddings illustrate how to evaluate what evidence the models actually use.

What carries the argument

The leakage-controlled context ratio computed from retrieval performance on inpainted background-only versus foreground-only images, together with laterality diagnostics based on cross-flank retrieval and mirror self-similarity.

If this is right

High context ratios indicate that models are matching based on background rather than the jaguar itself.
The laterality diagnostic identifies models that fail to match left and right flanks of the same animal.
The curated benchmark with segmentation masks supports controlled experiments on visual evidence.
Mitigation methods like anti-symmetry regularization can be compared for their impact on these diagnostics.
Evaluation protocols should incorporate these checks to ensure reliance on identity-defining features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar diagnostics could be developed for re-identification of other wildlife species with distinctive markings.
If background reliance is widespread, re-ID systems may not transfer well to new locations or camera setups.
Incorporating inpainting-based tests during model development could encourage learning of more robust identity features.

Load-bearing premise

Inpainting the images to isolate background or foreground does not introduce artifacts that alter the retrieval behavior of the models being tested.

What would settle it

A model maintaining its ranking performance when tested on background-only inpainted images (with the jaguar removed) would show it is not using coat patterns for identification.

Figures

Figures reproduced from arXiv: 2604.09690 by Abigail Allen Martin, Alexandra Schild, Antonio Rueda-Toicen, Daniil Morozov, Davide Panza, Gerard de Melo, Matin Mahmood, Shahabeddin Dayani.

**Figure 1.** Figure 1: Six diagnostic image variants derived from a single citizen-science photograph using its SAM-3 alpha mask. (A1) Original full rgb input. (A2) Binary silhouette (alpha channel). (B1) bg silhouette: foreground replaced by a black silhouette, retaining background context and shape cues. (B2) inpainted: foreground removed by FLUX.1-Fill generative inpainting, eliminating the silhouette-shaped hole used for lea… view at source ↗

**Figure 2.** Figure 2: Foreground vs. inpainted background mAP across frozen models. Green bars show foreground-only mAP; red bars show inpainted background-only mAP. Models are sorted by BG/FG, shown in bold next to each bar pair. MiewID-MSv2 achieves the lowest frozen BG/FG (0.52), consistent with wildlife-specific pre-training reducing non-coat context reliance. Frozen-model takeaways. Under the leakage-controlled BG/FG diagn… view at source ↗

**Figure 3.** Figure 3: Relationship between shortcut axes. Each point is a frozen model plotted by inpainted BG/FG (x) and mean mirror similarity (y), both computed on foreground-only cutouts (lower mean mirror similarity indicates greater laterality awareness). Spearman ρ = 0.307 (p = 0.265; N=15; 95% bootstrap CI [−0.360, 0.771], B=20,000, seed 0): no clear monotonic association. Counter-examples include EVA-02 (BG/FG 0.661, … view at source ↗

**Figure 4.** Figure 4: Qualitative embedding inspection with UMAP [28] in HyperView. The interface shows image thumbnails next to the latent arrangement and includes a lasso tool for selecting regions for qualitative inspection, near-duplicate discovery, and difficult re-identification cases. Hard cases produce overlapping embedding representations. These panels are qualitative explorations and part of the iterative model-guided… view at source ↗

**Figure 5.** Figure 5: Mask solidity distribution for the train (n=1,895) and test (n=371) splits, computed from SAM 3 alpha masks (Eq. 8). Segmentation Masks and Solidity Mask generation (SAM 3). We generate a jaguar foreground mask for each image using SAM 3 [25] with the text prompt "jaguar". The dataset used in our experiments stores this binary mask as the alpha channel of the RGBA PNG; all experiments use these masks, so r… view at source ↗

**Figure 6.** Figure 6: Lowest-solidity masks. Examples from the bottom of the solidity distribution. The most common artifact we see is partial occlusion (often vegetation), which removes parts of the jaguar or introduces small holes. A Leakage-Controlled Background-Only Variant via Inpainting Motivation. The companion ratio BG+Sil/FG uses background only images constructed by zeroing out the jaguar pixels using the alpha mask. … view at source ↗

**Figure 7.** Figure 7: Solidity-based quality hierarchy. A low-solidity (fragmented) mask at top yields a parent embedding with higher variance. High-quality full-body and close-up crops (bottom) produce tighter child embeddings on the Lorentz manifold [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 9.** Figure 9: Long-tail performance breakdown for ArcFace+DINOv3. CMC@1/5/10 by identity frequency tier. Tail identities (10 IDs, 9% of data) still achieve 70.4% CMC@1, with a HEAD-to-TAIL gap of 15.8pp. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_9.png] view at source ↗

**Figure 10.** Figure 10: Auxiliary mirrored-query retrieval stress test. Mirror-to-regular mAP ratio for the 14 frozen models (foreground-only cutouts). This figure reports a retrieval-level check, not the canonical definition of Axis 2, which remains mean mirror similarity under each model’s native retrieval score. Lower values indicate greater asymmetry awareness. Green: pattern-aware (< 0.90); orange: moderate (0.90–0.95); red… view at source ↗

**Figure 11.** Figure 11: Two positive-danger-margin cases according to MegaDescriptor-L embeddings. Each row: original foreground crop (left), horizontal mirror (centre), nearest wrong-identity match (right). In both cases the mirrored image is less similar to the original than a different individual (danger margin > 0), confirming that horizontal flips can corrupt identity for laterality-aware models. Both images have below-av… view at source ↗

read the original abstract

Jaguar re-identification (re-ID) from citizen-science imagery can look strong on standard retrieval metrics while still relying on the wrong evidence, such as background context or silhouette shape, instead of the coat pattern that defines identity. We introduce a diagnostic framework for wildlife re-ID with two axes: a leakage-controlled context ratio, background/foreground, computed from inpainted background-only versus foreground-only images, and a laterality diagnostic based on cross-flank retrieval and mirror self-similarity. To make these diagnostics measurable, we curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced evaluation protocol. We then use representative mitigation families, ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings, as case studies under the same evaluation lens. The goal is not only to ask which model ranks best, but also what visual evidence it uses to do so.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete diagnostic to check if jaguar re-ID models rely on coat patterns or on background and shape shortcuts, but the whole thing rests on whether inpainting cleanly separates those cues.

read the letter

The main takeaway is a diagnostic framework that measures how much a re-ID model depends on background context versus the actual jaguar coat pattern. They compute a leakage-controlled context ratio by comparing retrieval on inpainted background-only images against foreground-only versions, and they add a laterality diagnostic that checks cross-flank retrieval and mirror self-similarity. They also release a Pantanal jaguar benchmark with per-pixel masks and an identity-balanced protocol, then run a few mitigation methods through the same lens as case studies.

Referee Report

1 major / 2 minor

Summary. The paper claims that standard retrieval metrics for jaguar re-identification from citizen-science images can be misleading because models may exploit background context or silhouette shape rather than the biologically defining coat patterns. It introduces a diagnostic framework consisting of a leakage-controlled context ratio (computed via inpainted background-only versus foreground-only images) and a laterality diagnostic (based on cross-flank retrieval and mirror self-similarity). To support this, the authors curate a Pantanal jaguar benchmark with per-pixel segmentation masks and an identity-balanced protocol, then apply the diagnostics as case studies to representative mitigation approaches including ArcFace fine-tuning, anti-symmetry regularization, and Lorentz hyperbolic embeddings.

Significance. If the diagnostics prove robust, the work would be significant for wildlife computer vision by providing concrete tools to detect and mitigate reliance on non-identity cues, improving the reliability of re-ID for conservation applications. The curated benchmark with masks and the dual-axis evaluation protocol represent useful contributions that could be adopted more broadly. The case-study application demonstrates practical utility, though the overall impact depends on addressing the core methodological assumptions.

major comments (1)

[Diagnostic framework description] The leakage-controlled context ratio (introduced in the diagnostic framework and used to compute background/foreground performance) is load-bearing for the central claim that models rely on background leakage. However, the manuscript provides no validation, error analysis, or ablation of the inpainting step (e.g., no comparison of retrieval rankings on original vs. inpainted images or assessment of boundary artifacts in complex Pantanal scenes). This leaves open the possibility that observed context ratios reflect inpainter-specific statistical regularities rather than genuine scene context.

minor comments (2)

[Evaluation and case studies] Ensure that all quantitative results, error bars, and statistical tests for the context ratio and laterality metrics are reported with full details in the evaluation section, as the abstract supplies none.
[Benchmark curation] Clarify the exact identity-balanced evaluation protocol and how it prevents trivial splits; a small table summarizing dataset statistics (number of identities, images per flank, etc.) would improve readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The feedback highlights an important aspect of our diagnostic framework that requires additional validation to strengthen the central claims. We address the major comment below and commit to revisions that will incorporate the suggested analyses without altering the core contributions of the Pantanal benchmark or the dual-axis evaluation protocol.

read point-by-point responses

Referee: The leakage-controlled context ratio (introduced in the diagnostic framework and used to compute background/foreground performance) is load-bearing for the central claim that models rely on background leakage. However, the manuscript provides no validation, error analysis, or ablation of the inpainting step (e.g., no comparison of retrieval rankings on original vs. inpainted images or assessment of boundary artifacts in complex Pantanal scenes). This leaves open the possibility that observed context ratios reflect inpainter-specific statistical regularities rather than genuine scene context.

Authors: We agree that explicit validation of the inpainting step is necessary to support the leakage-controlled context ratio as a reliable diagnostic. In the revised manuscript we will add a dedicated ablation subsection that (i) compares retrieval rankings and mAP on the original images versus the inpainted background-only and foreground-only versions for the same model checkpoints, (ii) quantifies boundary artifacts by measuring pixel-level consistency around mask edges in the complex Pantanal vegetation, and (iii) reports an error analysis using a small manually annotated subset of inpainted images to estimate the fraction of cases where inpainting introduces spurious textures. These additions will directly address whether the observed context ratios arise from genuine scene context or from inpainter-specific regularities. revision: yes

Circularity Check

0 steps flagged

No circularity: diagnostics built from external inpainting and segmentation on curated benchmark

full rationale

The paper introduces a diagnostic framework consisting of a leakage-controlled context ratio (computed from inpainted background-only vs. foreground-only images) and a laterality diagnostic (cross-flank retrieval and mirror self-similarity). These are applied to a newly curated Pantanal jaguar benchmark with per-pixel masks and an identity-balanced protocol. Mitigation methods (ArcFace fine-tuning, anti-symmetry regularization, Lorentz embeddings) are then evaluated under this lens. No load-bearing step reduces by construction to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. The inpainting and segmentation steps are external techniques whose fidelity is an assumption (not a tautology), and the paper does not invoke uniqueness theorems or prior self-citations to justify its core choices. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on domain assumptions about the fidelity of inpainting and segmentation rather than introducing free parameters or new entities; limited details available from abstract.

axioms (1)

domain assumption Inpainting can produce background-only images that preserve model decision processes without introducing confounding artifacts
Invoked to compute the context ratio from background-only versus foreground-only images

pith-pipeline@v0.9.0 · 5480 in / 1202 out tokens · 55572 ms · 2026-05-10T19:51:40.692016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Jaguar field guide.https://www.jaguaridproject.com/product-page/ jaguar-field-guide-digital, 2025

Jaguar ID Project. Jaguar field guide.https://www.jaguaridproject.com/product-page/ jaguar-field-guide-digital, 2025

2025
[2]

Wichmann

Robert Geirhos, J¨ orn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A. Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelli- gence, 2(11):665–673, 2020

2020
[3]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. InProceedings of the European Conference on Computer Vision (ECCV), pages 456–473, 2018

2018
[4]

Wildlifedatasets: An open-source toolkit for animal re-identification

Vojtˇ echˇCerm´ ak, Lukas Picek, Luk´ aˇ s Adam, and Kostas Papafitsoros. Wildlifedatasets: An open-source toolkit for animal re-identification. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5953–5963, 2024

2024
[5]

Wildfusion: Individual animal identification with calibrated similarity fusion

Vojtˇ ech Cermak, Lukas Picek, Luk´ aˇ s Adam, Luk´ aˇ s Neumann, and Jiˇ r´ ı Matas. Wildfusion: Individual animal identification with calibrated similarity fusion. InEuropean Conference on Computer Vision, pages 18–36. Springer, 2024

2024
[6]

Multispecies animal re-id using a large community-curated dataset.arXiv preprint arXiv:2412.05602, 2024

Lasha Otarashvili, Tamilselvan Subramanian, Jason Holmberg, JJ Levenson, and Charles V Stewart. Multispecies animal re-id using a large community-curated dataset.arXiv preprint arXiv:2412.05602, 2024

work page arXiv 2024
[7]

Rees, and Vojtˇ echˇCerm´ ak

Luk´ aˇ s Adam, Kostas Papafitsoros, Claire Jean, Alan F. Rees, and Vojtˇ echˇCerm´ ak. Exploiting facial side similarities to improve ai-driven sea turtle photo-identification systems.Ecological Informatics, 89:103158, 2025

2025
[8]

Hotspotter—patterned species instance recognition

Jonathan P Crall, Charles V Stewart, Tanya Y Berger-Wolf, Daniel I Rubenstein, and Siva R Sundare- san. Hotspotter—patterned species instance recognition. In2013 IEEE Workshop on Applications of Computer Vision (WACV), pages 230–237. IEEE, 2013

2013
[9]

Naturally occurring equivariance in neural networks.Distill, 2020.https://distill.pub/2020/circuits/equivariance

Chris Olah, Nick Cammarata, Chelsea Voss, Ludwig Schubert, and Gabriel Goh. Naturally occurring equivariance in neural networks.Distill, 2020.https://distill.pub/2020/circuits/equivariance

2020
[10]

DINOv3

Oriane Sim´ eoni, Huy V Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha¨ el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171, 2024

2024
[12]

Efficientnetv2: Smaller models and faster training

Mingxing Tan and Quoc V Le. Efficientnetv2: Smaller models and faster training. InInternational Conference on Machine Learning, pages 10096–10106, 2021

2021
[13]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

2023
[14]

Convnext v2: Co-designing and scaling convnets with masked autoencoders

Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Saining Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16133–16142, 2023

2023
[15]

arXiv preprint arXiv:2601.17237 (2026)

Mike Ranzinger, Greg Heinrich, Jan Kautz, and Pavlo Molchanov. C-radiov4.arXiv preprint arXiv:2601.17237, 2026

work page arXiv 2026
[16]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 32

2016
[17]

Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization

Shiori Sagawa, Pang Wei Koh, Tatsunori B Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. In International Conference on Learning Representations (ICLR), 2020

2020
[18]

Arcface: Additive angular margin loss for deep face recognition

Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4690–4699, 2019

2019
[19]

Sub-center arcface: Boosting face recognition by large-scale noisy web faces

Jiankang Deng, Jia Guo, Tongliang Liu, Mingming Gong, and Stefanos Zafeiriou. Sub-center arcface: Boosting face recognition by large-scale noisy web faces. InEuropean Conference on Computer Vision, pages 741–757. Springer, 2020

2020
[20]

Hyperbolic image embeddings

Valentin Khrulkov, Leyla Mirvakhabova, Evgeniya Ustinova, Ivan Oseledets, and Victor Lempitsky. Hyperbolic image embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6418–6428, 2020

2020
[21]

Riemannian adaptive optimization methods

Gary B´ ecigneul and Octavian-Eugen Ganea. Riemannian adaptive optimization methods. InInterna- tional Conference on Learning Representations (ICLR), 2019

2019
[22]

Aggregating local descriptors into a compact image representation

Herv´ e J´ egou, Matthijs Douze, Cordelia Schmid, and Patrick P´ erez. Aggregating local descriptors into a compact image representation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3304–3311, 2010

2010
[23]

Aliked: A lighter keypoint and descriptor extraction network via deformable transformation.IEEE Transactions on Instrumentation and Measurement, 72:1–16, 2023

Xiaoming Zhao, Xingming Wu, Weihai Chen, Peter CY Chen, Qingsong Xu, and Zhengguo Li. Aliked: A lighter keypoint and descriptor extraction network via deformable transformation.IEEE Transactions on Instrumentation and Measurement, 72:1–16, 2023

2023
[24]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8922–8931, 2021

2021
[25]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chai- tanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Image enhancement by histogram transformation.Computer Graphics and Image Processing, 4(2):184–195, 1975

Robert Hummel. Image enhancement by histogram transformation.Computer Graphics and Image Processing, 4(2):184–195, 1975

1975
[27]

Hyperview: Open-source dataset cura- tion and model analysis, 2025

Matin Mahmood, Antonio Rueda-Toicen, and Daniil Morozov. Hyperview: Open-source dataset cura- tion and model analysis, 2025

2025
[28]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and projec- tion for dimension reduction.arXiv preprint arXiv:1802.03426, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022

2022
[30]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012–10022, 2021

2021
[31]

Mefem: Medical face embedding model.arXiv preprint arXiv:2602.14672, 2026

Yury Borets and Stepan Botman. Mefem: Medical face embedding model.arXiv preprint arXiv:2602.14672, 2026

work page arXiv 2026
[32]

FLUX.1: Open source text-to-image model.https://github.com/ black-forest-labs/flux, 2024

Black Forest Labs. FLUX.1: Open source text-to-image model.https://github.com/ black-forest-labs/flux, 2024

2024
[33]

Supervised contrastive learning.Advances in neural infor- mation processing systems, 33:18661–18673, 2020

Prannay Khosla, Piotr Teterwak, Chen Wang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. Supervised contrastive learning.Advances in neural infor- mation processing systems, 33:18661–18673, 2020. 33

2020