pith. sign in

arxiv: 2605.16515 · v1 · pith:ZN6DH75Pnew · submitted 2026-05-15 · 💻 cs.CV · cs.LG

SeamCam: Quantifying Seamless Camouflage via Multi-Cue Visual Detectability

Pith reviewed 2026-05-20 18:22 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords seamless camouflagecamouflage quantificationvisual detectabilityobject detection proposalssegmentation masksdiffusion model optimizationhuman judgment agreement
0
0 comments X

The pith

SeamCam quantifies seamless camouflage as one minus the highest IoU recoverable from category-conditioned detection proposals and their segmentation masks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a quantitative metric for seamless camouflage by treating it as a localization task where the animal must remain hard to find even when its category is known. SeamCam generates detection proposals tuned to the target species, extracts their segmentation masks, and computes the union that overlaps the ground-truth mask most strongly. The score is defined as one minus this maximum overlap, so higher values indicate better blending. A sympathetic reader would care because existing evaluations of camouflage rely on subjective ratings or uncontrolled images, making it difficult to compare methods or train generators systematically. The approach was tested against human choices and applied to optimize a diffusion model for creating camouflage.

Core claim

The central claim is that seamless camouflage strength equals one minus the maximum intersection-over-union between the true animal mask and the union of segmentation masks obtained from a pool of category-conditioned object detection proposals. This produces a score that aligned with human judgments of detection difficulty in 78.82 percent of 2,390 two-alternative forced-choice trials involving 94 participants, exceeding prior methods by roughly 25 percent. The same score is then used as a preference signal for direct preference optimization to fine-tune a diffusion-based inpainting model, and the work introduces the CamFG-1.5k dataset of 1,521 fully visible animal images to enable clean,Oc

What carries the argument

SeamCam, the metric computed as one minus the maximum IoU between the ground-truth mask and the union of segmentation masks from category-conditioned detection proposals.

If this is right

  • Camouflage can be compared and ranked using an automated procedure instead of repeated human studies.
  • The score supplies an explicit objective for training generative models to produce stronger camouflage via preference optimization.
  • Benchmarking becomes possible on datasets that control for occlusion by starting from fully visible animals.
  • New camouflage generation techniques can be evaluated reproducibly against the same localization-based criterion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same localization-shortfall idea could be tested on non-animal concealment tasks such as hiding objects in cluttered scenes.
  • Incorporating motion or multi-frame information might strengthen alignment with how humans actually search for camouflaged targets.
  • Running the metric across multiple detector families would show how much the scores depend on the underlying proposal mechanism.

Load-bearing premise

The maximum IoU recoverable from a finite set of category-conditioned detection proposals and masks acts as a faithful proxy for human visual detectability of seamless camouflage.

What would settle it

Human observers rating a set of images as easy to detect when those images receive high SeamCam scores (or the reverse) under a different detector architecture or proposal ranking would falsify the claim that the metric tracks human perception.

Figures

Figures reproduced from arXiv: 2605.16515 by Abolfazl Meyarian, Amin Karimi Monsefi, Anuj Karpatne, Cheng Zhang, Mridul Khurana, Pouyan Navard, Rajiv Ramnath, Shuheng Wang, Wei-Lun Chao.

Figure 1
Figure 1. Figure 1: SeamCam vs. CamOT. SeamCam produces consistent and accurate cam￾ouflage difficulty scores across diverse scenarios, whereas CamOT exhibits notable in￾consistencies. As illustrated by the polar bear — where superficial color and lighting similarity between subject and background causes CamOT to erroneously overestimate camouflage effectiveness despite the subject being plainly visible — CamOT assigns a disp… view at source ↗
Figure 2
Figure 2. Figure 2: Quality comparison between CamFG-1.5K vs. existing datasets. Exist￾ing datasets often include subjects that are cropped, or partially camouflaged, leading to biased evaluation of camouflage models. In contrast, CamFG-1.5K features clearly visible animals with minimal obstructions, enabling unbiased model assessment. Camouflage is fundamentally a problem of perception rather than appear￾ance alone. Biologic… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SeamCam framework. Given an image and species name, we generate category-conditioned detection proposals via GroundingDINO, apply se￾mantic and confidence gating, and obtain segmentation masks from SAM-2. We then evaluate all proposal subsets, computing IoU between each subset’s mask union and the ground truth. The maximum achievable IoU defines detectability D; the camou￾flage score is 1 − D. … view at source ↗
Figure 4
Figure 4. Figure 4: SeamCam-based sample selection for Direct Preference Optimiza [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-species accuracy comparison between SeamCam vs. CamOT [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Camouflage image generation using SeamCam-based DPO [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Animals are described as effectively camouflaged when they blend seamlessly with their surrounding, yet no standardized quantitative measure of this seamlessness exists. We address this gap by framing camouflage evaluation as a visual localization problem: a well-camouflaged animal is one that remains difficult to detect even when its category is known. We introduce SeamCam (Seamless Camouflage), a metric that quantifies how detectable an animal is from the available visual evidence. Given an image and a target species, SeamCam generates category-conditioned detection proposals, extracts segmentation masks, and identifies the subset whose collective union yields the highest IoU with the ground-truth mask. The SeamCam score is one minus this maximum recoverable localization signal, where a higher score indicates stronger camouflage (i.e., lower detectability). In a human two-alternative forced-choice study with 94 participants and 2,390 comparisons, SeamCam achieves 78.82% agreement with human camouflage difficulty judgments, outperforming state-of-the-art by about 25%. We then demonstrate SeamCam's utility as a preference signal for Direct Preference Optimization (DPO) to fine-tune a diffusion-based inpainting model for camouflage generation. This offers an affordable training approach with an objective explicitly suited for camouflage generation, unlike typical diffusion models. To support rigorous benchmarking, we further introduce CamFG-1.5k, a curated dataset of 1,521 high-resolution images in which animals are fully visible prior to camouflage generation, enabling unbiased evaluation by controlling for occlusion artifacts present in existing datasets. https://7amin.github.io/SeamCam/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SeamCam, a metric that quantifies seamless camouflage by framing detection difficulty as a localization task: given an image and target category, it generates category-conditioned detection proposals, extracts their segmentation masks, and computes the maximum IoU recoverable by their union against a provided ground-truth mask; the SeamCam score is then defined as one minus this maximum IoU. The central empirical result is a two-alternative forced-choice human study (94 participants, 2,390 comparisons) reporting 78.82% agreement between SeamCam scores and human camouflage-difficulty judgments, outperforming prior methods by roughly 25%. The authors also release the CamFG-1.5k dataset of 1,521 high-resolution images and demonstrate use of the metric as a preference signal for Direct Preference Optimization (DPO) of a diffusion inpainting model.

Significance. If the max-IoU proxy is shown to be robust rather than detector-specific, SeamCam would supply the first standardized, quantitative measure of seamless camouflage, directly addressing a long-standing gap in perceptual evaluation. The human study supplies direct empirical support for the correlation claim, the CamFG-1.5k dataset removes occlusion confounds present in prior collections, and the DPO application illustrates a concrete downstream use case. These contributions would be of clear interest to the computer-vision community working on camouflage, perceptual metrics, and generative modeling.

major comments (2)
  1. [Human Evaluation and Results] The headline correlation (78.82% human agreement) rests on the assumption that the highest IoU recoverable from a finite set of category-conditioned detection proposals is a faithful proxy for human visual detectability. No ablation is reported that replaces the proposal generator or alters its conditioning mechanism, leaving open the possibility that the reported agreement is an artifact of the particular detector architecture and training distribution rather than a general property of visual seamlessness.
  2. [Human Evaluation and Results] The abstract and results section report the 78.82% agreement figure without accompanying error bars, confidence intervals, or statistical significance tests against the state-of-the-art baselines. Because the central claim is that SeamCam outperforms prior metrics by approximately 25%, the absence of these controls makes it impossible to assess whether the improvement is reliable or could be explained by variance in the human study.
minor comments (2)
  1. [Dataset] The description of the CamFG-1.5k curation process should include explicit statistics on image resolution distribution, species diversity, and how the 'fully visible prior to camouflage' criterion was enforced to allow readers to judge potential selection bias.
  2. [Method] Notation for the union operation over masks and the precise definition of the 'maximum recoverable IoU' should be formalized with an equation in the method section to eliminate ambiguity when readers attempt to re-implement the metric.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments and the positive evaluation of our work. We address each of the major comments point by point below.

read point-by-point responses
  1. Referee: The headline correlation (78.82% human agreement) rests on the assumption that the highest IoU recoverable from a finite set of category-conditioned detection proposals is a faithful proxy for human visual detectability. No ablation is reported that replaces the proposal generator or alters its conditioning mechanism, leaving open the possibility that the reported agreement is an artifact of the particular detector architecture and training distribution rather than a general property of visual seamlessness.

    Authors: We agree that demonstrating robustness across detectors would strengthen the claim. The SeamCam metric is intended to use modern category-conditioned localization as a proxy for detectability, and our choice reflects current SOTA performance. In the revision we will add an ablation replacing the proposal generator with an alternative architecture and conditioning scheme, reporting the resulting human agreement to show the correlation is not detector-specific. revision: yes

  2. Referee: The abstract and results section report the 78.82% agreement figure without accompanying error bars, confidence intervals, or statistical significance tests against the state-of-the-art baselines. Because the central claim is that SeamCam outperforms prior metrics by approximately 25%, the absence of these controls makes it impossible to assess whether the improvement is reliable or could be explained by variance in the human study.

    Authors: We acknowledge that statistical controls are necessary to substantiate the reported improvement. We will revise the results section and abstract to include bootstrap-derived error bars, 95% confidence intervals on the agreement rate, and paired statistical tests (e.g., McNemar or binomial) against each baseline to confirm the ~25% gain is statistically reliable rather than attributable to sampling variance in the 2,390 comparisons. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SeamCam's metric definition or human validation

full rationale

The paper explicitly constructs SeamCam as one minus the maximum IoU recoverable from the union of category-conditioned detection proposals and their segmentation masks against a provided ground-truth mask; this is a deliberate proxy definition for detectability rather than a derived claim that reduces to its own inputs. The 78.82% agreement result is obtained from a separate human two-alternative forced-choice study (94 participants, 2,390 comparisons) that serves as external benchmarking, not from fitting or self-referential computation on the same data. The DPO demonstration treats the metric as an independent preference signal for fine-tuning a diffusion inpainting model without any parameter fitting to the metric's outputs or self-citation chains. No self-definitional loops, fitted inputs renamed as predictions, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation appear in the provided derivation chain, leaving the approach self-contained against the human study benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The metric rests on the assumption that detection proposals plus mask union can approximate human localization difficulty; no explicit free parameters are named in the abstract, but implicit choices include the number of proposals retained and the segmentation model.

axioms (1)
  • domain assumption Category-conditioned object detectors can generate proposals whose masks meaningfully reflect visual evidence available to humans.
    Invoked when defining the max-IoU recoverable localization signal.

pith-pipeline@v0.9.0 · 5860 in / 1314 out tokens · 32374 ms · 2026-05-20T18:22:15.177139+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

  1. [1]

    Bulletin of Electrical Engineering and Infor- matics (BEEI)10(4) (2021)

    Andang Sunarto, A.: Modified iterative method with red-black ordering for image composition using poisson equation. Bulletin of Electrical Engineering and Infor- matics (BEEI)10(4) (2021)

  2. [2]

    IEEE computer Graphics and Applications 23(4), 38–43 (2003)

    Ashikhmin, N.: Fast texture transfer. IEEE computer Graphics and Applications 23(4), 38–43 (2003)

  3. [3]

    In: Proceedings of the 29th annual conference on Computer graphics and interactive techniques

    Barrett, W.A., Cheney, A.S.: Object-based image editing. In: Proceedings of the 29th annual conference on Computer graphics and interactive techniques. pp. 777– 784 (2002)

  4. [4]

    Journal of Computer Science and Technology26(6), 1011–1016 (2011)

    Bie, X.H., Huang, H.D., Wang, W.C.: Free appearance-editing with improved pois- son image cloning. Journal of Computer Science and Technology26(6), 1011–1016 (2011)

  5. [5]

    Demystifying MMD GANs

    Bi´ nkowski, M., Sutherland, D.J., Arbel, M., Gretton, A.: Demystifying mmd gans. arXiv preprint arXiv:1801.01401 (2018)

  6. [6]

    ACM Trans

    Chu, H.K., Hsu, W.H., Mitra, N.J., Cohen-Or, D., Wong, T.T., Lee, T.Y.: Cam- ouflage images. ACM Trans. Graph.29(4), 51–1 (2010)

  7. [7]

    IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)

    Croitoru, F.A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: A survey. IEEE transactions on pattern analysis and machine intelligence45(9), 10850–10869 (2023)

  8. [8]

    Advances in Neural Information Processing Systems37, 101116–101143 (2024)

    Daras, G., Nie, W., Kreis, K., Dimakis, A., Mardani, M., Kovachki, N., Vahdat, A.: Warped diffusion: Solving video inverse problems with image diffusion models. Advances in Neural Information Processing Systems37, 101116–101143 (2024)

  9. [9]

    In: Proceedings of the Com- puter Vision and Pattern Recognition Conference

    Das, B., Gopalakrishnan, V.: Camouflage anything: Learning to hide using con- trolled out-painting and representation engineering. In: Proceedings of the Com- puter Vision and Pattern Recognition Conference. pp. 3603–3613 (2025)

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Duan, R., Ma, X., Wang, Y., Bailey, J., Qin, A.K., Yang, Y.: Adversarial cam- ouflage: Hiding physical-world attacks with natural styles. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1000–1008 (2020)

  11. [11]

    In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp

    Efros, A.A., Freeman, W.T.: Image quilting for texture synthesis and transfer. In: Seminal Graphics Papers: Pushing the Boundaries, Volume 2, pp. 571–576. Association for Computing Machinery (2023)

  12. [12]

    IEEE Transactions on Image Processing26(5), 2338–2351 (2017)

    Elad, M., Milanfar, P.: Style transfer via texture synthesis. IEEE Transactions on Image Processing26(5), 2338–2351 (2017)

  13. [13]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Fan, D.P., Ji, G.P., Sun, G., Cheng, M.M., Shen, J., Shao, L.: Camouflaged object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 2777–2787 (2020)

  14. [14]

    ACM Transactions on graphics (TOG)28(3), 1–9 (2009) 16 A.K

    Farbman, Z., Hoffer, G., Lipman, Y., Cohen-Or, D., Lischinski, D.: Coordinates for instant image cloning. ACM Transactions on graphics (TOG)28(3), 1–9 (2009) 16 A.K. Monsefi et al

  15. [15]

    In: European Conference on Computer Vision

    Hatamizadeh, A., Song, J., Liu, G., Kautz, J., Vahdat, A.: Diffit: Diffusion vision transformers for image generation. In: European Conference on Computer Vision. pp. 37–55. Springer (2024)

  16. [16]

    Advances in neural information processing systems30(2017)

    Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: Gans trained by a two time-scale update rule converge to a local nash equilibrium. Advances in neural information processing systems30(2017)

  17. [17]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  18. [18]

    arXiv preprint arXiv:2302.11797 (2023)

    Huang, N., Tang, F., Dong, W., Lee, T.Y., Xu, C.: Region-aware diffusion for zero-shot text-driven image editing. arXiv preprint arXiv:2302.11797 (2023)

  19. [19]

    ELCVIA: electronic letters on computer vision and image analysis14(2), 45–57 (2015)

    Hussain, K.F., Kamel, R.M.: Efficient poisson image editing. ELCVIA: electronic letters on computer vision and image analysis14(2), 45–57 (2015)

  20. [20]

    Machine Intelligence Research 20(1), 92–108 (2023)

    Ji, G.P., Fan, D.P., Chou, Y.C., Dai, D., Liniger, A., Van Gool, L.: Deep gradient learning for efficient camouflaged object detection. Machine Intelligence Research 20(1), 92–108 (2023)

  21. [21]

    arXiv preprint arXiv:2510.12798 (2025)

    Jiang, Q., Huo, J., Chen, X., Xiong, Y., Zeng, Z., Chen, Y., Ren, T., Yu, J., Zhang, L.: Detect anything via next point prediction. arXiv preprint arXiv:2510.12798 (2025)

  22. [22]

    In: European Conference on Computer Vision

    Khurana, M., Daw, A., Maruf, M., Uyeda, J.C., Dahdul, W., Charpentier, C., Bakı¸ s, Y., Bart Jr, H.L., Mabee, P.M., Lapp, H., et al.: Hierarchical conditioning of diffusion models using tree-of-life for studying species evolution. In: European Conference on Computer Vision. pp. 137–153. Springer (2024)

  23. [23]

    Taxaadapter: Vision taxonomy models are key to fine-grained image generation over the tree of life.arXiv preprint arXiv:2603.26128, 2026

    Khurana, M., Monsefi, A.K., Lee, J., Sawhney, M., Carlyn, D., Chae, J., Gu, J., Ramnath, R., Beery, S., Chao, W.L., et al.: Taxaadapter: Vision taxonomy mod- els are key to fine-grained image generation over the tree of life. arXiv preprint arXiv:2603.26128 (2026)

  24. [24]

    In: Proceedings of the 8th International Symposium on Non-Photorealistic Animation and Rendering

    Lee, H., Seo, S., Ryoo, S., Yoon, K.: Directional texture transfer. In: Proceedings of the 8th International Symposium on Non-Photorealistic Animation and Rendering. pp. 43–48 (2010)

  25. [25]

    In: ACM SIGGRAPH 2006 Research posters, pp

    Leventhal, D., Gordon, B., Sibley, P.G.: Poisson image editing extended. In: ACM SIGGRAPH 2006 Research posters, pp. 78–es. the Association for Computing Ma- chinery (ACM) (2006)

  26. [26]

    IEEE Transactions on Multimedia25, 5234–5247 (2022)

    Li, Y., Zhai, W., Cao, Y., Zha, Z.J.: Location-free camouflage generation network. IEEE Transactions on Multimedia25, 5234–5247 (2022)

  27. [27]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Li, Y., Liu, H., Wu, Q., Mu, F., Yang, J., Gao, J., Li, C., Lee, Y.J.: Gligen: Open-set grounded text-to-image generation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 22511–22521 (2023)

  28. [28]

    In: European conference on computer vision

    Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al.: Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In: European conference on computer vision. pp. 38–55. Springer (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- paint: Inpainting using denoising diffusion probabilistic models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11461– 11471 (2022)

  30. [30]

    Philosoph- ical Transactions of the Royal Society B: Biological Sciences372(1724) (2017)

    Merilaita, S., Scott-Samuel, N.E., Cuthill, I.C.: How camouflage works. Philosoph- ical Transactions of the Royal Society B: Biological Sciences372(1724) (2017)

  31. [31]

    Minderer, M., Gritsenko, A., Houlsby, N.: Scaling open-vocabulary object detection (2023) SeamCam 17

  32. [32]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Monsefi, A.K., Khurana, M., Ramnath, R., Karpatne, A., Chao, W.L., Zhang, C.: Taxadiffusion: Progressively trained diffusion model for fine-grained species gener- ation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 8579–8589 (2025)

  33. [33]

    arXiv preprint arXiv:2409.06809 (2024)

    Monsefi, A.K., Sailaja, K.P., Alilooee, A., Lim, S.N., Ramnath, R.: Detailclip: Detail-oriented clip for fine-grained tasks. arXiv preprint arXiv:2409.06809 (2024)

  34. [34]

    IEEE Transactions on Pattern Analysis and Ma- chine Intelligence47(2), 1161–1180 (2024)

    Montesuma, E.F., Mboula, F.M.N., Souloumiac, A.: Recent advances in optimal transport for machine learning. IEEE Transactions on Pattern Analysis and Ma- chine Intelligence47(2), 1161–1180 (2024)

  35. [35]

    Pattern Recognition Letters33(3), 342–348 (2012)

    Morel, J.M., Petro, A.B., Sbert, C.: Fourier implementation of poisson image edit- ing. Pattern Recognition Letters33(3), 342–348 (2012)

  36. [36]

    In: Proceedings of the Computer Vision and Pattern Recog- nition Conference

    Na, S., Kim, Y., Lee, H.: Boost your human image generation model via direct pref- erence optimization. In: Proceedings of the Computer Vision and Pattern Recog- nition Conference. pp. 23551–23562 (2025)

  37. [37]

    Knobgen: Controlling the sophistication of artwork in sketch-based diffusion models

    Navard, P., Monsefi, A.K., Zhou, M., Chao, W.L., Yilmaz, A., Ramnath, R.: Knob- gen: controlling the sophistication of artwork in sketch-based diffusion models. arXiv preprint arXiv:2410.01595 (2024)

  38. [38]

    IEEE Access (2024)

    Nguyen, T.D., Vu, A.K.N., Nguyen, N.D., Nguyen, V.T., Ngo, T.D., Do, T.T., Tran, M.T., Nguyen, T.V.: The art of camouflage: Few-shot learning for animal detection and segmentation. IEEE Access (2024)

  39. [39]

    arXiv preprint arXiv:2601.09881 (2026)

    Nie, W., Berner, J., Ma, N., Liu, C., Xie, S., Vahdat, A.: Transition matching distillation for fast video generation. arXiv preprint arXiv:2601.09881 (2026)

  40. [40]

    In: Seminal Graphics Pa- pers: Pushing the Boundaries, Volume 2, pp

    P´ erez, P., Gangnet, M., Blake, A.: Poisson image editing. In: Seminal Graphics Pa- pers: Pushing the Boundaries, Volume 2, pp. 577–582. Association for Computing Machinery (2023)

  41. [41]

    Advances in neural information processing systems36, 53728–53741 (2023)

    Rafailov, R., Sharma, A., Mitchell, E., Manning, C.D., Ermon, S., Finn, C.: Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems36, 53728–53741 (2023)

  42. [42]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024)

  43. [43]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  44. [44]

    In: International Conference on Learning Representations (2020)

    Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: International Conference on Learning Representations (2020)

  45. [45]

    AudioX: A Unified Framework for Anything-to-Audio Generation

    Tian, Z., Jin, Y., Liu, Z., Yuan, R., Tan, X., Chen, Q., Xue, W., Guo, Y.: Audiox: Diffusion transformer for anything-to-audio generation. arXiv preprint arXiv:2503.10522 (2025)

  46. [46]

    Perception & Psychophysics66(3), 517–533 (2004)

    Ulrich, R., Miller, J.: Threshold estimation in two-alternative forced-choice (2afc) tasks: The spearman-k¨ arber method. Perception & Psychophysics66(3), 517–533 (2004)

  47. [47]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Van Horn, G., Mac Aodha, O., Song, Y., Cui, Y., Sun, C., Shepard, A., Adam, H., Perona, P., Belongie, S.: The inaturalist species classification and detection dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 8769–8778 (2018)

  48. [48]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wallace, B., Dang, M., Rafailov, R., Zhou, L., Lou, A., Purushwalkam, S., Ermon, S., Xiong, C., Joty, S., Naik, N.: Diffusion model alignment using direct preference optimization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8228–8238 (2024) 18 A.K. Monsefi et al

  49. [49]

    arXiv preprint arXiv:2512.07076 (2024)

    Wang, C.Y., Ji, G.P., Shao, S., Cheng, M.M., Fan, D.P.: Context-measure: Con- textualizing metric for camouflage. arXiv preprint arXiv:2512.07076 (2024)

  50. [50]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Wang, Z., Zhao, L., Chen, H., Li, A., Zuo, Z., Xing, W., Lu, D.: Texture reformer: Towards fast and universal interactive texture transfer. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 36, pp. 2624–2632 (2022)

  51. [51]

    Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

    Wu, X., Hao, Y., Sun, K., Chen, Y., Zhu, F., Zhao, R., Li, H.: Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis. arXiv preprint arXiv:2306.09341 (2023)

  52. [52]

    IEEE Access7, 114619–114630 (2019)

    Yu, T., Song, K., Miao, P., Yang, G., Yang, H., Chen, C.: Nighttime single image dehazing via pixel-wise alpha blending. IEEE Access7, 114619–114630 (2019)

  53. [53]

    In: Proceedings of the AAAI conference on artificial intelligence

    Zhang, Q., Yin, G., Nie, Y., Zheng, W.S.: Deep camouflage images. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 12845–12852 (2020)

  54. [54]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Zhao, P., Xu, P., Qin, P., Fan, D.P., Zhang, Z., Jia, G., Zhou, B., Yang, J.: Lake- red: Camouflaged images generation by latent background knowledge retrieval- augmented diffusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4092–4101 (2024)

  55. [55]

    A camouflaged image of{c}

    Zheng, C., Cham, T.J., Cai, J., Phung, D.: Bridging global context interactions for high-fidelity image completion. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11512–11522 (2022) SeamCam 19 Appendix This appendix provides supplementary details that support the main paper. Section A describes the training and ...