arxiv: 2604.18076 · v1 · submitted 2026-04-20 · 💻 cs.CV · cs.AI

Recognition: unknown

Class-specific diffusion models improve military object detection in a low-data domain

Ella P. Fokkinga , Jan Erik van Woerden , Thijs A. Eker , Sebastiaan P. Snel , Elfi I.S. Hofmeijer , Klamer Schutte , Friso G. Heslinga

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:30 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords diffusion modelssynthetic dataobject detectionlow-data regimemilitary vehiclesLoRA fine-tuningstructural conditioningdata augmentation

0 comments

The pith

Fine-tuning diffusion models on just eight real images per military vehicle class generates synthetic data that raises object detector accuracy by up to eight percent mAP50.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a text-to-image diffusion model can be adapted to specific military vehicle classes using only the same handful of real photographs that train the detector itself. Those adapted models then produce additional images that, when mixed into the limited real training set, measurably improve detection on held-out real test images. Gains are largest when real data is scarcest, and adding explicit structural control over viewpoint further helps only in that extreme low-data setting. A sympathetic reader would care because the approach supplies usable training material without new real-world collection or hand-crafted simulators.

Core claim

Class-specific diffusion models created by LoRA fine-tuning of FLUX on eight or twenty-four real images per class across fifteen vehicle categories produce synthetic samples that improve an RF-DETR detector trained on the identical limited real set. Adding those generated images raises mAP50 by as much as 8.0 percent in the eight-sample regime. When structural guidance is supplied through ControlNet conditioned on Canny edge maps, performance increases by an additional 4.1 percent under the same conditions, yet the extra guidance yields no further benefit once twenty-four real samples are available. The central result is that object-specific diffusion models can substitute for traditional 3D

What carries the argument

Class-specific diffusion models obtained by fine-tuning FLUX with LoRA on per-class real images, optionally augmented by ControlNet edge-map conditioning; these models generate synthetic samples that augment the detector's training set without requiring any new real photographs.

If this is right

Performance improvements are largest precisely when the number of real training images is smallest.
Structural conditioning supplies additional benefit only in the most data-scarce regime and becomes redundant once more real samples are present.
The entire pipeline uses the same limited real images both to adapt the generator and to train the detector, eliminating the need for further labeling.
Class-specific generation avoids the identity mixing that occurs when a single generic model is asked to produce many vehicle types.
The method offers a data-efficient alternative to physics-based simulation pipelines for training military detection systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-class adaptation pattern could be tested on other scarce-image domains such as industrial defects or rare plant species.
If subtle distribution shifts remain between synthetic and real images, periodic insertion of fresh real photographs may still be required to maintain long-term accuracy.
Directly coupling the generator to an active-learning loop that requests new real labels only when synthetic augmentation saturates could further reduce annotation cost.
Extending the approach to video or multi-view sequences would test whether the observed gains generalize beyond static frames.

Load-bearing premise

The synthetic images must be sufficiently realistic and free of domain-specific artifacts that they help the detector generalize to new real photographs rather than reinforcing flaws unique to the generated data.

What would settle it

Train the detector on the combined real-plus-synthetic set and evaluate mAP50 on a separate collection of real military-vehicle photographs taken under different lighting, backgrounds, and viewpoints; if the score does not exceed the real-only baseline, the claim is falsified.

Figures

Figures reproduced from arXiv: 2604.18076 by Elfi I.S. Hofmeijer, Ella P. Fokkinga, Friso G. Heslinga, Jan Erik van Woerden, Klamer Schutte, Sebastiaan P. Snel, Thijs A. Eker.

**Figure 2.** Figure 2: Example of a real image of a Boxer and the caption to describe it. Such image-caption pairs are inputs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the structural-guidance process used in FLUX-ControlNet image synthesis for a Boxer. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of images generated by the object-specific FLUX-models for four class, each fine-tuned with [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of object detection performance (mAP [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Top row: Boxer. Bottom row: M1A2 Abrams. From left to right: FLUX-generated image, the [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Diffusion-based image synthesis has emerged as a promising source of synthetic training data for AI-based object detection and classification. In this work, we investigate whether images generated with diffusion can improve military vehicle detection under low-data conditions. We fine-tuned the text-to-image diffusion model FLUX.1 [dev] using LoRA with only 8 or 24 real images per class across 15 vehicle categories, resulting in class-specific diffusion models, which were used to generate new samples from automatically generated text prompts. The same real images were used to fine-tune the RF-DETR detector for a 15-class object detection task. Synthetic datasets generated by the diffusion models were then used to further improve detector performance. Importantly, no additional real data was required, as the generative models leveraged the same limited training samples. FLUX-generated images improved detection performance, particularly in the low-data regime (up to +8.0% mAP$_{50}$ with 8 real samples). To address the limited geometric control of text prompt-based diffusion, we additionally generated structurally guided synthetic data using ControlNet with Canny edge-map conditioning, yielding a FLUX-ControlNet (FLUX-CN) dataset with explicit control over viewpoint and pose. Structural guidance further enhanced performance when data is scarce (+4.1% mAP$_{50}$ with 8 real samples), but no additional benefit was observed when more real data is available. This study demonstrates that object-specific diffusion models are effective for improving military object detection in a low-data domain, and that structural guidance is most beneficial when real data is highly limited. These results highlight generative image data as an alternative to traditional simulation pipelines for the training of military AI systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Class-specific LoRA fine-tuning of FLUX on 8-24 images per military vehicle class produces synthetic data that lifts detector mAP50 by up to 8 points in the lowest-data regime, with ControlNet adding a further 4 points only when data is scarcest.

read the letter

This paper shows that fine-tuning FLUX with class-specific LoRA on only 8 or 24 real images per military vehicle type produces synthetic data that improves a detector's performance on real test images. The new part is the combination of per-class adaptation of a strong diffusion model with structural conditioning via ControlNet, applied to a 15-class military detection task using the same limited real samples for both generation and detection training. They report clear gains, largest when data is scarcest, and note that the structural guidance adds value only then. It does a good job of focusing on a real constraint—military applications often have very few labeled examples—and testing the augmentation end-to-end rather than just on generation quality. The observation that extra structure helps less once you have 24 samples is a useful detail. The soft spot is that fuller reporting on run-to-run variance, statistical significance, and exact baseline construction would make the gains easier to interpret. Domain shift remains a standing risk in any synthetic augmentation study, even when the final numbers come from real test images. This is for readers working on low-data object detection in specialized domains. It is a solid applied study rather than a methods paper. I would send it for peer review. The experiments are relevant and the results merit closer examination by the community.

Referee Report

2 major / 2 minor

Summary. The manuscript presents an empirical study on using class-specific fine-tuned FLUX diffusion models to generate synthetic training data for improving 15-class military vehicle detection with RF-DETR in low-data regimes. Using LoRA fine-tuning on 8 or 24 real images per class, the generated images are shown to boost mAP50 by up to 8%, with further gains from ControlNet Canny conditioning in the scarcest data settings.

Significance. Should the findings prove robust upon closer inspection of the experimental protocol, the results would be significant for low-resource object detection tasks, particularly in specialized domains like defense where acquiring large annotated datasets is challenging. The approach leverages generative AI to augment data without needing additional real samples, which is a practical contribution. The differential benefit of structural guidance in extreme low-data cases is noteworthy.

major comments (2)

The experimental protocol does not specify the total number of synthetic images generated per class, the exact method and templates used for automatic prompt generation, the number of independent training runs, or any statistical significance testing (e.g., standard deviations or p-values) for the reported mAP50 gains of +8.0% (8-sample regime) and +4.1% (ControlNet). These omissions make it difficult to assess the reliability of the central claim that class-specific diffusion augmentation improves detector generalization on held-out real data.
No ablation or comparison is provided against standard data-augmentation baselines (e.g., geometric/color jitter, MixUp, or other diffusion models without class-specific LoRA) or against purely real-data training with equivalent compute. This weakens the attribution of gains specifically to the class-specific FLUX-LoRA models rather than generic augmentation effects.

minor comments (2)

The abstract and results text should explicitly state the baseline (real-data-only RF-DETR) against which the percentage improvements are measured and clarify whether the same 8/24 real images are used in all conditions.
Notation for mAP should be standardized (e.g., mAP_{50}) and any tables or figures reporting results should include error bars or run counts for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental rigor and attribution. We address each point below and will revise the manuscript to incorporate the requested details and comparisons.

read point-by-point responses

Referee: The experimental protocol does not specify the total number of synthetic images generated per class, the exact method and templates used for automatic prompt generation, the number of independent training runs, or any statistical significance testing (e.g., standard deviations or p-values) for the reported mAP50 gains of +8.0% (8-sample regime) and +4.1% (ControlNet). These omissions make it difficult to assess the reliability of the central claim that class-specific diffusion augmentation improves detector generalization on held-out real data.

Authors: We agree that these protocol details are necessary for full reproducibility and to allow readers to evaluate the reliability of the reported gains. In the revised manuscript we will add a dedicated experimental protocol subsection that states the total number of synthetic images generated per class, describes the automatic prompt generation procedure and templates in full, reports the number of independent training runs with different random seeds, and includes standard deviations together with appropriate statistical significance testing (e.g., paired t-tests) for the mAP50 improvements. These additions will directly address the concern about assessing the robustness of the +8.0% and +4.1% gains. revision: yes
Referee: No ablation or comparison is provided against standard data-augmentation baselines (e.g., geometric/color jitter, MixUp, or other diffusion models without class-specific LoRA) or against purely real-data training with equivalent compute. This weakens the attribution of gains specifically to the class-specific FLUX-LoRA models rather than generic augmentation effects.

Authors: We concur that additional baselines are required to isolate the benefit of class-specific LoRA fine-tuning from generic augmentation or compute effects. In the revision we will introduce a new ablation table that compares our method against (i) standard geometric and color jitter applied to the real images, (ii) MixUp augmentation, (iii) synthetic data generated by the unmodified base FLUX model, and (iv) purely real-data training with an equivalent additional compute budget (via oversampling or extended training). Results will be reported for both the 8-sample and 24-sample regimes, allowing clearer attribution of the observed improvements to the class-specific diffusion models. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation

full rationale

The paper is an end-to-end experimental study that measures detector mAP on held-out real images after augmenting training sets with images from class-specific FLUX-LoRA models. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the reported protocol. Performance gains are established by direct comparison of real-only versus real-plus-synthetic training regimes; the central assumption (that generated images are useful) is tested by the same external metric rather than being presupposed by any internal construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning assumptions about the utility of synthetic data augmentation and the ability of LoRA fine-tuning to capture class-specific appearance from few examples. No new physical entities are postulated.

free parameters (2)

LoRA adaptation rank and scaling
Hyperparameters of the low-rank adaptation used during fine-tuning; chosen to enable learning from 8-24 images.
Number and diversity of generated synthetic samples
Quantity of synthetic images added to the detector training set; selected to produce the reported performance lift.

axioms (2)

domain assumption Fine-tuned class-specific diffusion models produce images that are distributionally close enough to real images to improve downstream detector generalization.
Invoked when claiming that synthetic data boosts mAP50 without negative transfer.
domain assumption Text prompts and Canny edge maps provide sufficient control to generate useful viewpoint and pose variation.
Required for the ControlNet component to deliver the additional +4.1% gain.

pith-pipeline@v0.9.0 · 5644 in / 1660 out tokens · 66504 ms · 2026-05-10T04:30:28.496995+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 4 canonical work pages · 1 internal anchor

[1]

On the use of simulated data for target recognition and mission planning,

Heslinga, F. G., Fokkinga, E. P., Eker, T. H., Liezenga, A. M., den Hollander, R. J. M., Oppeneer, V. O., van Heteren, A. M., van Vossen, R., Kuijf, H. J., van de Sande, J. J. M., van der Burg, D. W., Weyland, L. F., Henderson, H. C., Schadd, M. P. D., and Schutte, K., “On the use of simulated data for target recognition and mission planning,” in [Artific...

2024
[2]

Generative AI methods for synthesis of image data to train AI for automated scene understanding in a military context: a review of opportunities,

Fokkinga, E. P., Eker, T. A., van Woerden, J. E., Witon, J.-M., Stallinga, S. O., Visser, A., Schutte, K., and Heslinga, F. G., “Generative AI methods for synthesis of image data to train AI for automated scene understanding in a military context: a review of opportunities,” in [Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techn...

2025
[3]

Synthetic data from diffusion models improves imagenet classification.arXiv preprint arXiv:2304.08466, 2023

Azizi, S., Kornblith, S., Saharia, C., Norouzi, M., and Fleet, D. J., “Synthetic data from diffusion models improves imagenet classification. arxiv 2023,”arXiv preprint arXiv:2304.08466(2023)

work page arXiv 2023
[4]

DiffusionDet: Diffusion model for object detection,

Chen, S., Sun, P., Song, Y., and Luo, P., “DiffusionDet: Diffusion model for object detection,” in [IEEE/CVF International Conference on Computer Vision], 19830–19843 (2023)

2023
[5]

Lora: Low-rank adaptation of large language models.,

Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al., “Lora: Low-rank adaptation of large language models.,”ICLR1(2), 3 (2022)

2022
[6]

T-LoRA: Single image diffusion model cus- tomization without overfitting,

Soboleva, V., Alanov, A., Kuznetsov, A., and Sobolev, K., “T-LoRA: Single image diffusion model cus- tomization without overfitting,” in [AAAI Conference on Artificial Intelligence],40(11), 9051–9059 (2026)

2026
[7]

The effect of simulation variety on a deep learning-based military vehicle detector,

Eker, T. A., Heslinga, F. G., Ballan, L., den Hollander, R. J., and Schutte, K., “The effect of simulation variety on a deep learning-based military vehicle detector,” in [Artificial Intelligence for Security and Defence Applications],12742, 183–196, SPIE Sensors + Imaging (2023)

2023
[8]

Adding conditional control to text-to-image diffusion models,

Zhang, L., Rao, A., and Agrawala, M., “Adding conditional control to text-to-image diffusion models,” in [Proceedings of the IEEE/CVF international conference on computer vision], 3836–3847 (2023)

2023
[9]

Combining simulated data, foundation models, and few real samples for training object detectors,

Heslinga, F. G., Eker, T. A., Fokkinga, E. P., van Woerden, J. E., Ruis, F. A., den Hollander, R. J. M., and Schutte, K., “Combining simulated data, foundation models, and few real samples for training object detectors,” in [Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II],13035, SPIE Defense + Comme...

2024
[10]

Unlocking thermal aerial imaging: Synthetic enhancement of UAV datasets,

Kulas, A. B., Jurasovic, A., and Bogdan, S., “Unlocking thermal aerial imaging: Synthetic enhancement of UAV datasets,” (2025)

2025
[11]

3DSM- COS: A 3D model-based synthetic data pipeline for military camouflaged object segmentation with distractor-augmented realism,

Truong, T.-T.-H., Tran, T.-K., Hoang, N.-B., Nguyen, T.-D., Phan, T.-H.-H., and Nguyen, C.-T., “3DSM- COS: A 3D model-based synthetic data pipeline for military camouflaged object segmentation with distractor-augmented realism,” in [2025 International Conference on Multimedia Analysis and Pattern Recognition (MAPR)], 1–6 (2025)

2025
[12]

Balancing 3D-model fidelity for training a vehicle detector on simulated data,

Eker, T. A., Fokkinga, E. P., Heslinga, F. G., and Schutte, K., “Balancing 3D-model fidelity for training a vehicle detector on simulated data,” in [Artificial Intelligence for Security and Defence Applications II], 13206, SPIE Sensors + Imaging (2024)

2024
[13]

Aero- gen: Enhancing remote sensing object detection with diffusion-driven data generation.arXiv preprint arXiv:2411.15497, 2024

Tang, D., Cao, X., Wu, X., Li, J., Yao, J., Bai, X., and Meng, D., “AeroGen: Enhancing remote sensing object detection with diffusion-driven data generation,”arXiv preprint arXiv:2411.15497(2024)

work page arXiv 2024
[14]

Improving object detector training on synthetic data by starting with a strong baseline methodology,

Ruis, F. A., Liezenga, A. M., Heslinga, F. G., Ballan, L., den Hollander, R. J., van Leeuwen, M. C., Masinia, B., Dijk, J., and Huizinga, W., “Improving object detector training on synthetic data by starting with a strong baseline methodology,” in [Synthetic Data for Artificial Intelligence and Machine Learning: Tools, Techniques, and Applications II],130...

2024
[15]

Impact of style transfer approaches on synthetic data for military camouflaged object detection,

Truong, T.-T.-H., Nguyen, D.-V., Luong, D.-H., Nguyen, A.-T., Nguyen, N.-S., Vu, H.-K., Nguyen, D.-P., and Tran, T.-K., “Impact of style transfer approaches on synthetic data for military camouflaged object detection,” in [Information and Communication Technology], Buntine, W., Fjeld, M., Tran, T., Tran, M.-T., Huynh Thi Thanh, B., and Miyoshi, T., eds., ...

2025
[16]

Generative adversarial networks,

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y., “Generative adversarial networks,”Communications of the ACM63(11), 139–144 (2020)

2020
[17]

Implicit multi-spectral transformer: An lightweight and effective visible to infrared image translation model,

Chen, Y., Chen, P., Zhou, X., Lei, Y., Zhou, Z., and Li, M., “Implicit multi-spectral transformer: An lightweight and effective visible to infrared image translation model,” in [2024 International Joint Conference on Neural Networks (IJCNN)], 1–8, IEEE (2024)

2024
[18]

InfraGAN: A GAN architecture to transfer visible images to infrared domain,

¨Ozkano˘ glu, M. A. and Ozer, S., “InfraGAN: A GAN architecture to transfer visible images to infrared domain,”Pattern Recognition Letters155, 69–76 (2022)

2022
[19]

CycleGAN-based realistic image dataset generation for forward-looking sonar,

Liu, D., Wang, Y., Ji, Y., Tsuchiya, H., Yamashita, A., and and, H. A., “CycleGAN-based realistic image dataset generation for forward-looking sonar,”Advanced Robotics35(3-4), 242–254 (2021)

2021
[20]

F., “FLUX.”https://github.com/black-forest-labs/flux(2024)

Labs, B. F., “FLUX.”https://github.com/black-forest-labs/flux(2024)

2024
[21]

Data augmentation for vehicle detection with diffusion-based object inpainting,

Snel, S. P., Eker, T. A., Fokkinga, E. P., Visser, A., Schutte, K., and Heslinga, F. G., “Data augmentation for vehicle detection with diffusion-based object inpainting,” in [Artificial Intelligence for Security and Defence Applications III],13679, 294–307, SPIE Sensors + Imaging (2025)

2025
[22]

Kimi-VL Technical Report

Team, K., Du, A., Yin, B., Xing, B., Qu, B., Wang, B., Chen, C., Zhang, C., Du, C., Wei, C., et al., “Kimi-VL technical report,”arXiv preprint arXiv:2504.07491(2025)

work page internal anchor Pith review arXiv 2025
[23]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,

Liu, S., Zeng, Z., Ren, T., Li, F., Zhang, H., Yang, J., Jiang, Q., Li, C., Yang, J., Su, H., et al., “Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection,” in [European Conference on Computer Vision], 38–55, Springer (2024)

2024
[24]

O.,Blender - a 3D modelling and rendering package

Community, B. O.,Blender - a 3D modelling and rendering package. Blender Foundation, Stichting Blender Foundation, Amsterdam (2018)

2018
[25]

Rf-detr: neural architecture search for real-time detection transformers.arXiv preprint arXiv:2511.09554, 2025

Robinson, I., Robicheaux, P., Popov, M., Ramanan, D., and Peri, N., “RF-DETR: neural architecture search for real-time detection transformers,”arXiv preprint arXiv:2511.09554(2025)

work page arXiv 2025
[26]

DINOv2: Learning robust visual features without supervision,

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.-Y., Li, S.-W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., and Bojanowski, P., “DINOv2: Learning robust visual...

2024
[27]

Ultralytics yolo,

Jocher, G., Qiu, J., and Chaurasia, A., “Ultralytics yolo,” (January 2024)

2024
[28]

A survey on performance metrics for object-detection algorithms,

Padilla, R., Netto, S. L., and Da Silva, E. A., “A survey on performance metrics for object-detection algorithms,” in [2020 international conference on systems, signals and image processing (IWSSIP)], 237– 242, IEEE (2020)

2020
[29]

Conditioning diffusion models via attributes and semantic masks for face generation,

Lisanti, G. and Giambi, N., “Conditioning diffusion models via attributes and semantic masks for face generation,”Computer Vision and Image Understanding244, 104026 (2024)

2024
[30]

Loosecontrol: Lifting controlnet for generalized depth conditioning,

Bhat, S. F., Mitra, N., and Wonka, P., “Loosecontrol: Lifting controlnet for generalized depth conditioning,” in [ACM SIGGRAPH 2024 Conference Papers], 1–11 (2024)

2024
[31]

Advancing state of the art object detection (again) with RF-DETR

Gallagher, J. and Nelson, J., “Advancing state of the art object detection (again) with RF-DETR.”https: //blog.roboflow.com/rf-detr-nano-small-medium/(July 2025). Roboflow Blog

2025
[32]

Tide: A general toolbox for identifying object detection errors,

Bolya, D., Foley, S., Hays, J., and Hoffman, J., “Tide: A general toolbox for identifying object detection errors,” in [European Conference on Computer Vision], 558–573, Springer (2020)

2020
[33]

The corresponding user prompt is reported in Listing 2

APPENDIX The system prompt that was provided to Gemma-3-12b-it, which was used to caption the real image dataset described in Section 3.1, is reported in Listing 1. The corresponding user prompt is reported in Listing 2. The field{vehicle name}is replaced by the vehicle class name. The system prompt that was provided to GPT-4, which was used for generatin...