Dual-Control Frequency-Aware Diffusion Model for Depth-Dependent Optical Microrobot Microscopy Image Generation
Pith reviewed 2026-05-10 15:06 UTC · model grok-4.3
The pith
A dual-control diffusion model generates physically consistent, depth-dependent microscopy images of optical microrobots from small datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Du-FreqNet is a dual-control, frequency-aware diffusion model that encodes microrobot 3D point clouds and depth-specific mesh layers through separate ControlNet branches and applies an adaptive frequency-domain loss that reweights components based on distance to the focal plane using differentiable FFT. This enables controllable synthesis of depth-dependent microscopy images that match physical optical characteristics.
What carries the argument
The dual ControlNet branches for 3D point clouds and depth mesh layers, combined with the adaptive frequency-domain loss supervised via differentiable FFT.
If this is right
- Achieves controllable depth-dependent image synthesis from limited data.
- Improves SSIM by 20.7% compared to baseline methods.
- Generalizes to unseen poses not seen during training.
- Enhances accuracy of downstream 3D pose and depth estimation tasks.
- Supports robust closed-loop control in microrobotic systems.
Where Pith is reading between the lines
- The method may allow researchers to train perception models without extensive real-world data collection for similar microscale optical systems.
- Frequency reweighting based on focal distance could be applied to generate synthetic data for other depth-sensitive imaging modalities.
- Improved image generation might accelerate the development of autonomous microrobots for biological applications like cell manipulation.
- The approach highlights the value of incorporating physical priors, such as Fourier transforms, into generative models for scientific imaging.
Load-bearing premise
The adaptive frequency-domain loss, which reweights high- and low-frequency components according to distance to the focal plane, accurately captures real physical diffraction and defocus effects without adding artifacts or overfitting the small dataset.
What would settle it
If real microscopy images at known depths show frequency distributions that do not match those produced by the model when conditioned on the same depth and pose, or if performance on downstream tasks does not improve when using the generated images.
Figures
read the original abstract
Optical microrobots actuated by optical tweezers (OT) are important for cell manipulation and microscale assembly, but their autonomous operation depends on accurate 3D perception. Developing such perception systems is challenging because large-scale, high-quality microscopy datasets are scarce, owing to complex fabrication processes and labor-intensive annotation. Although generative AI offers a promising route for data augmentation, existing generative adversarial network (GAN)-based methods struggle to reproduce key optical characteristics, particularly depth-dependent diffraction and defocus effects. To address this limitation, we propose Du-FreqNet, a dual-control, frequency-aware diffusion model for physically consistent microscopy image synthesis. The framework features two independent ControlNet branches to encode microrobot 3D point clouds and depth-specific mesh layers, respectively. We introduce an adaptive frequency-domain loss that dynamically reweights high- and low-frequency components based on the distance to the focal plane. By leveraging differentiable FFT-based supervision, Du-FreqNet captures physically meaningful frequency distributions often missed by pixel-space methods. Trained on a limited dataset (e.g., 80 images per pose), our model achieves controllable, depth-dependent image synthesis, improving SSIM by 20.7% over baselines. Extensive experiments demonstrate that Du-FreqNet generalizes effectively to unseen poses and significantly enhances downstream tasks, including 3D pose and depth estimation, thereby facilitating robust closed-loop control in microrobotic systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Du-FreqNet, a dual-control frequency-aware diffusion model for synthesizing depth-dependent optical microrobot microscopy images. It uses two independent ControlNet branches to encode 3D point clouds and depth-specific mesh layers, respectively, together with an adaptive frequency-domain loss that dynamically reweights high- and low-frequency components via differentiable FFT according to distance to the focal plane. Trained on a small dataset (80 images per pose), the model is reported to achieve controllable depth-dependent synthesis with a 20.7% SSIM improvement over baselines, effective generalization to unseen poses, and improved performance on downstream 3D pose and depth estimation tasks.
Significance. If the frequency-aware loss produces images whose frequency content is physically consistent with diffraction and defocus rather than merely fitting training-set statistics, the approach would provide a useful data-augmentation tool for perception in optical microrobotics, where real annotated datasets are scarce. The dual-control architecture is a constructive design choice for separating pose and depth conditioning, and the use of differentiable FFT supervision is a clear technical strength for frequency-aware generation.
major comments (2)
- [Method (adaptive frequency-domain loss)] Method section (adaptive frequency-domain loss): The loss is described as dynamically reweighting FFT components based on distance to the focal plane, yet no derivation from the optical transfer function, pupil function, or measured point-spread function is supplied. Without such grounding or an ablation against a physics-based simulator, it remains unclear whether the 20.7% SSIM gain and downstream-task improvements reflect physical consistency or overfitting to the limited training distribution.
- [Experiments] Experiments section: The reported 20.7% SSIM improvement and generalization to unseen poses are presented without baseline implementation details, statistical significance tests, error bars, or explicit train/test split and data-exclusion criteria. Given the small training set size (80 images per pose), these omissions make it difficult to evaluate the robustness of the central empirical claims.
minor comments (1)
- [Abstract] The abstract and results would benefit from a brief statement of the total number of distinct poses and the precise train/validation/test partitioning to allow readers to assess the scale of the generalization experiments.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which have helped us identify areas for improvement in clarity and rigor. We address each major comment point by point below and will revise the manuscript accordingly to strengthen the presentation of our method and experiments.
read point-by-point responses
-
Referee: Method section (adaptive frequency-domain loss): The loss is described as dynamically reweighting FFT components based on distance to the focal plane, yet no derivation from the optical transfer function, pupil function, or measured point-spread function is supplied. Without such grounding or an ablation against a physics-based simulator, it remains unclear whether the 20.7% SSIM gain and downstream-task improvements reflect physical consistency or overfitting to the limited training distribution.
Authors: We appreciate the referee highlighting this point. The adaptive frequency-domain loss was developed from empirical observations of frequency attenuation in our microscopy dataset, where high-frequency components diminish with increasing distance from the focal plane due to defocus. The reweighting is implemented via a differentiable FFT that modulates the loss based on a distance-dependent schedule derived from measured image statistics rather than a closed-form optical model. In the revised manuscript, we will expand the method section with the explicit weighting formula, its empirical motivation, and a new ablation comparing the adaptive loss to a uniform-frequency baseline. We will also add a limitations paragraph acknowledging that the approach approximates observed optical effects without a direct derivation from the optical transfer function or pupil function, and that future work could incorporate physics-based simulators for stricter consistency. These changes will clarify the distinction between data-driven frequency awareness and full physical modeling while demonstrating that the reported gains are supported by improved generalization to unseen poses. revision: partial
-
Referee: Experiments section: The reported 20.7% SSIM improvement and generalization to unseen poses are presented without baseline implementation details, statistical significance tests, error bars, or explicit train/test split and data-exclusion criteria. Given the small training set size (80 images per pose), these omissions make it difficult to evaluate the robustness of the central empirical claims.
Authors: We agree that these details are essential for assessing robustness, particularly with the modest dataset size. In the revised manuscript, we will add a dedicated experimental details subsection that includes full baseline implementations (architectures, hyperparameters, and training protocols), results with standard error bars computed over five independent runs, and statistical significance testing (paired t-tests with p-values) for the SSIM improvements. The train/test protocol will be explicitly stated, specifying that 80 images per pose were used for training with a held-out set of unseen poses for generalization evaluation, along with the precise data-exclusion criteria applied during collection. These additions will enable readers to better judge the reliability of the empirical claims. revision: yes
Circularity Check
No circularity: empirical training and external evaluation
full rationale
The paper proposes a dual-ControlNet diffusion architecture with an adaptive frequency-domain loss, trains it on a small set of real microscopy images (80 per pose), and reports SSIM gains plus downstream improvements on held-out poses and tasks. No load-bearing step reduces a claimed prediction or result to its own inputs by construction, no self-citation chain justifies a uniqueness claim, and the loss is presented as a design choice rather than a derived identity. The central claims rest on standard empirical benchmarks against external image data and baselines, making the work self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- frequency reweighting schedule
axioms (1)
- domain assumption Differentiable FFT supervision captures physically meaningful frequency distributions missed by pixel-space losses
Reference graph
Works this paper leans on
-
[1]
Distributed force control for microrobot manipulation via planar multi-spot optical tweezer,
D. Zhang, A. Barbot, B. Lo, and G.-Z. Yang, “Distributed force control for microrobot manipulation via planar multi-spot optical tweezer,” Advanced Optical Materials, vol. 8, no. 21, p. 2000543, 2020
work page 2020
-
[2]
Optical chiral microrobot for out-of-plane rotation,
A. M. Ali, E. Gerena, J. A. I. Mart ´ınez, G. Ulliac, B. Lemkalli, A. Mohand-Ousaid, S. Haliyo, A. Bolopion, and M. Kadic, “Optical chiral microrobot for out-of-plane rotation,”Communications Physics, vol. 8, no. 1, p. 230, 2025
work page 2025
-
[3]
Optical-driven miniature robots: driving mechanism, applications and future trends,
X. Wang, S. Jia, Y . Gao, C. Liu, Y . Wang, A. Liu, and W. Yang, “Optical-driven miniature robots: driving mechanism, applications and future trends,”Lab on a Chip, vol. 25, pp. 4473–4507, 2025
work page 2025
-
[4]
Optical tweezers in single-molecule biophysics,
C. J. Bustamante, Y . R. Chemla, S. Liu, and M. D. Wang, “Optical tweezers in single-molecule biophysics,”Nature Reviews Methods Primers, vol. 1, no. 1, p. 25, 2021
work page 2021
-
[5]
Physics-informed machine learn- ing with adaptive grids for optical microrobot depth estimation,
L. Wei, L. Genoud, and D. Zhang, “Physics-informed machine learn- ing with adaptive grids for optical microrobot depth estimation,” in 2025 IEEE International Conference on Cyborg and Bionic Systems (CBS). IEEE, 2025, pp. 1–6
work page 2025
-
[6]
A dataset and benchmarks for deep learning- based optical microrobot pose and depth perception,
L. Wei and D. Zhang, “A dataset and benchmarks for deep learning- based optical microrobot pose and depth perception,” in2025 Interna- tional Conference on Manipulation, Automation and Robotics at Small Scales (MARSS). IEEE, 2025, pp. 1–8
work page 2025
-
[7]
Diffusion models in medical imaging: A comprehensive survey,
L. Qiegen, G. Yu, W. Weiwen, S. Hongming, and L. Dong, “Diffusion models in medical imaging: A comprehensive survey,”CT Theory and Applications, vol. 34, no. 3, pp. 506–524, 2025
work page 2025
-
[8]
Deep learning approaches for data augmentation in medical imaging: a review,
A. Kebaili, J. Lapuyade-Lahorgue, and S. Ruan, “Deep learning approaches for data augmentation in medical imaging: a review,” Journal of imaging, vol. 9, no. 4, p. 81, 2023
work page 2023
-
[9]
Medical image data augmentation: techniques, compar- isons and interpretations,
E. Goceri, “Medical image data augmentation: techniques, compar- isons and interpretations,”Artificial intelligence review, vol. 56, no. 11, pp. 12 561–12 605, 2023
work page 2023
-
[10]
A review and systematic guide to counteracting medical data scarcity for ai applications,
F. Gr ¨oger, L. Amruthalingam, S. Lionetti, A. A. Navarini, F. Ille, and M. Pouly, “A review and systematic guide to counteracting medical data scarcity for ai applications,”Computer Methods and Programs in Biomedicine Update, p. 100220, 2025
work page 2025
-
[11]
Data-driven microscopic pose and depth estimation for optical microrobot manipulation,
D. Zhang, F. P.-W. Lo, J.-Q. Zheng, W. Bai, G.-Z. Yang, and B. Lo, “Data-driven microscopic pose and depth estimation for optical microrobot manipulation,”Acs Photonics, vol. 7, no. 11, pp. 3003– 3014, 2020
work page 2020
-
[12]
Fabrication and optical manipulation of micro-robots for biomedical applications,
D. Zhang, Y . Ren, A. Barbot, F. Seichepine, B. Lo, Z.-C. Ma, and G.-Z. Yang, “Fabrication and optical manipulation of micro-robots for biomedical applications,”Matter, vol. 5, no. 10, pp. 3135–3160, 2022
work page 2022
-
[13]
Incorporating the image formation process into deep learning improves network performance,
Y . Li, Y . Su, M. Guo, X. Han, J. Liu, H. D. Vishwasrao, X. Li, R. Christensen, T. Sengupta, M. W. Moyleet al., “Incorporating the image formation process into deep learning improves network performance,”Nature Methods, vol. 19, no. 11, pp. 1427–1437, 2022
work page 2022
-
[14]
K. Ning, B. Lu, X. Wang, X. Zhang, S. Nie, T. Jiang, A. Li, G. Fan, X. Wang, Q. Luoet al., “Deep self-learning enables fast, high- fidelity isotropic resolution restoration for volumetric fluorescence microscopy,”Light: Science & Applications, vol. 12, no. 1, p. 204, 2023
work page 2023
-
[15]
M. Guo, Y . Wu, C. M. Hobson, Y . Su, S. Qian, E. Krueger, R. Christensen, G. Kroeschell, J. Bui, M. Chawet al., “Deep learning- based aberration compensation improves contrast and resolution in fluorescence microscopy,”Nature Communications, vol. 16, no. 1, p. 313, 2025
work page 2025
-
[16]
Y . Liu, T. Jiang, R. Li, L. Yuan, M. Grzegorzek, C. Li, and X. Li, “A state-of-the-art review of diffusion model applications for microscopic image and micro-alike image analysis,”Frontiers in Medicine, vol. 12, p. 1551894, 2025
work page 2025
-
[17]
C. Qiao, Y . Zeng, Q. Meng, X. Chen, H. Chen, T. Jiang, R. Wei, J. Guo, W. Fu, H. Luet al., “Zero-shot learning enables instant denoising and super-resolution in optical fluorescence microscopy,” Nature communications, vol. 15, no. 1, p. 4180, 2024
work page 2024
-
[18]
Pixel super-resolved virtual staining of label-free tissue using diffusion models,
Y . Zhang, L. Huang, N. Pillar, Y . Li, H. Chen, and A. Ozcan, “Pixel super-resolved virtual staining of label-free tissue using diffusion models,”Nature Communications, vol. 16, no. 1, p. 5016, 2025
work page 2025
-
[19]
Z. Tan and D. Zhang, “Interactive ot gym: A reinforcement learning- based interactive optical tweezer (ot)-driven microrobotics simulation platform,” in2025 IEEE International Conference on Robotics and Automation (ICRA), 2025, pp. 1–7
work page 2025
-
[20]
Micro-object pose estimation with sim-to-real transfer learning using small dataset,
D. Zhang, A. Barbot, F. Seichepine, F. P.-W. Lo, W. Bai, G.-Z. Yang, and B. Lo, “Micro-object pose estimation with sim-to-real transfer learning using small dataset,”Communications Physics, vol. 5, no. 1, p. 80, 2022
work page 2022
-
[21]
Z. Tan, L. Wei, and D. Zhang, “Physics-informed machine learning for efficient sim-to-real data augmentation in micro-object pose esti- mation,”arXiv preprint arXiv:2511.16494, 2025
-
[22]
Spatial frequency bias in convo- lutional generative adversarial networks,
M. Khayatkhoei and A. Elgammal, “Spatial frequency bias in convo- lutional generative adversarial networks,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 7, 2022, pp. 7152– 7159
work page 2022
-
[23]
A survey on training challenges in generative adversarial networks for biomedical image analysis,
M. M. Saad, R. O’Reilly, and M. H. Rehmani, “A survey on training challenges in generative adversarial networks for biomedical image analysis,”Artificial Intelligence Review, vol. 57, no. 2, p. 19, 2024
work page 2024
-
[24]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021
work page 2021
-
[25]
Focal frequency loss for image reconstruction and synthesis,
L. Jiang, B. Dai, W. Wu, and C. C. Loy, “Focal frequency loss for image reconstruction and synthesis,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 919–13 929
work page 2021
-
[26]
On the spectral bias of neural networks,
N. Rahaman, A. Baratin, D. Arpit, F. Draxler, M. Lin, F. Hamprecht, Y . Bengio, and A. Courville, “On the spectral bias of neural networks,” inInternational conference on machine learning. PMLR, 2019, pp. 5301–5310
work page 2019
-
[27]
This microtubule does not exist: Super-resolution microscopy image generation by a diffusion model,
A. Saguy, T. Nahimov, M. Lehrman, E. G ´omez-de Mariscal, I. Hidalgo-Cenalmor, O. Alalouf, A. Balakrishnan, M. Heilemann, R. Henriques, and Y . Shechtman, “This microtubule does not exist: Super-resolution microscopy image generation by a diffusion model,” Small Methods, vol. 9, no. 3, p. 2400672, 2025
work page 2025
-
[28]
D. Eschweiler, R. Yilmaz, M. Baumann, I. Laube, R. Roy, A. Jose, D. Br ¨uckner, and J. Stegmaier, “Denoising diffusion probabilistic models for generation of realistic fully-annotated microscopy image datasets,”PLOS Computational Biology, vol. 20, no. 2, p. e1011890, 2024
work page 2024
-
[29]
Microscopy image reconstruction with physics-informed denoising diffusion prob- abilistic model,
R. Li, G. Della Maggiora, V . Andriasyan, A. Petkidis, A. Yushkevich, N. Deshpande, M. Kudryashev, and A. Yakimovich, “Microscopy image reconstruction with physics-informed denoising diffusion prob- abilistic model,”Communications Engineering, vol. 3, no. 1, p. 186, 2024
work page 2024
-
[30]
Conditional diffusion model to enhance optical sectioning microscopy,
X. Liu, J. Z. Li, X. F. Chen, S. An, Y . Lu, N. Ali, K. Wen, P. Gao, J. J. Zheng, L. Liuet al., “Conditional diffusion model to enhance optical sectioning microscopy,”Optics Express, vol. 33, no. 21, pp. 45 381–45 397, 2025
work page 2025
-
[31]
Three-dimensional optical microrobot orientation estimation and tracking using deep learning,
S. Choudhary, F. Sadak, E. Gerena, and S. Haliyo, “Three-dimensional optical microrobot orientation estimation and tracking using deep learning,”Robotica, vol. 43, no. 2, pp. 616–637, 2025
work page 2025
-
[32]
Fair data for optical tweezers experiments,
M. T. Halma, S. Kumar, J. van Eck, S. Abeln, A. Gates, and G. J. Wuite, “Fair data for optical tweezers experiments,”Biophysical Journal, vol. 124, no. 8, pp. 1255–1272, 2025
work page 2025
-
[33]
High-resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695
work page 2022
-
[34]
K. Prakash, D. Baddeley, C. Eggeling, R. Fiolka, R. Heintzmann, S. Manley, A. Radenovic, H. Shroff, C. Smith, and L. Schermelleh, “Resolution in super-resolution microscopy–facts, artifacts, technolog- ical advancements and biological applications,”Journal of cell science, vol. 138, no. 10, p. jcs263567, 2025
work page 2025
-
[35]
Adding conditional control to text-to-image diffusion models,
L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 3836–3847
work page 2023
-
[36]
Image-to-image translation with conditional adversarial networks,
P. Isola, J.-Y . Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1125–1134
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.