Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra
Pith reviewed 2026-05-21 12:54 UTC · model grok-4.3
The pith
Apple Silicon optimization for diffusion models follows a different path than NVIDIA GPUs, enabling 22.7 FPS real-time camera img2img on the M3 Ultra.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The optimization landscape for diffusion models on Apple Silicon is fundamentally different from that on NVIDIA GPUs because of the unified memory architecture. Quantization brings no speedup, parallel inference is ineffective, and the Neural Engine is unsuitable for large models. Practical guidelines emerge from the experiments, and the concrete demonstration is real-time camera img2img at 22.7 FPS obtained by CoreML conversion of the distillation-specialized SDXS-512 model running inside a three-thread camera pipeline.
What carries the argument
The 3-thread camera pipeline paired with CoreML-converted SDXS-512, which exploits unified memory to sustain real-time throughput where CUDA-derived techniques do not.
If this is right
- Quantization should be skipped when targeting Apple Silicon for diffusion inference.
- Parallel model execution is unlikely to help and may hurt performance on unified-memory chips.
- The Neural Engine is not a reliable accelerator for large diffusion models.
- Real-time camera-driven image-to-image generation is now feasible on high-end Apple laptops without external GPUs.
- Developers should prioritize CoreML conversion and multi-threaded pipelines when porting diffusion workflows to Apple hardware.
Where Pith is reading between the lines
- Mobile and laptop apps could incorporate on-device generative editing at interactive speeds if similar distilled models and pipelines are adopted more widely.
- Platform-specific optimization toolkits may become necessary instead of assuming CUDA results transfer across vendors.
- The unified-memory advantage might extend to other generative tasks beyond diffusion, such as real-time video synthesis on the same hardware.
Load-bearing premise
The ten tested optimization phases and the particular choice of SDXS-512 plus the three-thread pipeline are representative enough to establish that Apple Silicon's optimization landscape differs from NVIDIA GPUs in general.
What would settle it
Re-running the quantization step on the same CoreML model on the M3 Ultra and observing a clear speed increase, or showing that parallel inference yields measurable gains, would falsify the claim of a fundamentally different landscape.
read the original abstract
While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon's unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs -- including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models -- and provide practical guidelines for diffusion model inference on Apple Silicon.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript conducts a systematic, 10-phase empirical study of optimization techniques for real-time diffusion-model inference on the Apple M3 Ultra (60-core GPU, 512 GB unified memory). Techniques examined include CoreML conversion, quantization, Token Merging, Neural Engine offload, compact models, frame interpolation, kNN synthesis, pix2pix-turbo, optical-flow skipping, and knowledge distillation. The central result is a 3-thread camera pipeline using CoreML-converted SDXS-512 that reaches 22.7 FPS for 512×512 img2img. The authors conclude that the optimization landscape on Apple Silicon differs fundamentally from NVIDIA GPUs, citing the lack of benefit from quantization, ineffectiveness of parallel inference, and unsuitability of the Neural Engine.
Significance. If the FPS measurements prove robust and the architectural contrasts hold under controlled comparison, the work supplies concrete, hardware-specific guidelines that are currently scarce for non-CUDA platforms. The systematic enumeration of techniques and the achieved real-time camera pipeline constitute a practical contribution for Apple-Silicon deployment of distilled diffusion models.
major comments (3)
- [Abstract and §4] Abstract and §4 (Results): FPS figures (e.g., 22.7 FPS) are presented as point estimates without error bars, standard deviations, number of averaged frames, or measurement protocol details (warm-up, steady-state timing, camera input variability). This weakens the quantitative support for the real-time claim and the cross-technique comparisons.
- [§7 and final paragraph] §7 (Discussion) and final paragraph: The claim that the optimization landscape is 'fundamentally different' from NVIDIA GPUs rests on indirect contrast with prior CUDA literature rather than side-by-side re-implementation of the same model (SDXS-512), techniques, and measurement protocol on an NVIDIA device. Without matched baselines, observed differences could arise from model choice, pipeline design, or untested variables instead of unified-memory architecture.
- [§3 and §5] §3 and §5: The ten phases and the exclusive use of SDXS-512 plus the 3-thread pipeline are presented as representative, yet no ablation or justification is given for why these choices suffice to generalize the 'different landscape' conclusion to other distilled diffusion models or inference scenarios on Apple Silicon.
minor comments (2)
- Notation for thread counts and pipeline stages is introduced inconsistently across figures and text; a single glossary or consistent subscript convention would improve readability.
- Several technique descriptions (e.g., kNN search-based synthesis) lack a brief complexity or memory-footprint analysis that would help readers assess why they were ultimately discarded.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions planned for the manuscript.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Results): FPS figures (e.g., 22.7 FPS) are presented as point estimates without error bars, standard deviations, number of averaged frames, or measurement protocol details (warm-up, steady-state timing, camera input variability). This weakens the quantitative support for the real-time claim and the cross-technique comparisons.
Authors: We agree that additional statistical details would strengthen the presentation. In the revised manuscript we will expand §4 to describe the measurement protocol, including the number of frames used for averaging, warm-up duration, steady-state timing window, and how camera input variability was handled. Where multiple independent runs were conducted we will report standard deviations; otherwise we will note the single-run nature of the timing. revision: yes
-
Referee: [§7 and final paragraph] §7 (Discussion) and final paragraph: The claim that the optimization landscape is 'fundamentally different' from NVIDIA GPUs rests on indirect contrast with prior CUDA literature rather than side-by-side re-implementation of the same model (SDXS-512), techniques, and measurement protocol on an NVIDIA device. Without matched baselines, observed differences could arise from model choice, pipeline design, or untested variables instead of unified-memory architecture.
Authors: We accept that a direct side-by-side experiment would provide stronger evidence. The manuscript’s claims rest on concrete empirical observations made on the M3 Ultra (no speedup from quantization, limited benefit from parallel threads, and poor Neural Engine scaling for large models). We will revise the wording in §7 and the conclusion to state that these results indicate important differences from typical CUDA-reported behavior rather than asserting a universally “fundamentally different” landscape, and we will explicitly note the absence of matched NVIDIA baselines as a limitation. revision: partial
-
Referee: [§3 and §5] §3 and §5: The ten phases and the exclusive use of SDXS-512 plus the 3-thread pipeline are presented as representative, yet no ablation or justification is given for why these choices suffice to generalize the 'different landscape' conclusion to other distilled diffusion models or inference scenarios on Apple Silicon.
Authors: SDXS-512 was selected because it is a publicly available distilled model explicitly designed for real-time inference, making it the most relevant candidate for the target 512×512 camera pipeline. The ten phases systematically enumerate the major optimization families applicable to diffusion models on Apple Silicon. We will add a short justification subsection in §3 explaining these choices and their intended generality to other distilled models; we acknowledge that exhaustive cross-model ablations lie outside the scope of the present study. revision: yes
Circularity Check
No significant circularity; purely empirical measurements with external contrast
full rationale
The paper presents a sequence of direct experimental measurements of FPS under ten optimization phases on the M3 Ultra with SDXS-512 and a 3-thread pipeline. The conclusion that the optimization landscape differs from NVIDIA GPUs rests on observed outcomes (no quantization benefit, parallel inference ineffectiveness) contrasted with prior CUDA literature rather than any internal equations or fitted parameters. No self-definitional steps, predictions that reduce to inputs by construction, or load-bearing self-citations appear; the derivation chain is self-contained as a report of measured performance differences.
Axiom & Free-Parameter Ledger
free parameters (2)
- thread count in camera pipeline
- choice of distilled model (SDXS-512)
axioms (1)
- domain assumption CoreML conversion of the diffusion model preserves sufficient accuracy and numerical stability for the img2img task.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs—including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline... 22.7 FPS at 512x512
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS) , 33, 6840–6851
work page 2020
-
[2]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 10684–10695
work page 2022
- [3]
- [4]
-
[5]
Luo, S., Tan, Y., Huang, L., Li, J., & Zhao, H. (2023). Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378
work page internal anchor Pith review Pith/arXiv arXiv 2023
- [6]
- [7]
- [8]
-
[9]
Apple Inc. (2022). ml-stable-diffusion: Stable diffusion with Core ML on Apple Silicon. GitHub repository. https://github.com/apple/ml-stable-diffusion
work page 2022
-
[10]
Parmar, G., Park, T., Narasimhan, S., & Zhu, J.-Y. (2024). One-step image translation with text-to-image models. Proc. European Conf. on Computer Vision (ECCV)
work page 2024
-
[11]
Blattmann, A., Rombach, R., Oktay, O., Müller, J., & Ommer, B. (2022). Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems (NeurIPS) , 35
work page 2022
-
[12]
Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Trans. on Big Data , 7(3), 535–547. 14
work page 2019
-
[13]
Bolya, D., & Hoffman, J. (2023). Token merging for fast stable diffusion. CVPR Workshop on Efficient Deep Learning for Computer Vision . 15
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.