Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

Yoichi Ochiai

arxiv: 2605.16259 · v1 · pith:HFZNVY5Bnew · submitted 2026-02-10 · 💻 cs.LG · cs.AI· cs.DC

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

Yoichi Ochiai This is my paper

Pith reviewed 2026-05-21 12:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC

keywords diffusion modelsreal-time inferenceApple SiliconCoreMLimage-to-image generationM3 Ultraunified memorymodel optimization

0 comments

The pith

Apple Silicon optimization for diffusion models follows a different path than NVIDIA GPUs, enabling 22.7 FPS real-time camera img2img on the M3 Ultra.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically tests ten phases of techniques on the Apple M3 Ultra to make diffusion-based image-to-image transformation run in real time from a camera feed. It finds that methods effective on NVIDIA hardware, such as quantization and parallel inference, produce no gains or even slowdowns on Apple's unified memory system. A working combination of CoreML conversion for the distilled SDXS-512 model plus a three-thread pipeline reaches 22.7 frames per second at 512x512 resolution. A sympathetic reader would care because this shows how to bring fast on-device generative AI to widely available consumer hardware rather than requiring specialized data-center GPUs.

Core claim

The optimization landscape for diffusion models on Apple Silicon is fundamentally different from that on NVIDIA GPUs because of the unified memory architecture. Quantization brings no speedup, parallel inference is ineffective, and the Neural Engine is unsuitable for large models. Practical guidelines emerge from the experiments, and the concrete demonstration is real-time camera img2img at 22.7 FPS obtained by CoreML conversion of the distillation-specialized SDXS-512 model running inside a three-thread camera pipeline.

What carries the argument

The 3-thread camera pipeline paired with CoreML-converted SDXS-512, which exploits unified memory to sustain real-time throughput where CUDA-derived techniques do not.

If this is right

Quantization should be skipped when targeting Apple Silicon for diffusion inference.
Parallel model execution is unlikely to help and may hurt performance on unified-memory chips.
The Neural Engine is not a reliable accelerator for large diffusion models.
Real-time camera-driven image-to-image generation is now feasible on high-end Apple laptops without external GPUs.
Developers should prioritize CoreML conversion and multi-threaded pipelines when porting diffusion workflows to Apple hardware.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Mobile and laptop apps could incorporate on-device generative editing at interactive speeds if similar distilled models and pipelines are adopted more widely.
Platform-specific optimization toolkits may become necessary instead of assuming CUDA results transfer across vendors.
The unified-memory advantage might extend to other generative tasks beyond diffusion, such as real-time video synthesis on the same hardware.

Load-bearing premise

The ten tested optimization phases and the particular choice of SDXS-512 plus the three-thread pipeline are representative enough to establish that Apple Silicon's optimization landscape differs from NVIDIA GPUs in general.

What would settle it

Re-running the quantization step on the same CoreML model on the M3 Ultra and observing a clear speed increase, or showing that parallel inference yields measurable gains, would falsify the claim of a fundamentally different landscape.

read the original abstract

While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon's unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs -- including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models -- and provide practical guidelines for diffusion model inference on Apple Silicon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

They hit 22.7 FPS real-time img2img on M3 Ultra with CoreML plus a distilled model and three threads, but the claim of a fundamentally different optimization landscape from CUDA rests on indirect contrasts.

read the letter

The paper's main concrete result is a working recipe that reaches 22.7 FPS at 512x512 for camera-based image-to-image diffusion on the M3 Ultra. They combined CoreML conversion of the SDXS-512 distilled model with a three-thread pipeline and reported FPS numbers across ten phases of tested techniques. That part is straightforward and potentially useful for anyone targeting Apple hardware for on-device generation.

Referee Report

3 major / 2 minor

Summary. The manuscript conducts a systematic, 10-phase empirical study of optimization techniques for real-time diffusion-model inference on the Apple M3 Ultra (60-core GPU, 512 GB unified memory). Techniques examined include CoreML conversion, quantization, Token Merging, Neural Engine offload, compact models, frame interpolation, kNN synthesis, pix2pix-turbo, optical-flow skipping, and knowledge distillation. The central result is a 3-thread camera pipeline using CoreML-converted SDXS-512 that reaches 22.7 FPS for 512×512 img2img. The authors conclude that the optimization landscape on Apple Silicon differs fundamentally from NVIDIA GPUs, citing the lack of benefit from quantization, ineffectiveness of parallel inference, and unsuitability of the Neural Engine.

Significance. If the FPS measurements prove robust and the architectural contrasts hold under controlled comparison, the work supplies concrete, hardware-specific guidelines that are currently scarce for non-CUDA platforms. The systematic enumeration of techniques and the achieved real-time camera pipeline constitute a practical contribution for Apple-Silicon deployment of distilled diffusion models.

major comments (3)

[Abstract and §4] Abstract and §4 (Results): FPS figures (e.g., 22.7 FPS) are presented as point estimates without error bars, standard deviations, number of averaged frames, or measurement protocol details (warm-up, steady-state timing, camera input variability). This weakens the quantitative support for the real-time claim and the cross-technique comparisons.
[§7 and final paragraph] §7 (Discussion) and final paragraph: The claim that the optimization landscape is 'fundamentally different' from NVIDIA GPUs rests on indirect contrast with prior CUDA literature rather than side-by-side re-implementation of the same model (SDXS-512), techniques, and measurement protocol on an NVIDIA device. Without matched baselines, observed differences could arise from model choice, pipeline design, or untested variables instead of unified-memory architecture.
[§3 and §5] §3 and §5: The ten phases and the exclusive use of SDXS-512 plus the 3-thread pipeline are presented as representative, yet no ablation or justification is given for why these choices suffice to generalize the 'different landscape' conclusion to other distilled diffusion models or inference scenarios on Apple Silicon.

minor comments (2)

Notation for thread counts and pipeline stages is introduced inconsistently across figures and text; a single glossary or consistent subscript convention would improve readability.
Several technique descriptions (e.g., kNN search-based synthesis) lack a brief complexity or memory-footprint analysis that would help readers assess why they were ultimately discarded.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions planned for the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Results): FPS figures (e.g., 22.7 FPS) are presented as point estimates without error bars, standard deviations, number of averaged frames, or measurement protocol details (warm-up, steady-state timing, camera input variability). This weakens the quantitative support for the real-time claim and the cross-technique comparisons.

Authors: We agree that additional statistical details would strengthen the presentation. In the revised manuscript we will expand §4 to describe the measurement protocol, including the number of frames used for averaging, warm-up duration, steady-state timing window, and how camera input variability was handled. Where multiple independent runs were conducted we will report standard deviations; otherwise we will note the single-run nature of the timing. revision: yes
Referee: [§7 and final paragraph] §7 (Discussion) and final paragraph: The claim that the optimization landscape is 'fundamentally different' from NVIDIA GPUs rests on indirect contrast with prior CUDA literature rather than side-by-side re-implementation of the same model (SDXS-512), techniques, and measurement protocol on an NVIDIA device. Without matched baselines, observed differences could arise from model choice, pipeline design, or untested variables instead of unified-memory architecture.

Authors: We accept that a direct side-by-side experiment would provide stronger evidence. The manuscript’s claims rest on concrete empirical observations made on the M3 Ultra (no speedup from quantization, limited benefit from parallel threads, and poor Neural Engine scaling for large models). We will revise the wording in §7 and the conclusion to state that these results indicate important differences from typical CUDA-reported behavior rather than asserting a universally “fundamentally different” landscape, and we will explicitly note the absence of matched NVIDIA baselines as a limitation. revision: partial
Referee: [§3 and §5] §3 and §5: The ten phases and the exclusive use of SDXS-512 plus the 3-thread pipeline are presented as representative, yet no ablation or justification is given for why these choices suffice to generalize the 'different landscape' conclusion to other distilled diffusion models or inference scenarios on Apple Silicon.

Authors: SDXS-512 was selected because it is a publicly available distilled model explicitly designed for real-time inference, making it the most relevant candidate for the target 512×512 camera pipeline. The ten phases systematically enumerate the major optimization families applicable to diffusion models on Apple Silicon. We will add a short justification subsection in §3 explaining these choices and their intended generality to other distilled models; we acknowledge that exhaustive cross-model ablations lie outside the scope of the present study. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical measurements with external contrast

full rationale

The paper presents a sequence of direct experimental measurements of FPS under ten optimization phases on the M3 Ultra with SDXS-512 and a 3-thread pipeline. The conclusion that the optimization landscape differs from NVIDIA GPUs rests on observed outcomes (no quantization benefit, parallel inference ineffectiveness) contrasted with prior CUDA literature rather than any internal equations or fitted parameters. No self-definitional steps, predictions that reduce to inputs by construction, or load-bearing self-citations appear; the derivation chain is self-contained as a report of measured performance differences.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central performance claim rests on standard assumptions about model conversion fidelity and the representativeness of the chosen test configurations rather than new mathematical axioms or invented entities.

free parameters (2)

thread count in camera pipeline
The value of 3 threads was selected to reach real-time performance; it is a tunable parameter fitted to the target hardware.
choice of distilled model (SDXS-512)
Model selection among compact variants is a configuration choice that directly affects the reported FPS.

axioms (1)

domain assumption CoreML conversion of the diffusion model preserves sufficient accuracy and numerical stability for the img2img task.
Invoked when reporting the final 22.7 FPS result after CoreML conversion.

pith-pipeline@v0.9.0 · 5762 in / 1273 out tokens · 72446 ms · 2026-05-21T12:54:39.188935+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs—including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline... 22.7 FPS at 512x512

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · 1 internal anchor

[1]

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS) , 33, 6840–6851

work page 2020
[2]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 10684–10695

work page 2022
[3]

Sauer, A., Lorenz, D., Blattmann, A., & Rombach, R. (2023). Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042

work page arXiv 2023
[4]

Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2024). SDXS: Real-time one-step latent diffu- sion models with image conditions. arXiv preprint arXiv:2403.16627

work page arXiv 2024
[5]

Luo, S., Tan, Y., Huang, L., Li, J., & Zhao, H. (2023). Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

Luo, S., Tan, Y., Patil, S., Gu, D., von Platen, P., Passos, A., Huang, L., Li, J., & Zhao, H. (2023). LCM-LoRA: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556

work page arXiv 2023
[7]

Ren, J., Xia, Y., Lu, K., Deng, J., & Luo, Z. (2024). Hyper-SD: Trajectory segmented consistency model for eﬀicient image synthesis. arXiv preprint arXiv:2404.13686

work page arXiv 2024
[8]

Kodaira, A., Xu, C., Hazama, T., Yoshimoto, T., Ohno, K., Mitsuhori, S., Sugano, S., Cho, H., Liu, Z., & Keutzer, K. (2023). StreamDiffusion: A pipeline-level solution for real-time interactive generation. arXiv preprint arXiv:2312.12491

work page arXiv 2023
[9]

Apple Inc. (2022). ml-stable-diffusion: Stable diffusion with Core ML on Apple Silicon. GitHub repository. https://github.com/apple/ml-stable-diffusion

work page 2022
[10]

Parmar, G., Park, T., Narasimhan, S., & Zhu, J.-Y. (2024). One-step image translation with text-to-image models. Proc. European Conf. on Computer Vision (ECCV)

work page 2024
[11]

Blattmann, A., Rombach, R., Oktay, O., Müller, J., & Ommer, B. (2022). Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems (NeurIPS) , 35

work page 2022
[12]

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Trans. on Big Data , 7(3), 535–547. 14

work page 2019
[13]

Bolya, D., & Hoffman, J. (2023). Token merging for fast stable diffusion. CVPR Workshop on Eﬀicient Deep Learning for Computer Vision . 15

work page 2023

[1] [1]

Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems (NeurIPS) , 33, 6840–6851

work page 2020

[2] [2]

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proc. IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 10684–10695

work page 2022

[3] [3]

Sauer, A., Lorenz, D., Blattmann, A., & Rombach, R. (2023). Adversarial diffusion distillation. arXiv preprint arXiv:2311.17042

work page arXiv 2023

[4] [4]

Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2024). SDXS: Real-time one-step latent diffu- sion models with image conditions. arXiv preprint arXiv:2403.16627

work page arXiv 2024

[5] [5]

Luo, S., Tan, Y., Huang, L., Li, J., & Zhao, H. (2023). Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378

work page internal anchor Pith review Pith/arXiv arXiv 2023

[6] [6]

Luo, S., Tan, Y., Patil, S., Gu, D., von Platen, P., Passos, A., Huang, L., Li, J., & Zhao, H. (2023). LCM-LoRA: A universal stable-diffusion acceleration module. arXiv preprint arXiv:2311.05556

work page arXiv 2023

[7] [7]

Ren, J., Xia, Y., Lu, K., Deng, J., & Luo, Z. (2024). Hyper-SD: Trajectory segmented consistency model for eﬀicient image synthesis. arXiv preprint arXiv:2404.13686

work page arXiv 2024

[8] [8]

Kodaira, A., Xu, C., Hazama, T., Yoshimoto, T., Ohno, K., Mitsuhori, S., Sugano, S., Cho, H., Liu, Z., & Keutzer, K. (2023). StreamDiffusion: A pipeline-level solution for real-time interactive generation. arXiv preprint arXiv:2312.12491

work page arXiv 2023

[9] [9]

Apple Inc. (2022). ml-stable-diffusion: Stable diffusion with Core ML on Apple Silicon. GitHub repository. https://github.com/apple/ml-stable-diffusion

work page 2022

[10] [10]

Parmar, G., Park, T., Narasimhan, S., & Zhu, J.-Y. (2024). One-step image translation with text-to-image models. Proc. European Conf. on Computer Vision (ECCV)

work page 2024

[11] [11]

Blattmann, A., Rombach, R., Oktay, O., Müller, J., & Ommer, B. (2022). Retrieval-augmented diffusion models. Advances in Neural Information Processing Systems (NeurIPS) , 35

work page 2022

[12] [12]

Johnson, J., Douze, M., & Jégou, H. (2019). Billion-scale similarity search with GPUs. IEEE Trans. on Big Data , 7(3), 535–547. 14

work page 2019

[13] [13]

Bolya, D., & Hoffman, J. (2023). Token merging for fast stable diffusion. CVPR Workshop on Eﬀicient Deep Learning for Computer Vision . 15

work page 2023