pith. sign in

arxiv: 2605.19385 · v2 · pith:B2STPUNSnew · submitted 2026-05-19 · 💻 cs.DC · cs.DB

LatentBox: Storing AI-Generated Images at Scale via a Latent-First Design

Pith reviewed 2026-05-21 07:21 UTC · model grok-4.3

classification 💻 cs.DC cs.DB
keywords AI-generated imageslatent storagegenerative AIstorage optimizationhybrid cacheon-demand reconstructionproduction traceimage caching
0
0 comments X

The pith

LatentBox stores AI-generated images as compact latents to cut persistent storage by 78.7% while matching or beating traditional image storage latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LatentBox, a storage system built for the billions of images created by generative AI models. It shows that full-resolution pixel files are redundant because the same images can be rebuilt on demand from the small latent tensors the model itself uses. By studying real access patterns from a long production trace, the system keeps popular images decoded for speed and stores the rest as latents to save space, then continuously tunes how much of each format to hold in cache. This approach delivers large capacity gains without increasing the time users wait for images.

Core claim

LatentBox treats compressed latent tensors as the durable primary storage format for AI-generated images and performs GPU reconstruction only on the read path when a request arrives. It maintains a hybrid cache that holds frequently requested images in decoded pixel form while keeping less-active objects as latents, using the production trace to drive ongoing adjustments to the split between the two caches. When evaluated against the same trace, the design reduces persistent storage by 78.7 percent while producing mean and tail latencies that are competitive with or better than a conventional image-only store.

What carries the argument

The hybrid latent-image cache that stores hot objects decoded and cold objects as compressed latents, with dynamic allocation tuned from observed access frequencies.

If this is right

  • Platforms can host several times more images on the same storage hardware without expanding capacity.
  • Storage bandwidth drops because latents are much smaller than full pixel blobs.
  • Compute is spent only for reconstruction on cache misses for infrequent objects.
  • User-visible latency stays low by keeping popular images ready in decoded form.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar latent-first designs could extend to other generative outputs such as audio clips or short videos that also have compact internal representations.
  • Object stores might eventually add native support for model-specific latent formats so applications do not need custom reconstruction logic.
  • If reconstruction demand grows very large, batching or specialized hardware accelerators could become necessary to keep tail latencies low.

Load-bearing premise

The 35-month trace of two billion requests from one platform represents typical future access patterns and that GPU reconstruction latency will stay acceptable to users even under varying load.

What would settle it

Running a live deployment of LatentBox against real user traffic for several months and directly comparing measured storage consumption and end-to-end request latencies against a pure image-based baseline.

Figures

Figures reproduced from arXiv: 2605.19385 by Haoran Ni, Juncheng Yang, Tingfeng Lan, Yue Cheng, Yunjia Zheng, Zhaoyuan Su, Zirui Wang.

Figure 1
Figure 1. Figure 1: Illustration of latent-first storage. (a) Conventional object stores persist AI-generated images as opaque blobs, whereas Latent￾Box (LB) stores compact model-native latents (intermediate state) and reconstructs images on demand. (b) Cost–latency tradeoff of five storage strategies. LatentBox achieves low cost and latency. Existing large-scale image storage systems, including Face￾book’s Haystack [10] and … view at source ↗
Figure 2
Figure 2. Figure 2: Adobe Firefly sees ex￾plosive gen-image growth [3]. VAE decoder Text prompt CLIP/T5 Diffusion Latent z Image x Stage 1: Denoising (seconds) Stage 2: Decode (tens of ms) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: CompanyX trace characterization. (a) Image popularity CDF. (b) Mean access rate vs. age, stratified by lifetime-view quartile. (c) Miss ratio vs. cache size for three policies. (d) CDF of intervals between consecutive accesses. 3 Motivation This section motivates the design of LatentBox by analyzing a production trace (§3.1) and characterizing the cost of on￾demand pixel reconstruction (§3.2). 3.1 Producti… view at source ↗
Figure 5
Figure 5. Figure 5: LatentBox architecture and request flow. to map each identifier to its owner node and tracks per-GPU queue depths for load-aware dispatch. Every GPU node is functionally identical: it hosts a dual-format cache that holds each object either as a decoded image or as a compressed latent (§4.2), an adaptive resizer that balances the two tiers online (§4.3), and one decode pipeline per GPU streaming CPU decompr… view at source ↗
Figure 6
Figure 6. Figure 6: Dual-format cache design. over the full request stream: a request counts as an image miss if the requested object is not found in the image cache, regardless of whether it is later found in the latent cache. Let MRlat(α) denote the latent-cache miss ratio at this allocation, measured over the image-miss stream, i.e., only those requests that already missed the image cache. A latent miss is therefore a full… view at source ↗
Figure 7
Figure 7. Figure 7: End-to-end read performance. (a) CDF of read latency for five store-and-read configurations. (b) Stacked cache hit distribution: image hit, latent hit, and full miss fractions. (c) Mean latency breakdown for cache-hit requests (image hit + latent hit); numbers above bars show total mean (ms). (d) Mean latency breakdown for cache-miss requests [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Cumulative cost at four time horizons (2026, 2030, 2040, 2050), normalized so ImgStore at the trace end (March 2026) equals 1. (Top) constant prices. (Bottom) with annual price decay (GPU −20%/yr, storage −10%/yr from 2026 [17, 18, 38, 49]). 5090 saves 64%, because cheaper GPUs amplify the decode￾based architecture’s advantage while storage-price reductions benefit both strategies proportionally. 6.5 Ablat… view at source ↗
Figure 11
Figure 11. Figure 11: Sensitivity analysis. LatentBox is robust to step size, window size, and tail fraction; the promotion threshold h has the largest impact. 6.5.3 Spillover Dispatch To validate the effectiveness of the spillover path, we replay the same 48-hour trace on a 6-node GPU cluster at 1000× speed and an overflow threshold θ=4. The without-spillover baseline sets θ to infinity, so every request is dispatched to its … view at source ↗
Figure 9
Figure 9. Figure 9: Per-window latency im￾provement and α trajectory. E2E Mean E2E P99 GPU Wait Mean GPU Wait P99 0 200 400 600 Read Latency (ms) 79 94 360 472 5 8 78 153 w/ Spillover w/o Spillover [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: Reconstruction fidelity over 10 K SD 3.5 images (1024×1024). (a) Signed per-channel pixel difference aggregated across 2,000 sampled images; 47% of pixel-channel values are un￾changed. (b)-(c) White dot: median; thick bar: interquartile range. “q” in “q95” denotes quality factor; higher PSNR (dB) and SSIM closer to 1 indicate better fidelity. sults indicate that practitioners can deploy LatentBox without … view at source ↗
read the original abstract

The explosive growth of AI-generated images has created a sustainability challenge for storage infrastructure. Platforms like Midjourney and Adobe Firefly already host billions of generative images, yet conventional object stores persist them as blobs with full-resolution pixels, consuming huge amounts of storage capacity and bandwidth. Unlike natural photos, however, AI-generated images can be deterministically reconstructed from compact, model-native latent tensors, making persistent image storage fundamentally redundant. This paper presents LatentBox, a latent-first storage system for AI-generated images. LatentBox treats compressed latents as durable storage objects and uses on-demand GPU reconstruction on the read path to trade inexpensive compute for large persistent storage savings. Our design is guided by the first large-scale analysis of AI-generated image access we are aware of, based on a 35-month, 2-billion-request production trace from a major generative-content platform. Motivated by the trace analysis, LatentBox keeps frequently accessed images in decoded pixel format for fast hits, stores less-active objects as compressed latents to expand effective cache capacity, and continuously adjusts the splits between the image and latent cache to optimize user-perceived access latency.We build a LatentBox prototype and evaluate it with the production trace. LatentBox reduces persistent storage by 78.7% with competitive or even lower mean and tail latency over a pure image-based storage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LatentBox, a latent-first storage system for AI-generated images that persists compact compressed latents as durable objects and performs on-demand GPU reconstruction on the read path. Motivated by analysis of a 35-month, 2-billion-request production trace, the design keeps hot objects in decoded pixel format while storing colder objects as latents, dynamically adjusting the image/latent cache split to optimize user-perceived latency. The prototype evaluation reports a 78.7% reduction in persistent storage with competitive or lower mean and tail latency versus a pure image-based baseline.

Significance. If the latency results hold under realistic load, the work offers a practical approach to the storage sustainability problem for generative-AI platforms by trading inexpensive compute for large capacity savings. The grounding in a real production trace and the concrete prototype measurements are strengths that increase relevance for systems research on AI infrastructure.

major comments (2)
  1. [Evaluation] Evaluation section: the headline claim that mean and tail latency remain competitive (or better) with pure pixel storage depends on on-demand reconstruction plus the dynamic cache split, yet the manuscript provides no measured per-request GPU reconstruction time, explicit queuing model for concurrent cold-object reconstructions, or sensitivity results when the trace exhibits bursts that exceed available GPUs. Without these, it is impossible to verify that tail latency does not exceed the baseline under the reported access patterns.
  2. [Evaluation] Trace-driven evaluation: the 78.7% storage-reduction figure and latency competitiveness rest on the assumption that the 35-month trace accurately represents access patterns and that reconstruction remains acceptable under varying load, but no additional experiments (e.g., synthetic burst workloads or different GPU counts) are reported to test this assumption.
minor comments (2)
  1. [Abstract] Abstract and §4: the comparison baseline (pure image storage) and exact latency metrics (mean, p99, etc.) should be stated more explicitly so readers can reproduce the competitiveness claim.
  2. [Design] Notation for cache-split thresholds and reconstruction cost model could be formalized with a short equation or pseudocode to improve clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the evaluation below and commit to revisions that directly strengthen the presentation of our latency and trace-driven results.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the headline claim that mean and tail latency remain competitive (or better) with pure pixel storage depends on on-demand reconstruction plus the dynamic cache split, yet the manuscript provides no measured per-request GPU reconstruction time, explicit queuing model for concurrent cold-object reconstructions, or sensitivity results when the trace exhibits bursts that exceed available GPUs. Without these, it is impossible to verify that tail latency does not exceed the baseline under the reported access patterns.

    Authors: We agree that the current manuscript would benefit from greater transparency on these points. The reported mean and tail latencies are end-to-end measurements obtained by replaying the production trace on the prototype, so reconstruction costs are already embedded in the results. To make this explicit, the revised version will add (1) a table and CDF of per-request GPU reconstruction times measured on our test hardware, (2) a simple M/M/k-style queuing analysis parameterized by the observed request rates and GPU count from the prototype runs, and (3) sensitivity curves for synthetic burst workloads that temporarily exceed the provisioned GPUs. These additions will allow readers to verify tail-latency behavior directly. revision: yes

  2. Referee: [Evaluation] Trace-driven evaluation: the 78.7% storage-reduction figure and latency competitiveness rest on the assumption that the 35-month trace accurately represents access patterns and that reconstruction remains acceptable under varying load, but no additional experiments (e.g., synthetic burst workloads or different GPU counts) are reported to test this assumption.

    Authors: The 78.7% storage reduction is obtained by comparing latent versus pixel sizes for every object referenced in the trace and is therefore independent of load assumptions. The latency comparison is likewise a direct trace replay. We acknowledge, however, that the manuscript does not vary GPU count or inject synthetic bursts. In the revision we will add two new experiments: (a) replaying the trace while varying the number of available GPUs from 1 to 8, and (b) synthetic burst workloads that double the peak request rate for short intervals. These results will be reported alongside the original trace-driven numbers. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on external trace and prototype measurements

full rationale

The paper's core results—the 78.7% persistent storage reduction and competitive mean/tail latency—are presented as direct empirical outcomes from evaluating a built prototype against the 35-month, 2-billion-request external production trace. Design choices (dynamic image/latent cache splits) are motivated by trace analysis but do not tautologically define the reported savings or latency figures; those are measured quantities. No self-citations, uniqueness theorems, fitted parameters renamed as predictions, or ansatzes appear in the derivation chain. The evaluation is self-contained against independent external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on the domain assumption that latents allow deterministic high-quality reconstruction and that the observed trace reflects future access behavior; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption AI-generated images can be deterministically reconstructed from compact, model-native latent tensors
    This premise is stated directly in the abstract as the foundation for treating latents as durable storage objects.

pith-pipeline@v0.9.0 · 5787 in / 1103 out tokens · 40347 ms · 2026-05-21T07:21:36.948751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 4 internal anchors

  1. [1]

    https://on nx.ai, 2019

    ONNX: Open neural network exchange. https://on nx.ai, 2019

  2. [2]

    https://www.midjourney.com , 2026

    Midjourney. https://www.midjourney.com , 2026. Accessed: April 2026

  3. [3]

    Introducing firefly foundry

    Adobe. Introducing firefly foundry. https://busine ss.adobe.com/blog/introducing-firefly-fou ndry, 2023. Accessed: 2026-04

  4. [4]

    Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming Throughput-Latency tradeoff in LLM inference with Sarathi-Serve. In18th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 24), pages 117–134, Santa Clara, CA, July 2024. USENIX Association

  5. [5]

    Amazon S3 pricing

    Amazon Web Services. Amazon S3 pricing. Online, 2024

  6. [6]

    Amazon S3 Glacier instant retrieval storage class, 2026

    Amazon Web Services. Amazon S3 Glacier instant retrieval storage class, 2026. Accessed: 2026-04-23

  7. [7]

    Amazon S3 Glacier Instant Retrieval storage class

    Amazon Web Services. Amazon S3 Glacier Instant Retrieval storage class. https://aws.amazon.com/s 3/storage-classes/glacier/instant-retriev al/, 2026. Accessed: 2026-04

  8. [8]

    Best practices design patterns: Optimizing amazon s3 performance

    Amazon Web Services. Best practices design patterns: Optimizing amazon s3 performance. https://docs .aws.amazon.com/AmazonS3/latest/userguide/ optimizing-performance.html , 2026. Accessed: 2026-04-23

  9. [9]

    Pelican: A building block for exascale cold data storage

    Shobana Balakrishnan, Richard Black, Austin Don- nelly, Paul England, Adam Glass, Dave Harper, Sergey Legtchenko, Aaron Ogus, Eric Peterson, and Antony Rowstron. Pelican: A building block for exascale cold data storage. In11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 351–365, 2014

  10. [10]

    Finding a needle in haystack: Face- book’s photo storage

    Doug Beaver, Sanjeev Kumar, Harry C Li, Jason Sobel, and Peter Vajgel. Finding a needle in haystack: Face- book’s photo storage. In9th USENIX Symposium on Operating Systems Design and Implementation (OSDI 10), 2010

  11. [11]

    L. A. Belady. A study of replacement algorithms for a virtual-storage computer.IBM Systems Journal, 5(2):78– 101, 1966

  12. [12]

    Win- dows azure storage: a highly available cloud storage service with strong consistency

    Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakan- tan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shash- wat Srivastav, Jiesheng Wu, Huseyin Simitci, et al. Win- dows azure storage: a highly available cloud storage service with strong consistency. InProceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, pages 143–157, 2011

  13. [13]

    PixArt- α: Fast training of diffusion transformer for photorealistic text- to-image synthesis

    Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, and Zhenguo Li. PixArt- α: Fast training of diffusion transformer for photorealistic text- to-image synthesis. InInternational Conference on Learning Representations (ICLR), 2024

  14. [14]

    Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

    Junyu Chen, Han Cai, Junsong Chen, Enze Xie, Shang Yang, Haotian Tang, Muyang Li, Yao Lu, and Song Han. Deep compression autoencoder for efficient high-resolution diffusion models.arXiv preprint arXiv:2410.10733, 2024

  15. [15]

    Reproducible scaling laws for contrastive language- image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jit- sev. Reproducible scaling laws for contrastive language- image learning. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 2818–2829, 2023

  16. [16]

    Cliffhanger: Scaling performance cliffs in web memory caches

    Asaf Cidon, Assaf Eisenman, Mohammad Alizadeh, and Sachin Katti. Cliffhanger: Scaling performance cliffs in web memory caches. In13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pages 379–392, 2016

  17. [17]

    Disk prices — current hard drive cost per gigabyte

    DiskPrices.com. Disk prices — current hard drive cost per gigabyte. https://diskprices.com/ , 2025. Accessed 2026-04

  18. [18]

    Data on machine learning hardware

    Epoch AI. Data on machine learning hardware. https: //epoch.ai/data/machine-learning-hardware ,

  19. [19]

    CC-BY 4.0

    Dataset of >170 AI accelerators (GPUs, TPUs) with performance, price, and efficiency metrics. CC-BY 4.0. Accessed 2026-04

  20. [20]

    Scaling recti- fied flow transformers for high-resolution image synthe- sis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling recti- fied flow transformers for high-resolution image synthe- sis. InForty-first international conference on machine learning, 2024

  21. [21]

    Tam- ing transformers for high-resolution image synthesis

    Patrick Esser, Robin Rombach, and Bjorn Ommer. Tam- ing transformers for high-resolution image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12873–12883, 2021

  22. [22]

    Ai image statistics

    Everypixel Journal. Ai image statistics. https://jo urnal.everypixel.com/ai-image-statistics ,

  23. [23]

    Accessed: 2026-04. 13

  24. [24]

    Zstandard - Fast real-time compression al- gorithm

    Facebook. Zstandard - Fast real-time compression al- gorithm. https://github.com/facebook/zstd . Accessed: 2026-04

  25. [25]

    NVIDIA RTX 5090 gpu guide and pric- ing

    GetDeploying. NVIDIA RTX 5090 gpu guide and pric- ing. https://getdeploying.com/gpus/nvidia-r tx-5090, 2026. Accessed: 2026-04

  26. [26]

    Gonzalez and Richard E

    Rafael C. Gonzalez and Richard E. Woods.Digital Image Processing. Pearson, 4th edition, 2018

  27. [27]

    Image quality metrics: Psnr vs

    Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pages 2366–2369. IEEE, 2010

  28. [28]

    NVIDIA H100 price guide 2026: GPU costs, cloud pricing & buy vs rent

    JarvisLabs. NVIDIA H100 price guide 2026: GPU costs, cloud pricing & buy vs rent. https://jarvisla bs.ai/blog/h100-price, 2026. Accessed: 2026-04

  29. [29]

    Elucidating the design space of diffusion-based gener- ative models

    Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine. Elucidating the design space of diffusion-based gener- ative models. InAdvances in Neural Information Pro- cessing Systems (NeurIPS), 2022

  30. [30]

    Auto-Encoding Variational Bayes

    Diederik P Kingma and Max Welling. Auto-encoding variational bayes.arXiv preprint arXiv:1312.6114, 2013

  31. [31]

    Flux.1 kontext: Flow matching for in-context image generation and editing in latent space, 2025

    Black Forest Labs, Stephen Batifol, Andreas Blattmann, Frederic Boesel, Saksham Consul, Cyril Diagne, Tim Dockhorn, Jack English, Zion English, Patrick Esser, Sumith Kulal, Kyle Lacey, Yam Levi, Cheng Li, Do- minik Lorenz, Jonas Müller, Dustin Podell, Robin Rom- bach, Harry Saini, Axel Sauer, and Luke Smith. Flux.1 kontext: Flow matching for in-context im...

  32. [32]

    Software-defined far memory in warehouse-scale computers

    Andres Lagar-Cavilla, Junwhan Ahn, Suleiman Souhlal, Neha Agarwal, Radoslaw Burny, Shakeel Butt, Jichuan Chang, Ashwin Chaugule, Nan Deng, Junaid Shahid, Greg Thorat, Adrian Yurtsever, Daniel Zolnowski, Kim Hazelwood, Martin Maas, Thomas Mccauley, and Rohit Sen. Software-defined far memory in warehouse-scale computers. InProceedings of the 24th Internatio...

  33. [33]

    Pseudo numerical methods for diffusion models on manifolds

    Luping Liu, Yi Ren, Zhijie Lin, and Zhou Zhao. Pseudo numerical methods for diffusion models on manifolds. InInternational Conference on Learning Representa- tions (ICLR), 2022

  34. [34]

    Pcodec: Better compression for numerical sequences, 2025

    Martin Loncaric, Niels Jeppesen, and Ben Zinberg. Pcodec: Better compression for numerical sequences, 2025

  35. [35]

    Latent Consistency Models: Synthesizing High-Resolution Images with Few-Step Inference

    Simian Luo, Yiqin Tan, Longbo Huang, Jian Li, and Hang Zhao. Latent consistency models: Synthesizing high-resolution images with few-step inference.arXiv preprint arXiv:2310.04378, 2023

  36. [36]

    LZ4 - Extremely fast compression

    lz4. LZ4 - Extremely fast compression. https://gith ub.com/lz4/lz4. Accessed: 2026-04

  37. [37]

    Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps

    Nanye Ma, Shangyuan Tong, Haolin Jia, Hexiang Hu, Yu-Chuan Su, Mingda Zhang, Xuan Yang, Yandong Li, Tommi Jaakkola, Xuhui Jia, and Saining Xie. Inference- time scaling for diffusion models beyond scaling denois- ing steps.arXiv preprint arXiv:2501.09732, 2025

  38. [38]

    Deep- cache: Accelerating diffusion models for free

    Xinyin Ma, Gongfan Fang, and Xinchao Wang. Deep- cache: Accelerating diffusion models for free. InPro- ceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15762–15772, 2024

  39. [39]

    R. L. Mattson, J. Gecsei, D. R. Slutz, and I. L. Traiger. Evaluation techniques for storage hierarchies.IBM Syst. J., 9(2):78–117, June 1970

  40. [40]

    McCallum

    John C. McCallum. Disk drive prices (1955–2023). ht tps://jcmit.net/diskprice.htm , 2023. Monthly survey of consumer HDD prices from NewEgg.com, 1955–2023. Archived at https://web.archive.or g/web/2024/https://jcmit.net/diskprice.htm . Accessed 2026-04

  41. [41]

    Jordan, and Ion Stoica

    Philipp Moritz, Robert Nishihara, Stephanie Wang, Alexey Tumber, Richard Liaw, Eric Liang, Melih Elibol, Zongheng Yang, William Paul, Michael I. Jordan, and Ion Stoica. Ray: A distributed framework for emerging AI applications. In13th USENIX Symposium on Oper- ating Systems Design and Implementation (OSDI 18), pages 561–577, 2018

  42. [42]

    f4: Facebook’s warm BLOB storage system

    Subramanian Muralidhar, Wyatt Lloyd, Sabyasachi Roy, Cory Hill, Ernest Lin, Weiwen Liu, Satadru Pan, Shiva Shankar, Viswanath Sivakumar, Linpeng Tang, et al. f4: Facebook’s warm BLOB storage system. In11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 383–398, 2014

  43. [43]

    Ambry: Linkedin’s scalable geo-distributed object store

    Shadi A Noghabi, Sriram Subramanian, Priyesh Narayanan, Sivabalan Narayanan, Gopalakrishna Holla, Mammad Zadeh, Tianwei Li, Indranil Gupta, and Roy H Campbell. Ambry: Linkedin’s scalable geo-distributed object store. InProceedings of the 2016 International Conference on Management of Data, pages 253–265, 2016

  44. [44]

    CUDA graphs

    NVIDIA. CUDA graphs. https://developer.nvid ia.com/blog/cuda-graphs/, 2019

  45. [45]

    TensorRT: High-performance deep learning inference sdk

    NVIDIA. TensorRT: High-performance deep learning inference sdk. https://developer.nvidia.com/t ensorrt, 2025. Accessed: 2026-04. 14

  46. [46]

    Price trends: GeForce RTX 4090

    PCPartPicker. Price trends: GeForce RTX 4090. https: //pcpartpicker.com/trends/price/video-card/ #gpu.chipset.geforce-rtx-4090, 2026. Accessed: 2026-04

  47. [47]

    Price trends: GeForce RTX 5090

    PCPartPicker. Price trends: GeForce RTX 5090. https: //pcpartpicker.com/trends/price/video-card/ #gpu.chipset.geforce-rtx-5090, 2026. Accessed: 2026-04

  48. [48]

    Scalable diffu- sion models with transformers

    William Peebles and Saining Xie. Scalable diffu- sion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  49. [49]

    Learning transferable visual models from natural lan- guage supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas- try, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PMLR, 2021

  50. [50]

    Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21(140):1–67, 2020

  51. [51]

    Performance per dollar improves around 30% each year

    Robi Rahman. Performance per dollar improves around 30% each year. https://epoch.ai/data-insight s/price-performance-hardware , 2024. Epoch AI data insight. Accessed 2026-04

  52. [52]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with clip latents.arXiv preprint arXiv:2204.06125, 1(2):3, 2022

  53. [53]

    Stochastic backpropagation and approxi- mate inference in deep generative models

    Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approxi- mate inference in deep generative models. InInter- national conference on machine learning, pages 1278–

  54. [54]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  55. [55]

    De- noising diffusion implicit models

    Jiaming Song, Chenlin Meng, and Stefano Ermon. De- noising diffusion implicit models. InInternational Con- ference on Learning Representations (ICLR), 2021

  56. [56]

    Tuduce and Thomas Gross

    Irina C. Tuduce and Thomas Gross. Adaptive main memory compression. InUSENIX Annual Technical Conference (ATC), pages 237–250, 2005

  57. [57]

    Neural dis- crete representation learning.Advances in neural infor- mation processing systems, 30, 2017

    Aaron Van Den Oord, Oriol Vinyals, et al. Neural dis- crete representation learning.Advances in neural infor- mation processing systems, 30, 2017

  58. [58]

    Waldspurger, Nohhyun Park, Alexander Garth- waite, and Irfan Ahmad

    Carl A. Waldspurger, Nohhyun Park, Alexander Garth- waite, and Irfan Ahmad. Efficient MRC construction with SHARDS. In13th USENIX Conference on File and Storage Technologies (FAST 15), pages 95–110, Santa Clara, CA, February 2015. USENIX Association

  59. [59]

    TMO: Transparent memory offload- ing in datacenters

    Johannes Weiner, Niket Agarwal, Dan Schatzberg, Leon Yang, Hao Wang, Blaise Sanouillet, Bikash Sharma, Tejun Heo, Mayank Jain, Chunqiang Tang, and Dim- itrios Skarlatos. TMO: Transparent memory offload- ing in datacenters. InProceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASP- ...

  60. [60]

    Wilson, Scott F

    Paul R. Wilson, Scott F. Kaplan, and Yannis Smarag- dakis. The case for compressed caching in virtual mem- ory systems. InUSENIX Annual Technical Conference (ATC), pages 101–116, 1999

  61. [61]

    Jake Wires, Stephen Ingram, Zachary Drudi, Nicholas J. A. Harvey, and Andrew Warfield. Characterizing stor- age workloads with counter stacks. In11th USENIX Symposium on Operating Systems Design and Imple- mentation (OSDI 14), pages 335–349, Broomfield, CO, October 2014. USENIX Association

  62. [62]

    zexpander: a key-value cache with both high performance and fewer misses

    Xingbo Wu, Li Zhang, Yandong Wang, Yufei Ren, Michel Hack, and Song Jiang. zexpander: a key-value cache with both high performance and fewer misses. In Proceedings of the Eleventh European Conference on Computer Systems, EuroSys ’16, New York, NY , USA,

  63. [63]

    Association for Computing Machinery

  64. [64]

    Fifo queues are all you need for cache eviction

    Juncheng Yang, Yazhuo Zhang, Ziyue Qiu, Yao Yue, and Rashmi Vinayak. Fifo queues are all you need for cache eviction. InProceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, page 130–149, New York, NY , USA, 2023. Association for Computing Machinery

  65. [65]

    One-step diffusion with distribution match- ing distillation

    Tianwei Yin, Michaël Gharbi, Richard Zhang, Eli Shechtman, Fredo Durand, William T Freeman, and Tae- sung Park. One-step diffusion with distribution match- ing distillation. InProceedings of the IEEE/CVF confer- ence on computer vision and pattern recognition, pages 6613–6623, 2024

  66. [66]

    Bidm: Pushing the limit of quantization for diffusion models.Advances in Neural Information Pro- cessing Systems, 37:39009–39035, 2024

    Xingyu Zheng, Xianglong Liu, Yichen Bian, Xudong Ma, Yulun Zhang, Jiakai Wang, Jinyang Guo, and Hao- tong Qin. Bidm: Pushing the limit of quantization for diffusion models.Advances in Neural Information Pro- cessing Systems, 37:39009–39035, 2024. 15

  67. [67]

    Demystifying cache policies for photo stores at scale: A tencent case study

    Ke Zhou, Si Sun, Hua Wang, Ping Huang, Xubin He, Rui Lan, Wenyan Li, Wenjie Liu, and Tianming Yang. Demystifying cache policies for photo stores at scale: A tencent case study. InProceedings of the 2018 Interna- tional Conference on Supercomputing, pages 284–294, 2018. 16