pith. sign in

arxiv: 1906.09122 · v2 · pith:QEDO4K34new · submitted 2019-06-21 · 💻 cs.DC

Semantics-aware Virtual Machine Image Management in IaaS Clouds

Pith reviewed 2026-05-25 18:37 UTC · model grok-4.3

classification 💻 cs.DC
keywords virtual machine imagesIaaS cloudssemantic graphsstorage optimizationVMI managementdecompositionimage assembly
0
0 comments X

The pith

Modeling VMIs as semantic graphs shrinks cloud storage 2.2-16 times

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Infrastructure clouds must store and deliver many large virtual machine images to users, but duplicate content across images wastes space and slows operations. Current techniques either cannot spot reusable components effectively or incur high costs when saving and loading images. The paper presents a system that builds semantic graphs from each image to find overlaps at the base system and package level, stores only the unique pieces, and reconstructs full images when requested. This method is shown to cut the size of image repositories substantially while speeding up the process of making images available. A sympathetic reader would see this as a practical way to handle growing demand for customized cloud environments without proportional increases in storage infrastructure.

Core claim

The Expelliarmus system models virtual machine images as semantic graphs to speed up similarity calculations, performs semantic-aware decomposition to select and store only non-redundant base images and software packages, and assembles images from these parts on user request, resulting in repository size optimizations of 2.2-16 times over existing systems along with improved publish and retrieval performance.

What carries the argument

Semantic graphs of VMIs that enable similarity computation, decomposition into unique components, and on-demand assembly.

Load-bearing premise

The semantic graph accurately captures the structure of base images and software packages so that decomposition and re-assembly maintain complete functionality and acceptable performance.

What would settle it

Deploy a set of original and reassembled VMIs in a test cloud and check whether applications run without errors or differences in behavior, while measuring any increase in boot time or resource use.

Figures

Figures reproduced from arXiv: 1906.09122 by Dragi Kimovski, Jorge G. Barbosa, Julian Remmers, Nishant Saurabh, Radu Prodan.

Figure 2
Figure 2. Figure 2: Expelliarmus architecture and VMI management. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Repository size growth with different numbers of successively stored VMIs. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VMI publishing time analysis. scenario represents the sequential upload of four VMIs used in previous studies [1], [15] (Mini, Base, Desktop, and IDE), while the second scenario evaluates all VMIs listed in Table II (including the ones from the first scenario). Figure 4a shows the VMI publishing time for the first scenario. The Expelliarmus optimizes not only the storage cost as previously discussed, but a… view at source ↗
Figure 5
Figure 5. Figure 5: Retrieval time analysis of VMIs. the size of a software packaged in the .deb or .rpm format. The different software packages with varying installation sizes largely affect the time to create a binary software package (e.g. .deb) resulting in a higher export time of the same to the repository. The total installation size of the exported software packages for the Desktop VMI is the largest, and hence require… view at source ↗
read the original abstract

Infrastructure-as-a-service (IaaS) Clouds concurrently accommodate diverse sets of user requests, requiring an efficient strategy for storing and retrieving virtual machine images (VMIs) at a large scale. The VMI storage management require dealing with multiple VMIs, typically in the magnitude of gigabytes, which entails VMI sprawl issues hindering the elastic resource management and provisioning. Nevertheless, existing techniques to facilitate VMI management overlook VMI semantics (i.e at the level of base image and software packages) with either restricted possibility to identify and extract reusable functionalities or with higher VMI publish and retrieval overheads. In this paper, we design, implement and evaluate Expelliarmus, a novel VMI management system that helps to minimize storage, publish and retrieval overheads. To achieve this goal, Expelliarmus incorporates three complementary features. First, it makes use of VMIs modelled as semantic graphs to expedite the similarity computation between multiple VMIs. Second, Expelliarmus provides a semantic aware VMI decomposition and base image selection to extract and store non-redundant base image and software packages. Third, Expelliarmus can also assemble VMIs based on the required software packages upon user request. We evaluate Expelliarmus through a representative set of synthetic Cloud VMIs on the real test-bed. Experimental results show that our semantic-centric approach is able to optimize repository size by 2.2-16 times compared to state-of-the-art systems (e.g. IBM's Mirage and Hemera) with significant VMI publish and slight retrieval performance improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents Expelliarmus, a VMI management system for IaaS clouds that models VMIs as semantic graphs to compute similarities, perform semantic-aware decomposition into non-redundant base images and software packages, and assemble VMIs on demand. It claims 2.2-16x repository size reduction versus Mirage and Hemera, plus publish/retrieval improvements, based on evaluation with synthetic Cloud VMIs on a real test-bed.

Significance. If the decomposition is shown to be lossless and low-overhead, the semantic-graph approach could meaningfully address VMI sprawl by enabling finer-grained deduplication at the package level. The design integrates similarity computation, decomposition, and assembly in one system, which is a coherent contribution to cloud storage management.

major comments (2)
  1. [§4] §4 (Evaluation): The reported 2.2-16x size reductions rest on synthetic VMIs that deliberately avoid complex dependency cases (shared libraries, kernel modules, post-install scripts). Without end-to-end functional equivalence tests on realistic workloads, the storage gains cannot be treated as generally valid.
  2. [§3.2] §3.2 (Semantic Graph Model): No formal semantics are supplied for dependency resolution, version pinning, or configuration-state preservation in the graph decomposition/re-assembly process. This is load-bearing for the central claim that the approach is lossless.
minor comments (1)
  1. [Abstract] Abstract: quantitative performance claims are stated without error bars, dataset sizes, or comparison methodology; these details belong in the abstract or a summary table for quick assessment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. Below we provide point-by-point responses to the major comments.

read point-by-point responses
  1. Referee: [§4] §4 (Evaluation): The reported 2.2-16x size reductions rest on synthetic VMIs that deliberately avoid complex dependency cases (shared libraries, kernel modules, post-install scripts). Without end-to-end functional equivalence tests on realistic workloads, the storage gains cannot be treated as generally valid.

    Authors: We acknowledge that our evaluation is based on synthetic VMIs, which were designed to isolate the effects of semantic decomposition while covering a range of package dependencies from real cloud images. The paper does not include end-to-end functional equivalence tests on realistic workloads. This is a valid point, and in the revised version we will include additional discussion on the scope of the evaluation and the representativeness of the synthetic VMIs. We maintain that the reported gains demonstrate the potential of the approach under controlled conditions. revision: partial

  2. Referee: [§3.2] §3.2 (Semantic Graph Model): No formal semantics are supplied for dependency resolution, version pinning, or configuration-state preservation in the graph decomposition/re-assembly process. This is load-bearing for the central claim that the approach is lossless.

    Authors: The semantic graph in §3.2 models VMIs with nodes for base images and packages and edges for dependencies derived from package metadata. The decomposition algorithm selects non-redundant components based on these dependencies. We do not provide formal semantics or a proof of losslessness in the current manuscript. We will revise the section to provide a more detailed description of how dependency resolution, version information, and configuration states are handled during decomposition and assembly, drawing from standard package management practices. revision: partial

Circularity Check

0 steps flagged

No circularity: system design and experimental claims are self-contained

full rationale

The paper describes a new VMI management system (Expelliarmus) based on semantic graph modeling, decomposition, and assembly, with performance claims supported by evaluation on synthetic workloads against external baselines (Mirage, Hemera). No equations, fitted parameters, self-citations as load-bearing premises, or reductions of predictions to inputs are present in the provided text. The central claims rest on an independent system implementation and empirical measurements rather than any definitional or citation-based circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The modeling choice of semantic graphs is a domain assumption whose validity is not independently evidenced in the provided text.

pith-pipeline@v0.9.0 · 5827 in / 1136 out tokens · 27013 ms · 2026-05-25T18:37:48.143700+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 1 internal anchor

  1. [1]

    Ammons, V

    G. Ammons, V . Bala, T. Mummert, D. Reimer, and X. Zhang. Virtual machine images as structured data: The mirage image library. In Proceedings of the 3rd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’11, pages 22–22, Berkeley, CA, USA, 2011. USENIX Association

  2. [2]

    Barham, B

    P . Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neuge- bauer, I. Pratt, and A. War field. Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 164–177, New Y ork, NY , USA, 2003. ACM

  3. [3]

    B. Beach. Pro PowerShell for Amazon Web Services: DevOps for the A WS Cloud. Apress, Berkely, CA, USA, 1st edition, 2014

  4. [4]

    Buyya, C.S

    R. Buyya, C.S. Y eo, S. V enugopal, J. Broberg, and I. Brandic. Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Gener . Comput. Syst. , 25(6):599–616, June 2009

  5. [5]

    Campello, C

    D. Campello, C. Crespo, A. V erma, R. Rangaswami, and P . Jayachan- dran. Coriolis: Scalable VM clustering in clouds. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13) , pages 101–105, San Jose, CA, 2013. USENIX

  6. [6]

    R. J. Creasy. The origin of the vm/370 time-sharing system. IBM Journal of Research and Development , 25(5):483–490, Sep 1981

  7. [7]

    A. V . Dastjerdi and R. Buyya. Fog computing: Helping the internet of things realize its potential. Computer, 49(8):112–116, Aug 2016

  8. [8]

    Felter, A

    W. Felter, A. Ferreira, R. Rajamony, and J. Rubio. An updated performance comparison of virtual machines and linux containers. In 2015 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) , pages 171–172, March 2015

  9. [9]

    P . Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, February 1912

  10. [10]

    Jayaram, C

    K.R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, and H. Lei. An empirical analysis of similarity in virtual machine images. In Proceed- ings of the Middleware 2011 Industry Track Workshop , Middleware ’11, pages 6:1–6:6, New Y ork, NY , USA, 2011. ACM

  11. [11]

    Jin and E.L

    K. Jin and E.L. Miller. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference , SYSTOR ’09, pages 7:1–7:12, New Y ork, NY , USA, 2009. ACM

  12. [12]

    Kochut, A

    A. Kochut, A. Karve, and B. Nicolae. Towards ef ficient on-demand vm provisioning: Study of vm runtime i/o access patterns to shared image content. In 2015 IFIP/IEEE International Symposium on Integrated Network Management (IM) , pages 321–329, May 2015

  13. [13]

    Liguori and E.V

    A. Liguori and E.V . Hensbergen. Experiences with content addressable storage and virtual disks. In Proceedings of the First Conference on I/O Virtualization , WIOV’08, pages 5–5, Berkeley, CA, USA, 2008. USENIX Association

  14. [14]

    X. Lin, M. Hibler, E. Eide, and R. Ricci. Using deduplicating storage for ef ficient disk image deployment. In TRIDENTCOM, 2015

  15. [15]

    H. Liu, B. He, X. Liao, and H. Jin. Towards declarative and data-centric virtual machine image management in iaas clouds. IEEE Transactions on Cloud Computing , page 1

  16. [16]

    Mandagere, P

    N. Mandagere, P . Zhou, M.A. Smith, and S. Uttamchandani. Demys- tifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware ’08 Conference Companion , Companion ’08, pages 12–17, New Y ork, NY , USA, 2008. ACM

  17. [17]

    Mell and T

    P .M. Mell and T. Grance. Sp 800-145. the nist de finition of cloud computing. Technical report, Gaithersburg, MD, United States, 2011

  18. [18]

    Chun-Ho Ng and P .P .C. Lee. Revdedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems , APSys ’13, pages 15:1–15:7, New Y ork, NY , USA, 2013. ACM

  19. [19]

    Chun-Ho. Ng, M. Ma, Tsz-Y eung Wong, P .P .C. Lee, and J.C.S. Lui. Live deduplication storage of virtual machine images in an open-source cloud. In Proceedings of the 12th ACM/IFIP/USENIX International Conference on Middleware , Middleware’11, pages 81–100, Berlin, Heidelberg, 2011. Springer-V erlag

  20. [20]

    Nicolae, A

    B. Nicolae, A. Kochut, and A. Karve. Discovering and leveraging content similarity to optimize collective on-demand data access to iaas cloud storage. In 2015 15th IEEE/ACM International Symposium on Cluster , Cloud and Grid Computing , pages 211–220, May 2015

  21. [21]

    M.O. Rabin. Fingerprinting by Random Polynomials . Center for Research in Computing Technology: Center for Research in Computing Technology. Center for Research in Computing Techn., Aiken Compu- tation Laboratory, Univ., 1981

  22. [22]

    Razavi and T

    K. Razavi and T. Kielmann. Scalable virtual machine deployment using vm image caches. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , SC ’13, pages 65:1–65:12, New Y ork, NY , USA, 2013. ACM

  23. [23]

    Reimer, A

    D. Reimer, A. Thomas, G. Ammons, T. Mummert, B. Alpern, and V . Bala. Opening black boxes: Using semantic information to combat virtual machine image sprawl. In Proceedings of the F ourth ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Envi- ronments, VEE ’08, pages 111–120, New Y ork, NY , USA, 2008. ACM

  24. [24]

    Sahoo, S

    J. Sahoo, S. Mohapatra, and R. Lath. Virtualization: A survey on concepts, taxonomy and associated security issues. In 2010 Second International Conference on Computer and Network Technology , pages 222–226, April 2010

  25. [25]

    Shi and S

    W. Shi and S. Dustdar. The promise of edge computing. Computer, 49(5):78–81, May 2016

  26. [26]

    J. E. Smith and R. Nair. The architecture of virtual machines. Computer, 38(5):32–38, May 2005

  27. [27]

    C. Sun, L. He, Q. Wang, and R. Willenborg. Simplifying service deployment with virtual appliances. In 2008 IEEE International Conference on Services Computing , volume 2, pages 265–272, July 2008

  28. [28]

    Next Generation Cloud Computing: New Trends and Research Directions

    B. V arghese and R. Buyya. Next generation cloud computing: New trends and research directions. CoRR, abs/1707.07452, 2017

  29. [29]

    J. Xu, W. Zhang, S. Y e, J. Wei, and T. Huang. A lightweight virtual machine image deduplication backup approach in cloud environment. In 2014 IEEE 38th Annual Computer Software and Applications Con- ference, pages 503–508, July 2014

  30. [30]

    J. Xu, W. Zhang, Z. Zhang, T. Wang, and T. Huang. Clustering- based acceleration for virtual machine image deduplication in the cloud environment. J. Syst. Softw. , 121(C):144–156, November 2016

  31. [31]

    X. Zhao, Y . Zhang, Y . Wu, K. Chen, J. Jiang, and K. Li. Liquid: A scalable deduplication file system for virtual machine images. IEEE Transactions on Parallel and Distributed Systems , 25(5):1257–1266, May 2014