Semantics-aware Virtual Machine Image Management in IaaS Clouds
Pith reviewed 2026-05-25 18:37 UTC · model grok-4.3
The pith
Modeling VMIs as semantic graphs shrinks cloud storage 2.2-16 times
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Expelliarmus system models virtual machine images as semantic graphs to speed up similarity calculations, performs semantic-aware decomposition to select and store only non-redundant base images and software packages, and assembles images from these parts on user request, resulting in repository size optimizations of 2.2-16 times over existing systems along with improved publish and retrieval performance.
What carries the argument
Semantic graphs of VMIs that enable similarity computation, decomposition into unique components, and on-demand assembly.
Load-bearing premise
The semantic graph accurately captures the structure of base images and software packages so that decomposition and re-assembly maintain complete functionality and acceptable performance.
What would settle it
Deploy a set of original and reassembled VMIs in a test cloud and check whether applications run without errors or differences in behavior, while measuring any increase in boot time or resource use.
Figures
read the original abstract
Infrastructure-as-a-service (IaaS) Clouds concurrently accommodate diverse sets of user requests, requiring an efficient strategy for storing and retrieving virtual machine images (VMIs) at a large scale. The VMI storage management require dealing with multiple VMIs, typically in the magnitude of gigabytes, which entails VMI sprawl issues hindering the elastic resource management and provisioning. Nevertheless, existing techniques to facilitate VMI management overlook VMI semantics (i.e at the level of base image and software packages) with either restricted possibility to identify and extract reusable functionalities or with higher VMI publish and retrieval overheads. In this paper, we design, implement and evaluate Expelliarmus, a novel VMI management system that helps to minimize storage, publish and retrieval overheads. To achieve this goal, Expelliarmus incorporates three complementary features. First, it makes use of VMIs modelled as semantic graphs to expedite the similarity computation between multiple VMIs. Second, Expelliarmus provides a semantic aware VMI decomposition and base image selection to extract and store non-redundant base image and software packages. Third, Expelliarmus can also assemble VMIs based on the required software packages upon user request. We evaluate Expelliarmus through a representative set of synthetic Cloud VMIs on the real test-bed. Experimental results show that our semantic-centric approach is able to optimize repository size by 2.2-16 times compared to state-of-the-art systems (e.g. IBM's Mirage and Hemera) with significant VMI publish and slight retrieval performance improvement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Expelliarmus, a VMI management system for IaaS clouds that models VMIs as semantic graphs to compute similarities, perform semantic-aware decomposition into non-redundant base images and software packages, and assemble VMIs on demand. It claims 2.2-16x repository size reduction versus Mirage and Hemera, plus publish/retrieval improvements, based on evaluation with synthetic Cloud VMIs on a real test-bed.
Significance. If the decomposition is shown to be lossless and low-overhead, the semantic-graph approach could meaningfully address VMI sprawl by enabling finer-grained deduplication at the package level. The design integrates similarity computation, decomposition, and assembly in one system, which is a coherent contribution to cloud storage management.
major comments (2)
- [§4] §4 (Evaluation): The reported 2.2-16x size reductions rest on synthetic VMIs that deliberately avoid complex dependency cases (shared libraries, kernel modules, post-install scripts). Without end-to-end functional equivalence tests on realistic workloads, the storage gains cannot be treated as generally valid.
- [§3.2] §3.2 (Semantic Graph Model): No formal semantics are supplied for dependency resolution, version pinning, or configuration-state preservation in the graph decomposition/re-assembly process. This is load-bearing for the central claim that the approach is lossless.
minor comments (1)
- [Abstract] Abstract: quantitative performance claims are stated without error bars, dataset sizes, or comparison methodology; these details belong in the abstract or a summary table for quick assessment.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. Below we provide point-by-point responses to the major comments.
read point-by-point responses
-
Referee: [§4] §4 (Evaluation): The reported 2.2-16x size reductions rest on synthetic VMIs that deliberately avoid complex dependency cases (shared libraries, kernel modules, post-install scripts). Without end-to-end functional equivalence tests on realistic workloads, the storage gains cannot be treated as generally valid.
Authors: We acknowledge that our evaluation is based on synthetic VMIs, which were designed to isolate the effects of semantic decomposition while covering a range of package dependencies from real cloud images. The paper does not include end-to-end functional equivalence tests on realistic workloads. This is a valid point, and in the revised version we will include additional discussion on the scope of the evaluation and the representativeness of the synthetic VMIs. We maintain that the reported gains demonstrate the potential of the approach under controlled conditions. revision: partial
-
Referee: [§3.2] §3.2 (Semantic Graph Model): No formal semantics are supplied for dependency resolution, version pinning, or configuration-state preservation in the graph decomposition/re-assembly process. This is load-bearing for the central claim that the approach is lossless.
Authors: The semantic graph in §3.2 models VMIs with nodes for base images and packages and edges for dependencies derived from package metadata. The decomposition algorithm selects non-redundant components based on these dependencies. We do not provide formal semantics or a proof of losslessness in the current manuscript. We will revise the section to provide a more detailed description of how dependency resolution, version information, and configuration states are handled during decomposition and assembly, drawing from standard package management practices. revision: partial
Circularity Check
No circularity: system design and experimental claims are self-contained
full rationale
The paper describes a new VMI management system (Expelliarmus) based on semantic graph modeling, decomposition, and assembly, with performance claims supported by evaluation on synthetic workloads against external baselines (Mirage, Hemera). No equations, fitted parameters, self-citations as load-bearing premises, or reductions of predictions to inputs are present in the provided text. The central claims rest on an independent system implementation and empirical measurements rather than any definitional or citation-based circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
P . Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neuge- bauer, I. Pratt, and A. War field. Xen and the art of virtualization. In Proceedings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages 164–177, New Y ork, NY , USA, 2003. ACM
work page 2003
-
[3]
B. Beach. Pro PowerShell for Amazon Web Services: DevOps for the A WS Cloud. Apress, Berkely, CA, USA, 1st edition, 2014
work page 2014
-
[4]
R. Buyya, C.S. Y eo, S. V enugopal, J. Broberg, and I. Brandic. Cloud computing and emerging it platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Gener . Comput. Syst. , 25(6):599–616, June 2009
work page 2009
-
[5]
D. Campello, C. Crespo, A. V erma, R. Rangaswami, and P . Jayachan- dran. Coriolis: Scalable VM clustering in clouds. In Proceedings of the 10th International Conference on Autonomic Computing (ICAC 13) , pages 101–105, San Jose, CA, 2013. USENIX
work page 2013
-
[6]
R. J. Creasy. The origin of the vm/370 time-sharing system. IBM Journal of Research and Development , 25(5):483–490, Sep 1981
work page 1981
-
[7]
A. V . Dastjerdi and R. Buyya. Fog computing: Helping the internet of things realize its potential. Computer, 49(8):112–116, Aug 2016
work page 2016
- [8]
-
[9]
P . Jaccard. The distribution of the flora in the alpine zone. New Phytologist, 11(2):37–50, February 1912
work page 1912
-
[10]
K.R. Jayaram, C. Peng, Z. Zhang, M. Kim, H. Chen, and H. Lei. An empirical analysis of similarity in virtual machine images. In Proceed- ings of the Middleware 2011 Industry Track Workshop , Middleware ’11, pages 6:1–6:6, New Y ork, NY , USA, 2011. ACM
work page 2011
-
[11]
K. Jin and E.L. Miller. The effectiveness of deduplication on virtual machine disk images. In Proceedings of SYSTOR 2009: The Israeli Experimental Systems Conference , SYSTOR ’09, pages 7:1–7:12, New Y ork, NY , USA, 2009. ACM
work page 2009
- [12]
-
[13]
A. Liguori and E.V . Hensbergen. Experiences with content addressable storage and virtual disks. In Proceedings of the First Conference on I/O Virtualization , WIOV’08, pages 5–5, Berkeley, CA, USA, 2008. USENIX Association
work page 2008
-
[14]
X. Lin, M. Hibler, E. Eide, and R. Ricci. Using deduplicating storage for ef ficient disk image deployment. In TRIDENTCOM, 2015
work page 2015
-
[15]
H. Liu, B. He, X. Liao, and H. Jin. Towards declarative and data-centric virtual machine image management in iaas clouds. IEEE Transactions on Cloud Computing , page 1
-
[16]
N. Mandagere, P . Zhou, M.A. Smith, and S. Uttamchandani. Demys- tifying data deduplication. In Proceedings of the ACM/IFIP/USENIX Middleware ’08 Conference Companion , Companion ’08, pages 12–17, New Y ork, NY , USA, 2008. ACM
work page 2008
-
[17]
P .M. Mell and T. Grance. Sp 800-145. the nist de finition of cloud computing. Technical report, Gaithersburg, MD, United States, 2011
work page 2011
-
[18]
Chun-Ho Ng and P .P .C. Lee. Revdedup: A reverse deduplication storage system optimized for reads to latest backups. In Proceedings of the 4th Asia-Pacific Workshop on Systems , APSys ’13, pages 15:1–15:7, New Y ork, NY , USA, 2013. ACM
work page 2013
-
[19]
Chun-Ho. Ng, M. Ma, Tsz-Y eung Wong, P .P .C. Lee, and J.C.S. Lui. Live deduplication storage of virtual machine images in an open-source cloud. In Proceedings of the 12th ACM/IFIP/USENIX International Conference on Middleware , Middleware’11, pages 81–100, Berlin, Heidelberg, 2011. Springer-V erlag
work page 2011
-
[20]
B. Nicolae, A. Kochut, and A. Karve. Discovering and leveraging content similarity to optimize collective on-demand data access to iaas cloud storage. In 2015 15th IEEE/ACM International Symposium on Cluster , Cloud and Grid Computing , pages 211–220, May 2015
work page 2015
-
[21]
M.O. Rabin. Fingerprinting by Random Polynomials . Center for Research in Computing Technology: Center for Research in Computing Technology. Center for Research in Computing Techn., Aiken Compu- tation Laboratory, Univ., 1981
work page 1981
-
[22]
K. Razavi and T. Kielmann. Scalable virtual machine deployment using vm image caches. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis , SC ’13, pages 65:1–65:12, New Y ork, NY , USA, 2013. ACM
work page 2013
-
[23]
D. Reimer, A. Thomas, G. Ammons, T. Mummert, B. Alpern, and V . Bala. Opening black boxes: Using semantic information to combat virtual machine image sprawl. In Proceedings of the F ourth ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Envi- ronments, VEE ’08, pages 111–120, New Y ork, NY , USA, 2008. ACM
work page 2008
- [24]
- [25]
-
[26]
J. E. Smith and R. Nair. The architecture of virtual machines. Computer, 38(5):32–38, May 2005
work page 2005
-
[27]
C. Sun, L. He, Q. Wang, and R. Willenborg. Simplifying service deployment with virtual appliances. In 2008 IEEE International Conference on Services Computing , volume 2, pages 265–272, July 2008
work page 2008
-
[28]
Next Generation Cloud Computing: New Trends and Research Directions
B. V arghese and R. Buyya. Next generation cloud computing: New trends and research directions. CoRR, abs/1707.07452, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
J. Xu, W. Zhang, S. Y e, J. Wei, and T. Huang. A lightweight virtual machine image deduplication backup approach in cloud environment. In 2014 IEEE 38th Annual Computer Software and Applications Con- ference, pages 503–508, July 2014
work page 2014
-
[30]
J. Xu, W. Zhang, Z. Zhang, T. Wang, and T. Huang. Clustering- based acceleration for virtual machine image deduplication in the cloud environment. J. Syst. Softw. , 121(C):144–156, November 2016
work page 2016
-
[31]
X. Zhao, Y . Zhang, Y . Wu, K. Chen, J. Jiang, and K. Li. Liquid: A scalable deduplication file system for virtual machine images. IEEE Transactions on Parallel and Distributed Systems , 25(5):1257–1266, May 2014
work page 2014
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.