GreenMalloc: Allocator Optimisation for Industrial Workloads

Aidan Dakhama; Hector D. Menendez; Karine Even-Mendoza; W.B. Langdon

arxiv: 2510.21405 · v1 · submitted 2025-10-24 · 💻 cs.SE · cs.AR· cs.PF

GreenMalloc: Allocator Optimisation for Industrial Workloads

Aidan Dakhama , W.B. Langdon , Hector D. Menendez , Karine Even-Mendoza This is my paper

Pith reviewed 2026-05-18 04:46 UTC · model grok-4.3

classification 💻 cs.SE cs.ARcs.PF

keywords memory allocator configurationsearch-based optimizationheap usage reductionevolutionary algorithmsexecution tracessystem simulationperformance tuning

0 comments

The pith

A search framework tunes memory allocator settings from execution traces to cut average heap usage by up to 4.1 percent with no loss in speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to automatically adjust parameters in common memory allocators so programs use less heap space on average while running at the same speed. It does this by running a multi-objective search on lightweight traces first, then moving the top settings into a full system simulator for validation across varied workloads. If the reductions hold, developers could adopt smaller memory footprints in production code without trading off performance. The work focuses on two standard allocators and reports concrete savings that appear consistent rather than one-off.

Core claim

The paper claims that configurations discovered by applying NSGA-II to allocator parameters on execution traces via a lightweight proxy can be transferred to a detailed simulator and produce up to 4.1 percent lower average heap usage with no runtime penalty, and in one reported case a 0.25 percent reduction, across the tested workloads.

What carries the argument

Multi-objective evolutionary search that explores allocator parameter spaces from execution traces using a lightweight proxy before transfer to full simulation.

If this is right

Standard allocators can be reconfigured per workload to reduce average memory demand.
The same search process applies to multiple allocators without manual tuning.
Trace-based proxy evaluation makes the search cheap enough to repeat on new programs.
Lower heap usage can occur without any measured increase in execution time.
The method scales to diverse workloads rather than single benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar proxy-plus-transfer searches could optimize other low-level system components such as thread schedulers or cache policies.
If the reductions persist on real machines, energy use in memory-bound servers might drop proportionally to heap savings.
Industrial teams could embed the search step into their build pipelines to generate allocator settings automatically for each release.
The approach suggests that small parameter changes in mature allocators still have untapped headroom when guided by workload traces.

Load-bearing premise

That the best settings found on the lightweight proxy from traces will still deliver the same memory and speed benefits once moved into the detailed simulator and that the chosen programs reflect real industrial use.

What would settle it

Running the same workloads on physical hardware with the discovered allocator configurations and measuring whether heap usage drops by similar percentages while runtime stays flat or improves.

Figures

Figures reproduced from arXiv: 2510.21405 by Aidan Dakhama, Hector D. Menendez, Karine Even-Mendoza, W.B. Langdon.

**Figure 1.** Figure 1: General GreenMalloc workflow: starting with rand_malloc optimisation to identify efficient allocation parameters, ended by validation on gem5 to assess improvements in memory usage and runtime. We show it is possible to automatically tune memory allocator parameters to reduce heap usage and energy consumption in industrial workloads. By targeting both memory and runtime, we identify allocator configuratio… view at source ↗

**Figure 2.** Figure 2: Comparison of default and GreenMalloc-optimised configurations of glibc malloc (glibc) and TCMalloc (tcmalloc). From left to right: average heap size, Memory release Rate, peak heap size, and instruction counts, as measured with Valgrind, perf, and gem5. Values are all pareto optimal values. Results: RQ2. For average heap usage, glibc shows a clear improvement: Tuning reduced the mean from 180 428 315 to … view at source ↗

read the original abstract

We present GreenMalloc, a multi objective search-based framework for automatically configuring memory allocators. Our approach uses NSGA II and rand_malloc as a lightweight proxy benchmarking tool. We efficiently explore allocator parameters from execution traces and transfer the best configurations to gem5, a large system simulator, in a case study on two allocators: the GNU C/CPP compiler's glibc malloc and Google's TCMalloc. Across diverse workloads, our empirical results show up to 4.1 percantage reduction in average heap usage without loss of runtime efficiency; indeed, we get a 0.25 percantage reduction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces GreenMalloc, a multi-objective optimization framework that applies NSGA-II to tune parameters of glibc malloc and TCMalloc. It uses rand_malloc as a lightweight proxy on execution traces to discover configurations, then transfers the best ones to gem5 for full-system evaluation, reporting up to 4.1% reduction in average heap usage with no runtime loss (and a 0.25% reduction in one case) across diverse workloads.

Significance. If the proxy-to-gem5 transfer holds and the workloads are representative, the work demonstrates a practical, automated method for allocator tuning that could reduce memory pressure in industrial systems without performance cost. The separation of lightweight proxy search from full simulation is a methodological strength worth highlighting if supported by validation data.

major comments (2)

[Evaluation / Results] The central claim of heap-usage reduction (up to 4.1%) rests on the assumption that NSGA-II configurations found by rand_malloc on traces transfer to glibc/TCMalloc inside gem5 with only minor discrepancies. The manuscript provides no quantitative bound on proxy-vs-gem5 discrepancy for the final parameter sets, nor an ablation showing that proxy ranking is preserved under gem5's memory model. This is load-bearing for the empirical results.
[Abstract and §4] Abstract and results sections report concrete percentage reductions but supply no workload details, number of independent runs, statistical significance tests, or error bars. Without these, it is impossible to judge whether the observed 4.1% and 0.25% figures are robust or could be explained by measurement noise.

minor comments (2)

[Abstract] Correct spelling: 'percantage' appears twice in the abstract and should read 'percentage'.
[Methodology] Define 'average heap usage' precisely and state how it is computed identically in rand_malloc and gem5; the current description leaves room for measurement mismatch.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where we agree and will revise the paper accordingly.

read point-by-point responses

Referee: [Evaluation / Results] The central claim of heap-usage reduction (up to 4.1%) rests on the assumption that NSGA-II configurations found by rand_malloc on traces transfer to glibc/TCMalloc inside gem5 with only minor discrepancies. The manuscript provides no quantitative bound on proxy-vs-gem5 discrepancy for the final parameter sets, nor an ablation showing that proxy ranking is preserved under gem5's memory model. This is load-bearing for the empirical results.

Authors: We agree that a quantitative validation of the proxy-to-gem5 transfer is necessary to support the central claims. The original manuscript reports the gem5 results for the top proxy-derived configurations but does not include explicit discrepancy measurements or a ranking-preservation ablation. In the revised manuscript we will add a dedicated subsection to the evaluation that reports per-configuration heap-usage differences between rand_malloc and gem5 for the final parameter sets, together with an ablation on a representative subset of workloads demonstrating that the proxy ranking is largely preserved under the full-system memory model. revision: yes
Referee: [Abstract and §4] Abstract and results sections report concrete percentage reductions but supply no workload details, number of independent runs, statistical significance tests, or error bars. Without these, it is impossible to judge whether the observed 4.1% and 0.25% figures are robust or could be explained by measurement noise.

Authors: We accept that the current reporting lacks sufficient experimental detail and statistical context. The manuscript describes the workloads only as 'diverse' and omits run counts, error bars, and significance testing. We will revise both the abstract and §4 to (i) list the specific workload categories and benchmarks employed, (ii) state that each configuration was evaluated over 10 independent runs, (iii) add error bars to the reported figures, and (iv) include a short statistical analysis (paired t-tests) confirming that the observed reductions exceed measurement variability. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search results independent of fitted inputs

full rationale

The paper describes an empirical multi-objective search using NSGA-II on rand_malloc proxy traces, followed by transfer of discovered configurations to gem5 for glibc and TCMalloc. Reported heap reductions (up to 4.1%) are measured outcomes from simulation runs on workloads, not quantities derived from equations or parameters that are defined in terms of the same measurements. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided abstract or described methodology. The derivation chain consists of independent search and simulation steps whose outputs are not forced by construction from the inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The framework depends on the proxy benchmark faithfully representing allocator behavior and on the transferability of configurations to the full simulator; these are domain assumptions rather than derived results.

free parameters (1)

NSGA-II population size and generation count
Standard evolutionary algorithm controls that must be chosen to balance search effort and solution quality.

axioms (1)

domain assumption rand_malloc proxy produces representative performance signals for allocator parameter search
Used as lightweight benchmarking tool to explore parameters from execution traces before gem5 transfer.

invented entities (1)

GreenMalloc framework no independent evidence
purpose: Automated multi-objective search for allocator configuration
Newly introduced tool that orchestrates NSGA-II, proxy, and simulator stages.

pith-pipeline@v0.9.0 · 5635 in / 1410 out tokens · 58129 ms · 2026-05-18T04:46:46.454626+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ the genetic algorithm (GA), NSGA-II, implemented using pymoo... We formulate the optimisation as a multi-objective problem, jointly targeting peak heap usage and execution time
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Synthetic Benchmarking with rand_malloc... transfer the best-performing parameter configurations to gem5

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

[1]

Available: https://doi.org/10.1145/2024716.2024718

Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News39(2), 1–7 (2011).https://doi.org/10.1145/2024716.2024718

work page doi:10.1145/2024716.2024718 2011
[2]

Automated Software Engineering32(2) (2025)

Dakhama, A., et al.: Enhancing search-based testing with LLMs for finding bugs in system simulators. Automated Software Engineering32(2) (2025)

work page 2025
[3]

In: DaMoN 2019.https://doi.org/10.1145/3329785.3329918

Durner, D., et al.: On the impact of memory allocation on high-performance query processing. In: DaMoN 2019.https://doi.org/10.1145/3329785.3329918

work page doi:10.1145/3329785.3329918 2019
[4]

In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

Even-Mendoza, et al.: Search+LLM-based testing for ARM simulators. In: ICSE- SEIP2025.pp.469–480.https://doi.org/10.1109/ICSE-SEIP66354.2025.00047

work page doi:10.1109/icse-seip66354.2025.00047 2025
[5]

github.io/gperftools/tcmalloc.html, accessed: Sep

Ghemawat, S.: TCMalloc: Thread-caching malloc (2024),https://gperftools. github.io/gperftools/tcmalloc.html, accessed: Sep. 2025

work page 2024
[6]

Google: TCMalloc.https://github.com/google/tcmalloc, accessed: Sep. 2025

work page 2025
[7]

URL: https://doi.org/10.5281/zenodo

GreenMalloc: This paper’s artifact (2025).https://doi.org/10.5281/zenodo. 17171047

work page doi:10.5281/zenodo 2025
[8]

In: UKCI (2025),https://gpbib.cs.ucl.ac.uk/gp-html/Langdon_2025_UKCI.html

Langdon, W.B.: A genetic improvement parameter benchmark: rand_malloc.c. In: UKCI (2025),https://gpbib.cs.ucl.ac.uk/gp-html/Langdon_2025_UKCI.html

work page 2025
[9]

GNU Project,https: //sourceware.org/glibc/manual/latest/pdf/libc.pdf, accessed: Sep

Loosemore, S., et al.: The GNU C Library Reference Manual. GNU Project,https: //sourceware.org/glibc/manual/latest/pdf/libc.pdf, accessed: Sep. 2025

work page 2025
[10]

In: PLDI 2007

Nethercote, N., et al.: Valgrind: a framework for heavyweight dynamic binary in- strumentation. In: PLDI 2007. p. 89–100

work page 2007
[11]

Pereira, R., et al.: Energy efficiency across programming languages: how do energy, time, and memory relate? In: SLE 2017. p. 256–267. ACM

work page 2017
[12]

Zhou,Z.,etal.:Characterizingamemoryallocatoratwarehousescale.In:ASPLOS

work page
[13]

p. 192–206. ACM.https://doi.org/10.1145/3620666.3651350

work page doi:10.1145/3620666.3651350

[1] [1]

Available: https://doi.org/10.1145/2024716.2024718

Binkert, N., et al.: The gem5 simulator. SIGARCH Comput. Archit. News39(2), 1–7 (2011).https://doi.org/10.1145/2024716.2024718

work page doi:10.1145/2024716.2024718 2011

[2] [2]

Automated Software Engineering32(2) (2025)

Dakhama, A., et al.: Enhancing search-based testing with LLMs for finding bugs in system simulators. Automated Software Engineering32(2) (2025)

work page 2025

[3] [3]

In: DaMoN 2019.https://doi.org/10.1145/3329785.3329918

Durner, D., et al.: On the impact of memory allocation on high-performance query processing. In: DaMoN 2019.https://doi.org/10.1145/3329785.3329918

work page doi:10.1145/3329785.3329918 2019

[4] [4]

In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP)

Even-Mendoza, et al.: Search+LLM-based testing for ARM simulators. In: ICSE- SEIP2025.pp.469–480.https://doi.org/10.1109/ICSE-SEIP66354.2025.00047

work page doi:10.1109/icse-seip66354.2025.00047 2025

[5] [5]

github.io/gperftools/tcmalloc.html, accessed: Sep

Ghemawat, S.: TCMalloc: Thread-caching malloc (2024),https://gperftools. github.io/gperftools/tcmalloc.html, accessed: Sep. 2025

work page 2024

[6] [6]

Google: TCMalloc.https://github.com/google/tcmalloc, accessed: Sep. 2025

work page 2025

[7] [7]

URL: https://doi.org/10.5281/zenodo

GreenMalloc: This paper’s artifact (2025).https://doi.org/10.5281/zenodo. 17171047

work page doi:10.5281/zenodo 2025

[8] [8]

In: UKCI (2025),https://gpbib.cs.ucl.ac.uk/gp-html/Langdon_2025_UKCI.html

Langdon, W.B.: A genetic improvement parameter benchmark: rand_malloc.c. In: UKCI (2025),https://gpbib.cs.ucl.ac.uk/gp-html/Langdon_2025_UKCI.html

work page 2025

[9] [9]

GNU Project,https: //sourceware.org/glibc/manual/latest/pdf/libc.pdf, accessed: Sep

Loosemore, S., et al.: The GNU C Library Reference Manual. GNU Project,https: //sourceware.org/glibc/manual/latest/pdf/libc.pdf, accessed: Sep. 2025

work page 2025

[10] [10]

In: PLDI 2007

Nethercote, N., et al.: Valgrind: a framework for heavyweight dynamic binary in- strumentation. In: PLDI 2007. p. 89–100

work page 2007

[11] [11]

Pereira, R., et al.: Energy efficiency across programming languages: how do energy, time, and memory relate? In: SLE 2017. p. 256–267. ACM

work page 2017

[12] [12]

Zhou,Z.,etal.:Characterizingamemoryallocatoratwarehousescale.In:ASPLOS

work page

[13] [13]

p. 192–206. ACM.https://doi.org/10.1145/3620666.3651350

work page doi:10.1145/3620666.3651350