pith. sign in

arxiv: 1907.07776 · v1 · pith:FGXWUCQTnew · submitted 2019-07-17 · 💻 cs.AR · cs.LG

CADS: Core-Aware Dynamic Scheduler for Multicore Memory Controllers

Pith reviewed 2026-05-24 19:40 UTC · model grok-4.3

classification 💻 cs.AR cs.LG
keywords memory controller schedulingmulticore processorsreinforcement learningDRAM access patternslocalitybank parallelismresource fairnessperformance optimization
0
0 comments X

The pith

CADS uses reinforcement learning to dynamically adjust memory scheduling for multiple cores at runtime, delivering 20% better CPI on PARSEC benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CADS, a memory controller scheduler that uses reinforcement learning to change its policy on the fly according to the requests arriving from different cores. It works by preserving locality within each core's data stream, spreading accesses across DRAM banks for parallelism, and dividing bandwidth fairly. Traditional controllers were built for single-core use and therefore lose efficiency when cores compete. A reader would care because this targets the shared DRAM bottleneck that limits how much extra performance extra cores can actually deliver.

Core claim

CADS is a core-aware dynamic scheduler for multicore memory controllers that employs reinforcement learning to alter its scheduling strategy dynamically at runtime, utilizing locality among data requests from multiple cores, exploiting parallelism in accessing multiple banks of DRAM, and sharing the DRAM while guaranteeing fairness to all cores. Using CADS policy, we achieve 20% better cycles per instruction (CPI) in running memory intensive and compute intensive PARSEC parallel benchmarks simultaneously, and 16% better CPI with SPEC 2006 benchmarks.

What carries the argument

Core-Aware Dynamic Scheduler (CADS) that uses a reinforcement learning agent to choose among scheduling strategies according to observed per-core memory access patterns.

If this is right

  • Mixed memory-intensive and compute-intensive workloads on the same multicore chip run with higher effective processor throughput.
  • Memory controllers no longer require workload-specific static tuning.
  • DRAM bandwidth is divided fairly among cores without manual intervention.
  • The same gains appear across both parallel application suites and standard single-program benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same runtime-learning method could be applied to scheduling decisions at other shared resources such as last-level cache or on-chip interconnect.
  • If the learned policies remain stable on workloads never seen during design, future processors could ship with less hand-tuned scheduling logic.
  • Extending the fairness and locality objectives to include power or thermal limits would produce an energy-aware variant of the same scheduler.

Load-bearing premise

A reinforcement learning agent can learn effective, stable scheduling policies at runtime that generalize beyond the training workloads while adding negligible overhead to the memory controller hardware.

What would settle it

A cycle-accurate simulation or hardware prototype of CADS on a multicore processor running the listed PARSEC workload mix that reports CPI improvement below 10 percent or controller area overhead above 5 percent.

Figures

Figures reproduced from arXiv: 1907.07776 by Eduardo Olmedo Sanchez, Xian-He Sun.

Figure 1
Figure 1. Figure 1: CADS structure The number of environment-features and possible actions should achieve a good tradeoff between complexity and scheduling performance. Including more features and actions achieve better scheduling decisions [8]; however, the learning time will be longer and it will require more memory and computational resources [27]. Hence, we have limited the number of features and actions as low as possibl… view at source ↗
Figure 2
Figure 2. Figure 2: Core Aware Dynamic Scheduling,Algorithm In the following, we describe the elements that compose the hardware implementation [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Hardware Implementation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Number of requests for 4 cores [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Number of requests for 8 cores [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Number of requests for 16 cores For PARSEC benchmarks, CADS policy outperforms the FR-FCFS policy by 14% on average for Intensive workloads and by 11% on average for Non-intensive workloads. With CPU2006 benchmarks, the average performance improvement with CADS is 11% and 10% for Intensive [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Memory controller scheduling is crucial in multicore processors, where DRAM bandwidth is shared. Since increased number of requests from multiple cores of processors becomes a source of bottleneck, scheduling the requests efficiently is necessary to utilize all the computing power these processors offer. However, current multicore processors are using traditional memory controllers, which are designed for single-core processors. They are unable to adapt to changing characteristics of memory workloads that run simultaneously on multiple cores. Existing schedulers may disrupt locality and bank parallelism among data requests coming from different cores. Hence, novel memory controllers that consider and adapt to the memory access characteristics, and share memory resources efficiently and fairly are necessary. We introduce Core-Aware Dynamic Scheduler (CADS) for multicore memory controller. CADS uses Reinforcement Learning (RL) to alter its scheduling strategy dynamically at runtime. Our scheduler utilizes locality among data requests from multiple cores and exploits parallelism in accessing multiple banks of DRAM. CADS is also able to share the DRAM while guaranteeing fairness to all cores accessing memory. Using CADS policy, we achieve 20% better cycles per instruction (CPI) in running memory intensive and compute intensive PARSEC parallel benchmarks simultaneously, and 16% better CPI with SPEC 2006 benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper proposes Core-Aware Dynamic Scheduler (CADS), a reinforcement learning-based memory controller scheduler for multicore processors. CADS dynamically adapts its policy at runtime to exploit request locality and bank parallelism while maintaining fairness across cores. The central empirical claim is a 20% CPI improvement on simultaneously running memory- and compute-intensive PARSEC benchmarks and a 16% CPI improvement on SPEC 2006 benchmarks relative to conventional schedulers.

Significance. If the RL policy can be shown to generalize stably to unseen workloads and to incur only negligible hardware overhead, the work would demonstrate a practical adaptive alternative to static FR-FCFS scheduling in shared DRAM systems, potentially improving multicore performance under varying memory pressure.

major comments (3)
  1. [Abstract] Abstract: the headline CPI gains (20% PARSEC, 16% SPEC) are stated without any description of the RL state/action representation, reward function, training procedure, baseline schedulers, or statistical error bars, making it impossible to assess whether the central performance claim is supported by the data.
  2. [Evaluation] Evaluation sections: the results appear to train and test the RL agent on the same benchmark suites without held-out workload cross-validation or transfer experiments, leaving the claim that the learned policy generalizes beyond the training workloads unsupported.
  3. [Hardware Implementation] Hardware overhead discussion: the assertion that the RL component adds negligible area, power, and latency is made without any synthesis results, gate counts, or comparison against a baseline FR-FCFS controller, so the practicality of the approach cannot be verified.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of our work. We address each major point below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline CPI gains (20% PARSEC, 16% SPEC) are stated without any description of the RL state/action representation, reward function, training procedure, baseline schedulers, or statistical error bars, making it impossible to assess whether the central performance claim is supported by the data.

    Authors: The abstract is intentionally concise per conference guidelines. The body of the paper details the RL formulation (state includes per-core request queues and bank states; actions are priority assignments; reward combines weighted CPI and fairness metric; trained via online Q-learning updates during simulation) and baselines (FR-FCFS and other static policies). Results are averaged over multiple runs with variance shown in plots. We will revise the abstract to include one sentence summarizing the RL approach and baselines while retaining the performance claims. revision: partial

  2. Referee: [Evaluation] Evaluation sections: the results appear to train and test the RL agent on the same benchmark suites without held-out workload cross-validation or transfer experiments, leaving the claim that the learned policy generalizes beyond the training workloads unsupported.

    Authors: The policy is updated online at runtime using RL, allowing adaptation to workload changes without offline retraining on specific suites. The reported results use standard, diverse PARSEC and SPEC mixes to show consistent gains. To further support generalization, we will add a subsection with held-out cross-validation (train on half the benchmarks, test on the remainder) and note transfer behavior on additional synthetic traces in the revised evaluation section. revision: yes

  3. Referee: [Hardware Implementation] Hardware overhead discussion: the assertion that the RL component adds negligible area, power, and latency is made without any synthesis results, gate counts, or comparison against a baseline FR-FCFS controller, so the practicality of the approach cannot be verified.

    Authors: The manuscript argues negligibility based on the small size of the Q-table and lookup logic relative to a standard memory controller. We agree quantitative data would strengthen this. In revision we will add estimated gate counts and latency overheads drawn from comparable lightweight RL hardware designs in the literature, along with a direct comparison table versus FR-FCFS. Full place-and-route synthesis remains future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents CADS as an RL-based scheduler and reports CPI gains as direct empirical outcomes from running PARSEC and SPEC benchmarks. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text to support the central claims. The results are framed as experimental measurements rather than quantities derived from the paper's own inputs by construction, rendering the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities. Standard RL assumptions (e.g., MDP formulation of the scheduling environment) are implicit but not stated or justified in the provided text.

pith-pipeline@v0.9.0 · 5745 in / 1161 out tokens · 30471 ms · 2026-05-24T19:40:21.341795+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages

  1. [1]

    Rixner, W

    S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattso n, and J.D. Owens. Memory access scheduling. In ISCA 27,2000

  2. [2]

    Jun Shao and B. T. Davis. A Burst Scheduling Ac cess Reordering Mechanism. In HPCA 13, 2007

  3. [3]

    In MICRO 33, 2000

    Zhao Zhang et al., A permutation based Page Int erleaving Scheme to Reduce Row Buffer Conflicts and Exploit Data Locality. In MICRO 33, 2000

  4. [4]

    Z. Fang, X. H. Sun, Y. Chen, S. Byna, Core awar e Memory Access Scheduling Schemes. In IPDPS 23,2009

  5. [5]

    Mutlu, T

    O. Mutlu, T. Moscibroda. Stall Time Fair Memory Access Scheduling for Chip Multiprocessors. In MICRO 40, 2007

  6. [6]

    B.T. Davis. Modern DRAM Architectures. Ph. D . dissertation, Dept. of EECS, University of Michigan,2000

  7. [7]

    Mitchell

    T. Mitchell. Machine Learning. Mc Graw Hill, Bo ston, MA, 1997

  8. [8]

    Russell, P

    S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach. 2nd edition, Prentice Hall, 200 3

  9. [9]

    R. M. Tomasul o. An efficient algorithm for exp loiting multiple arithmetic units. In IBM Journal o f Research and Development, Volume 11, Number 1, Page 25, 1967 [10 ]K. Murakami et al., SIMP: A novel high Speed S ingle Processor Architecture. In ISCA 16,1989

  10. [10]

    T. F. Chen and J. L. Baer, Reducing memory lat ency via non blocking and prefetching caches. In ASPLOS V, 1992

  11. [11]

    W m. A. W ulf, Sally A. McKee, Hitting the mem ory wall: implications of the obvious. In ACMSIGARCH Computer Architecture News,1995

  12. [12]

    Hassan, S

    J. Hassan, S. Chandra, T. N. Vijaykumar. Effic ient Use of Memory Bandwidth to Improve Network Processor Throughput. In ISCA 30, 2003

  13. [13]

    Irodova, R

    M. Irodova, R. H. Sloan. Reinforcement Learni ng and Function Approximation. In FLAIRS 18, 2005

  14. [14]

    Sutton and A

    R. Sutton and A. Ba rto. Reinforcement Learnin g. MIT Press, Cambridge, MA, 1998

  15. [15]

    D. T. Wang. Modern DRAM Memory Systems Perform ance Analysis and a High Performance, Power Constrained DRAM Scheduling Algorithm. Ph. D. Dissertation, Dept. Of ECE, University of Maryland, 2005

  16. [16]

    Mutlu, T

    O. Mutlu, T. Moscibroda. Parallelism Aware Bat ch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. In ISCA 35,2008

  17. [17]

    Family 10h AMD Ph enom™ II Processor Product Data, 2009

    Advanced Micro Devices, Inc. Family 10h AMD Ph enom™ II Processor Product Data, 2009

  18. [18]

    Intel® Core™ i7 Processor Extreme Edition Series and Intel® Core™ i7 Processor Datasheet, 2008

    Intel, Inc. Intel® Core™ i7 Processor Extreme Edition Series and Intel® Core™ i7 Processor Datasheet, 2008

  19. [19]

    Cuppu et.al., A performance comparison of c ontemporary DRAM architectures

    V. Cuppu et.al., A performance comparison of c ontemporary DRAM architectures. In ISCA 26,1999

  20. [20]

    Hur and C

    I. Hur and C. Lin. Adaptive history based memo ry schedulers. In MICRO 37, 2004

  21. [21]

    Zheng et

    H. Zheng et. al., Memory Access Scheduling Sch emes for Systems with Multi Core Processors. In ICPP 37, 2008

  22. [22]

    N. L. Binkert et al., The M5 simulator: Model ing networked systems. In MICRO 39, 2006

  23. [23]

    Wa ng et

    D. Wa ng et. Al., DRAMsim: A memory system sim ulator. ACM SIGARCH Computer Architecture News, 2005

  24. [24]

    Bienia et.al

    C. Bienia et.al.. The PARSEC Benchmark Suite: Characterization and Architectural Implications. In PACT 17, 2008

  25. [25]

    J. L. Henning,. SPEC CPU2006 bench mark descri ptions. ACM SIGARCH Computer Architecture News,2006

  26. [26]

    E. Ipek et. al., Self Optimizing Memory Contro llers: A Reinforcement Learning Approach. In ISCA 35, 2008

  27. [27]

    Natarajan et

    C. Natarajan et. Al. A study of performance im pact of memory controller f eatures in multi proces sor server environment. In WMPI 3, 2004

  28. [28]

    Zhu and Z

    Z. Zhu and Z. Zhang. A performance comparison of DRAM memory system optimizations for SMT processors. In HPCA 11 , 2005

  29. [29]

    University of Maryland Memory System Simulator Manual.http://w ww.ece.umd.edu/DRAMsim/download/DRAMsimManual.pdf 162 Computer Science & Information Technology (CS & IT)

  30. [30]

    A Case for Machine Learning to Opt imize Multicore Performance

    Ganapathi,. A.., . K.. Datta, . A.. Fox, and . D.. Patterson, “A Case for Machine Learning to Opt imize Multicore Performance”, HotPar09, Berkeley, CA, 3/2009

  31. [31]

    Martinez , Engin Ipek

    J ose F. Martinez , Engin Ipek. Dynam ic Multi core Resource Management: A Machine Learning Approach. In Micro 42,2009

  32. [32]

    al., Coordinated Manageme nt of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach

    Ramazan Bitirgen et. al., Coordinated Manageme nt of Multiple Interacting Resources in Chip Multiprocessors: A Machine Learning Approach . In Micro 41,2008

  33. [33]

    al., A Reinforcement Learnin g Approach to OnlineWeb System Auto configuration

    Xiangpin g Bu et. al., A Reinforcement Learnin g Approach to OnlineWeb System Auto configuration . In ICDCS,2009

  34. [34]

    al., DASH: Deadline Aware Hi gh Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators

    Hiroyuki Usui et. al., DASH: Deadline Aware Hi gh Performance Memory Scheduler for Heterogeneous Systems with Hardware Accelerators . In ACM Transact ions on Architecture and Code Optimization 2016

  35. [35]

    In Journal of Supercomputing Frontiers and Innovations 2014

    Onur Mutlu, Lavanya Subramanian Research Probl ems and Opportunities in Memory Systems. In Journal of Supercomputing Frontiers and Innovations 2014

  36. [36]

    Rachata Ausavarungnirun et. al. High Performan ce and Energy Effi cient Memory Scheduler Design for Heterogeneous Systems. 2018 A UTHORS Eduardo Olmedo Sanchez, graduated from Technical Un iversity of Madrid as an engineer in Automation and Electronics researcher i n topics related to the application of automation to computer engineering a nd computer arc...