pith. sign in

arxiv: 2606.10705 · v1 · pith:YA5MLPN5new · submitted 2026-06-09 · 💻 cs.LG · cs.AI· cs.SY· eess.SY

Event-Driven Reinforcement Learning Enables Long-Horizon Control in Semiconductor Fabrication

Pith reviewed 2026-06-27 14:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.SYeess.SY
keywords reinforcement learningsemiconductor manufacturingevent-driven controllong-horizon optimizationthroughput improvementproduction planningmulti-objective policysimulation validation
0
0 comments X

The pith

An event-driven reinforcement learning framework delivers significant gains in throughput and utilization for semiconductor fabrication.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a deep reinforcement learning approach tailored to the long-horizon, event-driven nature of semiconductor manufacturing, where wafers move through hundreds of steps on complex equipment. It formulates the problem as a centralized policy that coordinates decisions while the system evolves through discrete events, using a custom temporal-difference learning method that works with various algorithms. Extensive tests in high-fidelity simulations of real-world scenarios show consistent improvements in production metrics whether training happens offline or online. A sympathetic reader would care because semiconductor fabs are high-value, high-variance operations where even modest efficiency gains translate to substantial output increases.

Core claim

The authors claim that their event-driven temporal-difference formulation, integrated into a centralized multi-objective RL framework, enables effective policy optimization in stochastic, constrained semiconductor systems, yielding significant and consistent gains in throughput and utilization across offline and online training in diverse industry-real simulations, while also clarifying relative strengths of different RL algorithms.

What carries the argument

The event-driven temporal-difference formulation, which represents system evolution as an interconnected temporal process driven by discrete events and supports integration with policy optimization methods.

If this is right

  • Agents achieve significant and consistent gains in throughput and utilization in both offline and online settings.
  • The framework scales to systems with hundreds of processing steps and extensive equipment networks.
  • Performance generalizes across different training phases and operating scenarios.
  • Different model-free RL algorithms can be incorporated, with relative strengths clarified.
  • The approach supports transferability to controlling other event-driven complex adaptive systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar event-driven formulations might apply to other manufacturing or logistics domains with delayed feedback and discrete events.
  • If the simulation-to-reality gap is small, the framework could enable real-time adaptive control in operating fabs.
  • Extending the centralized agent to distributed decision making could address even larger scales.
  • Combining the method with physics-informed constraints might further improve sample efficiency.

Load-bearing premise

High-fidelity simulations of diverse industry-real operating scenarios accurately represent the stochasticity, constraints, and event dynamics of actual semiconductor fabrication systems.

What would settle it

Running the trained policies on a physical semiconductor production line and measuring whether throughput and utilization match the simulated gains would confirm or refute the results.

Figures

Figures reproduced from arXiv: 2606.10705 by Andrea Matta, Daniele Pagano, Mahsa Shekari, Nicla Frigerio, Yavar Yeganeh.

Figure 1
Figure 1. Figure 1: Actions induce discrete, temporally extended events that may overlap across the system. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the modular control architecture for policy interaction with the environment through [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Training pipelines, with simulations on CPU and training on GPU. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of KPI gains for offline agents relative to the FIFO baseline against the random and [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: KPI gains (%) of selected offline-trained DQL checkpoints relative to FIFO across shifts. The black [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of KPI gains for online agents relative to the FIFO baseline against the random and [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of KPI gains for two SAC agents with different target-entropy coefficients, relative to [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: General illustrative layout of the considered semi [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Conceptual ERD of the industrial use case, showing the main entity types and their relationships. [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scenario-dependent distribution of an anonymized decision-time metric. The figure reports (a) [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Distributions of an anonymized timing-related metric conditioned on queue pressure. Decisions [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Distribution of processing steps per product in the dataset. The density histogram and kernel [PITH_FULL_IMAGE:figures/full_fig_p033_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Shift-wise policy fluctuation measured by the mean KL divergence between consecutive action [PITH_FULL_IMAGE:figures/full_fig_p034_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Timeline of detected policy change-points and delayed KPI response windows. The vertical [PITH_FULL_IMAGE:figures/full_fig_p035_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Anonymized Directly-Follows Graph (DFG) based on transition frequency. Nodes represent [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Sectors distribution across scenarios for DQL and [PITH_FULL_IMAGE:figures/full_fig_p037_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Sectors distribution across scenarios for DQL, [PITH_FULL_IMAGE:figures/full_fig_p037_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Sector distribution across scenarios for SAC and [PITH_FULL_IMAGE:figures/full_fig_p038_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Sector distribution across scenarios for SAC, [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: KPI gains (%) across IQL agents relative to FIFO. Solid lines indicate the mean and shaded regions [PITH_FULL_IMAGE:figures/full_fig_p043_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: TD loss across training iterations with red bullets showing the selected checkpoints used for [PITH_FULL_IMAGE:figures/full_fig_p043_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: KPI gains (%) of selected offline-trained DQL checkpoints relative to FIFO across shift. The black [PITH_FULL_IMAGE:figures/full_fig_p044_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Comparison of the two CQL variants across shifts. The dark blue curve corresponds to Entropy [PITH_FULL_IMAGE:figures/full_fig_p045_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: Scenario-by-scenario Throughput gain (%) of the offline agents relative to the FIFO baseline for [PITH_FULL_IMAGE:figures/full_fig_p046_24.png] view at source ↗
Figure 25
Figure 25. Figure 25: Scenario-by-scenario Throughput gain (%) of the offline agents relative to the FIFO baseline for [PITH_FULL_IMAGE:figures/full_fig_p047_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: Scenario-by-scenario Saturation gain (%) of the offline agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p048_26.png] view at source ↗
Figure 27
Figure 27. Figure 27: Scenario-by-scenario Saturation gain (%) of the offline agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p049_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Scenario-by-scenario Load gain (%) of the offline agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p050_28.png] view at source ↗
Figure 29
Figure 29. Figure 29: Scenario-by-scenario Load gain (%) of the offline agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p051_29.png] view at source ↗
Figure 30
Figure 30. Figure 30: Scenario-by-scenario Throughput gain (%) of the online agents relative to the FIFO baseline for [PITH_FULL_IMAGE:figures/full_fig_p052_30.png] view at source ↗
Figure 31
Figure 31. Figure 31: Scenario-by-scenario Throughput gain (%) of the online agents relative to the FIFO baseline for [PITH_FULL_IMAGE:figures/full_fig_p053_31.png] view at source ↗
Figure 32
Figure 32. Figure 32: Scenario-by-scenario Saturation gain (%) of the online agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p054_32.png] view at source ↗
Figure 33
Figure 33. Figure 33: Scenario-by-scenario Saturation gain (%) of the online agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p055_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Scenario-by-scenario Load gain (%) of the online agents relative to the FIFO baseline for the first [PITH_FULL_IMAGE:figures/full_fig_p056_34.png] view at source ↗
Figure 35
Figure 35. Figure 35: Scenario-by-scenario Load gain (%) of the online agents relative to the FIFO baseline for the [PITH_FULL_IMAGE:figures/full_fig_p057_35.png] view at source ↗
Figure 36
Figure 36. Figure 36: Scenario-by-scenario Throughput gain (%) of the plain PPO trained agent with segment level [PITH_FULL_IMAGE:figures/full_fig_p058_36.png] view at source ↗
Figure 37
Figure 37. Figure 37: Scenario-by-scenario throughput gains (%) relative to the FIFO baseline for plain PPO trained [PITH_FULL_IMAGE:figures/full_fig_p059_37.png] view at source ↗
Figure 38
Figure 38. Figure 38: Scenario-by-scenario saturation gains (%) relative to the FIFO baseline for plain PPO trained [PITH_FULL_IMAGE:figures/full_fig_p060_38.png] view at source ↗
Figure 39
Figure 39. Figure 39: Scenario-by-scenario saturation gains (%) relative to the FIFO baseline for plain PPO trained [PITH_FULL_IMAGE:figures/full_fig_p061_39.png] view at source ↗
Figure 40
Figure 40. Figure 40: Scenario-by-scenario load gains (%) relative to the FIFO baseline for plain PPO trained with [PITH_FULL_IMAGE:figures/full_fig_p062_40.png] view at source ↗
Figure 41
Figure 41. Figure 41: Scenario-by-scenario load gains (%) relative to the FIFO baseline for plain PPO trained with [PITH_FULL_IMAGE:figures/full_fig_p063_41.png] view at source ↗
read the original abstract

Reinforcement learning promises to optimize sequential decisions in large-scale systems. Semiconductor manufacturing systems are stochastic and highly constrained environments where heterogeneous wafers traverse hundreds of processing steps across extensive equipment networks. These characteristics yield complex, high-dimensional decision problems with delayed feedback and long-horizon requirements, complicating production planning and control. We propose a deep reinforcement learning framework for multi-objective policy optimization at this scale. Specifically, we formulate control as a centralized-agent problem, where a core policy coordinates system-wide decisions, while system evolution is represented as an interconnected temporal process driven by discrete events. Accordingly, we develop a tailored event-driven temporal-difference formulation that remains general and can be integrated with various policy optimization methods under relevant training settings. We investigate several core model-free algorithms incorporated into this framework and evaluate their effectiveness using high-fidelity simulations of diverse, industry-real operating scenarios. Across extensive validation experiments, agents trained in both offline and online settings show significant and consistent gains in throughput and utilization. We further evaluate performance and generalization across training phases, clarifying the relative strengths of alternative reinforcement learning formulations and algorithms. Overall, the results support the scalability, generality, and transferability of the proposed framework for controlling event-driven complex adaptive systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper proposes a deep reinforcement learning framework for multi-objective policy optimization in semiconductor fabrication systems. It formulates the problem as a centralized-agent task with system evolution modeled as an interconnected temporal process driven by discrete events, develops a tailored event-driven temporal-difference formulation integrable with various model-free algorithms, and evaluates it via high-fidelity simulations of industry-real scenarios. Agents trained offline and online show significant consistent gains in throughput and utilization, with further analysis of performance and generalization across training phases.

Significance. If the simulation results hold under rigorous experimental controls, the work offers a scalable, general approach to applying RL in high-dimensional, stochastic, long-horizon manufacturing domains. The event-driven formulation and investigation of multiple algorithms under offline/online regimes are strengths that could support broader use in event-driven complex adaptive systems.

minor comments (2)
  1. The abstract refers to 'several core model-free algorithms' without naming them; the main text should explicitly list the algorithms (e.g., DQN, PPO) and their integration with the event-driven TD formulation.
  2. The evaluation description mentions 'extensive validation experiments' and 'significant and consistent gains'; the paper should include quantitative effect sizes, baseline descriptions, and statistical reporting (e.g., confidence intervals or p-values) to allow assessment of the internal validity of the gains.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work and the recommendation of minor revision. The assessment correctly captures the core contributions of the event-driven temporal-difference formulation and its evaluation under offline and online regimes. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper presents an event-driven RL framework for semiconductor fabrication control, formulates it as a centralized policy problem with a tailored temporal-difference method, and reports empirical performance gains from offline and online training in high-fidelity simulations. No load-bearing derivation reduces a claimed prediction or result to its own inputs by construction, no fitted parameters are relabeled as predictions, and no self-citation chain or uniqueness theorem is invoked to force the central outcomes. The evaluation rests on observed simulation metrics rather than any self-referential mathematical equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim depends on simulation fidelity and the generality of the event-driven formulation; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption High-fidelity simulations accurately capture real semiconductor fabrication dynamics and constraints.
    Evaluation and claims rest on results from these simulations (abstract evaluation paragraph).

pith-pipeline@v0.9.1-grok · 5763 in / 1082 out tokens · 11569 ms · 2026-06-27T14:12:42.231674+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

84 extracted references · 13 canonical work pages · 9 internal anchors

  1. [1]

    Springer Science & Business Media, 2012

    Lars Mönch, John W Fowler, and Scott J Mason.Production planning and control for semiconductor wafer fabrication facilities: modeling, analysis, and systems, volume 52. Springer Science & Business Media, 2012. 1, 2, 4, 10, 24

  2. [2]

    Reinforcement learning for adaptive order dispatching in the semiconductor industry.CIRP Annals, 67(1):511–514, 2018

    Nicole Stricker, Andreas Kuhnle, Roland Sturm, and Simon Friess. Reinforcement learning for adaptive order dispatching in the semiconductor industry.CIRP Annals, 67(1):511–514, 2018. 2, 27, 29, 30

  3. [3]

    Deep reinforcement learning for semiconductor production scheduling

    Bernd Waschneck, André Reichstaller, Lenz Belzner, Thomas Altenmüller, Thomas Bauernhansl, Alexan- der Knapp, and Andreas Kyek. Deep reinforcement learning for semiconductor production scheduling. In2018 29th annual SEMI advanced semiconductor manufacturing conference (ASMC), pages 301–306. IEEE, 2018. 2, 3, 26, 30, 31

  4. [4]

    Re- inforcement learning for online optimization of job-shop scheduling in a smart manufacturing factory

    Tong Zhou, Haihua Zhu, Dunbing Tang, Changchun Liu, Qixiang Cai, Wei Shi, and Yong Gui. Re- inforcement learning for online optimization of job-shop scheduling in a smart manufacturing factory. Advances in Mechanical Engineering, 14(3):16878132221086120, 2022. 3, 27, 30, 31

  5. [5]

    Semiconductor fab scheduling with self-supervised and reinforcement learning

    Pierre Tassel, Benjamin Kovács, Martin Gebser, Konstantin Schekotihin, Patrick Stöckermann, and Georg Seidel. Semiconductor fab scheduling with self-supervised and reinforcement learning. In2023 Winter Simulation Conference (WSC), pages 1924–1935. IEEE, 2023. 2, 3

  6. [6]

    MIT press, 2018

    Richard S Sutton and Andrew G Barto.Reinforcement learning: An introduction. MIT press, 2018. 2, 7, 8, 9

  7. [7]

    Deep learning.nature, 521(7553):436–444, 2015

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436–444, 2015. 2

  8. [8]

    Springer Nature, 2023

    Christopher M Bishop and Hugh Bishop.Deep learning: Foundations and concepts. Springer Nature, 2023

  9. [9]

    The expressive power of neural networks: A view from the width.Advances in neural information processing systems, 30, 2017

    Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive power of neural networks: A view from the width.Advances in neural information processing systems, 30, 2017. 2

  10. [10]

    Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning.nature, 518(7540):529–533, 2015. 2

  11. [11]

    Magnetic control of tokamak plasmas through deep reinforcement learning.Nature, 602(7897):414–419, 2022

    Jonas Degrave, Federico Felici, Jonas Buchli, Michael Neunert, Brendan Tracey, Francesco Carpanese, Timo Ewalds, Roland Hafner, Abbas Abdolmaleki, Diego de Las Casas, et al. Magnetic control of tokamak plasmas through deep reinforcement learning.Nature, 602(7897):414–419, 2022

  12. [12]

    A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021

    Azalia Mirhoseini, Anna Goldie, Mustafa Yazgan, Joe Wenjie Jiang, Ebrahim Songhori, Shen Wang, Young-Joon Lee, Eric Johnson, Omkar Pathak, Azade Nova, et al. A graph placement methodology for fast chip design.Nature, 594(7862):207–212, 2021

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  14. [14]

    Deep reinforcement learning for machine scheduling: Methodology, the state-of-the-art, and future directions.Computers & Industrial Engineering, 200:110856, 2025

    Maziyar Khadivi, Todd Charter, Marjan Yaghoubi, Masoud Jalayer, Maryam Ahang, Ardeshir Sho- jaeinasab, and Homayoun Najjaran. Deep reinforcement learning for machine scheduling: Methodology, the state-of-the-art, and future directions.Computers & Industrial Engineering, 200:110856, 2025. 2, 3

  15. [15]

    Autonomous order dispatching in the semiconductor industry using reinforcement learning.Procedia Cirp, 79:391–396, 2019

    Andreas Kuhnle, Nicole Röhrig, and Gisela Lanza. Autonomous order dispatching in the semiconductor industry using reinforcement learning.Procedia Cirp, 79:391–396, 2019. 2, 3, 27, 29, 30, 31 19 Event-Driven RL for Long-Horizon ControlPreprint

  16. [16]

    Active inference meeting energy-efficient control of parallel and identical machines

    Yavar Taheri Yeganeh, Mohsen Jafari, and Andrea Matta. Active inference meeting energy-efficient control of parallel and identical machines. InInternational Conference on Machine Learning, Optimization, and Data Science, pages 479–493. Springer, 2024. 2, 6

  17. [17]

    Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045, 2025

    Martin Klissarov, Akhil Bagaria, Ziyan Luo, George Konidaris, Doina Precup, and Marlos C Machado. Discovering temporal structure: An overview of hierarchical reinforcement learning.arXiv preprint arXiv:2506.14045, 2025. 2, 3

  18. [18]

    Patrick Stöckermann, Henning Südfeld, Alessandro Immordino, Thomas Altenmüller, Marc Wegmann, Martin Gebser, Konstantin Schekotihin, Georg Seidel, Chew Wye Chan, and Fei Fei Zhang. Scalability of reinforcement learning methods for dispatching in semiconductor frontend fabs: a comparison of open-source models with real industry datasets.The International J...

  19. [19]

    Supporting fab operations using multi-agent reinforcement learning

    Ishaan Sood, Abhinav Kaushik, Tom Bulgerin, Prashant Kumar, Subham Rath, Abdelhak Khemiri, Johnny Chang, Sam Hsu, and Jeroen Bédorf. Supporting fab operations using multi-agent reinforcement learning. In2024 35th Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), pages 1–6. IEEE, 2024. 2, 3

  20. [20]

    Agentic large language models, a survey.Journal of Artificial Intelligence Research, 84, 2025

    Aske Plaat, Max van Duijn, Niki Van Stein, Mike Preuss, Peter van der Putten, and Kees Joost Batenburg. Agentic large language models, a survey.Journal of Artificial Intelligence Research, 84, 2025. 2

  21. [21]

    Revisiting bellman errors for offline model selection

    Joshua P Zitovsky, Daniel De Marchi, Rishabh Agarwal, and Michael Rene Kosorok. Revisiting bellman errors for offline model selection. InInternational conference on machine learning, pages 43369–43406. PMLR, 2023. 2, 8, 17

  22. [22]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018. 8, 9

  23. [23]

    Offline Reinforcement Learning with Implicit Q-Learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. arXiv preprint arXiv:2110.06169, 2021. 2, 8, 9, 17

  24. [24]

    A reinforcement learning approach for improved photolithography schedules

    Tao Zhang, Kamil Erkan Kabak, Cathal Heavey, and Oliver Rose. A reinforcement learning approach for improved photolithography schedules. In2023 Winter Simulation Conference (WSC), pages 2136–2147. IEEE, 2023. 3

  25. [25]

    Deep reinforcement learning based scheduling within production plan in semiconductor fabrication.Expert Systems with Applications, 191:116222, 2022

    Young Hoon Lee and Seunghoon Lee. Deep reinforcement learning based scheduling within production plan in semiconductor fabrication.Expert Systems with Applications, 191:116222, 2022. 3, 27, 30

  26. [26]

    Simulation and deep reinforcement learning for adaptive dispatching in semiconductor manufacturing systems.Journal of Intelligent Manufacturing, 34(3):1311–1324, 2023

    Ahmed H Sakr, Ayman Aboelhassan, Soumaya Yacout, and Samuel Bassetto. Simulation and deep reinforcement learning for adaptive dispatching in semiconductor manufacturing systems.Journal of Intelligent Manufacturing, 34(3):1311–1324, 2023. 6, 27, 30, 31

  27. [27]

    Dynamic scheduling method for job-shop manufacturing systems by deep reinforcement learning with proximal policy optimization

    Ming Zhang, Yang Lu, Youxi Hu, Nasser Amaitik, and Yuchun Xu. Dynamic scheduling method for job-shop manufacturing systems by deep reinforcement learning with proximal policy optimization. sustainability, 14(9):5177, 2022. 3, 30, 31

  28. [28]

    A novel double q-learning with invalid action masking for semiconductor ion implantation scheduling problem.Computers & Industrial Engineering, page 111620, 2025

    Hung-Kai Wang, Ting-Yun Yang, and Yan-Cheng Lin. A novel double q-learning with invalid action masking for semiconductor ion implantation scheduling problem.Computers & Industrial Engineering, page 111620, 2025. 3

  29. [29]

    Distributed scheduling method for smart shop floor based on qmix

    Jianmin Xing, Yumin Ma, Jingwen Cai, Jiaxuan Shi, and Juan Liu. Distributed scheduling method for smart shop floor based on qmix. In2023 IEEE 19th International Conference on Automation Science and Engineering (CASE), pages 1–6. IEEE, 2023. 3, 30, 31

  30. [30]

    Juan Liu, Fei Qiao, Minjie Zou, Jonas Zinn, Yumin Ma, and Birgit Vogel-Heuser. Dynamic scheduling for semiconductor manufacturing systems with uncertainties using convolutional neural networks and reinforcement learning.Complex & Intelligent Systems, 8(6):4641–4662, 2022. 3, 26, 27

  31. [31]

    Deep learning enabling digital twin applications in production scheduling: Case of flexible job shop manufac- turing environment

    Amir Ghasemi, Yavar Taheri Yeganeh, Andrea Matta, Kamil Erkan Kabak, and Cathal Heavey. Deep learning enabling digital twin applications in production scheduling: Case of flexible job shop manufac- turing environment. In2023 Winter Simulation Conference (WSC), pages 2148–2159. IEEE, 2023. 3, 4

  32. [32]

    A compre- hensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020

    Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and Philip S Yu. A compre- hensive survey on graph neural networks.IEEE transactions on neural networks and learning systems, 32(1):4–24, 2020. 3 20 Event-Driven RL for Long-Horizon ControlPreprint

  33. [33]

    Graph representation and embedding for semiconductor manufacturing fab states

    Benedikt Schulz, Christoph Jacobi, Andrej Gisbrecht, Angelidis Evangelos, Chew Wye Chan, and Boon Ping Gan. Graph representation and embedding for semiconductor manufacturing fab states. In 2022 Winter Simulation Conference (WSC), pages 3382–3393. IEEE, 2022. 3

  34. [34]

    Flexible job shop scheduling problem using graph neural networks and reinforcement learning.Computers & Operations Research, 182:107139, 2025

    Xi Liu, Xin Chen, Vincent Chau, Jedrzej Musial, and Jacek Blazewicz. Flexible job shop scheduling problem using graph neural networks and reinforcement learning.Computers & Operations Research, 182:107139, 2025. 3

  35. [35]

    Learning to dispatch for job shop scheduling via deep reinforcement learning.Advances in neural information processing systems, 33:1621–1632, 2020

    Cong Zhang, Wen Song, Zhiguang Cao, Jie Zhang, Puay Siew Tan, and Xu Chi. Learning to dispatch for job shop scheduling via deep reinforcement learning.Advances in neural information processing systems, 33:1621–1632, 2020. 3

  36. [36]

    Deep reinforcement learning for dynamic flexible job shop scheduling with random job arrival.Processes, 10(4):760, 2022

    Jingru Chang, Dong Yu, Yi Hu, Wuwei He, and Haoyu Yu. Deep reinforcement learning for dynamic flexible job shop scheduling with random job arrival.Processes, 10(4):760, 2022. 3, 30, 31

  37. [37]

    A fuzzy hierarchical reinforce- ment learning based scheduling method for semiconductor wafer manufacturing systems.Journal of Manufacturing Systems, 61:239–248, 2021

    Junliang Wang, Pengjie Gao, Peng Zheng, Jie Zhang, and WH Ip. A fuzzy hierarchical reinforce- ment learning based scheduling method for semiconductor wafer manufacturing systems.Journal of Manufacturing Systems, 61:239–248, 2021

  38. [38]

    Deep reinforcement learning for queue-time management in semiconductor manufacturing

    Harel Yedidsion, Prafulla Dawadi, David Norman, and Emrah Zarifoglu. Deep reinforcement learning for queue-time management in semiconductor manufacturing. In2022 Winter Simulation Conference (WSC), pages 3275–3284. IEEE, 2022. 3

  39. [39]

    Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning

    Shu Luo. Dynamic scheduling for flexible job shop with new job insertions by deep reinforcement learning. Applied Soft Computing, 91:106208, 2020. 3, 27, 30, 31

  40. [40]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 3, 7, 9, 10

  41. [41]

    An adaptive multi-objective multi-task scheduling method by hierarchical deep reinforcement learning

    Jianxiong Zhang, Bing Guo, Xuefeng Ding, Dasha Hu, Jun Tang, Ke Du, Chao Tang, and Yuming Jiang. An adaptive multi-objective multi-task scheduling method by hierarchical deep reinforcement learning. Applied Soft Computing, 154:111342, 2024. 3, 30, 31

  42. [42]

    Explainable ai for reinforcement learning based dynamic scheduling solutions in semiconductor manufacturing: A

    Alessandro Immordino, Patrick Stöckermann, Niels Hayen, Thomas Altenmüller, Gian Antonio Susto, Martin Gebser, Konstantin Schekotihin, and Georg Seidel. Explainable ai for reinforcement learning based dynamic scheduling solutions in semiconductor manufacturing: A. immordino et al.Journal of Intelligent Manufacturing, pages 1–17, 2025. 3, 30, 31

  43. [43]

    Dispatching in real frontend fabs with industrial grade discrete-event simulations by deep reinforcement learning with evolution strategies

    Patrick Stöckermann, Alessandro Immordino, Thomas Altenmüller, Georg Seidel, Martin Gebser, Pierre Tassel, Chew Wye Chan, and Feifei Zhang. Dispatching in real frontend fabs with industrial grade discrete-event simulations by deep reinforcement learning with evolution strategies. In2023 Winter Simulation Conference (WSC), pages 3047–3058. IEEE, 2023. 3

  44. [44]

    Deep active inference agents for delayed and long-horizon environments.arXiv preprint arXiv:2505.19867, 2025

    Yavar Taheri Yeganeh, Mohsen Jafari, and Andrea Matta. Deep active inference agents for delayed and long-horizon environments.arXiv preprint arXiv:2505.19867, 2025. 3, 6, 17, 18

  45. [45]

    Rudder: Return decomposition for delayed rewards.Advances in Neural Information Processing Systems, 32, 2019

    Jose A Arjona-Medina, Michael Gillhofer, Michael Widrich, Thomas Unterthiner, Johannes Brandstetter, and Sepp Hochreiter. Rudder: Return decomposition for delayed rewards.Advances in Neural Information Processing Systems, 32, 2019. 3, 7, 18

  46. [46]

    AWAC: Accelerating Online Reinforcement Learning with Offline Datasets

    AshvinNair, AbhishekGupta, MurtazaDalal, andSergeyLevine. Awac: Acceleratingonlinereinforcement learning with offline datasets.arXiv preprint arXiv:2006.09359, 2020. 3, 8, 9

  47. [47]

    Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning.Advances in neural information processing systems, 33:1179–1191, 2020. 3, 8, 9

  48. [48]

    Temporal abstraction in reinforcement learning with the successor representation.Journal of machine learning research, 24(80): 1–69, 2023

    Marlos C Machado, Andre Barreto, Doina Precup, and Michael Bowling. Temporal abstraction in reinforcement learning with the successor representation.Journal of machine learning research, 24(80): 1–69, 2023. 4

  49. [49]

    Hierarchical decision-making for qualification management in wafer fabs: a simulation study.IEEE Transactions on Automation Science and Engineering, 20(1):320–333, 2022

    Denny Kopp and Lars Mönch. Hierarchical decision-making for qualification management in wafer fabs: a simulation study.IEEE Transactions on Automation Science and Engineering, 20(1):320–333, 2022. 4

  50. [50]

    Average-reward learning and planning with options.Advances in Neural Information Processing Systems, 34:22758–22769, 2021

    Yi Wan, Abhishek Naik, and Rich Sutton. Average-reward learning and planning with options.Advances in Neural Information Processing Systems, 34:22758–22769, 2021. 4

  51. [51]

    Mean field multi-agent reinforcement learning

    Yaodong Yang, Rui Luo, Minne Li, Ming Zhou, Weinan Zhang, and Jun Wang. Mean field multi-agent reinforcement learning. InInternational conference on machine learning, pages 5571–5580. PMLR, 2018. 5 21 Event-Driven RL for Long-Horizon ControlPreprint

  52. [52]

    Digital twins paradigm: A systematic review from the reinforcement learning perspective.ACM Computing Surveys, 58(7):1–33, 2026

    Shahmir Khan Mohammed, Shakti Singh, Rabeb Mizouni, Hadi Otrok, and Ernesto Damiani. Digital twins paradigm: A systematic review from the reinforcement learning perspective.ACM Computing Surveys, 58(7):1–33, 2026. 5

  53. [53]

    Hsuan-An Kuo, Tzu-Yen Hong, and Chen-Fu Chien. A deep reinforcement learning based digital twin framework for resilient production planning under demand uncertainty and an empirical study in semiconductor wafer fabrication.Computers & Industrial Engineering, page 111389, 2025. 5

  54. [54]

    Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications.IEEE Access, 12: 175473–175500, 2024

    Sinan Ibrahim, Mostafa Mostafa, Ali Jnadi, Hadi Salloum, and Pavel Osinenko. Comprehensive overview of reward engineering and shaping in advancing reinforcement learning applications.IEEE Access, 12: 175473–175500, 2024. 6, 11, 17

  55. [55]

    Rainbow: Combining improvements in deep reinforcement learning

    Matteo Hessel, Joseph Modayil, Hado Van Hasselt, Tom Schaul, Georg Ostrovski, Will Dabney, Dan Horgan, Bilal Piot, Mohammad Azar, and David Silver. Rainbow: Combining improvements in deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 32, 2018. 7

  56. [56]

    Addressing function approximation error in actor-critic methods

    Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. InInternational conference on machine learning, pages 1587–1596. PMLR, 2018. 7, 10

  57. [57]

    Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999

    Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning.Artificial intelligence, 112(1-2):181–211, 1999. 8

  58. [58]

    Deep Reinforcement Learning and the Deadly Triad

    Hado Van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, and Joseph Modayil. Deep reinforcement learning and the deadly triad.arXiv preprint arXiv:1812.02648, 2018. 8

  59. [59]

    Deep reinforcement learning with double q-learning

    Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. InProceedings of the AAAI conference on artificial intelligence, volume 30, 2016. 8

  60. [60]

    Soft actor-critic for discrete action settings.arXiv preprint arXiv:1910.07207,

    Petros Christodoulou. Soft actor-critic for discrete action settings.arXiv preprint arXiv:1910.07207,

  61. [61]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning.arXiv preprint arXiv:2004.07219, 2020. 9

  62. [62]

    Boltzmann exploration done right.Advances in neural information processing systems, 30, 2017

    Nicolò Cesa-Bianchi, Claudio Gentile, Gábor Lugosi, and Gergely Neu. Boltzmann exploration done right.Advances in neural information processing systems, 30, 2017. 9

  63. [63]

    Trust region policy optimization

    John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. InInternational conference on machine learning, pages 1889–1897. PMLR, 2015. 9

  64. [64]

    Asynchronous methods for deep reinforcement learning

    Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, pages 1928–1937. PmLR, 2016. 10

  65. [65]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation.arXiv preprint arXiv:1506.02438, 2015. 10

  66. [66]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 10

  67. [67]

    General agents contain world models, 2025

    Jonathan Richens, David Abel, Alexis Bellot, and Tom Everitt. General agents contain world models. arXiv preprint arXiv:2506.01622, 2025. 10

  68. [68]

    Manufacturing cycle time reduction using balance control in the semiconductor fabrication line.Production Planning & Control, 13(6):529–540, 2002

    Young Hoon Lee and Taeheon Kim. Manufacturing cycle time reduction using balance control in the semiconductor fabrication line.Production Planning & Control, 13(6):529–540, 2002. 11

  69. [69]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527, 2025. 18

  70. [70]

    A customizable simulator for artificial intelligence research to schedule semiconductor fabs

    Benjamin Kovács, Pierre Tassel, Ramsha Ali, Mohammed El-Kholany, Martin Gebser, and Georg Seidel. A customizable simulator for artificial intelligence research to schedule semiconductor fabs. In2022 33rd Annual SEMI Advanced Semiconductor Manufacturing Conference (ASMC), pages 1–6. IEEE, 2022. 26, 27, 30, 31

  71. [71]

    Crc press, 2017

    Pascal Dennis.Lean Production simplified: A plain-language guide to the world’s most powerful production system. Crc press, 2017. 26

  72. [72]

    McGraw-Hill Education New York, 2014

    William J Stevenson, Mehran Hojati, and James Cao.Operations management. McGraw-Hill Education New York, 2014. 27 22 Event-Driven RL for Long-Horizon ControlPreprint

  73. [73]

    Prentice Hall Upper Saddle River, NJ, 1998

    JR Tony Arnold, Stephen N Chapman, Lloyd M Clive, and Ann K Gatewood.Introduction to materials management. Prentice Hall Upper Saddle River, NJ, 1998. 27

  74. [74]

    Operations management: Sustainability and supply chain management

    Jay Heizer, Barry Render, Charles Lee Munson, and Paul Griffin. Operations management: Sustainability and supply chain management. 2020. 27

  75. [75]

    McGraw-Hill, Inc., 2004

    Peter Van Zant.Microchip fabrication. McGraw-Hill, Inc., 2004. 27

  76. [76]

    Simulation based multi-objective fab scheduling by using reinforcement learning

    Won-Jun Lee, Byung-Hee Kim, Keyhoon Ko, and Hayong Shin. Simulation based multi-objective fab scheduling by using reinforcement learning. In2019 Winter Simulation Conference (WSC), pages 2236–2247. IEEE, 2019. 27, 30

  77. [77]

    Waveland Press, 2015

    Steven Nahmias and Tava Lennon Olsen.Production and operations analysis. Waveland Press, 2015. 27

  78. [78]

    Manufacturing planning and control for supply chain management: the cpim reference.(No Title), 2018

    F Robert Jacobs, William L Berry, D Clay Whybark, and Thomas E Vollmann. Manufacturing planning and control for supply chain management: the cpim reference.(No Title), 2018. 27

  79. [79]

    Deep reinforce- ment learning approach for a dynamic flexible job shop problem with sequence dependent setup times

    Binxiao Yan, Xinbao Liu, Shaojun Lu, Chaoming Hu, Xubiao Wang, and Zhiping Zhou. Deep reinforce- ment learning approach for a dynamic flexible job shop problem with sequence dependent setup times. Computers & Industrial Engineering, page 111310, 2025. 30, 31

  80. [80]

    Tzu-Yen Hong and Kuan-Han Li. Large language model driven adaptive deep reinforcement learning with dynamic definition refinement for flexible job shop scheduling problem.Applied Soft Computing, page 114253, 2025. 30, 31

Showing first 80 references.