pith. sign in

arxiv: 2605.19457 · v1 · pith:6KCTGYEJnew · submitted 2026-05-19 · 💻 cs.AI

Generative Auto-Bidding with Unified Modeling and Exploration

Pith reviewed 2026-05-20 05:31 UTC · model grok-4.3

classification 💻 cs.AI
keywords auto-biddingdecision transformergenerative modelsexplorationsafetyonline advertisingreinforcement learningTaobao
0
0 comments X p. Extension
pith:6KCTGYEJ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{6KCTGYEJ}

Prints a linked pith:6KCTGYEJ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

GUIDE framework uses a Decision Transformer with Q-value guidance and inverse dynamics fallback to balance exploration and safety in automated bidding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GUIDE to solve problems in automated bidding for digital advertising, where early rule-based systems lacked flexibility and later reinforcement learning methods struggled with long-term dependencies while risking unsafe actions. GUIDE jointly models bidding actions and state transitions using a Decision Transformer. A Q-value module regularizes the transformer's exploration, and an Inverse Dynamics Module infers safe fallback actions from predicted future states, with the Q-value selecting the final output to unify efficiency and safety. If this holds, advertising platforms could optimize bids more effectively in live auctions without elevated financial risk. A sympathetic reader would care because it promises measurable gains in key metrics like revenue and return on investment in competitive real-world settings.

Core claim

The central claim is that the GUIDE framework synergistically integrates directed exploration with a safe fallback mechanism by employing a Decision Transformer to jointly model historical bidding actions and environmental state transitions, a Q-value module to guide the DT's exploration via regularization constraints, and an Inverse Dynamics Module to leverage DT-predicted future states for inferring robust, behaviorally consistent actions as a safe policy fallback, with the Q-value module adaptively selecting the final action between these two options to form an integrated explore-safeguard-select pipeline that unifies efficiency and safety, consistently outperforming state-of-the-art in a

What carries the argument

The explore-safeguard-select pipeline, where a Decision Transformer jointly models sequences of bidding actions and state transitions, a Q-value module applies regularization to guide exploration, and an Inverse Dynamics Module infers safe fallback actions from predicted future states.

If this is right

  • Consistent outperformance over state-of-the-art baselines on public datasets, simulated auction environments, and large-scale online deployment.
  • In real-world Taobao deployment, achieves +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI.
  • Reduces financial risk for advertising platforms by providing a safety fallback during exploration.
  • Enables unified modeling of long-term dependencies in bidding sequences that prior methods handled separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach of pairing sequence modeling with value-based guidance and dynamics-based fallbacks may extend to other high-stakes sequential decisions such as dynamic pricing or inventory control.
  • Testing variants without the Q-value regularization could reveal whether the safety component primarily drives the observed stability gains.
  • If the pipeline generalizes, it could reduce reliance on separate exploration heuristics in generative reinforcement learning systems for e-commerce.

Load-bearing premise

The Q-value module can reliably guide the Decision Transformer's exploration via regularization without destabilizing the model, and the Inverse Dynamics Module can infer behaviorally consistent safe actions from DT-predicted future states that serve as an effective fallback in live auction environments.

What would settle it

A controlled online A/B test on a live advertising platform where GUIDE is run without the IDM fallback or with disabled Q-value regularization, showing either lower performance than baselines or increased financial risk exposure compared to the full system.

Figures

Figures reproduced from arXiv: 2605.19457 by Chenliang Li, Feiqing Zhuang, Fei Xiao, Junxiong Zhu, Keping Yang, Lixin Zou, Mingming Zhang, Na Li, Shengjie Sun, Xiaowei Chen.

Figure 1
Figure 1. Figure 1: Different Modeling Approaches in Ad Bidding. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 4.1 Unified Modeling of Bid Trajectories 4.1.1 Trajectory Construction and Modeling. In the auto-bidding task, each round of bidding can be represented as a temporal trajec￾tory that sequentially records the advertising environment states, bidding actions, and the resulting rewards. Formally, a trajectory can be represented as follows: 𝜏 = (𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, ..., 𝑠𝑇 , 𝑎𝑇 , 𝑟𝑇 ) (5) where 𝑠𝑡 denotes t… view at source ↗
Figure 2
Figure 2. Figure 2: Overview architecture. a) Training of the unified modeling framework. b) Inference with bid selection [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Ablation Study Second, random selection instead of Q-value-based selection also reduces performance, with results lying between those of using the two action sources separately. This can be attributed to DT actions are higher in overall quality than IDM actions. Third, removing Q-value regularization optimization causes a significant drop in performance, though still outperforming the [PITH_FULL_IMAGE:fig… view at source ↗
Figure 5
Figure 5. Figure 5: Action preferences of different advertisers [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Two-stage Training Analysis 5.4 RQ3: Co-operation between DT and IDM As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Volatility comparison between DT and IDM bid [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: visually compares the actual cost trajectories managed by a baseline method and our proposed Guide against this ideal trajectory. As can be seen, while both methods attempt to follow the general trend, the cost distribution controlled by Guide (right panel) tracks the ideal cost much more closely across the different time steps. The baseline method (left panel), in contrast, shows more significant deviatio… view at source ↗
read the original abstract

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce GUIDE, a novel framework for automated bidding in digital advertising. It integrates a Decision Transformer (DT) to model historical bidding actions and environmental state transitions, a Q-value module to guide the DT's exploration through regularization, and an Inverse Dynamics Module (IDM) to infer safe, behaviorally consistent actions from DT-predicted future states as a fallback. An adaptive selector then chooses the final action to balance exploration and safety. The authors demonstrate consistent outperformance over state-of-the-art baselines on public datasets and in simulated environments, with real-world deployment on Taobao yielding +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI.

Significance. Should the empirical findings be confirmed with additional controls, this work offers a significant contribution to the field of AI for online advertising by proposing an integrated approach to exploration and safety in generative bidding models. The large-scale online deployment provides valuable real-world validation. The use of standard components in a new synergistic way is noted as a strength.

major comments (2)
  1. [§3.2] §3.2, Q-value module: the claim that regularization reliably guides DT exploration without destabilizing the model lacks any analysis of regularization strength selection, sensitivity, or robustness to non-stationary auction dynamics. This is load-bearing for the safety and stability guarantees of the explore-safeguard-select pipeline.
  2. [§5] §5, experimental results: the reported gains (e.g., +4.10% GMV) are presented without statistical significance tests, run-to-run variance, or ablation studies isolating the Q-value regularization and IDM fallback contributions. This undermines verification of the central outperformance claims.
minor comments (2)
  1. The abstract refers to 'public datasets' without naming them; this should be stated explicitly in the introduction or experimental setup.
  2. [§3] A diagram of the overall architecture in §3 would benefit from explicit labels on the adaptive selector and data flow between DT, Q-value, and IDM modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating revisions that will be incorporated to strengthen the empirical and analytical rigor of the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2, Q-value module: the claim that regularization reliably guides DT exploration without destabilizing the model lacks any analysis of regularization strength selection, sensitivity, or robustness to non-stationary auction dynamics. This is load-bearing for the safety and stability guarantees of the explore-safeguard-select pipeline.

    Authors: We acknowledge that the original submission did not provide explicit sensitivity analysis or robustness checks for the Q-value regularization strength. In the revised manuscript, we will expand §3.2 with a dedicated analysis of the regularization coefficient λ, including performance curves over a range of λ values, and additional experiments simulating non-stationary auction dynamics (e.g., varying bid landscape distributions). These additions will directly support the stability claims of the explore-safeguard-select pipeline. revision: yes

  2. Referee: [§5] §5, experimental results: the reported gains (e.g., +4.10% GMV) are presented without statistical significance tests, run-to-run variance, or ablation studies isolating the Q-value regularization and IDM fallback contributions. This undermines verification of the central outperformance claims.

    Authors: We agree that additional statistical reporting and ablations would improve verifiability. The revised §5 will include ablation studies isolating the Q-value regularization and IDM fallback, plus standard deviations and error bars from multiple simulation runs. For the online Taobao deployment, we will add confidence intervals based on available traffic data. Full multi-run variance is inherently limited in a single large-scale A/B test, which we will now explicitly discuss as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: framework assembles standard DT/Q/IDM components with new integration; claims rest on empirical results rather than self-referential definitions or fitted predictions.

full rationale

The paper presents GUIDE as an engineering integration of Decision Transformer for trajectory modeling, Q-value regularization for directed exploration, and an Inverse Dynamics Module for safe fallback, followed by an adaptive selector. No derivation chain reduces a claimed prediction or uniqueness result to its own inputs by construction. The abstract and described pipeline introduce no self-definitional loops, no fitted parameters renamed as predictions, and no load-bearing self-citations that substitute for independent justification. Results are reported from public datasets, simulations, and online deployment on Taobao, making the central claims externally falsifiable rather than tautological. This is the expected non-finding for an applied systems paper whose novelty lies in composition rather than mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from reinforcement learning and sequence modeling applied to bidding, plus the novel integration of components; no explicit free parameters or invented physical entities are detailed in the abstract.

axioms (1)
  • domain assumption Bidding can be effectively modeled as a sequence prediction task using historical actions and state transitions.
    Invoked by the use of Decision Transformer to jointly model bidding actions and environmental state transitions.

pith-pipeline@v0.9.0 · 5843 in / 1465 out tokens · 69892 ms · 2026-05-20T05:31:54.440895+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

  1. [1]

    Deepak Agarwal, Souvik Ghosh, Kai Wei, and Siyu You. 2014. Budget pacing for targeted online advertisements at linkedin. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1613– 1619

  2. [2]

    Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. 2022. Is conditional generative modeling all you need for decision- making?arXiv preprint arXiv:2211.15657(2022)

  3. [3]

    Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning.. InMachine intelligence 15. 103–129

  4. [4]

    Rakesh P Borase, DK Maghade, SY Sondkar, and SN Pawar. 2021. A review of PID control, tuning methods and applications.International Journal of Dynamics and Control9, 2 (2021), 818–827

  5. [5]

    Nikolay Borissov, Dirk Neumann, and Christof Weinhardt. 2010. Automated bidding in computational markets: an application in market-based allocation of computing services.Autonomous Agents and Multi-Agent Systems21, 2 (2010), 115–142

  6. [6]

    Craig Boutilier, Thomas Dean, and Steve Hanks. 1999. Decision-theoretic plan- ning: Structural assumptions and computational leverage.Journal of Artificial Intelligence Research11 (1999), 1–94

  7. [7]

    Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display adver- tising. InProceedings of the tenth ACM international conference on web search and data mining. 661–670

  8. [8]

    Leng Cai, Junxuan He, Yikai Li, Junjie Liang, Yuanping Lin, Ziming Quan, Yawen Zeng, and Jin Xu. 2025. RTBAgent: A LLM-based Agent System for Real-Time Bidding. InCompanion Proceedings of the ACM on Web Conference 2025. 104–113

  9. [9]

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems34 (2021), 15084–15097

  10. [10]

    Ye Chen, Pavel Berkhin, Bo Anderson, and Nikhil R Devanur. 2011. Real-time bidding algorithms for performance-based display ad allocation. InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 1307–1315

  11. [11]

    George B Dantzig. 2016. Linear programming and extensions. (2016)

  12. [12]

    Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau

  13. [13]

    Benchmarking batch deep reinforcement learning algorithms.arXiv preprint arXiv:1910.01708(2019)

  14. [14]

    Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems34 (2021), 20132–20145

  15. [15]

    Jingtong Gao, Yewen Li, Shuai Mao, Peng Jiang, Nan Jiang, Yejing Wang, Qing- peng Cai, Fei Pan, Peng Jiang, Kun Gai, et al . 2025. Generative Auto-Bidding with Value-Guided Explorations. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 244–254

  16. [16]

    Jiayan Guo, Yusen Huo, Zhilin Zhang, Tianyu Wang, Chuan Yu, Jian Xu, Bo Zheng, and Yan Zhang. 2024. Generative auto-bidding via conditional diffusion modeling. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5038–5049

  17. [17]

    Yue He, Xiujun Chen, Di Wu, Junwei Pan, Qing Tan, Chuan Yu, Jian Xu, and Xiaoqiang Zhu. 2021. A unified solution to constrained bidding in online display advertising. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2993–3001

  18. [18]

    Jiahao Ji, Tianyu Wang, Yeshu Li, Yusen Huo, Zhilin Zhang, Chuan Yu, Jian Xu, and Bo Zheng. 2025. Bid2X: Revealing Dynamics of Bidding Environment in Online Advertising from A Foundation Model Lens. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4543–4554

  19. [19]

    Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, and Peng Jiang. 2025. Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement. InCompanion Proceedings of the ACM on Web Conference 2025. 1033–1037

  20. [20]

    Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertis- ing. InProceedings of the 27th ACM international conference on information and knowledge management. 2193–2201

  21. [21]

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020). Generative Auto-Bidding with Unified Modeling and Exploration SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

  22. [22]

    Carl Knospe. 2006. PID control.IEEE Control Systems Magazine26, 1 (2006), 30–31

  23. [23]

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169(2021)

  24. [24]

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conserva- tive q-learning for offline reinforcement learning.Advances in neural information processing systems33 (2020), 1179–1191

  25. [25]

    Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, and Fan Wu. 2024. Trajectory-wise iterative reinforcement learning framework for auto-bidding. InProceedings of the ACM Web Conference 2024. 4193–4203

  26. [26]

    Juncheng Li and Pingzhong Tang. 2022. Auto-bidding equilibrium in ROI- constrained online advertising markets.arXiv preprint arXiv:2210.06107(2022)

  27. [27]

    Yewen Li, Shuai Mao, Jingtong Gao, Nan Jiang, Yunjian Xu, Qingpeng Cai, Fei Pan, Peng Jiang, and Bo An. 2025. GAS: Generative Auto-bidding with Post-training Search. InCompanion Proceedings of the ACM on Web Conference 2025. 315–324

  28. [28]

    Mengjuan Liu, Li Jiaxing, Zhengning Hu, Jinyu Liu, and Xuyun Nie. 2020. A dynamic bidding strategy based on model-free reinforcement learning in display advertising.IEEE Access8 (2020), 213587–213601

  29. [29]

    Haofei Lu, Dongqi Han, Yifei Shen, and Dongsheng Li. 2025. What makes a good diffusion planner for decision making?arXiv preprint arXiv:2503.00535(2025)

  30. [30]

    Zhiyu Mou, Yusen Huo, Rongquan Bai, Mingzhou Xie, Chuan Yu, Jian Xu, and Bo Zheng. 2022. Sustainable online reinforcement learning for auto-bidding. Advances in Neural Information Processing Systems35 (2022), 2651–2663

  31. [31]

    Yunshan Peng, Wenzheng Shu, Jiahao Sun, Yanxiang Zeng, Jinan Pang, Wentao Bai, Yunke Bai, Xialong Liu, and Peng Jiang. 2025. Expert-Guided Diffusion Planner for Auto-bidding.arXiv preprint arXiv:2508.08687(2025)

  32. [32]

    2014.Markov decision processes: discrete stochastic dynamic programming

    Martin L Puterman. 2014.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons

  33. [33]

    Kefan Su, Yusen Huo, Zhilin Zhang, Shuai Dou, Chuan Yu, Jian Xu, Zongqing Lu, and Bo Zheng. 2024. Auctionnet: A novel benchmark for decision-making in large-scale games.Advances in Neural Information Processing Systems37 (2024), 94428–94452

  34. [34]

    Chao Wen, Miao Xu, Zhilin Zhang, Zhenzhe Zheng, Yuhui Wang, Xiangyu Liu, Yu Rong, Dong Xie, Xiaoyang Tan, Chuan Yu, et al. 2022. A cooperative-competitive multi-agent framework for auto-bidding in online advertising. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1129–1139

  35. [35]

    Xun Yang, Yasong Li, Hao Wang, Di Wu, Qing Tan, Jian Xu, and Kun Gai. 2019. Bid optimization by multivariable control in display advertising. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1966–1974

  36. [36]

    Hao Yu, Michael Neely, and Xiaohan Wei. 2017. Online convex optimization with stochastic constraints.Advances in Neural Information Processing Systems 30 (2017)

  37. [37]

    Congde Yuan, Mengzhuo Guo, Chaoneng Xiang, Shuangyang Wang, Guoqing Song, and Qingpeng Zhang. 2022. An actor-critic reinforcement learning model for optimal bidding in online display advertising. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 3604–3613

  38. [38]

    Shuai Yuan, Jun Wang, and Xiaoxue Zhao. 2013. Real-time bidding for online advertising: measurement and analysis. InProceedings of the seventh international workshop on data mining for online advertising. 1–8

  39. [39]

    Haoqi Zhang, Lvyin Niu, Zhenzhe Zheng, Zhilin Zhang, Shan Gu, Fan Wu, Chuan Yu, Jian Xu, Guihai Chen, and Bo Zheng. 2023. A personalized automated bidding framework for fairness-aware online advertising. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5544–5553

  40. [40]

    Weinan Zhang, Yifei Rong, Jun Wang, Tianchi Zhu, and Xiaofan Wang. 2016. Feedback control of real-time display advertising. InProceedings of the Ninth ACM International Conference on Web Search and Data Mining. 407–416

  41. [41]

    Weinan Zhang, Shuai Yuan, and Jun Wang. 2014. Optimal real-time bidding for display advertising. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1077–1086

  42. [42]

    Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, and Weinan Zhang. 2024. Madiff: Offline multi-agent learning with diffusion models.Advances in Neural Information Processing Systems37 (2024), 4177–4206