Generative Auto-Bidding with Unified Modeling and Exploration

Chenliang Li; Feiqing Zhuang; Fei Xiao; Junxiong Zhu; Keping Yang; Lixin Zou; Mingming Zhang; Na Li; Shengjie Sun; Xiaowei Chen

arxiv: 2605.19457 · v1 · pith:6KCTGYEJnew · submitted 2026-05-19 · 💻 cs.AI

Generative Auto-Bidding with Unified Modeling and Exploration

Mingming Zhang , Feiqing Zhuang , Na Li , Shengjie Sun , Xiaowei Chen , Junxiong Zhu , Fei Xiao , Keping Yang

show 2 more authors

Lixin Zou Chenliang Li

This is my paper

Pith reviewed 2026-05-20 05:31 UTC · model grok-4.3

classification 💻 cs.AI

keywords auto-biddingdecision transformergenerative modelsexplorationsafetyonline advertisingreinforcement learningTaobao

0 comments

The pith

GUIDE framework uses a Decision Transformer with Q-value guidance and inverse dynamics fallback to balance exploration and safety in automated bidding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GUIDE to solve problems in automated bidding for digital advertising, where early rule-based systems lacked flexibility and later reinforcement learning methods struggled with long-term dependencies while risking unsafe actions. GUIDE jointly models bidding actions and state transitions using a Decision Transformer. A Q-value module regularizes the transformer's exploration, and an Inverse Dynamics Module infers safe fallback actions from predicted future states, with the Q-value selecting the final output to unify efficiency and safety. If this holds, advertising platforms could optimize bids more effectively in live auctions without elevated financial risk. A sympathetic reader would care because it promises measurable gains in key metrics like revenue and return on investment in competitive real-world settings.

Core claim

The central claim is that the GUIDE framework synergistically integrates directed exploration with a safe fallback mechanism by employing a Decision Transformer to jointly model historical bidding actions and environmental state transitions, a Q-value module to guide the DT's exploration via regularization constraints, and an Inverse Dynamics Module to leverage DT-predicted future states for inferring robust, behaviorally consistent actions as a safe policy fallback, with the Q-value module adaptively selecting the final action between these two options to form an integrated explore-safeguard-select pipeline that unifies efficiency and safety, consistently outperforming state-of-the-art in a

What carries the argument

The explore-safeguard-select pipeline, where a Decision Transformer jointly models sequences of bidding actions and state transitions, a Q-value module applies regularization to guide exploration, and an Inverse Dynamics Module infers safe fallback actions from predicted future states.

If this is right

Consistent outperformance over state-of-the-art baselines on public datasets, simulated auction environments, and large-scale online deployment.
In real-world Taobao deployment, achieves +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI.
Reduces financial risk for advertising platforms by providing a safety fallback during exploration.
Enables unified modeling of long-term dependencies in bidding sequences that prior methods handled separately.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach of pairing sequence modeling with value-based guidance and dynamics-based fallbacks may extend to other high-stakes sequential decisions such as dynamic pricing or inventory control.
Testing variants without the Q-value regularization could reveal whether the safety component primarily drives the observed stability gains.
If the pipeline generalizes, it could reduce reliance on separate exploration heuristics in generative reinforcement learning systems for e-commerce.

Load-bearing premise

The Q-value module can reliably guide the Decision Transformer's exploration via regularization without destabilizing the model, and the Inverse Dynamics Module can infer behaviorally consistent safe actions from DT-predicted future states that serve as an effective fallback in live auction environments.

What would settle it

A controlled online A/B test on a live advertising platform where GUIDE is run without the IDM fallback or with disabled Q-value regularization, showing either lower performance than baselines or increased financial risk exposure compared to the full system.

Figures

Figures reproduced from arXiv: 2605.19457 by Chenliang Li, Feiqing Zhuang, Fei Xiao, Junxiong Zhu, Keping Yang, Lixin Zou, Mingming Zhang, Na Li, Shengjie Sun, Xiaowei Chen.

**Figure 2.** Figure 2: 4.1 Unified Modeling of Bid Trajectories 4.1.1 Trajectory Construction and Modeling. In the auto-bidding task, each round of bidding can be represented as a temporal trajectory that sequentially records the advertising environment states, bidding actions, and the resulting rewards. Formally, a trajectory can be represented as follows: 𝜏 = (𝑠1, 𝑎1, 𝑟1, 𝑠2, 𝑎2, 𝑟2, ..., 𝑠𝑇 , 𝑎𝑇 , 𝑟𝑇 ) (5) where 𝑠𝑡 denotes t… view at source ↗

**Figure 2.** Figure 2: Overview architecture. a) Training of the unified modeling framework. b) Inference with bid selection [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Ablation Study Second, random selection instead of Q-value-based selection also reduces performance, with results lying between those of using the two action sources separately. This can be attributed to DT actions are higher in overall quality than IDM actions. Third, removing Q-value regularization optimization causes a significant drop in performance, though still outperforming the [PITH_FULL_IMAGE:fig… view at source ↗

**Figure 5.** Figure 5: Action preferences of different advertisers [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Two-stage Training Analysis 5.4 RQ3: Co-operation between DT and IDM As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: Volatility comparison between DT and IDM bid [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: visually compares the actual cost trajectories managed by a baseline method and our proposed Guide against this ideal trajectory. As can be seen, while both methods attempt to follow the general trend, the cost distribution controlled by Guide (right panel) tracks the ideal cost much more closely across the different time steps. The baseline method (left panel), in contrast, shows more significant deviatio… view at source ↗

read the original abstract

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GUIDE adds Q-regularization and IDM fallback to a Decision Transformer for safer generative bidding and reports real Taobao lifts, but the stability of that balance in live auctions is the part that needs more checking.

read the letter

Colleague, the main point is that this paper builds GUIDE around a Decision Transformer that models bidding sequences, then layers on a Q-value module to regularize exploration and an Inverse Dynamics Module to generate safe fallback actions from predicted states, with an adaptive selector picking the final bid. They show it running on Taobao with gains of about 4% GMV and 3.5% ROI plus smaller lifts in clicks and cost, and it beats baselines in public data and simulations too. That online deployment is the part that carries weight here. The integration itself is straightforward but useful: it takes the long-horizon strength of the DT and tries to fix the missing safety piece that earlier generative bidding work left to simple perturbations. The experiments span the usual public sets, controlled auctions, and actual platform traffic, which gives a fuller picture than most papers manage. The soft spot sits right where the stress-test note points. The Q-regularizer is supposed to steer the DT without breaking it, and the IDM is meant to produce consistent safe actions even when the DT's state predictions are off. In real auctions the environment shifts fast, so small prediction drift could make the selector pick the wrong branch or let risky actions through. The paper does not appear to include detailed sensitivity checks on regularization strength or clear ablations showing what happens when the two branches disagree, so the safety guarantee still feels more asserted than demonstrated. This is aimed at people working on RL or generative methods for ad allocation and online decision systems. Anyone building or evaluating bidding agents would find the deployment numbers worth reading. I would send it to referees. The live results are concrete enough to justify the time, even if the robustness section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce GUIDE, a novel framework for automated bidding in digital advertising. It integrates a Decision Transformer (DT) to model historical bidding actions and environmental state transitions, a Q-value module to guide the DT's exploration through regularization, and an Inverse Dynamics Module (IDM) to infer safe, behaviorally consistent actions from DT-predicted future states as a fallback. An adaptive selector then chooses the final action to balance exploration and safety. The authors demonstrate consistent outperformance over state-of-the-art baselines on public datasets and in simulated environments, with real-world deployment on Taobao yielding +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI.

Significance. Should the empirical findings be confirmed with additional controls, this work offers a significant contribution to the field of AI for online advertising by proposing an integrated approach to exploration and safety in generative bidding models. The large-scale online deployment provides valuable real-world validation. The use of standard components in a new synergistic way is noted as a strength.

major comments (2)

[§3.2] §3.2, Q-value module: the claim that regularization reliably guides DT exploration without destabilizing the model lacks any analysis of regularization strength selection, sensitivity, or robustness to non-stationary auction dynamics. This is load-bearing for the safety and stability guarantees of the explore-safeguard-select pipeline.
[§5] §5, experimental results: the reported gains (e.g., +4.10% GMV) are presented without statistical significance tests, run-to-run variance, or ablation studies isolating the Q-value regularization and IDM fallback contributions. This undermines verification of the central outperformance claims.

minor comments (2)

The abstract refers to 'public datasets' without naming them; this should be stated explicitly in the introduction or experimental setup.
[§3] A diagram of the overall architecture in §3 would benefit from explicit labels on the adaptive selector and data flow between DT, Q-value, and IDM modules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, providing clarifications and indicating revisions that will be incorporated to strengthen the empirical and analytical rigor of the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2, Q-value module: the claim that regularization reliably guides DT exploration without destabilizing the model lacks any analysis of regularization strength selection, sensitivity, or robustness to non-stationary auction dynamics. This is load-bearing for the safety and stability guarantees of the explore-safeguard-select pipeline.

Authors: We acknowledge that the original submission did not provide explicit sensitivity analysis or robustness checks for the Q-value regularization strength. In the revised manuscript, we will expand §3.2 with a dedicated analysis of the regularization coefficient λ, including performance curves over a range of λ values, and additional experiments simulating non-stationary auction dynamics (e.g., varying bid landscape distributions). These additions will directly support the stability claims of the explore-safeguard-select pipeline. revision: yes
Referee: [§5] §5, experimental results: the reported gains (e.g., +4.10% GMV) are presented without statistical significance tests, run-to-run variance, or ablation studies isolating the Q-value regularization and IDM fallback contributions. This undermines verification of the central outperformance claims.

Authors: We agree that additional statistical reporting and ablations would improve verifiability. The revised §5 will include ablation studies isolating the Q-value regularization and IDM fallback, plus standard deviations and error bars from multiple simulation runs. For the online Taobao deployment, we will add confidence intervals based on available traffic data. Full multi-run variance is inherently limited in a single large-scale A/B test, which we will now explicitly discuss as a limitation. revision: partial

Circularity Check

0 steps flagged

No circularity: framework assembles standard DT/Q/IDM components with new integration; claims rest on empirical results rather than self-referential definitions or fitted predictions.

full rationale

The paper presents GUIDE as an engineering integration of Decision Transformer for trajectory modeling, Q-value regularization for directed exploration, and an Inverse Dynamics Module for safe fallback, followed by an adaptive selector. No derivation chain reduces a claimed prediction or uniqueness result to its own inputs by construction. The abstract and described pipeline introduce no self-definitional loops, no fitted parameters renamed as predictions, and no load-bearing self-citations that substitute for independent justification. Results are reported from public datasets, simulations, and online deployment on Taobao, making the central claims externally falsifiable rather than tautological. This is the expected non-finding for an applied systems paper whose novelty lies in composition rather than mathematical reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from reinforcement learning and sequence modeling applied to bidding, plus the novel integration of components; no explicit free parameters or invented physical entities are detailed in the abstract.

axioms (1)

domain assumption Bidding can be effectively modeled as a sequence prediction task using historical actions and state transitions.
Invoked by the use of Decision Transformer to jointly model bidding actions and environmental state transitions.

pith-pipeline@v0.9.0 · 5843 in / 1465 out tokens · 69892 ms · 2026-05-20T05:31:54.440895+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 4 internal anchors

[1]

Deepak Agarwal, Souvik Ghosh, Kai Wei, and Siyu You. 2014. Budget pacing for targeted online advertisements at linkedin. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1613– 1619

work page 2014
[2]

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. 2022. Is conditional generative modeling all you need for decision- making?arXiv preprint arXiv:2211.15657(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[3]

Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning.. InMachine intelligence 15. 103–129

work page 1995
[4]

Rakesh P Borase, DK Maghade, SY Sondkar, and SN Pawar. 2021. A review of PID control, tuning methods and applications.International Journal of Dynamics and Control9, 2 (2021), 818–827

work page 2021
[5]

Nikolay Borissov, Dirk Neumann, and Christof Weinhardt. 2010. Automated bidding in computational markets: an application in market-based allocation of computing services.Autonomous Agents and Multi-Agent Systems21, 2 (2010), 115–142

work page 2010
[6]

Craig Boutilier, Thomas Dean, and Steve Hanks. 1999. Decision-theoretic plan- ning: Structural assumptions and computational leverage.Journal of Artificial Intelligence Research11 (1999), 1–94

work page 1999
[7]

Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display adver- tising. InProceedings of the tenth ACM international conference on web search and data mining. 661–670

work page 2017
[8]

Leng Cai, Junxuan He, Yikai Li, Junjie Liang, Yuanping Lin, Ziming Quan, Yawen Zeng, and Jin Xu. 2025. RTBAgent: A LLM-based Agent System for Real-Time Bidding. InCompanion Proceedings of the ACM on Web Conference 2025. 104–113

work page 2025
[9]

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems34 (2021), 15084–15097

work page 2021
[10]

Ye Chen, Pavel Berkhin, Bo Anderson, and Nikhil R Devanur. 2011. Real-time bidding algorithms for performance-based display ad allocation. InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 1307–1315

work page 2011
[11]

George B Dantzig. 2016. Linear programming and extensions. (2016)

work page 2016
[12]

Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau

work page
[13]

Benchmarking batch deep reinforcement learning algorithms.arXiv preprint arXiv:1910.01708(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1910
[14]

Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems34 (2021), 20132–20145

work page 2021
[15]

Jingtong Gao, Yewen Li, Shuai Mao, Peng Jiang, Nan Jiang, Yejing Wang, Qing- peng Cai, Fei Pan, Peng Jiang, Kun Gai, et al . 2025. Generative Auto-Bidding with Value-Guided Explorations. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 244–254

work page 2025
[16]

Jiayan Guo, Yusen Huo, Zhilin Zhang, Tianyu Wang, Chuan Yu, Jian Xu, Bo Zheng, and Yan Zhang. 2024. Generative auto-bidding via conditional diffusion modeling. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5038–5049

work page 2024
[17]

Yue He, Xiujun Chen, Di Wu, Junwei Pan, Qing Tan, Chuan Yu, Jian Xu, and Xiaoqiang Zhu. 2021. A unified solution to constrained bidding in online display advertising. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2993–3001

work page 2021
[18]

Jiahao Ji, Tianyu Wang, Yeshu Li, Yusen Huo, Zhilin Zhang, Chuan Yu, Jian Xu, and Bo Zheng. 2025. Bid2X: Revealing Dynamics of Bidding Environment in Online Advertising from A Foundation Model Lens. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4543–4554

work page 2025
[19]

Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, and Peng Jiang. 2025. Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement. InCompanion Proceedings of the ACM on Web Conference 2025. 1033–1037

work page 2025
[20]

Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertis- ing. InProceedings of the 27th ACM international conference on information and knowledge management. 2193–2201

work page 2018
[21]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020). Generative Auto-Bidding with Unified Modeling and Exploration SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

work page internal anchor Pith review Pith/arXiv arXiv 2020
[22]

Carl Knospe. 2006. PID control.IEEE Control Systems Magazine26, 1 (2006), 30–31

work page 2006
[23]

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conserva- tive q-learning for offline reinforcement learning.Advances in neural information processing systems33 (2020), 1179–1191

work page 2020
[25]

Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, and Fan Wu. 2024. Trajectory-wise iterative reinforcement learning framework for auto-bidding. InProceedings of the ACM Web Conference 2024. 4193–4203

work page 2024
[26]

Juncheng Li and Pingzhong Tang. 2022. Auto-bidding equilibrium in ROI- constrained online advertising markets.arXiv preprint arXiv:2210.06107(2022)

work page arXiv 2022
[27]

Yewen Li, Shuai Mao, Jingtong Gao, Nan Jiang, Yunjian Xu, Qingpeng Cai, Fei Pan, Peng Jiang, and Bo An. 2025. GAS: Generative Auto-bidding with Post-training Search. InCompanion Proceedings of the ACM on Web Conference 2025. 315–324

work page 2025
[28]

Mengjuan Liu, Li Jiaxing, Zhengning Hu, Jinyu Liu, and Xuyun Nie. 2020. A dynamic bidding strategy based on model-free reinforcement learning in display advertising.IEEE Access8 (2020), 213587–213601

work page 2020
[29]

Haofei Lu, Dongqi Han, Yifei Shen, and Dongsheng Li. 2025. What makes a good diffusion planner for decision making?arXiv preprint arXiv:2503.00535(2025)

work page arXiv 2025
[30]

Zhiyu Mou, Yusen Huo, Rongquan Bai, Mingzhou Xie, Chuan Yu, Jian Xu, and Bo Zheng. 2022. Sustainable online reinforcement learning for auto-bidding. Advances in Neural Information Processing Systems35 (2022), 2651–2663

work page 2022
[31]

Yunshan Peng, Wenzheng Shu, Jiahao Sun, Yanxiang Zeng, Jinan Pang, Wentao Bai, Yunke Bai, Xialong Liu, and Peng Jiang. 2025. Expert-Guided Diffusion Planner for Auto-bidding.arXiv preprint arXiv:2508.08687(2025)

work page arXiv 2025
[32]

2014.Markov decision processes: discrete stochastic dynamic programming

Martin L Puterman. 2014.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons

work page 2014
[33]

Kefan Su, Yusen Huo, Zhilin Zhang, Shuai Dou, Chuan Yu, Jian Xu, Zongqing Lu, and Bo Zheng. 2024. Auctionnet: A novel benchmark for decision-making in large-scale games.Advances in Neural Information Processing Systems37 (2024), 94428–94452

work page 2024
[34]

Chao Wen, Miao Xu, Zhilin Zhang, Zhenzhe Zheng, Yuhui Wang, Xiangyu Liu, Yu Rong, Dong Xie, Xiaoyang Tan, Chuan Yu, et al. 2022. A cooperative-competitive multi-agent framework for auto-bidding in online advertising. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1129–1139

work page 2022
[35]

Xun Yang, Yasong Li, Hao Wang, Di Wu, Qing Tan, Jian Xu, and Kun Gai. 2019. Bid optimization by multivariable control in display advertising. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1966–1974

work page 2019
[36]

Hao Yu, Michael Neely, and Xiaohan Wei. 2017. Online convex optimization with stochastic constraints.Advances in Neural Information Processing Systems 30 (2017)

work page 2017
[37]

Congde Yuan, Mengzhuo Guo, Chaoneng Xiang, Shuangyang Wang, Guoqing Song, and Qingpeng Zhang. 2022. An actor-critic reinforcement learning model for optimal bidding in online display advertising. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 3604–3613

work page 2022
[38]

Shuai Yuan, Jun Wang, and Xiaoxue Zhao. 2013. Real-time bidding for online advertising: measurement and analysis. InProceedings of the seventh international workshop on data mining for online advertising. 1–8

work page 2013
[39]

Haoqi Zhang, Lvyin Niu, Zhenzhe Zheng, Zhilin Zhang, Shan Gu, Fan Wu, Chuan Yu, Jian Xu, Guihai Chen, and Bo Zheng. 2023. A personalized automated bidding framework for fairness-aware online advertising. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5544–5553

work page 2023
[40]

Weinan Zhang, Yifei Rong, Jun Wang, Tianchi Zhu, and Xiaofan Wang. 2016. Feedback control of real-time display advertising. InProceedings of the Ninth ACM International Conference on Web Search and Data Mining. 407–416

work page 2016
[41]

Weinan Zhang, Shuai Yuan, and Jun Wang. 2014. Optimal real-time bidding for display advertising. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1077–1086

work page 2014
[42]

Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, and Weinan Zhang. 2024. Madiff: Offline multi-agent learning with diffusion models.Advances in Neural Information Processing Systems37 (2024), 4177–4206

work page 2024

[1] [1]

Deepak Agarwal, Souvik Ghosh, Kai Wei, and Siyu You. 2014. Budget pacing for targeted online advertisements at linkedin. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1613– 1619

work page 2014

[2] [2]

Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. 2022. Is conditional generative modeling all you need for decision- making?arXiv preprint arXiv:2211.15657(2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022

[3] [3]

Michael Bain and Claude Sammut. 1995. A Framework for Behavioural Cloning.. InMachine intelligence 15. 103–129

work page 1995

[4] [4]

Rakesh P Borase, DK Maghade, SY Sondkar, and SN Pawar. 2021. A review of PID control, tuning methods and applications.International Journal of Dynamics and Control9, 2 (2021), 818–827

work page 2021

[5] [5]

Nikolay Borissov, Dirk Neumann, and Christof Weinhardt. 2010. Automated bidding in computational markets: an application in market-based allocation of computing services.Autonomous Agents and Multi-Agent Systems21, 2 (2010), 115–142

work page 2010

[6] [6]

Craig Boutilier, Thomas Dean, and Steve Hanks. 1999. Decision-theoretic plan- ning: Structural assumptions and computational leverage.Journal of Artificial Intelligence Research11 (1999), 1–94

work page 1999

[7] [7]

Han Cai, Kan Ren, Weinan Zhang, Kleanthis Malialis, Jun Wang, Yong Yu, and Defeng Guo. 2017. Real-time bidding by reinforcement learning in display adver- tising. InProceedings of the tenth ACM international conference on web search and data mining. 661–670

work page 2017

[8] [8]

Leng Cai, Junxuan He, Yikai Li, Junjie Liang, Yuanping Lin, Ziming Quan, Yawen Zeng, and Jin Xu. 2025. RTBAgent: A LLM-based Agent System for Real-Time Bidding. InCompanion Proceedings of the ACM on Web Conference 2025. 104–113

work page 2025

[9] [9]

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. 2021. Decision transformer: Reinforcement learning via sequence modeling.Advances in neural information processing systems34 (2021), 15084–15097

work page 2021

[10] [10]

Ye Chen, Pavel Berkhin, Bo Anderson, and Nikhil R Devanur. 2011. Real-time bidding algorithms for performance-based display ad allocation. InProceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining. 1307–1315

work page 2011

[11] [11]

George B Dantzig. 2016. Linear programming and extensions. (2016)

work page 2016

[12] [12]

Scott Fujimoto, Edoardo Conti, Mohammad Ghavamzadeh, and Joelle Pineau

work page

[13] [13]

Benchmarking batch deep reinforcement learning algorithms.arXiv preprint arXiv:1910.01708(2019)

work page internal anchor Pith review Pith/arXiv arXiv 1910

[14] [14]

Scott Fujimoto and Shixiang Shane Gu. 2021. A minimalist approach to offline reinforcement learning.Advances in neural information processing systems34 (2021), 20132–20145

work page 2021

[15] [15]

Jingtong Gao, Yewen Li, Shuai Mao, Peng Jiang, Nan Jiang, Yejing Wang, Qing- peng Cai, Fei Pan, Peng Jiang, Kun Gai, et al . 2025. Generative Auto-Bidding with Value-Guided Explorations. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 244–254

work page 2025

[16] [16]

Jiayan Guo, Yusen Huo, Zhilin Zhang, Tianyu Wang, Chuan Yu, Jian Xu, Bo Zheng, and Yan Zhang. 2024. Generative auto-bidding via conditional diffusion modeling. InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5038–5049

work page 2024

[17] [17]

Yue He, Xiujun Chen, Di Wu, Junwei Pan, Qing Tan, Chuan Yu, Jian Xu, and Xiaoqiang Zhu. 2021. A unified solution to constrained bidding in online display advertising. InProceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining. 2993–3001

work page 2021

[18] [18]

Jiahao Ji, Tianyu Wang, Yeshu Li, Yusen Huo, Zhilin Zhang, Chuan Yu, Jian Xu, and Bo Zheng. 2025. Bid2X: Revealing Dynamics of Bidding Environment in Online Advertising from A Foundation Model Lens. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 4543–4554

work page 2025

[19] [19]

Hao Jiang, Yongxiang Tang, Yanxiang Zeng, Pengjia Yuan, Yanhua Cheng, Teng Sha, Xialong Liu, and Peng Jiang. 2025. Optimal Return-to-Go Guided Decision Transformer for Auto-Bidding in Advertisement. InCompanion Proceedings of the ACM on Web Conference 2025. 1033–1037

work page 2025

[20] [20]

Junqi Jin, Chengru Song, Han Li, Kun Gai, Jun Wang, and Weinan Zhang. 2018. Real-time bidding with multi-agent reinforcement learning in display advertis- ing. InProceedings of the 27th ACM international conference on information and knowledge management. 2193–2201

work page 2018

[21] [21]

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361(2020). Generative Auto-Bidding with Unified Modeling and Exploration SIGIR ’26, July 20–24, 2026, Melbourne, VIC, Australia

work page internal anchor Pith review Pith/arXiv arXiv 2020

[22] [22]

Carl Knospe. 2006. PID control.IEEE Control Systems Magazine26, 1 (2006), 30–31

work page 2006

[23] [23]

Ilya Kostrikov, Ashvin Nair, and Sergey Levine. 2021. Offline reinforcement learning with implicit q-learning.arXiv preprint arXiv:2110.06169(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021

[24] [24]

Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. 2020. Conserva- tive q-learning for offline reinforcement learning.Advances in neural information processing systems33 (2020), 1179–1191

work page 2020

[25] [25]

Haoming Li, Yusen Huo, Shuai Dou, Zhenzhe Zheng, Zhilin Zhang, Chuan Yu, Jian Xu, and Fan Wu. 2024. Trajectory-wise iterative reinforcement learning framework for auto-bidding. InProceedings of the ACM Web Conference 2024. 4193–4203

work page 2024

[26] [26]

Juncheng Li and Pingzhong Tang. 2022. Auto-bidding equilibrium in ROI- constrained online advertising markets.arXiv preprint arXiv:2210.06107(2022)

work page arXiv 2022

[27] [27]

Yewen Li, Shuai Mao, Jingtong Gao, Nan Jiang, Yunjian Xu, Qingpeng Cai, Fei Pan, Peng Jiang, and Bo An. 2025. GAS: Generative Auto-bidding with Post-training Search. InCompanion Proceedings of the ACM on Web Conference 2025. 315–324

work page 2025

[28] [28]

Mengjuan Liu, Li Jiaxing, Zhengning Hu, Jinyu Liu, and Xuyun Nie. 2020. A dynamic bidding strategy based on model-free reinforcement learning in display advertising.IEEE Access8 (2020), 213587–213601

work page 2020

[29] [29]

Haofei Lu, Dongqi Han, Yifei Shen, and Dongsheng Li. 2025. What makes a good diffusion planner for decision making?arXiv preprint arXiv:2503.00535(2025)

work page arXiv 2025

[30] [30]

Zhiyu Mou, Yusen Huo, Rongquan Bai, Mingzhou Xie, Chuan Yu, Jian Xu, and Bo Zheng. 2022. Sustainable online reinforcement learning for auto-bidding. Advances in Neural Information Processing Systems35 (2022), 2651–2663

work page 2022

[31] [31]

Yunshan Peng, Wenzheng Shu, Jiahao Sun, Yanxiang Zeng, Jinan Pang, Wentao Bai, Yunke Bai, Xialong Liu, and Peng Jiang. 2025. Expert-Guided Diffusion Planner for Auto-bidding.arXiv preprint arXiv:2508.08687(2025)

work page arXiv 2025

[32] [32]

2014.Markov decision processes: discrete stochastic dynamic programming

Martin L Puterman. 2014.Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons

work page 2014

[33] [33]

Kefan Su, Yusen Huo, Zhilin Zhang, Shuai Dou, Chuan Yu, Jian Xu, Zongqing Lu, and Bo Zheng. 2024. Auctionnet: A novel benchmark for decision-making in large-scale games.Advances in Neural Information Processing Systems37 (2024), 94428–94452

work page 2024

[34] [34]

Chao Wen, Miao Xu, Zhilin Zhang, Zhenzhe Zheng, Yuhui Wang, Xiangyu Liu, Yu Rong, Dong Xie, Xiaoyang Tan, Chuan Yu, et al. 2022. A cooperative-competitive multi-agent framework for auto-bidding in online advertising. InProceedings of the Fifteenth ACM International Conference on Web Search and Data Mining. 1129–1139

work page 2022

[35] [35]

Xun Yang, Yasong Li, Hao Wang, Di Wu, Qing Tan, Jian Xu, and Kun Gai. 2019. Bid optimization by multivariable control in display advertising. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 1966–1974

work page 2019

[36] [36]

Hao Yu, Michael Neely, and Xiaohan Wei. 2017. Online convex optimization with stochastic constraints.Advances in Neural Information Processing Systems 30 (2017)

work page 2017

[37] [37]

Congde Yuan, Mengzhuo Guo, Chaoneng Xiang, Shuangyang Wang, Guoqing Song, and Qingpeng Zhang. 2022. An actor-critic reinforcement learning model for optimal bidding in online display advertising. InProceedings of the 31st ACM International Conference on Information & Knowledge Management. 3604–3613

work page 2022

[38] [38]

Shuai Yuan, Jun Wang, and Xiaoxue Zhao. 2013. Real-time bidding for online advertising: measurement and analysis. InProceedings of the seventh international workshop on data mining for online advertising. 1–8

work page 2013

[39] [39]

Haoqi Zhang, Lvyin Niu, Zhenzhe Zheng, Zhilin Zhang, Shan Gu, Fan Wu, Chuan Yu, Jian Xu, Guihai Chen, and Bo Zheng. 2023. A personalized automated bidding framework for fairness-aware online advertising. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5544–5553

work page 2023

[40] [40]

Weinan Zhang, Yifei Rong, Jun Wang, Tianchi Zhu, and Xiaofan Wang. 2016. Feedback control of real-time display advertising. InProceedings of the Ninth ACM International Conference on Web Search and Data Mining. 407–416

work page 2016

[41] [41]

Weinan Zhang, Shuai Yuan, and Jun Wang. 2014. Optimal real-time bidding for display advertising. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1077–1086

work page 2014

[42] [42]

Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, and Weinan Zhang. 2024. Madiff: Offline multi-agent learning with diffusion models.Advances in Neural Information Processing Systems37 (2024), 4177–4206

work page 2024