pith. machine review for the scientific record. sign in

arxiv: 2605.10440 · v1 · submitted 2026-05-11 · 💻 cs.CY

Recognition: 3 theorem links

· Lean Theorem

TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:52 UTC · model grok-4.3

classification 💻 cs.CY
keywords commission steeringLLM auditonline travel agentsconversational agentsparametric instrumentcounterfactual promptgovernance audit
0
0 comments X

The pith

TourMart measures commission steering in LLM travel agents via paired prompts that hold traveler and bundle fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TourMart as an audit instrument for detecting whether LLM-based online travel agents steer recommendations toward higher-commission suppliers. It creates a paired counterfactual by comparing a commission-aware prompt against a minimum-disclosure factual template, controlled by two parameters: lambda for the gain on perceived acceptance and kappa for the normalized cap on welfare shift. A six-gate producer audit separates technical failures such as template collapse or refusal from actual commercial steering. Tests at deployed settings show positive steering deltas that reach statistical significance for two open models. This matters because current disclosure and safety tools were designed for older ranked-list interfaces and do not capture steering inside single-sentence prose advice.

Core claim

TourMart drives a lambda-kappa modulated paired counterfactual between commission-aware and minimum-disclosure prompts, applies a symmetric six-gate audit to isolate genuine steering, and reports concrete deltas such as +7.69pp for a Qwen-14B reader and +3.50pp for a Llama-3.1-8B reader at lambda=1, kappa=0.05, with both passing family-wise correction across the parameter grid.

What carries the argument

The lambda-kappa parametric paired counterfactual that generates a commission-aware prompt and a minimum-disclosure factual template to read off the steering delta while a six-gate producer audit filters engineering artifacts.

Load-bearing premise

Differences between the two prompts reflect only the model's response to commission information rather than unrelated changes in phrasing or refusal behavior.

What would settle it

Re-running the exact paired prompts on a model with no commission incentives and finding no measurable difference in recommendation rates.

Figures

Figures reproduced from arXiv: 2605.10440 by Yao Liu.

Figure 1
Figure 1. Figure 1: TourMart audit instrument. Inputs: market [PITH_FULL_IMAGE:figures/full_fig_p010_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TourMart audit procedure: paired counterfactual generation under fixed traveler [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 1
Figure 1. Figure 1: Cross-family behavioral steering phase diagram [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Behavioral heatmap: paired RD across the [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: Coefficient attribution fit_delta is the load-bearing channel (dotted lines = full-rule baseline; bars show max RD over 2D grid) Qwen-14B-AWQ Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Coefficient-zero attribution: max RD after setting each of the four rule coeffi [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: Evidence trajectory across round-21 scale-up stages [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evidence trajectory across v1 (n = 122, feature-only), v2 (n = 15), v3 (n = 48), v4 (n = 143). The effect size stabilizes at ∼ 10pp Qwen / ∼ 8pp Llama as sample size grows, with discord counts scaling roughly linearly. +0.873). This is a descriptive backbone difference, not a causal explanation of transmission. But it matters for governance: a minimum-disclosure template is not a perceptual zero-point; its… view at source ↗
read the original abstract

Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice. Each booking earns the OTA commission and different suppliers pay different rates: the agent has a structural incentive to favor higher-margin recommendations. Whether any deployed agent does this, and by how much, no one can currently measure. Disclosure banners, conversion A/B testing, UI dark-pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose-recommendation surface where the steering happens. We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance. Two governance levers -- lambda (gain on message-induced perception in the traveler's accept/reject decision) and kappa (budget-normalized cap on how far the message can shift perceived welfare) -- drive a paired counterfactual: holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template. A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering. At deployed (lambda=1, kappa=0.05), a Qwen-14B reader shows +7.69pp steering (exact McNemar p=0.003); a Llama-3.1-8B reader shows +3.50pp in the same direction at n=143, with an extended-n supplement (n=270) confirming significance (+2.96pp, p=0.008). Across the (lambda, kappa) grid both arms pass family-wise scenario-clustered correction (p<0.001 / p=0.008). TourMart outputs a sentence a compliance report can quote: "at this deployment, 7.7 extra commission-steered recommendations per 100 paired traveler sessions."

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces TourMart, a parametric audit instrument for measuring commission steering in LLM-based online travel agents. It defines two levers (lambda for gain on message-induced perception and kappa for budget-normalized cap on welfare shift) to drive a paired counterfactual: holding traveler and bundle fixed, the steering delta is computed between a commission-aware prompt and a minimum-disclosure factual template. A six-gate producer audit separates engineering failures (template collapse, refusal, leakage) from genuine steering. At deployed parameters (lambda=1, kappa=0.05), it reports +7.69pp steering for Qwen-14B (McNemar p=0.003) and +2.96pp for Llama-3.1-8B in an extended n=270 sample (p=0.008), with both passing family-wise scenario-clustered correction across the (lambda, kappa) grid.

Significance. If the empirical isolation of commission awareness holds, TourMart supplies a concrete, compliance-reportable metric ('7.7 extra commission-steered recommendations per 100 paired sessions') for a previously unmeasurable surface in deployed LLM-OTAs. The parametric levers, symmetric six-gate audit, and use of exact McNemar tests with correction constitute a clear methodological advance over generic safety scores or UI taxonomies. The approach is falsifiable and directly applicable to governance.

major comments (2)
  1. [Abstract and methods (paired counterfactual design)] Abstract and methods description of the paired counterfactual: the commission-aware prompt and minimum-disclosure factual template necessarily differ in length, specificity, directive language, and information density in addition to the commission signal. No ablation is reported that holds prompt style and structure constant while varying only the presence/absence of commission details; therefore the reported deltas (+7.69pp Qwen-14B; +2.96pp extended Llama) cannot yet be attributed unambiguously to internalized commercial incentives rather than model sensitivity to framing.
  2. [Results (grid and sample reporting)] Results section on the (lambda, kappa) grid and sample sizes: the manuscript reports n=143 and extended n=270 with family-wise correction but does not specify exact data exclusion rules, how refusals or template collapses from the six-gate audit are filtered, or whether the extended sample was pre-specified. This information is load-bearing for interpreting the McNemar p-values and the claim that both arms pass correction (p<0.001 / p=0.008).
minor comments (3)
  1. [Methods (parameter definitions)] The definitions of lambda and kappa are introduced parametrically but would benefit from an explicit equation showing how they modulate the accept/reject decision in the traveler model.
  2. [Methods (six-gate audit)] The six-gate audit is described at a high level; including the exact decision criteria or decision tree for each gate in an appendix would improve reproducibility.
  3. [Results (grid presentation)] Table or figure presenting the full (lambda, kappa) grid results should include per-cell sample sizes and exact p-values before and after correction for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our paired counterfactual design and reporting practices. We address each major comment in detail below, proposing revisions to strengthen the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Abstract and methods (paired counterfactual design)] Abstract and methods description of the paired counterfactual: the commission-aware prompt and minimum-disclosure factual template necessarily differ in length, specificity, directive language, and information density in addition to the commission signal. No ablation is reported that holds prompt style and structure constant while varying only the presence/absence of commission details; therefore the reported deltas (+7.69pp Qwen-14B; +2.96pp extended Llama) cannot yet be attributed unambiguously to internalized commercial incentives rather than model sensitivity to framing.

    Authors: We appreciate this observation on the paired design. The minimum-disclosure factual template was deliberately constructed to serve as a neutral baseline that omits any reference to commissions or commercial incentives, while the commission-aware prompt incorporates the steering signal within a realistic agent context. This difference in content is inherent to testing commission awareness, as the control must lack the incentive information. We acknowledge that variations in length, specificity, and directive language could contribute to the observed deltas, and that a style-matched ablation would provide stronger causal isolation. In the revised manuscript, we will expand the methods section to explicitly discuss this design choice and its potential limitations, including a note that future work could implement prompt-style ablations. The six-gate audit mitigates some framing effects by excluding engineering failures, but we agree this does not fully address sensitivity to non-commission framing differences. revision: partial

  2. Referee: [Results (grid and sample reporting)] Results section on the (lambda, kappa) grid and sample sizes: the manuscript reports n=143 and extended n=270 with family-wise correction but does not specify exact data exclusion rules, how refusals or template collapses from the six-gate audit are filtered, or whether the extended sample was pre-specified. This information is load-bearing for interpreting the McNemar p-values and the claim that both arms pass correction (p<0.001 / p=0.008).

    Authors: We agree that detailed reporting of sample construction is essential for reproducibility and interpretation of the statistical results. In the revised version, we will add a dedicated subsection in the results or methods detailing the data exclusion rules: specifically, only prompt pairs where both the commission-aware and minimum-disclosure responses pass all six gates of the producer audit (no template collapse, no refusal, no internal-ID leakage, and successful parsing) are retained for analysis. We will report the exact number of pairs excluded at each gate for the primary n=143 sample and the extended n=270 sample. The extended sample was collected post-hoc to increase statistical power after observing the initial results; we will explicitly state this and present it as an exploratory extension rather than a pre-specified analysis. These additions will clarify how the McNemar tests and family-wise corrections were applied to the filtered data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical deltas are direct measurements

full rationale

The paper's core results consist of observed steering percentages (+7.69pp for Qwen-14B, +2.96pp for Llama-3.1-8B) obtained by direct comparison of LLM outputs under two fixed prompt templates while holding traveler and bundle constant. These quantities are computed as simple empirical differences and do not reduce via any equations, fitted parameters, or self-referential definitions to quantities defined in terms of themselves. Lambda and kappa parameterize the audit instrument but function as experimental controls rather than inputs from which the reported deltas are algebraically derived. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the load-bearing claims. The derivation chain is therefore self-contained as an applied measurement protocol.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on two tunable governance levers treated as free parameters and one domain assumption about the validity of the fixed-traveler counterfactual.

free parameters (2)
  • lambda
    Gain on message-induced perception in the traveler's accept/reject decision; set to the deployed value of 1 in the reported experiments.
  • kappa
    Budget-normalized cap on how far the message can shift perceived welfare; set to the deployed value of 0.05 in the reported experiments.
axioms (1)
  • domain assumption Holding the traveler and bundle fixed isolates the steering delta between commission-aware and minimum-disclosure prompts.
    Invoked to justify reading the steering effect directly from the paired prompt difference.

pith-pipeline@v0.9.0 · 5650 in / 1393 out tokens · 60031 ms · 2026-05-12T04:52:56.855945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

  1. [1]

    gov/current/title-16/chapter-I/subchapter-B/part-255, final rule effective July 26, 2023; 88 FR 48092 (July 2023)

    Federal Trade Commission, Guides concerning the use of endorsements and testimonials in advertising, 16 CFR part 255,https://www.ecfr. gov/current/title-16/chapter-I/subchapter-B/part-255, final rule effective July 26, 2023; 88 FR 48092 (July 2023)

  2. [2]

    European Parliament and Council, Regulation (EU) 2022/2065 on a single market for digital services (digital services act), Official Journal of the European Union L 277/1 (October 2022). 34

  3. [3]

    Aridor, D

    G. Aridor, D. Gonçalves, Recommenders’ originals: The welfare effects of the dual role of platforms as producers and recommender systems, International Journal of Industrial Organization 83 (2022) 102845.doi: 10.1016/j.ijindorg.2022.102845

  4. [4]

    C. M. Gray, Y. Kou, B. Battles, J. Hoggatt, A. L. Toombs, The dark (patterns) side of UX design, in: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI), 2018. doi:10.1145/3173574.3174108

  5. [5]

    C. M. Gray, C. T. Santos, N. Bielova, T. Mildner, An ontology of dark patterns knowledge: Foundations, definitions, and a pathway for shared knowledge-building, in: Proceedings of the 2024 CHI Confer- ence on Human Factors in Computing Systems (CHI), 2024.doi: 10.1145/3613904.3642436

  6. [6]

    Mathur, G

    A. Mathur, G. Acar, M. J. Friedman, E. Lucherini, J. Mayer, M. Chetty, A. Narayanan, Dark patterns at scale: Findings from a crawl of 11k shopping websites, Proceedings of the ACM on Human-Computer In- teraction 3 (CSCW) (2019).doi:10.1145/3359183

  7. [7]

    European Parliament and Council, Directive 2014/65/EU (mifid ii), ar- ticle 27: Obligation to execute orders on terms most favourable to the client, Official Journal of the European Union L 173/349 (2014)

  8. [8]

    European Parliament and Council, Regulation (EU) 2019/1150 of the european parliament and of the council of 20 june 2019 on promoting fairness and transparency for business users of online intermediation services, Official Journal of the European Union L 186/57 (July 2019)

  9. [9]

    9),http://www.cac.gov.cn/2022-01/04/c_1642894606364259

    Cyberspace Administration of China and Ministry of Industry and In- formation Technology and Ministry of Public Security and State Admin- istration for Market Regulation, Provisions on the administration of al- gorithmic recommendations of internet information services (CAC order no. 9),http://www.cac.gov.cn/2022-01/04/c_1642894606364259. htm, effective 2...

  10. [10]

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, M. S. Bern- stein, Generative agents: Interactive simulacra of human behavior, in: 35 Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

  11. [11]

    J. J. Horton, A. Filippas, B. S. Manning, Large language models as simulated economic agents: What can we learn from Homo Silicus?, NBER Working Paper 31122, National Bureau of Economic Research (2023).doi:10.3386/w31122

  12. [12]

    L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, D. Wingate, Out of one, many: Using language models to simu- late human samples, Political Analysis 31 (3) (2023) 337–351.doi: 10.1017/pan.2023.2

  13. [13]

    Nature Human Behaviour9(8), 1645–1653 (2025)

    F. Salvi, M. Horta Ribeiro, R. Gallotti, R. West, On the conversational persuasiveness of GPT-4, Nature Human Behaviour 9 (2025) 1645–1653. doi:10.1038/s41562-025-02194-6

  14. [14]

    Commercial Persuasion in AI-Mediated Conversations

    F. Salvi, A. Cuevas, M. Horta Ribeiro, Commercial persuasion in AI-mediated conversations, arXiv preprint arXiv:2604.04263 (2026). arXiv:2604.04263

  15. [15]

    https://arxiv.org/abs/2510.25779 (2025)

    G. Bansal, W. Hua, Z. Huang, A. Fourney, A. Swearngin, W. Epperson, T. Payne, J. M. Hofman, B. Lucier, C. Singh, M. Mobius, A. Nambi, A. Yadav, K. Gao, D. M. Rothschild, A. Slivkins, D. G. Goldstein, H. Mozannar, N. Immorlica, M. Murad, M. Vogel, S. Kambhampati, E. Horvitz, S. Amershi, Magentic marketplace: An open-source envi- ronment for studying agenti...

  16. [16]

    Sandvig, K

    C. Sandvig, K. Hamilton, K. Karahalios, C. Langbort, Auditing algo- rithms: Research methods for detecting discrimination on internet plat- forms, in: 64th Annual Meeting of the International Communication Association (ICA), Data and Discrimination Preconference, 2014

  17. [17]

    Hannak, P

    A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, C. Wilson, Measuring personalization of web search, in: Proceedings of the 22nd International Conference on World Wide Web (WWW), ACM, 2013, pp. 527–538.doi:10.1145/2488388. 2488435. 36

  18. [18]

    Metaxa, J

    D. Metaxa, J. S. Park, R. E. Robertson, K. Karahalios, C. Wil- son, J. Hancock, C. Sandvig, Auditing algorithms: Understanding al- gorithmic systems from the outside in, Foundations and Trends in Human-Computer Interaction 14 (4) (2021) 272–344.doi:10.1561/ 1100000083

  19. [19]

    W. G. Kim, S. Pillai, K. Haldorai, W. Ahmad, Dark patterns used by online travel agency websites, Annals of Tourism Research 88 (2021) 103055.doi:10.1016/j.annals.2020.103055

  20. [20]

    E. Kran, H. M. Nguyen, A. Kundu, S. Jawhar, J. Park, M. M. Ju- rewicz, DarkBench: Benchmarking dark patterns in large language mod- els, in: International Conference on Learning Representations (ICLR), Oral, 2025.arXiv:2503.10728

  21. [21]

    H. Wu, B. Mitra, C. Ma, F. Diaz, X. Liu, Joint multisided exposure fairness for recommendation, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2022, pp. 703–714.doi:10.1145/3477495. 3532007

  22. [22]

    Federal Trade Commission, Bringing dark patterns to light: FTC staff report,https://www.ftc.gov/reports/ bringing-dark-patterns-light, p214800 (September 2022)

  23. [23]

    Reviglio, M

    U. Reviglio, M. Fabbri, The regulation of recommender systems under the DSA: A transition from default to multiple and dynamic controls?, DSA Observatory Policy Analysis (November 2024)

  24. [24]

    J.Xie, K.Zhang, J.Chen, T.Zhu, R.Lou, Y.Tian, Y.Xiao, Y.Su, Trav- elPlanner: A benchmark for real-world planning with language agents, in: Proceedings of the 41st International Conference on Machine Learn- ing (ICML), Spotlight, 2024.arXiv:2402.01622

  25. [25]

    Tripcraft: A benchmark for spatio-temporally fine grained travel planning.arXiv preprint arXiv:2502.20508,

    S. Chaudhuri, P. Purkar, R. Raghav, S. Mallick, M. Gupta, A. Jana, S. Ghosh, TripCraft: A benchmark for spatio-temporally fine grained travel planning, arXiv preprint arXiv:2502.20508 (2025).arXiv:2502. 20508. 37

  26. [26]

    Y. Qu, H. Xiao, F. Li, G. Li, H. Zhou, X. Dai, X. Dai, TripScore: Bench- marking and rewarding real-world travel planning with fine-grained eval- uation, arXiv preprint arXiv:2510.09011 (2025).arXiv:2510.09011

  27. [27]

    Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

    X. Cheng, Y. Hu, X. Zhang, L. Xu, L. Tan, Z. Pan, X. Li, Y. Liu, Beyond itinerary planning—a real-world benchmark for multi-turn and tool-using travel tasks, arXiv preprint arXiv:2512.22673 (2025).arXiv: 2512.22673

  28. [28]

    D. Yang, et al., Wide-horizon thinking and simulation-based evaluation forreal-worldLLMplanningwithmultifacetedconstraints, in: Advances in Neural Information Processing Systems (NeurIPS), 2025.arXiv: 2506.12421

  29. [29]

    A. Chen, X. Ge, Z. Fu, Y. Xiao, J. Chen, TravelAgent: An AI assis- tant for personalized travel planning, arXiv preprint arXiv:2409.08069 (2024)

  30. [30]

    Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, C. Wang, AutoGen: Enabling next-gen LLM applications via multi-agent conver- sations, in: Conference on Language Modeling (COLM), 2024.arXiv: 2308.08155

  31. [31]

    G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, B. Ghanem, CAMEL:Communicativeagentsfor“mind” explorationoflargelanguage model society, in: Advances in Neural Information Processing Systems (NeurIPS), 2023

  32. [32]

    S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, J. Schmidhuber, MetaGPT: Meta programming for a multi-agent col- laborative framework, in: International Conference on Learning Repre- sentations (ICLR), Oral, 2024.arXiv:2308.00352

  33. [33]

    W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y. Lu, Y.-H. Hung, C. Qian, et al., AgentVerse: Facilitating multi-agent collab- oration and exploring emergent behaviors, in: International Conference on Learning Representations (ICLR), 2024. 38

  34. [34]

    X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y. Bisk, D. Fried, G. Neubig, M. Sap, SOTOPIA: Interactive evaluation for social intelligence in language agents, in: International Conference on Learning Representations (ICLR), Spotlight, 2024

  35. [35]

    Piatti, Z

    G. Piatti, Z. Jin, M. Kleiman-Weiner, B. Schölkopf, M. Sachan, R. Mi- halcea, Cooperate or collapse: Emergence of sustainable cooperation in a society of LLM agents, in: Advances in Neural Information Processing Systems (NeurIPS), 2024

  36. [36]

    Abdelnabi, A

    S. Abdelnabi, A. Gomaa, S. Sivaprasad, L. Schönherr, M. Fritz, Coop- eration, competition, and maliciousness: LLM-stakeholders interactive negotiation, in: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

  37. [37]

    N. Li, C. Gao, M. Li, Y. Li, Q. Liao, EconAgent: Large language model- empowered agents for simulating macroeconomic activities, in: Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  38. [38]

    Q. Zhao, J. Wang, Y. Zhang, Y. Jin, K. Zhu, H. Chen, X. Xie, Com- peteAI: Understanding the competition dynamics of large language model-based agents, in: Proceedings of the 41st International Confer- ence on Machine Learning (ICML), Oral, 2024

  39. [39]

    L. Wang, J. Zhang, H. Yang, Z. Chen, J. Tang, Z. Zhang, X. Chen, Y. Lin, R. Song, W. X. Zhao, J. Xu, Z. Dou, J. Wang, J.-R. Wen, User behavior simulation with large language model-based agents for recom- mender systems, ACM Transactions on Information Systems (TOIS) (2025)

  40. [40]

    G. V. Aher, R. I. Arriaga, A. T. Kalai, Using large language models to simulate multiple humans and replicate human subject studies, in: Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

  41. [41]

    X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, et al., AgentBench: Evaluating LLMs as agents, in: Interna- tional Conference on Learning Representations (ICLR), 2024. 39

  42. [42]

    C. Ma, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan, L. Kong, J. He, AgentBoard: An analytical evaluation board of multi-turn LLM agents, in: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

  43. [43]

    Towards Understanding Sycophancy in Language Models

    M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. John- ston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, E. Perez, Towards understanding syco- phancy in language models, in: International Conference on Learning Representations (ICLR), 2024.arX...

  44. [44]

    Richard G¨ollner, Rebecca Lazarides, and Philipp Stark

    A. Fanous, J. Goldberg, A. Agarwal, J. Lin, A. Zhou, S. Xu, V. Bikia, R. Daneshjou, S. Koyejo, SycEval: Evaluating LLM sycophancy, in: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), Vol. 8, 2025, pp. 893–900.arXiv:2502.08177,doi:10.1609/ aies.v8i1.36598

  45. [45]

    Durmus, L

    E. Durmus, L. Lovitt, A. Tamkin, S. Ritchie, J. Clark, D. Gan- guli, Measuring the persuasiveness of language models,https:// www.anthropic.com/news/measuring-model-persuasiveness(April 2024)

  46. [46]

    Hackenburg, H

    K. Hackenburg, H. Margetts, Evaluating the persuasive influence of po- litical microtargeting with large language models, Proceedings of the National Academy of Sciences (PNAS) 121 (24) (2024) e2403116121. doi:10.1073/pnas.2403116121

  47. [47]

    S.C.Matz, J.D.Teeny, S.S.Vaid, H.Peters, G.M.Harari, M.Cerf, The potential of generative AI for personalized persuasion at scale, Scientific Reports 14 (2024) 4692.doi:10.1038/s41598-024-53755-0

  48. [48]

    H. Bai, J. G. Voelkel, S. Muldowney, J. C. Eichstaedt, R. Willer, LLM- generated messages can persuade humans on policy issues, Nature Com- munications 16 (2025) 6037.doi:10.1038/s41467-025-61345-5

  49. [49]

    Persuasion with Large Language Models: A Survey of Empirical Evidence, Study Methodologies, and Ethical Implications

    A. Rogiers, S. Noels, M. Buyl, T. De Bie, Persuasion with large language models: A survey, arXiv preprint arXiv:2411.06837 (2024)

  50. [50]

    Di Geronimo, L

    L. Di Geronimo, L. Braz, E. Fregnan, F. Palomba, A. Bacchelli, UI dark patterns and where to find them: A study on mobile applications and 40 user perception, in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), 2020.doi:10.1145/3313831. 3376600

  51. [51]

    Luguri, L

    J. Luguri, L. J. Strahilevitz, Shining a light on dark patterns, Journal of Legal Analysis 13 (1) (2021) 43–109.doi:10.1093/jla/laaa006

  52. [52]

    Ersoy, B

    D. Ersoy, B. Lee, A. Shreekumar, A. Arunasalam, M. Ibrahim, A. Bianchi, Z. B. Celik, Investigating the impact of dark patterns on LLM-based web agents, in: Proceedings of the 47th IEEE Symposium on Security and Privacy (S&P), 2026, to appear.arXiv:2510.18113

  53. [53]

    G. K. Patro, A. Biswas, N. Ganguly, K. P. Gummadi, A. Chakraborty, FairRec: Two-sided fairness for personalized recommendations in two- sided platforms, in: Proceedings of The Web Conference 2020 (WWW), 2020.doi:10.1145/3366423.3380196

  54. [54]

    Duetting, V

    P. Duetting, V. Mirrokni, R. Paes Leme, H. Xu, S. Zuo, Mechanism design for large language models, in: Proceedings of the ACM Web Conference 2024 (WWW), Best Paper, 2024.doi:10.1145/3589334. 3645511

  55. [55]

    Truthful aggregation of llms with an application to online advertising.arXiv preprint arXiv:2405.05905, 2024

    E. Soumalias, M. J. Curry, S. Seuken, Truthful aggregation of LLMs with an application to online advertising, in: International Conference on Learning Representations (ICLR), 2025.arXiv:2405.05905

  56. [56]

    K. A. Dubey, Z. Feng, R. Kidambi, A. Mehta, D. Wang, Auctions with LLM summaries, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2024, pp. 713– 722.doi:10.1145/3637528.3672022

  57. [57]

    S. Fish, Y. A. Gonczarowski, R. I. Shorrer, Algorithmic collusion by large language models, arXiv preprint arXiv:2404.00806 (2024)

  58. [58]

    Ben-Shahar, C

    O. Ben-Shahar, C. E. Schneider, More Than You Wanted to Know: The Failure of Mandated Disclosure, Princeton University Press, 2014

  59. [59]

    Loewenstein, C

    G. Loewenstein, C. R. Sunstein, R. Golman, Disclosure: Psychology changes everything, Annual Review of Economics 6 (2014) 391–419. doi:10.1146/annurev-economics-080213-041341. 41