arxiv: 2605.10440 · v1 · submitted 2026-05-11 · 💻 cs.CY

Recognition: 3 theorem links

· Lean Theorem

TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents

Yao Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:52 UTC · model grok-4.3

classification 💻 cs.CY

keywords commission steeringLLM auditonline travel agentsconversational agentsparametric instrumentcounterfactual promptgovernance audit

0 comments

The pith

TourMart measures commission steering in LLM travel agents via paired prompts that hold traveler and bundle fixed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TourMart as an audit instrument for detecting whether LLM-based online travel agents steer recommendations toward higher-commission suppliers. It creates a paired counterfactual by comparing a commission-aware prompt against a minimum-disclosure factual template, controlled by two parameters: lambda for the gain on perceived acceptance and kappa for the normalized cap on welfare shift. A six-gate producer audit separates technical failures such as template collapse or refusal from actual commercial steering. Tests at deployed settings show positive steering deltas that reach statistical significance for two open models. This matters because current disclosure and safety tools were designed for older ranked-list interfaces and do not capture steering inside single-sentence prose advice.

Core claim

TourMart drives a lambda-kappa modulated paired counterfactual between commission-aware and minimum-disclosure prompts, applies a symmetric six-gate audit to isolate genuine steering, and reports concrete deltas such as +7.69pp for a Qwen-14B reader and +3.50pp for a Llama-3.1-8B reader at lambda=1, kappa=0.05, with both passing family-wise correction across the parameter grid.

What carries the argument

The lambda-kappa parametric paired counterfactual that generates a commission-aware prompt and a minimum-disclosure factual template to read off the steering delta while a six-gate producer audit filters engineering artifacts.

Load-bearing premise

Differences between the two prompts reflect only the model's response to commission information rather than unrelated changes in phrasing or refusal behavior.

What would settle it

Re-running the exact paired prompts on a model with no commission incentives and finding no measurable difference in recommendation rates.

Figures

Figures reproduced from arXiv: 2605.10440 by Yao Liu.

**Figure 2.** Figure 2: TourMart audit procedure: paired counterfactual generation under fixed traveler [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 1.** Figure 1: Cross-family behavioral steering phase diagram [PITH_FULL_IMAGE:figures/full_fig_p018_1.png] view at source ↗

**Figure 3.** Figure 3: Behavioral heatmap: paired RD across the [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 2.** Figure 2: Coefficient attribution fit_delta is the load-bearing channel (dotted lines = full-rule baseline; bars show max RD over 2D grid) Qwen-14B-AWQ Llama-3.1-8B [PITH_FULL_IMAGE:figures/full_fig_p020_2.png] view at source ↗

**Figure 4.** Figure 4: Coefficient-zero attribution: max RD after setting each of the four rule coeffi [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 3.** Figure 3: Evidence trajectory across round-21 scale-up stages [PITH_FULL_IMAGE:figures/full_fig_p021_3.png] view at source ↗

**Figure 5.** Figure 5: Evidence trajectory across v1 (n = 122, feature-only), v2 (n = 15), v3 (n = 48), v4 (n = 143). The effect size stabilizes at ∼ 10pp Qwen / ∼ 8pp Llama as sample size grows, with discord counts scaling roughly linearly. +0.873). This is a descriptive backbone difference, not a causal explanation of transmission. But it matters for governance: a minimum-disclosure template is not a perceptual zero-point; its… view at source ↗

read the original abstract

Online travel agents (Booking, Trip.com, Expedia) have replaced ranked-list interfaces with conversational LLM agents that compress many options into one sentence of advice. Each booking earns the OTA commission and different suppliers pay different rates: the agent has a structural incentive to favor higher-margin recommendations. Whether any deployed agent does this, and by how much, no one can currently measure. Disclosure banners, conversion A/B testing, UI dark-pattern taxonomies, and generic LLM safety scores were built for older interfaces and miss the prose-recommendation surface where the steering happens. We propose TourMart, an applied intelligent-system audit instrument for LLM-OTA commission governance. Two governance levers -- lambda (gain on message-induced perception in the traveler's accept/reject decision) and kappa (budget-normalized cap on how far the message can shift perceived welfare) -- drive a paired counterfactual: holding the traveler and bundle fixed, the steering delta is read off between a commission-aware prompt and a minimum-disclosure factual template. A symmetric six-gate producer audit separates LLM-engineering failures (template collapse, refusal, internal-ID leakage) from genuine commercial steering. At deployed (lambda=1, kappa=0.05), a Qwen-14B reader shows +7.69pp steering (exact McNemar p=0.003); a Llama-3.1-8B reader shows +3.50pp in the same direction at n=143, with an extended-n supplement (n=270) confirming significance (+2.96pp, p=0.008). Across the (lambda, kappa) grid both arms pass family-wise scenario-clustered correction (p<0.001 / p=0.008). TourMart outputs a sentence a compliance report can quote: "at this deployment, 7.7 extra commission-steered recommendations per 100 paired traveler sessions."

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces TourMart, a parametric audit instrument for measuring commission steering in LLM-based online travel agents. It defines two levers (lambda for gain on message-induced perception and kappa for budget-normalized cap on welfare shift) to drive a paired counterfactual: holding traveler and bundle fixed, the steering delta is computed between a commission-aware prompt and a minimum-disclosure factual template. A six-gate producer audit separates engineering failures (template collapse, refusal, leakage) from genuine steering. At deployed parameters (lambda=1, kappa=0.05), it reports +7.69pp steering for Qwen-14B (McNemar p=0.003) and +2.96pp for Llama-3.1-8B in an extended n=270 sample (p=0.008), with both passing family-wise scenario-clustered correction across the (lambda, kappa) grid.

Significance. If the empirical isolation of commission awareness holds, TourMart supplies a concrete, compliance-reportable metric ('7.7 extra commission-steered recommendations per 100 paired sessions') for a previously unmeasurable surface in deployed LLM-OTAs. The parametric levers, symmetric six-gate audit, and use of exact McNemar tests with correction constitute a clear methodological advance over generic safety scores or UI taxonomies. The approach is falsifiable and directly applicable to governance.

major comments (2)

[Abstract and methods (paired counterfactual design)] Abstract and methods description of the paired counterfactual: the commission-aware prompt and minimum-disclosure factual template necessarily differ in length, specificity, directive language, and information density in addition to the commission signal. No ablation is reported that holds prompt style and structure constant while varying only the presence/absence of commission details; therefore the reported deltas (+7.69pp Qwen-14B; +2.96pp extended Llama) cannot yet be attributed unambiguously to internalized commercial incentives rather than model sensitivity to framing.
[Results (grid and sample reporting)] Results section on the (lambda, kappa) grid and sample sizes: the manuscript reports n=143 and extended n=270 with family-wise correction but does not specify exact data exclusion rules, how refusals or template collapses from the six-gate audit are filtered, or whether the extended sample was pre-specified. This information is load-bearing for interpreting the McNemar p-values and the claim that both arms pass correction (p<0.001 / p=0.008).

minor comments (3)

[Methods (parameter definitions)] The definitions of lambda and kappa are introduced parametrically but would benefit from an explicit equation showing how they modulate the accept/reject decision in the traveler model.
[Methods (six-gate audit)] The six-gate audit is described at a high level; including the exact decision criteria or decision tree for each gate in an appendix would improve reproducibility.
[Results (grid presentation)] Table or figure presenting the full (lambda, kappa) grid results should include per-cell sample sizes and exact p-values before and after correction for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of our paired counterfactual design and reporting practices. We address each major comment in detail below, proposing revisions to strengthen the manuscript where appropriate.

read point-by-point responses

Referee: [Abstract and methods (paired counterfactual design)] Abstract and methods description of the paired counterfactual: the commission-aware prompt and minimum-disclosure factual template necessarily differ in length, specificity, directive language, and information density in addition to the commission signal. No ablation is reported that holds prompt style and structure constant while varying only the presence/absence of commission details; therefore the reported deltas (+7.69pp Qwen-14B; +2.96pp extended Llama) cannot yet be attributed unambiguously to internalized commercial incentives rather than model sensitivity to framing.

Authors: We appreciate this observation on the paired design. The minimum-disclosure factual template was deliberately constructed to serve as a neutral baseline that omits any reference to commissions or commercial incentives, while the commission-aware prompt incorporates the steering signal within a realistic agent context. This difference in content is inherent to testing commission awareness, as the control must lack the incentive information. We acknowledge that variations in length, specificity, and directive language could contribute to the observed deltas, and that a style-matched ablation would provide stronger causal isolation. In the revised manuscript, we will expand the methods section to explicitly discuss this design choice and its potential limitations, including a note that future work could implement prompt-style ablations. The six-gate audit mitigates some framing effects by excluding engineering failures, but we agree this does not fully address sensitivity to non-commission framing differences. revision: partial
Referee: [Results (grid and sample reporting)] Results section on the (lambda, kappa) grid and sample sizes: the manuscript reports n=143 and extended n=270 with family-wise correction but does not specify exact data exclusion rules, how refusals or template collapses from the six-gate audit are filtered, or whether the extended sample was pre-specified. This information is load-bearing for interpreting the McNemar p-values and the claim that both arms pass correction (p<0.001 / p=0.008).

Authors: We agree that detailed reporting of sample construction is essential for reproducibility and interpretation of the statistical results. In the revised version, we will add a dedicated subsection in the results or methods detailing the data exclusion rules: specifically, only prompt pairs where both the commission-aware and minimum-disclosure responses pass all six gates of the producer audit (no template collapse, no refusal, no internal-ID leakage, and successful parsing) are retained for analysis. We will report the exact number of pairs excluded at each gate for the primary n=143 sample and the extended n=270 sample. The extended sample was collected post-hoc to increase statistical power after observing the initial results; we will explicitly state this and present it as an exploratory extension rather than a pre-specified analysis. These additions will clarify how the McNemar tests and family-wise corrections were applied to the filtered data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical deltas are direct measurements

full rationale

The paper's core results consist of observed steering percentages (+7.69pp for Qwen-14B, +2.96pp for Llama-3.1-8B) obtained by direct comparison of LLM outputs under two fixed prompt templates while holding traveler and bundle constant. These quantities are computed as simple empirical differences and do not reduce via any equations, fitted parameters, or self-referential definitions to quantities defined in terms of themselves. Lambda and kappa parameterize the audit instrument but function as experimental controls rather than inputs from which the reported deltas are algebraically derived. No self-citation chains, uniqueness theorems, or ansatz smuggling appear in the load-bearing claims. The derivation chain is therefore self-contained as an applied measurement protocol.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on two tunable governance levers treated as free parameters and one domain assumption about the validity of the fixed-traveler counterfactual.

free parameters (2)

lambda
Gain on message-induced perception in the traveler's accept/reject decision; set to the deployed value of 1 in the reported experiments.
kappa
Budget-normalized cap on how far the message can shift perceived welfare; set to the deployed value of 0.05 in the reported experiments.

axioms (1)

domain assumption Holding the traveler and bundle fixed isolates the steering delta between commission-aware and minimum-disclosure prompts.
Invoked to justify reading the steering effect directly from the paired prompt difference.

pith-pipeline@v0.9.0 · 5650 in / 1393 out tokens · 60031 ms · 2026-05-12T04:52:56.855945+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

acc(ϕ, ut, pβ, bt, τt;λ, κ) = ⊮[(ut(β)−pβ) + clip(λ⃗c·ϕ·bt,[±κbt]) ≥ τt bt]
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

paired counterfactual replay... commission-aware OTA prompt and a minimum-disclosure factual template
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

scenario-clustered grid max-stat permutation... exact McNemar p=0.003

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 6 internal anchors

[1]

gov/current/title-16/chapter-I/subchapter-B/part-255, final rule effective July 26, 2023; 88 FR 48092 (July 2023)

Federal Trade Commission, Guides concerning the use of endorsements and testimonials in advertising, 16 CFR part 255,https://www.ecfr. gov/current/title-16/chapter-I/subchapter-B/part-255, final rule effective July 26, 2023; 88 FR 48092 (July 2023)

work page 2023
[2]

European Parliament and Council, Regulation (EU) 2022/2065 on a single market for digital services (digital services act), Official Journal of the European Union L 277/1 (October 2022). 34

work page 2022
[3]

Aridor, D

G. Aridor, D. Gonçalves, Recommenders’ originals: The welfare effects of the dual role of platforms as producers and recommender systems, International Journal of Industrial Organization 83 (2022) 102845.doi: 10.1016/j.ijindorg.2022.102845

work page doi:10.1016/j.ijindorg.2022.102845 2022
[4]

C. M. Gray, Y. Kou, B. Battles, J. Hoggatt, A. L. Toombs, The dark (patterns) side of UX design, in: Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems (CHI), 2018. doi:10.1145/3173574.3174108

work page doi:10.1145/3173574.3174108 2018
[5]

C. M. Gray, C. T. Santos, N. Bielova, T. Mildner, An ontology of dark patterns knowledge: Foundations, definitions, and a pathway for shared knowledge-building, in: Proceedings of the 2024 CHI Confer- ence on Human Factors in Computing Systems (CHI), 2024.doi: 10.1145/3613904.3642436

work page doi:10.1145/3613904.3642436 2024
[6]

Mathur, G

A. Mathur, G. Acar, M. J. Friedman, E. Lucherini, J. Mayer, M. Chetty, A. Narayanan, Dark patterns at scale: Findings from a crawl of 11k shopping websites, Proceedings of the ACM on Human-Computer In- teraction 3 (CSCW) (2019).doi:10.1145/3359183

work page doi:10.1145/3359183 2019
[7]

European Parliament and Council, Directive 2014/65/EU (mifid ii), ar- ticle 27: Obligation to execute orders on terms most favourable to the client, Official Journal of the European Union L 173/349 (2014)

work page 2014
[8]

European Parliament and Council, Regulation (EU) 2019/1150 of the european parliament and of the council of 20 june 2019 on promoting fairness and transparency for business users of online intermediation services, Official Journal of the European Union L 186/57 (July 2019)

work page 2019
[9]

9),http://www.cac.gov.cn/2022-01/04/c_1642894606364259

Cyberspace Administration of China and Ministry of Industry and In- formation Technology and Ministry of Public Security and State Admin- istration for Market Regulation, Provisions on the administration of al- gorithmic recommendations of internet information services (CAC order no. 9),http://www.cac.gov.cn/2022-01/04/c_1642894606364259. htm, effective 2...

work page 2022
[10]

J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, M. S. Bern- stein, Generative agents: Interactive simulacra of human behavior, in: 35 Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), 2023

work page 2023
[11]

J. J. Horton, A. Filippas, B. S. Manning, Large language models as simulated economic agents: What can we learn from Homo Silicus?, NBER Working Paper 31122, National Bureau of Economic Research (2023).doi:10.3386/w31122

work page doi:10.3386/w31122 2023
[12]

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, D. Wingate, Out of one, many: Using language models to simu- late human samples, Political Analysis 31 (3) (2023) 337–351.doi: 10.1017/pan.2023.2

work page doi:10.1017/pan.2023.2 2023
[13]

Nature Human Behaviour9(8), 1645–1653 (2025)

F. Salvi, M. Horta Ribeiro, R. Gallotti, R. West, On the conversational persuasiveness of GPT-4, Nature Human Behaviour 9 (2025) 1645–1653. doi:10.1038/s41562-025-02194-6

work page doi:10.1038/s41562-025-02194-6 2025
[14]

Commercial Persuasion in AI-Mediated Conversations

F. Salvi, A. Cuevas, M. Horta Ribeiro, Commercial persuasion in AI-mediated conversations, arXiv preprint arXiv:2604.04263 (2026). arXiv:2604.04263

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

https://arxiv.org/abs/2510.25779 (2025)

G. Bansal, W. Hua, Z. Huang, A. Fourney, A. Swearngin, W. Epperson, T. Payne, J. M. Hofman, B. Lucier, C. Singh, M. Mobius, A. Nambi, A. Yadav, K. Gao, D. M. Rothschild, A. Slivkins, D. G. Goldstein, H. Mozannar, N. Immorlica, M. Murad, M. Vogel, S. Kambhampati, E. Horvitz, S. Amershi, Magentic marketplace: An open-source envi- ronment for studying agenti...

work page arXiv 2025
[16]

Sandvig, K

C. Sandvig, K. Hamilton, K. Karahalios, C. Langbort, Auditing algo- rithms: Research methods for detecting discrimination on internet plat- forms, in: 64th Annual Meeting of the International Communication Association (ICA), Data and Discrimination Preconference, 2014

work page 2014
[17]

Hannak, P

A. Hannak, P. Sapiezynski, A. Molavi Kakhki, B. Krishnamurthy, D. Lazer, A. Mislove, C. Wilson, Measuring personalization of web search, in: Proceedings of the 22nd International Conference on World Wide Web (WWW), ACM, 2013, pp. 527–538.doi:10.1145/2488388. 2488435. 36

work page doi:10.1145/2488388 2013
[18]

Metaxa, J

D. Metaxa, J. S. Park, R. E. Robertson, K. Karahalios, C. Wil- son, J. Hancock, C. Sandvig, Auditing algorithms: Understanding al- gorithmic systems from the outside in, Foundations and Trends in Human-Computer Interaction 14 (4) (2021) 272–344.doi:10.1561/ 1100000083

work page 2021
[19]

W. G. Kim, S. Pillai, K. Haldorai, W. Ahmad, Dark patterns used by online travel agency websites, Annals of Tourism Research 88 (2021) 103055.doi:10.1016/j.annals.2020.103055

work page doi:10.1016/j.annals.2020.103055 2021
[20]

E. Kran, H. M. Nguyen, A. Kundu, S. Jawhar, J. Park, M. M. Ju- rewicz, DarkBench: Benchmarking dark patterns in large language mod- els, in: International Conference on Learning Representations (ICLR), Oral, 2025.arXiv:2503.10728

work page arXiv 2025
[21]

H. Wu, B. Mitra, C. Ma, F. Diaz, X. Liu, Joint multisided exposure fairness for recommendation, in: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), ACM, 2022, pp. 703–714.doi:10.1145/3477495. 3532007

work page doi:10.1145/3477495 2022
[22]

Federal Trade Commission, Bringing dark patterns to light: FTC staff report,https://www.ftc.gov/reports/ bringing-dark-patterns-light, p214800 (September 2022)

work page 2022
[23]

Reviglio, M

U. Reviglio, M. Fabbri, The regulation of recommender systems under the DSA: A transition from default to multiple and dynamic controls?, DSA Observatory Policy Analysis (November 2024)

work page 2024
[24]

J.Xie, K.Zhang, J.Chen, T.Zhu, R.Lou, Y.Tian, Y.Xiao, Y.Su, Trav- elPlanner: A benchmark for real-world planning with language agents, in: Proceedings of the 41st International Conference on Machine Learn- ing (ICML), Spotlight, 2024.arXiv:2402.01622

work page arXiv 2024
[25]

Tripcraft: A benchmark for spatio-temporally fine grained travel planning.arXiv preprint arXiv:2502.20508,

S. Chaudhuri, P. Purkar, R. Raghav, S. Mallick, M. Gupta, A. Jana, S. Ghosh, TripCraft: A benchmark for spatio-temporally fine grained travel planning, arXiv preprint arXiv:2502.20508 (2025).arXiv:2502. 20508. 37

work page arXiv 2025
[26]

Y. Qu, H. Xiao, F. Li, G. Li, H. Zhou, X. Dai, X. Dai, TripScore: Bench- marking and rewarding real-world travel planning with fine-grained eval- uation, arXiv preprint arXiv:2510.09011 (2025).arXiv:2510.09011

work page arXiv 2025
[27]

Beyond Itinerary Planning-A Real-World Benchmark for Multi-Turn and Tool-Using Travel Tasks

X. Cheng, Y. Hu, X. Zhang, L. Xu, L. Tan, Z. Pan, X. Li, Y. Liu, Beyond itinerary planning—a real-world benchmark for multi-turn and tool-using travel tasks, arXiv preprint arXiv:2512.22673 (2025).arXiv: 2512.22673

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

D. Yang, et al., Wide-horizon thinking and simulation-based evaluation forreal-worldLLMplanningwithmultifacetedconstraints, in: Advances in Neural Information Processing Systems (NeurIPS), 2025.arXiv: 2506.12421

work page arXiv 2025
[29]

A. Chen, X. Ge, Z. Fu, Y. Xiao, J. Chen, TravelAgent: An AI assis- tant for personalized travel planning, arXiv preprint arXiv:2409.08069 (2024)

work page arXiv 2024
[30]

Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, C. Wang, AutoGen: Enabling next-gen LLM applications via multi-agent conver- sations, in: Conference on Language Modeling (COLM), 2024.arXiv: 2308.08155

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, B. Ghanem, CAMEL:Communicativeagentsfor“mind” explorationoflargelanguage model society, in: Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[32]

S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, C. Zhang, J. Wang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, J. Schmidhuber, MetaGPT: Meta programming for a multi-agent col- laborative framework, in: International Conference on Learning Repre- sentations (ICLR), Oral, 2024.arXiv:2308.00352

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C.-M. Chan, H. Yu, Y. Lu, Y.-H. Hung, C. Qian, et al., AgentVerse: Facilitating multi-agent collab- oration and exploring emergent behaviors, in: International Conference on Learning Representations (ICLR), 2024. 38

work page 2024
[34]

X. Zhou, H. Zhu, L. Mathur, R. Zhang, H. Yu, Z. Qi, L.-P. Morency, Y. Bisk, D. Fried, G. Neubig, M. Sap, SOTOPIA: Interactive evaluation for social intelligence in language agents, in: International Conference on Learning Representations (ICLR), Spotlight, 2024

work page 2024
[35]

Piatti, Z

G. Piatti, Z. Jin, M. Kleiman-Weiner, B. Schölkopf, M. Sachan, R. Mi- halcea, Cooperate or collapse: Emergence of sustainable cooperation in a society of LLM agents, in: Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[36]

Abdelnabi, A

S. Abdelnabi, A. Gomaa, S. Sivaprasad, L. Schönherr, M. Fritz, Coop- eration, competition, and maliciousness: LLM-stakeholders interactive negotiation, in: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

work page 2024
[37]

N. Li, C. Gao, M. Li, Y. Li, Q. Liao, EconAgent: Large language model- empowered agents for simulating macroeconomic activities, in: Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

work page 2024
[38]

Q. Zhao, J. Wang, Y. Zhang, Y. Jin, K. Zhu, H. Chen, X. Xie, Com- peteAI: Understanding the competition dynamics of large language model-based agents, in: Proceedings of the 41st International Confer- ence on Machine Learning (ICML), Oral, 2024

work page 2024
[39]

L. Wang, J. Zhang, H. Yang, Z. Chen, J. Tang, Z. Zhang, X. Chen, Y. Lin, R. Song, W. X. Zhao, J. Xu, Z. Dou, J. Wang, J.-R. Wen, User behavior simulation with large language model-based agents for recom- mender systems, ACM Transactions on Information Systems (TOIS) (2025)

work page 2025
[40]

G. V. Aher, R. I. Arriaga, A. T. Kalai, Using large language models to simulate multiple humans and replicate human subject studies, in: Proceedings of the 40th International Conference on Machine Learning (ICML), 2023

work page 2023
[41]

X. Liu, H. Yu, H. Zhang, Y. Xu, X. Lei, H. Lai, Y. Gu, H. Ding, K. Men, K. Yang, et al., AgentBench: Evaluating LLMs as agents, in: Interna- tional Conference on Learning Representations (ICLR), 2024. 39

work page 2024
[42]

C. Ma, J. Zhang, Z. Zhu, C. Yang, Y. Yang, Y. Jin, Z. Lan, L. Kong, J. He, AgentBoard: An analytical evaluation board of multi-turn LLM agents, in: Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2024

work page 2024
[43]

Towards Understanding Sycophancy in Language Models

M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman, N. Cheng, E. Durmus, Z. Hatfield-Dodds, S. R. John- ston, S. Kravec, T. Maxwell, S. McCandlish, K. Ndousse, O. Rausch, N. Schiefer, D. Yan, M. Zhang, E. Perez, Towards understanding syco- phancy in language models, in: International Conference on Learning Representations (ICLR), 2024.arX...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Richard G¨ollner, Rebecca Lazarides, and Philipp Stark

A. Fanous, J. Goldberg, A. Agarwal, J. Lin, A. Zhou, S. Xu, V. Bikia, R. Daneshjou, S. Koyejo, SycEval: Evaluating LLM sycophancy, in: Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society (AIES), Vol. 8, 2025, pp. 893–900.arXiv:2502.08177,doi:10.1609/ aies.v8i1.36598

work page arXiv 2025
[45]

Durmus, L

E. Durmus, L. Lovitt, A. Tamkin, S. Ritchie, J. Clark, D. Gan- guli, Measuring the persuasiveness of language models,https:// www.anthropic.com/news/measuring-model-persuasiveness(April 2024)

work page 2024
[46]

Hackenburg, H

K. Hackenburg, H. Margetts, Evaluating the persuasive influence of po- litical microtargeting with large language models, Proceedings of the National Academy of Sciences (PNAS) 121 (24) (2024) e2403116121. doi:10.1073/pnas.2403116121

work page doi:10.1073/pnas.2403116121 2024
[47]

S.C.Matz, J.D.Teeny, S.S.Vaid, H.Peters, G.M.Harari, M.Cerf, The potential of generative AI for personalized persuasion at scale, Scientific Reports 14 (2024) 4692.doi:10.1038/s41598-024-53755-0

work page doi:10.1038/s41598-024-53755-0 2024
[48]

H. Bai, J. G. Voelkel, S. Muldowney, J. C. Eichstaedt, R. Willer, LLM- generated messages can persuade humans on policy issues, Nature Com- munications 16 (2025) 6037.doi:10.1038/s41467-025-61345-5

work page doi:10.1038/s41467-025-61345-5 2025
[49]

Persuasion with Large Language Models: A Survey of Empirical Evidence, Study Methodologies, and Ethical Implications

A. Rogiers, S. Noels, M. Buyl, T. De Bie, Persuasion with large language models: A survey, arXiv preprint arXiv:2411.06837 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Di Geronimo, L

L. Di Geronimo, L. Braz, E. Fregnan, F. Palomba, A. Bacchelli, UI dark patterns and where to find them: A study on mobile applications and 40 user perception, in: Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (CHI), 2020.doi:10.1145/3313831. 3376600

work page doi:10.1145/3313831 2020
[51]

Luguri, L

J. Luguri, L. J. Strahilevitz, Shining a light on dark patterns, Journal of Legal Analysis 13 (1) (2021) 43–109.doi:10.1093/jla/laaa006

work page doi:10.1093/jla/laaa006 2021
[52]

Ersoy, B

D. Ersoy, B. Lee, A. Shreekumar, A. Arunasalam, M. Ibrahim, A. Bianchi, Z. B. Celik, Investigating the impact of dark patterns on LLM-based web agents, in: Proceedings of the 47th IEEE Symposium on Security and Privacy (S&P), 2026, to appear.arXiv:2510.18113

work page arXiv 2026
[53]

G. K. Patro, A. Biswas, N. Ganguly, K. P. Gummadi, A. Chakraborty, FairRec: Two-sided fairness for personalized recommendations in two- sided platforms, in: Proceedings of The Web Conference 2020 (WWW), 2020.doi:10.1145/3366423.3380196

work page doi:10.1145/3366423.3380196 2020
[54]

Duetting, V

P. Duetting, V. Mirrokni, R. Paes Leme, H. Xu, S. Zuo, Mechanism design for large language models, in: Proceedings of the ACM Web Conference 2024 (WWW), Best Paper, 2024.doi:10.1145/3589334. 3645511

work page doi:10.1145/3589334 2024
[55]

Truthful aggregation of llms with an application to online advertising.arXiv preprint arXiv:2405.05905, 2024

E. Soumalias, M. J. Curry, S. Seuken, Truthful aggregation of LLMs with an application to online advertising, in: International Conference on Learning Representations (ICLR), 2025.arXiv:2405.05905

work page arXiv 2025
[56]

K. A. Dubey, Z. Feng, R. Kidambi, A. Mehta, D. Wang, Auctions with LLM summaries, in: Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2024, pp. 713– 722.doi:10.1145/3637528.3672022

work page doi:10.1145/3637528.3672022 2024
[57]

S. Fish, Y. A. Gonczarowski, R. I. Shorrer, Algorithmic collusion by large language models, arXiv preprint arXiv:2404.00806 (2024)

work page arXiv 2024
[58]

Ben-Shahar, C

O. Ben-Shahar, C. E. Schneider, More Than You Wanted to Know: The Failure of Mandated Disclosure, Princeton University Press, 2014

work page 2014
[59]

Loewenstein, C

G. Loewenstein, C. R. Sunstein, R. Golman, Disclosure: Psychology changes everything, Annual Review of Economics 6 (2014) 391–419. doi:10.1146/annurev-economics-080213-041341. 41

work page doi:10.1146/annurev-economics-080213-041341 2014