pith. sign in

arxiv: 2606.11217 · v1 · pith:HK3I5QUYnew · submitted 2026-05-03 · 💻 cs.CY · cs.AI· cs.HC

Preregistration for Experiments with AI Agents

Pith reviewed 2026-07-01 00:12 UTC · model grok-4.3

classification 💻 cs.CY cs.AIcs.HC
keywords preregistrationAI agentsbehavioral experimentsresearch methodologyreproducibilityLLMscredibility
0
0 comments X

The pith

Preregistration should be extended to experiments with AI agents to reduce hidden researcher flexibility.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that experiments using AI agents face methodological vulnerabilities similar to those in human subjects research, including choices over which model to use, how to word prompts, and whether to redesign based on early results. These choices are especially easy to exploit and hard to detect because iteration is low-cost and reporting norms are absent. Extending preregistration requires researchers to commit in advance to models, prompts, settings, and analysis plans, which would make results more trustworthy. A template is offered to make this practical for AI agent studies. If true, this would increase the reliability of findings about how AI agents behave in social and decision-making contexts.

Core claim

Experiments with AI agents introduce researcher degrees of freedom such as model selection, prompt wording, settings, and outcome-contingent redesign; the low cost of iteration and lack of reporting norms make these choices easy to exploit and difficult to detect; therefore preregistration practices from human subjects research should be extended to this domain through a tailored template to improve credibility.

What carries the argument

A preregistration template tailored to AI agent experiments that requires advance specification of model choice, prompt details, experimental settings, and analysis plans.

If this is right

  • Journals and conferences would require preregistration for papers involving AI agent experiments.
  • Researchers would need to declare model and prompt choices before collecting results.
  • Outcome-contingent redesigns would be labeled as exploratory rather than planned.
  • Funding agencies could condition support on use of the preregistration template.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption might prompt creation of logging tools that automatically record AI experiment parameters.
  • The approach could reveal whether AI-specific issues like prompt sensitivity require additional safeguards beyond standard preregistration.
  • Success here might encourage similar registration norms in other low-cost computational behavioral studies.

Load-bearing premise

The methodological problems in AI agent experiments are similar enough to those in human studies that preregistration will address them effectively.

What would settle it

An audit comparing published AI agent studies that finds equivalent rates of selective reporting or post-hoc changes in both preregistered and non-preregistered work would undermine the claim.

Figures

Figures reproduced from arXiv: 2606.11217 by Michelle Vaccaro.

Figure 1
Figure 1. Figure 1: Garden of forking paths (Gelman & Loken, 2013) in experiments with AI agents. In experiments with AI agents, a single research question branches into a combinatorial space of experimental configurations as researchers make choices about model specifications, prompt design, sampling parameters, parsing and exclusion rules, and statistical analysis procedures. A preregistered confirmatory study follows a sin… view at source ↗
Figure 2
Figure 2. Figure 2: Cost–flexibility tradeoff across research paradigms. The risk of specification search depends jointly on two factors: the marginal cost of running additional experimental configurations (x-axis) and the number of defensible “forks” in the experimental pipeline (y-axis). Human subjects behavioral experiments (blue) involve substantial analytic flexibility but high marginal costs—participant recruitment, com… view at source ↗
Figure 3
Figure 3. Figure 3: Specification curve for anchoring effects across 2,430 experimental configurations. Each point represents the anchoring index from a unique combination of model family (9 models across Gemini, OpenAI, and Llama families), system prompt (human-like, incentivized, none), anchor distance (high, medium, low), delivery method (comparative, embedded, incidental), question content (Q1–Q5), and outlier handling (i… view at source ↗
read the original abstract

The proliferation of large language models (LLMs) and autonomous AI agents has given rise to a rapidly growing methodological paradigm: "in silico" behavioral experiments. Originally conceived as a way to use AI agents as proxies for human participants in studies of cognition, decision-making, and social dynamics, this approach has taken on new significance -- as AI agents increasingly negotiate, transact, and make consequential decisions on behalf of people and organizations, understanding their behavior has become a research priority in its own right. While these experiments with AI agents offer unprecedented advantages in terms of scalability, cost efficiency, and experimental control, they also inherit, and in some cases amplify, methodological vulnerabilities that have long plagued human subjects research. To address these issues, this paper argues that preregistration practices -- central to improving the credibility of human subjects experiments -- should now be extended to experiments with AI agents. We systematically catalog the researcher degrees of freedom that experiments with AI agents introduce -- model selection, prompt wording, settings, and outcome-contingent redesign, for example -- and show how the low cost of iteration and lack of reporting norms make these choices both easy to exploit and difficult to detect. We propose a preregistration template tailored to experiments with AI agents and call on conferences, journals, and funding agencies to make preregistration standard practice for this emerging research paradigm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that preregistration practices from human subjects research should be extended to 'in silico' behavioral experiments with AI agents. It catalogs researcher degrees of freedom including model selection, prompt wording, settings, and outcome-contingent redesign; notes that low iteration costs and absent reporting norms make these choices easy to exploit; and proposes a tailored preregistration template while calling for its adoption by conferences, journals, and funders.

Significance. If adopted, the proposal could help establish methodological norms in an emerging research area where scalability and low costs amplify flexibility in experimental design. The systematic catalog of vulnerabilities is a useful contribution, but the significance is constrained by the absence of any empirical test of whether the template actually constrains behavior or improves credibility.

major comments (2)
  1. [Abstract / proposal section] Abstract and main argument: the central recommendation that the same preregistration mechanism will mitigate the listed degrees of freedom rests on an untested analogy to human-subjects research; no pilot data, simulation, or formal argument is supplied showing that advance locking of model/prompt choices reduces selective reporting when parallel evaluation of multiple LLMs is near-zero cost.
  2. [Catalog of researcher degrees of freedom] Section cataloging vulnerabilities (model selection, prompt wording, outcome-contingent redesign): the claim that these are 'sufficiently analogous' to p-hacking is load-bearing for the extension argument, yet the manuscript provides no concrete test or counter-example analysis addressing whether low-cost iteration allows researchers to evade the template in ways not possible with human subjects.
minor comments (2)
  1. [Proposed template] The template itself is described at a high level; including an explicit example filled-out template (even a short one) would improve clarity and usability.
  2. [Call to action] No discussion of enforcement mechanisms or incentives for adoption is provided, which is a practical gap for a normative proposal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the value of the systematic catalog of vulnerabilities. We respond to each major comment below and note planned revisions.

read point-by-point responses
  1. Referee: [Abstract / proposal section] Abstract and main argument: the central recommendation that the same preregistration mechanism will mitigate the listed degrees of freedom rests on an untested analogy to human-subjects research; no pilot data, simulation, or formal argument is supplied showing that advance locking of model/prompt choices reduces selective reporting when parallel evaluation of multiple LLMs is near-zero cost.

    Authors: We agree that the proposal rests on an analogy without new empirical validation or simulation specific to AI agents. The manuscript's primary contribution is the identification of AI-specific degrees of freedom and the design of a corresponding template; the claim that preregistration can mitigate them follows from the mechanism of advance commitment rather than from cost considerations alone. Advance locking of model, prompt, and analysis choices still constrains post-outcome adjustments even when parallel runs are inexpensive. To address the concern directly, we will add a subsection providing explicit reasoning on this point, including why low iteration costs do not eliminate the value of pre-commitment. This is a partial revision. revision: partial

  2. Referee: [Catalog of researcher degrees of freedom] Section cataloging vulnerabilities (model selection, prompt wording, outcome-contingent redesign): the claim that these are 'sufficiently analogous' to p-hacking is load-bearing for the extension argument, yet the manuscript provides no concrete test or counter-example analysis addressing whether low-cost iteration allows researchers to evade the template in ways not possible with human subjects.

    Authors: The analogy is grounded in the shared structural problem of researcher flexibility enabling selective reporting, which the paper illustrates with concrete examples of outcome-contingent redesign. We acknowledge that no formal test or counter-example analysis is supplied. In revision we will expand the catalog section with additional hypothetical scenarios that explicitly consider low-cost evasion routes (e.g., running many parallel models before locking) and show how the template's requirements for pre-specifying the full evaluation plan are intended to limit them. We will also note remaining limitations. This is a partial revision. revision: partial

Circularity Check

0 steps flagged

No circularity detected; normative proposal draws on external field practices without self-referential derivation or fitted predictions.

full rationale

The paper is a conceptual/normative argument advocating extension of preregistration from human-subjects research to AI-agent experiments. It catalogs researcher degrees of freedom (model selection, prompt wording, etc.) and proposes a template, but contains no equations, derivations, fitted parameters, or self-citations that reduce the central claim to its own inputs by construction. The analogy to human-subjects preregistration is presented as an external precedent rather than a self-defined or self-cited uniqueness theorem. No load-bearing step matches any enumerated circularity pattern; the argument is self-contained as a direct proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is a methodological advocacy paper with no quantitative model, data analysis, or formal derivation. The central claim rests on the unexamined transfer of preregistration benefits from human subjects research to AI agent experiments and on the premise that the listed degrees of freedom are both widespread and harmful in practice.

axioms (1)
  • domain assumption Preregistration improves the credibility of experiments by limiting researcher degrees of freedom
    The paper invokes this established claim from human subjects research as the basis for extending the practice, without new justification or evidence specific to AI agents.

pith-pipeline@v0.9.1-grok · 5761 in / 1194 out tokens · 31402 ms · 2026-07-01T00:12:40.288132+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 16 canonical work pages · 6 internal anchors

  1. [1]

    Proceedings of the 40th International Conference on Machine Learning , pages =

    Using Large Language Models to Simulate Multiple Humans and Replicate Human Subject Studies , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

  2. [2]

    Nature Human Behaviour , year =

    Playing repeated games with large language models , author =. Nature Human Behaviour , year =

  3. [3]

    Political Analysis , volume =

    Out of One, Many: Using Language Models to Simulate Human Samples , author =. Political Analysis , volume =. 2023 , publisher =

  4. [4]

    Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages =

    When Will Negotiation Agents Be Able to Represent Us? The Challenges and Opportunities for Autonomous Negotiators , author =. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence , pages =

  5. [5]

    arXiv preprint arXiv:2402.05863 , year=

    Federico Bianchi and Patrick John Chia and Mert Yuksekgonul and Jacopo Tagliabue and Dan Jurafsky and James Zou , year=. How Well Can. 2402.05863 , journal=

  6. [6]

    Using Cognitive Psychology to Understand

    Binz, Marcel and Schulz, Eric , journal =. Using Cognitive Psychology to Understand. 2023 , volume =

  7. [7]

    Political Analysis , pages =

    Synthetic Replacements for Human Survey Data? The Perils of Large Language Models , author =. Political Analysis , pages =. 2023 , publisher =

  8. [8]

    2021 , eprint =

    On the Opportunities and Risks of Foundation Models , author =. 2021 , eprint =

  9. [9]

    Nature , year =

    Variability in the analysis of a single neuroimaging dataset by many teams , author =. Nature , year =

  10. [10]

    Nature Reviews Neuroscience , volume =

    Power Failure: Why Small Sample Size Undermines the Reliability of Neuroscience , author =. Nature Reviews Neuroscience , volume =. 2013 , publisher =

  11. [11]

    Science , volume =

    Evaluating Replicability of Laboratory Experiments in Economics , author =. Science , volume =. 2016 , publisher =

  12. [12]

    Evaluating the Replicability of Social Science Experiments in

    Camerer, Colin F and Dreber, Anna and Holzmeister, Felix and Ho, Teck-Hua and Huber, J. Evaluating the Replicability of Social Science Experiments in. Nature Human Behaviour , volume =. 2018 , publisher =

  13. [13]

    Specializing Large Language Models to Simulate Survey Response Distributions for Global Populations , author =. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , month = apr, year =

  14. [14]

    , journal =

    Chambers, Christopher D. , journal =. Registered Reports: A New Publishing Initiative at. 2013 , volume =

  15. [15]

    2024 , publisher =

    Chen, Lingjiao and Zaharia, Matei and Zou, James , journal =. 2024 , publisher =

  16. [16]

    Trends and Charts on Registered Studies , author =

  17. [17]

    Surpassing 100,000 Registrations on

    Pfeiffer, Nici and Call, Mark , year =. Surpassing 100,000 Registrations on

  18. [18]

    Registered Reports , author =

  19. [19]

    2024 , doi =

    Cui, Justin and Chiang, Wei-Lin and Stoica, Ion and Hsieh, Cho-Jui , journal =. 2024 , doi =

  20. [20]

    Dillion, Danica and Tandon, Niket and Gu, Yuling and Gray, Kurt , journal =. Can. 2023 , publisher =

  21. [21]

    Advances in Neural Information Processing Systems (NeurIPS 2024) , year =

    Questioning the Survey Responses of Large Language Models , author =. Advances in Neural Information Processing Systems (NeurIPS 2024) , year =

  22. [22]

    Science , year =

    The reusable holdout: Preserving validity in adaptive data analysis , author =. Science , year =

  23. [23]

    What Did

    Errica, Federico and Siracusano, Giuseppe and Sanvito, Davide and Bifulco, Roberto , journal =. What Did

  24. [24]

    2013 , note =

    The Garden of Forking Paths: Why Multiple Comparisons Can Be a Problem, Even When There Is No ``Fishing Expedition'' or ``p-Hacking'' and the Research Hypothesis Was Posited Ahead of Time , author =. 2013 , note =

  25. [25]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages =

    State of the Art: Reproducibility in Artificial Intelligence , author =. Proceedings of the AAAI Conference on Artificial Intelligence , pages =. 2018 , url =

  26. [26]

    Proceedings of the AAAI Conference on Artificial Intelligence , year =

    Deep Reinforcement Learning That Matters , author =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

  27. [27]

    Frontiers in Artificial Intelligence , VOLUME=

    Herrera-Poyatos, David and Peláez-González, Carlos and Zuheros, Cristina and Herrera-Poyatos, Andrés and Tejedor, Virilo and Herrera, Francisco and Montes, Rosana , TITLE=. Frontiers in Artificial Intelligence , VOLUME=. 2025 , URL=

  28. [28]

    The Curious Case of Neural Text Degeneration

    The Curious Case of Neural Text Degeneration , author =. International Conference on Learning Representations , year =. 1904.09751 , archivePrefix =

  29. [29]

    Horton, Apostolos Filippas, and Benjamin S

    Large Language Models as Simulated Economic Agents: What Can We Learn from Homo Silicus? , author =. 2023 , institution =. doi:10.48550/arXiv.2301.07543 , url =. 2301.07543 , archivePrefix =

  30. [30]

    Science , volume =

    Artificial Intelligence Faces Reproducibility Crisis , author =. Science , volume =. 2018 , publisher =

  31. [31]

    PLoS Medicine , volume =

    Why Most Published Research Findings Are False , author =. PLoS Medicine , volume =. 2005 , publisher =

  32. [32]

    Psychological Science , volume =

    Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling , author =. Psychological Science , volume =. 2012 , publisher =

  33. [33]

    Patterns , volume=

    Leakage and the reproducibility crisis in machine-learning-based science , author=. Patterns , volume=. 2023 , publisher=

  34. [34]

    The Instability of Safety:

    Larsen, Erik , year =. The Instability of Safety:. 2512.12066 , publisher =

  35. [35]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems (NeurIPS) , year =

  36. [36]

    A survey on

    Li, Xinyi and Wang, Sai and Zeng, Siqi and Wu, Yu and Yang, Yi , journal=. A survey on. 2024 , publisher=

  37. [37]

    2022 , eprint =

    Holistic Evaluation of Language Models , author =. 2022 , eprint =

  38. [38]

    Registered report adoption in academic journals:

    Lin, Ting. Registered report adoption in academic journals:. Scientometrics , year =

  39. [39]

    ACM Computing Surveys , year =

    Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing , author =. ACM Computing Surveys , year =

  40. [40]

    Prompt Stability in Code

    Ma, Wei and Yang, Yixiao and Ge, Jingquan and Xie, Xiaofei and Jiang, Lingxiao , year =. Prompt Stability in Code. doi:10.48550/arXiv.2509.13680 , url =. 2509.13680 , archivePrefix =

  41. [41]

    The Greatest Good Benchmark: Measuring

    Marraffini, Gonzalo F. The Greatest Good Benchmark: Measuring. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , year =

  42. [42]

    Mei, Qiaozhu and Xie, Yutong and Yuan, Walter and Jackson, Matthew O , journal =. A. 2024 , publisher =

  43. [43]

    Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

    Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? , author =. Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing , month = dec, year =

  44. [44]

    , author Boyd, D

    Model Cards for Model Reporting , author =. Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT* '19) , year =. doi:10.1145/3287560.3287596 , url =

  45. [45]

    2016 , month = sep, day =

  46. [46]

    2025 , month = apr, url =

    A 25-Year Journey to a Half-Million Registered Studies , author =. 2025 , month = apr, url =

  47. [47]

    Proceedings of the National Academy of Sciences , volume =

    The Preregistration Revolution , author =. Proceedings of the National Academy of Sciences , volume =. 2018 , publisher =. doi:10.1073/pnas.1708274114 , url =

  48. [48]

    Royal Society Open Science , year =

    Robustness of large language models in moral judgements , author =. Royal Society Open Science , year =

  49. [49]

    Science , year =

    Estimating the reproducibility of psychological science , author =. Science , year =

  50. [50]

    Journal of Machine Learning Research , year =

    Improving Reproducibility in Machine Learning Research , author =. Journal of Machine Learning Research , year =. 2003.12206 , archivePrefix =

  51. [51]

    Advances in Neural Information Processing Systems , year =

    A Step Toward Quantifying Independently Reproducible Machine Learning Research , author =. Advances in Neural Information Processing Systems , year =. 1909.06674 , archivePrefix =

  52. [52]

    Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in

    Rao, Abhinav and Khandelwal, Aditi and Tanmay, Kumar and Agrawal, Utkarsh and Chadha, Aman , booktitle =. Ethical Reasoning over Moral Alignment: A Case and Framework for In-Context Ethical Policies in

  53. [53]

    Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA '21) , year =

    Prompt Programming for Large Language Models: Beyond the Few-Shot Paradigm , author =. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA '21) , year =. doi:10.1145/3411763.3451760 , eprint =

  54. [54]

    Advances in Neural Information Processing Systems (NeurIPS 2019) , year =

    A Meta-Analysis of Overfitting in Machine Learning , author =. Advances in Neural Information Processing Systems (NeurIPS 2019) , year =

  55. [55]

    Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

    The Butterfly Effect of Altering Prompts: How Small Changes and Jailbreaks Affect Large Language Model Performance , author =. Findings of the Association for Computational Linguistics: ACL 2024 , month = aug, year =

  56. [56]

    Psychology & Marketing , volume =

    Using Large Language Models to Generate Silicon Samples in Consumer and Marketing Research: Challenges, Opportunities, and Guidelines , author =. Psychology & Marketing , volume =. 2024 , publisher =

  57. [57]

    Frontiers in Psychology , volume =

    The Meaningfulness of Effect Sizes in Psychological Research: Differences Between Sub-Disciplines and the Impact of Potential Biases , author =. Frontiers in Psychology , volume =. 2019 , publisher =

  58. [58]

    Advances in Methods and Practices in Psychological Science , volume =

    An Excess of Positive Results: Comparing the Standard Psychology Literature with Registered Reports , author =. Advances in Methods and Practices in Psychological Science , volume =. 2021 , publisher =

  59. [59]

    Evaluating the Moral Beliefs Encoded in

    Scherrer, Nino and Shi, Claudia and Feder, Amir and Blei, David M , booktitle =. Evaluating the Moral Beliefs Encoded in. 2024 , url=

  60. [60]

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design or: How I learned to start worrying about prompt formatting

    Quantifying Language Models' Sensitivity to Spurious Features in Prompt Design , author =. arXiv preprint arXiv:2310.11324 , year =

  61. [61]

    Psychological Science , year =

    False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant , author =. Psychological Science , year =

  62. [62]

    Nature Human Behaviour , year =

    Specification Curve Analysis , author =. Nature Human Behaviour , year =

  63. [63]

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , year =. Scaling. 2408.03314 , archivePrefix =

  64. [64]

    Perspectives on Psychological Science , year =

    Increasing Transparency Through a Multiverse Analysis , author =. Perspectives on Psychological Science , year =

  65. [65]

    Systematic Reviews , volume =

    Why prospective registration of systematic reviews makes sense , author =. Systematic Reviews , volume =. 2012 , publisher =. doi:10.1186/2046-4053-1-7 , url=

  66. [66]

    Behavior Research Methods , year =

    Preregistration in practice: A comparison of preregistered and non-preregistered studies in psychology , author =. Behavior Research Methods , year =

  67. [67]

    Perspectives on Psychological Science , volume =

    An Agenda for Purely Confirmatory Research , author =. Perspectives on Psychological Science , volume =. 2012 , publisher =

  68. [68]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

    Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , month = jul, year =

  69. [69]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author =. International Conference on Learning Representations , year =. 2203.11171 , archivePrefix =

  70. [70]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author =. Advances in Neural Information Processing Systems , year =. 2201.11903 , archivePrefix =

  71. [71]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation Framework , author =. 2023 , eprint =. doi:10.48550/arXiv.2308.08155 , url =

  72. [72]

    Zamfirescu-Pereira, J. D. and Wong, Richmond and Hartmann, Bjoern and Yang, Qian , booktitle =. Why Johnny Can't Prompt: How Non-. 2023 , url =

  73. [73]

    Update on Trial Registration 11 Years After the

    Zarin, Deborah A and Tse, Tony and Williams, Rebecca J and Rajakannan, Thiyagu , journal =. Update on Trial Registration 11 Years After the. 2017 , publisher =

  74. [74]

    Proceedings of the 38th International Conference on Machine Learning , pages =

    Calibrate Before Use: Improving Few-shot Performance of Language Models , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

  76. [76]

    Science , volume=

    Judgment under Uncertainty: Heuristics and Biases , author=. Science , volume=. 1974 , url=

  77. [77]

    Personality and Social Psychology Bulletin , volume=

    Measures of Anchoring in Estimation Tasks , author=. Personality and Social Psychology Bulletin , volume=. 1995 , url=

  78. [78]

    The Journal of Socio-Economics , volume=

    A Literature Review of the Anchoring Effect , author=. The Journal of Socio-Economics , volume=. 2011 , url=

  79. [79]

    Journal of Open Psychology Data , volume=

    The open anchoring quest dataset: Anchored estimates from 96 studies on anchoring effects , author=. Journal of Open Psychology Data , volume=. 2022 , url=