Unpredictability dissociates from structured control in language agents

Xiao Jia

arxiv: 2605.09692 · v2 · submitted 2026-05-10 · 💻 cs.AI

Unpredictability dissociates from structured control in language agents

Xiao Jia This is my paper

Pith reviewed 2026-05-15 05:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords language agentsstructured controlstochastic unpredictabilityagent lesionsaction couplingreason and vetobehavioral dissociation

0 comments

The pith

Stochastic unpredictability does not produce structured action control in language agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether random sampling can stand in for explicit mechanisms that tie reasons, memory, self-state, and inhibition to action choices in implemented language agents. It disables specific control parts through lesions and compares the resulting behavior against high-stochasticity versions across dozens of thousands of runs on multiple datasets. Structured agents keep stronger links between internal states and actions even when matched for entropy or token count, while lesions to reasons or vetoes reliably weaken those links. This separation matters because many treat unpredictability as a sign of control, yet here the two can be pulled apart by design choices. The work therefore shows that randomness alone does not deliver the same action-coupling profiles that intact reason-and-veto machinery produces.

Core claim

In a language-agent family whose control components can be selectively disabled, high-stochasticity comparators produced greater unpredictability than structured variants across all seven datasets, yet targeted lesions to reason coupling and veto inhibition reduced the expected structured-control behavioral profiles in every case. Matched-interface tests spanning thousands of generations showed the intact structured agent outperforming stochastic, scrambled, post-hoc, and verbosity controls on action-field coupling measures; removing free-form traces and running blinded audits preserved the pattern. Extensions to additional model families and scaffolds confirmed that no-fields, scrambled, or

What carries the argument

Selective lesioning of structured control components (reason coupling and veto inhibition) inside a language-agent architecture, measured against stochastic sampling baselines through action-field coupling scores.

If this is right

Structured control produces distinct action-coupling profiles that stochastic dispersion alone does not replicate in these agents.
Disabling reason coupling or veto inhibition reliably lowers expected control metrics in every tested dataset.
Matched stochastic, scrambled-context, and distribution-matched controls fail to recover structured action coupling even under strict entropy and compute matching.
Behavioral tests that remove free-form wording still show the dissociation between unpredictability and structured control.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Designers may need explicit coupling mechanisms rather than added randomness when reliable action selection matters.
Evaluation protocols for agents should track action coupling separately from raw unpredictability metrics.
The observed dissociation suggests that safety or alignment tests relying only on entropy measures may miss gaps in structured decision making.

Load-bearing premise

The lesion methods and stochastic comparators cleanly separate structured control effects from random dispersion without hidden implementation differences shaping the behavioral outcomes.

What would settle it

A pure stochastic sampling procedure that matches or exceeds the structured agent's action-field coupling scores on the predefined behavioral components across the same datasets while keeping reasons and vetoes disabled.

Figures

Figures reproduced from arXiv: 2605.09692 by Xiao Jia.

**Figure 1.** Figure 1: Finite-action-code behavior provides the primary format-independent behavioral evi [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Matched-interface action-field coupling. The action-field coupling index (AFCI) tests [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 2.** Figure 2: Baseline dissociation between stochastic dispersion and structured-control traces. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Finite-action-code behavior provides the primary format-independent behavioral evi [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 3.** Figure 3: Matched-interface action-field coupling. The action-field coupling index (AFCI) tests [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Entropy calibration constrains stochastic substitution under a predefined four-level close [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Robustness and validation across tested prompts, models, open-weight inference and [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 5.** Figure 5: Validation summary and scope of inference. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Unpredictable behavior is often taken as evidence of control, yet stochastic dispersion and structured action control need not coincide. This paper tests whether stochastic sampling can substitute for structured mechanisms that couple reasons, memory, self-state and inhibition to action selection in a language-agent implementation whose control components can be selectively disabled. In a seven-dataset baseline lesion matrix comprising 74,352 calls, the high-stochasticity comparator was more unpredictable than the structured-control variant in 7/7 datasets, whereas targeted reason and veto lesions reduced the expected structured-control profiles in 7/7 datasets each. In a matched-interface control spanning 26,946 generations, the structured agent maintained stronger action-field coupling than all stochastic, post-hoc, scrambled and verbosity controls across every dataset. The primary behavioral test removed free-form trace wording from the evaluation: 57,816 scored records showed the structured-control variant exceeding the high-stochasticity comparator or the reason/veto lesions in 7/7 datasets for all predefined behavioral components. Later open-weight runs extended the no-context controls to Qwen2.5 7B, 14B and 32B and to an independent Mistral-7B family across 20 task families and three agent scaffolds; no-fields, scrambled-context and distribution-matched controls failed to recover structured action control. A three-annotator blinded audit over 1,200 overlap items preserved high agreement. Strict entropy matching, strict token/compute matching and a formal counterfactual-flip stress test did not meet their gates and are treated as limitations. Stochastic unpredictability did not reproduce structured, action-coupled control in this implemented agent family.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that stochastic unpredictability does not reproduce structured action-coupled control in language agents. Across a 7-dataset lesion matrix (74k calls), high-stochasticity comparators exceed structured variants in unpredictability but show weaker reason/veto and action-field coupling; targeted lesions reduce those profiles in 7/7 datasets. Matched-interface controls (27k generations), no-context extensions to Qwen/Mistral families, and a blinded 3-annotator audit on 1.2k items support the dissociation, though strict entropy/token matching and counterfactual-flip tests failed their gates.

Significance. If the dissociation holds after addressing matching gaps, the result clarifies that randomness alone cannot substitute for explicit coupling of reasons, memory, self-state and inhibition to action selection. The scale (74k+ calls, 57k scored records, multi-family replication) and blinded audit are strengths; the work supplies falsifiable behavioral metrics that can be reused to test other agent scaffolds.

major comments (2)

[Abstract] Abstract and Limitations: the high-stochasticity comparator is reported as failing strict entropy matching, token/compute matching, and counterfactual-flip gates. Because these controls did not succeed, differences in reason/veto coupling and action-field metrics could arise from unmeasured shifts in output distribution or sampling variance rather than the presence/absence of the structured components; this directly affects the central dissociation claim.
[§4] §4 (lesion matrix) and §5 (matched-interface control): the paper states that the lesion methods and stochastic comparators mitigate some confounds, yet the primary unpredictability comparator itself fails the matching criteria. Additional post-hoc distribution-matching analyses or explicit reporting of effective entropy per condition are needed to confirm that behavioral differences are not driven by verbosity or token-level statistics.

minor comments (2)

[Abstract] Abstract: the numerical claims (74,352 calls, 57,816 scored records, 7/7 datasets) are dense; a short table summarizing per-dataset effect directions would improve readability.
[Methods] The blinded-audit protocol is described only at high level; adding inter-annotator agreement statistics (e.g., Fleiss' kappa) would strengthen the reliability claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and for highlighting the implications of the failed matching gates. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of the dissociation claim while maintaining full transparency about the limitations.

read point-by-point responses

Referee: [Abstract] Abstract and Limitations: the high-stochasticity comparator is reported as failing strict entropy matching, token/compute matching, and counterfactual-flip gates. Because these controls did not succeed, differences in reason/veto coupling and action-field metrics could arise from unmeasured shifts in output distribution or sampling variance rather than the presence/absence of the structured components; this directly affects the central dissociation claim.

Authors: We acknowledge that the strict entropy, token/compute, and counterfactual-flip controls did not meet their gates, as already stated in the limitations section of the manuscript. This is a genuine constraint on the strength of the primary comparator. However, the matched-interface control (26,946 generations) and the additional post-hoc, scrambled-context, and verbosity controls were explicitly designed to isolate structured components from distributional shifts; the dissociation in reason/veto and action-field coupling persisted across all of them. In revision we will add explicit reporting of effective entropy per condition and post-hoc distribution-matching statistics to further address the possibility of unmeasured variance. revision: yes
Referee: [§4] §4 (lesion matrix) and §5 (matched-interface control): the paper states that the lesion methods and stochastic comparators mitigate some confounds, yet the primary unpredictability comparator itself fails the matching criteria. Additional post-hoc distribution-matching analyses or explicit reporting of effective entropy per condition are needed to confirm that behavioral differences are not driven by verbosity or token-level statistics.

Authors: We agree that additional reporting is warranted. In the revised manuscript we will insert post-hoc distribution-matching analyses (including token-length histograms and entropy estimates per condition) into §4 and §5. These will be presented alongside the existing lesion matrix and matched-interface results to demonstrate that the observed differences in behavioral profiles are not driven by verbosity or token-level statistics. The blinded audit and multi-family replication already provide convergent support, but the requested analyses will be added to close this gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical lesion comparisons are self-contained

full rationale

The paper reports results from direct implementation of language agents, selective component lesions, and comparisons against stochastic, scrambled, and matched-interface controls across tens of thousands of generations and multiple datasets. No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatzes smuggled via prior work appear in the text. Claims rest on measured behavioral profiles (action-field coupling, reason/veto effects) under explicit experimental conditions, with limitations on matching controls openly stated rather than hidden. This is a standard empirical dissociation study whose central result does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that the agent architecture permits clean, selective disabling of control components and that the chosen behavioral metrics validly capture structured control.

axioms (1)

domain assumption Structured control components (reason coupling, memory, inhibition) can be selectively disabled via lesions without side effects that confound the comparison to stochastic sampling.
Invoked in the lesion matrix design and the claim that targeted lesions reduce expected profiles.

pith-pipeline@v0.9.0 · 5585 in / 1278 out tokens · 38621 ms · 2026-05-15T05:26:32.959996+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lesion-based evaluation framework separating stochastic dispersion, action-field coupling and hidden-label finite-action responsiveness... structured-control protocol... high-stochasticity sampling increased action entropy but did not recover reason-, memory-, self-state- or veto-coupled behavior
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

entropy analyses... four-level closeness criterion... Γ(θ) = max_ℓ |H(ℓ)_d,HS(θ) − H(ℓ)_d,SC|

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

Miller and Jonathan D

Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24:167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167

work page doi:10.1146/annurev.neuro.24.1.167 2001
[2]

Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990

Gordon D. Logan and William B. Cowan. On the ability to inhibit thought and action: A theoryofanactofcontrol.Psychological Review, 91(3):295–327, 1984. doi: 10.1037/0033-295X. 91.3.295

work page doi:10.1037/0033-295x 1984
[3]

Aron, Trevor W

Adam R. Aron, Trevor W. Robbins, and Russell A. Poldrack. Inhibition and the right inferior frontal cortex: One decade on.Trends in Cognitive Sciences, 18(4):177–185, 2014. doi: 10. 1016/j.tics.2013.12.003

work page 2014
[4]

Botvinick, and Jonathan D

Amitai Shenhav, Matthew M. Botvinick, and Jonathan D. Cohen. The expected value of control: An integrative theory of anterior cingulate cortex function.Neuron, 79(2):217–240,

work page
[5]

doi: 10.1016/j.neuron.2013.07.007

work page doi:10.1016/j.neuron.2013.07.007 2013
[6]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume35, pages24824–24837, 2022

work page 2022
[7]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023
[8]

Reflexion: Language agents with verbal rein- forcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal rein- forcement learning. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://papers.nips.cc/paper_files/paper/2023/hash/ 1b44b878bb782e6954cd888628510e90-Abstract-Conference.html

work page 2023
[9]

Griffiths, Yuan Cao, and KarthikNarasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and KarthikNarasimhan. Treeofthoughts: Deliberateproblemsolvingwithlargelanguagemodels. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps:// openreview.net/forum?id=5Xc1ecxO1h

work page 2023
[10]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,

work page
[11]

doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763
[12]

Voyager: An open-ended embodied agent with large language mod- els.Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language mod- els.Transactions on Machine Learning Research, 2024. URLhttps://voyager.minedojo. org/

work page 2024
[13]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18:186345,

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18:186345,

work page
[14]

doi: 10.1007/s11704-024-40231-1. 78

work page doi:10.1007/s11704-024-40231-1
[15]

doi: 10.18653/v1/P18-1082

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 889–898, 2018. doi: 10.18653/v1/P18-1082

work page doi:10.18653/v1/p18-1082 2018
[16]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH

work page 2020
[17]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw

work page 2023
[18]

Semanticuncertainty: Linguisticinvariances for uncertainty estimation in natural language generation

LorenzKuhn, YarinGal, andSebastianFarquhar. Semanticuncertainty: Linguisticinvariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve

work page 2023
[19]

Claude E. Shannon. A mathematical theory of communication.Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948
[20]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1993

work page 1993
[21]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[22]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong- hao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://proceedings.neurips.cc/paper_files/p...

work page 2023
[23]

The Hitchhiker ' s Guide to Testing Statistical Significance in Natural Language Processing

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The hitchhiker’s guide to test- ing statistical significance in natural language processing. InProceedings of the 56th An- nual Meeting of the Association for Computational Linguistics, pages 1383–1392, 2018. doi: 10.18653/v1/P18-1128

work page doi:10.18653/v1/p18-1128 2018
[24]

Laird and James H

Nan M. Laird and James H. Ware. Random-effects models for longitudinal data.Biometrics, 38(4):963–974, 1982. doi: 10.2307/2529876

work page doi:10.2307/2529876 1982
[25]

Manning, Christopher Ré, et al

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, et al. Holistic evaluation of language models.Transactions on Machine Learning Research,

work page
[26]

URLhttps://openreview.net/forum?id=iO4LZibEqW

work page
[27]

Model Cards for Model Reporting,

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229, 2019. doi: 10.1145/3287560.3287596. 79

work page doi:10.1145/3287560.3287596 2019
[28]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723. 80

work page doi:10.1145/3458723 2021

[1] [1]

Miller and Jonathan D

Earl K. Miller and Jonathan D. Cohen. An integrative theory of prefrontal cortex function. Annual Review of Neuroscience, 24:167–202, 2001. doi: 10.1146/annurev.neuro.24.1.167

work page doi:10.1146/annurev.neuro.24.1.167 2001

[2] [2]

Connectionist models of recognition memory: Constraints imposed by learning and forgetting functions.Psychological Review, 97(2):285–308, 1990

Gordon D. Logan and William B. Cowan. On the ability to inhibit thought and action: A theoryofanactofcontrol.Psychological Review, 91(3):295–327, 1984. doi: 10.1037/0033-295X. 91.3.295

work page doi:10.1037/0033-295x 1984

[3] [3]

Aron, Trevor W

Adam R. Aron, Trevor W. Robbins, and Russell A. Poldrack. Inhibition and the right inferior frontal cortex: One decade on.Trends in Cognitive Sciences, 18(4):177–185, 2014. doi: 10. 1016/j.tics.2013.12.003

work page 2014

[4] [4]

Botvinick, and Jonathan D

Amitai Shenhav, Matthew M. Botvinick, and Jonathan D. Cohen. The expected value of control: An integrative theory of anterior cingulate cortex function.Neuron, 79(2):217–240,

work page

[5] [5]

doi: 10.1016/j.neuron.2013.07.007

work page doi:10.1016/j.neuron.2013.07.007 2013

[6] [6]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume35, pages24824–24837, 2022

work page 2022

[7] [7]

ReAct: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=WE_vluYUL-X

work page 2023

[8] [8]

Reflexion: Language agents with verbal rein- forcement learning

Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal rein- forcement learning. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://papers.nips.cc/paper_files/paper/2023/hash/ 1b44b878bb782e6954cd888628510e90-Abstract-Conference.html

work page 2023

[9] [9]

Griffiths, Yuan Cao, and KarthikNarasimhan

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L. Griffiths, Yuan Cao, and KarthikNarasimhan. Treeofthoughts: Deliberateproblemsolvingwithlargelanguagemodels. InAdvances in Neural Information Processing Systems, volume 36, 2023. URLhttps:// openreview.net/forum?id=5Xc1ecxO1h

work page 2023

[10] [10]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology,

work page

[11] [11]

doi: 10.1145/3586183.3606763

work page doi:10.1145/3586183.3606763

[12] [12]

Voyager: An open-ended embodied agent with large language mod- els.Transactions on Machine Learning Research, 2024

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Voyager: An open-ended embodied agent with large language mod- els.Transactions on Machine Learning Research, 2024. URLhttps://voyager.minedojo. org/

work page 2024

[13] [13]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18:186345,

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Ji-Rong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18:186345,

work page

[14] [14]

doi: 10.1007/s11704-024-40231-1. 78

work page doi:10.1007/s11704-024-40231-1

[15] [15]

doi: 10.18653/v1/P18-1082

Angela Fan, Mike Lewis, and Yann Dauphin. Hierarchical neural story generation. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics, pages 889–898, 2018. doi: 10.18653/v1/P18-1082

work page doi:10.18653/v1/p18-1082 2018

[16] [16]

The curious case of neural text degeneration

Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. The curious case of neural text degeneration. InInternational Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rygGQyrFvH

work page 2020

[17] [17]

Le, Ed H

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V. Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought rea- soning in language models. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=1PL1NIMMrw

work page 2023

[18] [18]

Semanticuncertainty: Linguisticinvariances for uncertainty estimation in natural language generation

LorenzKuhn, YarinGal, andSebastianFarquhar. Semanticuncertainty: Linguisticinvariances for uncertainty estimation in natural language generation. InInternational Conference on Learning Representations, 2023. URLhttps://openreview.net/forum?id=VD-AYtP0dve

work page 2023

[19] [19]

Claude E. Shannon. A mathematical theory of communication.Bell System Technical Journal, 27(3):379–423, 1948. doi: 10.1002/j.1538-7305.1948.tb01338.x

work page doi:10.1002/j.1538-7305.1948.tb01338.x 1948

[20] [20]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. Chapman and Hall/CRC, 1993

work page 1993

[21] [21]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960

[22] [22]

Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yong- hao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM-as-a-judge with MT-bench and chatbot arena. InAdvances in Neural Information Processing Systems, vol- ume 36, 2023. URLhttps://proceedings.neurips.cc/paper_files/p...

work page 2023

[23] [23]

The Hitchhiker ' s Guide to Testing Statistical Significance in Natural Language Processing

Rotem Dror, Gili Baumer, Segev Shlomov, and Roi Reichart. The hitchhiker’s guide to test- ing statistical significance in natural language processing. InProceedings of the 56th An- nual Meeting of the Association for Computational Linguistics, pages 1383–1392, 2018. doi: 10.18653/v1/P18-1128

work page doi:10.18653/v1/p18-1128 2018

[24] [24]

Laird and James H

Nan M. Laird and James H. Ware. Random-effects models for longitudinal data.Biometrics, 38(4):963–974, 1982. doi: 10.2307/2529876

work page doi:10.2307/2529876 1982

[25] [25]

Manning, Christopher Ré, et al

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, et al. Holistic evaluation of language models.Transactions on Machine Learning Research,

work page

[26] [26]

URLhttps://openreview.net/forum?id=iO4LZibEqW

work page

[27] [27]

Model Cards for Model Reporting,

Margaret Mitchell, Simone Wu, Andrew Zaldivar, Parker Barnes, Lucy Vasserman, Ben Hutchinson, Elena Spitzer, Inioluwa Deborah Raji, and Timnit Gebru. Model cards for model reporting. InProceedings of the Conference on Fairness, Accountability, and Transparency, pages 220–229, 2019. doi: 10.1145/3287560.3287596. 79

work page doi:10.1145/3287560.3287596 2019

[28] [28]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daume III, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. doi: 10.1145/3458723. 80

work page doi:10.1145/3458723 2021