pith. sign in

arxiv: 2606.02556 · v1 · pith:4HF35XRJnew · submitted 2026-06-01 · 💻 cs.CL

HERO'S JOURNEY: Testing Complex Rule Induction with Text Games

Pith reviewed 2026-06-28 14:48 UTC · model grok-4.3

classification 💻 cs.CL
keywords rule inductiontext gameslarge language modelsbenchmarkattribute inductionprocedural inductionepisodic taskssteering methods
0
0 comments X

The pith

Large language models show evidence of rule induction in text games but remain limited and uneven, with execution as a bottleneck.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents HERO'S JOURNEY, a benchmark of eight tasks in which agents must infer hidden rules from demonstrations in goal-directed episodic text games and then execute multi-step actions based on those rules. The tasks span attribute and procedural induction families, each with four structural rule forms plus controls for lexical grounding and identifiability. Evaluations of current LLMs find that models can induce some rules from examples, yet performance is inconsistent across tasks. Process execution creates a clear bottleneck, while surface-level semantic changes have little impact. Targeted steering improves attribute tasks but produces no reliable gains on procedural ones.

Core claim

Models show evidence of rule induction from demonstrations in episodic text game tasks, yet this ability remains limited and varies unevenly across the eight tasks in the benchmark. Process execution introduces an execution bottleneck, while surface semantics has minimal effect on performance. Induction-specific steering improves results on attribute induction tasks but yields no reliable gains on procedural tasks.

What carries the argument

HERO'S JOURNEY benchmark of eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions.

If this is right

  • Models that induce rules from demonstrations could generalize across new goal-directed scenarios within the same task families.
  • The execution bottleneck implies that even correctly induced rules do not guarantee successful multi-step application.
  • Steering methods effective only on attribute tasks indicate that procedural rule induction requires distinct techniques.
  • Minimal effect of surface semantics suggests models rely primarily on structural patterns rather than lexical cues.
  • The gap in procedural induction remains an open challenge for improving model performance on sequence-based rules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark design could be adapted to test rule induction in partially observable or multi-agent text environments.
  • Persistent procedural gaps may constrain AI agents that must chain actions in planning or game-like settings.
  • Identifiability conditions could inform construction of training curricula that strengthen rule extraction in LLMs.
  • Results on steering suggest hybrid methods combining induction with explicit execution planning merit further tests.

Load-bearing premise

The eight tasks and their identifiability conditions isolate rule induction ability without confounding effects from prompting format, game length, or lexical choice.

What would settle it

If models achieve equal performance on procedural and attribute tasks when rules are induced from demonstrations, or if execution accuracy matches induction accuracy once rules are known, the claimed execution bottleneck and uneven induction would be undermined.

Figures

Figures reproduced from arXiv: 2606.02556 by Anshun Asher Zheng, David I. Beaver, Junyi Jessy Li, Kanishka Misra.

Figure 1
Figure 1. Figure 1: Overview of HERO’S JOURNEY. 1 Rule interactions: four structural forms varying how entity attributes jointly determine the required item or process. 2 Attribute induction tasks: illustrated with A￾Comp tasks (§3.2.1; each attribute independently governs a separate output dimension); entity class ( ranger, captain) and role ( prophet, chirurgeon). 3 Procedural induction tasks: illustrated with P-Comp tasks … view at source ↗
Figure 2
Figure 2. Figure 2: Task curation and the eight induction tasks. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ECSR vs. RV across all model×task condi￾tions. Each point represents one model on one task. in Appx. B.2. 4 Evaluation For each task in §3, we generate 20 variants by randomly sampling surface names from the lexicon and varying the source-gen splits. In each episode the agent receives (see prompt in Appx. H.1): (1) a world listing of all entities, attributes, and locations; (2) source-split demonstrations … view at source ↗
Figure 4
Figure 4. Figure 4: ECSR (solid colored lines) and contextual_bias(k) (dotted gray triangles) across different coverage k/k∗ , for all eight tasks with GPT-5.4-mini and GPT-OSS-120B. The vertical dotted line marks the identifiability threshold k ∗ = 1 (full source split). Bias is zero and omitted for A/P-Comp. when the rule is not identified, reflecting that hu￾mans who miss the rule generally cannot complete the task efficie… view at source ↗
Figure 5
Figure 5. Figure 5: Task curation illustrated on the Dragon’s Keep example (cf. Figure [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Attribute induction task grids (one instantiation of the source/gen split). Rows index class ( [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Procedural induction task grids (one instantiation of the source/gen split). Rows index class ( [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Task success rate (left) and normalized efficiency on successful episodes (right) per model and task, with [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (A–B) Format gap (QA accuracy − ECSR) per model and task, for attribute (A) and procedural (B) tasks. Bars above zero indicate an execution bottleneck; bars below zero a QA bottleneck. Stars mark gaps significantly non-zero (Bonferroni-corrected by task family; ∗p < 0.05, ∗∗p < 0.01, ∗∗∗p < 0.001). (C–D) ECSR and RV under semantic vs. nonce conditions for attribute (C) and procedural (D) tasks. Brackets ma… view at source ↗
Figure 10
Figure 10. Figure 10: ECSR and RV for GPT-5.4-mini and Qwen3.5-27B under semantic vs. nonce lexical con￾ditions, for attribute (left) and procedural (right) tasks. Brackets mark significant pairwise differences (Bonferroni-corrected by task family; p < 0.05); un￾marked pairs are non-significant. −0.05 0.00 0.05 0.10 0.15 0.20 0.25 Δ ECSR ReAct ACE IDEA HR ** ** *** *** GPT Qwen Attribute Procedural [PITH_FULL_IMAGE:figures/fu… view at source ↗
Figure 12
Figure 12. Figure 12: Episodic environment interface for humans to play the games; We use the same instructions as the one [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Interface to annotate the rule underlying the demonstrations after each episode [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
read the original abstract

We introduce HERO'S JOURNEY, a benchmark for rule induction in goal-directed episodic tasks, where agents must infer hidden rules from demonstrations and act on them through multi-step execution. HERO'S JOURNEY covers eight tasks across attribute and procedural induction families, each with four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluating state-of-the-art LLMs, we find that models show evidence of rule induction, but the ability is limited and uneven across tasks. Meanwhile, process execution adds an execution bottleneck for models, whereas surface semantics has minimal effect. Induction-specific steering methods improve performance on attribute tasks but show no reliable gains on procedural tasks, suggesting the gap in procedural induction remains an open challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces the HERO'S JOURNEY benchmark for evaluating rule induction in LLMs via eight goal-directed episodic text-game tasks spanning attribute and procedural induction families. Each task includes four structural rule forms, controllable lexical grounding, and identifiability conditions. Evaluation of state-of-the-art models shows limited and uneven evidence of rule induction, with process execution as a bottleneck and minimal impact from surface semantics; induction-specific steering improves attribute tasks but yields no reliable gains on procedural tasks.

Significance. If the central empirical claims hold under the stated identifiability conditions, the benchmark offers a structured way to probe complex rule induction beyond surface patterns, with the distinction between attribute and procedural families and the steering results highlighting an open challenge in procedural induction. The design elements of controllable lexical grounding and multiple rule forms per task are positive features for systematic evaluation.

major comments (2)
  1. [Abstract] Abstract and the description of identifiability conditions: the central claim that performance differences can be attributed to rule induction (rather than prompting format, episode length, or lexical choice) rests on the assertion that the eight tasks plus identifiability conditions isolate the target construct; however, the manuscript provides no explicit ablations or quantitative checks demonstrating that these variables were held constant or shown to be inert, leaving the attribution of 'limited and uneven' induction and 'minimal effect' of surface semantics unsecured.
  2. [Abstract] Abstract, results on steering methods: the differential effect (gains on attribute tasks, none on procedural) is load-bearing for the conclusion that 'the gap in procedural induction remains an open challenge,' yet without reported statistical tests, error analysis, or controls confirming that the steering interventions were applied identically across families, the unevenness cannot be confidently localized to induction ability versus execution or prompting confounds.
minor comments (1)
  1. The abstract refers to 'four structural rule forms' per task without enumerating or exemplifying them; adding a brief table or figure in the main text would improve clarity on how these forms vary across the attribute/procedural families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address the major comments point by point below and will make the necessary revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the description of identifiability conditions: the central claim that performance differences can be attributed to rule induction (rather than prompting format, episode length, or lexical choice) rests on the assertion that the eight tasks plus identifiability conditions isolate the target construct; however, the manuscript provides no explicit ablations or quantitative checks demonstrating that these variables were held constant or shown to be inert, leaving the attribution of 'limited and uneven' induction and 'minimal effect' of surface semantics unsecured.

    Authors: We agree that explicit ablations would provide stronger evidence that the identifiability conditions effectively isolate rule induction from confounds such as prompting format, episode length, and lexical choice. The manuscript describes the design elements intended to achieve this isolation, including controllable lexical grounding and the four structural rule forms. However, we acknowledge the absence of quantitative checks or ablations in the current version. In the revised manuscript, we will add ablations and analyses to verify that these variables are inert under the stated conditions. revision: yes

  2. Referee: [Abstract] Abstract, results on steering methods: the differential effect (gains on attribute tasks, none on procedural) is load-bearing for the conclusion that 'the gap in procedural induction remains an open challenge,' yet without reported statistical tests, error analysis, or controls confirming that the steering interventions were applied identically across families, the unevenness cannot be confidently localized to induction ability versus execution or prompting confounds.

    Authors: We concur that statistical tests and additional controls are important to support the differential effects observed with steering methods. The current results indicate gains on attribute tasks but no reliable gains on procedural tasks. To address this, the revised version will include statistical significance tests, error analyses, and explicit confirmation that steering interventions were applied consistently across the attribute and procedural families. This will help localize the effects more confidently. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces an empirical benchmark (HERO'S JOURNEY) consisting of eight tasks for evaluating LLM rule induction in text games, reports model performance results, and discusses effects of execution bottlenecks and surface semantics. No equations, fitted parameters, predictions derived from inputs, or mathematical derivations are present. The central claims rest on experimental measurements against external model evaluations rather than any self-referential reduction, self-citation chain, or ansatz smuggled via prior work. Identifiability conditions are design choices for the benchmark tasks, not a derivation that collapses to its own inputs. This is a standard self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities; the contribution is an empirical benchmark and evaluation protocol.

pith-pipeline@v0.9.1-grok · 5660 in / 938 out tokens · 22010 ms · 2026-06-28T14:48:25.533947+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 20 canonical work pages

  1. [1]

    Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R

    Carlos E. Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R. Narasimhan , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  2. [2]

    Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =

    Shuyan Zhou and Frank F. Xu and Hao Zhu and Xuhui Zhou and Robert Lo and Abishek Sridhar and Xianyi Cheng and Tianyue Ou and Yonatan Bisk and Daniel Fried and Uri Alon and Graham Neubig , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  3. [7]

    Narasimhan and Yuan Cao , title =

    Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  4. [8]

    Lake and Marco Baroni , editor =

    Brenden M. Lake and Marco Baroni , editor =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

  5. [9]

    Measuring Compositional Generalization:

    Daniel Keysers and Nathanael Sch. Measuring Compositional Generalization:. 8th International Conference on Learning Representations,. 2020 , url =

  6. [11]

    Lake , editor =

    Laura Ruis and Jacob Andreas and Marco Baroni and Diane Bouchacourt and Brenden M. Lake , editor =. Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual , year =

  7. [12]

    The Thirteenth International Conference on Learning Representations,

    Jiachun Li and Pengfei Cao and Zhuoran Jin and Yubo Chen and Kang Liu and Jun Zhao , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

  8. [13]

    7th International Conference on Learning Representations,

    Maxime Chevalier. 7th International Conference on Learning Representations,. 2019 , url =

  9. [15]

    The Twelfth International Conference on Learning Representations,

    Linlu Qiu and Liwei Jiang and Ximing Lu and Melanie Sclar and Valentina Pyatkin and Chandra Bhagavatula and Bailin Wang and Yoon Kim and Yejin Choi and Nouha Dziri and Xiang Ren , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  10. [17]

    i’m not sure, but

    Jerry A. Fodor and Zenon W. Pylyshyn , abstract =. Cognition , volume =. 1988 , issn =. doi:https://doi.org/10.1016/0010-0277(88)90031-5 , url =

  11. [20]

    2026 , eprint=

    A Survey of Inductive Reasoning for Large Language Models , author=. 2026 , eprint=

  12. [21]

    2019 , eprint=

    On the Measure of Intelligence , author=. 2019 , eprint=

  13. [22]

    2024 , pages =

    Nature Communications , author =. 2024 , pages =. doi:10.1038/s41467-024-50966-x , abstract =

  14. [24]

    and Levy, Omer

    Honovich, Or and Shaham, Uri and Bowman, Samuel R. and Levy, Omer. Instruction Induction: From Few Examples to Natural Language Task Descriptions. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023. doi:10.18653/v1/2023.acl-long.108

  15. [25]

    2026 , url=

    Qizheng Zhang and Changran Hu and Shubhangi Upasani and Boyuan Ma and Fenglu Hong and Vamsidhar Kamanuru and Jay Rainton and Chen Wu and Mengmeng Ji and Hanchen Li and Urmish Thakker and James Zou and Kunle Olukotun , booktitle=. 2026 , url=

  16. [29]

    2014 , url=

    Kemp, Charles and Jern, Alan , journal=. 2014 , url=

  17. [30]

    and McClelland, James L

    Rogers, Timothy T. and McClelland, James L. , title =. 2004 , month =. doi:10.7551/mitpress/6161.001.0001 , url =

  18. [31]

    Psychological Review , number =

    Sudeep Bhatia and Russell Richie , doi =. Psychological Review , number =

  19. [32]

    2025 , eprint=

    On Language Models' Sensitivity to Suspicious Coincidences , author=. 2025 , eprint=

  20. [34]

    Psychological Review , number =

    Sudeep Bhatia , doi =. Psychological Review , number =

  21. [35]

    2022 , volume=

    Kanishka Misra and Julia Rayz and Allyson Ettinger , booktitle=. 2022 , volume=

  22. [36]

    Rule, Joshua Stewart , year=

  23. [37]

    Sudeep Bhatia. 2024. https://doi.org/10.1037/rev0000446 Inductive Reasoning in Minds and Machines . Psychological Review, 131(6):1373--1391

  24. [38]

    Kedi Chen, Dezhao Ruan, Yuhao Dan, Yaoting Wang, Siyu Yan, Xuecheng Wu, Yinqi Zhang, Qin Chen, Jie Zhou, Liang He, Biqing Qi, Linyang Li, Qipeng Guo, Xiaoming Shi, and Wei Zhang. 2026. https://arxiv.org/abs/2510.10182 A survey of inductive reasoning for large language models . Preprint, arXiv:2510.10182

  25. [39]

    Maxime Chevalier - Boisvert, Dzmitry Bahdanau, Salem Lahlou, Lucas Willems, Chitwan Saharia, Thien Huu Nguyen, and Yoshua Bengio. 2019. https://openreview.net/forum?id=rJeXCo0cYX BabyAI: A Platform to Study the Sample Efficiency of Grounded Language Learning . In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, Ma...

  26. [40]

    François Chollet. 2019. https://arxiv.org/abs/1911.01547 On the measure of intelligence . Preprint, arXiv:1911.01547

  27. [41]

    Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler

    Marc - Alexandre C \^ o t \' e , \' A kos K \' a d \' a r, Xingdi Yuan, Ben Kybartas, Tavian Barnes, Emery Fine, James Moore, Matthew J. Hausknecht, Layla El Asri, Mahmoud Adada, Wendy Tay, and Adam Trischler. 2018. https://doi.org/10.1007/978-3-030-24337-1\_3 TextWorld: A Learning Environment for Text-Based Games . In Computer Games - 7th Workshop, CGW 2...

  28. [42]

    Ransom, Andrew Perfors, and Charles Kemp

    Simon Jerome Han, Keith J. Ransom, Andrew Perfors, and Charles Kemp. 2024. https://doi.org/10.1016/j.cogsys.2023.101155 Inductive reasoning in humans and large language models . Cognitive Systems Research, 83:101155

  29. [43]

    Hausknecht, Prithviraj Ammanabrolu, Marc - Alexandre C \^ o t \' e , and Xingdi Yuan

    Matthew J. Hausknecht, Prithviraj Ammanabrolu, Marc - Alexandre C \^ o t \' e , and Xingdi Yuan. 2020. https://doi.org/10.1609/AAAI.V34I05.6297 Interactive Fiction Games: A Colossal Adventure . In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI 2020, The Thirty-Second Innovative Applications of Artificial Intelligence Conference, IAAI 2...

  30. [44]

    Hayes and Evan Heit

    Brett K. Hayes and Evan Heit. 2018. https://doi.org/10.1002/wcs.1459 Inductive reasoning 2.0 . WIREs Cognitive Science, 9(3):e1459

  31. [45]

    Kaiyu He, Mian Zhang, Shuo Yan, Peilin Wu, and Zhiyu Chen. 2025. https://doi.org/10.18653/v1/2025.findings-acl.698 IDEA : Enhancing the Rule Learning Ability of Large Language Model Agent through Induction, Deduction, and Abduction . In Findings of the Association for Computational Linguistics: ACL 2025, pages 13563--13597, Vienna, Austria. Association fo...

  32. [46]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R. Narasimhan. 2024. https://openreview.net/forum?id=VTF8yNQM66 SWE-bench: Can Language Models Resolve Real-world Github Issues? In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net

  33. [47]

    Charles Kemp and Alan Jern. 2014. https://doi.org/10.3758/s13423-013-0467-3 A taxonomy of inductive problems . Psychonomic Bulletin & Review, 21(1):23--46

  34. [48]

    Daniel Keysers, Nathanael Sch \" a rli, Nathan Scales, Hylke Buisman, Daniel Furrer, Sergii Kashubin, Nikola Momchev, Danila Sinopalnikov, Lukasz Stafiniak, Tibor Tihon, Dmitry Tsarkov, Xiao Wang, Marc van Zee, and Olivier Bousquet. 2020. https://openreview.net/forum?id=SygcCnNKwr Measuring compositional generalization: A comprehensive method on Realistic...

  35. [49]

    Najoung Kim and Tal Linzen. 2020. https://doi.org/10.18653/v1/2020.emnlp-main.731 COGS : A Compositional Generalization Challenge Based on Semantic Interpretation . In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087--9105, Online. Association for Computational Linguistics

  36. [50]

    Lake and Marco Baroni

    Brenden M. Lake and Marco Baroni. 2018. http://proceedings.mlr.press/v80/lake18a.html Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks . In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm \" a ssan, Stockholm, Sweden, July 10-15, 2018 , Proceedings of ...

  37. [51]

    Lake, Ruslan Salakhutdinov, and Joshua B

    Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. 2015. https://doi.org/10.1126/science.aab3050 Human-level concept learning through probabilistic program induction . Science, 350(6266):1332--1338

  38. [52]

    Kang-il Lee, Hyukhun Koh, Dongryeol Lee, Seunghyun Yoon, Minsung Kim, and Kyomin Jung. 2025. https://doi.org/10.18653/v1/2025.naacl-long.429 Generating Diverse Hypotheses for Inductive Reasoning . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volu...

  39. [53]

    Jiachun Li, Pengfei Cao, Zhuoran Jin, Yubo Chen, Kang Liu, and Jun Zhao. 2025. https://openreview.net/forum?id=tZCqSVncRf MIRAGE: evaluating and explaining inductive reasoning process in Language Models . In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025 . OpenReview.net

  40. [54]

    Chris Lu, Cong Lu, Robert Tjarko Lange, Yutaro Yamada, Shengran Hu, Jakob Foerster, David Ha, and Jeff Clune. 2026. https://doi.org/10.1038/s41586-026-10265-5 Towards end-to-end automation of AI research . Nature, 651(8107):914--919

  41. [55]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. https://doi.org/10.18653/v1/2022.emnlp-main.759 Rethinking the Role of Demonstrations: What Makes In-Context Learning Work? In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048--11064, Abu Dhabi,...

  42. [56]

    Kanishka Misra, Julia Rayz, and Allyson Ettinger. 2022. https://escholarship.org/uc/item/6170h6nj A Property Induction Framework for Neural Language Models . In Proceedings of the Annual Meeting of the Cognitive Science Society, volume 44

  43. [57]

    Osherson, Edward E

    Daniel N. Osherson, Edward E. Smith, Ormond Wilkie, and Alejandro L\' o pez. 1990. https://doi.org/10.1037/0033-295x.97.2.185 Category-based induction . Psychological Review, 97(2):185--200

  44. [58]

    Sriram Padmanabhan, Kanishka Misra, Kyle Mahowald, and Eunsol Choi. 2025. https://arxiv.org/abs/2504.09387 On language models' sensitivity to suspicious coincidences . Preprint, arXiv:2504.09387

  45. [59]

    Linlu Qiu, Liwei Jiang, Ximing Lu, Melanie Sclar, Valentina Pyatkin, Chandra Bhagavatula, Bailin Wang, Yoon Kim, Yejin Choi, Nouha Dziri, and Xiang Ren. 2024. https://openreview.net/forum?id=bNt7oajl2a Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement . In The Twelfth International Conference o...

  46. [60]

    Laura Ruis, Jacob Andreas, Marco Baroni, Diane Bouchacourt, and Brenden M. Lake. 2020. https://proceedings.neurips.cc/paper/2020/hash/e5a90182cc81e12ab5e72d66e0b46fe3-Abstract.html A Benchmark for Systematic Generalization in Grounded Language Understanding . In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information ...

  47. [61]

    Joshua Stewart Rule. 2020. https://dspace.mit.edu/entities/publication/5af05170-125e-401d-a8b7-fe1437468356 The child as hacker: building more human-like models of learning . Ph.D. thesis, Massachusetts Institute of Technology

  48. [62]

    S.A. Sloman. 1993. https://doi.org/10.1006/cogp.1993.1006 Feature-Based Induction . Cognitive Psychology, 25(2):231--280

  49. [63]

    Tenenbaum, Charles Kemp, Thomas L

    Joshua B. Tenenbaum, Charles Kemp, Thomas L. Griffiths, and Noah D. Goodman. 2011. https://doi.org/10.1126/science.1192788 How to Grow a Mind: Statistics, Structure, and Abstraction . Science, 331(6022):1279--1285

  50. [64]

    Narasimhan, and Yuan Cao

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE\_vluYUL-X ReAct: Synergizing Reasoning and Acting in Language Models . In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 . OpenReview.net

  51. [65]

    Chi Zhang, Baoxiong Jia, Mark Edmonds, Song - Chun Zhu, and Yixin Zhu. 2021. https://doi.org/10.1109/CVPR46437.2021.01050 ACRE: Abstract Causal REasoning Beyond Covariation . In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2021, virtual, June 19-25, 2021 , pages 10643--10653. Computer Vision Foundation / IEEE

  52. [66]

    Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun. 2026. https://openreview.net/forum?id=eC4ygDs02R Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models . In The Fourteenth International Confere...

  53. [67]

    Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. 2024. https://openreview.net/forum?id=oKn9c6ytLx WebArena: A Realistic Web Environment for Building Autonomous Agents . In The Twelfth International Conference on Learning Representations, ICLR 2024,...