pith. machine review for the scientific record. sign in

arxiv: 2605.06142 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI

Recognition: unknown

IRC-Bench: Recognizing Entities from Contextual Cues in First-Person Reminiscences

Authors on Pith no claims yet

Pith reviewed 2026-05-08 10:43 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords implicit entity recognitionreminiscence narrativescontextual cuesnon-local inferencebenchmark datasetentity linkingLLM evaluationnarrative understanding
0
0 comments X

The pith

IRC-Bench evaluates models on recognizing entities inferred from distributed contextual cues in first-person reminiscence transcripts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IRC-Bench to assess how well models can identify entities such as people, places, and events when they appear only through clues spread across multiple non-contiguous parts of personal memory stories. This matters because reminiscences in therapy, archives, and social contexts commonly avoid direct names, yet existing named entity recognition and linking methods assume local mentions and therefore cannot handle the required inference. The benchmark constructs 25,136 samples by linking entities from 1,994 real transcripts to Wikidata and creating paired versions where direct mentions are removed to force reliance on the remaining dispersed evidence. Evaluations across 19 model configurations reveal that QLoRA-adapted Llama 3.1 8B leads in open-world exact match while fine-tuned dense passage retrieval leads in closed-world ranking. The full dataset, code, and tools are released to enable further progress on implicit reference resolution.

Core claim

We introduce IRC-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts. The benchmark targets non-locality: entity-identifying cues are distributed across multiple, non-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution. IRC-Bench comprises 25,136 samples constructed from 12,337 Wiki-data-linked entities across 1,994 transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity-Elided Narrative, in which direct mentions are removed.

What carries the argument

The Entity-Elided Narrative, created by removing direct mentions of the target entity from a grounded transcript, which forces models to recover the entity solely from dispersed, non-contiguous contextual cues.

If this is right

  • Models achieving high scores on IRC-Bench demonstrate improved ability to process real reminiscence data containing only indirect references.
  • Open-world generation with adapted large language models outperforms other approaches when the set of possible entities is not restricted in advance.
  • Closed-world retrieval methods such as fine-tuned DPR can rank candidate entities effectively, reaching over 70 percent Hit@10.
  • The 11-domain coverage of the benchmark enables testing whether inference performance generalizes across different types of personal memories.
  • Public release of the 25,136 paired samples and evaluation code supports standardized comparison of future implicit entity recognition systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The elision technique used to build the benchmark could be applied to other narrative genres to create tests for implicit reasoning beyond entity names.
  • Systems that succeed on IRC-Bench may improve AI tools that help users reflect on or organize personal life stories.
  • Strong results would indicate that current models still lack robust mechanisms for integrating evidence across long narrative spans.
  • Performance on this benchmark could be checked against success in downstream tasks such as summarizing therapy sessions or extracting life events from oral histories.

Load-bearing premise

That removing direct mentions from Entity-Grounded Narratives produces Entity-Elided Narratives that accurately simulate real-world implicit references requiring inference from dispersed, non-contiguous contextual cues without introducing artifacts or altering the underlying difficulty.

What would settle it

If human readers cannot identify the intended entity from the Entity-Elided Narratives at rates substantially above chance, the benchmark would fail to capture genuine implicit recognition.

Figures

Figures reproduced from arXiv: 2605.06142 by Alexander Apartsin, Eden Moran, Yehudit Aperstein.

Figure 1
Figure 1. Figure 1: IRC-Bench construction pipeline. Raw oral history transcripts undergo cleaning, named entity recognition with Wik￾idata linking, entity-grounded narrative generation, and entity elision to produce implicit entity recognition evaluation sam￾ples. The final validation stage checks for leakage, cue sufficiency, and narrative naturalness. 3.4 Entity Knowledge Base The entity knowledge base contains 12,337 uniq… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of open-world and closed-world methods on the IRC-Bench test set. Open-world methods are measured by exact match and alias-aware accuracy; closed-world methods by Hit@1 and Hit@10. The most striking open-world result is the effect of QLoRA fine-tuning. Open-world configuration 10 (O10; QLoRA Llama 3.1 8B) achieves 38.94% exact match, nearly tripling the base model's zero-shot performance (13.92%… view at source ↗
Figure 5
Figure 5. Figure 5: Hit@K curves for closed-world retrieval methods. Fine-tuned DPR with description representations (C5) substantially outperforms all baseline configurations across all K values view at source ↗
Figure 6
Figure 6. Figure 6: Effect of DPR fine-tuning on retrieval performance. Fine-tuning more than doubles Hit@1 across all entity represen￾tation strategies, with the largest absolute gain for descriptions (+18.74 pp). 6.3 Cross-Paradigm Comparison view at source ↗
Figure 7
Figure 7. Figure 7: Heatmap of performance (alias-aware Hit@1) by entity type and method. Person entities are consistently the hardest across all methods; Events are notably strong for both open-world and closed-world approaches. Entity Type n O1 (GPT-4o ZS) O2 (GPT-4o FS) O5 (Llama 8B ZS) C1 (BGE Name) C2 (BGE Desc) Place 2,076 38.15 43.88 18.16 14.88 15.99 Organiza￾tion 1,152 38.28 45.31 27.34 29.17 27.17 Person 698 23.82 2… view at source ↗
Figure 8
Figure 8. Figure 8: Relationship between model scale and open-world accuracy. Larger models achieve substantially higher accuracy, with the relationship appearing roughly log-linear in model parameter count. QLoRA fine-tuning (O10) breaks this trend, en￾abling an 8B model to outperform much larger models. 8. Conclusion We have extended implicit entity recognition, previously studied in short social-media text [9,10], to the d… view at source ↗
read the original abstract

When people recount personal memories, they often refer to people, places, and events indirectly, relying on contextual cues rather than explicit names. Such implicit references are central to reminiscence narratives: first-person accounts of lived experience used in therapeutic, archival, and social settings. They pose a difficult computational problem because the intended entity must be inferred from dispersed narrative evidence rather than from a local mention. We introduce IRC-Bench, the Implicit Reminiscence Context Benchmark, for evaluating implicit entity recognition in reminiscence transcripts. The benchmark targets non-locality: entity-identifying cues are distributed across multiple, non-contiguous clauses, unlike named entity recognition, entity linking, or coreference resolution. IRC-Bench comprises 25,136 samples constructed from 12,337 Wiki-data-linked entities across 1,994 transcripts spanning 11 thematic domains. Each sample pairs an Entity-Grounded Narrative, in which the target entity is explicitly mentioned, with an Entity-Elided Narrative, in which direct mentions are removed. We evaluate 19 configurations across LLM generation, dense retrieval, RAG, and fine-tuning. QLoRA-adapted Llama 3.1 8B performs best in the open-world setting (38.94% exact match; 51.59% Jaccard), while fine-tuned DPR leads closed-world retrieval (35.38% Hit@1; 71.49% Hit@10). We release IRC-Bench with data, code, and evaluation tools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces IRC-Bench, a benchmark of 25,136 samples drawn from 1,994 WikiData-linked reminiscence transcripts across 11 domains. Each sample consists of an Entity-Grounded Narrative (with explicit entity mentions) paired with an Entity-Elided Narrative (direct mentions removed). The benchmark is positioned as testing non-local implicit entity recognition from dispersed, non-contiguous contextual cues, distinct from standard NER, entity linking, or coreference. Evaluations across 19 LLM, retrieval, RAG, and fine-tuning configurations report QLoRA-adapted Llama 3.1 8B as strongest in open-world settings (38.94% exact match, 51.59% Jaccard) and fine-tuned DPR as strongest in closed-world retrieval (35.38% Hit@1, 71.49% Hit@10). The dataset, code, and evaluation tools are released.

Significance. If the central construction claim holds, IRC-Bench would fill a gap by providing a controlled testbed for non-local contextual inference in personal narratives, with direct relevance to therapeutic, archival, and social applications. The public release of data, code, and tools is a clear strength that supports reproducibility and extension by the community.

major comments (2)
  1. [§3] §3 (Benchmark Construction): The central claim requires that Entity-Elided Narratives simulate real-world implicit references by forcing inference from dispersed, non-contiguous cues. The manuscript describes construction as taking WikiData-linked Entity-Grounded Narratives and removing direct mentions, but supplies no details on the elision procedure (mention detection, removal scope, or safeguards against local residual signals), no quality validation (human ratings of cue locality or narrative coherence), and no comparison to naturally occurring implicit reminiscences. This is load-bearing: without such evidence the benchmark may measure surface pattern completion rather than the intended non-local reasoning, rendering the reported model rankings uninterpretable for the stated purpose.
  2. [§4] §4 (Model Evaluations): Performance figures (e.g., 38.94% exact match for QLoRA Llama 3.1 8B, 35.38% Hit@1 for fine-tuned DPR) are presented as evidence of model capability on implicit recognition. Because the construction validation gap in §3 is unresolved, these metrics cannot be confidently attributed to success on non-contiguous inference; an error analysis or cue-locality probe would be required to substantiate the interpretation.
minor comments (2)
  1. [Abstract] Abstract and §1: 'Wiki-data-linked' appears inconsistently; standardize to 'WikiData-linked' throughout.
  2. [§2] §2 (Related Work): The distinction from coreference resolution is asserted but would benefit from a short table contrasting task definitions and input assumptions.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and insightful comments on our manuscript. We address each of the major comments point by point below.

read point-by-point responses
  1. Referee: [§3] §3 (Benchmark Construction): The central claim requires that Entity-Elided Narratives simulate real-world implicit references by forcing inference from dispersed, non-contiguous cues. The manuscript describes construction as taking WikiData-linked Entity-Grounded Narratives and removing direct mentions, but supplies no details on the elision procedure (mention detection, removal scope, or safeguards against local residual signals), no quality validation (human ratings of cue locality or narrative coherence), and no comparison to naturally occurring implicit reminiscences. This is load-bearing: without such evidence the benchmark may measure surface pattern completion rather than the intended non-local reasoning, rendering the reported model rankings uninterpretable for the stated purpose.

    Authors: We acknowledge the need for greater transparency in the benchmark construction process. The manuscript outlines the overall approach but does not elaborate on the technical details of elision. In the revised version, we will provide a complete description of the elision procedure, including the mention detection method based on WikiData entity names and aliases, the scope of removal, and measures taken to avoid leaving local residual signals. We will also include a human evaluation study assessing cue locality and narrative coherence. A direct empirical comparison to naturally occurring implicit reminiscences is beyond the current scope, as it would require a new annotated corpus; we will add a discussion of this limitation while maintaining that the elision method enforces the non-local nature of the task by construction. revision: partial

  2. Referee: [§4] §4 (Model Evaluations): Performance figures (e.g., 38.94% exact match for QLoRA Llama 3.1 8B, 35.38% Hit@1 for fine-tuned DPR) are presented as evidence of model capability on implicit recognition. Because the construction validation gap in §3 is unresolved, these metrics cannot be confidently attributed to success on non-contiguous inference; an error analysis or cue-locality probe would be required to substantiate the interpretation.

    Authors: We agree that additional analysis is needed to link the performance metrics to non-local inference capabilities. We will revise §4 to include an error analysis that categorizes failures and successes based on the distribution of contextual cues. Furthermore, we will introduce a cue-locality probe experiment to demonstrate that models must rely on non-contiguous information. These changes will provide stronger evidence for the interpretation of the results. revision: yes

standing simulated objections not resolved
  • Direct comparison to a corpus of naturally occurring implicit reminiscences, which would require substantial additional data collection and annotation.

Circularity Check

0 steps flagged

No circularity: benchmark construction and empirical evaluations are self-contained

full rationale

The paper defines IRC-Bench by constructing Entity-Elided Narratives from Entity-Grounded Narratives through direct mention removal and then reports empirical performance of models (LLMs, DPR, RAG) on the resulting dataset. No equations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the derivation chain. All claims rest on explicit data construction steps and external model evaluations rather than reducing to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on domain assumptions about narrative structure and entity grounding rather than new parameters or invented entities.

axioms (2)
  • domain assumption Implicit references in reminiscence narratives can be resolved using distributed contextual cues across non-contiguous clauses.
    This premise underpins the benchmark's focus on non-locality and the elision process.
  • domain assumption WikiData-linked entities provide reliable and accurate grounding for constructing and evaluating the narrative samples.
    Used to create the 12,337 entities across 1,994 transcripts.

pith-pipeline@v0.9.0 · 5572 in / 1525 out tokens · 69542 ms · 2026-05-08T10:43:03.208350+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    Ritchie (ed.), The Oxford Handbook of Oral History, Oxford Handbooks (2010; online edn, Oxford Academic, 18 Sept

    Boyd, Doug, ' Achieving the Promise of Oral History in a Digital Age', in Donald A. Ritchie (ed.), The Oxford Handbook of Oral History, Oxford Handbooks (2010; online edn, Oxford Academic, 18 Sept. 2012), https://doi.org/10.1093/ox- fordhb/9780195339550.013.0021, accessed 27 Apr. 2026

  2. [2]

    Lazar, A., Demiris, G., and Thompson, H. (2016). Evaluation of a multifunctional technology system in a memory care unit: Opportunities for innovation in dementia care. Informatics for Health and Social Care, 41(4):373-389

  3. [3]

    and Woods, B

    Subramaniam, P. and Woods, B. (2012). The impact of individual reminiscence therapy for people with dementia. Expert Re- view of Neurotherapeutics, 12(5):545-555

  4. [4]

    and Sekine, S

    Nadeau, D. and Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1):3- 26

  5. [5]

    Li, J., Sun, A., Han, J., & Li, C. (2020). A survey on deep learning for named entity recognition. IEEE transactions on knowledge and data engineering, 34(1), 50-70

  6. [6]

    E., & Hofmann, T

    Ganea, O. E., & Hofmann, T. (2017, September). Deep joint entity disambiguation with local neural attention. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 2619-2629)

  7. [7]

    E., & Hofmann, T

    Kolitsas, N., Ganea, O. E., & Hofmann, T. (2018, October). End-to-end neural entity linking. In Proceedings of the 22nd confer- ence on computational natural language learning (pp. 519-529)

  8. [8]

    (2017, September)

    Lee, K., He, L., Lewis, M., & Zettlemoyer, L. (2017, September). End-to-end neural coreference resolution. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 188-197)

  9. [9]

    Hosseini, H. (2022). Implicit entity recognition and linking in tweets. PhD thesis, Toronto Metropolitan University

  10. [10]

    and Bagheri, E

    Hosseini, H. and Bagheri, E. (2021). Learning to rank implicit entities on Twitter. Information Processing & Management, 58(3):102503

  11. [11]

    (2016, June)

    Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016, June). Neural architectures for named entity recognition. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguis- tics: human language technologies (pp. 260-270)

  12. [12]

    W., Lee, K., & Toutanova, K

    Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2019, June). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational lin- guistics: human language technologies, volume 1 (long and short papers) (pp. 4171-4186)

  13. [13]

    (2023, December)

    Xie, T., Li, Q., Zhang, J., Zhang, Y., Liu, Z., & Wang, H. (2023, December). Empirical study of zero-shot NER with ChatGPT. In Proceedings of the 2023 conference on empirical methods in natural language processing (pp. 7935-7956)

  14. [14]

    Ashok, D., & Lipton, Z. C. (2023). Promptner: Prompting for named entity recognition. arXiv preprint arXiv:2305.15444

  15. [15]

    Sang, E. T. K., & De Meulder, F. (2003). Introduction to the CoNLL -2003 shared task: Language -independent named entity recognition. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 (pp. 142-147)

  16. [16]

    (2022, October)

    Malmasi, S., Fang, A., Fetahu, B., Kar, S., & Rokhlenko, O. (2022, October). MultiCoNER: A large-scale multilingual dataset for complex named entity recognition. In Proceedings of the 29th international conference on computational linguistics (pp. 3798- 3809)

  17. [17]

    Li, J., Fei, H., Liu, J., Wu, S., Zhang, M., Teng, C., ... & Li, F. (2022, June). Unified named entity recognition as word-word relation classification. In proceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 10, pp. 10965-10973)

  18. [18]

    Zhou, W., Zhang, S., Gu, Y., Chen, M., & Poon, H. (2023). Universalner: Targeted distillation from large language models for open named entity recognition. arXiv preprint arXiv:2308.03279

  19. [19]

    (2020, November)

    Wu, L., Petroni, F., Josifoski, M., Riedel, S., & Zettlemoyer, L. (2020, November). Scalable zero -shot entity linking with dense entity retrieval. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6397-6407)

  20. [20]

    (2020, September)

    De Cao, N., Izacard, G., Riedel, S., & Petroni, F. (2020, September). Autoregressive Entity Retrieval. In ICLR 2021 -9th Interna- tional Conference on Learning Representations (Vol. 2021). ICLR

  21. [21]

    (2022, July)

    Ayoola, T., Tyagi, S., Fisher, J., Christodoulopoulos, C., & Pierleoni, A. (2022, July). Refined: An efficient zero -shot-capable approach to end-to-end entity linking. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track (pp. 209-220)

  22. [22]

    A., Shan, Z., & Gillick, D

    Botha, J. A., Shan, Z., & Gillick, D. (2020, November). Entity linking in 100 languages. In Proceedings of the 2020 Conferenc e on Empirical Methods in Natural Language Processing (EMNLP) (pp. 7833-7845)

  23. [23]

    Butler, R. N. (1963). The life review: An interpretation of reminiscence in the aged. Psychiatry, 26(1), 65-76

  24. [24]

    Webster, J. D. (1993). Construction and validation of the Reminiscence Functions Scale. Journal of Gerontology, 48(5), P256 - P262

  25. [25]

    Nikitina, S., Callaioli, S., and Baez, M. (2018). Smart conversational agents for reminiscence. Proceedings of the 1st International Workshop on Software Engineering for Cognitive Services, 52-57

  26. [26]

    Pessanha, F., & Salah, A. A. (2021). A computational look at oral history archives. ACM Journal on Computing and Cultural Heritage (JOCCH), 15(1), 1-16

  27. [27]

    Perera, N., Dehmer, M., & Emmert-Streib, F. (2020). Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, 8, 673

  28. [28]

    (2020, July)

    Hou, Y. (2020, July). Bridging anaphora resolution as question answering. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (pp. 1428-1438)

  29. [29]

    Poesio, M., Stuckardt, R., and Versley, Y. (2016). Anaphora Resolution: Algorithms, Resources, and Applications. Springer

  30. [30]

    S., Lee, S., & Tsvetanov, K

    Treder, M. S., Lee, S., & Tsvetanov, K. A. (2024). Introduction to Large Language Models (LLMs) for dementia care and research. Frontiers in dementia, 3, 1385303

  31. [31]

    Broadbent, E., Stafford, R., & MacDonald, B. (2009). Acceptance of healthcare robots for the older population: Review and future directions. International journal of social robotics, 1(4), 319-330

  32. [32]

    De Jager, A., Fogarty, A., Tewson, A., Lenette, C., & Boydell, K. (2017). Digital storytelling in research: A systematic revi ew. The Qualitative Report

  33. [33]

    Yang, Z., Qi, P., Zhang, S., Bengio, Y., Cohen, W., Salakhutdinov, R., & Manning, C. D. (2018). HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 conference on empirical methods in natural language processing (pp. 2369-2380)

  34. [34]

    (2019, November)

    Petroni, F., Rocktäschel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., & Miller, A. (2019, November). Language models as knowledge bases?. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) (pp. 2463-2473)

  35. [35]

    Re- trieval-augmented generation for knowledge -intensive nlp tasks

    Lewis, Patrick, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler et al. "Re- trieval-augmented generation for knowledge -intensive nlp tasks." Advances in neural information processing systems 33 (2020): 9459-9474

  36. [36]

    Dense passage retrieval for open-domain question answering

    Karpukhin, Vladimir, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen -tau Yih. "Dense passage retrieval for open-domain question answering." In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp. 6769-6781. 2020

  37. [37]

    GPT-4o System Card

    Hurst, Aaron, Adam Lerer, Adam P. Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, A. J. Ostrow et al. "Gpt-4o system card." arXiv preprint arXiv:2410.21276 (2024)

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov et al. "Llama 2: Open foundation and fine-tuned chat models." arXiv preprint arXiv:2307.09288 (2023)

  39. [39]

    The Llama 3 Herd of Models

    Grattafiori, Aaron, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman et al. "The llama 3 herd of models." arXiv preprint arXiv:2407.21783 (2024)

  40. [40]

    Lora: Low-rank adaptation of large language models

    Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, and Weizhu Chen. "Lora: Low-rank adaptation of large language models." Iclr 1, no. 2 (2022): 3

  41. [41]

    Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36, 10088-10115

  42. [42]

    Xiao, S., Liu, Z., Zhang, P., Muennighoff, N., Lian, D., & Nie, J. Y. (2024, July). C -pack: Packed resources for general chinese embeddings. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval (pp. 641-649)

  43. [43]

    Chain-of-thought prompting elicits reasoning in large language models

    Wei, Jason, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V. Le, and Denny Zhou. "Chain-of-thought prompting elicits reasoning in large language models." Advances in neural information processing systems 35 (2022): 24824 - 24837