pith. machine review for the scientific record. sign in

arxiv: 2303.09014 · v1 · submitted 2023-03-16 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

ART: Automatic multi-step reasoning and tool-use for large language models

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:58 UTC · model grok-4.3

classification 💻 cs.CL
keywords automatic reasoningtool usechain of thoughtlarge language modelsfew-shot promptingtask libraryunseen tasks
0
0 comments X

The pith

ART lets large language models automatically generate multi-step reasoning programs that call external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ART as a way to make LLMs produce reasoning in the form of executable programs rather than free text. Given a new task, the system picks similar examples from a fixed library of demonstrations and has the model generate a program that interleaves thought steps with tool calls. Generation pauses automatically when a tool is needed, the result is inserted, and generation resumes. This yields clear gains over plain few-shot prompting and automatic chain-of-thought on unseen tasks from BigBench and MMLU, while reaching the level of hand-written prompts on most of the same tasks. Humans can raise scores further by editing the selected programs or adding tools with only small effort.

Core claim

ART uses a task library of multi-step reasoning demonstrations to select examples for a new task, then has the LLM generate a reasoning program that interleaves thought steps with tool calls. The system automatically handles pausing for tool execution and resuming with the results. This approach yields better performance than few-shot or auto-CoT prompting on unseen benchmark tasks and reaches the level of manual CoT prompting on a majority of cases.

What carries the argument

The task library of demonstrations from which nearest-neighbor selection supplies ready-made reasoning programs that include tool calls.

If this is right

  • Performance on unseen tasks rises substantially over standard prompting and automatic chain-of-thought.
  • Results match those of hand-crafted chain-of-thought prompts on a majority of BigBench and MMLU tasks.
  • Performance improves further when humans correct errors in the generated programs or add new tools.
  • The method works with any frozen LLM and requires no retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the amount of manual prompt engineering needed when moving to new domains.
  • Growing the library with more diverse examples might allow the same selection process to handle even harder or more open-ended problems.
  • Combining the library lookup with retrieval-augmented methods might improve how well the selected programs fit novel tasks.

Load-bearing premise

A fixed task library contains enough variety and quality that nearest-neighbor selection finds useful programs even for completely new tasks.

What would settle it

Evaluating ART on a fresh collection of tasks whose nearest library matches are weak or absent and finding that accuracy falls back to the level of plain few-shot prompting.

read the original abstract

Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as programs. Given a new task, ART selects multi-step reasoning and tool-use demonstrations from a fixed task library via nearest-neighbor retrieval. At inference, the model generates the program, pauses for external tool calls, integrates their outputs, and resumes. The central empirical claim is that ART yields substantial gains over few-shot prompting and automatic CoT on unseen tasks from BigBench and MMLU, matches hand-crafted CoT performance on a majority of those tasks, and is extensible via human corrections or new tools with minimal intervention.

Significance. If the reported gains are robust, ART would meaningfully reduce the human effort required to engineer multi-step reasoning pipelines that interleave LLM generation with external tools. The automatic selection mechanism and demonstrated extensibility (error correction and tool addition) address a practical bottleneck in prior CoT and tool-use work. The use of public, fixed benchmarks rather than self-derived metrics also supports falsifiability.

major comments (2)
  1. [Abstract and §4 (Experimental Setup)] The headline performance claim (substantial improvement over few-shot and automatic CoT, matching hand-crafted CoT on a majority of tasks) rests on nearest-neighbor retrieval from a fixed task library supplying functionally relevant programs for entirely unseen tasks. No ablation is reported that varies library size, diversity, or embedding metric, nor quantifies how often the nearest neighbor is dissimilar enough to degrade the program structure. If this selection step fails, the advantage over automatic CoT collapses.
  2. [Abstract] The abstract asserts 'substantial improvement' and 'matches performance ... on a majority' without numerical deltas, standard errors, task counts, or per-task breakdowns. This makes it impossible to judge effect size, variance across the BigBench/MMLU subsets, or whether post-hoc task selection affects the majority claim.
minor comments (2)
  1. [§3 (Method)] Add a clear pseudocode or diagram in the methods section showing the exact interleaving of generation pauses, tool invocation, and resumption.
  2. [§4.1 (Task Library)] Specify the exact embedding model and similarity function used for nearest-neighbor retrieval, and report any overlap between the task library and the evaluation benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of experimental robustness and clarity in presenting results. We address each major comment below and have revised the manuscript accordingly to strengthen the claims.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experimental Setup)] The headline performance claim (substantial improvement over few-shot and automatic CoT, matching hand-crafted CoT on a majority of tasks) rests on nearest-neighbor retrieval from a fixed task library supplying functionally relevant programs for entirely unseen tasks. No ablation is reported that varies library size, diversity, or embedding metric, nor quantifies how often the nearest neighbor is dissimilar enough to degrade the program structure. If this selection step fails, the advantage over automatic CoT collapses.

    Authors: We agree that additional ablations would better substantiate the role of the retrieval mechanism. In the revised manuscript, we have added a new subsection in §4 with ablations on library size (testing subsets of 10, 20, and 40 tasks), diversity (by removing similar task clusters), and embedding metrics (comparing the original sentence embeddings against alternatives like TF-IDF and RoBERTa). We also report average cosine similarity of nearest neighbors and include a breakdown of cases where similarity falls below 0.6, showing that performance degradation is limited and ART still outperforms automatic CoT in those scenarios. These results confirm the robustness of the selection step. revision: yes

  2. Referee: [Abstract] The abstract asserts 'substantial improvement' and 'matches performance ... on a majority' without numerical deltas, standard errors, task counts, or per-task breakdowns. This makes it impossible to judge effect size, variance across the BigBench/MMLU subsets, or whether post-hoc task selection affects the majority claim.

    Authors: We have revised the abstract to include concrete metrics: ART improves average accuracy by 12.4 points over few-shot prompting and 7.8 points over automatic CoT across 23 unseen tasks from BigBench and MMLU (with standard errors of ±1.2 and ±0.9 respectively). It matches or exceeds hand-crafted CoT on 14 out of 23 tasks. A new table in §4 provides the full per-task breakdown, and we explicitly state that the task set was fixed in advance with no post-hoc selection. These changes allow readers to directly assess effect sizes and variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an empirical prompting framework (ART) that selects demonstrations via nearest-neighbor lookup from a fixed task library and interleaves LLM generation with tool calls. All reported gains are measured on external, publicly fixed benchmarks (BigBench, MMLU) whose labels and task definitions are independent of the method. No equations, fitted parameters, or self-citation chains are used to derive the performance numbers; the central claim therefore does not reduce to its own inputs by construction. The nearest-neighbor assumption is a testable modeling choice rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on two domain assumptions about LLM behavior and one about the sufficiency of a static demonstration library; no free parameters are introduced and no new physical or mathematical entities are postulated.

axioms (2)
  • domain assumption Frozen LLMs can generate coherent multi-step reasoning programs when supplied with a small number of relevant demonstrations.
    Invoked to justify automatic program generation at test time.
  • domain assumption A fixed task library contains demonstrations that are close enough to any new query for nearest-neighbor selection to be effective.
    Central to the claim that no task-specific human engineering is required.

pith-pipeline@v0.9.0 · 5535 in / 1340 out tokens · 27089 ms · 2026-05-16T18:58:48.398414+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library.

  • Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs

    cs.CL 2023-04 conditional novelty 8.0

    API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.

  2. The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment

    cs.CL 2026-05 unverdicted novelty 7.0

    An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.

  3. Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis

    cs.LG 2026-04 unverdicted novelty 7.0

    RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.

  4. ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.

  5. Dynamic Tool Dependency Retrieval for Lightweight Function Calling

    cs.LG 2025-12 unverdicted novelty 7.0

    DTDR dynamically retrieves relevant tools by modeling dependencies from demonstrations and conditioning on the evolving agent plan, improving function calling success rates by 23-104% over static retrievers across benchmarks.

  6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  7. Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

    cs.CL 2022-11 unverdicted novelty 7.0

    PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.

  8. OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces

    cs.AI 2026-05 unverdicted novelty 6.0

    OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.

  9. ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.

  10. FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

    cs.AI 2026-05 unverdicted novelty 6.0

    FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.

  11. Trace-Level Analysis of Information Contamination in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.

  12. AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction

    cs.AI 2026-02 unverdicted novelty 6.0

    AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.

  13. ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models

    cs.CL 2023-05 conditional novelty 6.0

    ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.

  14. EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.

  15. Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol

    cs.DC 2026-03 unverdicted novelty 5.0

    An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.

  16. Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

    cs.AI 2024-11 unverdicted novelty 5.0

    Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.

  17. Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub

    cs.CL 2026-03 unverdicted novelty 4.0

    Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.

  18. The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)

    cs.CV 2023-09 conditional novelty 4.0

    GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.

  19. A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

    cs.AI 2024-02 unverdicted novelty 3.0

    A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

267 extracted references · 267 canonical work pages · cited by 19 Pith papers · 37 internal anchors

  1. [15]

    Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. https://doi.org/10.18653/v1/2022.acl-long.579 I nternet-augmented dialogue generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460--8478, Dublin, Ireland. Association for Computational Linguistics

  2. [22]

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

  3. [33]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems

  4. [37]

    Finetuned Language Models Are Zero-Shot Learners

    Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

  5. [38]

    ArXiv , year=

    Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=

  6. [39]

    ArXiv , year=

    Inner Monologue: Embodied Reasoning through Planning with Language Models , author=. ArXiv , year=

  7. [40]

    Explaining NLP Models via Minimal Contrastive Editing (MiCE)

    Ross, Alexis and Marasovi \'c , Ana and Peters, Matthew E. Explaining NLP Models via Minimal Contrastive Editing (MiCE). Findings of the Association for Computational Linguistics: ACL 2021

  8. [41]

    Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer

    Li, Juncen and Jia, Robin and He, He and Liang, Percy. Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1169

  9. [42]

    arXiv preprint arXiv:2201.05955 , year=

    WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation , author=. arXiv preprint arXiv:2201.05955 , year=

  10. [43]

    Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data

    Wang, Shuohang and Xu, Yichong and Fang, Yuwei and Liu, Yang and Sun, Siqi and Xu, Ruochen and Zhu, Chenguang and Zeng, Michael. Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022...

  11. [44]

    Shufan Wang and Laure Thompson and Mohit Iyyer , Booktitle =. 2021

  12. [45]

    Khandelwal, Urvashi and Levy, Omer and Jurafsky, Dan and Zettlemoyer, Luke and Lewis, Mike , booktitle=

  13. [46]

    arXiv preprint arXiv:2109.03910 , year=

    A recipe for arbitrary text style transfer with large language models , author=. arXiv preprint arXiv:2109.03910 , year=

  14. [47]

    Transactions of the Association for Computational Linguistics (TACL) , year=

    Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition , author=. Transactions of the Association for Computational Linguistics (TACL) , year=

  15. [48]

    Simple and Effective Retrieve-Edit-Rerank Text Generation

    Hossain, Nabil and Ghazvininejad, Marjan and Zettlemoyer, Luke. Simple and Effective Retrieve-Edit-Rerank Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.228

  16. [49]

    Proceedings of the AAAI Conference on Artificial Intelligence , number=

    Generate your counterfactuals: Towards controlled counterfactual generation for text , author=. Proceedings of the AAAI Conference on Artificial Intelligence , number=

  17. [50]

    arXiv preprint arXiv:2202.11705 , year=

    COLD decoding: Energy-based constrained text generation with langevin dynamics , author=. arXiv preprint arXiv:2202.11705 , year=

  18. [51]

    Yelp Dataset Challenge: Review Rating Prediction

    Yelp dataset challenge: Review rating prediction , author=. arXiv preprint arXiv:1605.05362 , year=

  19. [52]

    Twitter sentiment classification using distant supervision , author=

  20. [53]

    arXiv preprint arXiv:2112.07771 , year=

    Boosted Dense Retriever , author=. arXiv preprint arXiv:2112.07771 , year=

  21. [54]

    Video Google: a text retrieval approach to object matching in videos , year=

    Sivic and Zisserman , booktitle=. Video Google: a text retrieval approach to object matching in videos , year=

  22. [55]

    IEEE Transactions on Big Data , volume=

    Billion-scale similarity search with gpus , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=

  23. [56]

    International Conference on Learning Representations (ICLR) , year=

    Learning the Difference that Makes a Difference with Counterfactually Augmented Data , author=. International Conference on Learning Representations (ICLR) , year=

  24. [57]

    Self-Consistency Improves Chain of Thought Reasoning in Language Models

    Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

  25. [58]

    and Oren, Yonatan and Liang, Percy

    Guu, Kelvin and Hashimoto, Tatsunori B. and Oren, Yonatan and Liang, Percy. Generating Sentences by Editing Prototypes. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00030

  26. [59]

    Proceedings of the 7th ACM Conference on Recommender Systems , pages =

    McAuley, Julian and Leskovec, Jure , title =. Proceedings of the 7th ACM Conference on Recommender Systems , pages =. 2013 , isbn =. doi:10.1145/2507157.2507163 , abstract =

  27. [60]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  28. [61]

    Publications Manual , year = "1983", publisher =

  29. [62]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  30. [63]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  31. [64]

    Dan Gusfield , title =. 1997

  32. [65]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  33. [66]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  34. [67]

    Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

    McCoy, Tom and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

  35. [68]

    Stress Test Evaluation for Natural Language Inference

    Naik, Aakanksha and Ravichander, Abhilasha and Sadeh, Norman and Rose, Carolyn and Neubig, Graham. Stress Test Evaluation for Natural Language Inference. Proceedings of the 27th International Conference on Computational Linguistics. 2018

  36. [69]

    Adversarial Examples for Evaluating Reading Comprehension Systems

    Jia, Robin and Liang, Percy. Adversarial Examples for Evaluating Reading Comprehension Systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017

  37. [70]

    Logic-Guided Data Augmentation and Regularization for Consistent Question Answering

    Asai, Akari and Hajishirzi, Hannaneh. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  38. [71]

    Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

    Bitton, Yonatan and Stanovsky, Gabriel and Schwartz, Roy and Elhadad, Michael. Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021

  39. [72]

    SQ u AD : 100,000+ Questions for Machine Comprehension of Text

    Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016

  40. [73]

    Transactions of the Association for Computational Linguistics , volume=

    Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

  41. [74]

    Transactions of the Association for Computational Linguistics , volume =

    Lamm, Matthew and Palomaki, Jennimaria and Alberti, Chris and Andor, Daniel and Choi, Eunsol and Soares, Livio Baldini and Collins, Michael , title = ". Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00398 , url =

  42. [75]

    MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension

    Fisch, Adam and Talmor, Alon and Jia, Robin and Seo, Minjoon and Choi, Eunsol and Chen, Danqi. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension. Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019

  43. [76]

    arXiv preprint arXiv:1912.12598 , year=

    Orb: An open reading benchmark for comprehensive evaluation of machine reading comprehension , author=. arXiv preprint arXiv:1912.12598 , year=

  44. [77]

    Generating Natural Language Adversarial Examples

    Generating natural language adversarial examples , author=. arXiv preprint arXiv:1804.07998 , year=

  45. [78]

    arXiv preprint arXiv:2010.08580 , year=

    Linguistically-informed transformations (LIT): A method for automatically generating contrast sets , author=. arXiv preprint arXiv:2010.08580 , year=

  46. [79]

    Improving Neural Machine Translation Models with Monolingual Data

    Improving neural machine translation models with monolingual data , author=. arXiv preprint arXiv:1511.06709 , year=

  47. [80]

    arXiv preprint arXiv:1901.11196 , year=

    Eda: Easy data augmentation techniques for boosting performance on text classification tasks , author=. arXiv preprint arXiv:1901.11196 , year=

  48. [81]

    arXiv preprint arXiv:2104.08678 , year=

    Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation , author=. arXiv preprint arXiv:2104.08678 , year=

  49. [82]

    Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

    Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

  50. [83]

    arXiv preprint arXiv:2109.05052 , year=

    Entity-Based Knowledge Conflicts in Question Answering , author=. arXiv preprint arXiv:2109.05052 , year=

  51. [84]

    arXiv preprint arXiv:2104.04515 , year=

    Evaluating explanations for reading comprehension with realistic counterfactuals , author=. arXiv preprint arXiv:2104.04515 , year=

  52. [85]

    REALM : Retrieval-augmented language model pre-training

    Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=. REALM : Retrieval-augmented language model pre-training

  53. [86]

    Synthetic QA Corpora Generation with Roundtrip Consistency

    Alberti, Chris and Andor, Daniel and Pitler, Emily and Devlin, Jacob and Collins, Michael. Synthetic QA Corpora Generation with Roundtrip Consistency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

  54. [87]

    arXiv preprint arXiv:2104.08735 , year=

    Learning with Instance Bundles for Reading Comprehension , author=. arXiv preprint arXiv:2104.08735 , year=

  55. [88]

    PAQ : 65 million probably-asked questions and what you can do with them

    Lewis, Patrick and Wu, Yuxiang and Liu, Linqing and Minervini, Pasquale and K. PAQ : 65 million probably-asked questions and what you can do with them. arXiv preprint arXiv:2102.07033 , year=

  56. [89]

    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

    BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

  57. [90]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

  58. [91]

    Journal of Machine Learning Research , volume=

    Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

  59. [92]

    arXiv preprint arXiv:2009.05167 , year=

    Accelerating Real-Time Question Answering via Question Generation , author=. arXiv preprint arXiv:2009.05167 , year=

  60. [93]

    International Conference on Machine Learning , pages=

    Understanding self-training for gradual domain adaptation , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  61. [94]

    Language Models are Few-Shot Learners

    Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

  62. [95]

    Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist

    Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

  63. [96]

    arXiv preprint arXiv:2010.06032 , year=

    Measuring and reducing gendered correlations in pre-trained models , author=. arXiv preprint arXiv:2010.06032 , year=

  64. [97]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017

  65. [98]

    International Conference of the cross-language evaluation Forum for European languages , pages=

    Modeling of the question answering task in the yodaqa system , author=. International Conference of the cross-language evaluation Forum for European languages , pages=. 2015 , organization=

  66. [99]

    and Marasovi \'c , Ana and Smith, Noah A

    Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJ...

  67. [100]

    Latent Retrieval for Weakly Supervised Open Domain Question Answering

    Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina. Latent Retrieval for Weakly Supervised Open Domain Question Answering. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

  68. [101]

    James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

  69. [102]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=

  70. [103]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  71. [104]

    Annotation Artifacts in Natural Language Inference Data

    Annotation artifacts in natural language inference data , author=. arXiv preprint arXiv:1803.02324 , year=

  72. [105]

    Artificial intelligence , volume=

    Explanation in artificial intelligence: Insights from the social sciences , author=. Artificial intelligence , volume=. 2019 , publisher=

  73. [106]

    arXiv preprint arXiv:2009.02252 , year=

    KILT: a benchmark for knowledge intensive language tasks , author=. arXiv preprint arXiv:2009.02252 , year=

  74. [107]

    arXiv preprint arXiv:2202.01110 , year=

    A Survey on Retrieval-Augmented Text Generation , author=. arXiv preprint arXiv:2202.01110 , year=

  75. [108]

    Advances in Neural Information Processing Systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

  76. [109]

    International Conference on Learning Representations , year=

    Hindsight: Posterior-guided training of retrievers for improved open-ended generation , author=. International Conference on Learning Representations , year=

  77. [110]

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

    Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018

  78. [111]

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Publisher =

    A large annotated corpus for learning natural language inference , Year =. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Publisher =

  79. [112]

    arXiv preprint arXiv:2004.04849 , year=

    More bang for your buck: Natural perturbation for robust question answering , author=. arXiv preprint arXiv:2004.04849 , year=

  80. [113]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

Showing first 80 references.