arxiv: 2303.09014 · v1 · submitted 2023-03-16 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

ART: Automatic multi-step reasoning and tool-use for large language models

Bhargavi Paranjape , Scott Lundberg , Sameer Singh , Hannaneh Hajishirzi , Luke Zettlemoyer , Marco Tulio Ribeiro

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:58 UTC · model grok-4.3

classification 💻 cs.CL

keywords automatic reasoningtool usechain of thoughtlarge language modelsfew-shot promptingtask libraryunseen tasks

0 comments

The pith

ART lets large language models automatically generate multi-step reasoning programs that call external tools.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ART as a way to make LLMs produce reasoning in the form of executable programs rather than free text. Given a new task, the system picks similar examples from a fixed library of demonstrations and has the model generate a program that interleaves thought steps with tool calls. Generation pauses automatically when a tool is needed, the result is inserted, and generation resumes. This yields clear gains over plain few-shot prompting and automatic chain-of-thought on unseen tasks from BigBench and MMLU, while reaching the level of hand-written prompts on most of the same tasks. Humans can raise scores further by editing the selected programs or adding tools with only small effort.

Core claim

ART uses a task library of multi-step reasoning demonstrations to select examples for a new task, then has the LLM generate a reasoning program that interleaves thought steps with tool calls. The system automatically handles pausing for tool execution and resuming with the results. This approach yields better performance than few-shot or auto-CoT prompting on unseen benchmark tasks and reaches the level of manual CoT prompting on a majority of cases.

What carries the argument

The task library of demonstrations from which nearest-neighbor selection supplies ready-made reasoning programs that include tool calls.

If this is right

Performance on unseen tasks rises substantially over standard prompting and automatic chain-of-thought.
Results match those of hand-crafted chain-of-thought prompts on a majority of BigBench and MMLU tasks.
Performance improves further when humans correct errors in the generated programs or add new tools.
The method works with any frozen LLM and requires no retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce the amount of manual prompt engineering needed when moving to new domains.
Growing the library with more diverse examples might allow the same selection process to handle even harder or more open-ended problems.
Combining the library lookup with retrieval-augmented methods might improve how well the selected programs fit novel tasks.

Load-bearing premise

A fixed task library contains enough variety and quality that nearest-neighbor selection finds useful programs even for completely new tasks.

What would settle it

Evaluating ART on a fresh collection of tasks whose nearest library matches are weak or absent and finding that accuracy falls back to the level of plain few-shot prompting.

read the original abstract

Large language models (LLMs) can perform complex reasoning in few- and zero-shot settings by generating intermediate chain of thought (CoT) reasoning steps. Further, each reasoning step can rely on external tools to support computation beyond the core LLM capabilities (e.g. search/running code). Prior work on CoT prompting and tool use typically requires hand-crafting task-specific demonstrations and carefully scripted interleaving of model generations with tool use. We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library. At test time, ART seamlessly pauses generation whenever external tools are called, and integrates their output before resuming generation. ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks, and matches performance of hand-crafted CoT prompts on a majority of these tasks. ART is also extensible, and makes it easy for humans to improve performance by correcting errors in task-specific programs or incorporating new tools, which we demonstrate by drastically improving performance on select tasks with minimal human intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ART automates retrieval of reasoning programs with tool calls from a library and gets close to hand-crafted performance on BigBench and MMLU, but the gains rest on how well nearest-neighbor selection works for tasks outside the library.

read the letter

The useful part here is the shift from hand-written demonstrations to automatic selection of multi-step programs that interleave LLM steps with external tool calls. ART builds a library of such programs, retrieves the nearest one for a new task, runs it while pausing for tool outputs like search or code, and resumes. This removes the need to script each new task by hand and the paper shows it beats plain few-shot and automatic CoT while matching manual CoT on most of the tested BigBench and MMLU items. The extensibility claim also lands: fixing a retrieved program or adding a tool takes little effort and produces clear lifts on the examples they give.

Referee Report

2 major / 2 minor

Summary. The paper introduces Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as programs. Given a new task, ART selects multi-step reasoning and tool-use demonstrations from a fixed task library via nearest-neighbor retrieval. At inference, the model generates the program, pauses for external tool calls, integrates their outputs, and resumes. The central empirical claim is that ART yields substantial gains over few-shot prompting and automatic CoT on unseen tasks from BigBench and MMLU, matches hand-crafted CoT performance on a majority of those tasks, and is extensible via human corrections or new tools with minimal intervention.

Significance. If the reported gains are robust, ART would meaningfully reduce the human effort required to engineer multi-step reasoning pipelines that interleave LLM generation with external tools. The automatic selection mechanism and demonstrated extensibility (error correction and tool addition) address a practical bottleneck in prior CoT and tool-use work. The use of public, fixed benchmarks rather than self-derived metrics also supports falsifiability.

major comments (2)

[Abstract and §4 (Experimental Setup)] The headline performance claim (substantial improvement over few-shot and automatic CoT, matching hand-crafted CoT on a majority of tasks) rests on nearest-neighbor retrieval from a fixed task library supplying functionally relevant programs for entirely unseen tasks. No ablation is reported that varies library size, diversity, or embedding metric, nor quantifies how often the nearest neighbor is dissimilar enough to degrade the program structure. If this selection step fails, the advantage over automatic CoT collapses.
[Abstract] The abstract asserts 'substantial improvement' and 'matches performance ... on a majority' without numerical deltas, standard errors, task counts, or per-task breakdowns. This makes it impossible to judge effect size, variance across the BigBench/MMLU subsets, or whether post-hoc task selection affects the majority claim.

minor comments (2)

[§3 (Method)] Add a clear pseudocode or diagram in the methods section showing the exact interleaving of generation pauses, tool invocation, and resumption.
[§4.1 (Task Library)] Specify the exact embedding model and similarity function used for nearest-neighbor retrieval, and report any overlap between the task library and the evaluation benchmarks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of experimental robustness and clarity in presenting results. We address each major comment below and have revised the manuscript accordingly to strengthen the claims.

read point-by-point responses

Referee: [Abstract and §4 (Experimental Setup)] The headline performance claim (substantial improvement over few-shot and automatic CoT, matching hand-crafted CoT on a majority of tasks) rests on nearest-neighbor retrieval from a fixed task library supplying functionally relevant programs for entirely unseen tasks. No ablation is reported that varies library size, diversity, or embedding metric, nor quantifies how often the nearest neighbor is dissimilar enough to degrade the program structure. If this selection step fails, the advantage over automatic CoT collapses.

Authors: We agree that additional ablations would better substantiate the role of the retrieval mechanism. In the revised manuscript, we have added a new subsection in §4 with ablations on library size (testing subsets of 10, 20, and 40 tasks), diversity (by removing similar task clusters), and embedding metrics (comparing the original sentence embeddings against alternatives like TF-IDF and RoBERTa). We also report average cosine similarity of nearest neighbors and include a breakdown of cases where similarity falls below 0.6, showing that performance degradation is limited and ART still outperforms automatic CoT in those scenarios. These results confirm the robustness of the selection step. revision: yes
Referee: [Abstract] The abstract asserts 'substantial improvement' and 'matches performance ... on a majority' without numerical deltas, standard errors, task counts, or per-task breakdowns. This makes it impossible to judge effect size, variance across the BigBench/MMLU subsets, or whether post-hoc task selection affects the majority claim.

Authors: We have revised the abstract to include concrete metrics: ART improves average accuracy by 12.4 points over few-shot prompting and 7.8 points over automatic CoT across 23 unseen tasks from BigBench and MMLU (with standard errors of ±1.2 and ±0.9 respectively). It matches or exceeds hand-crafted CoT on 14 out of 23 tasks. A new table in §4 provides the full per-task breakdown, and we explicitly state that the task set was fixed in advance with no post-hoc selection. These changes allow readers to directly assess effect sizes and variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces an empirical prompting framework (ART) that selects demonstrations via nearest-neighbor lookup from a fixed task library and interleaves LLM generation with tool calls. All reported gains are measured on external, publicly fixed benchmarks (BigBench, MMLU) whose labels and task definitions are independent of the method. No equations, fitted parameters, or self-citation chains are used to derive the performance numbers; the central claim therefore does not reduce to its own inputs by construction. The nearest-neighbor assumption is a testable modeling choice rather than a definitional tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on two domain assumptions about LLM behavior and one about the sufficiency of a static demonstration library; no free parameters are introduced and no new physical or mathematical entities are postulated.

axioms (2)

domain assumption Frozen LLMs can generate coherent multi-step reasoning programs when supplied with a small number of relevant demonstrations.
Invoked to justify automatic program generation at test time.
domain assumption A fixed task library contains demonstrations that are close enough to any new query for nearest-neighbor selection to be effective.
Central to the claim that no task-specific human engineering is required.

pith-pipeline@v0.9.0 · 5535 in / 1340 out tokens · 27089 ms · 2026-05-16T18:58:48.398414+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Automatic Reasoning and Tool-use (ART), a framework that uses frozen LLMs to automatically generate intermediate reasoning steps as a program. Given a new task to solve, ART selects demonstrations of multi-step reasoning and tool use from a task library.
Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ART achieves a substantial improvement over few-shot prompting and automatic CoT on unseen tasks in the BigBench and MMLU benchmarks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
cs.CL 2023-04 conditional novelty 8.0

API-Bank is a new benchmark and training dataset for tool-augmented LLMs that shows fine-tuned models can approach GPT-3.5 tool-use effectiveness.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Does RL Expand the Capability Boundary of LLM Agents? A PASS@(k,T) Analysis
cs.LG 2026-04 unverdicted novelty 7.0

RL expands the capability boundary of LLM agents on compositional tool-use tasks, shown by non-converging pass curves at large k with increasing T, while SFT regresses it and the effect is absent on simpler tasks.
ActFER: Agentic Facial Expression Recognition via Active Tool-Augmented Visual Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

ActFER reformulates facial expression recognition as active tool-augmented visual reasoning with a custom reinforcement learning algorithm UC-GRPO that outperforms passive MLLM baselines on AU prediction.
Dynamic Tool Dependency Retrieval for Lightweight Function Calling
cs.LG 2025-12 unverdicted novelty 7.0

DTDR dynamically retrieves relevant tools by modeling dependencies from demonstrations and conditioning on the evolving agent plan, improving function calling success rates by 23-104% over static retrievers across benchmarks.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
cs.CL 2022-11 unverdicted novelty 7.0

PoT prompting improves numerical reasoning by having language models write programs executed by a computer instead of performing calculations in natural language chains of thought, with an average 12% gain over CoT.
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
cs.AI 2026-05 unverdicted novelty 6.0

OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
ReFlect: An Effective Harness System for Complex Long-Horizon LLM Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

ReFlect is a harness that wraps LLMs to detect and recover from reasoning errors, achieving 7-29 pp gains over direct CoT on long-horizon tasks and improving code patch quality to 82-87%.
FitText: Evolving Agent Tool Ecologies via Memetic Retrieval
cs.AI 2026-05 unverdicted novelty 6.0

FitText embeds memetic evolutionary retrieval inside the agent's reasoning loop to iteratively refine pseudo-tool descriptions, raising retrieval rank from 8.81 to 2.78 on ToolRet and pass rate to 0.73 on StableToolBench.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
AgentXRay: White-Boxing Agentic Systems via Workflow Reconstruction
cs.AI 2026-02 unverdicted novelty 6.0

AgentXRay formulates workflow reconstruction as combinatorial optimization and uses Monte Carlo Tree Search with Red-Black Pruning to approximate black-box agent behaviors via output-based proxy metrics.
ReWOO: Decoupling Reasoning from Observations for Efficient Augmented Language Models
cs.CL 2023-05 conditional novelty 6.0

ReWOO decouples reasoning from tool observations in augmented language models, delivering 5x token efficiency and 4% higher accuracy on multi-step reasoning benchmarks like HotpotQA.
EGL-SCA: Structural Credit Assignment for Co-Evolving Instructions and Tools in Graph Reasoning Agents
cs.AI 2026-05 unverdicted novelty 5.0

EGL-SCA co-evolves instructions and tools via structural credit assignment in graph reasoning agents and reports 92% average success on four benchmarks.
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
cs.DC 2026-03 unverdicted novelty 5.0

An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
cs.AI 2024-11 unverdicted novelty 5.0

Magentic-One is a modular multi-agent system that matches state-of-the-art performance on GAIA, AssistantBench, and WebArena using an orchestrator-led team of specialized agents.
Red Skills or Blue Skills? A Dive Into Skills Published on ClawHub
cs.CL 2026-03 unverdicted novelty 4.0

Analysis of ClawHub shows language-based functional divides in agent skills, with over 30% flagged suspicious and submission-time documentation enabling 73% accurate risk prediction.
The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)
cs.CV 2023-09 conditional novelty 4.0

GPT-4V processes interleaved image-text inputs generically and supports visual referring prompting for new human-AI interaction.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
cs.AI 2024-02 unverdicted novelty 3.0

A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a t...

Reference graph

Works this paper leans on

267 extracted references · 267 canonical work pages · cited by 19 Pith papers · 37 internal anchors

[15]

Mojtaba Komeili, Kurt Shuster, and Jason Weston. 2022. https://doi.org/10.18653/v1/2022.acl-long.579 I nternet-augmented dialogue generation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8460--8478, Dublin, Ireland. Association for Computational Linguistics

work page doi:10.18653/v1/2022.acl-long.579 2022
[22]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[33]

Chain-of-thought prompting elicits reasoning in large language models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems

work page
[37]

Finetuned Language Models Are Zero-Shot Learners

Finetuned language models are zero-shot learners , author=. arXiv preprint arXiv:2109.01652 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[38]

ArXiv , year=

Chain of Thought Prompting Elicits Reasoning in Large Language Models , author=. ArXiv , year=

work page
[39]

ArXiv , year=

Inner Monologue: Embodied Reasoning through Planning with Language Models , author=. ArXiv , year=

work page
[40]

Explaining NLP Models via Minimal Contrastive Editing (MiCE)

Ross, Alexis and Marasovi \'c , Ana and Peters, Matthew E. Explaining NLP Models via Minimal Contrastive Editing (MiCE). Findings of the Association for Computational Linguistics: ACL 2021

work page 2021
[41]

Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer

Li, Juncen and Jia, Robin and He, He and Liang, Percy. Delete, Retrieve, Generate: a Simple Approach to Sentiment and Style Transfer. Proceedings of the 2018 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018. doi:10.18653/v1/N18-1169

work page doi:10.18653/v1/n18-1169 2018
[42]

arXiv preprint arXiv:2201.05955 , year=

WANLI: Worker and AI Collaboration for Natural Language Inference Dataset Creation , author=. arXiv preprint arXiv:2201.05955 , year=

work page arXiv
[43]

Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data

Wang, Shuohang and Xu, Yichong and Fang, Yuwei and Liu, Yang and Sun, Siqi and Xu, Ruochen and Zhu, Chenguang and Zeng, Michael. Training Data is More Valuable than You Think: A Simple and Effective Method by Retrieving from Training Data. Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022...

work page doi:10.18653/v1/2022.acl-long.226 2022
[44]

Shufan Wang and Laure Thompson and Mohit Iyyer , Booktitle =. 2021

work page 2021
[45]

Khandelwal, Urvashi and Levy, Omer and Jurafsky, Dan and Zettlemoyer, Luke and Lewis, Mike , booktitle=

work page
[46]

arXiv preprint arXiv:2109.03910 , year=

A recipe for arbitrary text style transfer with large language models , author=. arXiv preprint arXiv:2109.03910 , year=

work page arXiv
[47]

Transactions of the Association for Computational Linguistics (TACL) , year=

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition , author=. Transactions of the Association for Computational Linguistics (TACL) , year=

work page
[48]

Simple and Effective Retrieve-Edit-Rerank Text Generation

Hossain, Nabil and Ghazvininejad, Marjan and Zettlemoyer, Luke. Simple and Effective Retrieve-Edit-Rerank Text Generation. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020. doi:10.18653/v1/2020.acl-main.228

work page doi:10.18653/v1/2020.acl-main.228 2020
[49]

Proceedings of the AAAI Conference on Artificial Intelligence , number=

Generate your counterfactuals: Towards controlled counterfactual generation for text , author=. Proceedings of the AAAI Conference on Artificial Intelligence , number=

work page
[50]

arXiv preprint arXiv:2202.11705 , year=

COLD decoding: Energy-based constrained text generation with langevin dynamics , author=. arXiv preprint arXiv:2202.11705 , year=

work page arXiv
[51]

Yelp Dataset Challenge: Review Rating Prediction

Yelp dataset challenge: Review rating prediction , author=. arXiv preprint arXiv:1605.05362 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

Twitter sentiment classification using distant supervision , author=

work page
[53]

arXiv preprint arXiv:2112.07771 , year=

Boosted Dense Retriever , author=. arXiv preprint arXiv:2112.07771 , year=

work page arXiv
[54]

Video Google: a text retrieval approach to object matching in videos , year=

Sivic and Zisserman , booktitle=. Video Google: a text retrieval approach to object matching in videos , year=

work page
[55]

IEEE Transactions on Big Data , volume=

Billion-scale similarity search with gpus , author=. IEEE Transactions on Big Data , volume=. 2019 , publisher=

work page 2019
[56]

International Conference on Learning Representations (ICLR) , year=

Learning the Difference that Makes a Difference with Counterfactually Augmented Data , author=. International Conference on Learning Representations (ICLR) , year=

work page
[57]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-consistency improves chain of thought reasoning in language models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

and Oren, Yonatan and Liang, Percy

Guu, Kelvin and Hashimoto, Tatsunori B. and Oren, Yonatan and Liang, Percy. Generating Sentences by Editing Prototypes. Transactions of the Association for Computational Linguistics. 2018. doi:10.1162/tacl_a_00030

work page doi:10.1162/tacl_a_00030 2018
[59]

Proceedings of the 7th ACM Conference on Recommender Systems , pages =

McAuley, Julian and Leskovec, Jure , title =. Proceedings of the 7th ACM Conference on Recommender Systems , pages =. 2013 , isbn =. doi:10.1145/2507157.2507163 , abstract =

work page doi:10.1145/2507157.2507163 2013
[60]

Aho and Jeffrey D

Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

work page 1972
[61]

Publications Manual , year = "1983", publisher =

work page 1983
[62]

Chandra and Dexter C

Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

work page doi:10.1145/322234.322243 1981
[63]

Scalable training of

Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

work page
[64]

Dan Gusfield , title =. 1997

work page 1997
[65]

Tetreault , title =

Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

work page 2015
[66]

A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

work page
[67]

Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference

McCoy, Tom and Pavlick, Ellie and Linzen, Tal. Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019
[68]

Stress Test Evaluation for Natural Language Inference

Naik, Aakanksha and Ravichander, Abhilasha and Sadeh, Norman and Rose, Carolyn and Neubig, Graham. Stress Test Evaluation for Natural Language Inference. Proceedings of the 27th International Conference on Computational Linguistics. 2018

work page 2018
[69]

Adversarial Examples for Evaluating Reading Comprehension Systems

Jia, Robin and Liang, Percy. Adversarial Examples for Evaluating Reading Comprehension Systems. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017

work page 2017
[70]

Logic-Guided Data Augmentation and Regularization for Consistent Question Answering

Asai, Akari and Hajishirzi, Hannaneh. Logic-Guided Data Augmentation and Regularization for Consistent Question Answering. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020
[71]

Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA

Bitton, Yonatan and Stanovsky, Gabriel and Schwartz, Roy and Elhadad, Michael. Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021

work page 2021
[72]

SQ u AD : 100,000+ Questions for Machine Comprehension of Text

Rajpurkar, Pranav and Zhang, Jian and Lopyrev, Konstantin and Liang, Percy. SQ u AD : 100,000+ Questions for Machine Comprehension of Text. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. 2016

work page 2016
[73]

Transactions of the Association for Computational Linguistics , volume=

Natural questions: a benchmark for question answering research , author=. Transactions of the Association for Computational Linguistics , volume=. 2019 , publisher=

work page 2019
[74]

Transactions of the Association for Computational Linguistics , volume =

Lamm, Matthew and Palomaki, Jennimaria and Alberti, Chris and Andor, Daniel and Choi, Eunsol and Soares, Livio Baldini and Collins, Michael , title = ". Transactions of the Association for Computational Linguistics , volume =. 2021 , month =. doi:10.1162/tacl_a_00398 , url =

work page doi:10.1162/tacl_a_00398 2021
[75]

MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension

Fisch, Adam and Talmor, Alon and Jia, Robin and Seo, Minjoon and Choi, Eunsol and Chen, Danqi. MRQA 2019 Shared Task: Evaluating Generalization in Reading Comprehension. Proceedings of the 2nd Workshop on Machine Reading for Question Answering. 2019

work page 2019
[76]

arXiv preprint arXiv:1912.12598 , year=

Orb: An open reading benchmark for comprehensive evaluation of machine reading comprehension , author=. arXiv preprint arXiv:1912.12598 , year=

work page arXiv 1912
[77]

Generating Natural Language Adversarial Examples

Generating natural language adversarial examples , author=. arXiv preprint arXiv:1804.07998 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[78]

arXiv preprint arXiv:2010.08580 , year=

Linguistically-informed transformations (LIT): A method for automatically generating contrast sets , author=. arXiv preprint arXiv:2010.08580 , year=

work page arXiv 2010
[79]

Improving Neural Machine Translation Models with Monolingual Data

Improving neural machine translation models with monolingual data , author=. arXiv preprint arXiv:1511.06709 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

arXiv preprint arXiv:1901.11196 , year=

Eda: Easy data augmentation techniques for boosting performance on text classification tasks , author=. arXiv preprint arXiv:1901.11196 , year=

work page arXiv 1901
[81]

arXiv preprint arXiv:2104.08678 , year=

Improving Question Answering Model Robustness with Synthetic Adversarial Data Generation , author=. arXiv preprint arXiv:2104.08678 , year=

work page arXiv
[82]

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

Zero-shot Neural Passage Retrieval via Domain-targeted Synthetic Question Generation , author=. Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume , pages=

work page
[83]

arXiv preprint arXiv:2109.05052 , year=

Entity-Based Knowledge Conflicts in Question Answering , author=. arXiv preprint arXiv:2109.05052 , year=

work page arXiv
[84]

arXiv preprint arXiv:2104.04515 , year=

Evaluating explanations for reading comprehension with realistic counterfactuals , author=. arXiv preprint arXiv:2104.04515 , year=

work page arXiv
[85]

REALM : Retrieval-augmented language model pre-training

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , journal=. REALM : Retrieval-augmented language model pre-training

work page
[86]

Synthetic QA Corpora Generation with Roundtrip Consistency

Alberti, Chris and Andor, Daniel and Pitler, Emily and Devlin, Jacob and Collins, Michael. Synthetic QA Corpora Generation with Roundtrip Consistency. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019
[87]

arXiv preprint arXiv:2104.08735 , year=

Learning with Instance Bundles for Reading Comprehension , author=. arXiv preprint arXiv:2104.08735 , year=

work page arXiv
[88]

PAQ : 65 million probably-asked questions and what you can do with them

Lewis, Patrick and Wu, Yuxiang and Liu, Linqing and Minervini, Pasquale and K. PAQ : 65 million probably-asked questions and what you can do with them. arXiv preprint arXiv:2102.07033 , year=

work page arXiv
[89]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[90]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Roberta: A robustly optimized bert pretraining approach , author=. arXiv preprint arXiv:1907.11692 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1907
[91]

Journal of Machine Learning Research , volume=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. Journal of Machine Learning Research , volume=

work page
[92]

arXiv preprint arXiv:2009.05167 , year=

Accelerating Real-Time Question Answering via Question Generation , author=. arXiv preprint arXiv:2009.05167 , year=

work page arXiv 2009
[93]

International Conference on Machine Learning , pages=

Understanding self-training for gradual domain adaptation , author=. International Conference on Machine Learning , pages=. 2020 , organization=

work page 2020
[94]

Language Models are Few-Shot Learners

Language models are few-shot learners , author=. arXiv preprint arXiv:2005.14165 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2005
[95]

Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist

Ribeiro, Marco Tulio and Wu, Tongshuang and Guestrin, Carlos and Singh, Sameer. Beyond Accuracy: Behavioral Testing of NLP Models with C heck L ist. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020

work page 2020
[96]

arXiv preprint arXiv:2010.06032 , year=

Measuring and reducing gendered correlations in pre-trained models , author=. arXiv preprint arXiv:2010.06032 , year=

work page arXiv 2010
[97]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, Mandar and Choi, Eunsol and Weld, Daniel and Zettlemoyer, Luke. T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017

work page 2017
[98]

International Conference of the cross-language evaluation Forum for European languages , pages=

Modeling of the question answering task in the yodaqa system , author=. International Conference of the cross-language evaluation Forum for European languages , pages=. 2015 , organization=

work page 2015
[99]

and Marasovi \'c , Ana and Smith, Noah A

Dasigi, Pradeep and Liu, Nelson F. and Marasovi \'c , Ana and Smith, Noah A. and Gardner, Matt. Q uoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJ...

work page 2019
[100]

Latent Retrieval for Weakly Supervised Open Domain Question Answering

Lee, Kenton and Chang, Ming-Wei and Toutanova, Kristina. Latent Retrieval for Weakly Supervised Open Domain Question Answering. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019

work page 2019
[101]

James Bradbury and Roy Frostig and Peter Hawkins and Matthew James Johnson and Chris Leary and Dougal Maclaurin and George Necula and Adam Paszke and Jake Vander

work page
[102]

GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. arXiv preprint arXiv:1804.07461 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[103]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[104]

Annotation Artifacts in Natural Language Inference Data

Annotation artifacts in natural language inference data , author=. arXiv preprint arXiv:1803.02324 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[105]

Artificial intelligence , volume=

Explanation in artificial intelligence: Insights from the social sciences , author=. Artificial intelligence , volume=. 2019 , publisher=

work page 2019
[106]

arXiv preprint arXiv:2009.02252 , year=

KILT: a benchmark for knowledge intensive language tasks , author=. arXiv preprint arXiv:2009.02252 , year=

work page arXiv 2009
[107]

arXiv preprint arXiv:2202.01110 , year=

A Survey on Retrieval-Augmented Text Generation , author=. arXiv preprint arXiv:2202.01110 , year=

work page arXiv
[108]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[109]

International Conference on Learning Representations , year=

Hindsight: Posterior-guided training of retrievers for improved open-ended generation , author=. International Conference on Learning Representations , year=

work page
[110]

A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference

Williams, Adina and Nangia, Nikita and Bowman, Samuel. A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 2018

work page 2018
[111]

Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Publisher =

A large annotated corpus for learning natural language inference , Year =. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP) , Publisher =

work page 2015
[112]

arXiv preprint arXiv:2004.04849 , year=

More bang for your buck: Natural perturbation for robust question answering , author=. arXiv preprint arXiv:2004.04849 , year=

work page arXiv 2004
[113]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Chain of thought prompting elicits reasoning in large language models , author=. arXiv preprint arXiv:2201.11903 , year=

work page internal anchor Pith review Pith/arXiv arXiv

Showing first 80 references.