Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks
read the original abstract
One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.
This paper has not been read by Pith yet.
Forward citations
Cited by 15 Pith papers
-
QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi
QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fa...
-
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts
MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...
-
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.
-
Concrete Problems in AI Safety
The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.
-
Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching
Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.
-
Scaling Data-Constrained Language Models
Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.
-
Hindi Question Generation Using Dependency Structures
A rule-based system using karaka-dependency structures and IndoWordNet generates significantly more diverse Hindi questions than input sentences.
-
Universal Transformers
Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.
-
MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems
MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.
-
Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents
A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer...
-
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
-
Be Consistent! Improving Procedural Text Comprehension using Label Consistency
A label consistency training framework improves F1 on the ProPara benchmark for procedural text comprehension by using multiple independent descriptions of the same process.
-
UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference
Compares BERT, ESP, and Cui2Vec embeddings within ESIM on the MedNLI shared-task dataset to assess performance and internal representations for medical inference.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.