pith. sign in

arxiv: 1502.05698 · v10 · pith:O3IPJBL4new · submitted 2015-02-19 · 💻 cs.AI · cs.CL· stat.ML

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

classification 💻 cs.AI cs.CLstat.ML
keywords tasksableansweringgoallearningmanymeasurequestion
0
0 comments X
read the original abstract

One long-term goal of machine learning research is to produce methods that are applicable to reasoning and natural language, in particular building an intelligent dialogue agent. To measure progress towards that goal, we argue for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering. Our tasks measure understanding in several ways: whether a system is able to answer questions via chaining facts, simple induction, deduction and many more. The tasks are designed to be prerequisites for any system that aims to be capable of conversing with a human. We believe many existing learning systems can currently not solve them, and hence our aim is to classify these tasks into skill sets, so that researchers can identify (and then rectify) the failings of their systems. We also extend and improve the recently introduced Memory Networks model, and show it is able to solve some, but not all, of the tasks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

    cs.AI 2026-05 accept novelty 8.0

    QSTRBench is a new benchmark evaluating LLMs on compositional reasoning, converse relations, and conceptual neighbourhoods across QSTR calculi including a newly published RCC-22 CN, showing models exceed chance but fa...

  2. Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    cs.LG 2022-01 unverdicted novelty 8.0

    Neural networks exhibit grokking on small algorithmic datasets, achieving perfect generalization well after overfitting.

  3. VORT: Adaptive Power-Law Memory for NLP Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  4. MIXAR: Scaling Autoregressive Pixel-based Language Models to Multiple Languages and Scripts

    cs.CL 2026-04 unverdicted novelty 7.0

    MIXAR is the first autoregressive pixel-based language model for eight languages and scripts, with empirical gains on multilingual tasks, robustness to unseen languages, and further improvements when scaled to 0.5B pa...

  5. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    cs.CL 2016-11 accept novelty 7.0

    MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.

  6. Concrete Problems in AI Safety

    cs.AI 2016-06 accept novelty 7.0

    The paper categorizes five concrete AI safety problems arising from flawed objectives, costly evaluation, and learning dynamics.

  7. Towards Faster Language Model Inference Using Mixture-of-Experts Flow Matching

    cs.AI 2026-04 unverdicted novelty 6.0

    Mixture-of-experts flow matching enables non-autoregressive language models to achieve autoregressive-level quality in three sampling steps, delivering up to 1000x faster inference than diffusion models.

  8. Scaling Data-Constrained Language Models

    cs.CL 2023-05 conditional novelty 6.0

    Repeating training data up to 4 epochs yields negligible loss increase versus unique data for fixed compute, and a new scaling law accounts for the decaying value of repeated tokens and excess parameters.

  9. Hindi Question Generation Using Dependency Structures

    cs.CL 2019-06 unverdicted novelty 6.0

    A rule-based system using karaka-dependency structures and IndoWordNet generates significantly more diverse Hindi questions than input sentences.

  10. Universal Transformers

    cs.CL 2018-07 unverdicted novelty 6.0

    Universal Transformers combine Transformer parallelism with recurrent updates and dynamic halting to achieve Turing-completeness under assumptions and outperform standard Transformers on algorithmic and language tasks.

  11. MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

    cs.CL 2026-05 unverdicted novelty 5.0

    MINTEval benchmark shows current memory-augmented systems average 27.9% accuracy on long-horizon interference tasks, limited by retrieval and memory construction with degradation from intervening updates.

  12. Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

    cs.AI 2026-05 unverdicted novelty 5.0

    A dual-process memory architecture for scientific AI agents maintains 70-85% accuracy over 15,000 messages by using a constant 10-message episodic window and domain-specific semantic consolidation, consuming 62% fewer...

  13. HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory

    cs.AI 2026-05 unverdicted novelty 5.0

    HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.

  14. Be Consistent! Improving Procedural Text Comprehension using Label Consistency

    cs.CL 2019-06 unverdicted novelty 5.0

    A label consistency training framework improves F1 on the ProPara benchmark for procedural text comprehension by using multiple independent descriptions of the same process.

  15. UW-BHI at MEDIQA 2019: An Analysis of Representation Methods for Medical Natural Language Inference

    cs.IR 2019-07 unverdicted novelty 2.0

    Compares BERT, ESP, and Cui2Vec embeddings within ESIM on the MedNLI shared-task dataset to assess performance and internal representations for medical inference.