pith. machine review for the scientific record. sign in

arxiv: 1810.12885 · v1 · submitted 2018-10-30 · 💻 cs.CL

Recognition: unknown

ReCoRD: Bridging the Gap between Human and Machine Commonsense Reading Comprehension

Authors on Pith no claims yet
classification 💻 cs.CL
keywords recordcommonsensecomprehensionhumanmachinereadingdatasetperformance
0
0 comments X
read the original abstract

We present a large-scale dataset, ReCoRD, for machine reading comprehension requiring commonsense reasoning. Experiments on this dataset demonstrate that the performance of state-of-the-art MRC systems fall far behind human performance. ReCoRD represents a challenge for future research to bridge the gap between human and machine commonsense reading comprehension. ReCoRD is available at http://nlp.jhu.edu/record.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  2. Visual Text Compression as Measure Transport

    cs.CV 2026-05 unverdicted novelty 7.0

    Framing visual text compression as measure transport decomposes encoding loss into precision and coverage costs, enabling a label-free routing rule that matches oracle performance on 17 of 24 NLP datasets while using ...

  3. GAIA: a benchmark for General AI Assistants

    cs.CL 2023-11 unverdicted novelty 7.0

    GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.

  4. RWKV: Reinventing RNNs for the Transformer Era

    cs.CL 2023-05 unverdicted novelty 7.0

    RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

  5. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  6. The Power of Scale for Parameter-Efficient Prompt Tuning

    cs.CL 2021-04 unverdicted novelty 7.0

    Prompt tuning matches full model tuning performance on large language models while tuning only a small fraction of parameters and improves robustness to domain shifts.

  7. DeBERTa: Decoding-enhanced BERT with Disentangled Attention

    cs.CL 2020-06 unverdicted novelty 7.0

    DeBERTa improves BERT-style models by separating content and relative position in attention and adding absolute positions to the decoder, yielding consistent gains on NLU and NLG tasks and the first single-model super...

  8. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  9. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

    cs.CL 2019-05 accept novelty 7.0

    BoolQ introduces naturally occurring yes/no questions as a challenging benchmark where BERT fine-tuned on MultiNLI reaches 80.4% accuracy against 90% human performance.

  10. MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    cs.CL 2016-11 accept novelty 7.0

    MS MARCO is a new large-scale machine reading comprehension dataset built from real Bing search queries, human-generated answers, and web passages, supporting three tasks including answer synthesis and passage ranking.

  11. Universally Empowering Zeroth-Order Optimization via Adaptive Layer-wise Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    AdaLeZO uses a non-stationary multi-armed bandit to adaptively allocate perturbation budget across layers in zeroth-order optimization and applies inverse probability weighting to reduce variance while preserving unbi...

  12. The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    cs.CL 2023-06 unverdicted novelty 6.0

    Properly filtered web data from CommonCrawl alone trains LLMs that significantly outperform models trained on The Pile, with 600 billion tokens and 1.3B/7.5B parameter models released.

  13. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  14. ST-MoE: Designing Stable and Transferable Sparse Expert Models

    cs.CL 2022-02 unverdicted novelty 6.0

    ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...

  15. How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    cs.CL 2020-02 accept novelty 6.0

    Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.

  16. SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems

    cs.CL 2019-05 accept novelty 6.0

    SuperGLUE is a new benchmark with more difficult language understanding tasks, a toolkit, and leaderboard to drive further progress beyond GLUE.