pith. sign in

arxiv: 1905.10044 · v1 · submitted 2019-05-24 · 💻 cs.CL

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Pith reviewed 2026-05-13 09:42 UTC · model grok-4.3

classification 💻 cs.CL
keywords yes/no questionsreading comprehensionnatural language inferencetransfer learningquestion answeringBERTBoolQ
0
0 comments X p. Extension

The pith

Natural yes/no questions prove harder for models than expected even after strong pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds BoolQ, a reading comprehension dataset drawn from naturally occurring yes/no questions rather than crowdsourced prompts. These questions typically ask for complex inferences that resemble entailment judgments instead of simple fact extraction from the passage. Transfer from entailment corpora such as MultiNLI improves results more than transfer from paraphrase or extractive QA sources. The benefit persists even when the starting point is a large pre-trained model like BERT. The strongest system reaches 80.4 percent accuracy, against 90 percent for human annotators and 62 percent for a majority baseline.

Core claim

BoolQ consists of yes/no questions generated in unprompted settings paired with Wikipedia passages. Solving them often requires difficult entailment-like reasoning over non-factoid information. Training on MultiNLI before fine-tuning on BoolQ is the most effective transfer strategy, and it continues to help even when the model begins as BERT. This procedure yields 80.4 percent accuracy, leaving a sizable gap relative to the 90 percent human ceiling.

What carries the argument

The BoolQ dataset of naturally occurring yes/no questions, used to measure how well models perform complex inference beyond fact lookup.

If this is right

  • Transfer from MultiNLI data improves accuracy on BoolQ more than transfer from paraphrase or extractive QA data.
  • Even BERT continues to benefit from an intermediate MultiNLI training stage before fine-tuning on BoolQ.
  • Natural yes/no questions frequently require non-factoid information and entailment-style inference rather than direct span extraction.
  • A performance gap of roughly ten points remains between the best model and human annotators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Success on BoolQ would likely improve models on other realistic query types that mix reasoning with passage understanding.
  • The pattern that entailment pre-training helps question answering may apply to additional tasks that hinge on implicit inference.
  • Datasets built from unprompted user questions could expose similar gaps in other language-understanding benchmarks.

Load-bearing premise

The collected questions faithfully represent the distribution of yes/no questions that arise in everyday language use and that the provided answers contain little ambiguity or annotator bias.

What would settle it

A model trained without any entailment data that reaches or exceeds 90 percent accuracy on the BoolQ test set would undermine the claim that these natural questions systematically demand harder reasoning than current techniques can supply.

read the original abstract

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces BoolQ, a reading comprehension dataset of naturally occurring yes/no questions paired with passages from web sources. It argues that these questions are unexpectedly difficult, often requiring complex non-factoid inference akin to textual entailment. The authors evaluate transfer learning baselines and find that fine-tuning BERT after MultiNLI pre-training achieves 80.4% accuracy, compared to a 62% majority baseline and 90% human performance, leaving a substantial gap for future work.

Significance. If the dataset construction and labels are reliable, BoolQ provides a useful benchmark highlighting limitations of current models on natural yes/no questions even after strong pre-training. The empirical finding that entailment transfer outperforms paraphrase or extractive QA transfer is a concrete, actionable result that could guide future QA and NLI research. The work ships a new dataset with concrete accuracy numbers and baseline comparisons, which strengthens its contribution as an empirical resource.

major comments (3)
  1. [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.
  2. [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.
  3. [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'surprising difficulty' without a quantitative comparison to prior yes/no QA datasets (e.g., on SQuAD or NewsQA yes/no subsets); adding this would better motivate the contribution.
  2. [Figure 1] Figure 1 (example questions) would benefit from explicit annotation of the inference steps required, to illustrate the 'entailment-like' nature claimed in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable comments. We address each of the major comments below and have made revisions to the manuscript to incorporate additional details on dataset construction, experimental variance, and performance breakdowns.

read point-by-point responses
  1. Referee: [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.

    Authors: We agree that inter-annotator agreement and details on the annotation process are important to establish label quality. Although not reported in the initial submission, we have now computed inter-annotator agreement on a sample of the data and included the statistics, the adjudication protocol, and an analysis of ambiguous cases in the revised manuscript. These additions show that agreement is high and ambiguous cases are few, supporting that the gap is not primarily due to label noise. revision: yes

  2. Referee: [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.

    Authors: We acknowledge that reporting variance and statistical significance would make the transfer learning results more convincing. We have conducted additional experiments across multiple random seeds and performed significance testing. The revised manuscript now includes the standard deviations and confirms that the improvement from MultiNLI pre-training is statistically significant. revision: yes

  3. Referee: [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.

    Authors: We agree that breaking down the results would help identify where models struggle. We have added such an analysis to the revised paper, including performance by question type and passage length. This breakdown reveals that the performance gap persists across different categories, indicating a broad challenge rather than localized issues. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and evaluation

full rationale

The paper constructs a new reading-comprehension dataset BoolQ from naturally occurring yes/no questions and reports direct empirical accuracies (80.4% for the best BERT+MultiNLI transfer baseline versus 90% human and 62% majority). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; performance figures are measured on held-out test data after standard training, with no reduction to inputs by construction. Human annotation serves as an external benchmark rather than a self-referential definition. The work is therefore self-contained against external data and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard NLP assumptions about annotation quality and model transferability rather than new postulates.

axioms (1)
  • domain assumption Human annotators provide reliable ground-truth labels for yes/no questions
    Invoked when reporting 90% human accuracy as the upper bound

pith-pipeline@v0.9.0 · 5466 in / 1071 out tokens · 25195 ms · 2026-05-13T09:42:05.357041+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Discovering Latent Knowledge in Language Models Without Supervision

    cs.CL 2022-12 conditional novelty 8.0

    An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...

  2. Language Models are Few-Shot Learners

    cs.CL 2020-05 accept novelty 8.0

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  3. Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...

  4. Winner-Take-All Spiking Transformer for Language Modeling

    cs.NE 2026-04 unverdicted novelty 7.0

    Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

  5. A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

    cs.AR 2026-03 unverdicted novelty 7.0

    SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

  6. Path-Constrained Mixture-of-Experts

    cs.LG 2026-03 unverdicted novelty 7.0

    PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.

  7. EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

    cs.LG 2026-03 conditional novelty 7.0

    EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

  8. HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training

    cs.LG 2026-01 unverdicted novelty 7.0

    HOSL reduces client memory up to 3.7x versus full first-order split learning while staying within 0.20-4.23% accuracy on OPT models by pairing client zeroth-order estimation with server first-order optimization.

  9. Deep Delta Learning

    cs.LG 2026-01 unverdicted novelty 7.0

    Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accu...

  10. MIDUS: Memory-Infused Depth Up-Scaling

    cs.LG 2025-12 unverdicted novelty 7.0

    MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.

  11. Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

    cs.LG 2025-12 unverdicted novelty 7.0

    Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

  12. Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

    cs.CL 2025-12 conditional novelty 7.0

    Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.

  13. SpinQuant: LLM quantization with learned rotations

    cs.LG 2024-05 conditional novelty 7.0

    SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.

  14. The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

    cs.CL 2024-02 unverdicted novelty 7.0

    BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

  15. Massive Activations in Large Language Models

    cs.CL 2024-02 unverdicted novelty 7.0

    Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

  16. Multitask Prompted Training Enables Zero-Shot Task Generalization

    cs.LG 2021-10 conditional novelty 7.0

    Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.

  17. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer

    cs.LG 2019-10 unverdicted novelty 7.0

    T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...

  18. PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts

    cs.CL 2026-05 unverdicted novelty 6.0

    PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.

  19. SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask

    cs.LG 2026-05 unverdicted novelty 6.0

    SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.

  20. Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

    cs.LG 2026-05 unverdicted novelty 6.0

    MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

  21. Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting

    cs.LG 2026-05 unverdicted novelty 6.0

    Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.

  22. Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM

    cs.CL 2026-05 unverdicted novelty 6.0

    A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...

  23. GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

    cs.AI 2026-04 unverdicted novelty 6.0

    GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...

  24. River-LLM: Large Language Model Seamless Exit Based on KV Share

    cs.CL 2026-04 unverdicted novelty 6.0

    River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving...

  25. TLoRA: Task-aware Low Rank Adaptation of Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...

  26. Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

    cs.LG 2026-04 unverdicted novelty 6.0

    DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

  27. BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation

    cs.NE 2026-04 unverdicted novelty 6.0

    BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...

  28. Rethinking Residual Errors in Compensation-based LLM Quantization

    cs.LG 2026-04 conditional novelty 6.0

    Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.

  29. SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.

  30. Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters

    cs.LG 2026-04 unverdicted novelty 6.0

    PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.

  31. Attention to Mamba: A Recipe for Cross-Architecture Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.

  32. M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling

    cs.LG 2026-03 unverdicted novelty 6.0

    M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.

  33. CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization

    cs.LG 2026-02 unverdicted novelty 6.0

    CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.

  34. L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts

    cs.LG 2026-01 unverdicted novelty 6.0

    L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.

  35. HyperAdapt: Simple High-Rank Adaptation

    cs.LG 2025-09 unverdicted novelty 6.0

    HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.

  36. Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model

    cs.AI 2024-08 unverdicted novelty 6.0

    A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.

  37. Chameleon: Mixed-Modal Early-Fusion Foundation Models

    cs.CL 2024-05 unverdicted novelty 6.0

    Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...

  38. Gated Linear Attention Transformers with Hardware-Efficient Training

    cs.LG 2023-12 unverdicted novelty 6.0

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  39. Chain-of-Verification Reduces Hallucination in Large Language Models

    cs.CL 2023-09 unverdicted novelty 6.0

    Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.

  40. AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

    cs.CL 2023-04 accept novelty 6.0

    AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.

  41. How Much Knowledge Can You Pack Into the Parameters of a Language Model?

    cs.CL 2020-02 accept novelty 6.0

    Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.

  42. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  43. Adaptive Spiking Neurons for Vision and Language Modeling

    cs.NE 2026-04 unverdicted novelty 5.0

    ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.

  44. Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse

    cs.CL 2026-02 unverdicted novelty 5.0

    Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.

  45. BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning

    cs.LG 2025-09 unverdicted novelty 5.0

    BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a...

  46. SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution

    cs.AI 2025-08 unverdicted novelty 5.0

    SMoE substitutes low-importance experts with cached similar ones in MoE inference on edge devices to achieve 48% lower decoding latency and over 60% cache hit rate with nearly lossless accuracy.

  47. MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices

    cs.CV 2023-12 unverdicted novelty 5.0

    MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetso...

  48. Galactica: A Large Language Model for Science

    cs.CL 2022-11 unverdicted novelty 5.0

    Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

  49. Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence

    cs.LG 2025-11 unverdicted novelty 4.0

    Dynamic nested hierarchies let models self-adjust their multi-level optimization structures to support lifelong learning and adaptation to shifting data distributions.

  50. From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems

    cs.CL 2025-07 unverdicted novelty 4.0

    Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.

  51. Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model

    cs.CV 2025-02 unverdicted novelty 4.0

    Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.

  52. Gemma: Open Models Based on Gemini Research and Technology

    cs.CL 2024-03 accept novelty 4.0

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  53. Gemma 2: Improving Open Language Models at a Practical Size

    cs.CL 2024-07 conditional novelty 3.0

    Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.

  54. Large Language Models: A Survey

    cs.CL 2024-02 accept novelty 3.0

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 54 Pith papers · 2 internal anchors

  1. [1]

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA : V isual Q uestion A nswering. In Proceedings of the IEEE international conference on computer vision

  2. [2]

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. T he S ixth PASCAL R ecognizing T extual E ntailment C hallenge. In TAC

  3. [3]

    Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2011. T he S eventh PASCAL R ecognizing T extual E ntailment C hallenge. In TAC

  4. [4]

    Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A L arge A nnotated C orpus for L earning N atural L anguage I nference . In EMNLP

  5. [5]

    Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. https://doi.org/10.18653/v1/P17-1152 E nhanced LSTM for N atural L anguage I nference . In ACL

  6. [6]

    Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. https://www.aclweb.org/anthology/D18-1241 Q ua C : Q uestion A nswering in C ontext . In EMNLP

  7. [7]

    Alexis Conneau and Douwe Kiela. 2018. https://www.aclweb.org/anthology/L18-1269 S enteval: A n E valuation T oolkit for U niversal S entence R epresentations . In LREC

  8. [8]

    Dorottya Demszky, Kelvin Guu, and Percy Liang. 2018. https://arxiv.org/abs/1809.02922 T ransforming Q uestion A nswering D atasets I nto N atural L anguage I nference D atasets . Computing Research Repository, arXiv:1809.02922. Version 2

  9. [9]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. https://arxiv.org/abs/1810.04805 BERT : P re-training of D eep B idirectional T ransformers for L anguage U nderstanding . Computing Research Repository, arXiv:1810.04805. Version 1

  10. [10]

    Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. https://www.aclweb.org/anthology/P18-2103 B reaking NLI S ystems with S entences that R equire S imple L exical I nferences . In ACL

  11. [11]

    Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. https://doi.org/10.18653/v1/N18-2017 A nnotation A rtifacts in N atural L anguage I nference D ata . In NAACL

  12. [12]

    Minghao Hu, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. 2018. R ead+ V erify: M achine R eading C omprehension with U nanswerable Q uestions. In CoRR

  13. [13]

    Robin Jia and Percy Liang. 2017. https://doi.org/10.18653/v1/D17-1215 A dversarial E xamples for E valuating R eading C omprehension S ystems . In EMNLP

  14. [14]

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T riviaqa: A L arge S cale D istantly S upervised C hallenge D ataset for R eading C omprehension . In ACL

  15. [15]

    Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. S ci T ail: A T extual E ntailment D ataset from S cience Q uestion A nswering. In AAAI

  16. [16]

    Diederik P Kingma and Jimmy Ba. 2014. A dam: A M ethod for S tochastic O ptimization. In ICLR

  17. [17]

    Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. N atural Q uestions: a B enchmark for Q uestion A nswering R esea...

  18. [18]

    Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 R ace: L arge- S cale R eading C omprehension D ataset from E xaminations . In EMNLP

  19. [19]

    R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://arxiv.org/abs/1902.01007 R ight for the W rong R easons: D iagnosing S yntactic H euristics in N atural L anguage I nference . Computing Research Repository, arXiv:1902.01007. Version 1

  20. [20]

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://www.aclweb.org/anthology/D18-1260 C an a S uit of A rmor C onduct E lectricity? A N ew D ataset for O pen B ook Q uestion A nswering . In EMNLP

  21. [21]

    Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. A dvances in P re- T raining D istributed W ord R epresentations. In LREC

  22. [22]

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/abs/1611.09268 MS MARCO : A H uman G enerated M achine R eading C omprehension D ataset . Computing Research Repository, arXiv:1611.09268. Version 3

  23. [23]

    a ckstr \

    Ankur P Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A D ecomposable A ttention M odel for N atural L anguage I nference . In EMNLP

  24. [24]

    Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 D eep C ontextualized W ord R epresentations . In NAACL

  25. [25]

    Jason Phang, Thibault F \'e vry, and Samuel R Bowman. 2018. https://arxiv.org/abs/1811.01088 S entence E ncoders on STILT s: S upplementary T raining on I ntermediate L abeled-data T asks . Computing Research Repository, arXiv:1811.01088. Version 2

  26. [26]

    Adam Poliak, Aparajita Haldar, Rachel Rudinger, J Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018. https://www.aclweb.org/anthology/D18-1007 C ollecting D iverse N atural L anguage I nference P roblems for S entence R epresentation E valuation . In EMNLP

  27. [27]

    Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf I mproving L anguage U nderstanding by G enerative P re-training

  28. [28]

    Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. https://www.aclweb.org/anthology/P18-2124 K now W hat Y ou D on't K now: U nanswerable Q uestions for SQuAD . In ACL

  29. [29]

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 S quad: 100,000+ Q uestions for M achine C omprehension of T ext . In EMNLP

  30. [30]

    Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. C o QA : A C onversational Q uestion A nswering C hallenge. In TACL

  31. [31]

    Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rockt \"a schel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/D18-1233 I nterpretation of N atural L anguage R ules in C onversational M achine R eading . In EMNLP

  32. [32]

    Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. B idirectional A ttention F low for M achine C omprehension. In ICLR

  33. [33]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://www.aclweb.org/anthology/W18-5446 GLUE : A M ulti- T ask B enchmark and A nalysis P latform for N atural L anguage U nderstanding . In EMNLP

  34. [34]

    Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/Q18-1021 C onstructing D atasets for M ulti-hop R eading C omprehension A cross D ocuments . In ACL

  35. [35]

    Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri \"e nboer, Armand Joulin, and Tomas Mikolov. 2015. T owards AI - C omplete Q uestion A nswering: A S et of P rerequisite T oy T asks. In ICLR

  36. [36]

    Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A B road- C overage C hallenge C orpus for S entence U nderstanding through I nference . In NAACL

  37. [37]

    Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual Q uestion A nswering: A S urvey of M ethods and D atasets. In Computer Vision and Image Understanding. Elsevier

  38. [38]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. https://www.aclweb.org/anthology/D18-1259 H otpotqa: A D ataset for D iverse, E xplainable M ulti-hop Q uestion A nswering . In EMNLP

  39. [39]

    Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. https://www.aclweb.org/anthology/D18-1009 S wag: A L arge- S cale A dversarial D ataset for G rounded C ommonsense I nference . In EMNLP

  40. [40]

    Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 R e C o R D : B ridging the G ap between H uman and M achine C ommonsense R eading C omprehension . Computing Research Repository, arXiv:1810.12885. Version 1

  41. [41]

    Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning B ooks and M ovies: T owards S tory- L ike V isual E xplanations by W atching M ovies and R eading B ooks. In Proceedings of the IEEE international conference on computer vision, pages 19--27