arxiv: 1905.10044 · v1 · submitted 2019-05-24 · 💻 cs.CL

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Christopher Clark , Kenton Lee , Ming-Wei Chang , Tom Kwiatkowski , Michael Collins , Kristina Toutanova This is my paper

Pith reviewed 2026-05-13 09:42 UTC · model grok-4.3

classification 💻 cs.CL

keywords yes/no questionsreading comprehensionnatural language inferencetransfer learningquestion answeringBERTBoolQ

0 comments p. Extension

The pith

Natural yes/no questions prove harder for models than expected even after strong pre-training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds BoolQ, a reading comprehension dataset drawn from naturally occurring yes/no questions rather than crowdsourced prompts. These questions typically ask for complex inferences that resemble entailment judgments instead of simple fact extraction from the passage. Transfer from entailment corpora such as MultiNLI improves results more than transfer from paraphrase or extractive QA sources. The benefit persists even when the starting point is a large pre-trained model like BERT. The strongest system reaches 80.4 percent accuracy, against 90 percent for human annotators and 62 percent for a majority baseline.

Core claim

BoolQ consists of yes/no questions generated in unprompted settings paired with Wikipedia passages. Solving them often requires difficult entailment-like reasoning over non-factoid information. Training on MultiNLI before fine-tuning on BoolQ is the most effective transfer strategy, and it continues to help even when the model begins as BERT. This procedure yields 80.4 percent accuracy, leaving a sizable gap relative to the 90 percent human ceiling.

What carries the argument

The BoolQ dataset of naturally occurring yes/no questions, used to measure how well models perform complex inference beyond fact lookup.

If this is right

Transfer from MultiNLI data improves accuracy on BoolQ more than transfer from paraphrase or extractive QA data.
Even BERT continues to benefit from an intermediate MultiNLI training stage before fine-tuning on BoolQ.
Natural yes/no questions frequently require non-factoid information and entailment-style inference rather than direct span extraction.
A performance gap of roughly ten points remains between the best model and human annotators.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Success on BoolQ would likely improve models on other realistic query types that mix reasoning with passage understanding.
The pattern that entailment pre-training helps question answering may apply to additional tasks that hinge on implicit inference.
Datasets built from unprompted user questions could expose similar gaps in other language-understanding benchmarks.

Load-bearing premise

The collected questions faithfully represent the distribution of yes/no questions that arise in everyday language use and that the provided answers contain little ambiguity or annotator bias.

What would settle it

A model trained without any entailment data that reaches or exceeds 90 percent accuracy on the BoolQ test set would undermine the claim that these natural questions systematically demand harder reasoning than current techniques can supply.

read the original abstract

In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BoolQ gives a new dataset of natural yes/no questions that are harder than expected, with MultiNLI transfer helping BERT reach 80.4%, though the human label quality is the part that needs the most scrutiny.

read the letter

The main thing here is that BoolQ collects real, unprompted yes/no questions and shows they require more inference than the synthetic or factoid versions we usually test on. The best model, BERT after MultiNLI then BoolQ fine-tuning, hits 80.4% while humans reach 90% and the majority baseline sits at 62%. That gap is the central result, and the paper demonstrates that entailment data transfers better than paraphrase or extractive QA data even on top of large pretraining.

Referee Report

3 major / 2 minor

Summary. The paper introduces BoolQ, a reading comprehension dataset of naturally occurring yes/no questions paired with passages from web sources. It argues that these questions are unexpectedly difficult, often requiring complex non-factoid inference akin to textual entailment. The authors evaluate transfer learning baselines and find that fine-tuning BERT after MultiNLI pre-training achieves 80.4% accuracy, compared to a 62% majority baseline and 90% human performance, leaving a substantial gap for future work.

Significance. If the dataset construction and labels are reliable, BoolQ provides a useful benchmark highlighting limitations of current models on natural yes/no questions even after strong pre-training. The empirical finding that entailment transfer outperforms paraphrase or extractive QA transfer is a concrete, actionable result that could guide future QA and NLI research. The work ships a new dataset with concrete accuracy numbers and baseline comparisons, which strengthens its contribution as an empirical resource.

major comments (3)

[§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.
[§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.
[Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.

minor comments (2)

[Abstract] The abstract and introduction use 'surprising difficulty' without a quantitative comparison to prior yes/no QA datasets (e.g., on SQuAD or NewsQA yes/no subsets); adding this would better motivate the contribution.
[Figure 1] Figure 1 (example questions) would benefit from explicit annotation of the inference steps required, to illustrate the 'entailment-like' nature claimed in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the valuable comments. We address each of the major comments below and have made revisions to the manuscript to incorporate additional details on dataset construction, experimental variance, and performance breakdowns.

read point-by-point responses

Referee: [§3] §3 (Dataset Construction): The paper reports 90% human accuracy but provides no inter-annotator agreement statistics, adjudication protocol for disagreements, or analysis of ambiguous cases. Given that the central claim is the surprising difficulty of natural yes/no questions (which the authors note often require complex inference), the absence of these details leaves open the possibility that a non-trivial fraction of the 10-point model-human gap reflects label noise rather than model shortcomings.

Authors: We agree that inter-annotator agreement and details on the annotation process are important to establish label quality. Although not reported in the initial submission, we have now computed inter-annotator agreement on a sample of the data and included the statistics, the adjudication protocol, and an analysis of ambiguous cases in the revised manuscript. These additions show that agreement is high and ambiguous cases are few, supporting that the gap is not primarily due to label noise. revision: yes
Referee: [§4.2] §4.2 (Transfer Learning Experiments): The claim that MultiNLI transfer 'continues to be very beneficial even when starting from massive pre-trained language models such as BERT' is supported by the 80.4% result, but the paper does not report variance across random seeds or statistical significance tests for the improvement over the BERT baseline without MultiNLI. This weakens the strength of the transfer-learning conclusion.

Authors: We acknowledge that reporting variance and statistical significance would make the transfer learning results more convincing. We have conducted additional experiments across multiple random seeds and performed significance testing. The revised manuscript now includes the standard deviations and confirms that the improvement from MultiNLI pre-training is statistically significant. revision: yes
Referee: [Table 2] Table 2 (Baseline Results): The majority baseline is reported at 62%, but the paper does not break down performance by question type (e.g., factoid vs. inference-heavy) or passage length, making it hard to localize where the remaining error lies and whether the 80.4% result truly demonstrates a broad gap.

Authors: We agree that breaking down the results would help identify where models struggle. We have added such an analysis to the revised paper, including performance by question type and passage length. This breakdown reveals that the performance gap persists across different categories, indicating a broad challenge rather than localized issues. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical dataset and evaluation

full rationale

The paper constructs a new reading-comprehension dataset BoolQ from naturally occurring yes/no questions and reports direct empirical accuracies (80.4% for the best BERT+MultiNLI transfer baseline versus 90% human and 62% majority). No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear; performance figures are measured on held-out test data after standard training, with no reduction to inputs by construction. Human annotation serves as an external benchmark rather than a self-referential definition. The work is therefore self-contained against external data and does not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard NLP assumptions about annotation quality and model transferability rather than new postulates.

axioms (1)

domain assumption Human annotators provide reliable ground-truth labels for yes/no questions
Invoked when reporting 90% human accuracy as the upper bound

pith-pipeline@v0.9.0 · 5466 in / 1071 out tokens · 25195 ms · 2026-05-13T09:42:05.357041+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

In this paper we study yes/no questions that are naturally occurring — meaning that they are generated in unprompted and unconstrained settings.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 54 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Discovering Latent Knowledge in Language Models Without Supervision
cs.CL 2022-12 conditional novelty 8.0

An unsupervised technique extracts latent yes-no knowledge from language model activations by locating a direction that satisfies logical consistency properties, outperforming zero-shot accuracy by 4% on average acros...
Language Models are Few-Shot Learners
cs.CL 2020-05 accept novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Chain-based Distillation for Effective Initialization of Variable-Sized Small Language Models
cs.CL 2026-05 unverdicted novelty 7.0

Chain-based Distillation constructs a sequence of anchor models to enable efficient initialization of variable-sized SLMs through interpolation, with bridge distillation for cross-architecture transfer, yielding bette...
Winner-Take-All Spiking Transformer for Language Modeling
cs.NE 2026-04 unverdicted novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network
cs.AR 2026-03 unverdicted novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Path-Constrained Mixture-of-Experts
cs.LG 2026-03 unverdicted novelty 7.0

PathMoE constrains expert paths in MoE models by sharing router parameters across layer blocks, yielding more concentrated paths, better performance on perplexity and tasks, and no need for auxiliary losses.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE
cs.LG 2026-03 conditional novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
HOSL: Hybrid-Order Split Learning for Memory-Constrained Edge Training
cs.LG 2026-01 unverdicted novelty 7.0

HOSL reduces client memory up to 3.7x versus full first-order split learning while staying within 0.20-4.23% accuracy on OPT models by pairing client zeroth-order estimation with server first-order optimization.
Deep Delta Learning
cs.LG 2026-01 unverdicted novelty 7.0

Deep Delta Learning replaces additive residual updates with a gated delta-rule that selectively overwrites residual content along learned directions, improving language modeling quality over standard ResNet-style accu...
MIDUS: Memory-Infused Depth Up-Scaling
cs.LG 2025-12 unverdicted novelty 7.0

MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.
Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics
cs.LG 2025-12 unverdicted novelty 7.0

Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
cs.CL 2025-12 conditional novelty 7.0

Four Over Six adaptively scales blocks in NVFP4 quantization to smaller FP4 values, making representable value distributions more uniform and reducing quantization error especially for near-maximal values.
SpinQuant: LLM quantization with learned rotations
cs.LG 2024-05 conditional novelty 7.0

SpinQuant learns optimal rotations to enable accurate 4-bit quantization of LLM weights, activations, and KV cache, reducing the zero-shot gap to full precision to 2.9 points on LLaMA-2 7B.
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
cs.CL 2024-02 unverdicted novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.
Massive Activations in Large Language Models
cs.CL 2024-02 unverdicted novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.
Multitask Prompted Training Enables Zero-Shot Task Generalization
cs.LG 2021-10 conditional novelty 7.0

Multitask fine-tuning of an encoder-decoder model on prompted datasets produces zero-shot generalization that often beats models up to 16 times larger on standard benchmarks.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
cs.LG 2019-10 unverdicted novelty 7.0

T5 casts all NLP tasks as text-to-text generation, systematically explores pre-training choices, and reaches strong performance on summarization, QA, classification and other tasks via large-scale training on the Colo...
PEML: Parameter-efficient Multi-Task Learning with Optimized Continuous Prompts
cs.CL 2026-05 unverdicted novelty 6.0

PEML co-optimizes continuous prompts and low-rank adaptations to deliver up to 6.67% average accuracy gains over existing multi-task PEFT methods on GLUE, SuperGLUE, and other benchmarks.
SparseForge: Efficient Semi-Structured LLM Sparsification via Annealing of Hessian-Guided Soft-Mask
cs.LG 2026-05 unverdicted novelty 6.0

SparseForge achieves 57.27% zero-shot accuracy on LLaMA-2-7B at 2:4 sparsity using only 5B retraining tokens, beating the dense baseline and nearly matching a 40B-token SOTA method.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio
cs.LG 2026-05 unverdicted novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Sharpness-Aware Pretraining Mitigates Catastrophic Forgetting
cs.LG 2026-05 unverdicted novelty 6.0

Sharpness-aware pretraining and related flat-minima interventions reduce catastrophic forgetting by up to 80% after post-training across 20M-150M models and by 31-40% at 1B scale.
Learn-to-learn on Arbitrary Textual Conditioning: A Hypernetwork-Driven Meta-Gated LLM
cs.CL 2026-05 unverdicted novelty 6.0

A hypernetwork generates meta-gating parameters for SwiGLU blocks to let LLMs adapt their nonlinearity to arbitrary textual conditions, outperforming finetuning and meta-learning baselines with reasonable generalizati...
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models
cs.AI 2026-04 unverdicted novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibra...
River-LLM: Large Language Model Seamless Exit Based on KV Share
cs.CL 2026-04 unverdicted novelty 6.0

River-LLM enables seamless token-level early exit in decoder-only LLMs via a KV-shared river mechanism and similarity-based error prediction, delivering 1.71-2.16x practical speedup on reasoning tasks while preserving...
TLoRA: Task-aware Low Rank Adaptation of Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer ...
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate
cs.LG 2026-04 unverdicted novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
BiSpikCLM: A Spiking Language Model integrating Softmax-Free Spiking Attention and Spike-Aware Alignment Distillation
cs.NE 2026-04 unverdicted novelty 6.0

BiSpikCLM is the first fully binary spiking MatMul-free causal language model that matches ANN performance on generation tasks using only 4-6 percent of the compute via softmax-free spiking attention and spike-aware d...
Rethinking Residual Errors in Compensation-based LLM Quantization
cs.LG 2026-04 conditional novelty 6.0

Redefining residual errors to include compensation-aware discrepancies and realigning calibration to full-precision outputs improves GPTQ and GPTAQ performance on LLMs.
SLaB: Sparse-Lowrank-Binary Decomposition for Efficient Large Language Models
cs.LG 2026-04 unverdicted novelty 6.0

SLaB compresses LLM weights via sparse-lowrank-binary decomposition guided by activation-aware scores, achieving up to 36% lower perplexity than prior methods at 50% compression on Llama models.
Scalable Variational Bayesian Fine-Tuning of LLMs via Orthogonalized Low-Rank Adapters
cs.LG 2026-04 unverdicted novelty 6.0

PoLAR-VBLL combines orthogonalized low-rank adapters with variational Bayesian last-layer inference to enable scalable, well-calibrated uncertainty quantification in fine-tuned LLMs.
Attention to Mamba: A Recipe for Cross-Architecture Distillation
cs.CL 2026-04 unverdicted novelty 6.0

A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
M$^2$RNN: Non-Linear RNNs with Matrix-Valued States for Scalable Language Modeling
cs.LG 2026-03 unverdicted novelty 6.0

M²RNN achieves perfect state tracking at unseen lengths and outperforms Gated DeltaNet hybrids by 0.4-0.5 perplexity on 7B models with 3x smaller recurrent states.
CoreQ: Learning-Free Mismatch Correction and Successive Rounding for Quantization
cs.LG 2026-02 unverdicted novelty 6.0

CoreQ delivers adaptive mismatch correction via closed-form geometric coefficient and successive rounding to improve PTQ accuracy for large language models.
L2R: Low-Rank and Lipschitz-Controlled Routing for Mixture-of-Experts
cs.LG 2026-01 unverdicted novelty 6.0

L2R improves MoE performance by routing in a low-rank space with Lipschitz-controlled saturated inner-product scoring and multi-anchor mechanisms.
HyperAdapt: Simple High-Rank Adaptation
cs.LG 2025-09 unverdicted novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.
Transfusion: Predict the Next Token and Diffuse Images with One Multi-Modal Model
cs.AI 2024-08 unverdicted novelty 6.0

A single transformer combines language modeling loss and diffusion loss on mixed-modality data, scaling to 7B parameters and 2T tokens while matching specialized language and diffusion models.
Chameleon: Mixed-Modal Early-Fusion Foundation Models
cs.CL 2024-05 unverdicted novelty 6.0

Chameleon is an early-fusion token model that handles mixed image-text sequences for understanding and generation, achieving competitive or superior performance to larger models like Llama-2, Mixtral, and Gemini-Pro o...
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
Chain-of-Verification Reduces Hallucination in Large Language Models
cs.CL 2023-09 unverdicted novelty 6.0

Chain-of-Verification reduces hallucinations in large language models by drafting responses, planning independent verification questions, answering them separately, and generating a final verified output.
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
cs.CL 2023-04 accept novelty 6.0

AGIEval shows GPT-4 exceeding average human scores on SAT Math at 95% and Chinese college entrance English at 92.5%, while revealing weaker results on complex reasoning tasks.
How Much Knowledge Can You Pack Into the Parameters of a Language Model?
cs.CL 2020-02 accept novelty 6.0

Fine-tuned language models store knowledge in parameters to answer questions competitively with retrieval-based open-domain QA systems.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
Adaptive Spiking Neurons for Vision and Language Modeling
cs.NE 2026-04 unverdicted novelty 5.0

ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
Attention Sink Forges Native MoE in Attention Layers: Sink-Aware Training to Address Head Collapse
cs.CL 2026-02 unverdicted novelty 5.0

Attention sinks forge native MoE mechanisms in attention layers that cause head collapse, addressed by sink-aware training with auxiliary load balancing.
BoHA: Blockwise Hadamard Product Adaptation for Parameter-Efficient Fine-Tuning
cs.LG 2025-09 unverdicted novelty 5.0

BoHA partitions frozen weights into a b by b grid and applies independent low-rank Hadamard factors per block, outperforming LoRA on matched-budget single-task averages while retaining 57.66% first-stage accuracy in a...
SMoE: An Algorithm-System Co-Design for Pushing MoE to the Edge via Expert Substitution
cs.AI 2025-08 unverdicted novelty 5.0

SMoE substitutes low-importance experts with cached similar ones in MoE inference on edge devices to achieve 48% lower decoding latency and over 60% cache hit rate with nearly lossless accuracy.
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
cs.CV 2023-12 unverdicted novelty 5.0

MobileVLM achieves on-par performance with much larger vision-language models on standard benchmarks while delivering state-of-the-art inference speeds of 21.5 tokens per second on Snapdragon 888 CPU and 65.3 on Jetso...
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Dynamic Nested Hierarchies: Pioneering Self-Evolution in Machine Learning Architectures for Lifelong Intelligence
cs.LG 2025-11 unverdicted novelty 4.0

Dynamic nested hierarchies let models self-adjust their multi-level optimization structures to support lifelong learning and adaptation to shifting data distributions.
From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
cs.CL 2025-07 unverdicted novelty 4.0

Coreference resolution improves retrieval relevance and QA performance in RAG systems, with mean pooling performing best and smaller models benefiting more.
Step-Video-T2V Technical Report: The Practice, Challenges, and Future of Video Foundation Model
cs.CV 2025-02 unverdicted novelty 4.0

Step-Video-T2V describes a 30B-parameter text-to-video model with custom Video-VAE, 3D DiT, flow matching, and Video-DPO that claims state-of-the-art results on a new internal benchmark.
Gemma: Open Models Based on Gemini Research and Technology
cs.CL 2024-03 accept novelty 4.0

Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.
Gemma 2: Improving Open Language Models at a Practical Size
cs.CL 2024-07 conditional novelty 3.0

Gemma 2 models achieve leading performance at their sizes by combining established Transformer modifications with knowledge distillation for the 2B and 9B variants.
Large Language Models: A Survey
cs.CL 2024-02 accept novelty 3.0

The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · cited by 54 Pith papers · 2 internal anchors

[1]

Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. 2015. VQA : V isual Q uestion A nswering. In Proceedings of the IEEE international conference on computer vision

work page 2015
[2]

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2009. T he S ixth PASCAL R ecognizing T extual E ntailment C hallenge. In TAC

work page 2009
[3]

Luisa Bentivogli, Peter Clark, Ido Dagan, and Danilo Giampiccolo. 2011. T he S eventh PASCAL R ecognizing T extual E ntailment C hallenge. In TAC

work page 2011
[4]

Samuel R Bowman, Gabor Angeli, Christopher Potts, and Christopher D Manning. 2015. https://doi.org/10.18653/v1/D15-1075 A L arge A nnotated C orpus for L earning N atural L anguage I nference . In EMNLP

work page doi:10.18653/v1/d15-1075 2015
[5]

Qian Chen, Xiaodan Zhu, Zhenhua Ling, Si Wei, Hui Jiang, and Diana Inkpen. 2017. https://doi.org/10.18653/v1/P17-1152 E nhanced LSTM for N atural L anguage I nference . In ACL

work page doi:10.18653/v1/p17-1152 2017
[6]

Eunsol Choi, He He, Mohit Iyyer, Mark Yatskar, Wen-tau Yih, Yejin Choi, Percy Liang, and Luke Zettlemoyer. 2018. https://www.aclweb.org/anthology/D18-1241 Q ua C : Q uestion A nswering in C ontext . In EMNLP

work page 2018
[7]

Alexis Conneau and Douwe Kiela. 2018. https://www.aclweb.org/anthology/L18-1269 S enteval: A n E valuation T oolkit for U niversal S entence R epresentations . In LREC

work page 2018
[8]

Dorottya Demszky, Kelvin Guu, and Percy Liang. 2018. https://arxiv.org/abs/1809.02922 T ransforming Q uestion A nswering D atasets I nto N atural L anguage I nference D atasets . Computing Research Repository, arXiv:1809.02922. Version 2

work page Pith review arXiv 2018
[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. https://arxiv.org/abs/1810.04805 BERT : P re-training of D eep B idirectional T ransformers for L anguage U nderstanding . Computing Research Repository, arXiv:1810.04805. Version 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. https://www.aclweb.org/anthology/P18-2103 B reaking NLI S ystems with S entences that R equire S imple L exical I nferences . In ACL

work page 2018
[11]

Suchin Gururangan, Swabha Swayamdipta, Omer Levy, Roy Schwartz, Samuel R Bowman, and Noah A Smith. 2018. https://doi.org/10.18653/v1/N18-2017 A nnotation A rtifacts in N atural L anguage I nference D ata . In NAACL

work page doi:10.18653/v1/n18-2017 2018
[12]

Minghao Hu, Yuxing Peng, Zhen Huang, Nan Yang, Ming Zhou, et al. 2018. R ead+ V erify: M achine R eading C omprehension with U nanswerable Q uestions. In CoRR

work page 2018
[13]

Robin Jia and Percy Liang. 2017. https://doi.org/10.18653/v1/D17-1215 A dversarial E xamples for E valuating R eading C omprehension S ystems . In EMNLP

work page doi:10.18653/v1/d17-1215 2017
[14]

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 T riviaqa: A L arge S cale D istantly S upervised C hallenge D ataset for R eading C omprehension . In ACL

work page doi:10.18653/v1/p17-1147 2017
[15]

Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018. S ci T ail: A T extual E ntailment D ataset from S cience Q uestion A nswering. In AAAI

work page 2018
[16]

Diederik P Kingma and Jimmy Ba. 2014. A dam: A M ethod for S tochastic O ptimization. In ICLR

work page 2014
[17]

Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. N atural Q uestions: a B enchmark for Q uestion A nswering R esea...

work page 2019
[18]

Guokun Lai, Qizhe Xie, Hanxiao Liu, Yiming Yang, and Eduard Hovy. 2017. https://doi.org/10.18653/v1/D17-1082 R ace: L arge- S cale R eading C omprehension D ataset from E xaminations . In EMNLP

work page doi:10.18653/v1/d17-1082 2017
[19]

R Thomas McCoy, Ellie Pavlick, and Tal Linzen. 2019. https://arxiv.org/abs/1902.01007 R ight for the W rong R easons: D iagnosing S yntactic H euristics in N atural L anguage I nference . Computing Research Repository, arXiv:1902.01007. Version 1

work page Pith review arXiv 2019
[20]

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. 2018. https://www.aclweb.org/anthology/D18-1260 C an a S uit of A rmor C onduct E lectricity? A N ew D ataset for O pen B ook Q uestion A nswering . In EMNLP

work page 2018
[21]

Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and Armand Joulin. 2018. A dvances in P re- T raining D istributed W ord R epresentations. In LREC

work page 2018
[22]

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. 2016. https://arxiv.org/abs/1611.09268 MS MARCO : A H uman G enerated M achine R eading C omprehension D ataset . Computing Research Repository, arXiv:1611.09268. Version 3

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

a ckstr \

Ankur P Parikh, Oscar T \"a ckstr \"o m, Dipanjan Das, and Jakob Uszkoreit. 2016. https://doi.org/10.18653/v1/D16-1244 A D ecomposable A ttention M odel for N atural L anguage I nference . In EMNLP

work page doi:10.18653/v1/d16-1244 2016
[24]

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. https://doi.org/10.18653/v1/N18-1202 D eep C ontextualized W ord R epresentations . In NAACL

work page doi:10.18653/v1/n18-1202 2018
[25]

Jason Phang, Thibault F \'e vry, and Samuel R Bowman. 2018. https://arxiv.org/abs/1811.01088 S entence E ncoders on STILT s: S upplementary T raining on I ntermediate L abeled-data T asks . Computing Research Repository, arXiv:1811.01088. Version 2

work page Pith review arXiv 2018
[26]

Adam Poliak, Aparajita Haldar, Rachel Rudinger, J Edward Hu, Ellie Pavlick, Aaron Steven White, and Benjamin Van Durme. 2018. https://www.aclweb.org/anthology/D18-1007 C ollecting D iverse N atural L anguage I nference P roblems for S entence R epresentation E valuation . In EMNLP

work page 2018
[27]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf I mproving L anguage U nderstanding by G enerative P re-training

work page 2018
[28]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. https://www.aclweb.org/anthology/P18-2124 K now W hat Y ou D on't K now: U nanswerable Q uestions for SQuAD . In ACL

work page 2018
[29]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. https://doi.org/10.18653/v1/D16-1264 S quad: 100,000+ Q uestions for M achine C omprehension of T ext . In EMNLP

work page doi:10.18653/v1/d16-1264 2016
[30]

Siva Reddy, Danqi Chen, and Christopher D Manning. 2018. C o QA : A C onversational Q uestion A nswering C hallenge. In TACL

work page 2018
[31]

Marzieh Saeidi, Max Bartolo, Patrick Lewis, Sameer Singh, Tim Rockt \"a schel, Mike Sheldon, Guillaume Bouchard, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/D18-1233 I nterpretation of N atural L anguage R ules in C onversational M achine R eading . In EMNLP

work page 2018
[32]

Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and Hannaneh Hajishirzi. 2017. B idirectional A ttention F low for M achine C omprehension. In ICLR

work page 2017
[33]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. https://www.aclweb.org/anthology/W18-5446 GLUE : A M ulti- T ask B enchmark and A nalysis P latform for N atural L anguage U nderstanding . In EMNLP

work page 2018
[34]

Johannes Welbl, Pontus Stenetorp, and Sebastian Riedel. 2018. https://www.aclweb.org/anthology/Q18-1021 C onstructing D atasets for M ulti-hop R eading C omprehension A cross D ocuments . In ACL

work page 2018
[35]

Jason Weston, Antoine Bordes, Sumit Chopra, Alexander M Rush, Bart van Merri \"e nboer, Armand Joulin, and Tomas Mikolov. 2015. T owards AI - C omplete Q uestion A nswering: A S et of P rerequisite T oy T asks. In ICLR

work page 2015
[36]

Adina Williams, Nikita Nangia, and Samuel R Bowman. 2018. https://doi.org/10.18653/v1/N18-1101 A B road- C overage C hallenge C orpus for S entence U nderstanding through I nference . In NAACL

work page doi:10.18653/v1/n18-1101 2018
[37]

Qi Wu, Damien Teney, Peng Wang, Chunhua Shen, Anthony Dick, and Anton van den Hengel. 2017. Visual Q uestion A nswering: A S urvey of M ethods and D atasets. In Computer Vision and Image Understanding. Elsevier

work page 2017
[38]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. 2018. https://www.aclweb.org/anthology/D18-1259 H otpotqa: A D ataset for D iverse, E xplainable M ulti-hop Q uestion A nswering . In EMNLP

work page 2018
[39]

Rowan Zellers, Yonatan Bisk, Roy Schwartz, and Yejin Choi. 2018. https://www.aclweb.org/anthology/D18-1009 S wag: A L arge- S cale A dversarial D ataset for G rounded C ommonsense I nference . In EMNLP

work page 2018
[40]

Sheng Zhang, Xiaodong Liu, Jingjing Liu, Jianfeng Gao, Kevin Duh, and Benjamin Van Durme. 2018. https://arxiv.org/abs/1810.12885 R e C o R D : B ridging the G ap between H uman and M achine C ommonsense R eading C omprehension . Computing Research Repository, arXiv:1810.12885. Version 1

work page Pith review arXiv 2018
[41]

Yukun Zhu, Ryan Kiros, Rich Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. 2015. Aligning B ooks and M ovies: T owards S tory- L ike V isual E xplanations by W atching M ovies and R eading B ooks. In Proceedings of the IEEE international conference on computer vision, pages 19--27

work page 2015