{"work":{"id":"511eeb84-4b95-46d5-b14f-50da43f4f19f","openalex_id":null,"doi":null,"arxiv_id":"1905.10044","raw_key":null,"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions","authors":null,"authors_text":"Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, Kristina Toutanova","year":2019,"venue":"cs.CL","abstract":"In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Our best method trains BERT on MultiNLI and then re-trains it on our train set. It achieves 80.4% accuracy compared to 90% accuracy of human annotators (and 62% majority-baseline), leaving a significant gap for future work.","external_url":"https://arxiv.org/abs/1905.10044","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-24T09:14:16.458351+00:00","pith_arxiv_id":"1905.10044","created_at":"2026-05-09T06:20:42.401780+00:00","updated_at":"2026-05-24T09:14:16.458351+00:00","title_quality_ok":true,"display_title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions","render_title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions"},"hub":{"state":{"work_id":"511eeb84-4b95-46d5-b14f-50da43f4f19f","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":73,"external_cited_by_count":null,"distinct_field_count":6,"first_pith_cited_at":"2019-10-23T17:37:36+00:00","last_pith_cited_at":"2026-05-18T14:30:58+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-05-31T05:01:58.700340+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":6},{"context_role":"dataset","n":3},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":6},{"context_polarity":"use_dataset","n":3},{"context_polarity":"baseline","n":1}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-17T02:39:33.570195+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":32},{"title":"HellaSwag: Can a Machine Really Finish Your Sentence?","work_id":"79f44c0c-96f4-4edb-bc50-a3c9d6b85936","shared_citers":22},{"title":"Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering","work_id":"b9d68fc0-5b23-4def-bc5e-6ad71d64eec6","shared_citers":17},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":14},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":13},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"SocialIQA: Commonsense Reasoning about Social Interactions","work_id":"3f93670e-0ae7-40e5-bed5-74c216638dd1","shared_citers":11},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":10},{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","work_id":"2c6b3f6d-54e4-4df7-baa7-475a490799af","shared_citers":10},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":10},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":9},{"title":"TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension","work_id":"f20e62ba-6265-4b97-aa8c-ddefaf2f5762","shared_citers":9},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":8},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":8},{"title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers","work_id":"19ed8c44-202a-48f6-8169-637d5a5f2408","shared_citers":8},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":7},{"title":"Pointer Sentinel Mixture Models","work_id":"fef3833e-dc80-42a3-a1e0-ffbfafee6fff","shared_citers":7},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":7},{"title":"RWKV: Reinventing RNNs for the Transformer Era","work_id":"524dc80d-f4ef-4f89-bf1a-9a8c1e4b6a81","shared_citers":7}],"time_series":[{"n":1,"year":2019},{"n":2,"year":2020},{"n":1,"year":2021},{"n":2,"year":2022},{"n":3,"year":2023},{"n":7,"year":2024},{"n":3,"year":2025},{"n":27,"year":2026}],"dependency_candidates":[{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network","primary_cat":"cs.AR","context_text":"els with different architectures, including LLaMA-2 [62], Mistral [28], and Mixtral [29], where Mixtral is a Mixture-of-Experts (MoE) model. Model accuracy is measured on multiple datasets, including the Massive Multitask Language Understanding (MMLU) [25] in the five-shot setting and zero-shot Commonsense QA benchmarks such as WinoGrande [58], PIQA [12], HellaSwag [68], ARC [16], BoolQ [15] and OBQA [45]. All evaluations are conducted using the Language Model Evaluation Harness [21]. 4.2 In-Network Quantized All-Reduce We first compare INQ All-Reduce with RQ All-Reduce. Table 1 re- ports the perplexity (PPL) results under different quantization bit widths and block sizes with TP = 8. When All-Reduce is quantized to INT8, INQ All-Reduce preserves nearly the same perplexity as","citing_arxiv_id":"2603.28239"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","context_text":"MultiRC dataset encompasses around 6, 000 multi- sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five. B. Datasets for Emergent: ICL, reasoning (CoT), instruction following This section centers on the benchmarks and datasets em- ployed to evaluate the emergent abilities of LLMs. • GSM8K [190] is designed to evaluate the model's ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistically diverse grade school math word problems written by humans. The dataset is split into two sets: a training set with 7.5K problems, and a test set with 1K problems. These problems need 2 to 8 steps to be solved. Solutions mainly are a series of elementary calculations using basic","citing_arxiv_id":"2402.06196"}]},"error":null,"updated_at":"2026-05-17T02:39:27.657507+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-17T02:39:31.147103+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions","claims":[{"claim_text":"In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingl","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"els with different architectures, including LLaMA-2 [62], Mistral [28], and Mixtral [29], where Mixtral is a Mixture-of-Experts (MoE) model. Model accuracy is measured on multiple datasets, including the Massive Multitask Language Understanding (MMLU) [25] in the five-shot setting and zero-shot Commonsense QA benchmarks such as WinoGrande [58], PIQA [12], HellaSwag [68], ARC [16], BoolQ [15] and OBQA [45]. All evaluations are conducted using the Language Model Evaluation Harness [21]. 4.2 In-Net","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"MultiRC dataset encompasses around 6, 000 multi- sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five. B. Datasets for Emergent: ICL, reasoning (CoT), instruction following This section centers on the benchmarks and datasets em- ployed to evaluate the emergent abilities of LLMs. • GSM8K [190] is designed to evaluate the model's ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistic","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"the chunk size C is the small fixed constant (More detail in the § D with Algorithm 1). Further, the chunk form of bt and theγ t,i are computed as following Eq.(10) and (11), b[t] = exp(µlog [t] +c log [t] )∈R C,(10) Γ[t] = exp(eAlog [t] )⊙ \u0010 1−exp(S log [t] ) \u0011 ∈R C×C ,(11) where the chunk matrix eAlog [t] ,S log [t] ∈R C×C is computed, (eAlog [t] )ij = ( ¯αlog [t] +c log [t] )i −( ¯µlog [t] )j fori≥j,(12) (Slog [t] )ij = (clog [t] )j−1 −(c log [t] )i fori≥j.(13) These lower triangular matrices","claim_type":"background","confidence":0.6,"evidence_strength":"citation_context"},{"claim_text":"with the hope of stimulating further study of test-time behavior of language models. 3.9.1 Arithmetic To test GPT-3's ability to perform simple arithmetic operations without task-speciﬁc training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language: • 2 digit addition (2D+) - The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. \"Q: What is 48 plus 76? A: 124.\" • 2 digit subtr","claim_type":"background","confidence":0.6,"evidence_strength":"citation_context"},{"claim_text":"Moreover, to ensure that the whole Mamba block output matches that of Hedgehog at initialization, we also set the parameters of the gate branch and the convolution so that they reduce to the identity operator. Additional details can be found in App. B. Attention scores normalization With the substitution in (7), the SSM mixer outputs Yϕ := ϕMLP(Q)ϕMLP(K)T\u0001 V . (8) However, the Attention scores in this formula come in an un-normalized fashion. For the Attention scores formulation to more closely ","claim_type":"background","confidence":0.5,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (3 contexts).","role_counts":[{"n":3,"context_role":"background"},{"n":2,"context_role":"dataset"}]},"error":null,"updated_at":"2026-05-17T02:39:27.662890+00:00"}},"summary":{"title":"BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions","claims":[{"claim_text":"In this paper we study yes/no questions that are naturally occurring --- meaning that they are generated in unprompted and unconstrained settings. We build a reading comprehension dataset, BoolQ, of such questions, and show that they are unexpectedly challenging. They often query for complex, non-factoid information, and require difficult entailment-like inference to solve. We also explore the effectiveness of a range of transfer learning baselines. We find that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingl","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"els with different architectures, including LLaMA-2 [62], Mistral [28], and Mixtral [29], where Mixtral is a Mixture-of-Experts (MoE) model. Model accuracy is measured on multiple datasets, including the Massive Multitask Language Understanding (MMLU) [25] in the five-shot setting and zero-shot Commonsense QA benchmarks such as WinoGrande [58], PIQA [12], HellaSwag [68], ARC [16], BoolQ [15] and OBQA [45]. All evaluations are conducted using the Language Model Evaluation Harness [21]. 4.2 In-Net","claim_type":"dataset","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"MultiRC dataset encompasses around 6, 000 multi- sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five. B. Datasets for Emergent: ICL, reasoning (CoT), instruction following This section centers on the benchmarks and datasets em- ployed to evaluate the emergent abilities of LLMs. • GSM8K [190] is designed to evaluate the model's ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistic","claim_type":"dataset","confidence":0.8,"evidence_strength":"citation_context"},{"claim_text":"the chunk size C is the small fixed constant (More detail in the § D with Algorithm 1). Further, the chunk form of bt and theγ t,i are computed as following Eq.(10) and (11), b[t] = exp(µlog [t] +c log [t] )∈R C,(10) Γ[t] = exp(eAlog [t] )⊙ \u0010 1−exp(S log [t] ) \u0011 ∈R C×C ,(11) where the chunk matrix eAlog [t] ,S log [t] ∈R C×C is computed, (eAlog [t] )ij = ( ¯αlog [t] +c log [t] )i −( ¯µlog [t] )j fori≥j,(12) (Slog [t] )ij = (clog [t] )j−1 −(c log [t] )i fori≥j.(13) These lower triangular matrices","claim_type":"background","confidence":0.6,"evidence_strength":"citation_context"},{"claim_text":"with the hope of stimulating further study of test-time behavior of language models. 3.9.1 Arithmetic To test GPT-3's ability to perform simple arithmetic operations without task-speciﬁc training, we developed a small battery of 10 tests that involve asking GPT-3 a simple arithmetic problem in natural language: • 2 digit addition (2D+) - The model is asked to add two integers sampled uniformly from [0, 100), phrased in the form of a question, e.g. \"Q: What is 48 plus 76? A: 124.\" • 2 digit subtr","claim_type":"background","confidence":0.6,"evidence_strength":"citation_context"},{"claim_text":"Moreover, to ensure that the whole Mamba block output matches that of Hedgehog at initialization, we also set the parameters of the gate branch and the convolution so that they reduce to the identity operator. Additional details can be found in App. B. Attention scores normalization With the substitution in (7), the SSM mixer outputs Yϕ := ϕMLP(Q)ϕMLP(K)T\u0001 V . (8) However, the Attention scores in this formula come in an un-normalized fashion. For the Attention scores formulation to more closely ","claim_type":"background","confidence":0.5,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (3 contexts).","role_counts":[{"n":3,"context_role":"background"},{"n":2,"context_role":"dataset"}]},"graph":{"co_cited":[{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":32},{"title":"HellaSwag: Can a Machine Really Finish Your Sentence?","work_id":"79f44c0c-96f4-4edb-bc50-a3c9d6b85936","shared_citers":22},{"title":"Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering","work_id":"b9d68fc0-5b23-4def-bc5e-6ad71d64eec6","shared_citers":17},{"title":"Measuring Massive Multitask Language Understanding","work_id":"e87ec49a-544b-4ec8-8991-75298c64ff5e","shared_citers":14},{"title":"Llama 2: Open Foundation and Fine-Tuned Chat Models","work_id":"68a5177f-d644-44c1-bd4f-4e5278c22f5d","shared_citers":13},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":12},{"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","shared_citers":12},{"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","shared_citers":11},{"title":"SocialIQA: Commonsense Reasoning about Social Interactions","work_id":"3f93670e-0ae7-40e5-bed5-74c216638dd1","shared_citers":11},{"title":"GLU Variants Improve Transformer","work_id":"17d0763c-1016-41ab-a478-478e890765eb","shared_citers":10},{"title":"Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer","work_id":"2c6b3f6d-54e4-4df7-baa7-475a490799af","shared_citers":10},{"title":"The Llama 3 Herd of Models","work_id":"1549a635-88af-4ac1-acfe-51ae7bb53345","shared_citers":10},{"title":"Mistral 7B","work_id":"eb5e1305-ad11-4875-ad8d-ad8b8f697599","shared_citers":9},{"title":"TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension","work_id":"f20e62ba-6265-4b97-aa8c-ddefaf2f5762","shared_citers":9},{"title":"BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding","work_id":"ed240a10-5b19-406c-baa5-30803f465785","shared_citers":8},{"title":"Distilling the Knowledge in a Neural Network","work_id":"d927ab1f-17b8-4002-9d09-c3d55764fbad","shared_citers":8},{"title":"GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers","work_id":"19ed8c44-202a-48f6-8169-637d5a5f2408","shared_citers":8},{"title":"Mixtral of Experts","work_id":"0de8c352-9daa-4e1e-8c7b-3d0dec69f369","shared_citers":8},{"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","shared_citers":7},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":7},{"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","shared_citers":7},{"title":"Pointer Sentinel Mixture Models","work_id":"fef3833e-dc80-42a3-a1e0-ffbfafee6fff","shared_citers":7},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":7},{"title":"RWKV: Reinventing RNNs for the Transformer Era","work_id":"524dc80d-f4ef-4f89-bf1a-9a8c1e4b6a81","shared_citers":7}],"time_series":[{"n":1,"year":2019},{"n":2,"year":2020},{"n":1,"year":2021},{"n":2,"year":2022},{"n":3,"year":2023},{"n":7,"year":2024},{"n":3,"year":2025},{"n":27,"year":2026}],"dependency_candidates":[{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network","primary_cat":"cs.AR","context_text":"els with different architectures, including LLaMA-2 [62], Mistral [28], and Mixtral [29], where Mixtral is a Mixture-of-Experts (MoE) model. Model accuracy is measured on multiple datasets, including the Massive Multitask Language Understanding (MMLU) [25] in the five-shot setting and zero-shot Commonsense QA benchmarks such as WinoGrande [58], PIQA [12], HellaSwag [68], ARC [16], BoolQ [15] and OBQA [45]. All evaluations are conducted using the Language Model Evaluation Harness [21]. 4.2 In-Network Quantized All-Reduce We first compare INQ All-Reduce with RQ All-Reduce. Table 1 re- ports the perplexity (PPL) results under different quantization bit widths and block sizes with TP = 8. When All-Reduce is quantized to INT8, INQ All-Reduce preserves nearly the same perplexity as","citing_arxiv_id":"2603.28239"},{"n":1,"role":"dataset","polarity":"use_dataset","paper_title":"Large Language Models: A Survey","primary_cat":"cs.CL","context_text":"MultiRC dataset encompasses around 6, 000 multi- sentence questions gathered from over 800 paragraphs. On average, each question offers about two valid answer alternatives out of a total of five. B. Datasets for Emergent: ICL, reasoning (CoT), instruction following This section centers on the benchmarks and datasets em- ployed to evaluate the emergent abilities of LLMs. • GSM8K [190] is designed to evaluate the model's ability for multi-step mathematical reasoning. GSM8K includes 8.5K linguistically diverse grade school math word problems written by humans. The dataset is split into two sets: a training set with 7.5K problems, and a test set with 1K problems. These problems need 2 to 8 steps to be solved. Solutions mainly are a series of elementary calculations using basic","citing_arxiv_id":"2402.06196"}]},"authors":[]}}