CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor; Jonathan Berant; Jonathan Herzig; Nicholas Lourie

arxiv: 1811.00937 · v2 · pith:MA33J7FOnew · submitted 2018-11-02 · 💻 cs.CL · cs.AI· cs.LG

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor , Jonathan Herzig , Nicholas Lourie , Jonathan Berant This is my paper

classification 💻 cs.CL cs.AIcs.LG

keywords answeringknowledgequestionquestionscommonsensecommonsenseqaconceptconcepts

0 comments

read the original abstract

When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MIDUS: Memory-Infused Depth Up-Scaling
cs.LG 2025-12 unverdicted novelty 7.0

MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.
DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning
cs.AI 2025-11 unverdicted novelty 7.0

DecompSR is a large, symbolically verified benchmark dataset and generation framework that independently varies productivity, substitutivity, overgeneralisation, and systematicity to probe compositional multihop spati...
LLM DNA: Tracing Model Evolution via Functional Representations
cs.LG 2025-09 unverdicted novelty 7.0

LLM DNA is introduced as a low-dimensional bi-Lipschitz functional representation proven to satisfy inheritance and genetic determinism, with a training-free extraction pipeline tested on 305 models to reveal relation...
Soft Head Selection for Injecting ICL-Derived Task Embeddings
cs.CL 2025-07 conditional novelty 7.0

SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B pa...
PRIMETIME : Limits of LLMs in Temporal Primitives
cs.NE 2025-04 unverdicted novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Federated Co-tuning Framework for Large and Small Language Models
cs.CL 2024-11 unverdicted novelty 7.0

FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.
Detecting Pretraining Data from Large Language Models
cs.CL 2023-10 conditional novelty 7.0

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.
Parcae: Scaling Laws For Stable Looped Language Models
cs.LG 2026-04 unverdicted novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth...
Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models
cs.AI 2026-03 unverdicted novelty 6.0

An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution
cs.CL 2025-09 unverdicted novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and g...
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
cs.AI 2025-08 unverdicted novelty 6.0

BlindGuard introduces an unsupervised hierarchical agent encoder plus corruption-guided contrastive detector that identifies malicious agents in LLM-based multi-agent systems without any attack labels or prior knowled...
LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws
cs.LG 2025-02 unverdicted novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.
DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks
cs.LG 2025-02 unverdicted novelty 6.0

DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergen...
LaMI: Augmenting Large Language Models via Late Multi-Image Fusion
cs.CL 2024-06 unverdicted novelty 6.0

LaMI augments LLMs with visual commonsense via late fusion of predictions from multiple text-generated images, outperforming prior augmented LLMs on visual tasks while matching VLMs and preserving or improving NLP per...
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
cs.LG 2023-06 unverdicted novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
Training Verifiers to Solve Math Word Problems
cs.LG 2021-10 conditional novelty 6.0

Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space
cs.CL 2026-03 unverdicted novelty 5.0

Inclusion-of-Thoughts purifies multiple-choice questions by keeping only plausible options, stabilizing LLM preferences and improving chain-of-thought results on reasoning benchmarks.
Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining
cs.CL 2025-11 unverdicted novelty 5.0

Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.
Mixtral of Experts
cs.LG 2024-01 unverdicted novelty 5.0

Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.
Mistral 7B
cs.CL 2023-10 accept novelty 5.0

Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.
Galactica: A Large Language Model for Science
cs.CL 2022-11 unverdicted novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.
Retrieval-Augmented Generation for Large Language Models: A Survey
cs.CL 2023-12 unverdicted novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.