hub

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Alon Talmor, Jonathan Herzig, Nicholas Lourie, Jonathan Berant · 2018 · cs.CL · arXiv 1811.00937

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

MIDUS: Memory-Infused Depth Up-Scaling

cs.LG · 2025-12-15 · unverdicted · novelty 7.0

MIDUS replaces duplicated FFN branches in depth up-scaling with head-wise memory layers using product-key retrieval and HIVE to deliver lightweight, head-conditioned residual capacity.

LLM DNA: Tracing Model Evolution via Functional Representations

cs.LG · 2025-09-29 · unverdicted · novelty 7.0

LLM DNA is introduced as a low-dimensional bi-Lipschitz functional representation proven to satisfy inheritance and genetic determinism, with a training-free extraction pipeline tested on 305 models to reveal relationships and construct phylogenetic trees.

Soft Head Selection for Injecting ICL-Derived Task Embeddings

cs.CL · 2025-07-28 · conditional · novelty 7.0

SITE applies soft gradient-based head selection to inject ICL-derived task embeddings, outperforming prior embedding adaptation and few-shot ICL across generation, reasoning, and NLU tasks on 12 LLMs from 4B to 70B parameters.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Federated Co-tuning Framework for Large and Small Language Models

cs.CL · 2024-11-18 · unverdicted · novelty 7.0

FedCoLLM is a parameter-efficient federated co-tuning framework that improves client SLMs via server LLMs and enriches LLMs with client domain insights using adapters on NLP text generation tasks.

Detecting Pretraining Data from Large Language Models

cs.CL · 2023-10-25 · conditional · novelty 7.0

Min-K% Prob detects pretraining data in LLMs by flagging outlier low-probability words in text, achieving 7.4% better performance than prior methods on the new WIKIMIA benchmark.

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models

cs.AI · 2026-03-26 · unverdicted · novelty 6.0

An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.

ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution

cs.CL · 2025-09-17 · unverdicted · novelty 6.0

ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.

BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks

cs.AI · 2025-08-11 · unverdicted · novelty 6.0

BlindGuard introduces an unsupervised hierarchical agent encoder plus corruption-guided contrastive detector that identifies malicious agents in LLM-based multi-agent systems without any attack labels or prior knowledge of malicious behaviors.

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

cs.LG · 2025-02-17 · unverdicted · novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.

DUET: Optimizing Training Data Mixtures via Feedback from Unseen Evaluation Tasks

cs.LG · 2025-02-01 · unverdicted · novelty 6.0

DUET is a global-to-local method that optimizes LLM training data mixtures via Bayesian optimization guided by influence-based selection and feedback from unseen evaluation tasks, with a regret bound showing convergence to the optimal mixture.

LaMI: Augmenting Large Language Models via Late Multi-Image Fusion

cs.CL · 2024-06-19 · unverdicted · novelty 6.0

LaMI augments LLMs with visual commonsense via late fusion of predictions from multiple text-generated images, outperforming prior augmented LLMs on visual tasks while matching VLMs and preserving or improving NLP performance.

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

cs.LG · 2023-06-24 · unverdicted · novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

Training Verifiers to Solve Math Word Problems

cs.LG · 2021-10-27 · conditional · novelty 6.0

Introduces GSM8K dataset and demonstrates that verifier-based selection of solutions from multiple candidates outperforms fine-tuning baselines on math word problems.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay

cs.LG · 2026-05-25 · unverdicted · novelty 5.0

Self-generated replay from language models nearly eliminates catastrophic forgetting during finetuning except when models are pretrained close to saturation.

Beyond URLs: Metadata Diversity and Position for Efficient LLM Pretraining

cs.CL · 2025-11-26 · unverdicted · novelty 5.0

Fine-grained metadata such as document quality indicators accelerate LLM pretraining when prepended, and metadata appending plus learnable meta-tokens recover additional speedup via auxiliary tasks and latent structure.

Mixtral of Experts

cs.LG · 2024-01-08 · unverdicted · novelty 5.0

Mixtral 8x7B is a sparse MoE LLM activating 2 of 8 experts per layer that matches or exceeds Llama 2 70B and GPT-3.5 on benchmarks while using only 13B active parameters.

Mistral 7B

cs.CL · 2023-10-10 · accept · novelty 5.0

Mistral 7B is a 7B-parameter LLM that outperforms Llama 2 13B across benchmarks via grouped-query attention and sliding-window attention while remaining efficient.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Retrieval-Augmented Generation for Large Language Models: A Survey

cs.CL · 2023-12-18 · unverdicted · novelty 3.0

A survey of RAG paradigms, components, benchmarks, and challenges for improving LLMs on knowledge-intensive tasks.

Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space

cs.CL · 2026-03-15

DecompSR: A dataset for decomposed analyses of compositional multihop spatial reasoning

cs.AI · 2025-11-04

citing papers explorer

Showing 4 of 4 citing papers after filters.

Beyond Content Safety: Real-Time Monitoring for Reasoning Vulnerabilities in Large Language Models cs.AI · 2026-03-26 · unverdicted · none · ref 30 · internal anchor
An external zero-shot monitor detects nine unsafe reasoning behaviors in LLMs at 87% step-level accuracy with low false positives and low latency.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 77
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
Forgetting in Language Models: Capacity, Optimization, and Self-Generated Replay cs.LG · 2026-05-25 · unverdicted · none · ref 36 · internal anchor
Self-generated replay from language models nearly eliminates catastrophic forgetting during finetuning except when models are pretrained close to saturation.
Inclusion-of-Thoughts: Mitigating Preference Instability via Purifying the Decision Space cs.CL · 2026-03-15 · unreviewed · ref 33 · internal anchor

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer