hub

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal · 2018 · cs.CL · arXiv 1809.02789

28 Pith papers cite this work. Polarity classification is still indexing.

28 Pith papers citing it

open full Pith review browse 28 citing papers arXiv PDF

abstract

We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1329 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic---in the context of common knowledge---and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 dataset 1

citation-polarity summary

background 1 use dataset 1

representative citing papers

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

cs.MM · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

cs.PF · 2026-04-20 · unverdicted · novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.

Winner-Take-All Spiking Transformer for Language Modeling

cs.NE · 2026-04-13 · unverdicted · novelty 7.0

Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

cs.AR · 2026-03-30 · unverdicted · novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

Moshi: a speech-text foundation model for real-time dialogue

eess.AS · 2024-09-17 · accept · novelty 7.0

Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

OPT: Open Pre-trained Transformer Language Models

cs.CL · 2022-05-02 · unverdicted · novelty 7.0

OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.

Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.

Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.

SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.

TLoRA: Task-aware Low Rank Adaptation of Large Language Models

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.

Representation-Guided Parameter-Efficient LLM Unlearning

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.

Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate

cs.LG · 2026-04-15 · unverdicted · novelty 6.0

DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

cs.LG · 2023-10-05 · accept · novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

Textbooks Are All You Need II: phi-1.5 technical report

cs.CL · 2023-09-11 · unverdicted · novelty 6.0

phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.

PaLM: Scaling Language Modeling with Pathways

cs.CL · 2022-04-05 · accept · novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

citing papers explorer

Showing 28 of 28 citing papers.

Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 51 · internal anchor
GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation cs.MM · 2026-05-12 · unverdicted · none · ref 56 · 2 links · internal anchor
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models cs.LG · 2026-05-10 · unverdicted · none · ref 60 · internal anchor
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing cs.PF · 2026-04-20 · unverdicted · none · ref 34 · internal anchor
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
Winner-Take-All Spiking Transformer for Language Modeling cs.NE · 2026-04-13 · unverdicted · none · ref 9 · internal anchor
Winner-take-all spiking self-attention replaces softmax in spiking transformers to support language modeling on 16 datasets with spike-driven, energy-efficient architectures.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network cs.AR · 2026-03-30 · unverdicted · none · ref 47 · internal anchor
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
Moshi: a speech-text foundation model for real-time dialogue eess.AS · 2024-09-17 · accept · none · ref 67 · internal anchor
Moshi is the first real-time full-duplex spoken large language model that casts dialogue as speech-to-speech generation using parallel audio streams and an inner monologue of time-aligned text tokens.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality cs.LG · 2024-05-31 · unverdicted · none · ref 66 · internal anchor
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 52 · internal anchor
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
OPT: Open Pre-trained Transformer Language Models cs.CL · 2022-05-02 · unverdicted · none · ref 185 · internal anchor
OPT releases open decoder-only transformers up to 175B parameters that match GPT-3 performance at one-seventh the carbon cost, along with code and training logs.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 29 · internal anchor
Extremely quantized LLMs degrade in smoothness, sparsifying the decoding tree and hurting generation quality; a smoothness-preserving principle delivers gains beyond numerical fitting.
Revealing Modular Gradient Noise Imbalance in LLMs: Calibrating Adam via Signal-to-Noise Ratio cs.LG · 2026-05-07 · unverdicted · none · ref 23 · internal anchor
MoLS scales Adam updates using module-level SNR estimates to correct gradient noise imbalance and improve LLM training convergence and generalization.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 32 · internal anchor
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
GRASPrune: Global Gating for Budgeted Structured Pruning of Large Language Models cs.AI · 2026-04-21 · unverdicted · none · ref 41 · internal anchor
GRASPrune removes 50% of parameters from LLaMA-2-7B via global gating and projected straight-through estimation, reaching 12.18 WikiText-2 perplexity and competitive zero-shot accuracy after four epochs on 512 calibration sequences.
SAMoRA: Semantic-Aware Mixture of LoRA Experts for Task-Adaptive Learning cs.CL · 2026-04-21 · unverdicted · none · ref 32 · internal anchor
SAMoRA is a parameter-efficient fine-tuning framework that uses semantic-aware routing and task-adaptive scaling within a Mixture of LoRA Experts to improve multi-task performance and generalization over prior methods.
TLoRA: Task-aware Low Rank Adaptation of Large Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 75 · internal anchor
TLoRA jointly optimizes LoRA initialization via task-data SVD and sensitivity-driven rank allocation, delivering stronger results than standard LoRA across NLU, reasoning, math, code, and chat tasks while using fewer trainable parameters.
Representation-Guided Parameter-Efficient LLM Unlearning cs.CL · 2026-04-19 · unverdicted · none · ref 181 · internal anchor
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
Robust Ultra Low-Bit Post-Training Quantization via Stable Diagonal Curvature Estimate cs.LG · 2026-04-15 · unverdicted · none · ref 32 · internal anchor
DASH-Q uses a stable diagonal curvature estimate and weighted least squares to achieve robust ultra-low-bit post-training quantization of LLMs, improving zero-shot accuracy by 7% on average over baselines.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 53 · internal anchor
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 59 · internal anchor
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks cs.LG · 2023-10-05 · accept · none · ref 41 · internal anchor
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
Textbooks Are All You Need II: phi-1.5 technical report cs.CL · 2023-09-11 · unverdicted · none · ref 17 · internal anchor
phi-1.5 is a 1.3B parameter model trained on synthetic textbook data that matches the reasoning performance of models five times larger on natural language, math, and basic coding tasks.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 97 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
HyperLens: Quantifying Cognitive Effort in LLMs with Fine-grained Confidence Trajectory cs.AI · 2026-05-07 · unverdicted · none · ref 47 · internal anchor
HyperLens reveals that deeper transformer layers magnify small confidence changes into fine-grained trajectories, allowing quantification of cognitive effort where complex tasks demand more and standard SFT can reduce it.
The Cognitive Circuit Breaker: A Systems Engineering Framework for Intrinsic AI Reliability cs.SE · 2026-04-15 · unverdicted · none · ref 4 · internal anchor
The Cognitive Circuit Breaker detects LLM hallucinations by computing the Cognitive Dissonance Delta between semantic confidence and latent certainty from hidden states, adding negligible overhead.
Adaptive Spiking Neurons for Vision and Language Modeling cs.NE · 2026-04-14 · unverdicted · none · ref 22 · internal anchor
ASN uses trainable parameters for adaptive membrane dynamics and firing in SNNs, with NASN adding normalization, and reports effectiveness across 19 vision and language datasets.
Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 198 · internal anchor
The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.

Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer