Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, Ashish Sabharwal · 2018

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

citation-role summary

dataset 2

citation-polarity summary

use dataset 2

representative citing papers

Scaling and evaluating sparse autoencoders

cs.LG · 2024-06-06 · unverdicted · novelty 7.0

K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.

Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts

cs.CL · 2026-04-09 · conditional · novelty 6.0

Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale

cs.CL · 2024-06-25 · unverdicted · novelty 6.0

FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.

Efficient Streaming Language Models with Attention Sinks

cs.CL · 2023-09-29 · accept · novelty 6.0

StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.

H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

cs.LG · 2023-06-24 · unverdicted · novelty 6.0

H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.

Detecting Language Model Attacks with Perplexity

cs.CL · 2023-08-27 · unverdicted · novelty 5.0

Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.

Galactica: A Large Language Model for Science

cs.CL · 2022-11-16 · unverdicted · novelty 5.0

Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

citing papers explorer

Showing 9 of 9 citing papers.

Scaling and evaluating sparse autoencoders cs.LG · 2024-06-06 · unverdicted · none · ref 44
K-sparse autoencoders with dead-latent fixes produce clean scaling laws and better feature quality metrics that improve with size, shown by training a 16-million-latent model on GPT-4 activations.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 107
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models cs.LG · 2026-05-08 · unverdicted · none · ref 60
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing identified as favorable on the stability-support trade-off.
Cram Less to Fit More: Training Data Pruning Improves Memorization of Facts cs.CL · 2026-04-09 · conditional · none · ref 59
Loss-based pruning of training data to limit facts and flatten their frequency distribution enables a 110M-parameter GPT-2 model to memorize 1.3 times more entity facts than standard training, matching a 1.3B-parameter model on the full dataset.
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale cs.CL · 2024-06-25 · unverdicted · none · ref 45
FineWeb is a curated 15T-token web dataset that produces stronger LLMs than prior open collections, while its educational subset sharply improves performance on MMLU and ARC benchmarks.
Efficient Streaming Language Models with Attention Sinks cs.CL · 2023-09-29 · accept · none · ref 34
StreamingLLM lets finite-window LLMs generalize to infinite-length sequences by retaining initial-token KV states as attention sinks, enabling stable streaming inference up to 4M tokens.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 44
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
Detecting Language Model Attacks with Perplexity cs.CL · 2023-08-27 · unverdicted · none · ref 21
Jailbreak prompts with adversarial suffixes have high GPT-2 perplexity, and a LightGBM model on perplexity and length detects most attacks.
Galactica: A Large Language Model for Science cs.CL · 2022-11-16 · unverdicted · none · ref 212
Galactica, a science-specialized LLM, reports higher scores than GPT-3, Chinchilla, and PaLM on LaTeX knowledge, mathematical reasoning, and medical QA benchmarks while outperforming general models on BIG-bench.

Can a suit of armor conduct electricity? a new dataset for open book question answering

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer