hub Mixed citations

PIQA: Reasoning about Physical Commonsense in Natural Language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, Yejin Choi · 2019 · cs.CL · arXiv 1911.11641

Mixed citation behavior. Most common role is background (56%).

41 Pith papers citing it

Background 56% of classified citations

open full Pith review browse 41 citing papers arXiv PDF

abstract

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 5 dataset 3 baseline 1

citation-polarity summary

background 5 use dataset 3 baseline 1

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures

cs.LG · 2026-07-02 · unverdicted · novelty 7.0

HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

cs.AI · 2026-06-01 · conditional · novelty 7.0

2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.

LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.

Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

cs.CL · 2026-04-20 · unverdicted · novelty 7.0

Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network

cs.AR · 2026-03-30 · unverdicted · novelty 7.0

SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

BitNet b1.58 shows that ternary 1.58-bit LLMs can match full-precision performance at substantially lower inference cost.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication

cs.LG · 2026-07-02 · conditional · novelty 6.0

SCAPE enables 90-99% sparse gradient communication in sharded Adam-style LLM training by deriving masks from first-moment statistics, achieving up to 43.3% faster pre-training on Llama-500M with no loss in validation loss or downstream accuracy.

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.

Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation

cs.IR · 2026-06-26 · unverdicted · novelty 6.0

R2LM combines causal attention with a reverse Mamba SSM sidecar to supply right-side context in dLLMs, claiming 2.4x-12.9x throughput gains over bidirectional dLLMs and 1.9x-2.9x over AR baselines while matching or exceeding quality.

From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability

cs.LG · 2026-06-16 · unverdicted · novelty 6.0

Derives an upper bound on frozen LM expected risk from proxy risk, SAE reconstruction gap, concept-pool mismatch and sparse complexity, with non-vacuous bounds observed on GPT-2, Gemma-2B and Llama-3-8B.

Self-Generated Error Training for Token Editing in Diffusion Language Models

cs.CL · 2026-06-15 · unverdicted · novelty 6.0

Self-generated T2T training on LLaDA2.1-mini improves benchmark accuracy and lowers edit intensity by supervising recovery from model-generated corruptions instead of random ones.

TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization

cs.LG · 2026-06-11 · unverdicted · novelty 6.0

TWLA is a PTQ method using E2M-ATQ, KOTMS, and ILA-AMP to enable W1.58A4 quantization for LLMs with maintained accuracy.

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

cs.AI · 2026-06-07 · unverdicted · novelty 6.0

STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.

Do Value Vectors in Deep Layers Need Context from the Residual Stream?

cs.CL · 2026-06-01 · unverdicted · novelty 6.0

Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

cs.CL · 2026-05-22 · unverdicted · novelty 6.0

LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Short window attention enables long-term memorization

cs.LG · 2025-09-29 · unverdicted · novelty 6.0

Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.

HyperAdapt: Simple High-Rank Adaptation

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

cs.CL · 2025-06-13 · conditional · novelty 6.0

MoE models with activation rates in an optimal region outperform dense LLMs of identical total parameter count, training compute, and data budget, with the optimal region consistent across scales.

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws

cs.LG · 2025-02-17 · unverdicted · novelty 6.0

Pretraining data determines loss-to-loss scaling laws in LLMs, while model size, optimization, tokenizer, and architecture have limited impact.

citing papers explorer

Showing 20 of 20 citing papers after filters.

HERMES: A Multi-Granularity Labeling Substrate for Pre-training Data Mixtures cs.LG · 2026-07-02 · unverdicted · none · ref 31 · internal anchor
HERMES provides a reusable hierarchical labeling substrate for pre-training data that reveals granularity-specific effects in data mixing rules during model training.
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery cs.AI · 2026-06-01 · conditional · none · ref 3 · internal anchor
2-bit quantized reasoning models exhibit process failures like loops and delayed commitment that degrade end-to-end performance, but FP16 planning and loop rescue recover accuracy on MATH-500 from 17.2% to 74.2% for Qwen3-8B while retaining speed gains.
LoopUS: Recasting Pretrained LLMs into Looped Latent Refinement Models cs.LG · 2026-05-10 · unverdicted · none · ref 58 · internal anchor
LoopUS converts pretrained LLMs into looped latent refinement models via block decomposition, selective gating, random deep supervision, and confidence-based early exiting to improve reasoning performance.
Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models cs.CL · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.
A Switch-Centric In-Network Architecture for Accelerating LLM Inference in Shared-Memory Network cs.AR · 2026-03-30 · unverdicted · none · ref 13 · internal anchor
SCIN uses an in-switch accelerator for direct memory access and 8-bit in-network quantization during All-Reduce, delivering up to 8.7x faster small-message reduction and 1.74x TTFT speedup on LLaMA-2 models.
SCAPE: Accurate and Efficient LLM Training with Extreme Sparse Communication cs.LG · 2026-07-02 · conditional · none · ref 38 · internal anchor
SCAPE enables 90-99% sparse gradient communication in sharded Adam-style LLM training by deriving masks from first-moment statistics, achieving up to 43.3% faster pre-training on Llama-500M with no loss in validation loss or downstream accuracy.
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining cs.LG · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
Bifocal Diffusion Language Models: Asymmetric Bidirectional Context for Parallel Generation cs.IR · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
R2LM combines causal attention with a reverse Mamba SSM sidecar to supply right-side context in dLLMs, claiming 2.4x-12.9x throughput gains over bidirectional dLLMs and 1.9x-2.9x over AR baselines while matching or exceeding quality.
From Sparse Features to Trustworthy Proxies: Certifying SAE-Based Interpretability cs.LG · 2026-06-16 · unverdicted · none · ref 2 · internal anchor
Derives an upper bound on frozen LM expected risk from proxy risk, SAE reconstruction gap, concept-pool mismatch and sparse complexity, with non-vacuous bounds observed on GPT-2, Gemma-2B and Llama-3-8B.
Self-Generated Error Training for Token Editing in Diffusion Language Models cs.CL · 2026-06-15 · unverdicted · none · ref 5 · internal anchor
Self-generated T2T training on LLaDA2.1-mini improves benchmark accuracy and lowers edit intensity by supervising recovery from model-generated corruptions instead of random ones.
TWLA: Achieving Ternary Weights and Low-Bit Activations for LLMs via Post-Training Quantization cs.LG · 2026-06-11 · unverdicted · none · ref 2 · internal anchor
TWLA is a PTQ method using E2M-ATQ, KOTMS, and ILA-AMP to enable W1.58A4 quantization for LLMs with maintained accuracy.
STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning cs.AI · 2026-06-07 · unverdicted · none · ref 1 · internal anchor
STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.
Do Value Vectors in Deep Layers Need Context from the Residual Stream? cs.CL · 2026-06-01 · unverdicted · none · ref 63 · internal anchor
Deeper transformer layers benefit from context-free token-specific value vectors in a Bank of Values lookup table, improving performance over standard attention with less compute.
Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions cs.CL · 2026-05-22 · unverdicted · none · ref 2 · internal anchor
LINK improves cross-lingual knowledge transfer via lexical substitutions in English pretraining data, yielding notable downstream gains and up to 2x training speedup across eight languages and five model sizes.
In-Place Test-Time Training cs.LG · 2026-04-07 · conditional · none · ref 7 · internal anchor
In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.
Q-Delta: Beyond Key-Value Associative State Evolution cs.AI · 2026-06-07 · unverdicted · none · ref 52 · internal anchor
Q-Delta extends linear attention by introducing a query-conditioned delta rule that incorporates mixed key-query errors into recurrent state updates for improved stability and performance.
Fitting Is Not Enough: Smoothness in Extremely Quantized LLMs cs.CL · 2026-05-09 · unverdicted · none · ref 4 · 2 links · internal anchor
Extremely quantized LLMs exhibit systematic smoothness degradation that reduces effective token candidates and degrades generation; a smoothness-preserving principle in PTQ and QAT delivers gains beyond numerical accuracy.
SOLAR: A Self-Optimizing Open-Ended Autonomous Agent for Lifelong Learning and Continual Adaptation cs.AI · 2026-03-23 · unverdicted · none · ref 51 · internal anchor
SOLAR introduces a self-optimizing agent using meta-learning on model weights and RL-driven strategy discovery for lifelong adaptation in LLMs, claiming superior performance on reasoning tasks across domains.
$\text{Log}_\text{b}$Quant: Quantizing Language Models in Logarithmic Space cs.CL · 2026-07-01 · unverdicted · none · ref 3 · internal anchor
Log_b Quant is an adjustable-base logarithmic quantization technique that outperforms tensor-wise asymmetric linear quantization at 4-bit precision on language model benchmarks while providing memory savings.
JT-SAFE-V2: Safety-by-Design Foundation Model with World-Context Data cs.AI · 2026-05-23 · unverdicted · none · ref 30 · internal anchor
JT-Safe-V2 is a safety-by-design LLM that reports SOTA scores on both capability and safety benchmarks while Safe-MoMA cuts inference cost over 30 percent.

PIQA: Reasoning about Physical Commonsense in Natural Language

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer