pith. sign in

hub Mixed citations

PIQA: Reasoning about Physical Commonsense in Natural Language

Mixed citation behavior. Most common role is background (56%).

41 Pith papers citing it
Background 56% of classified citations
abstract

To apply eyeshadow without a brush, should I use a cotton swab or a toothpick? Questions requiring this kind of physical commonsense pose a challenge to today's natural language understanding systems. While recent pretrained models (such as BERT) have made progress on question answering over more abstract domains - such as news articles and encyclopedia entries, where text is plentiful - in more physical domains, text is inherently limited due to reporting bias. Can AI systems learn to reliably answer physical common-sense questions without experiencing the physical world? In this paper, we introduce the task of physical commonsense reasoning and a corresponding benchmark dataset Physical Interaction: Question Answering or PIQA. Though humans find the dataset easy (95% accuracy), large pretrained models struggle (77%). We provide analysis about the dimensions of knowledge that existing models lack, which offers significant opportunities for future research.

hub tools

citation-role summary

background 5 dataset 3 baseline 1

citation-polarity summary

clear filters

representative citing papers

Measuring Massive Multitask Language Understanding

cs.CY · 2020-09-07 · accept · novelty 8.0

Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

Language Models are Few-Shot Learners

cs.CL · 2020-05-28 · accept · novelty 8.0

GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

PRIMETIME : Limits of LLMs in Temporal Primitives

cs.NE · 2025-04-22 · unverdicted · novelty 7.0

PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.

Massive Activations in Large Language Models

cs.CL · 2024-02-27 · unverdicted · novelty 7.0

Massive activations are constant large values in LLMs that function as indispensable bias terms and concentrate attention probabilities on specific tokens.

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

cs.AI · 2026-06-07 · unverdicted · novelty 6.0

STAR rethinks MoE routing as structure-aware subspace learning by adding a GHA-tracked principal subspace to standard routers, yielding more stable specialization and better performance on synthetic, language, and vision tasks.

In-Place Test-Time Training

cs.LG · 2026-04-07 · conditional · novelty 6.0

In-Place TTT adapts LLM MLP projection matrices at test time with a next-token-aligned objective and chunk-wise updates, enabling better long-context performance as a drop-in enhancement.

Short window attention enables long-term memorization

cs.LG · 2025-09-29 · unverdicted · novelty 6.0

Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.

HyperAdapt: Simple High-Rank Adaptation

cs.LG · 2025-09-23 · unverdicted · novelty 6.0

HyperAdapt performs parameter-efficient fine-tuning by row- and column-wise diagonal scaling to induce high-rank updates with only n+m trainable parameters.

citing papers explorer

Showing 7 of 7 citing papers after filters.

  • Measuring Massive Multitask Language Understanding cs.CY · 2020-09-07 · accept · none · ref 58 · internal anchor

    Introduces the MMLU benchmark of 57 tasks and shows that current models, including GPT-3, achieve low accuracy far below expert level across academic and professional domains.

  • Language Models are Few-Shot Learners cs.CL · 2020-05-28 · accept · none · ref 5 · internal anchor

    GPT-3 shows that scaling an autoregressive language model to 175 billion parameters enables strong few-shot performance across diverse NLP tasks via in-context prompting without fine-tuning.

  • Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone cs.CL · 2024-04-22 · accept · none · ref 2 · internal anchor

    Phi-3-mini (3.8B params, 3.3T tokens) reaches 69% MMLU and 8.38 MT-bench, matching larger models, with scaled-up 7B/14B variants and phi-3.5 extensions for multilingual, MoE, and vision capabilities.

  • PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 15 · internal anchor

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  • Gemma 3 Technical Report cs.CL · 2025-03-25 · accept · none · ref 7 · internal anchor

    Gemma 3 introduces multimodal open models with architectural changes for efficient long context, trained via distillation and a new post-training recipe that makes the 4B version competitive with prior 27B models and the 27B version comparable to Gemini-1.5-Pro.

  • Gemma: Open Models Based on Gemini Research and Technology cs.CL · 2024-03-13 · accept · none · ref 69 · internal anchor

    Gemma introduces open 2B and 7B LLMs derived from Gemini technology that beat comparable open models on 11 of 18 text tasks and come with safety assessments.

  • Large Language Models: A Survey cs.CL · 2024-02-09 · accept · none · ref 196 · internal anchor

    The paper surveys key large language models, their training methods, datasets, evaluation benchmarks, and future research directions in the field.