Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Abhilasha Ravichander; Amir Feder; Hinrich Sch\"utze; Marius Mosbach; Nora Kassner; Shauli Ravfogel; Yanai Elazar; Yoav Goldberg; Yonatan Belinkov

arxiv: 2207.14251 · v2 · pith:TTIOGXLWnew · submitted 2022-07-28 · 💻 cs.CL

Measuring Causal Effects of Data Statistics on Language Model's `Factual' Predictions

Yanai Elazar , Nora Kassner , Shauli Ravfogel , Amir Feder , Abhilasha Ravichander , Marius Mosbach , Yonatan Belinkov , Hinrich Sch\"utze

show 1 more author

Yoav Goldberg

This is my paper

classification 💻 cs.CL

keywords datamodelscausalframeworklanguagepredictionsstatisticstraining

0 comments

read the original abstract

Large amounts of training data are one of the major reasons for the high performance of state-of-the-art NLP models. But what exactly in the training data causes a model to make a certain prediction? We seek to answer this question by providing a language for describing how training data influences predictions, through a causal framework. Importantly, our framework bypasses the need to retrain expensive models and allows us to estimate causal effects based on observational data alone. Addressing the problem of extracting factual knowledge from pretrained language models (PLMs), we focus on simple data statistics such as co-occurrence counts and show that these statistics do influence the predictions of PLMs, suggesting that such models rely on shallow heuristics. Our causal framework and our results demonstrate the importance of studying datasets and the benefits of causality for understanding NLP models.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
cs.CL 2023-04 accept novelty 8.0

Pythia releases 16 identically trained LLMs with full checkpoints and data tools to study training dynamics, scaling, memorization, and bias in language models.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 7.0

LLM first-answer accuracy on procedural arithmetic drops from 61% on 5-step tasks to 20% on 95-step tasks, with frequent failures including skipped steps, premature answers, and hallucinated operations.
OR-VSKC: Resolving Visual-Semantic Knowledge Conflicts in Operating Rooms with Synthetic Data-Guided Alignment
cs.CV 2025-06 conditional novelty 7.0

OR-VSKC provides 28,190 synthetic operating room images plus an expert subset to expose and reduce visual-semantic knowledge conflicts in multimodal models for surgical risk detection.
Validity Threats for Foundation Model Research
cs.LG 2026-06 accept novelty 6.0

Maps common low-compute research strategies for foundation models onto statistical, internal, external, and construct validity threats via a causal-inference lens.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

LLM accuracy on controlled procedural arithmetic drops from 61% at 5 steps to 20% at 95 steps, with failures including skipped steps, premature answers, and hallucinated operations.
When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models
cs.CL 2026-05 unverdicted novelty 6.0

A new benchmark shows LLM first-answer accuracy on procedural arithmetic drops from 63% (5 steps) to 20% (95 steps) due to execution failures like skipped steps and premature answers.
Measuring Distribution Shift in User Prompts and Its Effects on LLM Performance
cs.CL 2026-04 unverdicted novelty 6.0

The LENS framework applied to 192 real-world settings shows moderate natural prompt distribution shifts cause 73% average performance loss in deployed LLMs, especially across user groups and regions.
Truth as a Compression Artifact in Language Model Training
cs.CL 2026-03 unverdicted novelty 6.0

Controlled experiments show language models extract correct answers from contradictory data only when errors are structurally incoherent, supporting the hypothesis that gradient descent selects the most compressible a...
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
cs.AI 2023-08 accept novelty 5.0

Survey organizes LLM trustworthiness into seven categories and 29 sub-categories, measures eight sub-categories on popular models, and finds that more aligned models generally score higher but with varying effectiveness.
Benchmark Data Contamination of Large Language Models: A Survey
cs.CL 2024-06 unverdicted novelty 3.0

A survey reviewing benchmark data contamination in LLMs, its impact on evaluation, and alternative assessment approaches.