In-Context Examples Suppress Scientific Knowledge Recall in LLMs
Pith reviewed 2026-05-07 09:26 UTC · model grok-4.3
The pith
In-context examples cause LLMs to suppress recall of scientific formulas and shift to pattern matching instead.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adding in-context examples makes LLMs rely less on pretrained domain knowledge for latent structure recovery tasks, shifting computation toward empirical pattern fitting even when the examples are produced by the true scientific formula, with the displacement documented across sixty tasks, six thousand trials, and four models.
What carries the argument
Knowledge displacement effect, in which in-context examples override pretrained formula recall in favor of surface pattern matching on latent structure tasks.
If this is right
- The same shift can lower accuracy when pretrained knowledge outperforms pattern fitting, leave accuracy unchanged, or raise it when patterns match better.
- Displacement occurs even when examples are generated from the correct formula rather than from incorrect or noisy data.
- The effect holds across chemistry, economics, physics, biology, and other domains tested.
- Practitioners cannot assume in-context examples will reinforce or activate existing scientific knowledge in LLMs.
Where Pith is reading between the lines
- The suppression may extend to other knowledge-intensive domains where LLMs hold strong pretrained priors.
- Prompt designs that explicitly instruct models to use internal knowledge before looking at examples could be tested as a countermeasure.
- The finding suggests that scaling up in-context examples might widen rather than close the gap between pattern fitting and knowledge use.
Load-bearing premise
The tasks truly isolate recall of hidden scientific structure from surface pattern matching, and the observed change is suppression of knowledge rather than a general change in computation strategy.
What would settle it
An experiment where models continue to derive answers from their pretrained formulas on tasks where the in-context examples follow a different pattern than the formula would falsify the suppression claim.
Figures
read the original abstract
Scientific reasoning rarely stops at what is directly observable; it often requires uncovering hidden structure from data. From estimating reaction constants in chemistry to inferring demand elasticities in economics, this latent structure recovery is what distinguishes scientific reasoning from curve fitting. Large language models (LLMs) can often recall and apply relevant scientific formulas, but we show that this ability is surprisingly easy to suppress. We show that adding in-context examples makes models rely less on pretrained domain knowledge, even when those examples are generated by the very same formula. Rather than reinforcing knowledge-driven derivation, examples shift computation toward empirical pattern fitting. We document this knowledge displacement on 60 latent structure recovery tasks across five scientific domains, 6,000 trials, and four models. This displacement is consistent across domains, but its accuracy consequences depend on how the displaced strategy compares to the one that replaces it: the same shift can lower accuracy, leave it unchanged, or appear to improve it. In all cases, however, the model shifts away from knowledge-driven reasoning. For practitioners deploying LLMs on scientific tasks, the message is cautionary: in-context examples may displace, rather than reinforce, the knowledge they are intended to support.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that in-context examples suppress LLMs' recall and application of pretrained scientific knowledge on latent structure recovery tasks, shifting computation toward empirical pattern fitting even when examples are generated from the identical underlying formula. This displacement is documented across 60 tasks in five domains, 6000 trials, and four models, with accuracy consequences varying by whether the displaced knowledge-driven strategy outperforms or underperforms the replacement strategy.
Significance. If the result holds after addressing measurement and isolation issues, it would be a practically important finding for scientific applications of LLMs, showing that standard few-shot prompting can displace rather than reinforce domain knowledge. The scale (60 tasks, multiple models and domains) is a strength, but the absence of direct probes for knowledge access limits the strength of the causal interpretation.
major comments (3)
- [Abstract and §3] Abstract and §3 (experimental setup): the central claim requires that zero-shot performance reflects pretrained formula recall and that few-shot shifts specifically suppress that recall rather than induce a general strategy change. No direct measurement (e.g., formula-token activation, knowledge-editing ablation, or counterfactual examples preserving surface statistics) is described to rule out the alternative that models simply reweight toward prompt-local fitting.
- [§4] §4 (results): accuracy shifts are reported as evidence of displacement, yet without statistical controls for prompt length, example quality, or multiple solution strategies per task, the patterns remain compatible with non-specific changes in computation rather than targeted knowledge suppression. The claim that 'the model shifts away from knowledge-driven reasoning' in all cases therefore rests on an untested assumption about what zero-shot behavior measures.
- [Task construction] Task construction (throughout): the 60 latent-structure tasks must demonstrably force reliance on pretrained formulas in the zero-shot condition. If tasks admit surface heuristics or multiple valid strategies whose relative weighting changes with the addition of examples, the observed displacement does not isolate suppression of scientific knowledge recall.
minor comments (2)
- [Methods] Clarify the exact prompting templates, example generation procedure, and how 'knowledge use' was operationalized (e.g., via accuracy alone or additional probes) in the methods section.
- [Results] Add error bars, confidence intervals, or per-task variance for the 6000 trials and report whether differences are statistically significant after multiple-comparison correction.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed report. The comments highlight important issues around causal isolation and measurement that we have addressed through clarifications, additional controls, and expanded discussion in the revised manuscript. Below we respond point by point to the major comments.
read point-by-point responses
-
Referee: [Abstract and §3] the central claim requires that zero-shot performance reflects pretrained formula recall and that few-shot shifts specifically suppress that recall rather than induce a general strategy change. No direct measurement (e.g., formula-token activation, knowledge-editing ablation, or counterfactual examples preserving surface statistics) is described to rule out the alternative that models simply reweight toward prompt-local fitting.
Authors: We agree that direct probes such as activation analysis or knowledge editing would provide stronger causal evidence. Our current design infers recall from task construction: each latent-structure task was selected so that zero-shot success requires applying a specific pretrained formula (e.g., reaction-rate equations or elasticity formulas), and we report non-trivial zero-shot accuracies that exceed surface-heuristic baselines. To address the reweighting alternative, the revised manuscript adds (i) length-matched zero-shot controls, (ii) shuffled-example ablations that preserve surface statistics while breaking formula structure, and (iii) explicit discussion of why general strategy change is unlikely given the error patterns observed. Full internal-state probes remain outside the present API-based experimental scope. revision: partial
-
Referee: [§4] accuracy shifts are reported as evidence of displacement, yet without statistical controls for prompt length, example quality, or multiple solution strategies per task, the patterns remain compatible with non-specific changes in computation rather than targeted knowledge suppression.
Authors: We appreciate this observation. The revised §4 now includes: (a) prompt-length controls via padded zero-shot baselines of matched token count, (b) verification that all in-context examples were generated from the identical underlying formula to ensure quality and relevance, and (c) per-task analysis of error types showing that zero-shot mistakes align with formula misapplication rather than generic heuristics. We also added statistical tests (mixed-effects models) that control for these factors when reporting accuracy shifts. These additions support the targeted-suppression interpretation while acknowledging residual uncertainty. revision: yes
-
Referee: [Task construction] the 60 latent-structure tasks must demonstrably force reliance on pretrained formulas in the zero-shot condition. If tasks admit surface heuristics or multiple valid strategies whose relative weighting changes with the addition of examples, the observed displacement does not isolate suppression of scientific knowledge recall.
Authors: We have substantially expanded the task-construction section and added an appendix table. For each domain we now detail why surface heuristics fail on the chosen parameter ranges and held-out test distributions, and we report that zero-shot model performance reliably exceeds heuristic baselines. We further include a new analysis showing that adding examples changes not only accuracy but also the qualitative form of errors in a manner consistent with formula displacement rather than simple reweighting among multiple strategies. These revisions strengthen the claim that the observed shift isolates suppression of pretrained formula recall. revision: yes
Circularity Check
No circularity: purely empirical measurement of prompt effects
full rationale
The paper reports experimental results on 60 tasks across domains and models, documenting accuracy shifts when in-context examples are added. No equations, derivations, fitted parameters, or first-principles claims appear; the central observation is a measured behavioral change under controlled prompt conditions. The load-bearing step is the experimental design itself (comparing no-example vs. example prompts), which does not reduce to any self-referential definition or prior self-citation. Self-citations, if present, are not invoked to justify uniqueness or to close a derivation loop. This is a standard empirical study whose validity rests on task construction and measurement, not on any chain that collapses by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 60 tasks require genuine latent structure recovery that cannot be solved by surface pattern matching alone
Reference graph
Works this paper leans on
-
[1]
What learning algorithm is in-context learning? Investigations with linear models
Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models. In ICLR, 2023
work page 2023
-
[2]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners. In NeurIPS, 2020
work page 2020
-
[3]
Stephanie C.Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. In NeurIPS, 2022
work page 2022
-
[4]
Why can GPT learn in-context? Language models implicitly perform gradient descent as meta-optimizers
Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? Language models implicitly perform gradient descent as meta-optimizers. In Findings of ACL, 2023
work page 2023
-
[5]
Evaluating large language models in scientific discovery
Yuanqi Du et al. Evaluating large language models in scientific discovery. arXiv preprint, 2025
work page 2025
-
[6]
What can transformers learn in-context? A case study of simple function classes
Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? A case study of simple function classes. In NeurIPS, 2022
work page 2022
-
[7]
Shortcut learning in deep neural networks
Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665--673, 2020
work page 2020
-
[8]
Alireza Ghafarollahi and Markus J. Buehler. SciAgents : Automating scientific discovery through multi-agent intelligent graph reasoning. Advanced Materials, 37(22):2413523, 2024
work page 2024
-
[9]
EconNLI: Evaluating large language models on economics reasoning
Yue Guo and Yi Yang. EconNLI: Evaluating large language models on economics reasoning. In Findings of ACL, 2024
work page 2024
-
[10]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021
work page 2021
-
[11]
Large language models are zero-shot reasoners
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022
work page 2022
-
[12]
Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak
Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In ICML, 2023
work page 2023
-
[13]
What makes good in-context examples for GPT-3 ? In DeeLIO, 2022
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3 ? In DeeLIO, 2022
work page 2022
-
[14]
What makes in-context learning effective for mathematical reasoning? In ICML, 2025
Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, ChengXiang Zhai, and Enhong Chen. What makes in-context learning effective for mathematical reasoning? In ICML, 2025
work page 2025
-
[15]
Fantastically ordered prompts and where to find them
Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them. In ACL, 2022
work page 2022
-
[16]
LLM4SR : A survey on large language models for scientific research
Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. LLM4SR : A survey on large language models for scientific research. ACM Computing Surveys, 2025
work page 2025
-
[17]
Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022
work page 2022
-
[18]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, et al. In-context learning and induction heads. Transformer Circuits Thread, 2022
work page 2022
-
[19]
What in-context learning ``learns'' in-context: Disentangling task recognition and task learning
Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. What in-context learning ``learns'' in-context: Disentangling task recognition and task learning. In Findings of ACL, 2023
work page 2023
-
[20]
Are NLP models really able to solve simple math word problems? In NAACL-HLT, 2021
Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In NAACL-HLT, 2021
work page 2021
-
[21]
STEER-ME : Assessing the microeconomic reasoning of large language models
Narun Raman et al. STEER-ME : Assessing the microeconomic reasoning of large language models. In NeurIPS, 2025
work page 2025
-
[22]
Impact of pretraining term frequencies on few-shot numerical reasoning
Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of EMNLP, 2022
work page 2022
-
[23]
LLM-SRBench : A new benchmark for scientific equation discovery with large language models
Parshin Shojaee et al. LLM-SRBench : A new benchmark for scientific equation discovery with large language models. In OpenReview, 2025
work page 2025
-
[24]
Large language models can be lazy learners: Analyze shortcuts in in-context learning
Ruixiang Tang, Dehan Kong, Longtao Huang, and Hui Xue. Large language models can be lazy learners: Analyze shortcuts in in-context learning. In Findings of ACL, 2023
work page 2023
-
[25]
What has a foundation model found? Using inductive bias to probe for world models
Keyon Vafa, Peter G Chang, Ashesh Rambachan, and Sendhil Mullainathan. What has a foundation model found? Using inductive bias to probe for world models. In ICML, 2025
work page 2025
-
[26]
Transformers learn in-context by gradient descent
Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, et al. Transformers learn in-context by gradient descent. In ICML, 2023
work page 2023
-
[27]
From harm to help: Turning reasoning in-context demos into assets for reasoning LMs
Haonan Wang, Weida Liang, Zihang Fu, Zheng Nie, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, and Kenji Kawaguchi. From harm to help: Turning reasoning in-context demos into assets for reasoning LMs . arXiv preprint, 2025
work page 2025
-
[28]
Chain-of-thought prompting elicits reasoning in large language models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022
work page 2022
-
[29]
Larger language models do in-context learning differently
Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently. arXiv preprint, 2023
work page 2023
-
[30]
Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky \"u rek, Boyuan Chen, et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In NAACL, 2024
work page 2024
-
[31]
An explanation of in-context learning as implicit B ayesian inference
Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit B ayesian inference. In ICLR, 2022
work page 2022
-
[32]
Which attention heads matter for in-context learning? In ICML, 2025
Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning? In ICML, 2025
work page 2025
-
[33]
Calibrate before use: Improving few-shot performance of language models
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In ICML, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.