pith. sign in

arxiv: 2604.27540 · v1 · submitted 2026-04-30 · 💻 cs.AI

In-Context Examples Suppress Scientific Knowledge Recall in LLMs

Pith reviewed 2026-05-07 09:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords in-context learningscientific reasoningknowledge recalllarge language modelslatent structure recoveryprompting effectsfew-shot promptingknowledge displacement
0
0 comments X

The pith

In-context examples cause LLMs to suppress recall of scientific formulas and shift to pattern matching instead.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often know scientific formulas from pretraining and can apply them to recover hidden structure from data. The paper shows that providing in-context examples suppresses this knowledge use, pushing models toward fitting the surface patterns in the examples even when those examples were generated by the correct formula. The effect appears across five domains and four models in sixty tasks. Accuracy may drop, stay flat, or rise after the shift, but the model consistently moves away from knowledge-driven reasoning. For anyone using LLMs on scientific problems, this means standard few-shot prompting can undermine the very knowledge it aims to support.

Core claim

Adding in-context examples makes LLMs rely less on pretrained domain knowledge for latent structure recovery tasks, shifting computation toward empirical pattern fitting even when the examples are produced by the true scientific formula, with the displacement documented across sixty tasks, six thousand trials, and four models.

What carries the argument

Knowledge displacement effect, in which in-context examples override pretrained formula recall in favor of surface pattern matching on latent structure tasks.

If this is right

  • The same shift can lower accuracy when pretrained knowledge outperforms pattern fitting, leave accuracy unchanged, or raise it when patterns match better.
  • Displacement occurs even when examples are generated from the correct formula rather than from incorrect or noisy data.
  • The effect holds across chemistry, economics, physics, biology, and other domains tested.
  • Practitioners cannot assume in-context examples will reinforce or activate existing scientific knowledge in LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The suppression may extend to other knowledge-intensive domains where LLMs hold strong pretrained priors.
  • Prompt designs that explicitly instruct models to use internal knowledge before looking at examples could be tested as a countermeasure.
  • The finding suggests that scaling up in-context examples might widen rather than close the gap between pattern fitting and knowledge use.

Load-bearing premise

The tasks truly isolate recall of hidden scientific structure from surface pattern matching, and the observed change is suppression of knowledge rather than a general change in computation strategy.

What would settle it

An experiment where models continue to derive answers from their pretrained formulas on tasks where the in-context examples follow a different pattern than the formula would falsify the suppression claim.

Figures

Figures reproduced from arXiv: 2604.27540 by Chaemin Jang, Dongman Lee, Hyeok Yun, Jihee Kim, Woojin Park.

Figure 1
Figure 1. Figure 1: Two competing reasoning modes on Newton’s cooling. view at source ↗
Figure 2
Figure 2. Figure 2: Cross-model replication (30 trials per task, except GPT-5.2: 50 trials). Blue = view at source ↗
Figure 3
Figure 3. Figure 3: Case A (Economics): The model correctly derives consumer surplus from Cournot equilibrium under zero-shot, but under 10-shot fits a numerical heuristic from examples that is inconsistent across trials. Case B (Geoscience): The model recalls the correct law but assumes wrong parameters; 10-shot replaces derivation with pattern-fitting that produces the right number. Original Decoy Domain 0-shot 10-shot 0-sh… view at source ↗
Figure 4
Figure 4. Figure 4: Domain representations in hidden-state space (Qwen2.5-7B, MDS projection view at source ↗
read the original abstract

Scientific reasoning rarely stops at what is directly observable; it often requires uncovering hidden structure from data. From estimating reaction constants in chemistry to inferring demand elasticities in economics, this latent structure recovery is what distinguishes scientific reasoning from curve fitting. Large language models (LLMs) can often recall and apply relevant scientific formulas, but we show that this ability is surprisingly easy to suppress. We show that adding in-context examples makes models rely less on pretrained domain knowledge, even when those examples are generated by the very same formula. Rather than reinforcing knowledge-driven derivation, examples shift computation toward empirical pattern fitting. We document this knowledge displacement on 60 latent structure recovery tasks across five scientific domains, 6,000 trials, and four models. This displacement is consistent across domains, but its accuracy consequences depend on how the displaced strategy compares to the one that replaces it: the same shift can lower accuracy, leave it unchanged, or appear to improve it. In all cases, however, the model shifts away from knowledge-driven reasoning. For practitioners deploying LLMs on scientific tasks, the message is cautionary: in-context examples may displace, rather than reinforce, the knowledge they are intended to support.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that in-context examples suppress LLMs' recall and application of pretrained scientific knowledge on latent structure recovery tasks, shifting computation toward empirical pattern fitting even when examples are generated from the identical underlying formula. This displacement is documented across 60 tasks in five domains, 6000 trials, and four models, with accuracy consequences varying by whether the displaced knowledge-driven strategy outperforms or underperforms the replacement strategy.

Significance. If the result holds after addressing measurement and isolation issues, it would be a practically important finding for scientific applications of LLMs, showing that standard few-shot prompting can displace rather than reinforce domain knowledge. The scale (60 tasks, multiple models and domains) is a strength, but the absence of direct probes for knowledge access limits the strength of the causal interpretation.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (experimental setup): the central claim requires that zero-shot performance reflects pretrained formula recall and that few-shot shifts specifically suppress that recall rather than induce a general strategy change. No direct measurement (e.g., formula-token activation, knowledge-editing ablation, or counterfactual examples preserving surface statistics) is described to rule out the alternative that models simply reweight toward prompt-local fitting.
  2. [§4] §4 (results): accuracy shifts are reported as evidence of displacement, yet without statistical controls for prompt length, example quality, or multiple solution strategies per task, the patterns remain compatible with non-specific changes in computation rather than targeted knowledge suppression. The claim that 'the model shifts away from knowledge-driven reasoning' in all cases therefore rests on an untested assumption about what zero-shot behavior measures.
  3. [Task construction] Task construction (throughout): the 60 latent-structure tasks must demonstrably force reliance on pretrained formulas in the zero-shot condition. If tasks admit surface heuristics or multiple valid strategies whose relative weighting changes with the addition of examples, the observed displacement does not isolate suppression of scientific knowledge recall.
minor comments (2)
  1. [Methods] Clarify the exact prompting templates, example generation procedure, and how 'knowledge use' was operationalized (e.g., via accuracy alone or additional probes) in the methods section.
  2. [Results] Add error bars, confidence intervals, or per-task variance for the 6000 trials and report whether differences are statistically significant after multiple-comparison correction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important issues around causal isolation and measurement that we have addressed through clarifications, additional controls, and expanded discussion in the revised manuscript. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract and §3] the central claim requires that zero-shot performance reflects pretrained formula recall and that few-shot shifts specifically suppress that recall rather than induce a general strategy change. No direct measurement (e.g., formula-token activation, knowledge-editing ablation, or counterfactual examples preserving surface statistics) is described to rule out the alternative that models simply reweight toward prompt-local fitting.

    Authors: We agree that direct probes such as activation analysis or knowledge editing would provide stronger causal evidence. Our current design infers recall from task construction: each latent-structure task was selected so that zero-shot success requires applying a specific pretrained formula (e.g., reaction-rate equations or elasticity formulas), and we report non-trivial zero-shot accuracies that exceed surface-heuristic baselines. To address the reweighting alternative, the revised manuscript adds (i) length-matched zero-shot controls, (ii) shuffled-example ablations that preserve surface statistics while breaking formula structure, and (iii) explicit discussion of why general strategy change is unlikely given the error patterns observed. Full internal-state probes remain outside the present API-based experimental scope. revision: partial

  2. Referee: [§4] accuracy shifts are reported as evidence of displacement, yet without statistical controls for prompt length, example quality, or multiple solution strategies per task, the patterns remain compatible with non-specific changes in computation rather than targeted knowledge suppression.

    Authors: We appreciate this observation. The revised §4 now includes: (a) prompt-length controls via padded zero-shot baselines of matched token count, (b) verification that all in-context examples were generated from the identical underlying formula to ensure quality and relevance, and (c) per-task analysis of error types showing that zero-shot mistakes align with formula misapplication rather than generic heuristics. We also added statistical tests (mixed-effects models) that control for these factors when reporting accuracy shifts. These additions support the targeted-suppression interpretation while acknowledging residual uncertainty. revision: yes

  3. Referee: [Task construction] the 60 latent-structure tasks must demonstrably force reliance on pretrained formulas in the zero-shot condition. If tasks admit surface heuristics or multiple valid strategies whose relative weighting changes with the addition of examples, the observed displacement does not isolate suppression of scientific knowledge recall.

    Authors: We have substantially expanded the task-construction section and added an appendix table. For each domain we now detail why surface heuristics fail on the chosen parameter ranges and held-out test distributions, and we report that zero-shot model performance reliably exceeds heuristic baselines. We further include a new analysis showing that adding examples changes not only accuracy but also the qualitative form of errors in a manner consistent with formula displacement rather than simple reweighting among multiple strategies. These revisions strengthen the claim that the observed shift isolates suppression of pretrained formula recall. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement of prompt effects

full rationale

The paper reports experimental results on 60 tasks across domains and models, documenting accuracy shifts when in-context examples are added. No equations, derivations, fitted parameters, or first-principles claims appear; the central observation is a measured behavioral change under controlled prompt conditions. The load-bearing step is the experimental design itself (comparing no-example vs. example prompts), which does not reduce to any self-referential definition or prior self-citation. Self-citations, if present, are not invoked to justify uniqueness or to close a derivation loop. This is a standard empirical study whose validity rests on task construction and measurement, not on any chain that collapses by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the chosen tasks isolate knowledge-driven reasoning from pattern matching and that the measurement of 'knowledge use' is valid; no free parameters or invented entities are described.

axioms (1)
  • domain assumption The 60 tasks require genuine latent structure recovery that cannot be solved by surface pattern matching alone
    Invoked to interpret the shift as knowledge suppression rather than strategy change

pith-pipeline@v0.9.0 · 5514 in / 1105 out tokens · 30872 ms · 2026-05-07T09:26:24.246784+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages

  1. [1]

    What learning algorithm is in-context learning? Investigations with linear models

    Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? Investigations with linear models. In ICLR, 2023

  2. [2]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al. Language models are few-shot learners. In NeurIPS, 2020

  3. [3]

    Chan, Adam Santoro, Andrew K

    Stephanie C.Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, Jay McClelland, and Felix Hill. Data distributional properties drive emergent in-context learning in transformers. In NeurIPS, 2022

  4. [4]

    Why can GPT learn in-context? Language models implicitly perform gradient descent as meta-optimizers

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. Why can GPT learn in-context? Language models implicitly perform gradient descent as meta-optimizers. In Findings of ACL, 2023

  5. [5]

    Evaluating large language models in scientific discovery

    Yuanqi Du et al. Evaluating large language models in scientific discovery. arXiv preprint, 2025

  6. [6]

    What can transformers learn in-context? A case study of simple function classes

    Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? A case study of simple function classes. In NeurIPS, 2022

  7. [7]

    Shortcut learning in deep neural networks

    Robert Geirhos, J \"o rn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks. Nature Machine Intelligence, 2(11):665--673, 2020

  8. [8]

    Alireza Ghafarollahi and Markus J. Buehler. SciAgents : Automating scientific discovery through multi-agent intelligent graph reasoning. Advanced Materials, 37(22):2413523, 2024

  9. [9]

    EconNLI: Evaluating large language models on economics reasoning

    Yue Guo and Yi Yang. EconNLI: Evaluating large language models on economics reasoning. In Findings of ACL, 2024

  10. [10]

    Measuring mathematical problem solving with the MATH dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. In NeurIPS, 2021

  11. [11]

    Large language models are zero-shot reasoners

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In NeurIPS, 2022

  12. [12]

    Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak

    Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, and Samet Oymak. Transformers as algorithms: Generalization and stability in in-context learning. In ICML, 2023

  13. [13]

    What makes good in-context examples for GPT-3 ? In DeeLIO, 2022

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3 ? In DeeLIO, 2022

  14. [14]

    What makes in-context learning effective for mathematical reasoning? In ICML, 2025

    Jiayu Liu, Zhenya Huang, Chaokun Wang, Xunpeng Huang, ChengXiang Zhai, and Enhong Chen. What makes in-context learning effective for mathematical reasoning? In ICML, 2025

  15. [15]

    Fantastically ordered prompts and where to find them

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. Fantastically ordered prompts and where to find them. In ACL, 2022

  16. [16]

    LLM4SR : A survey on large language models for scientific research

    Ziming Luo, Zonglin Yang, Zexin Xu, Wei Yang, and Xinya Du. LLM4SR : A survey on large language models for scientific research. ACM Computing Surveys, 2025

  17. [17]

    Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In EMNLP, 2022

  18. [18]

    In-context learning and induction heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, et al. In-context learning and induction heads. Transformer Circuits Thread, 2022

  19. [19]

    What in-context learning ``learns'' in-context: Disentangling task recognition and task learning

    Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. What in-context learning ``learns'' in-context: Disentangling task recognition and task learning. In Findings of ACL, 2023

  20. [20]

    Are NLP models really able to solve simple math word problems? In NAACL-HLT, 2021

    Arkil Patel, Satwik Bhattamishra, and Navin Goyal. Are NLP models really able to solve simple math word problems? In NAACL-HLT, 2021

  21. [21]

    STEER-ME : Assessing the microeconomic reasoning of large language models

    Narun Raman et al. STEER-ME : Assessing the microeconomic reasoning of large language models. In NeurIPS, 2025

  22. [22]

    Impact of pretraining term frequencies on few-shot numerical reasoning

    Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of EMNLP, 2022

  23. [23]

    LLM-SRBench : A new benchmark for scientific equation discovery with large language models

    Parshin Shojaee et al. LLM-SRBench : A new benchmark for scientific equation discovery with large language models. In OpenReview, 2025

  24. [24]

    Large language models can be lazy learners: Analyze shortcuts in in-context learning

    Ruixiang Tang, Dehan Kong, Longtao Huang, and Hui Xue. Large language models can be lazy learners: Analyze shortcuts in in-context learning. In Findings of ACL, 2023

  25. [25]

    What has a foundation model found? Using inductive bias to probe for world models

    Keyon Vafa, Peter G Chang, Ashesh Rambachan, and Sendhil Mullainathan. What has a foundation model found? Using inductive bias to probe for world models. In ICML, 2025

  26. [26]

    Transformers learn in-context by gradient descent

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, et al. Transformers learn in-context by gradient descent. In ICML, 2023

  27. [27]

    From harm to help: Turning reasoning in-context demos into assets for reasoning LMs

    Haonan Wang, Weida Liang, Zihang Fu, Zheng Nie, Yifan Zhang, Yao Tong, Tongyao Zhu, Hao Jiang, Chuang Li, Jiaying Wu, and Kenji Kawaguchi. From harm to help: Turning reasoning in-context demos into assets for reasoning LMs . arXiv preprint, 2025

  28. [28]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. In NeurIPS, 2022

  29. [29]

    Larger language models do in-context learning differently

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. Larger language models do in-context learning differently. arXiv preprint, 2023

  30. [30]

    Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks

    Zhaofeng Wu, Linlu Qiu, Alexis Ross, Ekin Aky \"u rek, Boyuan Chen, et al. Reasoning or reciting? Exploring the capabilities and limitations of language models through counterfactual tasks. In NAACL, 2024

  31. [31]

    An explanation of in-context learning as implicit B ayesian inference

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. An explanation of in-context learning as implicit B ayesian inference. In ICLR, 2022

  32. [32]

    Which attention heads matter for in-context learning? In ICML, 2025

    Kayo Yin and Jacob Steinhardt. Which attention heads matter for in-context learning? In ICML, 2025

  33. [33]

    Calibrate before use: Improving few-shot performance of language models

    Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. Calibrate before use: Improving few-shot performance of language models. In ICML, 2021