pith. sign in

arxiv: 2605.26350 · v1 · pith:DP2UAVVBnew · submitted 2026-05-25 · 💻 cs.LG · cs.AI

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

Pith reviewed 2026-06-29 22:22 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords in-context learningdemonstrationsperturbationscontextual evidence shiftsentiment classificationlogical reasoningmath word problems
0
0 comments X

The pith

Correct demonstrations can reduce in-context learning accuracy even when they remain valid task examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that correctness of demonstrations does not guarantee their utility in in-context learning, and that some correct examples can decrease accuracy. To isolate this gap the authors introduce task-preserving perturbations that alter only the exemplar input while assigning the target induced by the same task mapping. They formalize the resulting failure as contextual evidence shift, in which the perturbation changes the mixture of evidence the model uses for inference without breaking correctness. Experiments on sentiment classification, logical reasoning, and math word problems show substantial drops in ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. The results indicate that robust ICL requires checking how demonstrations shape contextual inference, not only whether they are correct.

Core claim

Task-preserving perturbations separate exemplar correctness from utility by changing the effective mixture of evidence the model uses for contextual inference, allowing some correct demonstrations to reduce ICL performance.

What carries the argument

Contextual evidence shift, the mechanism by which task-preserving perturbations alter the mixture of evidence for contextual inference while preserving exemplar correctness.

If this is right

  • Perturbed correct demonstrations degrade ICL more for smaller models than for larger ones.
  • Degradation grows with higher ratios of perturbed demonstrations.
  • Harder tasks exhibit larger negative effects from the perturbations.
  • Evaluating ICL robustness requires assessing influence on contextual inference beyond label correctness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Demonstration selection procedures could gain from checking semantic alignment of inputs with the query in addition to correctness.
  • The evidence-shift effect may appear in other few-shot prompting settings that rely on example mixtures.
  • Models could be tested for sensitivity to evidence mixture through controlled perturbation experiments on new tasks.

Load-bearing premise

Task-preserving perturbations change only the effective mixture of evidence for contextual inference and introduce no other uncontrolled factors.

What would settle it

Apply task-preserving perturbations to the same set of correct demonstrations and measure whether ICL accuracy on held-out queries drops compared with the unperturbed set.

Figures

Figures reproduced from arXiv: 2605.26350 by Chenghao Qiu, Chunli Peng, Kuan-Hao Huang, Yi Zhou, Yufeng Yang.

Figure 1
Figure 1. Figure 1: Overview of Task Preserving Exemplar Perturbations. We study task preserving exemplar perturbations across (1) sentiment analysis, (2) logical reasoning, and (3) math word tasks. Top: Exemplar construction under perturbation ratio ρ, where a proportion ρ of exemplars is randomly selected for perturbation while the remaining exemplars are kept unchanged. Green denotes original exemplars, orange denotes pert… view at source ↗
Figure 2
Figure 2. Figure 2: Sentiment Analysis Performance. We evaluate SST-2 with 32 in-context exemplars under different exemplar perturbation ratios. Top row: selected exemplars are replaced by task preserving perturbed exemplars constructed using our input side perturbation method. Bottom row: selected exemplars are instead replaced by task irrelevant factual sentences. The x-axis shows the perturbation ratio, and the y-axis repo… view at source ↗
Figure 3
Figure 3. Figure 3: Math Reasoning Performance. We evaluate LLAMA-2 models on the PROBLEMATHIC dataset Anantheswaran et al. [2025] with 16 in-context exemplars under different input perturbation ratios denoted by different colors. The left and right panels report results on the Simple and Complex splits. Accuracy is measured by Exact Match (EM) and averaged over 10 runs. Results [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Attention Map under Tail Perturbation. Attention Map. Motivated by positional effect in Section 4.5, we visualize exemplar level attention under tail perturbations where only the first 4 exemplars are clean out of 32 [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: SST-2 accuracy under reduced exemplar-format similarity. [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Complete SST-2 sentiment analysis results. [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
read the original abstract

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao-Qiu/Task-Preserving-ICL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that correctness of demonstrations does not guarantee utility in in-context learning and that some correct exemplars can degrade performance. It introduces task-preserving perturbations (label-updating and target-preserving) that keep examples valid under the task mapping while altering only the input, formalizing the resulting degradation as contextual evidence shift that changes the mixture of evidence for contextual inference. Experiments across sentiment classification, logical reasoning, and math word problems report substantial ICL accuracy drops under these perturbations, especially for smaller models, harder tasks, and higher perturbation ratios.

Significance. If the central claim holds after addressing controls, the work usefully challenges the assumption that correct exemplars are always beneficial in ICL and motivates evaluating how demonstrations affect inference. The public code release is a clear strength for reproducibility. The proposed distinction between correctness and utility is novel but hinges on isolating the evidence-shift mechanism from confounds.

major comments (1)
  1. The load-bearing assumption that task-preserving perturbations change only the effective mixture of contextual evidence (without rendering exemplars inconsistent with the model's learned mapping) requires explicit support. The manuscript should report zero-shot accuracy on the perturbed inputs using their assigned targets to confirm the pairs remain valid from the model's perspective; absent this check, the observed degradation could arise from uncontrolled factors orthogonal to contextual evidence shift.
minor comments (1)
  1. [Abstract] The abstract states results across three task types but omits key methodological details such as the number of models tested, dataset sizes, perturbation ratios used, and whether error bars or statistical tests accompany the reported degradations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the major comment below.

read point-by-point responses
  1. Referee: The load-bearing assumption that task-preserving perturbations change only the effective mixture of contextual evidence (without rendering exemplars inconsistent with the model's learned mapping) requires explicit support. The manuscript should report zero-shot accuracy on the perturbed inputs using their assigned targets to confirm the pairs remain valid from the model's perspective; absent this check, the observed degradation could arise from uncontrolled factors orthogonal to contextual evidence shift.

    Authors: We agree that an explicit empirical check is valuable to isolate the evidence-shift mechanism. By construction, our label-updating and target-preserving perturbations assign targets that satisfy the task mapping, but we acknowledge that this does not automatically guarantee consistency with a given model's internal mapping. In the revision we will add zero-shot accuracy results on the perturbed inputs paired with their assigned targets across all tasks and models. These results will be reported alongside the main ICL experiments to confirm that the perturbed pairs remain valid from the model's perspective and that the observed ICL degradation is not driven by outright inconsistency. If the zero-shot numbers are high, this will strengthen the claim that the performance drop stems from altered evidence mixture rather than uncontrolled factors. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical investigation with no reductive derivations

full rationale

The paper reports experimental results on task-preserving perturbations and their impact on ICL accuracy across multiple tasks. No equations, fitted parameters, or first-principles derivations appear in the provided text. The introduced concepts (task-preserving perturbations, contextual evidence shift) are defined operationally to describe observed phenomena rather than derived from prior fitted quantities or self-citations. Claims rest on direct performance measurements, not on any chain that reduces by construction to the inputs. This is a standard empirical study with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based solely on abstract; ledger is therefore minimal and provisional.

axioms (1)
  • domain assumption In-context learning performs contextual inference by mixing evidence from demonstrations
    Invoked to explain why perturbations separate correctness from utility
invented entities (1)
  • contextual evidence shift no independent evidence
    purpose: Formal name for the mechanism that decouples exemplar correctness from utility
    New term introduced in abstract to describe the failure mode

pith-pipeline@v0.9.1-grok · 5777 in / 1144 out tokens · 27761 ms · 2026-06-29T22:22:05.345953+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 10 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  4. [4]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    Coverage-based example selection for in-context learning

    Shivanshu Gupta, Matt Gardner, and Sameer Singh. Coverage-based example selection for in-context learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 13924–13950,

  6. [6]

    Baijun Ji, Xiangyu Duan, Zhenyu Qiu, Tong Zhang, Junhui Li, Hao Yang, and Min Zhang

    ISSN 2835-8856. Baijun Ji, Xiangyu Duan, Zhenyu Qiu, Tong Zhang, Junhui Li, Hao Yang, and Min Zhang. Submodular-based in-context example selection for llms-based machine translation. InPro- ceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 15398–15409,

  7. [7]

    Finding support examples for in-context learning

    10 Xiaonan Li and Xipeng Qiu. Finding support examples for in-context learning. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 6219–6235,

  8. [8]

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, William B Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? InProceedings of Deep Learning Inside Out (DeeLIO 2022): The 3rd workshop on knowledge extraction and integration for deep learning architectures, pages 100–114,

  9. [9]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Proceedings of the 2022 conference on empirical methods in natural language processing, pages 11048–11064,

  10. [10]

    In-context example selection with influences.arXiv preprint arXiv:2302.11042,

    Tai Nguyen and Eric Wong. In-context example selection with influences.arXiv preprint arXiv:2302.11042,

  11. [11]

    Code Llama: Open Foundation Models for Code

    Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

  12. [12]

    Learning to retrieve prompts for in-context learning

    Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. InProceedings of the 2022 conference of the North American chapter of the association for computational linguistics: human language technologies, pages 2655–2671,

  13. [13]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  14. [14]

    Adversarial demonstration attacks on large language models.arXiv preprint arXiv:2305.14950,

    11 Jiongxiao Wang, Zichen Liu, Keun Hee Park, Zhuojun Jiang, Zhaoheng Zheng, Zhuofeng Wu, Muhao Chen, and Chaowei Xiao. Adversarial demonstration attacks on large language models.arXiv preprint arXiv:2305.14950,

  15. [15]

    Larger language models do in-context learning differently.arXiv preprint arXiv:2303.03846,

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, et al. Larger language models do in-context learning differently.arXiv preprint arXiv:2303.03846,

  16. [16]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report.arXiv preprint arXiv:2412.15115,

  17. [17]

    Active example selection for in-context learning

    Yiming Zhang, Shi Feng, and Chenhao Tan. Active example selection for in-context learning. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9134–9148,

  18. [18]

    Hijacking large language models via adversarial in-context learning.arXiv preprint arXiv:2311.09948,

    Xiangyu Zhou, Yao Qiang, Saleh Zare Zade, Prashant Khanduri, and Dongxiao Zhu. Hijacking large language models via adversarial in-context learning.arXiv preprint arXiv:2311.09948,

  19. [19]

    this film is wonderful

    additionally used to accelerate inference. D Reproducibility D.1 Model and Inference Configuration Table 5: Model and core inference settings. Model family Model IDs Backend Dtype Decoding Llama-2 Chatmeta-llama/Llama-2-{7b,13b,70b}-chat-hfvLLM bfloat16 temperature= 0.0, top-p= 1.0 Llama-3.1 Instructmeta-llama/Llama-3.1-{8B,70B}-InstructvLLM bfloat16 temp...