pith. machine review for the scientific record. sign in

arxiv: 2605.08295 · v1 · submitted 2026-05-08 · 💻 cs.LG · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

In-Context Fixation: When Demonstrated Labels Override Semantics in Few-Shot Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords in-context learningfew-shot classificationlabel fixationvocabulary retrievalactivation patchingsemantic overrideconstrained vocabulary
0
0 comments X

The pith

Homogeneous demonstration labels force in-context models to treat shown tokens as the full answer vocabulary, collapsing accuracy even when the labels are semantically correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that few-shot in-context learning does not infer task semantics from examples but instead retrieves answers exclusively from the token inventory placed in the label position of the demonstrations. When those labels are uniform, performance falls to 12 percent or below across six models and four tasks, even if the uniform labels are valid. Experiments with varied nonsense tokens confirm the model assigns substantial probability mass to the demonstrated set while ignoring plausible alternatives. Mechanistic interventions localize the effect to a specific circuit whose patching restores nearly all lost performance.

Core claim

In-context learning output is constrained vocabulary retrieval: the model binds its output to the demonstrated token inventory regardless of semantic plausibility, with homogeneity as the maximally collapsed case.

What carries the argument

Label-slot content fixation, where tokens occupying the demonstration label position are adopted as an exhaustive answer vocabulary, with the effect localized to a layer-7-centered circuit.

If this is right

  • Four-way classification accuracy falls to zero percent across tested model sizes.
  • Multi-token verbalizers separate into dissociable format-level template adoption and content-level polarity override.
  • Per-item activation patching on Pythia-1B recovers 98.4 percent of the performance gap and identifies a rank-2 circuit.
  • The encode-then-override trajectory replicates across Llama architectures with causal confirmation in top layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Prompt engineering must prioritize label diversity to prevent unintended vocabulary constraints.
  • The retrieval bias implies that simply scaling model size will not restore semantic flexibility without targeted interventions.
  • The dissociation between format and content components suggests separate training objectives could mitigate each.
  • The nonsense-token probability shift provides a diagnostic test for whether a new model exhibits the same binding behavior.

Load-bearing premise

The accuracy collapse is triggered specifically by the content of the label slots rather than other correlated prompt features or training artifacts.

What would settle it

A model that maintains high accuracy on homogeneous yet semantically valid labels after the same activation-patching intervention that previously recovered 98 percent of the gap would falsify the fixation account.

Figures

Figures reproduced from arXiv: 2605.08295 by Ming Liu.

Figure 1
Figure 1. Figure 1: Dose–response: accuracy on dog-descriptive [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Per-item paired activation patching (Pythia [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Logit lens on Llama-3.2-1B. Both GP and control encode the correct answer at near-ceiling accu￾racy (L9–L11: ∼100%); upper layers override the GP signal to 0% while control retains 49%. Divergence at L3 (Bonferroni p= 4.7 × 10−24). control prompt into the GP forward pass, testing sufficiency: the top head (L10-H5) alone restores 16.7% of the gap, and the top 4 heads (spanning L7, L8, L10) together achieve … view at source ↗
Figure 4
Figure 4. Figure 4: Set-level fixation across four models. Un [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

While random demonstration labels barely hurt in-context learning (Min et al., 2022), we show that homogeneous labels--even semantically valid ones--collapse accuracy to <=12% across six models (Pythia, Llama, Qwen; 0.8B--8B) and four tasks. The trigger is label-slot content: the model treats tokens occupying the label position as an exhaustive answer vocabulary, with homogeneity as the maximally collapsed case. A novel set-level fixation finding confirms this: when demonstrations carry varied nonsense tokens from {foo,bar,vex,nit,orb}, the model places 42--67% of probability on the demonstrated set while P(dog) remains below 0.2%. This is inconsistent with latent-concept Bayesian accounts (Xie et al., 2022) and reveals that ICL output is constrained vocabulary retrieval--the model binds its output to the demonstrated token inventory regardless of semantic plausibility. The effect generalizes to 4-way classification (0% accuracy across three models, 1B--8B) and multi-token verbalizers ("very positive"), where we decompose fixation into format-level (template adoption) and content-level (polarity override) components that are experimentally dissociable. Mechanistically, per-item paired activation patching on Pythia-1B recovers 98.4% of the gap (95% CI [84%, 112%]), localizing fixation to a layer-7-centered circuit (rank 2/560, 99.8th percentile; 4-fold CV mean 103%). Cross-architecture logit lens on Llama-3.2-1B replicates the encode-then-override trajectory with causal confirmation (top-5 layers: 89% recovery).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that in-context learning for classification tasks causes LLMs to treat tokens in the demonstration label slots as an exhaustive output vocabulary, overriding semantic content. This 'fixation' collapses accuracy to <=12% under homogeneous labels (even valid ones), extends to nonsense token sets where 42-67% probability mass stays within the demonstrated inventory, generalizes to 4-way classification (0% accuracy) and multi-token verbalizers, and is mechanistically localized via activation patching (98.4% recovery on Pythia-1B) and logit lens (replicated on Llama-3.2-1B).

Significance. If the results hold after addressing controls, the work supplies strong interventional evidence against latent-concept Bayesian accounts of ICL and reframes output generation as constrained retrieval from the demonstrated token set. Strengths include the multi-model/multi-task scope, the set-level probability measurements, the dissociable format- vs. content-level effects in verbalizer experiments, and the high-recovery activation-patching circuit localization with cross-architecture confirmation.

major comments (2)
  1. [Experimental setup and set-level fixation experiments (referenced in abstract and §4)] The central claim that fixation is specifically triggered by label-slot content (rather than correlated prompt features) rests on conditions that all appear to use fixed prompt templates. The set-level fixation result with {foo,bar,vex,nit,orb} therefore does not yet isolate label positions from repeated tokens, separators, or overall format; an ablation that permutes template structure while holding label-slot content constant is required to support the 'label-slot content' trigger stated in the abstract.
  2. [4-way generalization experiments] The 4-way classification result reports 0% accuracy across three models, but the prompt construction details (how the four label slots are populated and whether the template remains identical to the binary case) are needed to confirm that the collapse is not an artifact of increased label-set size interacting with the same fixed format.
minor comments (2)
  1. [Abstract and results tables] The abstract reports '42--67% of probability on the demonstrated set' for the nonsense-token condition; adding the exact per-model breakdown and the corresponding P(dog) values in a table would improve readability.
  2. [Mechanistic analysis] The activation-patching recovery is given as 98.4% with 95% CI [84%, 112%]; clarifying whether the interval is across items or across random seeds would help assess stability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below with clarifications from the manuscript and indicate the revisions we will incorporate to strengthen the claims.

read point-by-point responses
  1. Referee: [Experimental setup and set-level fixation experiments (referenced in abstract and §4)] The central claim that fixation is specifically triggered by label-slot content (rather than correlated prompt features) rests on conditions that all appear to use fixed prompt templates. The set-level fixation result with {foo,bar,vex,nit,orb} therefore does not yet isolate label positions from repeated tokens, separators, or overall format; an ablation that permutes template structure while holding label-slot content constant is required to support the 'label-slot content' trigger stated in the abstract.

    Authors: We agree that an explicit ablation permuting template structure would provide stronger isolation. Our current design holds the full prompt template fixed while systematically varying only the tokens inserted into the label slots (including homogeneous valid labels, random labels, and the nonsense set {foo,bar,vex,nit,orb}). The fact that fixation persists—and probability mass remains concentrated on the demonstrated inventory—even when those tokens are semantically empty directly implicates label-slot content over other fixed prompt features. To address the referee's point, we will add a new control experiment in the revised §4 that permutes template elements (e.g., demonstration order, separator tokens, or instruction phrasing) while keeping the exact label-slot tokens constant, confirming that fixation is preserved. This ablation will be reported with the same set-level probability metrics. revision: yes

  2. Referee: [4-way generalization experiments] The 4-way classification result reports 0% accuracy across three models, but the prompt construction details (how the four label slots are populated and whether the template remains identical to the binary case) are needed to confirm that the collapse is not an artifact of increased label-set size interacting with the same fixed format.

    Authors: We will include the full prompt templates and construction details for the 4-way experiments in the revised manuscript (new Appendix or expanded §4.2). The 4-way template is a direct extension of the binary template: the same instruction and separator structure is retained, with four distinct label positions populated in each demonstration (one token per class). Demonstrations are constructed by sampling one example per class and placing the corresponding label token in its slot. The 0% accuracy is therefore measured under an otherwise identical format, supporting that the collapse scales with set size under the same fixation mechanism rather than arising from format changes. These details will be added with example prompts for transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical interventional study

full rationale

The paper presents no derivation chain, equations, or first-principles claims that reduce to fitted parameters or self-referential definitions. All central results (accuracy collapse under homogeneous labels, set-level fixation on nonsense tokens, 4-way generalization, activation patching recovery of 98.4%, logit-lens trajectories) are obtained via direct experiments on multiple models, with explicit controls, confidence intervals, and cross-architecture replication. No self-citations are load-bearing; the work contrasts with prior Bayesian accounts but does not rely on them for its own validity. The skeptic concern about prompt-feature confounds is a question of experimental design strength, not circularity in any claimed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests entirely on experimental observations across models and tasks; no free parameters, mathematical axioms, or postulated entities with independent evidence are required or introduced.

pith-pipeline@v0.9.0 · 5621 in / 1084 out tokens · 55756 ms · 2026-05-12T01:39:28.128363+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 3 internal anchors

  1. [1]

    Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, and 1 others

    Rishabh Agarwal, Avi Singh, Lei M. Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, and 1 others. 2024. Many-shot in-context learning. In Advances in Neural Information Processing Systems (NeurIPS)

  2. [2]

    Ekin Aky \"u rek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. 2023. What learning algorithm is in-context learning? investigations with linear models. In International Conference on Learning Representations (ICLR)

  3. [3]

    Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. 2023. Eliciting latent predictions from transformers with the tuned lens. arXiv preprint arXiv:2303.08112

  4. [4]

    Gormley, and Graham Neubig

    Amanda Bertsch, Maor Ivgi, Emily Xiao, Uri Alon, Jonathan Berant, Matthew R. Gormley, and Graham Neubig. 2025. In-context learning with long-context models: An in-depth exploration. In North American Chapter of the Association for Computational Linguistics (NAACL)

  5. [5]

    Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, and 1 others. 2023. P ythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning (ICML)

  6. [6]

    Marcel Binz and Eric Schulz. 2023. Using cognitive psychology to understand GPT-3 . Proceedings of the National Academy of Sciences, 120(6):e2218523120

  7. [7]

    Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and 1 others

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and 1 others. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems (NeurIPS)

  8. [8]

    Stephanie C. Y. Chan, Adam Santoro, Andrew K. Lampinen, Jane X. Wang, Aaditya Singh, Pierre H. Richemond, James L. McClelland, and Felix Hill. 2022. Data distributional properties drive emergent in-context learning in transformers. In Advances in Neural Information Processing Systems (NeurIPS)

  9. [9]

    Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso

    Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso. 2023. Towards automated circuit discovery for mechanistic interpretability. In Advances in Neural Information Processing Systems (NeurIPS)

  10. [10]

    Damai Dai, Yutao Sun, Li Dong, Yaru Hao, Shuming Ma, Zhifang Sui, and Furu Wei. 2023. Why can GPT learn in-context? language models implicitly perform gradient descent as meta-optimizers. In Findings of the Association for Computational Linguistics: ACL

  11. [11]

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, and 1 others. 2021. A mathematical framework for transformer circuits. Transformer Circuits Thread

  12. [12]

    Yu Fei, Yifan Hou, Zeming Chen, and Antoine Bosselut. 2023. Mitigating label biases for in-context learning. In Annual Meeting of the Association for Computational Linguistics (ACL)

  13. [13]

    Shivam Garg, Dimitris Tsipras, Percy Liang, and Gregory Valiant. 2022. What can transformers learn in-context? a case study of simple function classes. In Advances in Neural Information Processing Systems (NeurIPS)

  14. [14]

    Mor Geva, Avi Caciularu, Kevin Ro Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Empirical Methods in Natural Language Processing (EMNLP)

  15. [15]

    Nicholas Goldowsky-Dill, Chris MacLeod, Lucas Sato, and Aryaman Arora. 2023. Localizing model behavior with path patching. arXiv preprint arXiv:2304.05969

  16. [16]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  17. [17]

    Karan Gupta, Sumegh Roychowdhury, Siva Rajesh Kasa, Santhosh Kumar Kasa, Anish Bhanushali, Nikhil Pattisapu, and Prasanna Srinivasa Murthy. 2024. How robust are LLM s to in-context majority label bias? In Workshop on Responsible Language Modeling, AAAI

  18. [18]

    Danny Halawi, Jean-Stanislas Denain, and Jacob Steinhardt. 2023. Overthinking the truth: Understanding how language models process false demonstrations. arXiv preprint arXiv:2307.09476

  19. [19]

    Roee Hendel, Mor Geva, and Amir Globerson. 2023. In-context learning creates task vectors. In Findings of the Association for Computational Linguistics: EMNLP

  20. [20]

    Ari Holtzman, Peter West, Vered Shwartz, Yejin Choi, and Luke Zettlemoyer. 2021. Surface form competition: Why the highest probability answer isn't always right. In Empirical Methods in Natural Language Processing (EMNLP)

  21. [21]

    Jannik Kossen, Yarin Gal, and Tom Rainforth. 2024. In-context learning learns label relationships but is not conventional learning. In International Conference on Learning Representations (ICLR)

  22. [22]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics (TACL), 12

  23. [23]

    Quanyu Long, Yin Wu, Wenya Wang, and Sinno Jialin Pan. 2024. Does in-context learning really learn? rethinking how large language models respond and solve tasks via in-context learning. In Conference on Language Modeling (COLM)

  24. [24]

    Yao Lu, Max Bartolo, Alastair Moore, Sebastian Riedel, and Pontus Stenetorp. 2022. Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity. In Annual Meeting of the Association for Computational Linguistics (ACL)

  25. [25]

    Callum McDougall, Arthur Conmy, Cody Rushing, Thomas McGrath, and Neel Nanda. 2024. Copy suppression: Comprehensively understanding a motif in language model attention heads. In Proceedings of the 7th BlackBoxNLP Workshop (EMNLP)

  26. [26]

    Thomas McGrath, Matthew Rahtz, Janos Kramar, Vladimir Mikulik, and Shane Legg. 2023. The hydra effect: Emergent self-repair in language model computations. arXiv preprint arXiv:2307.15771

  27. [27]

    McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others

    Ian R. McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, and 1 others. 2023. Inverse scaling: When bigger isn't better. Transactions on Machine Learning Research (TMLR)

  28. [28]

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS)

  29. [29]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. 2022. Rethinking the role of demonstrations: What makes in-context learning work? In Empirical Methods in Natural Language Processing (EMNLP)

  30. [30]

    Neel Nanda and Joseph Bloom. 2022. Transformerlens. https://github.com/TransformerLensOrg/TransformerLens

  31. [31]

    nostalgebraist. 2020. Interpreting GPT : The logit lens. https://www.lesswrong.com/posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

  32. [32]

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, and 1 others. 2022. In-context learning and induction heads. Transformer Circuits Thread

  33. [33]

    Jane Pan, Tianyu Gao, Howard Chen, and Danqi Chen. 2023. What in-context learning ``learns'' in-context: Disentangling task recognition and task learning. In Findings of the Association for Computational Linguistics: ACL

  34. [34]

    Alexander Peysakhovich and Adam Lerer. 2023. Attention sorting combats recency bias in long context language models. arXiv preprint arXiv:2310.01427

  35. [35]

    Gautam Reddy. 2024. The mechanistic basis of data dependence and abrupt learning in an in-context classification task. In International Conference on Learning Representations (ICLR)

  36. [36]

    Cody Rushing and Neel Nanda. 2024. Explorations of self-repair in language models. In International Conference on Machine Learning (ICML)

  37. [37]

    Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. 2023. Are emergent abilities of large language models a mirage? In Advances in Neural Information Processing Systems (NeurIPS)

  38. [38]

    Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. 2024. Quantifying language models' sensitivity to spurious features in prompt design. In International Conference on Learning Representations (ICLR)

  39. [39]

    Li, Arnab Sen Sharma, Aaron Mueller, Byron C

    Eric Todd, Millicent L. Li, Arnab Sen Sharma, Aaron Mueller, Byron C. Wallace, and David Bau. 2024. Function vectors in large language models. In International Conference on Learning Representations (ICLR)

  40. [40]

    Johannes von Oswald, Eyvind Niklasson, Ettore Randazzo, Jo \ a o Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. 2023. Transformers learn in-context by gradient descent. In International Conference on Machine Learning (ICML)

  41. [41]

    Kevin Ro Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. 2022. Interpretability in the wild: A circuit for indirect object identification in GPT -2 small. In International Conference on Learning Representations (ICLR)

  42. [42]

    Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. 2023. Label words are anchors: An information flow perspective for understanding in-context learning. In Empirical Methods in Natural Language Processing (EMNLP)

  43. [43]

    Jason Wei, Najoung Kim, Yi Tay, and Quoc V. Le. 2023 a . Inverse scaling can become U -shaped. In Empirical Methods in Natural Language Processing (EMNLP)

  44. [44]

    Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus

    Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and William Fedus. 2022. Emergent abilities of large language models. Transactions on Machine Learning Research (TMLR)

  45. [45]

    Jerry Wei, Jason Wei, Yi Tay, Dustin Tran, Albert Webson, Yifeng Lu, Xinyun Chen, Hanxiao Liu, Da Huang, Denny Zhou, and Tengyu Ma. 2023 b . Larger language models do in-context learning differently. arXiv preprint arXiv:2303.03846

  46. [46]

    Sang Michael Xie, Aditi Raghunathan, Percy Liang, and Tengyu Ma. 2022. An explanation of in-context learning as implicit B ayesian inference. In International Conference on Learning Representations (ICLR)

  47. [47]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

  48. [48]

    Kang Min Yoo, Junyeob Kim, Hyuhng Joon Kim, Hyunsoo Cho, Hwiyeol Jo, Sang-Woo Lee, Sang-goo Lee, and Taeuk Kim. 2022. Ground-truth labels matter: A deeper look into input-label demonstrations. In Empirical Methods in Natural Language Processing (EMNLP)

  49. [49]

    Zeping Yu and Sophia Ananiadou. 2024. How do large language models learn in-context? query and key matrices of in-context heads are two towers for metric learning. In Empirical Methods in Natural Language Processing (EMNLP)

  50. [50]

    Fred Zhang and Neel Nanda. 2024. Towards best practices of activation patching in language models: Metrics and methods. In International Conference on Learning Representations (ICLR)

  51. [51]

    Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh

    Tony Z. Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning (ICML)

  52. [52]

    Han Zhou, Xingchen Wan, Lev Proleev, Diana Mincu, Jilin Chen, Katherine Heller, and Subhrajit Roy. 2024. Batch calibration: Rethinking calibration for in-context learning and prompt engineering. In International Conference on Learning Representations (ICLR)