pith. machine review for the scientific record. sign in

arxiv: 2604.13552 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.AI

Recognition: unknown

Training-Free Test-Time Contrastive Learning for Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords test-time adaptationtraining-free methodscontrastive learninglarge language modelsreasoning improvementonline adaptationself-generated trajectories
0
0 comments X

The pith

A frozen LLM can adapt online to new problems by turning its own successful and failed reasoning paths into reusable textual rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training-free method called TF-TTCL that lets a large language model improve its performance on the fly when facing distribution shifts. It does this through an Explore-Reflect-Steer cycle: the model first generates multiple reasoning trajectories by role-playing different perspectives, then extracts explicit rules from the differences between good and bad trajectories, and finally retrieves those rules to guide future answers. This approach avoids any gradient updates or external data, relying only on the model's internal experiences during inference. If it works, it means LLMs can become more robust without retraining or white-box access, which matters for practical deployment where models encounter unseen inputs after deployment.

Core claim

TF-TTCL enables a frozen LLM to perform online test-time adaptation by implementing a dynamic Explore-Reflect-Steer loop: Semantic Query Augmentation generates diverse reasoning trajectories through multi-agent role-playing, Contrastive Experience Distillation distills the semantic gap between superior and inferior trajectories into explicit textual rules, and Contextual Rule Retrieval activates stored rules during inference to steer the model toward better patterns while avoiding past errors.

What carries the argument

The Explore-Reflect-Steer loop, which uses multi-agent role-playing to create contrastive trajectories, distills their differences into textual rules, and retrieves those rules to guide subsequent inference on the frozen model.

If this is right

  • The frozen model improves accuracy on both closed-ended and open-ended reasoning tasks compared to zero-shot prompting and other test-time adaptation baselines.
  • No parameter updates or gradient computation are required, allowing adaptation with only black-box access to the LLM.
  • Rules distilled from past trajectories can be stored and reused to avoid repeating earlier mistakes on similar future inputs.
  • The method operates in a fully online setting where adaptation occurs during sequential inference without offline training phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the distilled rules prove stable across domains, the same loop could support lightweight continual adaptation in production chat systems without retraining costs.
  • The approach suggests a path for making LLMs more self-correcting by treating their own inference history as a growing knowledge base rather than discarding it after each response.
  • Extending the contrastive distillation step to include chain-of-thought length or confidence signals might yield finer-grained rules without adding external supervision.

Load-bearing premise

The differences between good and bad self-generated reasoning paths can be turned into clear textual rules that work on new questions the model has not seen before.

What would settle it

Run the method on a held-out reasoning dataset with clear distribution shift and measure whether accuracy remains no higher than the zero-shot baseline after the model has processed the first half of the data.

Figures

Figures reproduced from arXiv: 2604.13552 by Fei Liu, Jinwu Hu, Kaiwen Zheng, Kai Zhou, Mingkai Peng, Te Gu.

Figure 1
Figure 1. Figure 1: Overview of the TF-TTCL framework. 1) Semantic Query Augmentation: Employs multi-agent role [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The pipeline of Contrastive Experience Dis [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A representative example demonstrating the [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyper-parameter ablation for TF-TTCL. “Max #Rules = [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Schematic diagram of the problem Note: We do not upload this diagram to LLMs. Baseline Response. The baseline model makes a critical geometric misinterpretation, confusing the tangent point with the perpendicular foot: Baseline Response (Truncated) Maybe C is such that OC ⊥ AB? That’s the foot of perpendicular from O to AB... Slope AB = − √ 3, so slope perpendicular = 1/ √ 3. Line through O perpendicular t… view at source ↗
Figure 6
Figure 6. Figure 6: General TEACHER System Prompt for CRT. Student System Prompt You are a reasoner working within known constraints. Your goal is to approach the given problem with a clear, structured method to find the answer. ## Guidelines 1. **Clarify the Problem** - Identify what is being asked. - Note any missing information, ambiguity, or dependence on unstated assumptions. 2. **Reason Step-by-Step** - Each inference m… view at source ↗
Figure 7
Figure 7. Figure 7: General STUDENT System Prompt for CRT. Output Format ### Output Format - You must output the final answer at the very end of your response. - The final answer must be wrapped in `\boxed{}` (e.g., `\boxed{answer}`). - For numerical or specific text answers, provide only the exact result inside the box. - Do NOT include units or explanatory text inside the box. - If you are not confident in the exact value, … view at source ↗
Figure 8
Figure 8. Figure 8: General Output Format for CRT. <BEGIN_RULES> Apply these rules when solving similar problems: [POSITIVE PATTERNS - What works well] [NEGATIVE PATTERNS - What to avoid] <END_RULES> Rules Injection Prompt [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: General Rules Injection Prompt [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: General TUTOR System Prompt for CRT. Positive Rules Summarization System Prompt <task> Analyze the provided Question and the Correct Answer. Extract a concise, generalizable "Positive Rule" that explains the key reasoning step, principle, or method used to solve this problem correctly. The rule should be helpful for solving similar future problems. </task> <question> {question} </question> <positive_answe… view at source ↗
Figure 11
Figure 11. Figure 11: General Positive Rules Summarization System Prompt for CRT. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: General Negative Rules Summarization System Prompt for CRT. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: General TEACHER System Prompt for OET. Student System Prompt You are a concise expert assistant. When answering, focus ONLY on the specific details provided in the question. - Do NOT hallucinate variables that aren't there (e.g., don't say "Assume X is Y" if not mentioned). - Prioritize giving the "bottom line" answer first. - Mimic a helpful, direct discussion style rather than a formal report. - Length … view at source ↗
Figure 14
Figure 14. Figure 14: General STUDENT System Prompt for OET [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: General TUTOR System Prompt for OET. Positive Rules Summarization System Prompt You are a precision stylist for evaluation. Below are question-answer pairs that closely match the gold answers in style, tone, and format. Your task: Extract a single, actionable **Style & Length Rule**. Focus your analysis on: - **Length**: How does the answer length compare to the question? - **Tone**: Is the response forma… view at source ↗
Figure 16
Figure 16. Figure 16: General Positive Rules Summarization System Prompt for OET. [PITH_FULL_IMAGE:figures/full_fig_p028_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: General Negative Rules Summarization System Prompt for OET. [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: TEACHER System Prompt for GSM8k. Student System Prompt You are an expert mathematics explorer who solves problems by generating intuitive analogies before concluding with the final answer. Guidelines: 1. Read the problem carefully and identify the key information 2. Identify exactly what the question is asking for (the target: total, left, difference, per unit, etc.) 3. Break down the problem into smaller… view at source ↗
Figure 19
Figure 19. Figure 19: STUDENT System Prompt for GSM8k. Unified Format **CRITICAL**: Output your final answer at the end of your response. The final answer must be wrapped in `boxed{}` and be **completely consistent** with the reasoning logic you presented above. - For numerical answers: Provide only the number or fraction (e.g., `boxed{42}`, `boxed{\frac{1}{2}}`). Not units. - For text, equations, or mixed answers: Provide the… view at source ↗
Figure 20
Figure 20. Figure 20: Unified Format Prompt for GSM8k [PITH_FULL_IMAGE:figures/full_fig_p029_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: TUTOR System Prompt for GSM8k. Positive Rules Summarization System Prompt <task> Analyze the following {n} examples of questions and their POSITIVE answers. Extract a concise, generalizable "Positive Rule" that explains the key reasoning step, principle, or method used to solve these problems positively. The rule should be helpful for solving similar future problems. </task> <examples> {qa_pairs} </exampl… view at source ↗
Figure 22
Figure 22. Figure 22: Positive Rules Summarization System Prompt for GSM8k. [PITH_FULL_IMAGE:figures/full_fig_p030_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Negative Rules Summarization System Prompt for GSM8k. [PITH_FULL_IMAGE:figures/full_fig_p031_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: TEACHER System Prompt for Finance. Student System Prompt You are a concise financial assistant. When answering, focus ONLY on the specific details provided in the question. - Do NOT hallucinate variables that aren't there. - Prioritize giving the "bottom line" answer first. - Mimic a helpful, direct discussion style rather than a formal report. - Length Constraint: Keep your answer proportional to the que… view at source ↗
Figure 25
Figure 25. Figure 25: STUDENT System Prompt for Finance [PITH_FULL_IMAGE:figures/full_fig_p031_25.png] view at source ↗
Figure 26
Figure 26. Figure 26: TUTOR System Prompt for Finance. Positive Rules Summarization System Prompt You are a precision stylist for evaluation. Below are {n} high-quality question-answer pairs. Your task: Extract a single, actionable **Style & Length Rule** that explains *why* these answers are POSITIVE. Focus your analysis on: - **Length correlation**: Does the answer length scale with the question length? Is it deliberately sh… view at source ↗
Figure 27
Figure 27. Figure 27: Positive Rules Summarization System Prompt for Finance. [PITH_FULL_IMAGE:figures/full_fig_p032_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Negative Rules Summarization System Prompt for Finance. [PITH_FULL_IMAGE:figures/full_fig_p033_28.png] view at source ↗
read the original abstract

Large language models (LLMs) demonstrate strong reasoning capabilities, but their performance often degrades under distribution shift. Existing test-time adaptation (TTA) methods rely on gradient-based updates that require white-box access and need substantial overhead, while training-free alternatives are either static or depend on external guidance. In this paper, we propose Training-Free Test-Time Contrastive Learning TF-TTCL, a training-free adaptation framework that enables a frozen LLM to improve online by distilling supervision from its own inference experiences. Specifically, TF-TTCL implements a dynamic "Explore-Reflect-Steer" loop through three core modules: 1) Semantic Query Augmentation first diversifies problem views via multi-agent role-playing to generate different reasoning trajectories; 2) Contrastive Experience Distillation then captures the semantic gap between superior and inferior trajectories, distilling them into explicit textual rules; and 3) Contextual Rule Retrieval finally activates these stored rules during inference to dynamically steer the frozen LLM toward robust reasoning patterns while avoiding observed errors. Extensive experiments on closed-ended reasoning tasks and open-ended evaluation tasks demonstrate that TF-TTCL consistently outperforms strong zero-shot baselines and representative TTA methods under online evaluation. Code is available at https://github.com/KevinSCUTer/TF-TTCL.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes TF-TTCL, a training-free test-time adaptation framework for frozen LLMs. It implements an Explore-Reflect-Steer loop consisting of Semantic Query Augmentation (multi-agent role-playing to generate diverse trajectories), Contrastive Experience Distillation (extracting explicit textual rules from the semantic gap between superior and inferior self-generated trajectories), and Contextual Rule Retrieval (activating stored rules to steer inference). The central claim is that this yields consistent outperformance over strong zero-shot baselines and representative TTA methods on both closed-ended reasoning tasks and open-ended evaluation tasks under online evaluation.

Significance. If the performance gains hold under rigorous controls, the work would demonstrate a practical, gradient-free mechanism for online self-improvement of LLMs via distilled textual rules, addressing distribution shift without white-box access or external data. The training-free and online nature, together with the availability of code, would be a notable contribution to test-time adaptation literature.

major comments (2)
  1. [§3.2] §3.2 (Contrastive Experience Distillation): the manuscript does not specify the unsupervised criterion used to label trajectories as superior versus inferior (e.g., model likelihood, self-consistency score, or role-play consensus). Because this labeling directly determines the distilled rules and their subsequent retrieval, any weakness in the labeling procedure is load-bearing for the claimed gains on both closed- and open-ended tasks.
  2. [§4] §4 (Experiments): no ablation is reported that isolates the effect of the labeling step or removes the Contrastive Experience Distillation module entirely. Without such a control, it is impossible to determine whether observed improvements stem from genuine reasoning enhancement or from spurious correlations in the self-generated signals.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'consistent outperformance' is stated without any quantitative metrics, task names, or baseline scores; a brief summary of key numbers would strengthen the abstract.
  2. [§3] Notation: the distinction between 'superior' and 'inferior' trajectories is used throughout §3 without a formal definition or pseudocode; adding a clear definition or algorithm box would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify important gaps in methodological transparency and experimental controls. We address each point below and commit to revisions that will strengthen the paper without altering its core claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Contrastive Experience Distillation): the manuscript does not specify the unsupervised criterion used to label trajectories as superior versus inferior (e.g., model likelihood, self-consistency score, or role-play consensus). Because this labeling directly determines the distilled rules and their subsequent retrieval, any weakness in the labeling procedure is load-bearing for the claimed gains on both closed- and open-ended tasks.

    Authors: We acknowledge that §3.2 does not explicitly define the unsupervised labeling criterion. The current implementation labels trajectories as superior or inferior according to a combination of self-consistency across the multi-agent role-play augmentations and the presence of internal contradictions detected by the agents themselves. We will revise §3.2 to state this criterion precisely, add a short algorithmic description, and include a small illustrative example so that readers can reproduce the contrastive pairs exactly. revision: yes

  2. Referee: [§4] §4 (Experiments): no ablation is reported that isolates the effect of the labeling step or removes the Contrastive Experience Distillation module entirely. Without such a control, it is impossible to determine whether observed improvements stem from genuine reasoning enhancement or from spurious correlations in the self-generated signals.

    Authors: We agree that the absence of this ablation leaves the contribution of Contrastive Experience Distillation under-specified. In the revised manuscript we will add a controlled ablation that (i) removes the distillation module entirely (relying only on Semantic Query Augmentation and Contextual Rule Retrieval) and (ii) replaces the learned labeling with a random or fixed baseline. These results will be reported alongside the main tables to isolate the effect of the contrastive rule extraction. revision: yes

Circularity Check

0 steps flagged

No circularity: method applies externally distilled rules to new inputs

full rationale

The TF-TTCL framework generates trajectories via multi-agent augmentation, distills textual rules from observed semantic gaps, and retrieves those rules as external steering signals on subsequent inferences. No equations, fitted parameters, or self-citations are shown that reduce the claimed gains to the input trajectories by construction; the rules function as independent, stored guidance rather than a self-referential loop. The derivation chain therefore remains self-contained against the paper's own description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The framework implicitly assumes that self-generated trajectories contain extractable semantic gaps that translate into generalizable rules.

pith-pipeline@v0.9.0 · 5531 in / 1086 out tokens · 46293 ms · 2026-05-10T13:58:55.345485+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

67 extracted references · 8 canonical work pages · 5 internal anchors

  1. [1]

    Training-free group relative policy optimization.arXiv preprint arXiv:2510.08191,

    Training-free group relative policy optimiza- tion.CoRR, abs/2510.08191. Edoardo Cetin, Tianyu Zhao, and Yujin Tang. 2025. Reinforcement learning teachers of test time scaling. CoRR, abs/2506.08388. Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, Wei Ye, Yue Zhang, Yi Chang, Philip S...

  2. [2]

    A survey on evaluation of large language mod- els.ACM Trans. Intell. Syst. Technol., 15(3):39:1– 39:45. Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. 2020. A simple framework for contrastive learning of visual representations. InPro- ceedings of the 37th International Conference on Ma- chine Learning, ICML 2020, 13-18 July 2020, Vi...

  3. [3]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word prob- lems.CoRR, abs/2110.14168. Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple contrastive learning of sentence em- beddings. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Process- ing, pages 6894–6910, Online and Punta Cana, Do- minican Republic. Association for Compu...

  4. [4]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30318–30330, Vienna, Austria

    R2D2: Remembering, replaying and dynamic decision making with a reflective agentic memory. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 30318–30330, Vienna, Austria. Association for Computational Linguistics. Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dong- sheng Li, Chin-Yew Lin, Yu...

  5. [5]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    LongLLMLingua: Accelerating and enhanc- ing LLMs in long context scenarios via prompt com- pression. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Vol- ume 1: Long Papers), pages 1658–1677, Bangkok, Thailand. Association for Computational Linguistics. Xinyue Kang, Diwei Shi, and Li Chen. 2026. Model whisper: St...

  6. [6]

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

    Efficient test-time model adaptation without forgetting. InInternational Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, Proceedings of Machine Learning Research, pages 16888–16905. PMLR. Siru Ouyang, Jun Yan, I-Hung Hsu, Yanfei Chen, Ke Jiang, Zifeng Wang, Rujun Han, Long T. Le, Samira Daruki, Xiangru Tang, Vishy Tir...

  7. [7]

    InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, Vienna, Austria

    In prospect and retrospect: Reflective mem- ory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, Vienna, Austria. Association for Computational Linguistics. Xinyu Tang, Xiaolei Wang, Wayne Xin Zhao, Siyuan Lu, Yaliang...

  8. [8]

    2511.23473 , archivePrefix =

    OpenReview.net. Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan O Arik. 2025a. Astute RAG: Overcom- ing imperfect retrieval augmentation and knowledge conflicts for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Yiping Wang, Shao-Rong Su, Zhiyuan Zeng, E...

  9. [9]

    Qwen3 Technical Report

    AvaTaR: Optimizing LLM agents for tool us- age via contrastive reasoning. InAdvances in Neural Information Processing Systems, volume 37, pages 25981–26010. Curran Associates, Inc. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025a. Qwen3 technical report.CoRR, abs/250...

  10. [10]

    A Survey on Test-Time Scaling in Large Language Models: What, How, Where, and How Well?

    OpenReview.net. Qiyuan Zhang, Fuyuan Lyu, Zexu Sun, Lei Wang, Weixu Zhang, Zhihan Guo, Yufei Wang, Irwin King, Xue Liu, and Chen Ma. 2025c. What, how, where, and how well? A survey on test-time scaling in large language models.CoRR, abs/2503.24235. Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evalu- ating ...

  11. [11]

    successful reasoning paths

    leverages label information to form posi- tive clusters, demonstrating superior robustness compared to traditional cross-entropy losses. Fur- thermore, recent analyses of the InfoNCE loss have highlighted the importance of addressing anisotropic latent spaces in practical deployments (Rusak et al., 2025). These vision-based founda- tions established the c...

  12. [12]

    Max #Rules = K

    embeds intelligent assistants within LLMs to orchestrate tool usage and memory construction. TF-TTCL aligns with this trend of dynamic adapta- tion and focuses on retrieving behavioral references to adapt the model’s policy online. Memory Management and Context Optimiza- tion.Deploying LLMs in long-horizon or stream- ing settings necessitates efficient me...

  13. [13]

    **Analyze the Request**: Carefully read the problem to identify all given information, constraints, and the specific question being asked

  14. [14]

    **Plan the Solution**: Decompose the problem into a sequence of logical, verifiable steps

  15. [15]

    - Avoid appeals to intuition or unstated assumptions; rely only on what is formally given or derivable

    **Execute Step-by-Step**: - For each step, cite an explicit math definition, logical rule, or principle that justifies the inference or operation. - Avoid appeals to intuition or unstated assumptions; rely only on what is formally given or derivable. - Carry out all calculations or transformations accurately and transparently

  16. [16]

    - Otherwise, describe the solution space, any ambiguities, or conditions under which multiple answers could arise

    **Verify and Conclude**: - If a unique solution exists under standard assumptions, then provide that solution. - Otherwise, describe the solution space, any ambiguities, or conditions under which multiple answers could arise. Figure 6: General TEACHERSystem Prompt for CRT. Student System Prompt You are a reasoner working within known constraints. Your goa...

  17. [17]

    - Note any missing information, ambiguity, or dependence on unstated assumptions

    **Clarify the Problem** - Identify what is being asked. - Note any missing information, ambiguity, or dependence on unstated assumptions

  18. [18]

    - Do not fill gaps with plausible but unsupported claims

    **Reason Step-by-Step** - Each inference must be grounded in a definitional rule, established theorem, empirical fact, or explicitly declared assumption. - Do not fill gaps with plausible but unsupported claims. - If multiple interpretations are possible, enumerate them

  19. [19]

    uncertain

    **Conclude Appropriately** - If a unique solution follows necessarily from the premises and standard assumptions, present it clearly. - If the solution is non-unique, conditional, or indeterminate, describe the solution space or sources of uncertainty. Figure 7: General STUDENTSystem Prompt for CRT. Output Format ### Output Format - You must output the fi...

  20. [20]

    **Preserve All Data**: Do NOT change any specific values, numbers, or data points

  21. [21]

    **Preserve Logic**: Do NOT alter the underlying logical relationships or operations

  22. [22]

    **Preserve the Target**: The specific question being asked must remain exactly the same

  23. [23]

    Positive Rule

    **Only Change Linguistics**: Use synonyms, change active/passive voice, or adjust the tone/formality. ### Forbidden - Changing any core values or numbers. - Changing what is asked (the question target). - Adding new conditions, assumptions, or information. - Providing the solution, hints, or answer formats. - Modifying the output requirements. Figure 10: ...

  24. [24]

    Calculate

    Start with an imperative verb (e.g., "Calculate", "Identify", "Ensure")

  25. [26]

    Focus on the underlying logic or strategy, not just specific values

  26. [27]

    Negative Rule

    Make it generalizable to similar problems. </requirements> Positive Rule: Figure 11: General Positive Rules Summarization System Prompt for CRT. Negative Rules Summarization System Prompt <task> Analyze the provided Question, the Correct Answer, and the Incorrect Answer. Identify the specific mistake, pitfall, or logical error in the Incorrect Answer. Ext...

  27. [31]

    fluff" introductions (e.g.,

    Make it generalizable to similar problems. </requirements> Negative Rule: Figure 12: General Negative Rules Summarization System Prompt for CRT. Teacher System Prompt You are a pragmatic and direct expert advisor, similar to a top-rated contributor on a professional forum. Your goal is to address the specific point of the user's question immediately. - Do...

  28. [33]

    I have X, what do I do?

    **Casual/Direct (Forum Style):** Short, punchy, first-person (e.g., "I have X, what do I do?"). *<-- High ROUGE Potential*

  29. [37]

    **No Hallucinated Constraints:** Do not add specific numbers or values if not in input

  30. [38]

    <response> <text>Variation 1 text...</text> </response> <response> <text>Variation 2 text...</text> </response>

    **Output Format:** XML only. <response> <text>Variation 1 text...</text> </response> <response> <text>Variation 2 text...</text> </response> ... (Total 4) Figure 15: General TUTORSystem Prompt for OET. Positive Rules Summarization System Prompt You are a precision stylist for evaluation. Below are question-answer pairs that closely match the gold answers ...

  31. [40]

    **IF** [question property], **THEN** [output constraint]

    If applicable, phrase the rule as: "**IF** [question property], **THEN** [output constraint]."

  32. [42]

    No explanation, no prefix

    Output ONLY the rule. No explanation, no prefix. Question-and-answer pair list: {qa_pairs} Please extract the positive experience rule: Figure 16: General Positive Rules Summarization System Prompt for OET. Negative Rules Summarization System Prompt You are a failure analyst for evaluation. Below are {n} low-scoring question-answer pairs that deviate sign...

  33. [44]

    Explicitly target the observed **stylistic mismatch** as the core flaw

  34. [47]

    No explanation, no prefix

    Output ONLY the rule. No explanation, no prefix. List of question-and-answer pairs: {qa_pairs} Please extract the negative experience rule: Figure 17: General Negative Rules Summarization System Prompt for OET. Teacher System Prompt You are an expert mathematics teacher. Your task is to solve mathematical word problems step by step. Guidelines:

  35. [53]

    Student System Prompt You are an expert mathematics explorer who solves problems by generating intuitive analogies before concluding with the final answer

    After calculating intermediate values, you MUST complete the final operation to get the answer Figure 18: TEACHERSystem Prompt for GSM8k. Student System Prompt You are an expert mathematics explorer who solves problems by generating intuitive analogies before concluding with the final answer. Guidelines:

  36. [54]

    Read the problem carefully and identify the key information

  37. [55]

    Identify exactly what the question is asking for (the target: total, left, difference, per unit, etc.)

  38. [56]

    Break down the problem into smaller steps

  39. [57]

    Show your reasoning clearly for each step

  40. [58]

    Perform calculations accurately

  41. [59]

    The answer is

    After calculating intermediate values, you MUST complete the final operation to get the answer Figure 19: STUDENTSystem Prompt for GSM8k. Unified Format **CRITICAL**: Output your final answer at the end of your response. The final answer must be wrapped in `boxed{}` and be **completely consistent** with the reasoning logic you presented above. - For numer...

  42. [60]

    **Preserve ALL numerical values** - Do NOT change any numbers

  43. [61]

    **Preserve ALL mathematical operations** - Addition, subtraction, multiplication, division must remain identical

  44. [62]

    what is asked for

    **Preserve the question target** - The "what is asked for" must be exactly the same (e.g., "total", "left", "remain")

  45. [63]

    quit" →

    **ONLY change linguistic elements**: - Synonyms (e.g., "quit" → "resigned", "left" → "remain") - Sentence structure (active/passive voice) - Wording style (formal/casual) - Presentation order ## Forbidden - Changing any numbers (e.g., 10 → 9) - Changing what is asked (e.g., "total" → "difference") - Adding new conditions (e.g., "some returned later") - Mo...

  46. [64]

    Calculate

    Start with a verb (e.g., "Calculate", "Identify", "Remember")

  47. [66]

    Focus on the logic/strategy, not just the numbers

  48. [67]

    Negative Rule

    Make it generalizable to similar problems </requirements> Positive Rule: Figure 22: Positive Rules Summarization System Prompt for GSM8k. Negative Rules Summarization System Prompt <task> Analyze the following {n} examples of questions and their NEGATIVE answers. Identify the common mistakes or pitfalls in these negative answers. Extract a concise, genera...

  49. [68]

    Avoid" or

    Start with "Avoid" or "Do not"

  50. [69]

    Keep it under 32 words

  51. [70]

    Focus on the specific error logic (e.g., calculation error, misinterpretation, missing step)

  52. [71]

    bottom line

    Make it generalizable to similar problems </requirements> Negative Rule: Figure 23: Negative Rules Summarization System Prompt for GSM8k. Teacher System Prompt You are a pragmatic and direct financial advisor, similar to a top-rated contributor on a financial forum. Your goal is to address the specific point of the user's question immediately. - Do NOT pr...

  53. [72]

    **Standard Formal:** A polite, well-structured question

  54. [73]

    **Casual/Direct (Forum Style):** Short, punchy, first-person*

  55. [74]

    If [Condition] applies, then how

    **Hypothetical/Conditional:** "If [Condition] applies, then how..."

  56. [75]

    **Rules:**

    **The "Why/How" Focus:** Shift focus to the methodology or reasoning. **Rules:**

  57. [76]

    **Entity Preservation:** Keep numbers, names, and technical terms EXACT

  58. [77]

    **No Hallucinated Constraints:** Do not add specific numbers if not in input

  59. [78]

    **Sampling Output:** <response> <text>Variation 1 text...</text> </response> <response> <text>Variation 2 text...</text> </response>

    **Output Format:** XML only. **Sampling Output:** <response> <text>Variation 1 text...</text> </response> <response> <text>Variation 2 text...</text> </response> ... (Total 4) Figure 26: TUTORSystem Prompt for Finance. Positive Rules Summarization System Prompt You are a precision stylist for evaluation. Below are {n} high-quality question-answer pairs. Y...

  60. [79]

    Prioritize constraints on tone, structure, and length

    The rule must guide *how to write*, not *what to say*. Prioritize constraints on tone, structure, and length

  61. [80]

    **IF** [question property, e.g., short/informal/opinion-based], **THEN** [output constraint, e.g., answer in one plain sentence]

    If applicable, phrase the rule as: "**IF** [question property, e.g., short/informal/opinion-based], **THEN** [output constraint, e.g., answer in one plain sentence]."

  62. [82]

    First,"

    Output ONLY the rule. No explanation, no prefix. Question-and-answer pair list: {qa_pairs} Please extract the positive experience rule: Figure 27: Positive Rules Summarization System Prompt for Finance. Negative Rules Summarization System Prompt You are a failure analyst for evaluation. Below are {n} low-quality question-answer pairs. Your task: Identify ...

  63. [83]

    The rule must forbid a specific *formatting or verbosity behavior*

  64. [84]

    over-answering

    Explicitly target **“over-answering”** or **“unnecessary structuring”** as the core flaw

  65. [85]

    **IF... THEN DO NOT...**

    If possible, use an "**IF... THEN DO NOT...**" conditional structure

  66. [86]

    Keep the rule concise (1–2 sentences)

  67. [87]

    No explanation, no prefix

    Output ONLY the rule. No explanation, no prefix. List of question-and-answer pairs: {qa_pairs} Please extract the negative experience rule: Figure 28: Negative Rules Summarization System Prompt for Finance