Less Back-and-Forth: A Comparative Study of Structured Prompting

Abdou Sow; Gabriella Polach; Saurav Ghosh

arxiv: 2605.20149 · v1 · pith:JZHMX5KEnew · submitted 2026-05-19 · 💻 cs.CL · cs.AI· cs.HC

Less Back-and-Forth: A Comparative Study of Structured Prompting

Saurav Ghosh , Gabriella Polach , Abdou Sow This is my paper

Pith reviewed 2026-05-20 05:25 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords structured promptingprompt engineeringLLM response qualityuser effortchecklist promptscomparative evaluationtask completion

0 comments

The pith

Checklist-structured prompts produce higher-quality LLM responses than raw or clarifying-question prompts while using fewer tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests whether adding structure to prompts can raise the quality of answers from large language models and cut down on extra rounds of user follow-up. The authors run a comparison of raw prompts, checklist-improved prompts, and clarifying-question prompts on summarization, planning, explanation, and coding tasks with three different models. A single rubric scores each output on task completion, correctness, compliance, and clarity. Checklist prompts reach the highest average score and consume the fewest tokens on average, pointing to a practical way to get more reliable results with less effort. A sympathetic reader would care because many current interactions with language models involve repeated clarification that this approach appears to reduce.

Core claim

Checklist-improved prompts achieved the highest mean rubric score of 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts across the four task types and three LLM systems.

What carries the argument

A prompt checklist that systematically adds explicit guidance on task requirements, constraints, and desired output format.

If this is right

LLM outputs become more complete, correct, and clear when prompts include the checklist elements.
Users achieve the same or better results with less total input length and fewer follow-up messages.
The quality gain holds across summarization, planning, explanation, and coding tasks.
The advantage appears for multiple large language models without needing model-specific tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The checklist approach could be turned into reusable templates for common task categories.
Similar structuring might reduce back-and-forth in longer, multi-turn conversations not tested here.
Automatic generation of checklist items from a raw task description could make the method easier to adopt at scale.

Load-bearing premise

The unified rubric accurately and consistently measures response quality across task types and LLMs without bias from the specific criteria or evaluators.

What would settle it

A replication that applies the same three prompt conditions and rubric to the four tasks but finds checklist prompts scoring no higher than raw prompts on average would falsify the central result.

Figures

Figures reproduced from arXiv: 2605.20149 by Abdou Sow, Gabriella Polach, Saurav Ghosh.

**Figure 2.** Figure 2: Paired effects relative to the raw-prompt baseline. Each faint point represents one matched model-task trial, the diamond marks the mean paired [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Large language models (LLMs) are widely used for open-ended tasks, but underspecified prompts can lead to low-quality answers and additional interaction. This paper studies whether structured prompt design improves response quality while reducing user effort. We compare three prompt conditions: a raw prompt, a checklist-improved prompt, and a clarifying-question prompt. We evaluate these conditions across four task types--summarization, planning, explanation, and coding--using three LLM systems: ChatGPT, Claude, and Grok. Each output is scored with a unified rubric covering task completion, correctness, compliance, and clarity. Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts. Checklist prompts also produced the best quality-effort tradeoff, using fewer average tokens than both raw and clarifying prompts. These results suggest that a simple prompt checklist can improve LLM responses while reducing unnecessary interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Checklist prompts scored higher and used fewer tokens than raw or clarifying ones, but the lack of scoring details leaves the gap open to question.

read the letter

Checklist prompts came out ahead in their tests on rubric scores and token use, but without details on sample sizes or scoring reliability the differences could be overstated. The paper ran a head-to-head comparison of raw prompts, checklist-improved prompts, and clarifying-question prompts across summarization, planning, explanation, and coding. They tested these on ChatGPT, Claude, and Grok, then scored outputs on a single rubric for task completion, correctness, compliance, and clarity. The checklist version averaged 7.5 out of 8 while raw sat at 5.67 and clarifying at 6.67, and it also showed lower average token counts. That gives a practical quality-effort angle that is easy to grasp. The direct numerical comparison across a few tasks and models is the main new piece here. It builds on common prompting habits but puts them side by side with concrete numbers instead of just describing one approach. The token counts add a useful efficiency measure that many applied users care about. The soft spot is the evaluation. The abstract states exact means but gives no sample size per condition, no standard deviations, no statistical tests, and no account of who scored the outputs or whether they knew the prompt type. If the scoring was done by the authors without blinding, the 1.83-point gap between checklist and raw could easily reflect rater expectations rather than real output differences. The unified rubric might also fit coding and summarization unevenly, which would affect the averages. This work is for practitioners who want quick, low-theory guidance on structuring prompts for everyday LLM tasks. A reader running similar experiments or looking for baseline numbers on these three styles would find it worth a look, though anyone needing rigorous evidence would want more on the methods first. It deserves peer review because the question is straightforward and the setup is replicable in principle, but the authors would need to add the missing details on scoring procedure and sample sizes before the results could be treated as solid.

Referee Report

1 major / 2 minor

Summary. The paper conducts an empirical comparison of three prompting approaches—raw prompts, checklist-improved prompts, and clarifying-question prompts—across four task types (summarization, planning, explanation, coding) and three LLMs (ChatGPT, Claude, Grok). Outputs are evaluated using a unified rubric on task completion, correctness, compliance, and clarity. The central claim is that checklist-improved prompts achieve the highest mean score (7.50/8) versus 5.67 for raw and 6.67 for clarifying-question prompts, while also using fewer tokens and thus offering the best quality-effort tradeoff.

Significance. If the reported score differences prove robust, the work would supply actionable evidence that a lightweight checklist can raise output quality and reduce interaction rounds in open-ended LLM use. The multi-task, multi-model design is a positive feature for generalizability. However, the absence of key methodological details currently limits the strength of this contribution to prompt-engineering practice.

major comments (1)

[Evaluation procedure] Evaluation procedure (abstract and results): The manuscript reports precise mean rubric scores (7.50, 5.67, 6.67) that underpin every comparative claim and the quality-effort tradeoff conclusion, yet supplies no sample size per condition, no standard deviations, no statistical tests, no inter-rater reliability metric, and no statement on whether scoring was blinded or performed by a single author aware of prompt type. These omissions are load-bearing because the 1.83-point gap could arise from evaluator bias or inconsistent rubric application across task types.

minor comments (2)

[Abstract] The abstract states a 'unified rubric' covering four criteria but does not specify how the four aspects are aggregated into an 8-point scale or whether weights differ by task type.
[Prompt conditions] Provide at least one concrete example of each prompt variant (raw, checklist, clarifying) for a single task in the main text or appendix to allow replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful and constructive review of our manuscript. The feedback on the evaluation procedure highlights important aspects of methodological transparency that we have addressed in the revision.

read point-by-point responses

Referee: Evaluation procedure (abstract and results): The manuscript reports precise mean rubric scores (7.50, 5.67, 6.67) that underpin every comparative claim and the quality-effort tradeoff conclusion, yet supplies no sample size per condition, no standard deviations, no statistical tests, no inter-rater reliability metric, and no statement on whether scoring was blinded or performed by a single author aware of prompt type. These omissions are load-bearing because the 1.83-point gap could arise from evaluator bias or inconsistent rubric application across task types.

Authors: We agree that these details are essential for assessing the robustness of the reported differences. In the revised manuscript we now state that each of the three prompt conditions was evaluated on 36 outputs (three instances per task type across the four task types and three LLMs), for a total of 108 scored responses. We report standard deviations with each mean and include the results of a one-way ANOVA followed by post-hoc Tukey tests, which show the checklist condition differs significantly from the raw-prompt baseline (p < 0.01). The rubric scoring was performed by the first author, who was necessarily aware of prompt condition; we have added an explicit statement to this effect and expanded the limitations section to discuss the risk of evaluator bias. Because only a single rater was used, inter-rater reliability could not be computed; we now note this design choice and its implications as a limitation of the present study. These additions directly mitigate the concern that the observed 1.83-point gap might reflect inconsistent rubric application or bias. revision: yes

Circularity Check

0 steps flagged

No circularity: straightforward empirical comparison of prompt conditions

full rationale

The paper reports mean rubric scores from direct evaluation of LLM outputs under three prompt conditions across task types and models. No equations, derivations, fitted parameters, predictions, or self-citation chains exist that could reduce any result to its inputs by construction. Claims rest on independent empirical measurements (rubric scores and token counts), satisfying the default expectation for non-derivational papers. Potential issues with rubric reliability or blinding are validity concerns, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on the assumption that the chosen rubric validly captures quality and that the selected tasks and models are representative; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The unified rubric provides an accurate and unbiased measure of output quality across tasks and models.
All reported scores and comparisons depend on this rubric covering task completion, correctness, compliance, and clarity.

pith-pipeline@v0.9.0 · 5704 in / 1348 out tokens · 55109 ms · 2026-05-20T05:25:13.238850+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Checklist-improved prompts achieved the highest mean rubric score, 7.50 out of 8, compared with 5.67 for raw prompts and 6.67 for clarifying-question prompts.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,”arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Unleashing the potential of prompt engineering for large language models,

B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,”Patterns, vol. 6, 2023

work page 2023
[3]

Prompt engineering as an important emerging skill for medical professionals: Tutorial,

B. Mesk ´o, “Prompt engineering as an important emerging skill for medical professionals: Tutorial,”Journal of Medical Internet Research, vol. 25, 2023

work page 2023
[4]

Prompting change: Exploring prompt engineering in large language model ai and its potential to transform education,

W. Cain, “Prompting change: Exploring prompt engineering in large language model ai and its potential to transform education,”TechTrends, vol. 68, pp. 47 – 57, 2023

work page 2023
[5]

Prompt engineering and the effectiveness of large language models in enhancing human productivity,

R. K. Anam, “Prompt engineering and the effectiveness of large language models in enhancing human productivity,”ArXiv, vol. abs/2507.18638, 2025

work page arXiv 2025
[6]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”ArXiv, vol. abs/2402.07927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

arXiv preprint arXiv:2211.01910 , year=

Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,”ArXiv, vol. abs/2211.01910, 2022

work page arXiv 2022
[8]

Why johnny can’t prompt: How non-ai experts try (and fail) to design llm prompts,

J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang, “Why johnny can’t prompt: How non-ai experts try (and fail) to design llm prompts,”Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023

work page 2023
[9]

InProceed- ings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 244–264, Rabat, Mo- rocco

M. Desmond and M. Brachman, “Exploring prompt engineering prac- tices in the enterprise,”ArXiv, vol. abs/2403.08950, 2024

work page arXiv 2024
[10]

Promptaid: Visual prompt exploration, perturbation, testing and iteration for large language models,

A. Mishra, B. Danzy, U. Soni, A. Arunkumar, J. Huang, B. C. Kwon, and C. Bryan, “Promptaid: Visual prompt exploration, perturbation, testing and iteration for large language models,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, pp. 6946–6962, 2025

work page 2025
[11]

Harnessing llms for multi- dimensional writing assessment: Reliability and alignment with human judgments,

X. Tang, H. Chen, D. Lin, and K. Li, “Harnessing llms for multi- dimensional writing assessment: Reliability and alignment with human judgments,”Heliyon, vol. 10, 2024

work page 2024
[12]

The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of ai-driven feedback,

L. Jacobsen and K. E. Weber, “The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of ai-driven feedback,”AI, 2025

work page 2025
[13]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

S. Nayab, G. Rossolini, G. Buttazzo, N. Manes, and F. Giacomelli, “Concise thoughts: Impact of output length on llm reasoning and cost,” ArXiv, vol. abs/2407.19825, 2024

work page arXiv 2024
[14]

Evallm: Interactive evaluation of large language model prompts on user-defined criteria,

T. S. Kim, Y . Lee, J. Shin, Y .-H. Kim, and J. Kim, “Evallm: Interactive evaluation of large language model prompts on user-defined criteria,” Proceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems, 2023

work page 2024
[15]

Prompt design matters for computational social science tasks but in unpredictable ways,

S. Atreja, J. Ashkinaze, L. Li, J. Mendelsohn, and L. Hemphill, “Prompt design matters for computational social science tasks but in unpredictable ways,” pp. 122–145, 2024

work page 2024
[16]

Ai literacy and its implications for prompt engineering strategies,

N. Knoth, A. Tolzin, A. Janson, and J. Leimeister, “Ai literacy and its implications for prompt engineering strategies,”Comput. Educ. Artif. Intell., vol. 6, p. 100225, 2024

work page 2024
[17]

Do advanced language models eliminate the need for prompt engineering in software engineering?

G. Wang, Z. Sun, S. Ye, Z. Gong, Y . Chen, Y . Zhao, Q.-L. Liang, and D. Hao, “Do advanced language models eliminate the need for prompt engineering in software engineering?”ACM Transactions on Software Engineering and Methodology, 2024

work page 2024
[18]

Prompt engineering in large language models for patient education: A systematic review,

A. Mudrik, G. Nadkarni, O. Efros, S. Soffer, and E. Klang, “Prompt engineering in large language models for patient education: A systematic review,” 2025

work page 2025
[19]

What should we engineer in prompts? training humans in requirement-driven llm use,

Q. Ma, W. Peng, C. Yang, H. Shen, K. Koedinger, and T. Wu, “What should we engineer in prompts? training humans in requirement-driven llm use,”ACM Transactions on Computer-Human Interaction, vol. 32, pp. 1 – 27, 2024

work page 2024
[20]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,”arXiv preprint arXiv:2203.02155, 2022. APPENDIXA EXPERIMEN...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

[PAPER ABSTRACT HERE]

Raw Prompt: Summarise this. [PAPER ABSTRACT HERE]

work page
[22]

Use simple language

Checklist-Improved Prompt: Summarise the abstract below for a smart non-expert CS student. Use simple language. Keep the summary under 100 words. Focus on the main idea and why it matters. Write the answer as one short paragraph only, with no bullet points or headings. [PAPER ABSTRACT HERE]

work page
[23]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can write a better summary of the abstract below. Do not work on the task until I reply. Task: Summarize this. [PAPER ABSTRACT HERE] B. Explanation

work page
[24]

[PAPER ABSTRACT HERE]

Raw Prompt: Explain this. [PAPER ABSTRACT HERE]

work page
[25]

Use simple but technically correct language

Checklist-Improved Prompt: Explain the abstract below for a first-year graduate student. Use simple but technically correct language. Focus on the main idea and why the chain-of-thought helps. Write the answer in 2 short paragraphs only, with no bullet points or headings. [PAPER ABSTRACT HERE]

work page
[26]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can write a better explanation of the abstract below. Do not work on the task until I reply. Task: Explain this. [PAPER ABSTRACT HERE] C. Planning

work page
[27]

Raw Prompt: Plan a vacation in Europe

work page
[28]

Keep the total budget around $2500, excluding international flights

Checklist-Improved Prompt: Plan a 7-day vacation in Europe for 2 adults. Keep the total budget around $2500, excluding international flights. Focus on art, walkable cities, and vegetarian-friendly food. Avoid any plan that requires driving. Write the answer as a day-by-day itinerary and include a rough budget breakdown

work page
[29]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can create a better travel plan. Do not work on the task until I reply. Task: Plan a vacation in Europe. D. Coding

work page
[30]

Raw Prompt: Generate code for user input

work page
[31]

Check whether the string is a palindrome

Checklist-Improved Prompt: Write Python code that prompts the user for a string. Check whether the string is a palindrome. Ignore spaces and letter case when checking. Print a clear result for the user. Write clean, runnable code only

work page
[32]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can write better code for this task. Do not work on the task until I reply. Task: Generate code for user input. APPENDIXB TRIAL-LEVELEVALUATION TABLE VI COMPACT TRIAL-LEVEL EVALUATION SUMMARY. SCORES ARE TOTAL RUBRIC SCORES ON THE0–8SCALE. Trial ID Score Turns...

work page

[1] [1]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,”arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Unleashing the potential of prompt engineering for large language models,

B. Chen, Z. Zhang, N. Langren ´e, and S. Zhu, “Unleashing the potential of prompt engineering for large language models,”Patterns, vol. 6, 2023

work page 2023

[3] [3]

Prompt engineering as an important emerging skill for medical professionals: Tutorial,

B. Mesk ´o, “Prompt engineering as an important emerging skill for medical professionals: Tutorial,”Journal of Medical Internet Research, vol. 25, 2023

work page 2023

[4] [4]

Prompting change: Exploring prompt engineering in large language model ai and its potential to transform education,

W. Cain, “Prompting change: Exploring prompt engineering in large language model ai and its potential to transform education,”TechTrends, vol. 68, pp. 47 – 57, 2023

work page 2023

[5] [5]

Prompt engineering and the effectiveness of large language models in enhancing human productivity,

R. K. Anam, “Prompt engineering and the effectiveness of large language models in enhancing human productivity,”ArXiv, vol. abs/2507.18638, 2025

work page arXiv 2025

[6] [6]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”ArXiv, vol. abs/2402.07927, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

arXiv preprint arXiv:2211.01910 , year=

Y . Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,”ArXiv, vol. abs/2211.01910, 2022

work page arXiv 2022

[8] [8]

Why johnny can’t prompt: How non-ai experts try (and fail) to design llm prompts,

J. Zamfirescu-Pereira, R. Y . Wong, B. Hartmann, and Q. Yang, “Why johnny can’t prompt: How non-ai experts try (and fail) to design llm prompts,”Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, 2023

work page 2023

[9] [9]

InProceed- ings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Vol- ume 2: Short Papers), pages 244–264, Rabat, Mo- rocco

M. Desmond and M. Brachman, “Exploring prompt engineering prac- tices in the enterprise,”ArXiv, vol. abs/2403.08950, 2024

work page arXiv 2024

[10] [10]

Promptaid: Visual prompt exploration, perturbation, testing and iteration for large language models,

A. Mishra, B. Danzy, U. Soni, A. Arunkumar, J. Huang, B. C. Kwon, and C. Bryan, “Promptaid: Visual prompt exploration, perturbation, testing and iteration for large language models,”IEEE Transactions on Visualization and Computer Graphics, vol. 31, pp. 6946–6962, 2025

work page 2025

[11] [11]

Harnessing llms for multi- dimensional writing assessment: Reliability and alignment with human judgments,

X. Tang, H. Chen, D. Lin, and K. Li, “Harnessing llms for multi- dimensional writing assessment: Reliability and alignment with human judgments,”Heliyon, vol. 10, 2024

work page 2024

[12] [12]

The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of ai-driven feedback,

L. Jacobsen and K. E. Weber, “The promises and pitfalls of large language models as feedback providers: A study of prompt engineering and the quality of ai-driven feedback,”AI, 2025

work page 2025

[13] [13]

Concise thoughts: Impact of output length on llm reasoning and cost.arXiv preprint arXiv:2407.19825, 2024

S. Nayab, G. Rossolini, G. Buttazzo, N. Manes, and F. Giacomelli, “Concise thoughts: Impact of output length on llm reasoning and cost,” ArXiv, vol. abs/2407.19825, 2024

work page arXiv 2024

[14] [14]

Evallm: Interactive evaluation of large language model prompts on user-defined criteria,

T. S. Kim, Y . Lee, J. Shin, Y .-H. Kim, and J. Kim, “Evallm: Interactive evaluation of large language model prompts on user-defined criteria,” Proceedings of the 2024 CHI Conference on Human Factors in Com- puting Systems, 2023

work page 2024

[15] [15]

Prompt design matters for computational social science tasks but in unpredictable ways,

S. Atreja, J. Ashkinaze, L. Li, J. Mendelsohn, and L. Hemphill, “Prompt design matters for computational social science tasks but in unpredictable ways,” pp. 122–145, 2024

work page 2024

[16] [16]

Ai literacy and its implications for prompt engineering strategies,

N. Knoth, A. Tolzin, A. Janson, and J. Leimeister, “Ai literacy and its implications for prompt engineering strategies,”Comput. Educ. Artif. Intell., vol. 6, p. 100225, 2024

work page 2024

[17] [17]

Do advanced language models eliminate the need for prompt engineering in software engineering?

G. Wang, Z. Sun, S. Ye, Z. Gong, Y . Chen, Y . Zhao, Q.-L. Liang, and D. Hao, “Do advanced language models eliminate the need for prompt engineering in software engineering?”ACM Transactions on Software Engineering and Methodology, 2024

work page 2024

[18] [18]

Prompt engineering in large language models for patient education: A systematic review,

A. Mudrik, G. Nadkarni, O. Efros, S. Soffer, and E. Klang, “Prompt engineering in large language models for patient education: A systematic review,” 2025

work page 2025

[19] [19]

What should we engineer in prompts? training humans in requirement-driven llm use,

Q. Ma, W. Peng, C. Yang, H. Shen, K. Koedinger, and T. Wu, “What should we engineer in prompts? training humans in requirement-driven llm use,”ACM Transactions on Computer-Human Interaction, vol. 32, pp. 1 – 27, 2024

work page 2024

[20] [20]

Training language models to follow instructions with human feedback

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano, J. Leike, and R. Lowe, “Training language models to follow instructions with human feedback,”arXiv preprint arXiv:2203.02155, 2022. APPENDIXA EXPERIMEN...

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

[PAPER ABSTRACT HERE]

Raw Prompt: Summarise this. [PAPER ABSTRACT HERE]

work page

[22] [22]

Use simple language

Checklist-Improved Prompt: Summarise the abstract below for a smart non-expert CS student. Use simple language. Keep the summary under 100 words. Focus on the main idea and why it matters. Write the answer as one short paragraph only, with no bullet points or headings. [PAPER ABSTRACT HERE]

work page

[23] [23]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can write a better summary of the abstract below. Do not work on the task until I reply. Task: Summarize this. [PAPER ABSTRACT HERE] B. Explanation

work page

[24] [24]

[PAPER ABSTRACT HERE]

Raw Prompt: Explain this. [PAPER ABSTRACT HERE]

work page

[25] [25]

Use simple but technically correct language

Checklist-Improved Prompt: Explain the abstract below for a first-year graduate student. Use simple but technically correct language. Focus on the main idea and why the chain-of-thought helps. Write the answer in 2 short paragraphs only, with no bullet points or headings. [PAPER ABSTRACT HERE]

work page

[26] [26]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can write a better explanation of the abstract below. Do not work on the task until I reply. Task: Explain this. [PAPER ABSTRACT HERE] C. Planning

work page

[27] [27]

Raw Prompt: Plan a vacation in Europe

work page

[28] [28]

Keep the total budget around $2500, excluding international flights

Checklist-Improved Prompt: Plan a 7-day vacation in Europe for 2 adults. Keep the total budget around $2500, excluding international flights. Focus on art, walkable cities, and vegetarian-friendly food. Avoid any plan that requires driving. Write the answer as a day-by-day itinerary and include a rough budget breakdown

work page

[29] [29]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can create a better travel plan. Do not work on the task until I reply. Task: Plan a vacation in Europe. D. Coding

work page

[30] [30]

Raw Prompt: Generate code for user input

work page

[31] [31]

Check whether the string is a palindrome

Checklist-Improved Prompt: Write Python code that prompts the user for a string. Check whether the string is a palindrome. Ignore spaces and letter case when checking. Print a clear result for the user. Write clean, runnable code only

work page

[32] [32]

Do not work on the task until I reply

Clarifying-Question Prompt: Before answering, ask me exactly 3 short clarifying questions so you can write better code for this task. Do not work on the task until I reply. Task: Generate code for user input. APPENDIXB TRIAL-LEVELEVALUATION TABLE VI COMPACT TRIAL-LEVEL EVALUATION SUMMARY. SCORES ARE TOTAL RUBRIC SCORES ON THE0–8SCALE. Trial ID Score Turns...

work page