PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions
Pith reviewed 2026-05-20 15:45 UTC · model grok-4.3
The pith
Teachers can author reliable AI tutors by correcting live bot responses instead of editing abstract system prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PromptDecipher restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts. Teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip.
What carries the argument
The correction-based interaction, in which teachers edit bot responses inside a live preview and an automated pipeline converts those edits into prompt rewrites plus validation checks.
If this is right
- Teachers will treat testing as a routine part of authoring rather than an optional extra step.
- Each correction will automatically trigger prompt updates and cross-scenario validation.
- Educators will receive scaffolding for the designer and QA roles they currently skip.
- The system will support deployment in large courses enrolling hundreds of higher-education instructors.
Where Pith is reading between the lines
- The same correction-driven loop could apply to non-experts authoring AI agents in domains beyond tutoring.
- Reducing reliance on abstract prompt writing may lower the barrier for subject-matter experts to create interactive tools.
- Long-term use could reveal whether routine QA during authoring correlates with better student learning outcomes.
- Extending the validation step to include student feedback data might further strengthen the bots.
Load-bearing premise
The automated pipeline can reliably analyze one teacher correction and produce a prompt rewrite that improves performance across test scenarios without introducing new errors.
What would settle it
A controlled comparison in which teachers using PromptDecipher produce bots that fail more test scenarios or introduce new errors after corrections compared with bots authored by direct prompt editing.
Figures
read the original abstract
Chatbots have long been explored as tools to support learning, and recent advances in large language models have significantly expanded the availability of platforms for educators to author AI tutoring chatbots. Yet effective authorship demands more than writing a system prompt; it requires educators to act as learning designers, AI interaction designers, and QA engineers. In practice, however, teachers rarely fulfill these roles. Our formative study found that virtually none systematically tested their bots before deploying them to students. To address this gap, we present PromptDecipher, a system that restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts, teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip. PromptDecipher will be deployed in an AI for Educators course enrolling hundreds of higher-education instructors. A live prototype (https://teacher-prompting.vercel.app/), an anonymized codebase (https://anonymous.4open.science/r/teacher-prompting-2EDF/), and anonymized demo (https://tinyurl.com/las-prompt-decipher-demo) are available via links in the footnote.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents PromptDecipher, a system that restructures AI tutor authoring around direct correction of simulated chatbot responses rather than abstract system prompt writing. Teachers interact with a live preview, edit undesirable outputs, and an automated pipeline analyzes the correction to propose a targeted system prompt rewrite that is validated against pre-defined test scenarios. This is motivated by a formative study finding that teachers rarely test bots before deployment. The approach aims to enforce QA as a first-class activity and scaffold educators in learning design, interaction design, and QA roles. A prototype, anonymized codebase, and demo are provided, with planned deployment in an AI for Educators course.
Significance. If the pipeline reliably converts single corrections into prompt updates that improve behavior on test scenarios without regressions, the work could meaningfully advance HCI in educational AI by lowering barriers to effective chatbot authoring. The correction-based workflow directly addresses an observed gap in teacher practices and integrates QA into the primary interaction, which is a substantive design contribution. The open prototype and codebase enable reproducibility and extension.
major comments (2)
- [Automated pipeline description] Section describing the automated pipeline (the 'analyzes the correction, proposes a targeted system prompt rewrite, and validates the change' component): no quantitative metrics are reported on rewrite success rate, frequency of introduced errors or regressions on test scenarios, coverage of the pre-defined scenarios, or failure modes such as over-generalization. This is load-bearing for the central claim that the system scaffolds QA roles, as the workflow's effectiveness rests on the pipeline's reliability without introducing new errors.
- [Formative study] Formative study section: the claim that 'virtually none systematically tested their bots' is presented without details on study design, participant count, data collection method, or quantitative breakdown. This weakens the motivation for the correction-based workflow, as the design rationale depends directly on this finding.
minor comments (2)
- [Abstract and system overview] The abstract references the live prototype, codebase, and demo links; ensure these are also clearly footnoted or linked in the main body and system description sections for reader accessibility.
- [Figures] Workflow diagrams or interface screenshots would benefit from expanded captions that explicitly map user actions (correction) to pipeline steps (analysis, rewrite, validation).
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential significance of the correction-based authoring workflow. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: Section describing the automated pipeline (the 'analyzes the correction, proposes a targeted system prompt rewrite, and validates the change' component): no quantitative metrics are reported on rewrite success rate, frequency of introduced errors or regressions on test scenarios, coverage of the pre-defined scenarios, or failure modes such as over-generalization. This is load-bearing for the central claim that the system scaffolds QA roles, as the workflow's effectiveness rests on the pipeline's reliability without introducing new errors.
Authors: We agree that quantitative metrics on pipeline performance would provide stronger evidence for the reliability of the QA scaffolding. The current manuscript prioritizes the description of the pipeline architecture and its integration into the live correction workflow. In the revision we will expand the pipeline section to detail the mechanisms by which corrections are analyzed and rewrites are proposed, and to explain how the validation step against pre-defined scenarios is intended to surface regressions. We will also add an explicit limitations subsection that acknowledges the absence of reported success rates, regression frequencies, coverage statistics, and failure-mode analysis at this prototype stage, while outlining plans to gather such data during the upcoming deployment in the AI for Educators course. revision: partial
-
Referee: Formative study section: the claim that 'virtually none systematically tested their bots' is presented without details on study design, participant count, data collection method, or quantitative breakdown. This weakens the motivation for the correction-based workflow, as the design rationale depends directly on this finding.
Authors: We accept that the formative study section would benefit from greater transparency. In the revised manuscript we will expand this section to include a full account of the study design, participant recruitment and demographics, data collection procedures, and the quantitative results that support the observation that virtually none of the educators systematically tested their bots prior to deployment. These additions will make the motivation for the correction-based workflow more robust and directly traceable to the empirical findings. revision: yes
Circularity Check
No significant circularity; systems-design contribution is self-contained
full rationale
This is a systems-design paper presenting an implemented workflow and formative study observations for AI tutor authoring. There are no mathematical derivations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations. The central claims rest on the described correction-based interaction pipeline and empirical notes about teacher behavior rather than any self-referential logic or ansatz smuggled via prior work. The work is therefore self-contained against external benchmarks as a practical design intervention.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Gwo-Jen Hwang and Ching-Yi Chang. 2023. A review of opportunities and challenges of chatbots in education.Interactive Learning Environments31, 7 (2023), 4099–4112. doi:10.1080/10494820.2021.1952615
-
[2]
Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn,...
-
[3]
Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Ken Koedinger, and Tong- shuang Wu. 2025. What Should We Engineer in Prompts? Training Humans in Requirement-Driven LLM Use.ACM Trans. Comput.-Hum. Interact.32, 4, Article 41 (Aug. 2025), 27 pages. doi:10.1145/3731756
-
[4]
Shixian Xie, John Zimmerman, and Motahhare Eslami. 2025. Exploring What People Need to Know to be AI Literate: Tailoring for a Diversity of AI Roles and Responsibilities. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1018, 16 pages. doi:10.1145/3706...
- [5]
-
[6]
Zamfirescu-Pereira, Richmond Y
J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang
-
[7]
Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,
Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1145/3544548.3581388
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.