PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions

John Stamper; Miina Koyama; Ruiwei Xiao

arxiv: 2605.16605 · v1 · pith:R5ZB7CGHnew · submitted 2026-05-15 · 💻 cs.HC · cs.AI

PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions

Miina Koyama , Ruiwei Xiao , John Stamper This is my paper

Pith reviewed 2026-05-20 15:45 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords AI tutor authoringprompt engineeringinteractive editingquality assuranceeducational chatbotshuman-AI interactionworkflow redesign

0 comments

The pith

Teachers can author reliable AI tutors by correcting live bot responses instead of editing abstract system prompts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that writing system prompts for AI tutoring chatbots forces educators into roles as learning designers, interaction designers, and QA engineers, yet most teachers skip systematic testing entirely. PromptDecipher replaces direct prompt writing with a workflow where teachers chat with a live preview and simply edit any undesirable bot replies. An automated pipeline then turns each correction into a targeted prompt rewrite and checks the update against pre-defined test scenarios. This structure makes quality assurance a natural first step rather than an afterthought. If the approach holds, more teachers could produce bots that behave consistently without needing advanced technical expertise.

Core claim

PromptDecipher restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts. Teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip.

What carries the argument

The correction-based interaction, in which teachers edit bot responses inside a live preview and an automated pipeline converts those edits into prompt rewrites plus validation checks.

If this is right

Teachers will treat testing as a routine part of authoring rather than an optional extra step.
Each correction will automatically trigger prompt updates and cross-scenario validation.
Educators will receive scaffolding for the designer and QA roles they currently skip.
The system will support deployment in large courses enrolling hundreds of higher-education instructors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same correction-driven loop could apply to non-experts authoring AI agents in domains beyond tutoring.
Reducing reliance on abstract prompt writing may lower the barrier for subject-matter experts to create interactive tools.
Long-term use could reveal whether routine QA during authoring correlates with better student learning outcomes.
Extending the validation step to include student feedback data might further strengthen the bots.

Load-bearing premise

The automated pipeline can reliably analyze one teacher correction and produce a prompt rewrite that improves performance across test scenarios without introducing new errors.

What would settle it

A controlled comparison in which teachers using PromptDecipher produce bots that fail more test scenarios or introduce new errors after corrections compared with bots authored by direct prompt editing.

Figures

Figures reproduced from arXiv: 2605.16605 by John Stamper, Miina Koyama, Ruiwei Xiao.

**Figure 1.** Figure 1: The PromptDecipher workflow. (1) A teacher edits an unsatisfactory bot response in the test case chat. (2) The system [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

read the original abstract

Chatbots have long been explored as tools to support learning, and recent advances in large language models have significantly expanded the availability of platforms for educators to author AI tutoring chatbots. Yet effective authorship demands more than writing a system prompt; it requires educators to act as learning designers, AI interaction designers, and QA engineers. In practice, however, teachers rarely fulfill these roles. Our formative study found that virtually none systematically tested their bots before deploying them to students. To address this gap, we present PromptDecipher, a system that restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts, teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip. PromptDecipher will be deployed in an AI for Educators course enrolling hundreds of higher-education instructors. A live prototype (https://teacher-prompting.vercel.app/), an anonymized codebase (https://anonymous.4open.science/r/teacher-prompting-2EDF/), and anonymized demo (https://tinyurl.com/las-prompt-decipher-demo) are available via links in the footnote.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents PromptDecipher, a system that restructures AI tutor authoring around direct correction of simulated chatbot responses rather than abstract system prompt writing. Teachers interact with a live preview, edit undesirable outputs, and an automated pipeline analyzes the correction to propose a targeted system prompt rewrite that is validated against pre-defined test scenarios. This is motivated by a formative study finding that teachers rarely test bots before deployment. The approach aims to enforce QA as a first-class activity and scaffold educators in learning design, interaction design, and QA roles. A prototype, anonymized codebase, and demo are provided, with planned deployment in an AI for Educators course.

Significance. If the pipeline reliably converts single corrections into prompt updates that improve behavior on test scenarios without regressions, the work could meaningfully advance HCI in educational AI by lowering barriers to effective chatbot authoring. The correction-based workflow directly addresses an observed gap in teacher practices and integrates QA into the primary interaction, which is a substantive design contribution. The open prototype and codebase enable reproducibility and extension.

major comments (2)

[Automated pipeline description] Section describing the automated pipeline (the 'analyzes the correction, proposes a targeted system prompt rewrite, and validates the change' component): no quantitative metrics are reported on rewrite success rate, frequency of introduced errors or regressions on test scenarios, coverage of the pre-defined scenarios, or failure modes such as over-generalization. This is load-bearing for the central claim that the system scaffolds QA roles, as the workflow's effectiveness rests on the pipeline's reliability without introducing new errors.
[Formative study] Formative study section: the claim that 'virtually none systematically tested their bots' is presented without details on study design, participant count, data collection method, or quantitative breakdown. This weakens the motivation for the correction-based workflow, as the design rationale depends directly on this finding.

minor comments (2)

[Abstract and system overview] The abstract references the live prototype, codebase, and demo links; ensure these are also clearly footnoted or linked in the main body and system description sections for reader accessibility.
[Figures] Workflow diagrams or interface screenshots would benefit from expanded captions that explicitly map user actions (correction) to pipeline steps (analysis, rewrite, validation).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of the correction-based authoring workflow. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Section describing the automated pipeline (the 'analyzes the correction, proposes a targeted system prompt rewrite, and validates the change' component): no quantitative metrics are reported on rewrite success rate, frequency of introduced errors or regressions on test scenarios, coverage of the pre-defined scenarios, or failure modes such as over-generalization. This is load-bearing for the central claim that the system scaffolds QA roles, as the workflow's effectiveness rests on the pipeline's reliability without introducing new errors.

Authors: We agree that quantitative metrics on pipeline performance would provide stronger evidence for the reliability of the QA scaffolding. The current manuscript prioritizes the description of the pipeline architecture and its integration into the live correction workflow. In the revision we will expand the pipeline section to detail the mechanisms by which corrections are analyzed and rewrites are proposed, and to explain how the validation step against pre-defined scenarios is intended to surface regressions. We will also add an explicit limitations subsection that acknowledges the absence of reported success rates, regression frequencies, coverage statistics, and failure-mode analysis at this prototype stage, while outlining plans to gather such data during the upcoming deployment in the AI for Educators course. revision: partial
Referee: Formative study section: the claim that 'virtually none systematically tested their bots' is presented without details on study design, participant count, data collection method, or quantitative breakdown. This weakens the motivation for the correction-based workflow, as the design rationale depends directly on this finding.

Authors: We accept that the formative study section would benefit from greater transparency. In the revised manuscript we will expand this section to include a full account of the study design, participant recruitment and demographics, data collection procedures, and the quantitative results that support the observation that virtually none of the educators systematically tested their bots prior to deployment. These additions will make the motivation for the correction-based workflow more robust and directly traceable to the empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity; systems-design contribution is self-contained

full rationale

This is a systems-design paper presenting an implemented workflow and formative study observations for AI tutor authoring. There are no mathematical derivations, fitted parameters, predictions that reduce to inputs by construction, or load-bearing self-citations. The central claims rest on the described correction-based interaction pipeline and empirical notes about teacher behavior rather than any self-referential logic or ansatz smuggled via prior work. The work is therefore self-contained against external benchmarks as a practical design intervention.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied systems paper; it introduces no free parameters, mathematical axioms, or new postulated entities. The design rests on standard HCI assumptions about user behavior and the reliability of LLM prompt rewriting.

pith-pipeline@v0.9.0 · 5762 in / 1002 out tokens · 38873 ms · 2026-05-20T15:45:11.937921+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

7 extracted references · 7 canonical work pages

[1]

Gwo-Jen Hwang and Ching-Yi Chang. 2023. A review of opportunities and challenges of chatbots in education.Interactive Learning Environments31, 7 (2023), 4099–4112. doi:10.1080/10494820.2021.1952615

work page doi:10.1080/10494820.2021.1952615 2023
[2]

Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn,...

work page doi:10.1016/j.lindif.2023 2023
[3]

Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Ken Koedinger, and Tong- shuang Wu. 2025. What Should We Engineer in Prompts? Training Humans in Requirement-Driven LLM Use.ACM Trans. Comput.-Hum. Interact.32, 4, Article 41 (Aug. 2025), 27 pages. doi:10.1145/3731756

work page doi:10.1145/3731756 2025
[4]

Shixian Xie, John Zimmerman, and Motahhare Eslami. 2025. Exploring What People Need to Know to be AI Literate: Tailoring for a Diversity of AI Roles and Responsibilities. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1018, 16 pages. doi:10.1145/3706...

work page doi:10.1145/3706598.3713841 2025
[5]

Minju Yoo, Hyoungwook Jin, and Juho Kim. 2025. How Do Teachers Create Peda- gogical Chatbots?: Current Practices and Challenges. arXiv:2503.00967 [cs.HC] https://arxiv.org/abs/2503.00967

work page arXiv 2025
[6]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang

work page
[7]

Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1145/3544548.3581388

work page doi:10.1145/3544548.3581388 2023

[1] [1]

Gwo-Jen Hwang and Ching-Yi Chang. 2023. A review of opportunities and challenges of chatbots in education.Interactive Learning Environments31, 7 (2023), 4099–4112. doi:10.1080/10494820.2021.1952615

work page doi:10.1080/10494820.2021.1952615 2023

[2] [2]

Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stephan Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn,...

work page doi:10.1016/j.lindif.2023 2023

[3] [3]

Qianou Ma, Weirui Peng, Chenyang Yang, Hua Shen, Ken Koedinger, and Tong- shuang Wu. 2025. What Should We Engineer in Prompts? Training Humans in Requirement-Driven LLM Use.ACM Trans. Comput.-Hum. Interact.32, 4, Article 41 (Aug. 2025), 27 pages. doi:10.1145/3731756

work page doi:10.1145/3731756 2025

[4] [4]

Shixian Xie, John Zimmerman, and Motahhare Eslami. 2025. Exploring What People Need to Know to be AI Literate: Tailoring for a Diversity of AI Roles and Responsibilities. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI ’25). Association for Computing Machinery, New York, NY, USA, Article 1018, 16 pages. doi:10.1145/3706...

work page doi:10.1145/3706598.3713841 2025

[5] [5]

Minju Yoo, Hyoungwook Jin, and Juho Kim. 2025. How Do Teachers Create Peda- gogical Chatbots?: Current Practices and Challenges. arXiv:2503.00967 [cs.HC] https://arxiv.org/abs/2503.00967

work page arXiv 2025

[6] [6]

Zamfirescu-Pereira, Richmond Y

J.D. Zamfirescu-Pereira, Richmond Y. Wong, Bjoern Hartmann, and Qian Yang

work page

[7] [7]

Why johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts,

Why Johnny Can’t Prompt: How Non-AI Experts Try (and Fail) to Design LLM Prompts. InProceedings of the 2023 CHI Conference on Human Factors in Computing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 437, 21 pages. doi:10.1145/3544548.3581388

work page doi:10.1145/3544548.3581388 2023