pith. sign in

arxiv: 2606.26809 · v1 · pith:566ZF7UDnew · submitted 2026-06-25 · 🧮 math.HO

The Educational Proof Assistant Waterproof in an Introductory Proof Course: Proof Construction and Learning Processes

Pith reviewed 2026-06-26 01:51 UTC · model grok-4.3

classification 🧮 math.HO
keywords educational proof assistantsproof constructionmathematics educationexplicit proofsquasi-experimentcontrolled natural languageintroductory proof course
0
0 comments X

The pith

Waterproof proof assistant leads to more explicit student proofs on paper even across English and Dutch.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether an educational proof assistant improves first-year university students' ability to construct valid mathematical proofs. It reports evidence from a quasi-experiment that students who used Waterproof produced more explicit proofs when writing on paper, with the effect appearing even though the assistant used English and the course proofs were in Dutch. A reader would care because first-year students commonly struggle to make their reasoning steps clear, and explicitness matters for producing correct proofs. The study also notes higher grades among Mathematics-Computer Science students who used the tool. Results remain suggestive because students chose whether to use Waterproof rather than being assigned at random.

Core claim

Students using Waterproof constructed more explicit proofs on paper than non-users. Effects carried over from the English-language assistant to Dutch pen-and-paper work. Students in the Mathematics-Computer Science program received higher grades when using the tool. The central conclusion is that an educational proof assistant can help students be more explicit in their proofs.

What carries the argument

Waterproof, a proof assistant built around controlled natural language to support transfer of proof skills from the computer to paper.

If this is right

  • Effects of the assistant carry over to the pen-and-paper context.
  • Students in the Mathematics-Computer Science program achieve higher grades when using Waterproof.
  • The assistant helps students make their reasoning steps more explicit.
  • The benefits appear even when the assistant language differs from the language of the written proofs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A follow-up study with random assignment could test whether the observed differences are caused by the tool.
  • Similar controlled-natural-language assistants might produce comparable gains in other languages or proof courses.
  • Classroom observations of how students interact with the assistant could reveal specific mechanisms behind the increase in explicitness.

Load-bearing premise

That differences in explicitness and grades between users and non-users come from the tool itself rather than from pre-existing differences in the self-selected groups.

What would settle it

A randomized trial that assigns students to use or not use Waterproof and then scores the explicitness of their pen-and-paper proofs.

Figures

Figures reproduced from arXiv: 2606.26809 by Jim Portegies, Johan Commelin, Pim Otte, Rogier Bos.

Figure 1
Figure 1. Figure 1: Survey results on perceptions about feedback [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Survey results on time spent on the course [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Observations on time spent in class. The vertical axis shows the proportion of observed in [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Observations on number of questions asked by students [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Screenshot showing a Waterproof proof and example of feedback. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

We study the use of an educational proof assistant in an introductory proof course through a quasi-experiment in a varied setting: multiple teachers, students with different study programs, and a mixed Dutch-English language environment. First-year university students are known to struggle with writing proofs. Waterproof is a proof assistant that is designed to support the transfer of skills to paper proofs by working with controlled natural language. We focus on the students' ability to construct valid mathematical proofs, and on their learning process. We study this through in-class observation, surveys, and analysis of student performance and proof structure. We present evidence that effects of using an educational proof assistant carry over to the pen-and-paper context,even when the assistant is English and the proof is given in Dutch. We also present evidence that suggests students in the Mathematics-Computer Science program achieve higher grades when using Waterproof. Our most important conclusion is that an educational proof assistant can help students be more explicit in their proofs. As students self-selected into using Waterproof rather than being randomly assigned, these results are suggestive rather than causal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper reports results from a quasi-experiment on the use of the Waterproof educational proof assistant (with controlled natural language) in a first-year introductory proof course involving multiple teachers, mixed study programs, and a Dutch-English language setting. Through in-class observations, surveys, and analysis of student performance and proof structure, it presents evidence that tool use carries over to more explicit pen-and-paper proofs (even when the assistant is in English and proofs in Dutch) and that Mathematics-Computer Science students achieve higher grades when using it. The authors' most important conclusion is that such an assistant can help students be more explicit; they explicitly note that self-selection into tool use means results are suggestive rather than causal.

Significance. If the observed differences can be attributed to the tool rather than pre-existing group differences, the findings would offer concrete support for the value of controlled-natural-language proof assistants in developing transferable proof-construction skills, a persistent challenge in undergraduate mathematics education. The mixed-language setting and multi-teacher design add ecological validity that is uncommon in tool-evaluation studies.

major comments (2)
  1. [Abstract] Abstract: The central claim that Waterproof 'can help students be more explicit in their proofs' and produces carry-over effects rests on direct comparisons of self-selected users versus non-users. No information is provided on baseline equivalence (prior mathematics performance, motivation, or program-specific traits), nor any post-hoc statistical adjustment for confounding. This directly undermines attribution of the observed differences in explicitness and grades to the intervention itself.
  2. [Abstract] Abstract and results sections: Although the authors correctly qualify the findings as 'suggestive rather than causal,' the manuscript still frames the group differences as evidence supporting the tool's effectiveness. Without additional analysis (e.g., propensity-score methods, pre-test measures, or qualitative evidence of mechanism), the load-bearing inference from observation to tool effect remains unsupported.
minor comments (1)
  1. The manuscript would benefit from a clearer description of how proof explicitness was operationalized and coded from the pen-and-paper artifacts.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive comments. We agree that the quasi-experimental design with self-selection requires careful framing, which we already attempt but will strengthen further. Below we respond point by point to the major comments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that Waterproof 'can help students be more explicit in their proofs' and produces carry-over effects rests on direct comparisons of self-selected users versus non-users. No information is provided on baseline equivalence (prior mathematics performance, motivation, or program-specific traits), nor any post-hoc statistical adjustment for confounding. This directly undermines attribution of the observed differences in explicitness and grades to the intervention itself.

    Authors: We acknowledge that the absence of baseline measures on prior performance or motivation prevents ruling out pre-existing group differences or performing adjustments such as propensity-score matching. The manuscript already states in the abstract that 'students self-selected into using Waterproof rather than being randomly assigned, these results are suggestive rather than causal,' but we will revise the abstract, results, and a new limitations subsection to more prominently foreground this constraint and rephrase all claims in terms of observed associations rather than tool effects. revision: partial

  2. Referee: [Abstract] Abstract and results sections: Although the authors correctly qualify the findings as 'suggestive rather than causal,' the manuscript still frames the group differences as evidence supporting the tool's effectiveness. Without additional analysis (e.g., propensity-score methods, pre-test measures, or qualitative evidence of mechanism), the load-bearing inference from observation to tool effect remains unsupported.

    Authors: We will revise the abstract and results sections to eliminate any phrasing that could be read as implying effectiveness or causal support (e.g., changing 'evidence that suggests students ... achieve higher grades when using Waterproof' to 'observed associations between tool use and grades'). We will also add explicit discussion of the lack of mechanism evidence. Because no pre-intervention data were collected, we cannot add the requested analyses, but the textual revisions will make the inferential limits clearer. revision: yes

standing simulated objections not resolved
  • The study did not collect baseline equivalence data on prior mathematics performance or motivation, so post-hoc statistical adjustments or propensity-score methods cannot be performed.

Circularity Check

0 steps flagged

No circularity: empirical study with no derivations or fitted predictions

full rationale

The paper reports a quasi-experimental educational study relying on in-class observations, surveys, and direct analysis of student proofs and grades. It contains no equations, parameters, derivations, or mathematical claims that could reduce by construction to inputs. The authors explicitly flag self-selection as limiting causal attribution, and no self-citation chains or ansatzes are invoked to support core conclusions. All evidence is presented as observational and suggestive.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical education-research study with no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5724 in / 1096 out tokens · 35920 ms · 2026-06-26T01:51:25.944743+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 11 canonical work pages · 1 internal anchor

  1. [1]

    39–59, doi:10.1007/s10817-015-9326-4

    Francisco Botana, Markus Hohenwarter, Predrag Jani ˇci´c, Zoltán Kovács, Ivan Petrovi ´c, Tomás Recio & Simon Weitzhofer (2015):Automated Theorem Proving in GeoGebra: Current Achievements.Journal of Automated Reasoning55(1), pp. 39–59, doi:10.1007/s10817-015-9326-4. Available athttps://doi.org/ 10.1007/s10817-015-9326-4

  2. [2]

    59–70, doi:10.4204/eptcs.354.5

    Merlin Carl, Hinrich Lorenzen & Michael Schmitz (2022):Natural Language Proof Checking in Introduction to Proof Classes – First Experiences with Diproche.Electronic Proceedings in Theoretical Computer Science 354, p. 59–70, doi:10.4204/eptcs.354.5. Available athttp://dx.doi.org/10.4204/EPTCS.354.5

  3. [3]

    CoqPL 2023

    Pablo Donato, Benjamin Werner & Kaustuv Chaudhuri (2023):Integrating graphical proofs in Coq. CoqPL 2023

  4. [4]

    Available athttps://github.com/Paper-Proof/ paperproof

    Evgenia Karunus & Anton Kovsharov:Paperproof. Available athttps://github.com/Paper-Proof/ paperproof

  5. [5]

    Available athttps://hal.science/hal-03979238

    Marie Kerjean, Frédéric Le Roux, Patrick Massot, Micaela Mayero, Zoé Mesnil, Simon Modeste, Julien Narboux & Pierre Rousselin (2022):Utilisation des assistants de preuves pour l’enseignement en L1.La Gazette de la Société mathématique de France174. Available athttps://hal.science/hal-03979238

  6. [6]

    In:Proceedings of the 2017 ACM Conference on International Computing Education Research, ICER ’17, Association for Computing Machinery, New York, NY , USA, p

    Maria Knobelsdorf, Christiane Frede, Sebastian Böhne & Christoph Kreitz (2017):Theorem Provers as a Learning Tool in Theory of Computation. In:Proceedings of the 2017 ACM Conference on International Computing Education Research, ICER ’17, Association for Computing Machinery, New York, NY , USA, p. 83–92, doi:10.1145/3105726.3106184. Available athttps://do...

  7. [7]

    Patrick Massot (2024):Teaching Mathematics Using Lean and Controlled Natural Language. In Yves Bertot, Temur Kutsia & Michael Norrish, editors:15th International Conference on Interactive Theorem Proving (ITP 2024),Leibniz International Proceedings in Informatics (LIPIcs)309, Schloss Dagstuhl – Leibniz- Zentrum für Informatik, Dagstuhl, Germany, pp. 27:1–...

  8. [8]

    Roman Matuszewski & Piotr Rudnicki (2005):Mizar: the first 30 years.Mechanized mathematics and its applications4(1), pp. 3–24. 12The Educational Proof Assistant Waterproof in an Introductory Proof Course

  9. [9]

    Moore (1994):Making the transition to formal proof.Educational Studies in Mathematics27(3), pp

    Robert C. Moore (1994):Making the transition to formal proof.Educational Studies in Mathematics27(3), pp. 249–266, doi:10.1007/BF01273731. Available athttps://doi.org/10.1007/BF01273731

  10. [10]

    161–180, doi:10.1007/s10817-007-9071-4

    Julien Narboux (2007):A Graphical User Interface for Formal Proofs in Geometry.Journal of Auto- mated Reasoning39(2), pp. 161–180, doi:10.1007/s10817-007-9071-4. Available athttps://doi.org/ 10.1007/s10817-007-9071-4

  11. [11]

    Pedro Quaresma & Vanda Santos (2019):Computer-Generated Geometry Proofs in a Learning Context, pp. 237–253. Springer International Publishing, Cham, doi:10.1007/978-3-030-28483-1_11. Available at https://doi.org/10.1007/978-3-030-28483-1_11

  12. [12]

    TryLogic tutorial: an approach to Learning Logic by proving and refuting

    Patrick Terrematte & João Marcos (2015):TryLogic tutorial: an approach to Learning Logic by proving and refuting, doi:10.48550/arXiv.1507.03685. arXiv:1507.03685

  13. [13]

    64–93, doi:10.1007/s40753-021-00140-1

    Athina Thoma & Paola Iannone (2022):Learning about Proof with the Theorem Prover LEAN: the Abundant Numbers Task.International Journal of Research in Undergraduate Mathematics Education8(1), pp. 64–93, doi:10.1007/s40753-021-00140-1. Available athttps://doi.org/10.1007/s40753-021-00140-1

  14. [14]

    Electronic Proceedings in Theoretical Computer Science419, p

    Frédéric Tran Minh, Laure Gonnord & Julien Narboux (2025):Proof Assistants for Teaching: a Survey. Electronic Proceedings in Theoretical Computer Science419, p. 1–27, doi:10.4204/eptcs.419.1. Available at http://dx.doi.org/10.4204/EPTCS.419.1

  15. [15]

    In João F

    Jørgen Villadsen & Frederik Krogsdal Jacobsen (2021):Using Isabelle in Two Courses on Logic and Au- tomated Reasoning. In João F. Ferreira, Alexandra Mendes & Claudio Menghi, editors:Formal Methods Teaching, Springer International Publishing, Cham, pp. 117–132

  16. [16]

    96–119, doi:10.4204/eptcs.400.7

    Jelle Wemmenhove, Dick Arends, Thijs Beurskens, Maitreyee Bhaid, Sean McCarren, Jan Moraal, Diego Rivera Garrido, David Tuin, Malcolm Vassallo, Pieter Wils & Jim Portegies (2024):Waterproof: Educational Software for Learning How to Write Mathematical Proofs.Electronic Proceedings in Theoretical Computer Science400, p. 96–119, doi:10.4204/eptcs.400.7. Avai...

  17. [17]

    To appear in post-proceedings of ThEdu 25

    Jelle Wemmenhove, Nick Hoofd, Alexander Schüler-Meyer & Jim Portegies (2026):Comparative analysis of student proof structure following the use of Waterproof. To appear in post-proceedings of ThEdu 25

  18. [18]

    Bewijzen in de Wiskunde

    Cezary Kaliszyk Freek Wiedijk & Maxim Hendriks Femke van Raamsdonk (2007):Teaching logic using a state-of-the-art proof assistant. In:International Workshop on Proof Assistants and Types in Education (PA TE’07).Available athttps://www.cs.ru.nl/~freek/pubs/webded1.pdf. P . Otte, R. Bos, J. Commelin and J. Portegies13 A Waterproof screenshot Figure 5: Scree...

  19. [19]

    Did you fill in the consent form? ◦No ◦Yes (Students who answered “No” were directed to the end of the survey.)

  20. [20]

    What is your student number?

  21. [21]

    What is your gender? ◦Male ◦Female ◦Other, feel free to specify, if you want: ◦Prefer not to say

  22. [22]

    What is your study programme? ◦Mathematics (no double bachelor) ◦Mathematics and applications ◦Mathematics/Physics ◦Mathematics/Computer Science ◦Mathematics/Economics ◦Other:

  23. [23]

    I have been in higher education for . . . years. (So 0 if you are a first-year student.) Previous Grades We will ask for your grades for central exams and total final grades for secondary education. If you were not in the Dutch secondary system, please enter grades on a 10 point scale that best correspond to the respective grade

  24. [24]

    What was your grade for the central exam for Mathematics B (or other obligatory mathematics course)?

  25. [25]

    What was your final grade for Mathematics B (or other obligatory mathematics course)?

  26. [26]

    What was your final grade for Mathematics D (or other optional mathematics course)? ◦Grade: P . Otte, R. Bos, J. Commelin and J. Portegies19 ◦I did not take Mathematics D, but my school did offer it. ◦My school did not offer Mathematics D

  27. [27]

    Have you ever written a mathematical proof, outside the secondary school curriculum? (Select all that apply.) □No □Yes, in (training for) mathematics competitions □Yes, during a prior study □Yes, during self-study □Yes, in another way: F Final Survey About Me

  28. [28]

    Did you fill in the consent form at the start of the course? ◦No ◦Yes (Students who answered “No” were directed to the end of the survey.)

  29. [29]

    What is your student number? Feedback

  30. [30]

    Feedback refers to all kinds of feedback received during the course: from teachers, TAs, written, oral and through tools

    For each of the following statements, choose the most appropriate option. Feedback refers to all kinds of feedback received during the course: from teachers, TAs, written, oral and through tools. Options: Strongly disagree – Somewhat disagree – Neither agree nor disagree – Somewhat agree – Strongly agree. (a) I would have liked to have gotten more feedbac...

  31. [31]

    hours on Bewijzen in de Wiskunde (excluding the 8 hours of weekly lectures and tutorials)

    I spent . . . hours on Bewijzen in de Wiskunde (excluding the 8 hours of weekly lectures and tutorials)

  32. [32]

    hours on Bewijzen in de Wiskunde in total

    I spent . . . hours on Bewijzen in de Wiskunde in total

  33. [33]

    hours on my studies in total from 1 September 2025 up to and including 4 November 2025

    I spent . . . hours on my studies in total from 1 September 2025 up to and including 4 November 2025

  34. [34]

    number of exercises listed in the course planning

    I seriously attempted or completed . . . number of exercises listed in the course planning. Waterproof in Exercises

  35. [35]

    ◦No ◦Yes (Students who answered “No” were directed to the end of this block.)

    I used Waterproof to do 1 or more exercises. ◦No ◦Yes (Students who answered “No” were directed to the end of this block.)

  36. [36]

    exercises using Waterproof

    I seriously attempted or completed . . . exercises using Waterproof

  37. [37]

    Options: Strongly disagree – Somewhat disagree – Neither agree nor disagree – Somewhat agree – Strongly agree

    Select the option that best matches your opinion. Options: Strongly disagree – Somewhat disagree – Neither agree nor disagree – Somewhat agree – Strongly agree. (a) I enjoyed using Waterproof. (b) Doing exercises in Waterproof helped me develop my proof writing skills. (c) Using Waterproof was a pleasant experience. (d) Doing an exercise in Waterproof is ...

  38. [38]

    List three comments about how doing exercises in Waterproof did or did not influence your learn- ing process

  39. [39]

    List three reasons you continued or stopped using Waterproof

  40. [40]

    Waterproof in Class

    If you have any other comments regarding the use of Waterproof for exercises, leave them here. Waterproof in Class

  41. [41]

    ◦No ◦Yes (Students who answered “No” were directed to the end of this block.)

    I was in a group where Waterproof was mentioned in class (Groups 1 and 2, taught by Johan and Pim). ◦No ◦Yes (Students who answered “No” were directed to the end of this block.)

  42. [42]

    Options: Strongly disagree – Somewhat disagree – Neither agree nor disagree – Somewhat agree – Strongly agree

    Select the most appropriate option. Options: Strongly disagree – Somewhat disagree – Neither agree nor disagree – Somewhat agree – Strongly agree. (a) I would have liked to see more use of Waterproof by the teacher. (b) The use of Waterproof by the teacher helped me better understand proofs. (c) The use of Waterproof by the teacher had a negative effect o...

  43. [43]

    Give three comments about how the use of Waterproof in class contributed to your learning or not

  44. [44]

    If you have any suggestions for the use of Waterproof in this course, or the Waterproof software in general, leave them here. P . Otte, R. Bos, J. Commelin and J. Portegies21 General

  45. [45]

    G Translation of Dutch proof We need to prove that∀y∈f(A),∃x∈A,y=g(x)

    If you have any other remarks concerning the course Bewijzen in de Wiskunde, or anything related to this research, leave them here. G Translation of Dutch proof We need to prove that∀y∈f(A),∃x∈A,y=g(x). Takey∈f(A). f(A) ={f(x)|x∈A}, so∃x ′ ∈A,y=f(x). Obtain thisx ′ ∈A. So it holds thaty∈f(x ′). We need to prove that∃x∈A,y=g(x). Choosex:=x ′. (Indeed,x∈A.)...