pith. sign in

arxiv: 2604.19142 · v1 · submitted 2026-04-21 · 💻 cs.SE · cs.CY

Towards More Empathic Programming Environments: An Experimental Empathic AI-Enhanced IDE

Pith reviewed 2026-05-10 02:41 UTC · model grok-4.3

classification 💻 cs.SE cs.CY
keywords empathic AInovice programmersprogramming IDEAI-assisted learningerror correctionuser studyC programminglearning outcomes
0
0 comments X

The pith

An empathic AI IDE for novice C programmers matches standard tools on most measures but users find it more helpful for fixing errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Ceci as a Caring Empathic C IDE that uses AI to offer emotional support and encourage learning instead of supplying code solutions directly. It describes a small pilot study in which eleven novice programmers completed a task either with Ceci or with VSCode plus ChatGPT, then completed workload surveys and usability questions. Results showed no meaningful differences in how effective or educational the tools seemed or how much effort they required, yet participants using Ceci rated it significantly higher for help during error correction. A sympathetic reader would care because the work tests whether adding empathy to AI coding tools can reduce over-reliance while still supporting beginners in the moment they get stuck.

Core claim

The study establishes that the Caring Empathic C IDE called Ceci, which prioritizes learning and emotional support over direct code generation, produces no significant differences from VSCode paired with ChatGPT in perceived effectiveness, learning outcomes, or workload among novice programmers, although Ceci receives significantly higher ratings for helpfulness when correcting errors.

What carries the argument

Ceci, the Caring Empathic C IDE, which embeds empathic AI responses into the programming environment to focus on emotional encouragement and learner growth rather than immediate code provision.

If this is right

  • Empathic responses can be added to an IDE without raising users' reported workload.
  • The main observed benefit appears in how helpful the tool feels when users are fixing mistakes.
  • Empathic features by themselves are unlikely to produce broad gains in learning or reduced effort.
  • Future designs should test larger groups, varied tasks, and deeper integration of empathic elements with other supports.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pairing empathic feedback with guided hints that still require the user to write code might produce stronger learning gains than empathy alone.
  • The same interface approach could be tested in other languages or technical subjects where novices often feel frustrated.
  • AI coding assistants may need separate controls for empathy level so users can choose support without automatic code generation.

Load-bearing premise

The empathic features were correctly designed and placed in the interface, and the chosen survey measures plus small participant group were sufficient to detect any real differences in learning or workload.

What would settle it

A follow-up experiment with fifty or more novices, multiple distinct coding tasks, and direct measures of skill retention or code quality over repeated sessions would show whether the empathic condition produces measurably better learning results than the control condition.

Figures

Figures reproduced from arXiv: 2604.19142 by Aaron Daniel Go, Jocelynn Cu, Justin Rainier Go, Kurt Christian Andaya, Roemer Gabriel Caliboso.

Figure 1
Figure 1. Figure 1: Schematic Diagram of Methodology facing a cryptic compiler error. Furthermore, while studies explore empathic responses, few have investigated the use of visual, non￾verbal cues, such as an animated agent with different emotional poses, to deliver this feedback in a debugging context. Therefore, our project aims to address this gap by designing and evaluating a system that provides empathic feedback specif… view at source ↗
Figure 2
Figure 2. Figure 2: Prototype Interface Design 3.1 Preparation and Participant Plan The first step in the methodology is the preparation of necessary components for the study such as the prototype itself and the par￾ticipants. 3.1.1 Prototype Design. The prototype, named Ceci, is an Inte￾grated Development Environment (IDE) run locally through Python Flask. Within the IDE there would be three sections: the AI chat￾bot interfa… view at source ↗
Figure 3
Figure 3. Figure 3: NASA TLX Results per Group. The Ceci group seems to experience a slightly higher task load index than the ChatGPT group, but the difference is not statistically significant as can be seen in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

As generative AI becomes integral to software development, the risk of over-reliance and diminished critical thinking grows. This study introduces "Ceci," our Caring Empathic C IDE designed to support novice programmers by prioritizing learning and emotional support over direct code generation. The researchers conducted a comparative pilot study between Ceci and VSCode + ChatGPT [9, 40]. Participants completed a coding task and were evaluated using the NASA-TLX workload assessment and a post-test usability survey. Although the sample size was small (n = 11), results show that there is no significant difference in perceived effectiveness, learning and workload between the Experimental Ceci group and the Control group, though Ceci users reported significantly greater perceived helpfulness in error correction (p = 0.0220). These findings suggest that empathic responses may not be sufficient on their own to enhance the learner's outcomes, perceptions, or reduce workload. Overall, this study provides a foundational framework for future research. Such research should explore larger sample sizes, diverse programming tasks, and additional empathic features to better understand the potential of empathic programming environments in supporting novice programmers; they must also ensure that the empathic features are well-integrated in the user interface.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces 'Ceci', a Caring Empathic C IDE that prioritizes emotional support and learning for novice programmers over direct code generation. It reports a pilot comparative study (n=11) of Ceci versus VSCode+ChatGPT on a coding task, assessed via NASA-TLX workload scores and a post-task usability survey. No significant differences were found in perceived effectiveness, learning, or workload, but Ceci users reported significantly greater helpfulness in error correction (p=0.022). The authors conclude that empathic responses may not be sufficient on their own and recommend larger studies with additional features.

Significance. If replicated with adequate power and refined measures, the work would usefully constrain expectations for empathic AI in programming education: it suggests that empathy alone may not outperform standard generative-AI assistance on core outcomes such as perceived learning or workload. The study supplies a concrete experimental framework (task, instruments, and comparison) that future work can build upon, particularly if it incorporates power analyses and effect-size reporting.

major comments (1)
  1. [Abstract and Conclusion] Abstract and Conclusion: the claim that empathic responses 'may not be sufficient on their own to enhance the learner's outcomes, perceptions, or reduce workload' rests on interpreting non-significant differences in effectiveness, learning, and workload as positive evidence of insufficiency. With n=11 split across groups, standard power calculations for typical Likert-scale or usability differences (Cohen's d ≈ 0.5–0.8) yield power well below 0.5; the null results are therefore inconclusive rather than supportive of the central claim. The single significant result (p=0.022) is reported without effect size, confidence intervals, or multiplicity correction.
minor comments (2)
  1. [Methods] Methods: the manuscript provides limited detail on the concrete empathic response mechanisms implemented in Ceci and their precise integration into the IDE UI; explicit examples or screenshots would allow readers to evaluate whether the features were delivered as intended.
  2. [Results] Results: the p=0.022 finding on error-correction helpfulness should be accompanied by the exact test statistic, degrees of freedom, effect size, and any pre-specified analysis plan to permit assessment of its robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our pilot study. We agree that the small sample size makes the non-significant results inconclusive and that the reporting of the significant finding requires improvement. We will revise the abstract and conclusion to more cautiously frame the findings as preliminary, emphasize the pilot nature of the work, and avoid any implication that null results constitute evidence of insufficiency. We will also enhance statistical reporting for the significant result.

read point-by-point responses
  1. Referee: the claim that empathic responses 'may not be sufficient on their own to enhance the learner's outcomes, perceptions, or reduce workload' rests on interpreting non-significant differences in effectiveness, learning, and workload as positive evidence of insufficiency. With n=11 split across groups, standard power calculations for typical Likert-scale or usability differences (Cohen's d ≈ 0.5–0.8) yield power well below 0.5; the null results are therefore inconclusive rather than supportive of the central claim.

    Authors: We agree that the non-significant results cannot be interpreted as positive evidence of insufficiency given the low power of the pilot (n=11). Our intent was to present the absence of observed benefits in this initial comparison as motivation for further research rather than a definitive conclusion. However, we recognize that the current phrasing risks overinterpretation. In the revised manuscript we will reword the abstract and conclusion to state that the pilot findings are inconclusive on the question of sufficiency, explicitly note the limited statistical power, and stress that larger, pre-registered studies with power analyses are needed to evaluate whether empathic features can improve the measured outcomes. revision: yes

  2. Referee: The single significant result (p=0.022) is reported without effect size, confidence intervals, or multiplicity correction.

    Authors: We thank the referee for this point. In the revision we will add the effect size (Cohen’s d or appropriate non-parametric equivalent) and 95% confidence interval for the difference in perceived helpfulness during error correction. We will also clarify that the analysis was exploratory with a modest number of comparisons and discuss whether a multiplicity adjustment is warranted; if applied, we will report both unadjusted and adjusted values. These additions will improve transparency and allow readers to better assess the result. revision: yes

Circularity Check

0 steps flagged

Purely empirical pilot study with no derivations or fitted predictions

full rationale

The paper is a comparative user study (n=11) reporting survey and NASA-TLX results between Ceci and VSCode+ChatGPT. No equations, parameter fitting, predictions derived from prior fits, or self-citation chains appear in the abstract or described methods. Conclusions follow directly from the observed p-values and null findings without any reduction to inputs by construction. This matches the default expectation of a self-contained empirical evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of standard survey instruments (NASA-TLX) and the assumption that the empathic responses were implemented as intended in the tool.

axioms (2)
  • domain assumption NASA-TLX workload assessment is a valid and sensitive measure for this context
    Used to compare groups without additional validation in the study.
  • ad hoc to paper The empathic responses in Ceci were correctly designed and delivered to participants
    Core to the experimental manipulation but not independently verified in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1412 out tokens · 40656 ms · 2026-05-10T02:41:55.957999+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages

  1. [1]

    S. K. Ahmed, R. A. Mohammed, A. J. Nashwan, R. H. Ibrahim, A. Q. Abdalla, B. M. Ameen, and R. M. Khdhir. 2025. Using thematic analysis in qualitative research. Journal of Medicine, Surgery, and Public Health6 (2025), 100198

  2. [2]

    Aldrup, B

    K. Aldrup, B. Carstensen, and U. Klusmann. 2022. Is empathy the key to effective teaching? a systematic review of its association with teacher-student interactions and student outcomes.Educational Psychology Review34, 1 (2022), 1177–1216

  3. [3]

    Bosch and S

    N. Bosch and S. D’Mello. 2015. The affective experience of novice computer programmers.International Journal of Artificial Intelligence in Education27, 1 (2015), 181–206

  4. [4]

    Carreira, L

    G. Carreira, L. Silva, A. J. Mendes, and H. G. Oliveira. 2022. Pyo, a chatbot assistant for introductory programming students. In2022 International Symposium on Computers in Education (SIIE)

  5. [5]

    Castellano, A

    G. Castellano, A. Paiva, A. Kappas, R. Aylett, H. Hastie, W. Barendregt, F. Nabais, and S. Bull. 2013. Towards empathic virtual and robotic tutors. InLecture Notes in Computer Science. 733–736

  6. [6]

    C. K. Y. Chan and L. H. Y. Tsi. 2023. The AI revolution in education: Will AI replace or assist teachers in higher education? (2023). arXiv:2305.01185

  7. [7]

    Colligan, H

    L. Colligan, H. W. W. Potts, C. T. Finn, and R. A. Sinkin. 2015. Cognitive workload changes for nurses transitioning from a legacy system with paper documenta- tion to a commercial electronic health record.International Journal of Medical Informatics84, 7 (2015), 469–476

  8. [8]

    L. M. Collins. 2007. Research design and methods. InCronbach, L. J. (1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3):297–334. Elsevier, 433–442

  9. [9]

    Deckker and S

    D. Deckker and S. Sumanasekara. 2024.The role of ChatGPT in software develop- ment and code generation. Wrexham University, Tech report

  10. [10]

    P. Eibl, S. Sabouri, and S. Chattopadhyay. 2025. Exploring the challenges and opportunities of AI-assisted codebase generation. (2025). arXiv:2508.07966

  11. [11]

    Ford and C

    D. Ford and C. Parnin. 2015. Exploring causes of frustration for software devel- opers. (2015). doi:10.1109/chase.2015.19

  12. [12]

    García-Pérez, J.-M

    R. García-Pérez, J.-M. Santos-Delgado, and O. Buzón-García. 2016. Virtual empa- thy as digital competence in education 3.0.International Journal of Educational Technology in Higher Education13 (2016)

  13. [13]

    Goroshit and M

    M. Goroshit and M. Hen. 2014. Does emotional self-efficacy predict teachers’ self-efficacy and empathy?Journal of Education and Training Studies2, 3 (2014)

  14. [14]

    Groothuijsen, A

    S. Groothuijsen, A. van den Beemt, J. C. Remmers, and L. W. van Meeuwen. 2024.Ai chatbots in programming education. Computers and Education: Artificial Intelligence, 7:100290

  15. [15]

    F. Gu, Z. Liang, H. Li, and J. Ma. 2025. The Matthew effect of AI programming assistants: A hidden bias in software evolution. (2025). arXiv:2509.23261

  16. [16]

    Gupta, H

    R. Gupta, H. Goyal, D. Kumar, A. Mehra, S. Sharma, K. Mittal, and J. S. Challa

  17. [17]

    Sakshm AI: Advancing AI-assisted coding education for engineering students in india through socratic tutoring and comprehensive feedback. (2025). arXiv:2503.12479

  18. [18]

    S. G. Hart and L. E. Staveland. 1988. Development of NASA-TLX (Task Load Index): Results of empirical and theoretical research.Advances in Psychology52 (1988), 139–183

  19. [19]

    Coding Tutor

    S. Hobert. 2019. Say hello to “Coding Tutor”! design and evaluation of a chatbot- based learning system supporting students. (2019)

  20. [20]

    Kate Hone. 2006. Empathic agents to reduce user frustration: The effects of varying agent characteristics.Interacting with Computers18, 2 (2006), 227–245. doi:10.1016/j.intcom.2005.05.003

  21. [21]

    Hossain, C

    I. Hossain, C. Hundhausen, A. Tariq, S. Haque, Y. Qiao, and B. Mulanda. 2025. The effects of GitHub Copilot on computing students’ programming effectiveness, efficiency, and processes in brownfield programming tasks. (2025). doi:10.1145/ 3702652.3744219

  22. [22]

    M. L. Kamins and C. S. Dweck. 1999. Person versus process praise and criticism: Implications for contingent self-worth and coping.Developmental Psychology35, 3 (1999), 835–847

  23. [23]

    Kurniawan, E

    O. Kurniawan, E. Chandra, C. M. Poskitt, Y. Noller, K. Tsu, and C. Jegourel. 2025. Designing for novice debuggers: A pilot study on an AI-assisted debugging tool. (2025). arXiv:2509.21067

  24. [24]

    Lasha, M

    L. Lasha, M. Grigolia, and L. Machaidze. 2023. Role of AI chatbots in education: systematic literature review. (2023)

  25. [25]

    Chenxi Li. 2025. AIVA: An AI-based Virtual Companion for Emotion-aware Interaction. arXiv:2509.03212 [cs.CV] https://arxiv.org/abs/2509.03212

  26. [26]

    H. Li, Z. Wang, L. Ding, J. Zhang, and G. Wang. 2025. The facts about the effects of pedagogical agents on learners’ cognitive load: a meta-analysis based on 24 studies.Frontiers in Psychology16 (2025)

  27. [27]

    J. Liu, X. Tang, L. Li, P. Chen, and Y. Liu. 2023. Which is a better programming assistant? A comparative study between ChatGPT and stack overflow. (2023). arXiv:2308.13851

  28. [28]

    R. Liu, J. Zhao, B. Xu, C. Perez, and D. J. Malan. 2025. Improving AI in CS50: Leveraging human feedback for better learning.SIGCSE TS2025 (2025), 715–721

  29. [29]

    Marwan, G

    S. Marwan, G. Gao, S. Fisk, T. W. Price, and T. Barnes. 2020. Adaptive immediate feedback can improve novice programming engagement and intention to persist in computer science. InProceedings of the 2020 ACM Conference on International Computing Education Research

  30. [30]

    V. May, D. Misra, Y. Luo, A. Sridhar, J. Gehring, and J. Silvio. 2025. Fresh- brew: A benchmark for evaluating AI agents on java code migration. (2025). arXiv:2510.04852

  31. [31]

    Mondal, C

    S. Mondal, C. K. Roy, H. Wang, J. Arguello, and S. Mathan. 2025. Can we trust the AI pair programmer? Copilot for API misuse detection and correction. (2025). arXiv:2509.16795

  32. [32]

    Mozannar, G

    H. Mozannar, G. Bansal, A. Fourney, and E. Horvitz. 2024. Reading between the lines: Modeling user behavior and costs in AI-assisted programming. (2024)

  33. [33]

    C. M. Mueller and C. S. Dweck. 1998. Praise for intelligence can undermine chil- dren’s motivation and performance.Journal of Personality and Social Psychology 75, 1 (1998), 33–52

  34. [34]

    Novak, K

    E. Novak, K. McDaniel, and J. Li. 2023. Factors that impact student frustration in digital learning environments.Computers and Education Open5 (2023), 100153

  35. [35]

    Ortega-Ochoa, J

    E. Ortega-Ochoa, J. Q. Pérez, M. Arguedas, T. Daradoumis, and J. Manuel. 2024. The effectiveness of empathic chatbot feedback for developing computer com- petencies, motivation, self-regulation, and metacognitive reasoning in online higher education.Internet of Things25 (2024), 101101

  36. [36]

    Phanudom, T

    P. Phanudom, T. Hirao, R. Gaikovina Kula, and H. Iida. 2021. Interactive chatbot for supporting students in online Python programming class. (2021)

  37. [37]

    R. A. Poldrack, T. Lu, and G. Beguš. 2023. AI-assisted coding: Experiments with GPT-4. (2023). arXiv:2304.13187

  38. [38]

    A. S. Raamkumar and Y. Yang. 2022. Empathetic conversational systems: A review of current advances, gaps, and opportunities.IEEE Transactions on Affective Computing(2022), 1–20

  39. [39]

    Raykov and G

    T. Raykov and G. A. Marcoulides. 2017. Thanks coefficient alpha, we still need you!Educational and Psychological Measurement79, 1 (2017), 200–210

  40. [40]

    Schmidhuber, S

    J. Schmidhuber, S. Schlögl, and C. Ploder. 2021. Cognitive load and productivity implications in human-chatbot interaction. arXiv / tech report. (2021)

  41. [41]

    A. Silver. 2025.Celebrating 50 Million Developers. Microsoft

  42. [42]

    F. Sun, L. Li, S. Meng, X. Teng, T. Payne, and P. Craig. 2025. Integrating emotional intelligence, memory architecture, and gestures to achieve empathetic humanoid robot interaction in an educational setting. (2025). arXiv:2505.19803

  43. [43]

    J. H. Sundjaja, R. Shrestha, and K. Krishan. 2023.McNemar and Mann-Whitney U Tests. StatPearls

  44. [44]

    Vijayvergiya, M

    M. Vijayvergiya, M. Salawa, I. Budiselić, D. Zheng, P. Lamblin, M. Ivanković, J. Carin, M. Lewko, J. Andonov, G. Petrović, D. Tarlow, P. Maniatis, and R. Just

  45. [45]

    arXiv / conference report

    AI-assisted assessment of coding practices in modern code review. arXiv / conference report. (2024)

  46. [46]

    Z. Zhou. 2022. Empathy in education: A critical review.International Journal for the Scholarship of Teaching and Learning16, 3 (2022)