Exploring how EFL students talk to and through AI to develop texts

Chi Ho Yeung; David James Woo; Deliang Wang; Kai Guo; Yangyang Yu; Yilin Huang

arxiv: 2605.12523 · v1 · pith:TP67S6NVnew · submitted 2026-04-06 · 💻 cs.CL · cs.AI· cs.HC

Exploring how EFL students talk to and through AI to develop texts

David James Woo , Yangyang Yu , Yilin Huang , Deliang Wang , Kai Guo , Chi Ho Yeung This is my paper

Pith reviewed 2026-05-14 21:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HC

keywords EFL writingAI chatbotsprompt engineeringrhetorical loadwriting performancehuman-AI collaborationMANOVA analysis

0 comments

The pith

Students' different ways of sharing writing responsibility with AI show no significant impact on their final text quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines how English as a foreign language students interact with AI chatbots while writing. It identifies ten prompting strategies and clusters them into three profiles based on who carries more rhetorical load: AI-dominant, human-dominant, or collaborative. Analysis of 44 students reveals that these profiles do not lead to measurable differences in the content, language, or organization of their writing. The findings suggest that both heavy AI use and balanced collaboration can support similar outcomes in EFL writing tasks.

Core claim

Through content analysis of screen recordings, the study found ten types of prompting strategies. Clustering these strategies produced three distinct profiles of human-AI rhetorical load responsibility: AI-dominant for 52% of students, Human-dominant for 25%, and Collaborative human-AI for 14%. A MANOVA analysis showed no significant multivariate effect of these responsibility profiles on the three dimensions of students' writing performance: content, language, and organization.

What carries the argument

Three profiles of rhetorical load responsibility derived from clustering prompting strategies, which determine the division of authorship negotiation between student and AI.

If this is right

Students can achieve comparable writing performance whether they let AI dominate prompts or maintain more human control.
Prompt engineering patterns do not appear to differentiate writing quality in this context.
Pedagogy can focus on engagement and autonomy without concern for performance drops from different responsibility shares.
AI integration in EFL writing may support varied student approaches without compromising output quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future studies could test if these profiles affect long-term language learning gains beyond single-task performance.
The lack of difference might stem from the specific task or student population, warranting replication with diverse writing assignments.
Teachers might design interventions that encourage collaborative profiles to build student autonomy even if performance remains similar.

Load-bearing premise

The clustering of prompting strategies into three responsibility profiles is stable and meaningful, and the sample size allows detection of performance differences.

What would settle it

A replication study with a larger sample or different task that finds significant MANOVA differences between the responsibility profiles would falsify the no-effect claim.

read the original abstract

Generative Artificial Intelligence (AI) introduces new considerations for English as a foreign language (EFL) writing pedagogy. This study explores how students talk to and through AI by prompt engineering and negotiating authorship, respectively, and whether any patterns in the latter relate to students' writing performance. Using an exploratory mixed methods design, we analyzed screen recordings of 44 Hong Kong secondary students completing a Curricular Writing Task with AI Chatbots. Content analysis identified ten types of prompting strategies students employed, including questions, searches, and detailed instructions. From clustering these strategies, three distinct profiles of human-AI rhetorical load responsibility emerged: AI-dominant (52% of students), Human-dominant (25%) and Collaborative human-AI (14%). A MANOVA analysis indicated no significant multivariate effect of rhetorical load responsibility on three dimensions of students' writing performance: content, language, and organization. Students' prompting strategies and rhetorical load responsibility patterns have implications for their engagement and autonomy in EFL writing pedagogy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports an exploratory mixed-methods study of 44 Hong Kong secondary EFL students completing a curricular writing task with AI chatbots. Content analysis of screen recordings identifies ten prompting strategies; these are clustered into three rhetorical-load-responsibility profiles (AI-dominant 52%, Human-dominant 25%, Collaborative 14%). A MANOVA finds no significant multivariate effect of profile on three writing-performance dimensions (content, language, organization). The authors conclude that prompting strategies and responsibility patterns have pedagogical implications for student engagement and autonomy.

Significance. If the profiles prove stable and the null result survives proper power and assumption checks, the work supplies timely descriptive evidence on how EFL students distribute rhetorical effort with generative AI. Such data can inform curriculum design that treats AI as a tool rather than a replacement for student agency. The mixed-methods approach is well-suited to an under-studied area, but the small, uneven subgroups limit the strength of any performance-related claim.

major comments (2)

[Results] Results section on profile derivation: the clustering procedure that converts the ten prompting strategies into the three responsibility profiles (AI-dominant n≈23, Human-dominant n≈11, Collaborative n≈6) is not described (algorithm, distance metric, number of clusters chosen, or validation metrics such as silhouette scores or bootstrap stability). These details are load-bearing because the subsequent MANOVA rests entirely on the resulting group assignments.
[Results] MANOVA analysis (reported in Results): with cell sizes of roughly 23/11/6 and three dependent variables, the test has low power; no power analysis, no effect sizes (partial η² or Wilks’ λ), and no multivariate assumption checks (normality, homogeneity of covariance) are provided. Consequently the non-significant result cannot be distinguished from an underpowered test, weakening the central claim that responsibility profile is unrelated to writing performance.

minor comments (2)

[Abstract] Abstract: the three percentages sum to 91 % (52 + 25 + 14). Clarify whether this reflects rounding or an omitted fourth category.
[Methods] The manuscript should report the exact sample sizes per profile and the software/version used for the MANOVA and clustering.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our exploratory mixed-methods study. The feedback has prompted us to improve the methodological transparency of the clustering procedure and to provide fuller reporting of the MANOVA, including power, effect sizes, and assumption checks. We address each major comment below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Results] Results section on profile derivation: the clustering procedure that converts the ten prompting strategies into the three responsibility profiles (AI-dominant n≈23, Human-dominant n≈11, Collaborative n≈6) is not described (algorithm, distance metric, number of clusters chosen, or validation metrics such as silhouette scores or bootstrap stability). These details are load-bearing because the subsequent MANOVA rests entirely on the resulting group assignments.

Authors: We agree that the clustering details were omitted and should have been included. The ten prompting strategies (derived from content analysis of screen recordings) were clustered using k-means with Euclidean distance in R (factoextra package). The number of clusters (k=3) was selected via the elbow method on within-cluster sum of squares and validated with average silhouette width (0.58) and bootstrap resampling (1000 iterations, Jaccard similarity >0.75 for all clusters). We have added a dedicated subsection in the Methods (now titled “Cluster Analysis of Prompting Strategies”) that fully describes the algorithm, distance metric, cluster-number selection, and validation metrics, along with the R code used. revision: yes
Referee: [Results] MANOVA analysis (reported in Results): with cell sizes of roughly 23/11/6 and three dependent variables, the test has low power; no power analysis, no effect sizes (partial η² or Wilks’ λ), and no multivariate assumption checks (normality, homogeneity of covariance) are provided. Consequently the non-significant result cannot be distinguished from an underpowered test, weakening the central claim that responsibility profile is unrelated to writing performance.

Authors: We accept that the original MANOVA reporting was incomplete and that the small, unbalanced cell sizes (23/11/6) limit statistical power. We have now added: (1) a post-hoc power analysis (G*Power 3.1) showing achieved power ≈0.42 for a medium effect (f²=0.15) at α=0.05; (2) effect sizes (partial η² = 0.03 for content, 0.05 for language, 0.04 for organization; Wilks’ λ = 0.91, p = 0.38); and (3) assumption checks (Mardia’s multivariate normality test p = 0.21; Box’s M test for covariance homogeneity p = 0.27). These additions appear in a new paragraph in the Results section and are explicitly discussed as limitations in the Discussion. We have tempered the claim to state that the null result must be interpreted cautiously given low power and that larger samples are needed to confirm the absence of an effect. revision: yes

Circularity Check

0 steps flagged

No circularity: fully empirical data-driven analysis

full rationale

The paper performs content analysis on screen recordings from 44 students to extract ten prompting strategies, applies clustering to form three responsibility profiles (AI-dominant, Human-dominant, Collaborative), and runs MANOVA to test effects on writing scores. No equations, derivations, fitted parameters renamed as predictions, self-citation chains for uniqueness theorems, or ansatzes smuggled via prior work appear. All steps reduce to direct observation and standard statistical procedures on the collected data, with no reduction by construction to the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard assumptions of content analysis validity and MANOVA statistical requirements rather than on new mathematical derivations or invented entities.

axioms (2)

domain assumption Content analysis categories can be reliably applied to student-AI interaction transcripts
Invoked when identifying the ten prompting strategies and subsequent clustering
standard math MANOVA assumptions (multivariate normality, homogeneity of covariance) hold for the three performance dimensions
Required for interpreting the reported non-significant multivariate effect

pith-pipeline@v0.9.0 · 5480 in / 1315 out tokens · 36381 ms · 2026-05-14T21:17:49.481751+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

I can understand genre / process and its approach to writing. 2. I can understand ChatGPT and identify its tasks 3. I can understand prompts and identify their categories 4. I can write prompts for different writing stages 5. I can independently develop a text with the support of ChatGPT Learning activities (ILOs) (minutes)

work page
[2]

Introduction to writing approach (B1

Pre-workshop questionnaire (5 minutes) 2. Introduction to writing approach (B1. Genre-based / B2. Process-based) (10 minutes) 3. Introduction to AI, chatbots and ChatGPT (5 minutes) 4. Model prompt types with examples (25 minutes) 5. Guided practice applying prompts to writing stages for an HKDSE task (25 minutes) 6. Introduction to contest and setting up...

work page
[3]

Google Docs 3

Generative AI tools on POE app on iPads 2. Google Docs 3. Shared Google Drive folder: a. Contest website (English language) b. Marking scheme (English language) c. Pre- and post-workshop questionnaires (English and Chinese languages) d. Workshop slidedeck (English language) e. Worksheets (English language) 4. iPads / desktops 5. Poll Everywhere (English l...

work page

[1] [1]

I can understand genre / process and its approach to writing. 2. I can understand ChatGPT and identify its tasks 3. I can understand prompts and identify their categories 4. I can write prompts for different writing stages 5. I can independently develop a text with the support of ChatGPT Learning activities (ILOs) (minutes)

work page

[2] [2]

Introduction to writing approach (B1

Pre-workshop questionnaire (5 minutes) 2. Introduction to writing approach (B1. Genre-based / B2. Process-based) (10 minutes) 3. Introduction to AI, chatbots and ChatGPT (5 minutes) 4. Model prompt types with examples (25 minutes) 5. Guided practice applying prompts to writing stages for an HKDSE task (25 minutes) 6. Introduction to contest and setting up...

work page

[3] [3]

Google Docs 3

Generative AI tools on POE app on iPads 2. Google Docs 3. Shared Google Drive folder: a. Contest website (English language) b. Marking scheme (English language) c. Pre- and post-workshop questionnaires (English and Chinese languages) d. Workshop slidedeck (English language) e. Worksheets (English language) 4. iPads / desktops 5. Poll Everywhere (English l...

work page