Using Large Language Models to Analyze Engagement in Computational Thinking via Computational Physics Essays

Amir Bralin; N.Sanjay Rebello; Paul Hur; Sean Savage

arxiv: 2605.07036 · v2 · pith:XK7YGYG7new · submitted 2026-05-07 · ⚛️ physics.ed-ph

Using Large Language Models to Analyze Engagement in Computational Thinking via Computational Physics Essays

Sean Savage , Amir Bralin , Paul Hur , N.Sanjay Rebello This is my paper

Pith reviewed 2026-05-11 01:13 UTC · model grok-4.3

classification ⚛️ physics.ed-ph

keywords computational thinkingengineering educationcomputational physics essayproject-based learningpython jupyterphysics modelingsystems thinking

0 comments

The pith

Computational Physics Essays elicit high variety of computational thinking practices in engineering students, with strong correlation to overall quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper adapts the computational essay into a Computational Physics Essay where introductory engineering students use Python in Jupyter notebooks to iteratively model real-world physics systems as a capstone project. Analysis of 100 random submissions using a customized 20-item rubric shows the project constraint elicits a high variety of CT practices, with 99 percent of students successfully investigating complex systems as a whole. The use of these practices correlates at rho equals 0.75 with expert ratings of overall essay quality, while some students show novice weaknesses in software modularity but shift toward physical sensemaking. A sympathetic reader would care because traditional coding tasks face growing AI vulnerability, and this offers an authentic way to assess scientific inquiry by bridging programming and argumentation.

Core claim

The paper establishes that the Computational Physics Essay, administered as a culminating capstone requiring Python-based modeling of real physics systems, successfully elicits a high variety of computational thinking practices per Weintrop's taxonomy. Students achieve 99 percent proficiency in investigating complex systems as a whole, CT practices correlate strongly with expert-rated quality, and the approach shifts students' epistemic frame toward physical sensemaking despite expected novice limitations in modularity.

What carries the argument

The Computational Physics Essay capstone project using Python in Jupyter notebooks, evaluated via a customized 20-item rubric based on Weintrop's computational thinking taxonomy.

Load-bearing premise

The customized 20-item rubric based on Weintrop's taxonomy validly and reliably captures computational thinking practices in this engineering context.

What would settle it

Applying an independent validated measure of scientific argumentation or physical sensemaking to the same 100 essays and finding no correlation with the rubric scores would challenge the claim.

read the original abstract

As computational thinking (CT) becomes increasingly important to physics education, the need for authentic, project-based assessments has grown. While open-ended multimodal assignments, such as Computational Physics Essays (CPEs), help capture student reasoning and encourage active learning, they introduce a significant evaluation bottleneck. Manually grading these complex notebooks across a complex taxonomy of computational practices is resource-intensive and limits scalability in large-enrollment courses. In this study, we investigated the viability of using a multimodal Large Language Model (LLM) to automate the evaluation of 100 student-generated CPEs. Using a human-coded baseline, we systematically evaluated the model's capacity to detect student engagement across 20 distinct CT sub-practices and a holistic overall quality score. The results showed that the LLM performs very well on clearly defined tasks, achieving an 84% exact agreement with human raters on the binary sub-practices. However, more subjective constructs proved challenging, with the model reaching only a 71% agreement for the holistic quality analysis. Our findings demonstrated that while LLMs can reliably automate the detection of specific computational practices, subjective evaluation remains a hurdle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper adapts computational essays into capstone projects and reports high CT proficiency plus a 0.75 correlation with quality ratings, but the rubric scoring lacks any reported validation or reliability checks.

read the letter

The main thing to know is that computational physics essays as capstone projects got introductory engineering students to show a wide range of computational thinking practices, with 99% hitting systems thinking and a 0.75 correlation to overall essay quality. They analyzed 100 random submissions using a 20-item rubric drawn from Weintrop's taxonomy, and the project format appears to have pushed students toward modeling real physics systems rather than just writing code that might come from AI tools.

Referee Report

2 major / 2 minor

Summary. The paper introduces Computational Physics Essays (CPEs) as capstone projects for introductory engineering students, requiring them to iteratively model real-world physics systems using Python in Jupyter Notebooks. A random sample of N=100 CPE submissions is scored with a customized 20-item rubric derived from Weintrop's computational thinking (CT) taxonomy. Key findings include high variety of elicited CT practices, 99% proficiency in investigating complex systems as a whole (Modeling and Systems Thinking), and a strong correlation (ρ=0.75) between CT practice use and expert ratings of overall CPE quality. The authors conclude that this project-based approach shifts students toward physical sensemaking and provides an authentic assessment framework resistant to generative AI.

Significance. If the rubric-based results hold, the work provides useful evidence that authentic, project-based computational essays can effectively promote and assess CT practices in engineering physics contexts, particularly by emphasizing scientific inquiry over isolated coding tasks. The quantitative link between CT engagement and essay quality, combined with the use of an established external taxonomy applied to student artifacts, adds value to physics education research on computational thinking.

major comments (2)

[Methods] Methods (rubric description): The customized 20-item rubric based on Weintrop's taxonomy is presented without any reported details on the adaptation process, pilot testing, content validation by domain experts, or inter-rater reliability statistics (e.g., Cohen's κ or agreement percentages). Because the headline results—99% systems-thinking proficiency and ρ=0.75 correlation—are produced directly from scores on this rubric, the lack of these metrics makes the quantitative claims difficult to interpret.
[Results] Results (sample and scoring): The random sample of N=100 is described without details on selection procedure, stratification, or any statistical controls; combined with the unvalidated rubric, this leaves the reported proficiency percentages and correlation vulnerable to selection or rater bias.

minor comments (2)

[Abstract] Abstract: The correlation is written as 'ρ= 0.75' but rendered with the LaTeX fragment 'r{ho}'; correct the typesetting.
[Abstract] Abstract/Introduction: A brief parenthetical reference or citation to the specific Weintrop et al. taxonomy paper would help readers unfamiliar with the framework.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify key areas where additional transparency can strengthen the presentation of our methods and results. We have revised the manuscript to address these concerns and provide point-by-point responses below.

read point-by-point responses

Referee: [Methods] Methods (rubric description): The customized 20-item rubric based on Weintrop's taxonomy is presented without any reported details on the adaptation process, pilot testing, content validation by domain experts, or inter-rater reliability statistics (e.g., Cohen's κ or agreement percentages). Because the headline results—99% systems-thinking proficiency and ρ=0.75 correlation—are produced directly from scores on this rubric, the lack of these metrics makes the quantitative claims difficult to interpret.

Authors: We agree that the original manuscript would have benefited from more explicit documentation of the rubric development process. In the revised manuscript, we have added a new subsection in Methods that details the adaptation: we selected and customized 20 items from Weintrop's taxonomy that align with the iterative modeling and Jupyter Notebook workflow required by the CPE. The rubric underwent pilot testing on 15 essays to refine wording and applicability. Content validation was conducted by two independent physics education researchers who confirmed alignment with the taxonomy and relevance to engineering contexts. Inter-rater reliability was evaluated on a random subset of 30 essays scored independently by two raters, resulting in Cohen's κ = 0.81 and 86% raw agreement. These additions are now included to support the validity of the reported proficiency rates and correlation. revision: yes
Referee: [Results] Results (sample and scoring): The random sample of N=100 is described without details on selection procedure, stratification, or any statistical controls; combined with the unvalidated rubric, this leaves the reported proficiency percentages and correlation vulnerable to selection or rater bias.

Authors: We acknowledge that the original description of the sample was insufficiently detailed. The revised Results section now specifies that the N=100 submissions were selected via simple random sampling without replacement from the full pool of 248 CPEs using a Python random number generator. No stratification was applied, as all submissions came from a single cohort in the same introductory engineering course. The correlation (Spearman's ρ) was computed to account for non-normality. We have also added a Limitations paragraph discussing potential selection and rater biases and outlining future plans for multi-rater scoring and sensitivity analyses. These revisions improve transparency while preserving the core findings. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical application of external CT taxonomy

full rationale

The paper applies Weintrop's externally cited taxonomy via a customized 20-item rubric to score N=100 student artifacts, then reports observed proficiency rates (e.g., 99% in systems thinking) and a correlation (ρ=0.75) with separate expert quality ratings. These are direct empirical outcomes of the scoring process rather than quantities derived by construction from fitted parameters, self-definitions, or self-citation chains. No load-bearing steps reduce claims to inputs; the central results rest on independent application of the rubric and expert judgments.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the unvalidated transfer of Weintrop's taxonomy to this engineering setting and the assumption that project-based constraints directly cause the observed CT practices and epistemic shift.

axioms (1)

domain assumption Weintrop's computational thinking taxonomy provides a valid and appropriate framework for scoring student work in this introductory engineering physics context
Paper applies a customized 20-item rubric based on the taxonomy without reporting new validation or reliability metrics for the adapted use.

pith-pipeline@v0.9.0 · 5526 in / 1321 out tokens · 49312 ms · 2026-05-11T01:13:34.322964+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

customized 20-item rubric based on Weintrop’s computational thinking (CT) taxonomy... mean score of 16.5 out of 20... ρ=0.75 correlation with expert ratings
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We analyzed a random sample of CPE submissions (N = 100) using a customized 20-item rubric

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.