pith. sign in

arxiv: 2604.25689 · v1 · submitted 2026-04-28 · 💻 cs.SE · cs.AI

Spreadsheet Modeling Experiments Using GPTs on Small Problem Statements and the Wall Task

Pith reviewed 2026-05-07 16:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords spreadsheet modelingGPT toolsExcel AIERFR criteriareproducibilityanalytical modelsAI assistanceworkflow challenges
0
0 comments X

The pith

GPT tools can build structured spreadsheet models from simple text but produce inconsistent and non-reproducible results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether GPT-based extensions can assist in creating reusable analytical spreadsheet models by conducting structured experiments on small problem statements. After screening five tools, it focuses on Excel AI and evaluates the generated models against the ERFR criteria of placing each input in a cell, using cell formulas, avoiding hardwired numbers, including labels, and maintaining accuracy. The results show that the tool often produces well-structured models yet delivers varying outputs on repeated runs and struggles with consistency. A reader would care because spreadsheets support much business and analytical work, so reliable AI help could shorten the time or lower the expertise needed to build them, provided the reliability issues are solved.

Core claim

Through experiments on simple problem statements and the Wall Task, Excel AI produces models that sometimes satisfy the ERFR criteria yet remains inconsistent and often non-reproducible across runs. The paper identifies two central challenges: the problem of confidence, in which users cannot trust the generated models without verification, and the problem of workflow, in which the tools do not fit smoothly into standard spreadsheet development practices. The authors conclude that GPTs show promise for generating draft models that may reduce development time or lower skill requirements, but current tools remain unreliable for professional use without skilled users to verify and adapt the work

What carries the argument

The ERFR criteria for spreadsheet quality, which require each input in its own cell, use of cell formulas rather than constants, no hardwired numbers, presence of labels, and overall accuracy.

If this is right

  • Skilled users must still verify and adapt GPT-generated spreadsheets before use.
  • GPT tools may reduce development time or lower the skill level needed for initial model drafts.
  • Reproducibility must be improved before the tools can support professional modeling work.
  • Future research should examine prompt engineering, reproducibility fixes, and performance on larger modeling tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If inconsistency on small tasks persists at scale, hybrid human-AI workflows will likely remain necessary for spreadsheet modeling.
  • Testing multiple GPT tools across varied spreadsheet environments would show whether the problems of confidence and workflow are general or tool-specific.
  • Improved prompt techniques that achieve reproducibility could enable automated creation of model families for complex analyses.

Load-bearing premise

That the performance seen on small, simple problem statements with one chosen tool represents how GPT extensions would behave on realistic, larger-scale spreadsheet modeling tasks.

What would settle it

Applying the same structured tests to a collection of large, real-world spreadsheet problems with interdependent calculations and checking whether the outputs become consistent and reproducible across multiple runs.

Figures

Figures reproduced from arXiv: 2604.25689 by Sopiko Datuashvili, Thomas A. Grossman, Yuan Chen.

Figure 1
Figure 1. Figure 1: Two “New chat” options are in red ovals. Two GPTs named “Excel AI” are in red rectangles. view at source ↗
Figure 2
Figure 2. Figure 2: Spreadsheet Outputs for Problem Statements 4 (left) and 6 (center and right). view at source ↗
Figure 3
Figure 3. Figure 3: Correct Output, No Cell Formulas view at source ↗
Figure 4
Figure 4. Figure 4: ERFR Spreadsheet Output, Providing Inputs and Outputs Modules. (Top shows values, view at source ↗
Figure 5
Figure 5. Figure 5: Correct Output, No Cell Formulas Again, there was an option to provide a “dynamic” spreadsheet model. We selected this option, and EAI generated a spreadsheet file that generated an error message on opening requiring Excel to “recover” the contents. The resulting spreadsheet ( view at source ↗
Figure 6
Figure 6. Figure 6: Spreadsheet Output Providing Inputs and (Unlabeled) Outputs Modules. Cell Formulas Are view at source ↗
read the original abstract

This paper investigates how GPT-based tools can assist in building reusable analytical spreadsheet models. After a screening, we evaluate five GPT extensions and select Excel AI by pulsrai.com for detailed testing. Through structured experiments on simple problem statements, we assess Excel AI's performance against the ERFR criteria (each input in a cell; cell formulas; no hardwired numbers; labels; accurate). Results show that while Excel AI can produce well-structured models, it is inconsistent and often non-reproducible. We identify two central challenges - "the problem of confidence" and "the problem of workflow" - which highlight the need for skilled users to verify and adapt GPT-generated spreadsheets. Though GPTs show promise for generating draft models that may reduce development time or lower skill requirements, current tools remain unreliable for professional use. We conclude with recommendations for future research into prompt engineering, reproducibility, and larger-scale modeling tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the use of GPT-based tools for building analytical spreadsheet models. It screens five GPT extensions, selects Excel AI for detailed testing on simple problem statements using the ERFR criteria (each input in a cell, cell formulas, no hardwired numbers, labels, accurate). The experiments reveal that while Excel AI can generate well-structured models, the outputs are inconsistent and non-reproducible. The paper identifies 'the problem of confidence' and 'the problem of workflow', concluding that GPTs show promise for draft models but current tools are unreliable for professional use, with recommendations for future research on prompt engineering, reproducibility, and larger-scale tasks.

Significance. If the observed inconsistencies hold, this study provides valuable empirical evidence on the limitations of current GPT tools in spreadsheet modeling, emphasizing the necessity for user verification and adaptation. It contributes to the understanding of AI-assisted software engineering in the context of spreadsheets, potentially guiding improvements in tool design and prompting strategies. The focus on small problems limits the generalizability to complex professional scenarios.

major comments (2)
  1. [Abstract and Conclusion] Abstract and Conclusion: The central claim that 'current tools remain unreliable for professional use' relies on experiments conducted exclusively on small, simple problem statements using only the Excel AI tool. Although the paper acknowledges the need for future work on larger-scale modeling tasks, the evidence presented does not demonstrate that the inconsistencies observed in toy problems persist or are more severe in realistic, multi-sheet, data-intensive scenarios, thereby weakening support for the broad professional unreliability conclusion.
  2. [Experimental Setup] Experimental Setup: The manuscript lacks specific details on the number of problem statements tested, the exact prompts employed, the number of trials per statement to assess reproducibility, and any statistical analysis of inconsistency rates. This omission makes it challenging to fully evaluate the robustness of the findings regarding non-reproducibility and inconsistency.
minor comments (2)
  1. [ERFR Criteria] Provide a more detailed explanation or table showing how each of the ERFR criteria (each input in a cell; cell formulas; no hardwired numbers; labels; accurate) was applied and scored in the experiments.
  2. [Terminology] Ensure consistent use of terms like 'GPT extensions' and 'GPT-based tools' throughout the paper.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Conclusion] Abstract and Conclusion: The central claim that 'current tools remain unreliable for professional use' relies on experiments conducted exclusively on small, simple problem statements using only the Excel AI tool. Although the paper acknowledges the need for future work on larger-scale modeling tasks, the evidence presented does not demonstrate that the inconsistencies observed in toy problems persist or are more severe in realistic, multi-sheet, data-intensive scenarios, thereby weakening support for the broad professional unreliability conclusion.

    Authors: We agree that the scope of our experiments limits direct claims about complex professional scenarios. The focus on small problem statements was intentional to isolate core GPT behaviors such as non-reproducibility and the identified problems of confidence and workflow. These issues arise from fundamental generation mechanisms and are expected to persist or intensify with scale, but we lack direct evidence from multi-sheet tasks. To address this, we will revise the abstract and conclusion to more explicitly qualify the unreliability conclusion as based on small-scale evidence, while strengthening the emphasis on the need for future larger-scale studies. This partial revision will improve precision without altering the core empirical findings. revision: partial

  2. Referee: [Experimental Setup] Experimental Setup: The manuscript lacks specific details on the number of problem statements tested, the exact prompts employed, the number of trials per statement to assess reproducibility, and any statistical analysis of inconsistency rates. This omission makes it challenging to fully evaluate the robustness of the findings regarding non-reproducibility and inconsistency.

    Authors: We accept this critique and will expand the experimental setup section. The revised manuscript will report the precise number of problem statements tested, reproduce the full prompts used, specify the number of trials per statement (conducted to evaluate reproducibility), and add quantitative summaries of inconsistency rates, such as the proportion of trials meeting all ERFR criteria. These details will enhance transparency and allow readers to better assess the strength of the non-reproducibility observations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations or self-referential constructions

full rationale

The paper performs screening of GPT extensions, selects Excel AI, runs structured experiments on small problem statements, and evaluates outputs against the explicitly defined ERFR criteria via direct observation. No equations, fitted parameters, derivations, or mathematical claims exist. Central conclusions rest on reported inconsistencies in tool outputs rather than any reduction to inputs by construction. Self-citations (if any) are not load-bearing for the empirical findings, and the paper explicitly flags limitations for larger-scale tasks without claiming generality from the small cases. This is a standard non-circular empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical evaluation study with no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.0 · 5454 in / 1122 out tokens · 60711 ms · 2026-05-07T16:16:52.365616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages

  1. [1]

    industrial quality analytical spreadsheet models

    INTRODUCTION It is widely understood that Artificial Intelligence general pre-trained transformers (GPTs) are poised to have a large impact on many aspects of business activity. Analytics is perceived to be a particularly attractive candidate for GPTs, including spreadsheet analytics. A quick search of the internet reveals countless resources purporting t...

  2. [2]

    We did not attempt to catalog them

    SELECTION OF A SPREADSHEET MODELING GPT There are any number of GPTs that claim to assist with spreadsheet analytics, including data analysis, data visualization, writing cell formulas, and creating spreadsheet models. We did not attempt to catalog them. We did an initial screen to narrow the field to five readily-available GPTs that claim to produce spre...

  3. [3]

    High conversation quality includes step-by-step guidance, clear communication of concepts, and explanations of formulas and calculations

    Conversation Quality - Refers to how detailed, informative, and easy to understand the explanations are during the interaction (e.g., in GPT chat). High conversation quality includes step-by-step guidance, clear communication of concepts, and explanations of formulas and calculations

  4. [4]

    Strong interaction also includes offering suggestions on how to improve the model or what features could be added

    User Interaction - Measures how well the tool engages with the user, whether it asks clarifying questions, adapts to feedback, and guides the user through the process. Strong interaction also includes offering suggestions on how to improve the model or what features could be added

  5. [5]

    Error Handling - Assesses how accurately the tool builds the model and how it handles mistakes. Good error handling results in fewer errors, and when errors do occur, they are easy to identify and correct thanks to a clear structure and step-by-step guidance in conversation chat

  6. [6]

    A reliable file downloads correctly, opens without corruption in Excel, and includes working formulas and proper formatting with minimal issues

    File Reliability - Evaluates the quality and functionality of the generated spreadsheet files. A reliable file downloads correctly, opens without corruption in Excel, and includes working formulas and proper formatting with minimal issues

  7. [7]

    provide a downloadable Excel file

    Spreadsheet Design - Describes how well the spreadsheet is structured. A modular structure clearly separates inputs and outputs, labels provide meaning, and consistent formats make the model easy to read and understand. Evaluation Protocol We experimented using 16 short prompts to assess how different GPT tools responded to variations in prompt structure ...

  8. [8]

    word problem

    SPREADSHEET MODELING EXPERIMENTS We performed a series of experiments using EAI to evaluate its ability to build ERFR spreadsheets. We first discuss our prompting approach, and then describe a series of GPT runs. 3.1 Distinguish Between Problem Statement and Instructions A prompt is the set of information we provide to the GPT. For our goal of getting the...

  9. [9]

    Create an Excel model

  10. [10]

    Page 6 of 16 We make no claim that these are somehow the best, or even good, instructions

    Provide a downloadable Excel file. Page 6 of 16 We make no claim that these are somehow the best, or even good, instructions. However, they did result in the GPT producing downloadable Excel files with cell formulas. (Caveat: additional work that is not presented in this paper, finds that these instructions are not as effective for larger models.) 3.2 Exp...

  11. [11]

    Parameters? The distinction is between a problem where the variable values are presented, and when variable names without values (parameters) are presented

    How does the GPT perform when given Data v. Parameters? The distinction is between a problem where the variable values are presented, and when variable names without values (parameters) are presented. We compare Problem Statements 1 and 2 for a simple situation, and Problem Statements 3 and 4 for a slightly less simple situation

  12. [12]

    30 days” v. “month

    How does the GPT handle “30 days” v. “month”? The distinction is between an explicit statement of 30 days, and an implicit indication of 30 days by stating a month. (Although it is a common analytic approximation to treat months as having 30 days, there is ambiguity because a month can have 28, 29, 30, or 31 days.) We compare Problem Statements 3 and 5 (f...

  13. [13]

    30 consecutive days

    How does the GPT handle a known noun v. a made-up noun? This distinction here is between a known noun and a made-up word that functions as a noun. We compare Problem Statements 3 and 7 (for a prompt using Data), and Problem Statements 4 and 8 (for a prompt using Parameters). A prompt was created by appending the three instruction statements (section 3.1) ...

  14. [14]

    Create an Excel model. Use cell formulas. Provide a downloadable Excel file

    Experiments on a Larger Problem Statement (The Wall Task) The Wall Task problem statement (in Appendix 4) appeared in Panko 1999 where it was used to investigate student performance on a modest spreadsheet modeling task. It has since been used by faculty at least a few universities as an exercise for students that connects them to the research literature....

  15. [15]

    problem of confidence

    WHAT MIGHT BE THE VALUE OF A GPT FOR BUILDING A SPREADSHEET MODEL? Let’s suppose that a GPT has provided a spreadsheet model. How can we think about its value? If our expectation is that the GPT has produced a satisfactory spreadsheet model that satisfies ERFR, how can we confirm this expectation? We call this the “problem of confidence”. If the spreadshe...

  16. [16]

    problem statement

    CONCLUSIONS This paper provides a summary of simple, entry-level experiments into using GPTs to build spreadsheet models. There is a large space of GPTs that claim an ability to build spreadsheet models. We found that the tools are temperamental and unreliable. However, they sometimes work very well, although the results are often not reproducible. Initia...

  17. [17]

    A Use Case-Engineering Resources Taxonomy for Analytical Spreadsheet Models

    REFERENCES Grossman, T. A., Mehrotra V. (2023), “A Use Case-Engineering Resources Taxonomy for Analytical Spreadsheet Models”, European Spreadsheet Risks Interest Group 2023 Conference, London, England, July. Page 13 of 16 Grossman, T. A., Mehrotra V., Sander J. (2011) “Towards Evaluating the Quality of a Spreadsheet: The Case of the Analytical Spreadshee...

  18. [18]

    output” text and session. model title and clear “output

    APPENDICES Appendix 1 - Summary of Evaluation of five GPTs In Tables 1a and 1b we provide a summary of the basic facts and our evaluation of the 5 GPTs. (Note: these tables were generated by ChatGPT based on our research notes.) GPT Excel Doc Maker Document Maker Provider By NAIF J ALOTAIBI By aidocmaker.com By community builder User Ratings Ratings (1K+)...

  19. [19]

    Area, using Parameters

    Calculate the square footage of a rectangular room that is 10 feet long and 12 feet wide. Area, using Parameters

  20. [20]

    Spending, using Data

    Calculate the square footage of a rectangular room given its length and width. Spending, using Data

  21. [21]

    What is total spending? Spending, using Parameters Page 16 of 16

    A man buys two apples every day for 30 consecutive days, spending $2 per apple. What is total spending? Spending, using Parameters Page 16 of 16

  22. [22]

    Given the price per apple and the number of apples he buys daily, what is total spending? Spending, using Data, using Month

    A man buys apples every day for 30 consecutive days. Given the price per apple and the number of apples he buys daily, what is total spending? Spending, using Data, using Month

  23. [23]

    What is total spending? Spending, using Parameters; using Month

    A man buys two apples every day for a month, spending $2 per apple. What is total spending? Spending, using Parameters; using Month

  24. [24]

    Given the price per apple and the number of apples he buys daily, what is total spending? Spending, using Data; using Made-Up Noun

    A man buys apples every day for a month. Given the price per apple and the number of apples he buys daily, what is total spending? Spending, using Data; using Made-Up Noun

  25. [25]

    What is total spending? Spending, using Parameters, using Made-Up Noun

    A man buys two snapplees every day for 30 consecutive days, spending $2 per snapplee. What is total spending? Spending, using Parameters, using Made-Up Noun

  26. [26]

    the original consulting firm

    A man buys snapplees every day for 30 consecutive days. Given the price per snapplee and the number of snapplees he buys daily, what is total spending? Appendix 4 – Problem Statement for the Wall Task Suppose that you are working for a general contractor (“the original consulting firm”) who has asked you to build a spreadsheet model to help her to create ...