CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

Bingduo Liao; Chiming Ni; Hanan Salam; Hanjun Luo; Jiaheng Wen; Sylvia Chung; Wenyuan Xu; Xiaofeng Wang; Xinfeng Li; Yingbin Jin

arxiv: 2512.04111 · v3 · pith:WBZXHBKGnew · submitted 2025-11-30 · 💻 cs.SE · cs.AI· cs.HC

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

Hanjun Luo , Chiming Ni , Jiaheng Wen , Zhimu Huang , Yiran Wang , Bingduo Liao , Sylvia Chung , Yingbin Jin

show 4 more authors

Xinfeng Li Wenyuan Xu XiaoFeng Wang Hanan Salam

This is my paper

Pith reviewed 2026-05-22 11:55 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC

keywords human-AI collaborationagentic codingcoding benchmarksLLM evaluationhuman-in-the-loopco-reasoningsoftware engineering

0 comments

The pith

Human-AI collaboration reaches 31.11 percent success on coding problems that neither humans nor LLMs can solve alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CentaurEval as a benchmark designed to capture the value added when humans and AI coding agents work together on tasks. It defines Collaboration-Necessary templates that stand-alone humans or LLMs fail to complete but that become solvable once both contribute. Experiments with 45 participants and five LLMs across four intervention levels show solo pass rates of 18.89 percent for humans and 0.67 percent for LLMs, rising to 31.11 percent in collaboration. The results also indicate that strategic insights arise from either partner, pointing to a two-way co-reasoning dynamic rather than a one-directional tool relationship.

Core claim

CentaurEval supplies 45 Collaboration-Necessary problem templates that are intractable for standalone LLMs or humans yet solvable through effective collaboration. Dynamic instantiation produces a 450-task toolkit paired with a standardized IDE. Benchmarking yields 0.67 percent success for LLMs alone, 18.89 percent for humans alone, and 31.11 percent under human-AI collaboration, accompanied by evidence that breakthroughs can originate from either side and thereby challenge the conventional human-over-tool hierarchy.

What carries the argument

Collaboration-Necessary problem templates that isolate tasks requiring both human strategic reasoning and AI implementation efficiency.

If this is right

Human-AI teams can address a distinct class of coding problems beyond solo capabilities.
Strategic breakthroughs in agentic coding need not originate only from the human partner.
Evaluation of coding agents should incorporate controlled human-in-the-loop conditions.
Dynamic template instantiation offers a reproducible method for testing collaborative performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interfaces for coding agents may need to support bidirectional idea exchange instead of one-way instruction.
The same template approach could be adapted to measure hybrid value in domains such as debugging or software design.
Training objectives for LLMs might shift toward recognizing and building on partial human contributions.

Load-bearing premise

The 45 templates are genuinely intractable for either humans or LLMs working alone.

What would settle it

Showing that current LLMs or skilled humans can solve a substantial share of the templates at high rates without the other party would remove the measured collaborative gain.

Figures

Figures reproduced from arXiv: 2512.04111 by Bingduo Liao, Chiming Ni, Hanan Salam, Hanjun Luo, Jiaheng Wen, Sylvia Chung, Wenyuan Xu, Xiaofeng Wang, Xinfeng Li, Yingbin Jin, Yiran Wang, Zhimu Huang.

**Figure 2.** Figure 2: The overall architecture of HAI-Eval. 4 HAI-EVAL FRAMEWORK [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The design-validation pipeline for transforming algorithmic cores into templates. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of key participant feedback. De [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Visual representations of the participants’ demographic data. [PITH_FULL_IMAGE:figures/full_fig_p032_5.png] view at source ↗

**Figure 6.** Figure 6: The standard workspace interface after initialization. The file explorer displays the core [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: An example of a typical README.md. Step 3: Implementation. Participants are required to write their code in the designated solution.py file to complete each task. For every task, we provide a starter code template that includes a basic framework and helper functions, allowing participants to focus on implementing the core logic. The total time limit for completing all tasks is two hours [PITH_FULL_IMAGE:f… view at source ↗

**Figure 8.** Figure 8: An example of a typical solution.py. Step 4: Solution Submission. After completing their coding and local testing, participants are required to submit their solution by executing a shell script in the integrated terminal. They must first navigate to the corresponding problem directory and then run the submission command. This script automatically packages all necessary files and sends them to the backend e… view at source ↗

**Figure 9.** Figure 9: An example of the submission process. K DETAILED EXPERIMENTAL RESULTS K.1 DETAILED LLM BENCHMARKING RESULTS [PITH_FULL_IMAGE:figures/full_fig_p040_9.png] view at source ↗

**Figure 10.** Figure 10: Averaged Ratings on overall user experience and usability [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗

read the original abstract

LLM-powered coding agents are reshaping the development paradigm. However, existing evaluation systems, neither traditional tests for humans nor benchmarks for LLMs, fail to capture this shift, excluding problems that require both human reasoning to guide solutions and AI efficiency for implementation. We introduce CentaurEval, a unified, ecologically valid benchmark for measuring human-in-the-loop value in coding. CentaurEval's core innovation is its "Collaboration-Necessary" problem templates, which are intractable for standalone LLMs or humans, but solvable through effective collaboration. CentaurEval dynamically instantiates tasks from 45 templates, providing a standardized IDE for humans and a reproducible 450-task toolkit for LLMs. We benchmark 45 participants against 5 LLMs under 4 levels of human intervention. Results show that while LLMs or humans alone achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves to 31.11%. Our analysis reveals an emerging co-reasoning partnership, challenging the traditional human-tool hierarchy by showing that strategic breakthroughs can originate from either humans or AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces CentaurEval, a benchmark for human-in-the-loop value in agentic coding. Its central innovation is 45 'Collaboration-Necessary' problem templates claimed to be intractable for standalone LLMs or humans yet solvable through collaboration. The benchmark provides a standardized IDE and a reproducible 450-task toolkit. Experiments with 45 human participants and 5 LLMs across 4 intervention levels report pass rates of 0.67% for LLMs alone, 18.89% for humans alone, and 31.11% for collaboration, with analysis suggesting an emerging co-reasoning partnership that challenges the traditional human-tool hierarchy.

Significance. If the intractability claim is substantiated, the work supplies a valuable, ecologically valid benchmark that quantifies genuine collaborative gains in coding tasks and offers a reproducible toolkit for future studies. The emphasis on bidirectional strategic breakthroughs (from either humans or AI) could inform agent design and evaluation in software engineering.

major comments (1)

Abstract and template description: The headline result (solo LLMs 0.67%, solo humans 18.89%, collaboration 31.11%) rests on the assertion that the 45 Collaboration-Necessary templates are genuinely intractable for either party alone. No evidence is supplied for how this property was established—e.g., no report of exhaustive solo-LLM trials with modern prompting techniques, no expert-human time limits or failure criteria, and no pre-registration or validation protocol for template difficulty. Without this, the 12-point lift cannot be confidently attributed to collaboration value rather than task-selection artifacts.

minor comments (2)

The abstract and results sections report aggregate pass rates without error bars, confidence intervals, or statistical tests comparing conditions, which would clarify the reliability of the observed differences.
Details on participant recruitment, expertise levels, task randomization, and selection criteria for the 45 templates are not provided, limiting assessment of generalizability and potential biases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for highlighting the need to better substantiate the intractability claim for the Collaboration-Necessary templates. We address this point directly below and commit to revisions that will strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: Abstract and template description: The headline result (solo LLMs 0.67%, solo humans 18.89%, collaboration 31.11%) rests on the assertion that the 45 Collaboration-Necessary templates are genuinely intractable for either party alone. No evidence is supplied for how this property was established—e.g., no report of exhaustive solo-LLM trials with modern prompting techniques, no expert-human time limits or failure criteria, and no pre-registration or validation protocol for template difficulty. Without this, the 12-point lift cannot be confidently attributed to collaboration value rather than task-selection artifacts.

Authors: We agree that the submitted manuscript provides insufficient detail on the validation process used to confirm the Collaboration-Necessary property. The templates were developed through iterative pilot testing in which both LLMs (across multiple prompting regimes) and human participants consistently failed to solve the problems within reasonable bounds; however, these steps were not fully documented. We will add a dedicated subsection to the Methods section that reports: (1) the specific LLM prompting techniques evaluated during validation (including chain-of-thought, few-shot, and self-consistency variants), (2) the human trial protocol (30-minute time limit per task, clear failure criteria based on inability to produce a passing solution), and (3) the iterative refinement criteria that led to the final 45 templates. While the study was not formally pre-registered, the validation followed a documented internal protocol. These additions will allow readers to assess whether the observed 12-point gain reflects genuine collaborative value rather than selection effects. We view this as a necessary clarification. revision: yes

Circularity Check

0 steps flagged

No circularity detected in empirical benchmark results

full rationale

The paper reports direct empirical measurements of pass rates for solo LLMs (0.67%), solo humans (18.89%), and human-AI collaboration (31.11%) on tasks dynamically instantiated from 45 templates. No equations, derivations, fitted parameters, or self-referential definitions appear in the provided text that would reduce the reported collaboration gains to inputs by construction. The core claims rest on observable experimental outcomes under controlled conditions with a reproducible toolkit, allowing independent verification and falsification outside any internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen templates isolate collaboration value; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The 45 Collaboration-Necessary problem templates are intractable for standalone LLMs or humans but solvable through effective collaboration
This premise is invoked to justify the benchmark's ecological validity and is stated as the core innovation in the abstract.

pith-pipeline@v0.9.0 · 5768 in / 1260 out tokens · 45534 ms · 2026-05-22T11:55:20.207362+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HAI-Eval’s core innovation is its “Collaboration-Necessary” problem templates, which are intractable for both standalone LLMs and unaided humans, but solvable through effective collaboration.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Results show that while LLMs or humans alone achieve poor pass rates (0.67% and 18.89%), human-AI collaboration significantly improves to 31.11%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
cs.SE 2026-04 unverdicted novelty 5.0

Agentic Consensus replaces code as the main artifact with a typed property graph world model that maintains commitments and evidence through synchronization operators, shifting evaluation to alignment fidelity and con...

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

A Survey on Code Generation with LLM-based Agents

Accessed: 2025-09-14. TopCoder. Topcoder.https://www.topcoder.com/, 2001. Accessed: September 7, 2025. Nabeel Ullah, Marcus Liwicki, and Mats Sj ¨oberg. Towards enhancing ecological validity in user studies: a systematic review of guidelines and implications for qoe research.Quality and User Experience, 8(1):1–32, 2023. doi: 10.1007/s41233-023-00059-2. Mi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s41233-023-00059-2 2025
[2]

Based on the potential tool list provided by the template, determine which tools are needed

work page
[3]

According to the tool dependency information provided in{tool info}, plan the invo- cation sequence

work page
[4]

template_id

Based on the scenario and value ranges provided by the template, set appropriate pa- rameters for each tool You can: • Adjust tool invocation order based on template characteristics • Skip unnecessary tools • Repeatedly invoke the same tool for parameter refinement Below are two examples. Please return the execution plan following this JSON format. Exampl...

work page
[5]

The execution plan (execution plan)

work page
[6]

Tool interface specifications (tool interfaces)

work page
[7]

Current step information (current step) Your tasks are:

work page
[8]

Execute strictly in the step order specified in the plan

work page
[9]

Invoke tools using parameters specified in the plan

work page
[10]

Use outputs from previous steps as dependency inputs for current step

work page
[11]

step": 1,

Handle potential errors from tool invocations If tool invocation fails, you should: • Analyze the failure reason • Adjust parameters and retry (maximum 3 attempts) • If failures persist, mark the step as failed with explanation Please return execution results for each step in the following format: Execution Example: { "step": 1, "tool": "TechnicalParamete...

work page 2022
[12]

INTRODUCTION You are invited to participate in an academic study aimed to develop a novel framework for assessing programming abilities with coding agents. Your participation will provide valuable scientific data for understanding the core value of programmers in the AI era and for improving future engineering education and technical interviews. This test...

work page
[13]

10 minutes) to provide basic information, educational background, and technical experience

PROCEDURES If you agree to participate in this study, you will be asked to complete the following: •Pre-Test Questionnaire:You will first complete an online questionnaire (approx. 10 minutes) to provide basic information, educational background, and technical experience. This helps us ensure you meet the study’s criteria. •Programming Tasks & Conditions:I...

work page
[14]

You may experience some stress or frustration

RISKS ANDBENEFITS •Risks:As approved by the IRB, the risks associated with this study are minimal. You may experience some stress or frustration. All your personal data will be strictly anonymized. •Benefits:You will gain insight into a novel evaluation method

work page
[15]

Payment will be made via one of the following methods: Amazon Gift Card, PayPal, Zelle, Alipay, or WeChat

COMPENSATION Upon completion of all four programming tasks and the questionnaires, selected participants will be compensated with 40 USD or the equivalent amount in another currency. Payment will be made via one of the following methods: Amazon Gift Card, PayPal, Zelle, Alipay, or WeChat. The specific method will be determined in consultation with you aft...

work page
[16]

•Anonymous Access:To ensure your anonymity, you will not use your personal GitHub account

CONFIDENTIALITY We will take strict measures to protect your privacy. •Anonymous Access:To ensure your anonymity, you will not use your personal GitHub account. You will be provided with a uniformly assigned, anonymous GitHub account to access Codespaces for the tasks. The credentials for this account will be sent to your registered email address. •Data U...

work page
[17]

You may withdraw at any time without penalty

VOLUNTARYPARTICIPATION Your participation is voluntary. You may withdraw at any time without penalty. J.2 PRE-TESTQUESTIONNAIRE The following questionnaire was administered to screen and assign participants. For inclusion in this appendix, all interactive input fields have been removed. 34 Preprint. Work in progress. This questionnaire is designed to unde...

work page
[18]

Which role are you applying for? (Select one) • Software Development Engineer (SDE) • Machine Learning Engineer (MLE) • Data Scientist (DS) PART2: DEMOGRAPHICINFORMATION

work page
[19]

Gender: • Male • Female • Non-binary • Prefer not to say

work page
[20]

Race/Ethnicity (Please select all that apply): • Arabic • Black or African American • East Asian • Hispanic or Latino • Native American • Native Hawaiian or Other Pacific Islander • South Asian • White • Other • Prefer not to say PART3: EDUCATIONALBACKGROUND

work page
[21]

What is your current or highest level of education? • Year 1 Undergraduate • Year 2 Undergraduate • Year 3 Undergraduate • Year 4 Undergraduate • Master’s Student • PhD Student • Bachelor’s Graduate (not current student) • Master’s Graduate (not current student) • PhD Graduate

work page
[22]

University/Institution Name:

work page
[23]

Work in progress

Graduation Year / Expected Graduation Year: 35 Preprint. Work in progress

work page
[24]

How would you rate your overall academic performance in courses most relevant to your selected role? • Excellent (Top 10%) • Good (Top 10%-30%) • Average (Top 30%-60%) • Fair (Below Top 60%) PART4: ENGLISHPROFICIENCY

work page
[25]

Is English your native language? (Yes / No)

work page
[26]

(If No) Standardized English test scores (if applicable): • TOEFL • IELTS • Duolingo • CET-4 • CET-6 • Other • I have not taken any PART5: TECHNICAL& PROFESSIONALEXPERIENCE

work page
[27]

Number of relevant internships or full-time jobs: • 0 • 1 • 2 • 3 or more

work page
[28]

Brief description of most relevant work experience:

work page
[29]

Have you published any peer-reviewed research papers? (Yes / No)

work page
[30]

(If Yes) List of significant publications or link to academic profile:

work page
[31]

Have you completed any significant personal/open-source projects? (Yes / No)

work page
[32]

(If Yes) Link or description of the project you are most proud of:

work page
[33]

Frequency of recent programming tasks WITHOUT AI assistance: • Daily • A few times a week • A few times a month • Rarely • Almost never PART6: FAMILIARITY WITHDEVELOPMENTENVIRONMENTS& AI TOOLS

work page
[34]

Primarily used IDEs or code editors (Select all that apply): • Visual Studio Code (VS Code) • JetBrains IDEs (e.g., PyCharm, IntelliJ) • Vim / Neovim • Jupyter Notebook / JupyterLab • Other

work page
[35]

On a scale of 1 to 5 (1 = Novice, 5 = Expert), please rate your proficiency with Visual Studio Code (VS Code):

work page
[36]

On a scale of 1 to 5 (1 = Not familiar at all, 5 = Very familiar), please rate your familiarity with container-based development or cloud-based IDEs:

work page
[37]

Work in progress

On a scale of 1 to 5 (1 = Never, 5 = Almost always), please rate your frequency of relying on AI-powered coding assistants in your daily workflow: 36 Preprint. Work in progress

work page
[38]

Usage of GitHub Copilot specifically: • I use it daily as my primary AI assistant • I use it frequently (a few times a week) • I use it occasionally • I have tried it but do not use it regularly • I have never used it

work page
[39]

PART1: OVERALLEXPERIENCE& USABILITY

Other AI coding tools used: J.3 POST-TESTQUESTIONNAIRE The following questionnaire was administered after participants completed all tasks to collect sub- jective feedback. PART1: OVERALLEXPERIENCE& USABILITY

work page
[40]

• The instructions in the README.md for each task were clear

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the following statements: • The GitHub Codespaces environment was stable and easy to use. • The instructions in the README.md for each task were clear. • The submission process (./scripts/submit.sh) was straightforward

work page
[41]

HUMAN-AI)

Did you encounter any significant technical issues or confusion? (Open-ended response) PART2: COMPARISON OFCONDITIONS(HUMAN-ONLY VS. HUMAN-AI)

work page
[42]

Compared to tasks WITHOUT AI, how did tasks WITH AI affect your: •Problem-Solving Speed:(Much Slower / Slower / About the Same / Faster / Much Faster) •Final Solution Quality/Correctness:(Much Lower / Lower / About the Same / Higher / Much Higher)

work page
[43]

On a scale of 1 to 5 (1 = Very Low, 5 = Very High), please rate theMental Effort (Cogni- tive Load)for each condition: • Human-Only Condition: • Human-AI Collaboration Condition:

work page
[44]

On a scale of 1 to 5 (1 = Not Confident at All, 5 = Very Confident), please rate your Confidencein your solution for each condition: • Human-Only Condition: • Human-AI Collaboration Condition:

work page
[45]

In the Human-AI condition, which of the following roles did the AI play during your problem-solving process? (Select all that apply): • Brainstorming or exploring different solution strategies • Suggesting a fundamentally different approach or algorithm (including a change in the core algorithmic logic, different architectures, and the use of a completely...

work page
[46]

• I felt the explanations from the AI assistant were reliable

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the following statements about the AI assistant: • I trusted the code suggestions provided by the AI assistant. • I felt the explanations from the AI assistant were reliable. 37 Preprint. Work in progress. PART3: ORDEREFFECTS

work page
[47]

• My strategy for Human-Only tasks was affected by my experience in Human-AI tasks

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the following statements: • My performance in later tasks was influenced by the tasks I completed earlier. • My strategy for Human-Only tasks was affected by my experience in Human-AI tasks. • My strategy for Human-AI tasks was affected by my experience in Human-Only tasks

work page
[48]

(Open-ended response) PART4: TASK& EVALUATIONFEEDBACK

If you felt there was an influence, please briefly describe it. (Open-ended response) PART4: TASK& EVALUATIONFEEDBACK

work page
[49]

On a scale of 1 to 5 (1 = Not at all realistic, 5 = Very realistic), please rate how well the tasks reflected real-world programming challenges:

work page
[50]

–TheEfficiency Metricsaccurately reflected my effort

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the report’s accuracy: •Regarding Human-Only tasks: –TheFunctional Correctnessscore accurately reflected my performance. –TheEfficiency Metricsaccurately reflected my effort. •Regarding Human-AI Collaboration tasks: –TheFunctional Correctnessscore accurately reflect...

work page
[51]

(Open-ended response) PART5: FINALOPEN-ENDEDFEEDBACK

Please explain your ratings on the evaluation report’s accuracy. (Open-ended response) PART5: FINALOPEN-ENDEDFEEDBACK

work page
[52]

What was the most positive or satisfying part of your experience? (Open-ended response)

work page
[53]

What was the most negative or frustrating part of your experience? (Open-ended response)

work page
[54]

Easy” tasks are nearly as low as those for “Hard

Do you have any other suggestions for improving HAI-Eval? (Open-ended response) J.4 EXPERIMENTALPROTOCOL This section outlines the full procedural workflow experienced by participants during a single ses- sion. It illustrates the evaluation process used inHAI-Evalfor human developers and highlights the framework’s ecological validity. The entire protocol ...

work page 2010

[1] [1]

A Survey on Code Generation with LLM-based Agents

Accessed: 2025-09-14. TopCoder. Topcoder.https://www.topcoder.com/, 2001. Accessed: September 7, 2025. Nabeel Ullah, Marcus Liwicki, and Mats Sj ¨oberg. Towards enhancing ecological validity in user studies: a systematic review of guidelines and implications for qoe research.Quality and User Experience, 8(1):1–32, 2023. doi: 10.1007/s41233-023-00059-2. Mi...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/s41233-023-00059-2 2025

[2] [2]

Based on the potential tool list provided by the template, determine which tools are needed

work page

[3] [3]

According to the tool dependency information provided in{tool info}, plan the invo- cation sequence

work page

[4] [4]

template_id

Based on the scenario and value ranges provided by the template, set appropriate pa- rameters for each tool You can: • Adjust tool invocation order based on template characteristics • Skip unnecessary tools • Repeatedly invoke the same tool for parameter refinement Below are two examples. Please return the execution plan following this JSON format. Exampl...

work page

[5] [5]

The execution plan (execution plan)

work page

[6] [6]

Tool interface specifications (tool interfaces)

work page

[7] [7]

Current step information (current step) Your tasks are:

work page

[8] [8]

Execute strictly in the step order specified in the plan

work page

[9] [9]

Invoke tools using parameters specified in the plan

work page

[10] [10]

Use outputs from previous steps as dependency inputs for current step

work page

[11] [11]

step": 1,

Handle potential errors from tool invocations If tool invocation fails, you should: • Analyze the failure reason • Adjust parameters and retry (maximum 3 attempts) • If failures persist, mark the step as failed with explanation Please return execution results for each step in the following format: Execution Example: { "step": 1, "tool": "TechnicalParamete...

work page 2022

[12] [12]

INTRODUCTION You are invited to participate in an academic study aimed to develop a novel framework for assessing programming abilities with coding agents. Your participation will provide valuable scientific data for understanding the core value of programmers in the AI era and for improving future engineering education and technical interviews. This test...

work page

[13] [13]

10 minutes) to provide basic information, educational background, and technical experience

PROCEDURES If you agree to participate in this study, you will be asked to complete the following: •Pre-Test Questionnaire:You will first complete an online questionnaire (approx. 10 minutes) to provide basic information, educational background, and technical experience. This helps us ensure you meet the study’s criteria. •Programming Tasks & Conditions:I...

work page

[14] [14]

You may experience some stress or frustration

RISKS ANDBENEFITS •Risks:As approved by the IRB, the risks associated with this study are minimal. You may experience some stress or frustration. All your personal data will be strictly anonymized. •Benefits:You will gain insight into a novel evaluation method

work page

[15] [15]

Payment will be made via one of the following methods: Amazon Gift Card, PayPal, Zelle, Alipay, or WeChat

COMPENSATION Upon completion of all four programming tasks and the questionnaires, selected participants will be compensated with 40 USD or the equivalent amount in another currency. Payment will be made via one of the following methods: Amazon Gift Card, PayPal, Zelle, Alipay, or WeChat. The specific method will be determined in consultation with you aft...

work page

[16] [16]

•Anonymous Access:To ensure your anonymity, you will not use your personal GitHub account

CONFIDENTIALITY We will take strict measures to protect your privacy. •Anonymous Access:To ensure your anonymity, you will not use your personal GitHub account. You will be provided with a uniformly assigned, anonymous GitHub account to access Codespaces for the tasks. The credentials for this account will be sent to your registered email address. •Data U...

work page

[17] [17]

You may withdraw at any time without penalty

VOLUNTARYPARTICIPATION Your participation is voluntary. You may withdraw at any time without penalty. J.2 PRE-TESTQUESTIONNAIRE The following questionnaire was administered to screen and assign participants. For inclusion in this appendix, all interactive input fields have been removed. 34 Preprint. Work in progress. This questionnaire is designed to unde...

work page

[18] [18]

Which role are you applying for? (Select one) • Software Development Engineer (SDE) • Machine Learning Engineer (MLE) • Data Scientist (DS) PART2: DEMOGRAPHICINFORMATION

work page

[19] [19]

Gender: • Male • Female • Non-binary • Prefer not to say

work page

[20] [20]

Race/Ethnicity (Please select all that apply): • Arabic • Black or African American • East Asian • Hispanic or Latino • Native American • Native Hawaiian or Other Pacific Islander • South Asian • White • Other • Prefer not to say PART3: EDUCATIONALBACKGROUND

work page

[21] [21]

What is your current or highest level of education? • Year 1 Undergraduate • Year 2 Undergraduate • Year 3 Undergraduate • Year 4 Undergraduate • Master’s Student • PhD Student • Bachelor’s Graduate (not current student) • Master’s Graduate (not current student) • PhD Graduate

work page

[22] [22]

University/Institution Name:

work page

[23] [23]

Work in progress

Graduation Year / Expected Graduation Year: 35 Preprint. Work in progress

work page

[24] [24]

How would you rate your overall academic performance in courses most relevant to your selected role? • Excellent (Top 10%) • Good (Top 10%-30%) • Average (Top 30%-60%) • Fair (Below Top 60%) PART4: ENGLISHPROFICIENCY

work page

[25] [25]

Is English your native language? (Yes / No)

work page

[26] [26]

(If No) Standardized English test scores (if applicable): • TOEFL • IELTS • Duolingo • CET-4 • CET-6 • Other • I have not taken any PART5: TECHNICAL& PROFESSIONALEXPERIENCE

work page

[27] [27]

Number of relevant internships or full-time jobs: • 0 • 1 • 2 • 3 or more

work page

[28] [28]

Brief description of most relevant work experience:

work page

[29] [29]

Have you published any peer-reviewed research papers? (Yes / No)

work page

[30] [30]

(If Yes) List of significant publications or link to academic profile:

work page

[31] [31]

Have you completed any significant personal/open-source projects? (Yes / No)

work page

[32] [32]

(If Yes) Link or description of the project you are most proud of:

work page

[33] [33]

Frequency of recent programming tasks WITHOUT AI assistance: • Daily • A few times a week • A few times a month • Rarely • Almost never PART6: FAMILIARITY WITHDEVELOPMENTENVIRONMENTS& AI TOOLS

work page

[34] [34]

Primarily used IDEs or code editors (Select all that apply): • Visual Studio Code (VS Code) • JetBrains IDEs (e.g., PyCharm, IntelliJ) • Vim / Neovim • Jupyter Notebook / JupyterLab • Other

work page

[35] [35]

On a scale of 1 to 5 (1 = Novice, 5 = Expert), please rate your proficiency with Visual Studio Code (VS Code):

work page

[36] [36]

On a scale of 1 to 5 (1 = Not familiar at all, 5 = Very familiar), please rate your familiarity with container-based development or cloud-based IDEs:

work page

[37] [37]

Work in progress

On a scale of 1 to 5 (1 = Never, 5 = Almost always), please rate your frequency of relying on AI-powered coding assistants in your daily workflow: 36 Preprint. Work in progress

work page

[38] [38]

Usage of GitHub Copilot specifically: • I use it daily as my primary AI assistant • I use it frequently (a few times a week) • I use it occasionally • I have tried it but do not use it regularly • I have never used it

work page

[39] [39]

PART1: OVERALLEXPERIENCE& USABILITY

Other AI coding tools used: J.3 POST-TESTQUESTIONNAIRE The following questionnaire was administered after participants completed all tasks to collect sub- jective feedback. PART1: OVERALLEXPERIENCE& USABILITY

work page

[40] [40]

• The instructions in the README.md for each task were clear

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the following statements: • The GitHub Codespaces environment was stable and easy to use. • The instructions in the README.md for each task were clear. • The submission process (./scripts/submit.sh) was straightforward

work page

[41] [41]

HUMAN-AI)

Did you encounter any significant technical issues or confusion? (Open-ended response) PART2: COMPARISON OFCONDITIONS(HUMAN-ONLY VS. HUMAN-AI)

work page

[42] [42]

Compared to tasks WITHOUT AI, how did tasks WITH AI affect your: •Problem-Solving Speed:(Much Slower / Slower / About the Same / Faster / Much Faster) •Final Solution Quality/Correctness:(Much Lower / Lower / About the Same / Higher / Much Higher)

work page

[43] [43]

On a scale of 1 to 5 (1 = Very Low, 5 = Very High), please rate theMental Effort (Cogni- tive Load)for each condition: • Human-Only Condition: • Human-AI Collaboration Condition:

work page

[44] [44]

On a scale of 1 to 5 (1 = Not Confident at All, 5 = Very Confident), please rate your Confidencein your solution for each condition: • Human-Only Condition: • Human-AI Collaboration Condition:

work page

[45] [45]

In the Human-AI condition, which of the following roles did the AI play during your problem-solving process? (Select all that apply): • Brainstorming or exploring different solution strategies • Suggesting a fundamentally different approach or algorithm (including a change in the core algorithmic logic, different architectures, and the use of a completely...

work page

[46] [46]

• I felt the explanations from the AI assistant were reliable

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the following statements about the AI assistant: • I trusted the code suggestions provided by the AI assistant. • I felt the explanations from the AI assistant were reliable. 37 Preprint. Work in progress. PART3: ORDEREFFECTS

work page

[47] [47]

• My strategy for Human-Only tasks was affected by my experience in Human-AI tasks

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the following statements: • My performance in later tasks was influenced by the tasks I completed earlier. • My strategy for Human-Only tasks was affected by my experience in Human-AI tasks. • My strategy for Human-AI tasks was affected by my experience in Human-Only tasks

work page

[48] [48]

(Open-ended response) PART4: TASK& EVALUATIONFEEDBACK

If you felt there was an influence, please briefly describe it. (Open-ended response) PART4: TASK& EVALUATIONFEEDBACK

work page

[49] [49]

On a scale of 1 to 5 (1 = Not at all realistic, 5 = Very realistic), please rate how well the tasks reflected real-world programming challenges:

work page

[50] [50]

–TheEfficiency Metricsaccurately reflected my effort

On a scale of 1 (Strongly Disagree) to 5 (Strongly Agree), please rate your agreement with the report’s accuracy: •Regarding Human-Only tasks: –TheFunctional Correctnessscore accurately reflected my performance. –TheEfficiency Metricsaccurately reflected my effort. •Regarding Human-AI Collaboration tasks: –TheFunctional Correctnessscore accurately reflect...

work page

[51] [51]

(Open-ended response) PART5: FINALOPEN-ENDEDFEEDBACK

Please explain your ratings on the evaluation report’s accuracy. (Open-ended response) PART5: FINALOPEN-ENDEDFEEDBACK

work page

[52] [52]

What was the most positive or satisfying part of your experience? (Open-ended response)

work page

[53] [53]

What was the most negative or frustrating part of your experience? (Open-ended response)

work page

[54] [54]

Easy” tasks are nearly as low as those for “Hard

Do you have any other suggestions for improving HAI-Eval? (Open-ended response) J.4 EXPERIMENTALPROTOCOL This section outlines the full procedural workflow experienced by participants during a single ses- sion. It illustrates the evaluation process used inHAI-Evalfor human developers and highlights the framework’s ecological validity. The entire protocol ...

work page 2010