arxiv: 2604.27727 · v1 · submitted 2026-04-30 · 💻 cs.SE

Recognition: unknown

LLM-as-a-Judge for Human-AI Co-Creation: A Reliability-Aware Evaluation Framework for Coding

Md Faizul Ibne Amin , Yutaka Watanobe , Daniel M. Muepu , Haruto Suzuki , Kenta Nanaumi , Md Mostafizer Rahman

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:32 UTC · model grok-4.3

classification 💻 cs.SE

keywords LLM-as-a-Judgehuman-AI co-creationcoding evaluationreliability metricstrajectory analysisrubric-based assessmentsoftware engineering

0 comments

The pith

A rubric-driven framework with schema constraints and grouped splitting lets LLM judges evaluate human-AI coding co-creation reliably and comparably across models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an evaluation system for using LLMs as judges in multi-turn human-AI programming sessions. It enforces structured outputs through rubrics and validation steps, then splits data by user and problem to block information leakage while supplying participant context to judges. Multiple judges are scored on discrimination, calibration, and agreement, and the same data is analyzed for co-creation patterns such as turn-wise success rates and revision behavior. If the approach holds, teams can run consistent, auditable assessments of AI-assisted coding without constant human re-labeling. The work also shows that successful outcomes cluster early and that revision styles vary widely.

Core claim

The framework combines schema-constrained judge outputs, automated repair for invalid responses, grouped splitting by user and problem, and participant-level non-blind context to produce LLM judgments that are comparable across models and linked to trajectory signals like Success-at-Turn and revision churn. On the reported data the best judges reach 0.5937 ROC-AUC, 0.6904 PR-AUC, and 0.5000 MCC on held-out sets, while co-creation success rises to 0.8533 at the first observed turn and stabilizes near 0.8641 by turn 6, with heterogeneous revision patterns.

What carries the argument

Rubric-driven LLM-as-a-Judge with schema-constrained outputs, validation-repair loops, and grouped splitting by user and problem.

If this is right

Co-creation success concentrates early, exceeding 85 percent by the first observed turn and leveling near 86 percent by turn 6.
Revision behavior stays heterogeneous, so progress can occur through either small incremental edits or larger restructurings.
The strongest judges reach roughly 0.59 ROC-AUC and 0.50 MCC on held-out data under the reported protocol.
Inter-judge agreement remains modest, with mean pairwise Cohen's kappa around 0.16 and Fleiss' kappa around 0.07.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same splitting and rubric structure could be reused to compare LLM judges in other open-ended creative tasks such as design or writing.
Low inter-judge agreement suggests that future work may need ensembles or targeted fine-tuning to raise consistency without losing the framework's auditability.
Early success clustering implies that real-time feedback tools should focus support on the first few turns rather than later debugging stages.
Trajectory signals such as time-to-success and CodeBLEU could be turned into live developer dashboards if the judging pipeline is made lightweight.

Load-bearing premise

Schema constraints plus grouped splitting are enough to make LLM judgments reliable and comparable without large-scale human validation.

What would settle it

A new set of coding trajectories where the same rubric and splitting rules produce LLM judgments that still disagree with human raters on more than half the cases or yield Cohen's kappa below 0.2 across judges.

Figures

Figures reproduced from arXiv: 2604.27727 by Daniel M. Muepu, Haruto Suzuki, Kenta Nanaumi, Md Faizul Ibne Amin, Md Mostafizer Rahman, Yutaka Watanobe.

**Figure 1.** Figure 1: Overview of the LLM-as-a-Judge and human-AI co-creation workflow view at source ↗

**Figure 3.** Figure 3: Judge-level curves for the reliability (0.5000) with a relatively high threshold (0.81), indicating that its probability scale supports a conservative accept decision without sacrificing balanced accuracy. Claude achieves MCCtest of 0.3371 with a very low threshold (0.01), while OpenAI and Gemini yield lower MCCtest (0.0755 and 0.1348) with thresholds of 0.04 and 0.99, respectively. The wide spread of sel… view at source ↗

**Figure 2.** Figure 2: Judge-level curves for (a) ROC-AUC (b) PR-AUC view at source ↗

**Figure 4.** Figure 4: Cross-metric comparison: (a) ROC-AUC/PR-AUC vs. view at source ↗

**Figure 6.** Figure 6: Histogram of prompt-code NED on TEST 8) Integrated performance perspective: Across all evaluation dimensions, the results paint a richer and more nuanced picture of judge behavior than any single metric could convey. Discrimination, thresholded decision quality, probabilistic reliability, and inter-judge agreement each illuminate a distinct aspect of performance, and it is only in combination that a cohe… view at source ↗

**Figure 8.** Figure 8: Struggle curve of mean judge confidence across ob view at source ↗

**Figure 11.** Figure 11: Participant-level prompt-space map using TF-IDF and view at source ↗

**Figure 12.** Figure 12: Relationship between code churn and turn-wise im view at source ↗

**Figure 14.** Figure 14: Distribution of CodeBLEU similarity to the first view at source ↗

read the original abstract

LLMs are increasingly employed both as judges for evaluating open-ended outputs and as co-creation partners in AI-assisted programming; yet rigorous evaluation in human-AI co-creation settings remains underdeveloped as judgments must be reliable, comparable across models, and interpretable over multi-turn interaction. To address this gap, a rubric-driven LLM-as-a-Judge framework is presented for contest-style human-AI co-creation in coding and software engineering (SE). The framework is built around schema-constrained judge outputs, validation and repair mechanisms, grouped and split by user and problem to prevent trajectory leakage, and participant-level NONBLIND context. Multiple LLM judges are assessed through a multi-metric protocol covering discrimination (ROC-AUC, PR-AUC), thresholded decision quality (MCC), probabilistic reliability (LogLoss, Brier score, ECE), and inter-judge agreement (Cohen's and Fleiss' k). Human-AI co-creation is further examined through trajectory-level signals, including turn-wise confidence, Success-at-Turn, time-to-success, revision churn, and CodeBLEU. Co-creation success is found to concentrate early, with Success-at-Turn rising to 0.8533 at the first observed turn and stabilizing at 0.8641 by turn 6. Revision behavior, however, remains heterogeneous, suggesting that productive progress can emerge through either incremental refinement or broader restructuring. On the judging side, the best held-out scores reach 0.5937 for ROC-AUC, 0.6904 for PR-AUC, and 0.5000 for MCC test, while inter-judge consistency remains modest overall (mean pairwise Cohen's k = 0.1592, Fleiss' k = 0.0696). Taken together, this work offers an auditable and reproducible evaluation methodology that links reliability-aware LLM judging with trajectory-based analysis of human-AI co-creation, providing a practical evaluation template for future AI-assisted coding and SE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds a full evaluation pipeline for LLM judges in human-AI coding co-creation with leakage controls and trajectory metrics, but the reported agreement and discrimination scores stay low enough to limit how much reliability the framework actually delivers.

read the letter

The main point is that this work packages a rubric-driven LLM judge with schema constraints, validation steps, and user/problem-grouped splits to study multi-turn human-AI coding sessions from contests. It then tracks signals like Success-at-Turn, revision churn, and CodeBLEU alongside standard judge metrics. That combination is new relative to plain LLM-as-judge papers, and the trajectory findings are straightforward: success concentrates early, reaching 0.85 by the first turn and leveling off around 0.86 by turn six, while revision patterns vary widely. The grouped splitting and multi-metric protocol (ROC-AUC, MCC, LogLoss, Cohen's and Fleiss' kappa) are sensible design choices that make the setup more reproducible than many ad-hoc evaluations. The authors also report the numbers honestly instead of hiding them. The soft spots sit in the actual performance. Fleiss' kappa lands at 0.07 and mean pairwise Cohen's at 0.16, which is near chance, and the best held-out ROC-AUC is only 0.59 with MCC at 0.50. These figures suggest the rubric and constraints did not produce judgments that are consistent across models or strongly aligned with whatever ground truth they used. Without an external human anchor or ablation that shows the components actually improve reliability, the claim of a “reliability-aware” framework rests more on the packaging than on demonstrated gains. The work is aimed at researchers building or evaluating AI coding assistants who need a concrete template rather than a finished solution. A reader who wants to see how trajectory analysis can be paired with judge evaluation will find usable ideas here. It deserves peer review because the experimental setup is transparent and the data choices are documented, even though the results are modest and the reliability gap needs more scrutiny.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a rubric-driven LLM-as-a-Judge framework for evaluating human-AI co-creation trajectories in contest-style coding and software engineering tasks. Key design elements include schema-constrained judge outputs, validation/repair mechanisms, user- and problem-grouped data splitting to prevent leakage, and participant-level non-blind context. LLM judges are assessed via a multi-metric protocol (ROC-AUC, PR-AUC, MCC, LogLoss, Brier, ECE, Cohen's/Fleiss' kappa), while co-creation is analyzed through turn-wise signals such as Success-at-Turn (reaching 0.8533 at turn 1 and 0.8641 by turn 6), revision churn, and CodeBLEU. The work reports modest held-out performance (best ROC-AUC 0.5937, MCC 0.5000) and low inter-judge agreement (mean Cohen's k=0.1592, Fleiss' k=0.0696), yet positions the framework as an auditable, reproducible template for future AI-assisted SE evaluation.

Significance. If the grouped splitting, schema constraints, and multi-metric protocol can be shown to produce judgments that are meaningfully more reliable and comparable than standard LLM prompting, the paper would supply a reusable evaluation template that links judging reliability to trajectory-level insights in human-AI coding. The explicit reporting of low agreement and discrimination metrics, together with the emphasis on leakage prevention, constitutes a strength in methodological transparency. However, the modest empirical results limit the immediate practical impact unless the design choices are demonstrated to drive the observed (limited) performance.

major comments (3)

Abstract and evaluation results: The central claim that schema-constrained outputs plus grouped splitting yield 'reliable' and 'comparable across models' judgments is load-bearing for the contribution, yet the reported metrics (ROC-AUC 0.5937, MCC 0.5000, Fleiss' k=0.0696) show only marginal discrimination and near-chance consistency. Without ablations that isolate the contribution of each component (e.g., schema vs. no-schema, grouped vs. random split) or a human inter-rater baseline on the same items, it is unclear whether the framework improves reliability or merely standardizes noisy outputs.
Trajectory analysis section: Success-at-Turn is stated to stabilize at 0.8641 by turn 6, but the operational definition of 'success' (whether it derives from the LLM judge, an external oracle, or human annotation) and its validation against ground truth are not specified. This directly affects the interpretability of all downstream signals (time-to-success, revision churn) and their linkage to the reliability-aware judging framework.
Framework description: The paper asserts that the rubric-driven approach with schema constraints and grouped splitting 'ensures reliable LLM judgments comparable across models without needing extensive human validation.' Given the low Fleiss' k and ROC-AUC, this assumption requires explicit testing; a direct head-to-head with human judgments on a held-out subset would be necessary to substantiate that the framework reduces rather than conceals the need for human oversight.

minor comments (2)

The acronym 'NONBLIND' is introduced without expansion or precise definition; clarifying whether it denotes full participant history, problem context, or another form of conditioning would aid reproducibility.
The exact rubric text, JSON schema, and prompt templates used for the LLM judges are not reproduced in the main text; providing them (or a link to supplementary material) would strengthen the auditable/reproducible claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful for the referee's insightful comments, which highlight important areas for improving the clarity and rigor of our work. We address each major comment in detail below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: Abstract and evaluation results: The central claim that schema-constrained outputs plus grouped splitting yield 'reliable' and 'comparable across models' judgments is load-bearing for the contribution, yet the reported metrics (ROC-AUC 0.5937, MCC 0.5000, Fleiss' k=0.0696) show only marginal discrimination and near-chance consistency. Without ablations that isolate the contribution of each component (e.g., schema vs. no-schema, grouped vs. random split) or a human inter-rater baseline on the same items, it is unclear whether the framework improves reliability or merely standardizes noisy outputs.

Authors: We recognize that the modest ROC-AUC of 0.5937 and low Fleiss' kappa of 0.0696 indicate limited discrimination and agreement, which we report explicitly to promote transparency. The framework's value is in its design for preventing data leakage through user- and problem-grouped splitting, schema-constrained outputs for consistency, and a comprehensive multi-metric evaluation protocol. We did not perform the suggested ablations in the original submission but will add them in the revision, including comparisons of schema-constrained versus unconstrained prompting and grouped versus random splits. A full human inter-rater baseline on the held-out items would strengthen the work further; we will include a discussion of this as a limitation and outline plans for such validation in future extensions, while maintaining that the current approach provides a reproducible template with quantified reliability. revision: partial
Referee: Trajectory analysis section: Success-at-Turn is stated to stabilize at 0.8641 by turn 6, but the operational definition of 'success' (whether it derives from the LLM judge, an external oracle, or human annotation) and its validation against ground truth are not specified. This directly affects the interpretability of all downstream signals (time-to-success, revision churn) and their linkage to the reliability-aware judging framework.

Authors: The Success-at-Turn metric is derived from the LLM judge's assessment of whether the co-creation trajectory has reached a successful state by that turn, based on the rubric and schema. We will clarify this operational definition in the revised manuscript, specify that it relies on the judge outputs (whose reliability is evaluated separately), and provide details on any cross-validation with human annotations or external oracles available in the dataset. This will better link the trajectory analysis to the judging framework. revision: yes
Referee: Framework description: The paper asserts that the rubric-driven approach with schema constraints and grouped splitting 'ensures reliable LLM judgments comparable across models without needing extensive human validation.' Given the low Fleiss' k and ROC-AUC, this assumption requires explicit testing; a direct head-to-head with human judgments on a held-out subset would be necessary to substantiate that the framework reduces rather than conceals the need for human oversight.

Authors: The assertion in the framework description was intended to convey that the structured approach with constraints and splitting reduces arbitrariness and leakage compared to standard prompting, thereby making judgments more comparable across models with less ad-hoc human intervention. However, given the empirical results, we agree the wording should be tempered. In the revision, we will revise the language to emphasize that the framework quantifies reliability and provides auditability, while acknowledging that it does not eliminate the value of human validation. We will also add a note on the need for further head-to-head comparisons with human judgments on held-out data. revision: yes

Circularity Check

0 steps flagged

No significant circularity; framework and metrics are empirically derived on held-out data.

full rationale

The paper defines a rubric-driven LLM-as-a-Judge framework via explicit design choices (schema-constrained outputs, validation/repair, user/problem-grouped splitting, NONBLIND context) that are motivated by leakage prevention and reliability goals rather than by fitting to any target metric. These choices are then evaluated post-hoc with standard discrimination, calibration, and agreement metrics (ROC-AUC, MCC, Fleiss' k, etc.) computed on held-out splits; the reported values are presented as observed outcomes, not as quantities forced by the framework definition. No equations, self-definitional loops, or load-bearing self-citations reduce the central methodology or its performance claims to tautologies. Minor self-citation of prior LLM-judge work may exist but is not used to justify uniqueness or to substitute for the present empirical results. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on standard ML metrics and assumptions about rubric effectiveness. No free parameters or new entities detailed in abstract.

axioms (1)

domain assumption LLM outputs can be reliably constrained and validated via schemas for judging coding tasks.
Core to the framework but untested in abstract.

pith-pipeline@v0.9.0 · 10321 in / 1080 out tokens · 98862 ms · 2026-05-07T08:32:46.659991+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 18 canonical work pages · 5 internal anchors

[1]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in neural information processing systems, vol. 30, 2017

2017
[2]

Large language models for education and research: An empirical and user survey-based analysis,

M. M. Rahman, A. I. Shiplu, M. F. I. Amin, Y . Watanobe, and L. Peng, “Large language models for education and research: An empirical and user survey-based analysis,”arXiv preprint arXiv:2512.08057, 2025

work page arXiv 2025
[3]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray,et al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27730–27744, 2022

2022
[4]

Language mod- els are few-shot learners,

T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell,et al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

1901
[5]

On the effectiveness of llm-as-a-judge for code generation and summarization,

G. Crupi, R. Tufano, A. Velasco, A. Mastropaolo, D. Poshyvanyk, and G. Bavota, “On the effectiveness of llm-as-a-judge for code generation and summarization,”IEEE Transactions on Software Engineering, 2025

2025
[6]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman,et al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review arXiv 2021
[7]

Designing creative ai partners with cofi: A framework for modeling interaction in human-ai co-creative systems,

J. Rezwana and M. L. Maher, “Designing creative ai partners with cofi: A framework for modeling interaction in human-ai co-creative systems,” ACM Transactions on Computer-Human Interaction, vol. 30, no. 5, pp. 1–28, 2023

2023
[8]

Human-ai collaboration patterns in ai-assisted academic writing,

A. Nguyen, Y . Hong, B. Dang, and X. Huang, “Human-ai collaboration patterns in ai-assisted academic writing,”Studies in Higher Education, vol. 49, no. 5, pp. 847–864, 2024

2024
[9]

Human-ai co-creation of worked examples for program- ming classes,

M. Hassany, P. Brusilovsky, J. Ke, K. Akhuseyinoglu, and A. B. L. Narayanan, “Human-ai co-creation of worked examples for program- ming classes,”arXiv preprint arXiv:2402.16235, 2024

work page arXiv 2024
[10]

Human-ai co-design and co-creation: A review of emerging approaches, challenges, and future directions,

N. Kadenhe, M. Al Musleh, and A. Lompot, “Human-ai co-design and co-creation: A review of emerging approaches, challenges, and future directions,” inProceedings of the AAAI Symposium Series, vol. 6, pp. 265–270, 2025

2025
[11]

The “question neighbourhood

S. Honarvar, M. Rei, and A. Donaldson, “The “question neighbourhood” approach for systematic evaluation of code-generating llms,”IEEE Transactions on Software Engineering, 2025

2025
[12]

G-eval: Nlg evaluation using gpt-4 with better human alignment,

Y . Liu, D. Iter, Y . Xu, S. Wang, R. Xu, and C. Zhu, “G-eval: Nlg evaluation using gpt-4 with better human alignment,” inProceedings of the 2023 conference on empirical methods in natural language processing, pp. 2511–2522, 2023

2023
[13]

A survey on evaluation of large language models,

Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wang,et al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

2024
[14]

A survey on llm-as-a-judge,

J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y . Shen, S. Ma, H. Liu,et al., “A survey on llm-as-a-judge,”The Innovation, 2024

2024
[15]

LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods

H. Li, Q. Dong, J. Chen, H. Su, Y . Zhou, Q. Ai, Z. Ye, and Y . Liu, “Llms- as-judges: a comprehensive survey on llm-based evaluation methods,” arXiv preprint arXiv:2412.05579, 2024. 16

work page internal anchor Pith review arXiv 2024
[16]

Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,

J. Liu, C. S. Xia, Y . Wang, and L. Zhang, “Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation,”Advances in neural information processing systems, vol. 36, pp. 21558–21572, 2023

2023
[17]

Generative judge for evaluating alignment.arXiv preprint arXiv:2310.05470, 2023

J. Li, S. Sun, W. Yuan, R.-Z. Fan, H. Zhao, and P. Liu, “Generative judge for evaluating alignment,”arXiv preprint arXiv:2310.05470, 2023

work page arXiv 2023
[18]

Can llms replace human evaluators? an empirical study of llm-as-a-judge in soft- ware engineering,

R. Wang, J. Guo, C. Gao, G. Fan, C. Y . Chong, and X. Xia, “Can llms replace human evaluators? an empirical study of llm-as-a-judge in soft- ware engineering,”Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1955–1977, 2025

1955
[19]

Causality-aided evalua- tion and explanation of large language model-based code generation,

Z. Ji, P. Ma, Z. Li, Z. Wang, and S. Wang, “Causality-aided evalua- tion and explanation of large language model-based code generation,” Proceedings of the ACM on Software Engineering, vol. 2, no. ISSTA, pp. 1374–1397, 2025

2025
[20]

Evaluating language models for generating and judging programming feedback,

C. Koutcheme, N. Dainese, S. Sarsa, A. Hellas, J. Leinonen, S. Ashraf, and P. Denny, “Evaluating language models for generating and judging programming feedback,” inProceedings of the 56th ACM Technical Symposium on Computer Science Education V . 1, pp. 624–630, 2025

2025
[21]

Pc koshien 2025

The University of Aizu, “Pc koshien 2025.” Online. https://pckoshien. u-aizu.ac.jp/, 2025. Accessed: February, 2026

2025
[22]

Judging llm-as-a-judge with mt-bench and chatbot arena,

L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xing,et al., “Judging llm-as-a-judge with mt-bench and chatbot arena,”Advances in neural information processing systems, vol. 36, pp. 46595–46623, 2023

2023
[23]

Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics,

A. Pathak, R. Gandhi, V . Uttam, A. Ramamoorthy, P. Ghosh, A. R. Jindal, S. Verma, A. Mittal, A. Ased, C. Khatri,et al., “Rubric is all you need: Improving llm-based code evaluation with question-specific rubrics,” inProceedings of the 2025 ACM Conference on International Computing Education Research V . 1, pp. 181–195, 2025

2025
[24]

PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization,

Y . Wang, Z. Yu, W. Yao, Z. Zeng, L. Yang, C. Wang, H. Chen, C. Jiang, R. Xie, J. Wang, X. Xie, W. Ye, S. Zhang, and Y . Zhang, “PandaLM: An automatic evaluation benchmark for LLM instruction tuning optimization,” inThe Twelfth International Conference on Learn- ing Representations, 2024

2024
[25]

Judgelm: Fine-tuned large language models are scalable judges,

L. Zhu, X. Wang, and X. Wang, “Judgelm: Fine-tuned large language models are scalable judges,”arXiv preprint arXiv:2310.17631, 2023

work page arXiv 2023
[26]

A closer look into using large language models for automatic evaluation,

C.-H. Chiang and H.-y. Lee, “A closer look into using large language models for automatic evaluation,” inFindings of the association for computational linguistics: EMNLP 2023, pp. 8928–8942, 2023

2023
[27]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Y . Dubois, B. Galambosi, P. Liang, and T. B. Hashimoto, “Length- controlled alpacaeval: A simple way to debias automatic evaluators,” arXiv preprint arXiv:2404.04475, 2024

work page internal anchor Pith review arXiv 2024
[28]

Mcts-judge: Test-time scaling in llm-as-a-judge for code correctness evaluation,

Y . Wang, P. Ji, C. Yang, K. Li, M. Hu, J. Li, and G. Sartoretti, “Mcts-judge: Test-time scaling in llm-as-a-judge for code correctness evaluation,”arXiv preprint arXiv:2502.12468, 2025

work page arXiv 2025
[29]

Judging the judges: A systematic study of position bias in llm-as-a-judge,

L. Shi, C. Ma, W. Liang, X. Diao, W. Ma, and S. V osoughi, “Judging the judges: A systematic study of position bias in llm-as-a-judge,” inProceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pp. 292–314, 2025

2025
[30]

Assessing judging bias in large reasoning models: An empirical study.arXiv preprint arXiv:2504.09946, 2025a

Q. Wang, Z. Lou, Z. Tang, N. Chen, X. Zhao, W. Zhang, D. Song, and B. He, “Assessing judging bias in large reasoning models: An empirical study,”arXiv preprint arXiv:2504.09946, 2025

work page arXiv 2025
[31]

arXiv preprint arXiv:2410.21819 (2025)

K. Wataoka, T. Takahashi, and R. Ri, “Self-preference bias in llm-as-a- judge,”arXiv preprint arXiv:2410.21819, 2024

work page arXiv 2024
[32]

Learning llm-as-a-judge for preference alignment,

Z. Ye, X. Li, Q. Li, Q. Ai, Y . Zhou, W. Shen, D. Yan, and Y . Liu, “Learning llm-as-a-judge for preference alignment,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[33]

Humans or llms as the judge? a study on judgement bias,

G. H. Chen, S. Chen, Z. Liu, F. Jiang, and B. Wang, “Humans or llms as the judge? a study on judgement bias,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 8301–8327, 2024

2024
[34]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan, “Swe-bench: Can language models resolve real-world github issues?,”arXiv preprint arXiv:2310.06770, 2023

work page internal anchor Pith review arXiv 2023
[35]

CodeJudgeBench: Benchmarking LLM-as-a-judge for coding tasks, 2025

H. Jiang, Y . Chen, Y . Cao, H.-y. Lee, and R. T. Tan, “Code- judgebench: Benchmarking llm-as-a-judge for coding tasks,”arXiv preprint arXiv:2507.10535, 2025

work page arXiv 2025
[36]

Judgebench: A benchmark for evaluating llm-based judges,

S. Tan, S. Zhuang, K. Montgomery, W. Y . Tang, A. Cuadron, C. Wang, R. A. Popa, and I. Stoica, “Judgebench: A benchmark for evaluating llm-based judges,”arXiv preprint arXiv:2410.12784, 2024

work page arXiv 2024
[37]

Multi- label code error classification using codet5 and ml-knn,

M. F. I. Amin, A. Shirafuji, M. M. Rahman, and Y . Watanobe, “Multi- label code error classification using codet5 and ml-knn,”IEEE Access, vol. 12, pp. 100805–100820, 2024

2024
[38]

Source code error understanding using bert for multi-label classification,

M. F. I. Amin, Y . Watanobe, M. M. Rahman, and A. Shirafuji, “Source code error understanding using bert for multi-label classification,”IEEE Access, vol. 13, pp. 3802–3822, 2025

2025
[39]

Codet5-rnn: Reinforcing contextual embeddings for enhanced code comprehension,

M. M. Rahman, A. I. Shiplu, Y . Watanobe, M. F. I. Amin, S. R. Naqvi, and F. Liu, “Codet5-rnn: Reinforcing contextual embeddings for enhanced code comprehension,”arXiv preprint arXiv:2603.17821, 2026

work page arXiv 2026
[40]

Humanevalcomm: Benchmarking the commu- nication competence of code generation for llms and llm agents,

J. J. Wu and F. H. Fard, “Humanevalcomm: Benchmarking the commu- nication competence of code generation for llms and llm agents,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–42, 2025

2025
[41]

Error understanding in program code with llm-dl for multi-label classification,

M. F. I. Amin, Y . Watanobe, M. M. Rahman, D. M. Muepu, and M. S. Mia, “Error understanding in program code with llm-dl for multi-label classification,”arXiv preprint arXiv:2603.25005, 2026

work page arXiv 2026
[42]

’it was 80% me, 20% ai’: Seeking authenticity in co-writing with large language models,

A. H.-C. Hwang, Q. V . Liao, S. L. Blodgett, A. Olteanu, and A. Trischler, “’it was 80% me, 20% ai’: Seeking authenticity in co-writing with large language models,”Proceedings of the ACM on Human-Computer Interaction, vol. 9, no. 2, pp. 1–41, 2025

2025
[43]

Automating evaluation of ai text generation in healthcare with a large language model (llm)- as-a-judge,

E. Croxford, Y . Gao, E. First, N. Pellegrino, M. Schnier, J. Caskey, M. Oguss, G. Wills, G. Chen, D. Dligach,et al., “Automating evaluation of ai text generation in healthcare with a large language model (llm)- as-a-judge,”medRxiv, 2025

2025
[44]

Prometheus 2: An open source language model specialized in evaluating other language models,

S. Kim, J. Suk, S. Longpre, B. Y . Lin, J. Shin, S. Welleck, G. Neubig, M. Lee, K. Lee, and M. Seo, “Prometheus 2: An open source language model specialized in evaluating other language models,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 4334–4353, 2024

2024
[45]

Fairer preferences elicit improved human-aligned large language model judg- ments,

H. Zhou, X. Wan, Y . Liu, N. Collier, I. Vuli ´c, and A. Korhonen, “Fairer preferences elicit improved human-aligned large language model judg- ments,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 1241–1252, 2024

2024
[46]

Rocks coding, not development: A human-centric, experimental evaluation of llm- supported se tasks,

W. Wang, H. Ning, G. Zhang, L. Liu, and Y . Wang, “Rocks coding, not development: A human-centric, experimental evaluation of llm- supported se tasks,”Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, pp. 699–721, 2024

2024
[47]

How to teach program- ming in the ai era? using llms as a teachable agent for debugging,

Q. Ma, H. Shen, K. Koedinger, and S. T. Wu, “How to teach program- ming in the ai era? using llms as a teachable agent for debugging,” inInternational Conference on Artificial Intelligence in Education, pp. 265–279, Springer, 2024

2024
[48]

Human-human-ai triadic programming: Uncovering the role of ai agent and the value of human partner in collaborative learning,

T. Daryanto, X. Ding, K. Ping, L. T. Wilhelm, Y . Chen, C. Brown, and E. H. Rho, “Human-human-ai triadic programming: Uncovering the role of ai agent and the value of human partner in collaborative learning,” arXiv preprint arXiv:2601.12134, 2026

work page arXiv 2026
[49]

Decompiling the synergy: An empirical study of human-llm teaming in software reverse engineer- ing,

Z. L. Basque, S. Doria, A. Soneji, W. Gibbs, A. Doup ´e, Y . Shoshi- taishvili, E. Losiouk, R. Wang, and S. Aonzo, “Decompiling the synergy: An empirical study of human-llm teaming in software reverse engineer- ing,” in33rd Network and Distributed System Security Symposium, 23- 27 February 2026, San Diego, USA(Usenix, ed.), 2026

2026
[50]

Automated grading and feedback tools for programming education: A systematic review,

M. Messer, N. C. Brown, M. K ¨olling, and M. Shi, “Automated grading and feedback tools for programming education: A systematic review,” ACM Transactions on Computing Education, vol. 24, no. 1, pp. 1–43, 2024

2024
[51]

Reli- ability in software engineering qualitative research through inter-coder agreement,

´A. Gonz ´alez-Prieto, J. Perez, J. Diaz, and D. L ´opez-Fern´andez, “Reli- ability in software engineering qualitative research through inter-coder agreement,”Journal of Systems and Software, vol. 202, p. 111707, 2023

2023
[52]

Axiom: Benchmarking llm-as-a-judge for code via rule-based perturbation and multisource quality calibration,

R. Wang, X. Wang, C. Gao, C. Y . Chong, X. Xia, and Q. Liao, “Axiom: Benchmarking llm-as-a-judge for code via rule-based perturbation and multisource quality calibration,”arXiv preprint arXiv:2512.20159, 2025

work page arXiv 2025
[53]

Progression trajectory-based student modeling for novice block-based programming,

F. Morshed Fahid, X. Tian, A. Emerson, J. B. Wiggins, D. Bounajim, A. Smith, E. Wiebe, B. Mott, K. Elizabeth Boyer, and J. Lester, “Progression trajectory-based student modeling for novice block-based programming,” inProceedings of the 29th ACM Conference on User Modeling, Adaptation and Personalization, pp. 189–200, 2021

2021
[54]

Studenteval: A benchmark of student-written prompts for large language models of code,

H. M. Babe, S. Nguyen, Y . Zi, A. Guha, M. Q. Feldman, and C. J. Anderson, “Studenteval: A benchmark of student-written prompts for large language models of code,” inFindings of the Association for Computational Linguistics: ACL 2024, pp. 8452–8474, 2024

2024
[55]

An introduction to roc analysis,

T. Fawcett, “An introduction to roc analysis,”Pattern recognition letters, vol. 27, no. 8, pp. 861–874, 2006

2006
[56]

Strictly proper scoring rules, predic- tion, and estimation,

T. Gneiting and A. E. Raftery, “Strictly proper scoring rules, predic- tion, and estimation,”Journal of the American statistical Association, vol. 102, no. 477, pp. 359–378, 2007

2007
[57]

Verification of forecasts expressed in terms of probability,

W. B. Glennet al., “Verification of forecasts expressed in terms of probability,”Monthly weather review, vol. 78, no. 1, pp. 1–3, 1950

1950
[58]

On calibration of modern neural networks,

C. Guo, G. Pleiss, Y . Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” inInternational conference on machine learning, pp. 1321–1330, PMLR, 2017

2017
[59]

Comparison of the predicted and observed secondary structure of t4 phage lysozyme,

B. W. Matthews, “Comparison of the predicted and observed secondary structure of t4 phage lysozyme,”Biochimica et Biophysica Acta (BBA)- Protein Structure, vol. 405, no. 2, pp. 442–451, 1975. 17

1975
[60]

A coefficient of agreement for nominal scales,

J. Cohen, “A coefficient of agreement for nominal scales,”Educational and psychological measurement, vol. 20, no. 1, pp. 37–46, 1960

1960
[61]

Measuring nominal scale agreement among many raters.,

J. L. Fleiss, “Measuring nominal scale agreement among many raters.,” Psychological bulletin, vol. 76, no. 5, p. 378, 1971

1971
[62]

Nonparametric estimation from incomplete observations,

E. L. Kaplan and P. Meier, “Nonparametric estimation from incomplete observations,”Journal of the American statistical association, vol. 53, no. 282, pp. 457–481, 1958

1958
[63]

Term-weighting approaches in automatic text retrieval,

G. Salton and C. Buckley, “Term-weighting approaches in automatic text retrieval,”Information processing & management, vol. 24, no. 5, pp. 513–523, 1988

1988
[64]

Visualizing data using t-sne.,

L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.,” Journal of machine learning research, vol. 9, no. 11, 2008

2008
[65]

Binary codes capable of correcting deletions, insertions, and reversals,

V . I. Levenshteinet al., “Binary codes capable of correcting deletions, insertions, and reversals,” inSoviet physics doklady, vol. 10, pp. 707– 710, Soviet Union, 1966

1966
[66]

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

S. Ren, D. Guo, S. Lu, L. Zhou, S. Liu, D. Tang, N. Sundaresan, M. Zhou, A. Blanco, and S. Ma, “Codebleu: a method for automatic evaluation of code synthesis,”arXiv preprint arXiv:2009.10297, 2020

work page internal anchor Pith review arXiv 2009