Subjective Code Preferences in Experts and Large Language Models

Anna Mokhova; Iryna Gurevych; Simone Balloccu; Subhabrata Dutta

arxiv: 2605.25296 · v1 · pith:XPUVPXG7new · submitted 2026-05-24 · 💻 cs.HC

Subjective Code Preferences in Experts and Large Language Models

Anna Mokhova , Subhabrata Dutta , Iryna Gurevych , Simone Balloccu This is my paper

Pith reviewed 2026-06-29 23:16 UTC · model grok-4.3

classification 💻 cs.HC

keywords subjective coding preferenceslarge language modelshuman expertspositional biascode evaluationLikert scale ratingsPython code pairspreference axes

0 comments

The pith

Large language models often prefer the opposite code option when shown actual snippets rather than natural language descriptions of the same choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure how LLMs handle four everyday subjective preferences in code—complexity, commenting, modularity, and readability—by building pairs of Python solutions that differ along one axis at a time. Experts first rate the pairs on a Likert scale; the same pairs are then shown to models once as text summaries and once as concrete code. Models frequently select one alternative in the text setting and the reverse alternative once the code is visible. Models that stay consistent across the two settings still flip their choice when the order of the two options is swapped. When the same models re-rate the full dataset, their Likert scores are more polarized than the human distributions and diverge on which snippet is preferred.

Core claim

When 13 LLMs are given identical programming tasks first as textual option descriptions and then as paired code snippets, they select opposite alternatives in the two formats; models whose choices remain coherent between formats exhibit positional bias under order swaps; and the five most consistent models produce more extreme Likert ratings than 73 human experts while diverging on specific preference judgments.

What carries the argument

A dataset of approximately 3000 paired Python snippets differing along one of four validated preference axes, presented to models once in natural language and once as executable code to expose format-dependent and order-dependent choice reversals.

If this is right

LLM-based code tools must treat natural-language preference statements and direct code evaluation as separate inputs rather than interchangeable signals.
Order randomization or explicit debiasing steps are required for any model claimed to be consistent on subjective code choices.
Human expert ratings remain necessary because model Likert distributions are more polarized and disagree with experts on which variant is preferred.
Reasoning traces from models about code preferences can rest on external assumptions not present in the given snippets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Preference alignment methods for code generation may need separate objectives for text-described versus code-presented judgments.
Automated re-annotation pipelines using LLMs will systematically shift the distribution of accepted solutions away from human expert distributions.
The format mismatch observed here could appear in other structured outputs such as configuration files or API call sequences.

Load-bearing premise

Each pair of snippets differs only along the intended preference axis and contains no uncontrolled differences in length, correctness, or other factors that could drive the ratings.

What would settle it

A replication in which the same models receive the identical pairs in randomized order across multiple trials and still show no systematic preference reversal between text and code or between the two orderings.

Figures

Figures reproduced from arXiv: 2605.25296 by Anna Mokhova, Iryna Gurevych, Simone Balloccu, Subhabrata Dutta.

**Figure 2.** Figure 2: Responses from the experts’ survey for each preference: aggregated values, along with context-specific [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of an instance from our dataset. Given a Python problem (e.g., generating the Fibonacci [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of preference choices among LLMs and humans. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Distribution of code metrics differentiating [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt template to investigate LLMs’ coding [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: CodeGemma-7B Code Generation Prompt. The following two options represent solutions of the same Python problem z.Which of these two options would you prefer? Option A: x Option B: y Please respond with only A or B. Here are some examples: 1) Option A $CODE: xcode; Option B $CODE: ycode; Response: xcode. Option A $TEXT: xtext; Option B $TEXT: ytext; Response: xtext. 2) Option A $CODE: xcode; Option B $CODE: … view at source ↗

**Figure 8.** Figure 8: Few-Shot Example Prompt. E Details on Dataset Curation and CodeGemma-7B Applications In this section, we highlight that CodeGemma-7B did not generate the alternative code, but rather a modification that was manually checked to eliminate the potential differences that can be detected 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Prompt for Human-like Preference Annotation. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Readability Model Pipeline. G Annotation Interface This section demonstrates an example of UI interface for the main annotation task as it was displayed to participants ( [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: A full example of the expert preference annotation interface. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Human-LLM Bootstrapping Statistics for Disagreement metric. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Pairwise models’ significance scores for Disagreement metric. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

**Figure 14.** Figure 14: Human-LLM Bootstrapping Statistics for Jensen-Shannon divergence. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗

**Figure 15.** Figure 15: Pairwise models’ significance scores for Jensen-Shannon divergence. [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Human-LLM Bootstrapping Statistics for Spearman correlation. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Pairwise models’ significance scores for Spearman correlation. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have become increasingly popular for coding tasks, with subjective coding preferences being an essential element to adapt to programmers' personal needs. Existing work overlooks such characteristics and mainly focuses on code correctness. In this study, we propose a typification of four subjective coding preference axes - complexity, commenting, modularity, and readability - motivated by common engineering habits and validated by 25 software engineers. We collect a dataset of ~3,000 paired Python code snippets reflecting these axes, annotated by 73 experts who rate their preferences on a Likert scale. Using our dataset, we study how LLMs handle subjective coding preferences. We present 13 LLMs with pairs of solutions to the same programming task, first as textual descriptions and then as concrete code snippets. We find that models often prefer one option in natural language but the opposite when evaluating code. More consistent models (i.e., those that are coherent in their choices between deeds and words) frequently reveal positional bias: swapping the order of options changes the preferred alternative. We then use the five most consistent models to re-annotate the dataset. Compared to humans, models show polarized Likert distributions and notable divergence in ratings. A case study on GPT-5 reveals reliance on external assumptions and brittle reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs reverse preferences between text descriptions and actual code, with consistent models showing positional bias, but the pairs may have uncontrolled differences beyond the four axes.

read the letter

The main things here are the reported preference reversal—models pick one option in text but the opposite in code—and that consistent models often flip when the order of options is swapped. The work also shows models giving more polarized Likert scores than the human experts.

The paper does a decent job defining four axes (complexity, commenting, modularity, readability) from engineering habits, getting input from 25 engineers on the typology, and collecting ratings from 73 experts on roughly 3000 pairs. Testing 13 models first on textual descriptions then on the code itself, then re-annotating with the five most consistent models, plus the GPT-5 case study, produces concrete observations about where the models diverge from humans.

The soft spot is pair construction. The entire analysis rests on the pairs differing only along the declared axis. The abstract says the pairs were motivated by habits and validated by engineers, but without clear details on generation, length matching, correctness checks, or confirmation that raters saw only the target difference, other factors could drive the ratings and the human-model gap. If the full paper has those controls or post-hoc checks, they need to be front and center; otherwise the central claims are harder to trust.

This is for researchers in HCI or software engineering who care about subjective style preferences in code tools rather than just correctness. Readers who want a new dataset or direct comparisons of model versus human judgments on these axes will find usable material.

It deserves peer review. The questions are practical, the data collection is large enough to be worth referee time, and the findings are specific enough to evaluate even if the methods section needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces four subjective coding preference axes (complexity, commenting, modularity, readability) motivated by engineering practice and validated by 25 engineers. It constructs a dataset of ~3,000 paired Python snippets, collects Likert-scale annotations from 73 experts, and evaluates 13 LLMs by presenting the same tasks first as textual descriptions and then as concrete code. Key findings are that models frequently reverse preferences between natural-language and code presentations, that more consistent models exhibit positional bias when option order is swapped, and that model ratings are more polarized and diverge from human distributions, illustrated by a GPT-5 case study.

Significance. If the paired snippets are shown to differ only along the declared axes, the work supplies concrete evidence that current LLMs cannot reliably translate stated preferences into code-level judgments and that positional bias persists even in the most coherent models. This has direct implications for preference-tuning pipelines and for any system that solicits natural-language feedback before generating code.

major comments (2)

[Abstract, §3] Abstract and §3 (dataset construction): the central claims of preference reversal and positional bias rest on the assumption that each pair differs from its counterpart solely along one of the four target axes. The abstract states that pairs were “motivated by common engineering habits and validated by 25 software engineers,” yet provides no description of the generation procedure, automated checks for length/correctness/token-count parity, or post-hoc validation that raters perceived differences only on the intended dimension. Without such controls the observed reversals and bias statistics are confounded.
[§4–5] §4–5 (LLM evaluation and human–model comparison): no information is supplied on the statistical tests used to establish significance of preference reversals, inter-rater reliability among the 73 experts, or how positional bias was quantified and controlled for in the model queries. These omissions leave the headline divergence results without a clear evidential basis.

minor comments (2)

The Likert-scale presentation and exact prompt wording used for both humans and models should be reproduced verbatim in an appendix to allow replication.
Table or figure captions should explicitly state the number of pairs per axis and the distribution of expert agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional methodological transparency will strengthen the paper. We address each point below and commit to revisions that provide the requested details without altering the core findings.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (dataset construction): the central claims of preference reversal and positional bias rest on the assumption that each pair differs from its counterpart solely along one of the four target axes. The abstract states that pairs were “motivated by common engineering habits and validated by 25 software engineers,” yet provides no description of the generation procedure, automated checks for length/correctness/token-count parity, or post-hoc validation that raters perceived differences only on the intended dimension. Without such controls the observed reversals and bias statistics are confounded.

Authors: We agree that the current description of pair construction is insufficiently detailed. The pairs were generated by taking a base correct solution and applying targeted, minimal edits along exactly one axis (e.g., inserting or removing comments while preserving functionality and length). The 25 engineers reviewed a stratified sample of 200 pairs and confirmed that perceived differences aligned with the intended axis. To fully address the concern we will expand §3 with: (i) the exact generation algorithm and prompts, (ii) automated verification that token counts and line lengths differ by less than 5 %, and (iii) a post-hoc analysis of the 73-expert ratings showing that cross-axis contamination is below 8 %. These additions will be included in the revised manuscript. revision: yes
Referee: [§4–5] §4–5 (LLM evaluation and human–model comparison): no information is supplied on the statistical tests used to establish significance of preference reversals, inter-rater reliability among the 73 experts, or how positional bias was quantified and controlled for in the model queries. These omissions leave the headline divergence results without a clear evidential basis.

Authors: The manuscript currently reports descriptive statistics and raw reversal rates but omits formal inferential tests. We will revise §4 and §5 to report: (1) McNemar’s test (with exact p-values) for the significance of preference reversals between description and code presentations, (2) Krippendorff’s alpha for inter-rater reliability across the 73 experts on each axis, and (3) a precise operationalization of positional bias as the fraction of trials in which the preferred option flips when the two snippets are presented in reversed order, with order randomized per query to control for presentation effects. These statistical details and the corresponding code will be added to the revision. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or self-referential predictions

full rationale

The paper collects a new dataset of ~3,000 paired Python code snippets, obtains Likert ratings from 73 experts, and directly queries 13 LLMs on textual vs. code versions of the pairs. No equations, fitted parameters, uniqueness theorems, or ansatzes are invoked. Central claims (preference reversal between text and code, positional bias in consistent models) are statistical comparisons against the newly collected annotations and model outputs. This is self-contained empirical work with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that expert ratings on the four axes are stable and that the constructed code pairs isolate the intended preference dimensions.

axioms (1)

domain assumption Software engineers can reliably and consistently rate code snippets along the axes of complexity, commenting, modularity, and readability
Stated as validated by 25 engineers, but the validation procedure and agreement metrics are not described in the abstract

pith-pipeline@v0.9.1-grok · 5764 in / 1349 out tokens · 26248 ms · 2026-06-29T23:16:11.225776+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 1 internal anchor

[1]

On information and sufficiency.The annals of mathematical statistics, 22(1):79–86
[2]

Why functional programming matters.The com- puter journal, 32(2):98–107
[3]

Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151
[4]

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg

A metrics suite for object oriented design.IEEE Transactions on software engineering, 20(6):476–493. Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. Opencodein- struct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030. Anthropic. 2025. Cla...

work page arXiv 2025
[5]

InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353

Prometheus 2: An open source language model specialized in evaluating other language models. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353. Imam Kusmaryono, Dyana Wijayanti, and Hevy Risqi Maharani. 2022. Number of response options, reliabil- ity, validity, and potential bias in the use of the ...

work page arXiv 2024
[6]

InInternational Conference on Learning Rep- resentations, volume 2024, pages 7604–7623

Octopack: Instruction tuning code large language models. InInternational Conference on Learning Rep- resentations, volume 2024, pages 7604–7623. Delano Oliveira, Reydne Santos, Benedito De Oliveira, Martin Monperrus, Fernando Castor, and Fernanda Madeiral. 2024. Understanding code understandability improvements in code reviews.IEEE Transactions on Softwar...

2024
[7]

Linda Rosenberg, Ted Hammer, and Jack Shaw

A decade of code comment quality assessment: A systematic literature review.Journal of Systems and Software, 195:111515. Linda Rosenberg, Ted Hammer, and Jack Shaw. Soft- ware metrics and reliability. Furkan ¸ Sahinuç, Subhabrata Dutta, and Iryna Gurevych
[8]

Reward Modeling for Scientific Writing Evaluation

Reward modeling for scientific writing evaluation. arXiv preprint arXiv:2601.11374. Simone Scalabrino, Gabriele Bavota, Christopher Ven- dome, Mario Linares-Vásquez, Denys Poshyvanyk, and Rocco Oliveto. 2017. Automatically assessing code understandability: How far are we? In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE...

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Choquette

A notional understanding of the relationship be- tween code readability and software complexity.Infor- mation, 14(2):81. CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christo- pher A Choquette-Choo, Jingyue Shen, Joe Kelley, and 1 others. 2024. Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406....

work page arXiv 2024
[10]

InProceedings of the 25th Australasian Computing Education Conference, pages 105–112

An experiment on the effects of modularity on code modification and understanding. InProceedings of the 25th Australasian Computing Education Conference, pages 105–112. The Algorithms — GitHub Organization. 2026. Open source resource for learning data structures & algo- rithms and their implementation in any programming language. Chaoqi Wang, Yibo Jiang, ...

2026
[11]

Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixi- ang Zhou, Le Sun, and Yingfei Sun

40 years of designing code comprehension exper- iments: A systematic mapping study.ACM computing surveys, 56(4):1–42. Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixi- ang Zhou, Le Sun, and Yingfei Sun. 2025. Large lan- guage models often say one thing and do another.arXiv preprint arXiv:2503.07003. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Biny...

work page arXiv 2025
[12]

arXiv preprint arXiv:2407.11470

Beyond correctness: Benchmarking multi- dimensional code generation for large language models. arXiv preprint arXiv:2407.11470. Tan Zhi-Xuan, Micah Carroll, Matija Franklin, and Hal Ashton. 2025. Beyond preferences in ai alignment: T. zhi-xuan et al.Philosophical Studies, 182(7):1813– 1863. A Additional information on coding metrics Although there is no s...

work page arXiv 2025
[13]

44% have 4-6 years of working experience; 28% have 7-9 years; 20% have 1-3 years; and 8% have 10+ years
[14]

Coding comments

48% spend approximately 3-5 hours a day on working with code; 44% spend 5-8 hours a day; 8% spend 1-3 hours. C Survey Coding Styles of Software Engineers Part 1: General information about experi- ence and current work Q1.How many years of working experience do you have as a developer? Response options: (a) 1-3 years (b) 4-6 years (c) 7-9 years (d) 10+ yea...
[15]

low or high) depends on the habits of a particular developer

Using different degree of comments in code (i.e. low or high) depends on the habits of a particular developer
[16]

low or high) depending on the specific project/task

In my daily coding activity I personally choose different degree of comments (i.e. low or high) depending on the specific project/task
[17]

Using different degree of comments in code depends on the requirements and/or computa- tional limitations of the specific project
[18]

Modularity

Using different degree of comments in code (i.e. low or high) doesn’t depend on the habits of a particular developer 13 Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance degree of comments in your code Response options...
[19]

Modularity in code depends on the require- ments and/or computational limitations of the specific project
[20]

Modularity in code doesn’t depend on the habits of a particular developer
[21]

In my daily coding activity I personally choose different degree of modularity depend- ing on the specific project/task
[22]

Complexity

Modularity in code depends on the habits of a particular developer Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance modularity in your code Response options: (a) I always write monolithic code (b) I mostly write monol...
[23]

Complexity in code depends on the habits of a particular developer
[24]

Complexity in code depends on the require- ments and/or computational limitations of the specific project
[25]

In my daily coding activity I personally make decisions regarding code complexity depending on the specific project/task
[26]

Readability

In my daily coding activity my decisions regarding code complexity don’t depend on the specific project/task Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance complexity in your code Response options: (a) I always mini...
[27]

Choices regarding coding readability- complexity trade-off depend on the require- ments and/or computational limitations of the specific project
[28]

In my daily coding activity I person- ally make decisions regarding readability- complexity trade-off depending on the spe- cific project/task
[29]

Choices regarding coding readability- complexity trade-off don’t depend on the re- quirements and/or computational limitations of the specific project
[30]

Choices regarding coding readability- complexity trade-off depends on the habits of a particular developer Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance readability-complexity trade-off in your code Response option...

2024
[31]

Option A$TEXT:x text; Option B$TEXT:y text; Response: xtext

Option A$CODE:x code; Option B$CODE:y code; Response: xcode. Option A$TEXT:x text; Option B$TEXT:y text; Response: xtext
[32]

LLM-generated vs. human-written

Option A$CODE:x code; Option B$CODE:y code; Response: ycode. Option A$TEXT:x text; Option B$TEXT:y text; Response: ytext. Figure 8: Few-Shot Example Prompt. E Details on Dataset Curation and CodeGemma-7B Applications In this section, we highlight that CodeGemma-7B did not generate the alternative code, but rather a modification that was manually checked t...

2019

[1] [1]

On information and sufficiency.The annals of mathematical statistics, 22(1):79–86

[2] [2]

Why functional programming matters.The com- puter journal, 32(2):98–107

[3] [3]

Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151

[4] [4]

Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg

A metrics suite for object oriented design.IEEE Transactions on software engineering, 20(6):476–493. Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. Opencodein- struct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030. Anthropic. 2025. Cla...

work page arXiv 2025

[5] [5]

InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353

Prometheus 2: An open source language model specialized in evaluating other language models. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353. Imam Kusmaryono, Dyana Wijayanti, and Hevy Risqi Maharani. 2022. Number of response options, reliabil- ity, validity, and potential bias in the use of the ...

work page arXiv 2024

[6] [6]

InInternational Conference on Learning Rep- resentations, volume 2024, pages 7604–7623

Octopack: Instruction tuning code large language models. InInternational Conference on Learning Rep- resentations, volume 2024, pages 7604–7623. Delano Oliveira, Reydne Santos, Benedito De Oliveira, Martin Monperrus, Fernando Castor, and Fernanda Madeiral. 2024. Understanding code understandability improvements in code reviews.IEEE Transactions on Softwar...

2024

[7] [7]

Linda Rosenberg, Ted Hammer, and Jack Shaw

A decade of code comment quality assessment: A systematic literature review.Journal of Systems and Software, 195:111515. Linda Rosenberg, Ted Hammer, and Jack Shaw. Soft- ware metrics and reliability. Furkan ¸ Sahinuç, Subhabrata Dutta, and Iryna Gurevych

[8] [8]

Reward Modeling for Scientific Writing Evaluation

Reward modeling for scientific writing evaluation. arXiv preprint arXiv:2601.11374. Simone Scalabrino, Gabriele Bavota, Christopher Ven- dome, Mario Linares-Vásquez, Denys Poshyvanyk, and Rocco Oliveto. 2017. Automatically assessing code understandability: How far are we? In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE...

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Choquette

A notional understanding of the relationship be- tween code readability and software complexity.Infor- mation, 14(2):81. CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christo- pher A Choquette-Choo, Jingyue Shen, Joe Kelley, and 1 others. 2024. Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406....

work page arXiv 2024

[10] [10]

InProceedings of the 25th Australasian Computing Education Conference, pages 105–112

An experiment on the effects of modularity on code modification and understanding. InProceedings of the 25th Australasian Computing Education Conference, pages 105–112. The Algorithms — GitHub Organization. 2026. Open source resource for learning data structures & algo- rithms and their implementation in any programming language. Chaoqi Wang, Yibo Jiang, ...

2026

[11] [11]

Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixi- ang Zhou, Le Sun, and Yingfei Sun

40 years of designing code comprehension exper- iments: A systematic mapping study.ACM computing surveys, 56(4):1–42. Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixi- ang Zhou, Le Sun, and Yingfei Sun. 2025. Large lan- guage models often say one thing and do another.arXiv preprint arXiv:2503.07003. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Biny...

work page arXiv 2025

[12] [12]

arXiv preprint arXiv:2407.11470

Beyond correctness: Benchmarking multi- dimensional code generation for large language models. arXiv preprint arXiv:2407.11470. Tan Zhi-Xuan, Micah Carroll, Matija Franklin, and Hal Ashton. 2025. Beyond preferences in ai alignment: T. zhi-xuan et al.Philosophical Studies, 182(7):1813– 1863. A Additional information on coding metrics Although there is no s...

work page arXiv 2025

[13] [13]

44% have 4-6 years of working experience; 28% have 7-9 years; 20% have 1-3 years; and 8% have 10+ years

[14] [14]

Coding comments

48% spend approximately 3-5 hours a day on working with code; 44% spend 5-8 hours a day; 8% spend 1-3 hours. C Survey Coding Styles of Software Engineers Part 1: General information about experi- ence and current work Q1.How many years of working experience do you have as a developer? Response options: (a) 1-3 years (b) 4-6 years (c) 7-9 years (d) 10+ yea...

[15] [15]

low or high) depends on the habits of a particular developer

Using different degree of comments in code (i.e. low or high) depends on the habits of a particular developer

[16] [16]

low or high) depending on the specific project/task

In my daily coding activity I personally choose different degree of comments (i.e. low or high) depending on the specific project/task

[17] [17]

Using different degree of comments in code depends on the requirements and/or computa- tional limitations of the specific project

[18] [18]

Modularity

Using different degree of comments in code (i.e. low or high) doesn’t depend on the habits of a particular developer 13 Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance degree of comments in your code Response options...

[19] [19]

Modularity in code depends on the require- ments and/or computational limitations of the specific project

[20] [20]

Modularity in code doesn’t depend on the habits of a particular developer

[21] [21]

In my daily coding activity I personally choose different degree of modularity depend- ing on the specific project/task

[22] [22]

Complexity

Modularity in code depends on the habits of a particular developer Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance modularity in your code Response options: (a) I always write monolithic code (b) I mostly write monol...

[23] [23]

Complexity in code depends on the habits of a particular developer

[24] [24]

Complexity in code depends on the require- ments and/or computational limitations of the specific project

[25] [25]

In my daily coding activity I personally make decisions regarding code complexity depending on the specific project/task

[26] [26]

Readability

In my daily coding activity my decisions regarding code complexity don’t depend on the specific project/task Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance complexity in your code Response options: (a) I always mini...

[27] [27]

Choices regarding coding readability- complexity trade-off depend on the require- ments and/or computational limitations of the specific project

[28] [28]

In my daily coding activity I person- ally make decisions regarding readability- complexity trade-off depending on the spe- cific project/task

[29] [29]

Choices regarding coding readability- complexity trade-off don’t depend on the re- quirements and/or computational limitations of the specific project

[30] [30]

Choices regarding coding readability- complexity trade-off depends on the habits of a particular developer Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance readability-complexity trade-off in your code Response option...

2024

[31] [31]

Option A$TEXT:x text; Option B$TEXT:y text; Response: xtext

Option A$CODE:x code; Option B$CODE:y code; Response: xcode. Option A$TEXT:x text; Option B$TEXT:y text; Response: xtext

[32] [32]

LLM-generated vs. human-written

Option A$CODE:x code; Option B$CODE:y code; Response: ycode. Option A$TEXT:x text; Option B$TEXT:y text; Response: ytext. Figure 8: Few-Shot Example Prompt. E Details on Dataset Curation and CodeGemma-7B Applications In this section, we highlight that CodeGemma-7B did not generate the alternative code, but rather a modification that was manually checked t...

2019