pith. sign in

arxiv: 2605.25296 · v1 · pith:XPUVPXG7new · submitted 2026-05-24 · 💻 cs.HC

Subjective Code Preferences in Experts and Large Language Models

Pith reviewed 2026-06-29 23:16 UTC · model grok-4.3

classification 💻 cs.HC
keywords subjective coding preferenceslarge language modelshuman expertspositional biascode evaluationLikert scale ratingsPython code pairspreference axes
0
0 comments X

The pith

Large language models often prefer the opposite code option when shown actual snippets rather than natural language descriptions of the same choices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to measure how LLMs handle four everyday subjective preferences in code—complexity, commenting, modularity, and readability—by building pairs of Python solutions that differ along one axis at a time. Experts first rate the pairs on a Likert scale; the same pairs are then shown to models once as text summaries and once as concrete code. Models frequently select one alternative in the text setting and the reverse alternative once the code is visible. Models that stay consistent across the two settings still flip their choice when the order of the two options is swapped. When the same models re-rate the full dataset, their Likert scores are more polarized than the human distributions and diverge on which snippet is preferred.

Core claim

When 13 LLMs are given identical programming tasks first as textual option descriptions and then as paired code snippets, they select opposite alternatives in the two formats; models whose choices remain coherent between formats exhibit positional bias under order swaps; and the five most consistent models produce more extreme Likert ratings than 73 human experts while diverging on specific preference judgments.

What carries the argument

A dataset of approximately 3000 paired Python snippets differing along one of four validated preference axes, presented to models once in natural language and once as executable code to expose format-dependent and order-dependent choice reversals.

If this is right

  • LLM-based code tools must treat natural-language preference statements and direct code evaluation as separate inputs rather than interchangeable signals.
  • Order randomization or explicit debiasing steps are required for any model claimed to be consistent on subjective code choices.
  • Human expert ratings remain necessary because model Likert distributions are more polarized and disagree with experts on which variant is preferred.
  • Reasoning traces from models about code preferences can rest on external assumptions not present in the given snippets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Preference alignment methods for code generation may need separate objectives for text-described versus code-presented judgments.
  • Automated re-annotation pipelines using LLMs will systematically shift the distribution of accepted solutions away from human expert distributions.
  • The format mismatch observed here could appear in other structured outputs such as configuration files or API call sequences.

Load-bearing premise

Each pair of snippets differs only along the intended preference axis and contains no uncontrolled differences in length, correctness, or other factors that could drive the ratings.

What would settle it

A replication in which the same models receive the identical pairs in randomized order across multiple trials and still show no systematic preference reversal between text and code or between the two orderings.

Figures

Figures reproduced from arXiv: 2605.25296 by Anna Mokhova, Iryna Gurevych, Simone Balloccu, Subhabrata Dutta.

Figure 1
Figure 1. Figure 1: Overview of our dataset construction and experimental pipeline. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Responses from the experts’ survey for each preference: aggregated values, along with context-specific [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Examples of an instance from our dataset. Given a Python problem (e.g., generating the Fibonacci [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of preference choices among LLMs and humans. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Distribution of code metrics differentiating [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template to investigate LLMs’ coding [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: CodeGemma-7B Code Generation Prompt. The following two options represent solutions of the same Python problem z.Which of these two options would you prefer? Option A: x Option B: y Please respond with only A or B. Here are some examples: 1) Option A $CODE: xcode; Option B $CODE: ycode; Response: xcode. Option A $TEXT: xtext; Option B $TEXT: ytext; Response: xtext. 2) Option A $CODE: xcode; Option B $CODE: … view at source ↗
Figure 8
Figure 8. Figure 8: Few-Shot Example Prompt. E Details on Dataset Curation and CodeGemma-7B Applications In this section, we highlight that CodeGemma-7B did not generate the alternative code, but rather a modification that was manually checked to elimi￾nate the potential differences that can be detected 16 [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt for Human-like Preference Annotation. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Readability Model Pipeline. G Annotation Interface This section demonstrates an example of UI inter￾face for the main annotation task as it was displayed to participants ( [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: A full example of the expert preference annotation interface. [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Human-LLM Bootstrapping Statistics for Disagreement metric. [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Pairwise models’ significance scores for Disagreement metric. [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Human-LLM Bootstrapping Statistics for Jensen-Shannon divergence. [PITH_FULL_IMAGE:figures/full_fig_p022_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Pairwise models’ significance scores for Jensen-Shannon divergence. [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Human-LLM Bootstrapping Statistics for Spearman correlation. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Pairwise models’ significance scores for Spearman correlation. [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have become increasingly popular for coding tasks, with subjective coding preferences being an essential element to adapt to programmers' personal needs. Existing work overlooks such characteristics and mainly focuses on code correctness. In this study, we propose a typification of four subjective coding preference axes - complexity, commenting, modularity, and readability - motivated by common engineering habits and validated by 25 software engineers. We collect a dataset of ~3,000 paired Python code snippets reflecting these axes, annotated by 73 experts who rate their preferences on a Likert scale. Using our dataset, we study how LLMs handle subjective coding preferences. We present 13 LLMs with pairs of solutions to the same programming task, first as textual descriptions and then as concrete code snippets. We find that models often prefer one option in natural language but the opposite when evaluating code. More consistent models (i.e., those that are coherent in their choices between deeds and words) frequently reveal positional bias: swapping the order of options changes the preferred alternative. We then use the five most consistent models to re-annotate the dataset. Compared to humans, models show polarized Likert distributions and notable divergence in ratings. A case study on GPT-5 reveals reliance on external assumptions and brittle reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces four subjective coding preference axes (complexity, commenting, modularity, readability) motivated by engineering practice and validated by 25 engineers. It constructs a dataset of ~3,000 paired Python snippets, collects Likert-scale annotations from 73 experts, and evaluates 13 LLMs by presenting the same tasks first as textual descriptions and then as concrete code. Key findings are that models frequently reverse preferences between natural-language and code presentations, that more consistent models exhibit positional bias when option order is swapped, and that model ratings are more polarized and diverge from human distributions, illustrated by a GPT-5 case study.

Significance. If the paired snippets are shown to differ only along the declared axes, the work supplies concrete evidence that current LLMs cannot reliably translate stated preferences into code-level judgments and that positional bias persists even in the most coherent models. This has direct implications for preference-tuning pipelines and for any system that solicits natural-language feedback before generating code.

major comments (2)
  1. [Abstract, §3] Abstract and §3 (dataset construction): the central claims of preference reversal and positional bias rest on the assumption that each pair differs from its counterpart solely along one of the four target axes. The abstract states that pairs were “motivated by common engineering habits and validated by 25 software engineers,” yet provides no description of the generation procedure, automated checks for length/correctness/token-count parity, or post-hoc validation that raters perceived differences only on the intended dimension. Without such controls the observed reversals and bias statistics are confounded.
  2. [§4–5] §4–5 (LLM evaluation and human–model comparison): no information is supplied on the statistical tests used to establish significance of preference reversals, inter-rater reliability among the 73 experts, or how positional bias was quantified and controlled for in the model queries. These omissions leave the headline divergence results without a clear evidential basis.
minor comments (2)
  1. The Likert-scale presentation and exact prompt wording used for both humans and models should be reproduced verbatim in an appendix to allow replication.
  2. Table or figure captions should explicitly state the number of pairs per axis and the distribution of expert agreement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important areas where additional methodological transparency will strengthen the paper. We address each point below and commit to revisions that provide the requested details without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (dataset construction): the central claims of preference reversal and positional bias rest on the assumption that each pair differs from its counterpart solely along one of the four target axes. The abstract states that pairs were “motivated by common engineering habits and validated by 25 software engineers,” yet provides no description of the generation procedure, automated checks for length/correctness/token-count parity, or post-hoc validation that raters perceived differences only on the intended dimension. Without such controls the observed reversals and bias statistics are confounded.

    Authors: We agree that the current description of pair construction is insufficiently detailed. The pairs were generated by taking a base correct solution and applying targeted, minimal edits along exactly one axis (e.g., inserting or removing comments while preserving functionality and length). The 25 engineers reviewed a stratified sample of 200 pairs and confirmed that perceived differences aligned with the intended axis. To fully address the concern we will expand §3 with: (i) the exact generation algorithm and prompts, (ii) automated verification that token counts and line lengths differ by less than 5 %, and (iii) a post-hoc analysis of the 73-expert ratings showing that cross-axis contamination is below 8 %. These additions will be included in the revised manuscript. revision: yes

  2. Referee: [§4–5] §4–5 (LLM evaluation and human–model comparison): no information is supplied on the statistical tests used to establish significance of preference reversals, inter-rater reliability among the 73 experts, or how positional bias was quantified and controlled for in the model queries. These omissions leave the headline divergence results without a clear evidential basis.

    Authors: The manuscript currently reports descriptive statistics and raw reversal rates but omits formal inferential tests. We will revise §4 and §5 to report: (1) McNemar’s test (with exact p-values) for the significance of preference reversals between description and code presentations, (2) Krippendorff’s alpha for inter-rater reliability across the 73 experts on each axis, and (3) a precise operationalization of positional bias as the fraction of trials in which the preferred option flips when the two snippets are presented in reversed order, with order randomized per query to control for presentation effects. These statistical details and the corresponding code will be added to the revision. revision: yes

Circularity Check

0 steps flagged

Empirical study with no derivation chain or self-referential predictions

full rationale

The paper collects a new dataset of ~3,000 paired Python code snippets, obtains Likert ratings from 73 experts, and directly queries 13 LLMs on textual vs. code versions of the pairs. No equations, fitted parameters, uniqueness theorems, or ansatzes are invoked. Central claims (preference reversal between text and code, positional bias in consistent models) are statistical comparisons against the newly collected annotations and model outputs. This is self-contained empirical work with no reduction of results to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the assumption that expert ratings on the four axes are stable and that the constructed code pairs isolate the intended preference dimensions.

axioms (1)
  • domain assumption Software engineers can reliably and consistently rate code snippets along the axes of complexity, commenting, modularity, and readability
    Stated as validated by 25 engineers, but the validation procedure and agreement metrics are not described in the abstract

pith-pipeline@v0.9.1-grok · 5764 in / 1349 out tokens · 26248 ms · 2026-06-29T23:16:11.225776+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    On information and sufficiency.The annals of mathematical statistics, 22(1):79–86

  2. [2]

    Why functional programming matters.The com- puter journal, 32(2):98–107

  3. [3]

    Divergence measures based on the shannon entropy.IEEE Transactions on Information theory, 37(1):145–151

  4. [4]

    Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg

    A metrics suite for object oriented design.IEEE Transactions on software engineering, 20(6):476–493. Wasi Uddin Ahmad, Aleksander Ficek, Mehrzad Samadi, Jocelyn Huang, Vahid Noroozi, Somshubra Majumdar, and Boris Ginsburg. 2025. Opencodein- struct: A large-scale instruction tuning dataset for code llms.arXiv preprint arXiv:2504.04030. Anthropic. 2025. Cla...

  5. [5]

    InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353

    Prometheus 2: An open source language model specialized in evaluating other language models. InPro- ceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353. Imam Kusmaryono, Dyana Wijayanti, and Hevy Risqi Maharani. 2022. Number of response options, reliabil- ity, validity, and potential bias in the use of the ...

  6. [6]

    InInternational Conference on Learning Rep- resentations, volume 2024, pages 7604–7623

    Octopack: Instruction tuning code large language models. InInternational Conference on Learning Rep- resentations, volume 2024, pages 7604–7623. Delano Oliveira, Reydne Santos, Benedito De Oliveira, Martin Monperrus, Fernando Castor, and Fernanda Madeiral. 2024. Understanding code understandability improvements in code reviews.IEEE Transactions on Softwar...

  7. [7]

    Linda Rosenberg, Ted Hammer, and Jack Shaw

    A decade of code comment quality assessment: A systematic literature review.Journal of Systems and Software, 195:111515. Linda Rosenberg, Ted Hammer, and Jack Shaw. Soft- ware metrics and reliability. Furkan ¸ Sahinuç, Subhabrata Dutta, and Iryna Gurevych

  8. [8]

    Reward Modeling for Scientific Writing Evaluation

    Reward modeling for scientific writing evaluation. arXiv preprint arXiv:2601.11374. Simone Scalabrino, Gabriele Bavota, Christopher Ven- dome, Mario Linares-Vásquez, Denys Poshyvanyk, and Rocco Oliveto. 2017. Automatically assessing code understandability: How far are we? In2017 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE...

  9. [9]

    Choquette

    A notional understanding of the relationship be- tween code readability and software complexity.Infor- mation, 14(2):81. CodeGemma Team, Heri Zhao, Jeffrey Hui, Joshua Howland, Nam Nguyen, Siqi Zuo, Andrea Hu, Christo- pher A Choquette-Choo, Jingyue Shen, Joe Kelley, and 1 others. 2024. Codegemma: Open code models based on gemma.arXiv preprint arXiv:2406....

  10. [10]

    InProceedings of the 25th Australasian Computing Education Conference, pages 105–112

    An experiment on the effects of modularity on code modification and understanding. InProceedings of the 25th Australasian Computing Education Conference, pages 105–112. The Algorithms — GitHub Organization. 2026. Open source resource for learning data structures & algo- rithms and their implementation in any programming language. Chaoqi Wang, Yibo Jiang, ...

  11. [11]

    Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixi- ang Zhou, Le Sun, and Yingfei Sun

    40 years of designing code comprehension exper- iments: A systematic mapping study.ACM computing surveys, 56(4):1–42. Ruoxi Xu, Hongyu Lin, Xianpei Han, Jia Zheng, Weixi- ang Zhou, Le Sun, and Yingfei Sun. 2025. Large lan- guage models often say one thing and do another.arXiv preprint arXiv:2503.07003. An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Biny...

  12. [12]

    arXiv preprint arXiv:2407.11470

    Beyond correctness: Benchmarking multi- dimensional code generation for large language models. arXiv preprint arXiv:2407.11470. Tan Zhi-Xuan, Micah Carroll, Matija Franklin, and Hal Ashton. 2025. Beyond preferences in ai alignment: T. zhi-xuan et al.Philosophical Studies, 182(7):1813– 1863. A Additional information on coding metrics Although there is no s...

  13. [13]

    44% have 4-6 years of working experience; 28% have 7-9 years; 20% have 1-3 years; and 8% have 10+ years

  14. [14]

    Coding comments

    48% spend approximately 3-5 hours a day on working with code; 44% spend 5-8 hours a day; 8% spend 1-3 hours. C Survey Coding Styles of Software Engineers Part 1: General information about experi- ence and current work Q1.How many years of working experience do you have as a developer? Response options: (a) 1-3 years (b) 4-6 years (c) 7-9 years (d) 10+ yea...

  15. [15]

    low or high) depends on the habits of a particular developer

    Using different degree of comments in code (i.e. low or high) depends on the habits of a particular developer

  16. [16]

    low or high) depending on the specific project/task

    In my daily coding activity I personally choose different degree of comments (i.e. low or high) depending on the specific project/task

  17. [17]

    Using different degree of comments in code depends on the requirements and/or computa- tional limitations of the specific project

  18. [18]

    Modularity

    Using different degree of comments in code (i.e. low or high) doesn’t depend on the habits of a particular developer 13 Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance degree of comments in your code Response options...

  19. [19]

    Modularity in code depends on the require- ments and/or computational limitations of the specific project

  20. [20]

    Modularity in code doesn’t depend on the habits of a particular developer

  21. [21]

    In my daily coding activity I personally choose different degree of modularity depend- ing on the specific project/task

  22. [22]

    Complexity

    Modularity in code depends on the habits of a particular developer Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance modularity in your code Response options: (a) I always write monolithic code (b) I mostly write monol...

  23. [23]

    Complexity in code depends on the habits of a particular developer

  24. [24]

    Complexity in code depends on the require- ments and/or computational limitations of the specific project

  25. [25]

    In my daily coding activity I personally make decisions regarding code complexity depending on the specific project/task

  26. [26]

    Readability

    In my daily coding activity my decisions regarding code complexity don’t depend on the specific project/task Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance complexity in your code Response options: (a) I always mini...

  27. [27]

    Choices regarding coding readability- complexity trade-off depend on the require- ments and/or computational limitations of the specific project

  28. [28]

    In my daily coding activity I person- ally make decisions regarding readability- complexity trade-off depending on the spe- cific project/task

  29. [29]

    Choices regarding coding readability- complexity trade-off don’t depend on the re- quirements and/or computational limitations of the specific project

  30. [30]

    Choices regarding coding readability- complexity trade-off depends on the habits of a particular developer Response options:1 = Strongly disagree, 2 = Dis- agree, 3 = Somewhat disagree, 4 = Neutral, 5 = Somewhat agree, 6 = Agree, 7 = Strongly agree. Q2.Please mark here how you usually bal- ance readability-complexity trade-off in your code Response option...

  31. [31]

    Option A$TEXT:x text; Option B$TEXT:y text; Response: xtext

    Option A$CODE:x code; Option B$CODE:y code; Response: xcode. Option A$TEXT:x text; Option B$TEXT:y text; Response: xtext

  32. [32]

    LLM-generated vs. human-written

    Option A$CODE:x code; Option B$CODE:y code; Response: ycode. Option A$TEXT:x text; Option B$TEXT:y text; Response: ytext. Figure 8: Few-Shot Example Prompt. E Details on Dataset Curation and CodeGemma-7B Applications In this section, we highlight that CodeGemma-7B did not generate the alternative code, but rather a modification that was manually checked t...