pith. the verified trust layer for science. sign in

arxiv: 2508.16771 · v2 · submitted 2025-08-22 · 💻 cs.SE · cs.AI· cs.HC

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

Pith reviewed 2026-05-18 20:51 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC
keywords code language modelseye trackingvisual attentionfine-tuningcode translationcode summarizationhuman attention alignment
0
0 comments X p. Extension

The pith

EyeMulator augments CodeLLM fine-tuning loss with token weights from human eye-tracking scan paths to mimic developer visual focus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Code Language Models improve when their attention is steered toward the same code tokens that human programmers fixate on during comprehension tasks. EyeMulator extracts scan paths from eye-tracking recordings, converts them into per-token attention weights, and adds these to the standard training loss without altering model architecture. This alignment produces measurable gains on code translation and summarization benchmarks for three different base models. The authors argue the gains arise specifically from replicating human salience patterns rather than from generic regularization. If the approach generalizes, it offers a practical route to embed human intuition into code-generation systems.

Core claim

EyeMulator derives token-level attention weights directly from human eye-tracking scan paths collected during program comprehension and uses those weights to augment the loss function while fine-tuning CodeLLMs. The resulting models are induced to prioritize semantically salient tokens in the same manner as human developers. Experiments across StarCoder, Llama-3.2, and DeepSeek-Coder report gains exceeding 30 CodeBLEU points on translation and up to 22 BERTScore points on summarization, with ablations attributing the improvements to the human-attention component.

What carries the argument

Token-level attention weights extracted from eye-tracking scan paths, inserted as an additive term in the fine-tuning loss to bias the model toward human visual salience.

If this is right

  • Code translation performance rises by more than 30 CodeBLEU points when the loss incorporates human attention weights.
  • Code summarization improves by as much as 22 BERTScore points across the tested models.
  • The same procedure works without architectural changes on StarCoder, Llama-3.2, and DeepSeek-Coder.
  • Ablation results indicate that removing the human-attention term eliminates most of the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could be tested on additional code tasks such as bug localization or test generation where human attention patterns may also highlight relevant regions.
  • Aggregating eye-tracking data across many programmers might produce more stable attention weights than single-user recordings.
  • If the attention weights prove consistent across languages, the approach could be applied to low-resource programming languages with limited training data.

Load-bearing premise

Human eye-tracking scan paths supply a reliable, task-general signal of semantic importance that can be transferred to new code examples and models without creating new error patterns.

What would settle it

Run identical fine-tuning on the same data with and without the eye-tracking-derived attention weights; if the two resulting models show no difference in CodeBLEU or BERTScore on held-out translation and summarization sets, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2508.16771 by Chen Huang, Collin McMillan, Jiahao Zhang, Kevin Leach, Toby Jia-Jun Li, Yifan Zhang, Yueke Zhang, Yu Huang.

Figure 1
Figure 1. Figure 1: End-to-end workflow for harvesting human gaze signals and injecting them into an attention-aware data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the open-source EyeTrans dataset [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pseudo attention path for the filterEvens func￾tion. Sampled tokens (external class and function call) are highlighted, assignment tokens are not sampled, and the re￾sulting sequential attention path is shown. 4.1 Research Questions Our goal is to determine whether EyeMulator’s incorporation of human visual-attention signals can meaningfully improve LLM performance on core code-intelligence tasks. To evalu… view at source ↗
Figure 4
Figure 4. Figure 4: Estimated Beta parameters (𝛼𝑠=gaze hits, 𝛽𝑠=gaze misses) for each semantic label across reading, writing, and combined tasks. Higher 𝛼𝑠 indicates consistent attention; higher 𝛽𝑠 signals less frequent fixations. 0.0 0.2 0.4 0.6 0.8 1.0 Value (x) 0 20 40 60 80 100 120 Probability Density Reading Semantic Labels 0.0 0.2 0.4 0.6 0.8 1.0 Value (x) Writing Semantic Labels 0.0 0.2 0.4 0.6 0.8 1.0 Value (x) Combin… view at source ↗
Figure 5
Figure 5. Figure 5: Smoothed Beta probability density functions for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of semantic categories across the three [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Attention maps for baseline (top) and EyeMulator (bottom) on completion (left), translation (center), and sum￾marization (right). EyeMulator ignores irrelevant tokens in completion/translation and emphasizes relevant ones in summarization. • achieves only modest GCS gains but boosts RFS from 0.55 to 1.27 and AFS from 3.08 to 7.73; • halves Entropy (88.21→60.41), reflecting a sharper focus on semantically r… view at source ↗
read the original abstract

Code Language Models (CodeLLMs) traditionally learn attention based solely on statistical input-output token correlations ("machine attention"). In contrast, human developers rely on intuition, selectively fixating on semantically salient tokens during program comprehension. We present EyeMulator, a model-agnostic technique to align CodeLLM attention with human visual attention without architectural changes. By extracting scan paths from eye-tracking data, we derive token-level attention weights used to augment the loss function during fine-tuning. This induces the model to mimic human focus. Our evaluation across StarCoder, Llama-3.2, and DeepSeek-Coder shows that EyeMulator significantly outperforms baselines, achieving gains of over 30 CodeBLEU points in translation and up to 22 BERTScore points in summarization. Ablation studies confirm that these gains stem directly from replicating human attention dynamics. Artifacts are available at https://zenodo.org/records/17205682.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EyeMulator, a model-agnostic technique for improving CodeLLMs by aligning model attention with human visual attention extracted from eye-tracking scan paths. Token-level attention weights derived from these paths are used to augment the loss function during fine-tuning of models including StarCoder, Llama-3.2, and DeepSeek-Coder. The authors report large gains (over 30 CodeBLEU points in translation and up to 22 BERTScore points in summarization) and claim via ablation studies that improvements stem directly from replicating human attention dynamics. Artifacts are released on Zenodo.

Significance. If the central claims hold after clarification of the alignment procedure, the work would offer a practical, architecture-preserving way to inject human cognitive priors into code model training. The public artifacts are a positive step toward reproducibility. The approach could influence future efforts to ground LLM training in human comprehension signals, though its dependence on external eye-tracking data limits immediate scalability.

major comments (1)
  1. [Method section] Method section: the procedure for mapping eye-tracking fixations to token-level attention weights is underspecified. No details are provided on attribution rules for fixations landing on whitespace, comments, or inter-token spaces, nor on whether fixation duration is normalized by token length or visual prominence. This ambiguity risks the derived weights reflecting editor layout artifacts rather than semantic salience, which directly undermines the ablation claim that gains arise from 'human attention dynamics' rather than incidental re-weighting of the training distribution.
minor comments (2)
  1. [Abstract and §4] Abstract and evaluation sections: more explicit description of the eye-tracking dataset collection protocol, exact loss augmentation formula, baseline definitions, and statistical significance tests would strengthen verifiability of the reported numeric gains.
  2. [Evaluation] The manuscript should clarify whether the eye-tracking data is task-specific to the downstream translation/summarization benchmarks or drawn from a separate comprehension corpus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. The major comment highlights an important area for clarification in the Method section, and we have revised the manuscript to provide the requested details while preserving the core claims.

read point-by-point responses
  1. Referee: [Method section] Method section: the procedure for mapping eye-tracking fixations to token-level attention weights is underspecified. No details are provided on attribution rules for fixations landing on whitespace, comments, or inter-token spaces, nor on whether fixation duration is normalized by token length or visual prominence. This ambiguity risks the derived weights reflecting editor layout artifacts rather than semantic salience, which directly undermines the ablation claim that gains arise from 'human attention dynamics' rather than incidental re-weighting of the training distribution.

    Authors: We agree that the original description of the fixation-to-weight mapping was insufficiently detailed and could invite the interpretation raised by the referee. In the revised manuscript we have expanded the Method section with a dedicated subsection and pseudocode that specifies the following rules: (1) fixations falling on whitespace or inter-token spaces are attributed to the nearest preceding token by character offset within the line; (2) fixations landing inside comments are retained because they frequently mark regions of active comprehension; (3) raw fixation duration is used directly as the weight contribution without normalization by token length or visual prominence, as our internal validation showed stronger alignment with human-reported salience when duration is left unadjusted. We have also added a figure that illustrates the mapping on a sample code snippet. Regarding the ablation claim, we note that the reported gains are measured against both uniform re-weighting and randomly permuted human weights; the fact that only the original human-derived ordering produces the observed improvements supports that the benefit is tied to the specific attention dynamics captured by the eye-tracking data rather than generic re-weighting. We acknowledge that layout artifacts remain a possible confounding factor and have added a short limitations paragraph discussing this point. revision: yes

Circularity Check

0 steps flagged

No circularity: external eye-tracking data grounds the attention weights independently of model outputs or self-citations

full rationale

The paper's core derivation extracts token-level attention weights directly from external eye-tracking scan paths collected from human developers, then uses these weights to augment the fine-tuning loss. This chain does not reduce any claimed prediction or performance gain to a fitted parameter inside the paper's own equations, nor does it rely on self-citations for uniqueness or ansatz. Ablation studies compare against baselines using the same external signal, and results are reported on held-out tasks (translation, summarization) without tautological re-derivation of the input weights. The method is therefore self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that human eye-tracking data supplies a transferable semantic-salience signal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Human eye-tracking scan paths yield token-level attention weights that reflect semantic salience in code.
    This premise is required for the loss-augmentation step to be meaningful; it is invoked when the abstract states that the weights are used to induce the model to mimic human focus.

pith-pipeline@v0.9.0 · 5712 in / 1345 out tokens · 53888 ms · 2026-05-18T20:51:43.563517+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 16 internal anchors

  1. [1]

    Tamburri

    Silvia Abrahão, John Grundy, Mauro Pezzè, Margaret-Anne Storey, and Damian A. Tamburri. 2025. Software Engineering by and for Humans in an AI Era. ACM Transactions on Software Engineering and Methodology 34, 5 (June 2025), 1–46. https://doi.org/10.1145/3715111

  2. [2]

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, et al. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. https: //doi.org/10.48550/arXiv.2204.05862

  3. [3]

    Aakash Bansal, Bonita Sharif, and Collin McMillan. 2023. Towards modeling human attention from eye movements for neural source code summarization. Proceedings of the ACM on Human-Computer Interaction 7, ETRA (2023), 1–19

  4. [4]

    Aakash Bansal, Chia-Yi Su, Zachary Karas, Yifan Zhang, Yu Huang, Toby Jia- Jun Li, and Collin McMillan. 2023. Modeling programmer attention as scanpath prediction. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1732–1736

  5. [5]

    Bertram, J

    I. Bertram, J. Hong, Y. Huang, W. Weimer, and Z. Sharafi. 2020. Trustworthi- ness perceptions in code review: An eye-tracking study. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, 31. https://doi.org/10.1145/3382494.3422164

  6. [6]

    Tara Capel and Margot Brereton. 2023. What is human-centered about human- centered AI? A map of the research landscape. In Proceedings of the 2023 CHI conference on human factors in computing systems . 1–23

  7. [7]

    Deep reinforcement learning from human preferences

    Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. Deep Reinforcement Learning from Human Preferences. https: //doi.org/10.48550/arXiv.1706.03741

  8. [8]

    Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering . 79–89

  9. [9]

    Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray

    Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. SemCoder: Training Code Language Models with Compre- hensive Semantics Reasoning. arXiv:2406.01006 [cs.CL] https://arxiv.org/abs/ 2406.01006

  10. [10]

    Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. https://doi.org/10.48550/arXiv.2402.01391

  11. [11]

    Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: precisely and efficiently measuring the similarity of code. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12

  12. [12]

    Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. KTO: Model Alignment as Prospect Theoretic Optimization. https: //doi.org/10.48550/arXiv.2402.01306

  13. [13]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

  14. [14]

    Lisa Grabinger, Florian Hauser, Christian Wolff, and Jürgen Mottok. 2024. On eye tracking in software engineering. SN Computer Science 5, 6 (July 2024), 729. https://doi.org/10.1007/s42979-024-03045-3

  15. [15]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Kadian, et al. 2024. The Llama 3 Herd of Models. https://doi.org/10.48550/arXiv.2407. 21783

  16. [16]

    Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang

  17. [17]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    DeepSeek-coder: When the Large Language Model Meets Programming – the Rise of Code Intelligence. https://doi.org/10.48550/arXiv.2401.14196

  18. [18]

    Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, et al . 2024. Retrieval-augmented code generation for universal information extraction. In CCF International Conference on Natural Language Processing and Chinese Computing . Springer, 30–42

  19. [19]

    Harth and P

    E. Harth and P. Dugerdil. 2017. Program understanding models: An historical overview and a classification. In Proceedings of the 12th International Conference on Software Technologies (ICSOFT), Vol. 1. SciTePress, 402–413. https://doi.org/ 10.5220/0006465504020413

  20. [20]

    Fusen He, Juan Zhai, and Minxue Pan. 2024. Beyond code generation: Assessing code llm maturity with postconditions. arXiv preprint arXiv:2407.14118 (2024)

  21. [21]

    Pengfei He, Shaowei Wang, Shaiful Chowdhury, and Tse-Hsun Chen. 2024. Ex- ploring Demonstration Retrievers in RAG for Coding Tasks: Yeas and Nays!arXiv preprint arXiv:2410.09662 (2024)

  22. [22]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. https://doi.org/10.48550/ arXiv.2308.10620

  23. [23]

    LoRA: Low-Rank Adaptation of Large Language Models

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106. 09685

  24. [24]

    Huang, K

    Y. Huang, K. Leach, Z. Sharafi, T. Santander, and W. Weimer. 2020. Biases and differences in code review using medical imaging and eye-tracking: genders, humans, and machines. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 456–468. https://doi.org/10.1...

  25. [25]

    Dominik Huber, Matteo Paltenghi, and Michael Pradel. 2023. Where to Look When Repairing Code? Comparing the Attention of Neural Models and Develop- ers. https://doi.org/10.48550/arXiv.2305.07287

  26. [26]

    Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. https://doi.org/10. 48550/arXiv.2406.00515

  27. [27]

    Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2024. A Survey on Human Preference Learning for Large Language Models. https://doi.org/10.48550/arXiv.2406.11191

  28. [28]

    Carpenter

    Marcel Just and Patricia A. Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychological Review (1980). https://doi.org/10.1037/ 0033-295X.87.4.329

  29. [29]

    Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A Tale of Two Comprehensions? Analyzing Student Programmer Attention during Code Summarization.ACM Transactions on Software Engineering and Methodology (2024)

  30. [30]

    Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2023. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452 (2023)

  31. [31]

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Akiki, et al. 2023. StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161

  32. [32]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. https://doi.org/10.48550/arXiv.2305. 01210

  33. [33]

    Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)

  34. [35]

    Meta AI. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, cus- tomizable models. Meta AI Blog. https://ai.meta.com/blog/llama-3-2-connect- 2024-vision-edge-mobile-devices/

  35. [36]

    Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–13

  36. [37]

    Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, and Son Nguyen. 2024. An Empirical Study on Capability of Large Language Models in Understanding Code Semantics. https://doi.org/10.48550/arXiv.2407.03611

  37. [38]

    M. P. O’Brien. 2003. Software comprehension: A review and research direction . Technical Report. Department of Computer Science & Information Systems, University of Limerick

  38. [39]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Train- ing Language Models to Follow Instructions with Human F...

  39. [40]

    Matteo Paltenghi and Michael Pradel. 2021. Thinking like a Developer? Compar- ing the Attention of Humans with Neural Models of Code. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, Mel- bourne, Australia, 867–879. https://doi.org/10.1109/ase51524.2021.9678712

  40. [41]

    Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval augmented code generation and summarization. arXiv preprint arXiv:2108.11601 (2021)

  41. [42]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. https://doi.org/10.48550/arXiv.2305.18290

  42. [43]

    Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020)

  43. [44]

    McBurney, and Collin McMillan

    Pedro Rodeghero, Chao Liu, Peter W. McBurney, and Collin McMillan. 2014. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 390–401. https://doi.org/10.1145/2568225.2568247 Conference’17, July 2017, Washington, DC, USA Yifan Zh...

  44. [45]

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

  45. [46]

    Proximal Policy Optimization Algorithms

    Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv. 1707.06347

  46. [47]

    Paterson

    Carsten Schulte, Tony Clear, Ahmad Taherkhani, Teresa Busjahn, and James H. Paterson. 2010. An introduction to program comprehension for computer science educators. In Proceedings of the 2010 ITiCSE Working Group Reports (ITiCSE -WGR ’10), Alison Clear and Lori Russell Dag (Eds.). ACM, 65–86. https://doi.org/10. 1145/1971681.1971687

  47. [48]

    Schulte, T

    C. Schulte, T. Clear, A. Taherkhani, T. Busjahn, and J. H. Paterson. 2010. An introduction to program comprehension for computer science educators. In Proceedings of the 2010 ITiCSE Working Group Reports (ITiCSE–WGR ’10), A. Clear and L. R. Dag (Eds.). ACM, 65–86

  48. [49]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

  49. [50]

    Sharafi, Y

    Z. Sharafi, Y. Huang, K. Leach, and W. Weimer. 2021. Toward an objective measure of developers’ cognitive activities. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 30:1–30:30

  50. [51]

    Zohreh Sharafi, Timothy Shaffer, Bonita Sharif, and Yann-Gaël Guéhéneuc. 2015. Eye-tracking metrics in software engineering. In 2015 Asia-Pacific Software Engi- neering Conference (APSEC). 96–103. https://doi.org/10.1109/APSEC.2015.53

  51. [52]

    Sharafi, B

    Z. Sharafi, B. Sharif, Y.-G. Guéhéneuc, A. Begel, R. Bednarik, and M. Crosby. 2020. A practical guide on conducting eye-tracking studies in software engineering. Empirical Software Engineering 25, 5 (2020), 3128–3174. https://doi.org/10.1007/ s10664-020-09829-4

  52. [53]

    Zohreh Sharafi, Bonita Sharif, Yann-Gaël Guéhéneuc, Andrew Begel, Roman Bednarik, and Martha Crosby. 2020. A practical guide on conducting eye tracking studies in software engineering. Empirical Software Engineering 25, 5 (Sept. 2020), 3128–3174. https://doi.org/10.1007/s10664-020-09829-4

  53. [54]

    Zohreh Sharafi, Zéphyrin Soh, and Yann-Gaël Guéhéneuc. 2015. A systematic literature review on the usage of eye-tracking in software engineering. Informa- tion and Software Technology 67 (Nov. 2015), 79–107. https://doi.org/10.1016/j. infsof.2015.06.008

  54. [55]

    Bonita Sharif, Mark Falcone, and Jonathan I. Maletic. 2012. An eye -tracking study on the role of scan time in finding source code defects. In Proceedings of the Symposium on Eye Tracking Research and Applications . ACM, 381–384. https://doi.org/10.1145/2168556.2168642

  55. [56]

    Bonita Sharif and Huzefa Kagdi. 2011. On the use of eye tracking in software traceability. In Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering (TEFSE ’11) . ACM, 67–70. https: //doi.org/10.1145/1987856.1987872

  56. [57]

    Bonita Sharif, Jeff Meinken, Timothy Shaffer, and Huzefa Kagdi. 2017. Eye movements in software traceability link recovery.Empirical Software Engineering 22, 3 (2017), 1063–1102. https://doi.org/10.1007/s10664-016-9486-9

  57. [58]

    Mohammed Latif Siddiq, Lindsay Roney, Jiahao Zhang, and Joanna Cecilia Da Silva Santos. 2024. Quality Assessment of ChatGPT Generated Code and Their Use by Developers. In Proceedings of the 21st International Confer- ence on Mining Software Repositories . ACM, Lisbon Portugal, 152–156. https: //doi.org/10.1145/3643991.3645071

  58. [59]

    Stewart Slocum, Asher Parker-Sartori, and Dylan Hadfield-Menell. 2025. Diverse preference learning for capabilities and alignment. InThe Thirteenth International Conference on Learning Representations

  59. [60]

    E. D. Tempero and Y.-C. Tu. 2024. Using program comprehension models to teach comprehensibility. In Proceedings of the ACE 2024: Australian Computing Education Conference. ACM, 1–10. https://doi.org/10.1145/3636243.3636244

  60. [61]

    Usman Ahmad Usmani, Ari Happonen, and Junzo Watada. 2023. Human-centered artificial intelligence: Designing for user empowerment and ethical considera- tions. In 2023 5th international congress on human-computer interaction, optimiza- tion and robotic applications (HORA) . IEEE, 1–7

  61. [62]

    Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. 2024. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 8228–8238

  62. [63]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning

  63. [64]

    Yuanhao Wang, Qinghua Liu, and Chi Jin. 2023. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems 36 (2023), 76006–76032

  64. [65]

    Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216 (2024)

  65. [66]

    Blog post

    Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried. 2025. CodeRAG-Bench: Can Retrieval Augment Code Generation? arXiv:2406.14497 [cs.SE] https://arxiv.org/abs/2406.14497

  66. [67]

    Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. 2023. Iterative preference learning from human feed- back: Bridging theory and practice for rlhf under kl-constraint. arXiv preprint arXiv:2312.11456 (2023)

  67. [68]

    Wei Xu, Marvin J Dainoff, Liezhong Ge, and Zaifeng Gao. 2023. Transitioning to human interaction with AI systems: New challenges and opportunities for HCI professionals to enable human-centered AI. International Journal of Human– Computer Interaction 39, 3 (2023), 494–518

  68. [69]

    Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhong- tao Liu, William Yang Wang, Lei Li, and Markus Freitag. 2024. LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback. https://doi.org/10.48550/arXiv.2311.09336

  69. [70]

    Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An empirical study of retrieval-augmented code generation: Challenges and opportunities. ACM Transactions on Software Engineering and Methodology (2025)

  70. [71]

    Yifan Zhang, Jiliang Li, Zachary Karas, Aakash Bansal, Toby Jia-Jun Li, Collin McMillan, Kevin Leach, and Yu Huang. 2024. Eyetrans: Merging human and machine attention for neural code summarization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 115–136

  71. [72]

    Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation. https://doi.org/10.48550/arXiv.2409.20550

  72. [73]

    Li Zhong and Zilong Wang. 2024. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the AAAI conference on artificial intelligence , Vol. 38. 21841–21849