arxiv: 2508.16771 · v2 · submitted 2025-08-22 · 💻 cs.SE · cs.AI· cs.HC

EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention

Yifan Zhang , Chen Huang , Yueke Zhang , Jiahao Zhang , Toby Jia-Jun Li , Collin McMillan , Kevin Leach , Yu Huang This is my paper

Pith reviewed 2026-05-18 20:51 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.HC

keywords code language modelseye trackingvisual attentionfine-tuningcode translationcode summarizationhuman attention alignment

0 comments p. Extension

The pith

EyeMulator augments CodeLLM fine-tuning loss with token weights from human eye-tracking scan paths to mimic developer visual focus.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Code Language Models improve when their attention is steered toward the same code tokens that human programmers fixate on during comprehension tasks. EyeMulator extracts scan paths from eye-tracking recordings, converts them into per-token attention weights, and adds these to the standard training loss without altering model architecture. This alignment produces measurable gains on code translation and summarization benchmarks for three different base models. The authors argue the gains arise specifically from replicating human salience patterns rather than from generic regularization. If the approach generalizes, it offers a practical route to embed human intuition into code-generation systems.

Core claim

EyeMulator derives token-level attention weights directly from human eye-tracking scan paths collected during program comprehension and uses those weights to augment the loss function while fine-tuning CodeLLMs. The resulting models are induced to prioritize semantically salient tokens in the same manner as human developers. Experiments across StarCoder, Llama-3.2, and DeepSeek-Coder report gains exceeding 30 CodeBLEU points on translation and up to 22 BERTScore points on summarization, with ablations attributing the improvements to the human-attention component.

What carries the argument

Token-level attention weights extracted from eye-tracking scan paths, inserted as an additive term in the fine-tuning loss to bias the model toward human visual salience.

If this is right

Code translation performance rises by more than 30 CodeBLEU points when the loss incorporates human attention weights.
Code summarization improves by as much as 22 BERTScore points across the tested models.
The same procedure works without architectural changes on StarCoder, Llama-3.2, and DeepSeek-Coder.
Ablation results indicate that removing the human-attention term eliminates most of the observed gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be tested on additional code tasks such as bug localization or test generation where human attention patterns may also highlight relevant regions.
Aggregating eye-tracking data across many programmers might produce more stable attention weights than single-user recordings.
If the attention weights prove consistent across languages, the approach could be applied to low-resource programming languages with limited training data.

Load-bearing premise

Human eye-tracking scan paths supply a reliable, task-general signal of semantic importance that can be transferred to new code examples and models without creating new error patterns.

What would settle it

Run identical fine-tuning on the same data with and without the eye-tracking-derived attention weights; if the two resulting models show no difference in CodeBLEU or BERTScore on held-out translation and summarization sets, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2508.16771 by Chen Huang, Collin McMillan, Jiahao Zhang, Kevin Leach, Toby Jia-Jun Li, Yifan Zhang, Yueke Zhang, Yu Huang.

**Figure 1.** Figure 1: End-to-end workflow for harvesting human gaze signals and injecting them into an attention-aware data pipeline. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of the open-source EyeTrans dataset [ [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Pseudo attention path for the filterEvens function. Sampled tokens (external class and function call) are highlighted, assignment tokens are not sampled, and the resulting sequential attention path is shown. 4.1 Research Questions Our goal is to determine whether EyeMulator’s incorporation of human visual-attention signals can meaningfully improve LLM performance on core code-intelligence tasks. To evalu… view at source ↗

**Figure 4.** Figure 4: Estimated Beta parameters (𝛼𝑠=gaze hits, 𝛽𝑠=gaze misses) for each semantic label across reading, writing, and combined tasks. Higher 𝛼𝑠 indicates consistent attention; higher 𝛽𝑠 signals less frequent fixations. 0.0 0.2 0.4 0.6 0.8 1.0 Value (x) 0 20 40 60 80 100 120 Probability Density Reading Semantic Labels 0.0 0.2 0.4 0.6 0.8 1.0 Value (x) Writing Semantic Labels 0.0 0.2 0.4 0.6 0.8 1.0 Value (x) Combin… view at source ↗

**Figure 5.** Figure 5: Smoothed Beta probability density functions for [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Distribution of semantic categories across the three [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Attention maps for baseline (top) and EyeMulator (bottom) on completion (left), translation (center), and summarization (right). EyeMulator ignores irrelevant tokens in completion/translation and emphasizes relevant ones in summarization. • achieves only modest GCS gains but boosts RFS from 0.55 to 1.27 and AFS from 3.08 to 7.73; • halves Entropy (88.21→60.41), reflecting a sharper focus on semantically r… view at source ↗

read the original abstract

Code Language Models (CodeLLMs) traditionally learn attention based solely on statistical input-output token correlations ("machine attention"). In contrast, human developers rely on intuition, selectively fixating on semantically salient tokens during program comprehension. We present EyeMulator, a model-agnostic technique to align CodeLLM attention with human visual attention without architectural changes. By extracting scan paths from eye-tracking data, we derive token-level attention weights used to augment the loss function during fine-tuning. This induces the model to mimic human focus. Our evaluation across StarCoder, Llama-3.2, and DeepSeek-Coder shows that EyeMulator significantly outperforms baselines, achieving gains of over 30 CodeBLEU points in translation and up to 22 BERTScore points in summarization. Ablation studies confirm that these gains stem directly from replicating human attention dynamics. Artifacts are available at https://zenodo.org/records/17205682.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EyeMulator, a model-agnostic technique for improving CodeLLMs by aligning model attention with human visual attention extracted from eye-tracking scan paths. Token-level attention weights derived from these paths are used to augment the loss function during fine-tuning of models including StarCoder, Llama-3.2, and DeepSeek-Coder. The authors report large gains (over 30 CodeBLEU points in translation and up to 22 BERTScore points in summarization) and claim via ablation studies that improvements stem directly from replicating human attention dynamics. Artifacts are released on Zenodo.

Significance. If the central claims hold after clarification of the alignment procedure, the work would offer a practical, architecture-preserving way to inject human cognitive priors into code model training. The public artifacts are a positive step toward reproducibility. The approach could influence future efforts to ground LLM training in human comprehension signals, though its dependence on external eye-tracking data limits immediate scalability.

major comments (1)

[Method section] Method section: the procedure for mapping eye-tracking fixations to token-level attention weights is underspecified. No details are provided on attribution rules for fixations landing on whitespace, comments, or inter-token spaces, nor on whether fixation duration is normalized by token length or visual prominence. This ambiguity risks the derived weights reflecting editor layout artifacts rather than semantic salience, which directly undermines the ablation claim that gains arise from 'human attention dynamics' rather than incidental re-weighting of the training distribution.

minor comments (2)

[Abstract and §4] Abstract and evaluation sections: more explicit description of the eye-tracking dataset collection protocol, exact loss augmentation formula, baseline definitions, and statistical significance tests would strengthen verifiability of the reported numeric gains.
[Evaluation] The manuscript should clarify whether the eye-tracking data is task-specific to the downstream translation/summarization benchmarks or drawn from a separate comprehension corpus.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our submission. The major comment highlights an important area for clarification in the Method section, and we have revised the manuscript to provide the requested details while preserving the core claims.

read point-by-point responses

Referee: [Method section] Method section: the procedure for mapping eye-tracking fixations to token-level attention weights is underspecified. No details are provided on attribution rules for fixations landing on whitespace, comments, or inter-token spaces, nor on whether fixation duration is normalized by token length or visual prominence. This ambiguity risks the derived weights reflecting editor layout artifacts rather than semantic salience, which directly undermines the ablation claim that gains arise from 'human attention dynamics' rather than incidental re-weighting of the training distribution.

Authors: We agree that the original description of the fixation-to-weight mapping was insufficiently detailed and could invite the interpretation raised by the referee. In the revised manuscript we have expanded the Method section with a dedicated subsection and pseudocode that specifies the following rules: (1) fixations falling on whitespace or inter-token spaces are attributed to the nearest preceding token by character offset within the line; (2) fixations landing inside comments are retained because they frequently mark regions of active comprehension; (3) raw fixation duration is used directly as the weight contribution without normalization by token length or visual prominence, as our internal validation showed stronger alignment with human-reported salience when duration is left unadjusted. We have also added a figure that illustrates the mapping on a sample code snippet. Regarding the ablation claim, we note that the reported gains are measured against both uniform re-weighting and randomly permuted human weights; the fact that only the original human-derived ordering produces the observed improvements supports that the benefit is tied to the specific attention dynamics captured by the eye-tracking data rather than generic re-weighting. We acknowledge that layout artifacts remain a possible confounding factor and have added a short limitations paragraph discussing this point. revision: yes

Circularity Check

0 steps flagged

No circularity: external eye-tracking data grounds the attention weights independently of model outputs or self-citations

full rationale

The paper's core derivation extracts token-level attention weights directly from external eye-tracking scan paths collected from human developers, then uses these weights to augment the fine-tuning loss. This chain does not reduce any claimed prediction or performance gain to a fitted parameter inside the paper's own equations, nor does it rely on self-citations for uniqueness or ansatz. Ablation studies compare against baselines using the same external signal, and results are reported on held-out tasks (translation, summarization) without tautological re-derivation of the input weights. The method is therefore self-contained against external benchmarks rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the untested premise that human eye-tracking data supplies a transferable semantic-salience signal; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Human eye-tracking scan paths yield token-level attention weights that reflect semantic salience in code.
This premise is required for the loss-augmentation step to be meaningful; it is invoked when the abstract states that the weights are used to induce the model to mimic human focus.

pith-pipeline@v0.9.0 · 5712 in / 1345 out tokens · 53888 ms · 2026-05-18T20:51:43.563517+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We extract scan paths from eye-tracking data, which reflects the order in which tokens are read by humans. We use these scan paths to assign attention weights to each token in the input samples used to train an LLM.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

w_j = w_base + 1/log(freq(g_j)+2) + E[θ_sj]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 16 internal anchors

[1]

Tamburri

Silvia Abrahão, John Grundy, Mauro Pezzè, Margaret-Anne Storey, and Damian A. Tamburri. 2025. Software Engineering by and for Humans in an AI Era. ACM Transactions on Software Engineering and Methodology 34, 5 (June 2025), 1–46. https://doi.org/10.1145/3715111

work page doi:10.1145/3715111 2025
[2]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, et al. 2022. Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. https: //doi.org/10.48550/arXiv.2204.05862

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2204.05862 2022
[3]

Aakash Bansal, Bonita Sharif, and Collin McMillan. 2023. Towards modeling human attention from eye movements for neural source code summarization. Proceedings of the ACM on Human-Computer Interaction 7, ETRA (2023), 1–19

work page 2023
[4]

Aakash Bansal, Chia-Yi Su, Zachary Karas, Yifan Zhang, Yu Huang, Toby Jia- Jun Li, and Collin McMillan. 2023. Modeling programmer attention as scanpath prediction. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 1732–1736

work page 2023
[5]

Bertram, J

I. Bertram, J. Hong, Y. Huang, W. Weimer, and Z. Sharafi. 2020. Trustworthi- ness perceptions in code review: An eye-tracking study. In Proceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). ACM, 31. https://doi.org/10.1145/3382494.3422164

work page doi:10.1145/3382494.3422164 2020
[6]

Tara Capel and Margot Brereton. 2023. What is human-centered about human- centered AI? A map of the research landscape. In Proceedings of the 2023 CHI conference on human factors in computing systems . 1–23

work page 2023
[7]

Deep reinforcement learning from human preferences

Paul Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. 2023. Deep Reinforcement Learning from Human Preferences. https: //doi.org/10.48550/arXiv.1706.03741

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03741 2023
[8]

Tristan Coignion, Clément Quinton, and Romain Rouvoy. 2024. A performance study of llm-generated code on leetcode. In Proceedings of the 28th international conference on evaluation and assessment in software engineering . 79–89

work page 2024
[9]

Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray

Yangruibo Ding, Jinjun Peng, Marcus J. Min, Gail Kaiser, Junfeng Yang, and Baishakhi Ray. 2024. SemCoder: Training Code Language Models with Compre- hensive Semantics Reasoning. arXiv:2406.01006 [cs.CL] https://arxiv.org/abs/ 2406.01006

work page arXiv 2024
[10]

Shihan Dou, Yan Liu, Haoxiang Jia, Limao Xiong, Enyu Zhou, Wei Shen, Junjie Shan, Caishuang Huang, Xiao Wang, Xiaoran Fan, Zhiheng Xi, Yuhao Zhou, Tao Ji, Rui Zheng, Qi Zhang, Xuanjing Huang, and Tao Gui. 2024. StepCoder: Improve Code Generation with Reinforcement Learning from Compiler Feedback. https://doi.org/10.48550/arXiv.2402.01391

work page doi:10.48550/arxiv.2402.01391 2024
[11]

Aryaz Eghbali and Michael Pradel. 2022. CrystalBLEU: precisely and efficiently measuring the similarity of code. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering . 1–12

work page 2022
[12]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. 2024. KTO: Model Alignment as Prospect Theoretic Optimization. https: //doi.org/10.48550/arXiv.2402.01306

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.01306 2024
[13]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

Lisa Grabinger, Florian Hauser, Christian Wolff, and Jürgen Mottok. 2024. On eye tracking in software engineering. SN Computer Science 5, 6 (July 2024), 729. https://doi.org/10.1007/s42979-024-03045-3

work page doi:10.1007/s42979-024-03045-3 2024
[15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Kadian, et al. 2024. The Llama 3 Herd of Models. https://doi.org/10.48550/arXiv.2407. 21783

work page doi:10.48550/arxiv.2407 2024
[16]

Daya Guo, Qihao Zhu, Dejian Yang, Zhenda Xie, Kai Dong, Wentao Zhang, Guant- ing Chen, Xiao Bi, Y. Wu, Y. K. Li, Fuli Luo, Yingfei Xiong, and Wenfeng Liang

work page
[17]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-coder: When the Large Language Model Meets Programming – the Rise of Code Intelligence. https://doi.org/10.48550/arXiv.2401.14196

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14196
[18]

Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, et al . 2024. Retrieval-augmented code generation for universal information extraction. In CCF International Conference on Natural Language Processing and Chinese Computing . Springer, 30–42

work page 2024
[19]

Harth and P

E. Harth and P. Dugerdil. 2017. Program understanding models: An historical overview and a classification. In Proceedings of the 12th International Conference on Software Technologies (ICSOFT), Vol. 1. SciTePress, 402–413. https://doi.org/ 10.5220/0006465504020413

work page doi:10.5220/0006465504020413 2017
[20]

Fusen He, Juan Zhai, and Minxue Pan. 2024. Beyond code generation: Assessing code llm maturity with postconditions. arXiv preprint arXiv:2407.14118 (2024)

work page arXiv 2024
[21]

Pengfei He, Shaowei Wang, Shaiful Chowdhury, and Tse-Hsun Chen. 2024. Ex- ploring Demonstration Retrievers in RAG for Coding Tasks: Yeas and Nays!arXiv preprint arXiv:2410.09662 (2024)

work page arXiv 2024
[22]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large Language Models for Software Engineering: A Systematic Literature Review. https://doi.org/10.48550/ arXiv.2308.10620

work page arXiv 2024
[23]

LoRA: Low-Rank Adaptation of Large Language Models

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. LoRA: Low-Rank Adaptation of Large Language Models. arXiv preprint arXiv:2106.09685. https://arxiv.org/abs/2106. 09685

work page internal anchor Pith review Pith/arXiv arXiv 2021
[24]

Huang, K

Y. Huang, K. Leach, Z. Sharafi, T. Santander, and W. Weimer. 2020. Biases and differences in code review using medical imaging and eye-tracking: genders, humans, and machines. In Proceedings of the 28th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE). ACM, 456–468. https://doi.org/10.1...

work page doi:10.1145/3368089.3409681 2020
[25]

Dominik Huber, Matteo Paltenghi, and Michael Pradel. 2023. Where to Look When Repairing Code? Comparing the Attention of Neural Models and Develop- ers. https://doi.org/10.48550/arXiv.2305.07287

work page doi:10.48550/arxiv.2305.07287 2023
[26]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. https://doi.org/10. 48550/arXiv.2406.00515

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Ruili Jiang, Kehai Chen, Xuefeng Bai, Zhixuan He, Juntao Li, Muyun Yang, Tiejun Zhao, Liqiang Nie, and Min Zhang. 2024. A Survey on Human Preference Learning for Large Language Models. https://doi.org/10.48550/arXiv.2406.11191

work page doi:10.48550/arxiv.2406.11191 2024
[28]

Carpenter

Marcel Just and Patricia A. Carpenter. 1980. A theory of reading: From eye fixations to comprehension. Psychological Review (1980). https://doi.org/10.1037/ 0033-295X.87.4.329

work page 1980
[29]

Zachary Karas, Aakash Bansal, Yifan Zhang, Toby Li, Collin McMillan, and Yu Huang. 2024. A Tale of Two Comprehensions? Analyzing Student Programmer Attention during Code Summarization.ACM Transactions on Software Engineering and Methodology (2024)

work page 2024
[30]

Robert Kirk, Ishita Mediratta, Christoforos Nalmpantis, Jelena Luketina, Eric Hambro, Edward Grefenstette, and Roberta Raileanu. 2023. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452 (2023)

work page internal anchor Pith review arXiv 2023
[31]

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Akiki, et al. 2023. StarCoder: May the Source Be with You! https://doi.org/10.48550/arXiv.2305.06161

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.06161 2023
[32]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. 2023. Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation. https://doi.org/10.48550/arXiv.2305. 01210

work page doi:10.48550/arxiv.2305 2023
[33]

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambro- sio Blanco, Colin Clement, Dawn Drain, Daxin Jiang, Duyu Tang, et al . 2021. Codexglue: A machine learning benchmark dataset for code understanding and generation. arXiv preprint arXiv:2102.04664 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[35]

Meta AI. 2024. Llama 3.2: Revolutionizing edge AI and vision with open, cus- tomizable models. Meta AI Blog. https://ai.meta.com/blog/llama-3-2-connect- 2024-vision-edge-mobile-devices/

work page 2024
[36]

Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an llm to help with code understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering . 1–13

work page 2024
[37]

Thu-Trang Nguyen, Thanh Trong Vu, Hieu Dinh Vo, and Son Nguyen. 2024. An Empirical Study on Capability of Large Language Models in Understanding Code Semantics. https://doi.org/10.48550/arXiv.2407.03611

work page doi:10.48550/arxiv.2407.03611 2024
[38]

M. P. O’Brien. 2003. Software comprehension: A review and research direction . Technical Report. Department of Computer Science & Information Systems, University of Limerick

work page 2003
[39]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. 2022. Train- ing Language Models to Follow Instructions with Human F...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2203.02155 2022
[40]

Matteo Paltenghi and Michael Pradel. 2021. Thinking like a Developer? Compar- ing the Attention of Humans with Neural Models of Code. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, Mel- bourne, Australia, 867–879. https://doi.org/10.1109/ase51524.2021.9678712

work page doi:10.1109/ase51524.2021.9678712 2021
[41]

Md Rizwan Parvez, Wasi Uddin Ahmad, Saikat Chakraborty, Baishakhi Ray, and Kai-Wei Chang. 2021. Retrieval augmented code generation and summarization. arXiv preprint arXiv:2108.11601 (2021)

work page arXiv 2021
[42]

Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. 2024. Direct Preference Optimization: Your Language Model Is Secretly a Reward Model. https://doi.org/10.48550/arXiv.2305.18290

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.18290 2024
[43]

Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu, Duyu Tang, Neel Sundare- san, Ming Zhou, Ambrosio Blanco, and Shuai Ma. 2020. Codebleu: a method for automatic evaluation of code synthesis. arXiv preprint arXiv:2009.10297 (2020)

work page internal anchor Pith review Pith/arXiv arXiv 2020
[44]

McBurney, and Collin McMillan

Pedro Rodeghero, Chao Liu, Peter W. McBurney, and Collin McMillan. 2014. Improving automated source code summarization via an eye-tracking study of programmers. In Proceedings of the 36th International Conference on Software Engineering (ICSE ’14). ACM, 390–401. https://doi.org/10.1145/2568225.2568247 Conference’17, July 2017, Washington, DC, USA Yifan Zh...

work page doi:10.1145/2568225.2568247 2014
[45]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov

work page
[46]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv. 1707.06347

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv
[47]

Paterson

Carsten Schulte, Tony Clear, Ahmad Taherkhani, Teresa Busjahn, and James H. Paterson. 2010. An introduction to program comprehension for computer science educators. In Proceedings of the 2010 ITiCSE Working Group Reports (ITiCSE -WGR ’10), Alison Clear and Lori Russell Dag (Eds.). ACM, 65–86. https://doi.org/10. 1145/1971681.1971687

work page arXiv 2010
[48]

Schulte, T

C. Schulte, T. Clear, A. Taherkhani, T. Busjahn, and J. H. Paterson. 2010. An introduction to program comprehension for computer science educators. In Proceedings of the 2010 ITiCSE Working Group Reports (ITiCSE–WGR ’10), A. Clear and L. R. Dag (Eds.). ACM, 65–86

work page 2010
[49]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. 2024. DeepSeek- Math: Pushing the Limits of Mathematical Reasoning in Open Language Models. arXiv:2402.03300 [cs.CL] https://arxiv.org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[50]

Sharafi, Y

Z. Sharafi, Y. Huang, K. Leach, and W. Weimer. 2021. Toward an objective measure of developers’ cognitive activities. ACM Transactions on Software Engineering and Methodology 30, 3 (2021), 30:1–30:30

work page 2021
[51]

Zohreh Sharafi, Timothy Shaffer, Bonita Sharif, and Yann-Gaël Guéhéneuc. 2015. Eye-tracking metrics in software engineering. In 2015 Asia-Pacific Software Engi- neering Conference (APSEC). 96–103. https://doi.org/10.1109/APSEC.2015.53

work page doi:10.1109/apsec.2015.53 2015
[52]

Sharafi, B

Z. Sharafi, B. Sharif, Y.-G. Guéhéneuc, A. Begel, R. Bednarik, and M. Crosby. 2020. A practical guide on conducting eye-tracking studies in software engineering. Empirical Software Engineering 25, 5 (2020), 3128–3174. https://doi.org/10.1007/ s10664-020-09829-4

work page 2020
[53]

Zohreh Sharafi, Bonita Sharif, Yann-Gaël Guéhéneuc, Andrew Begel, Roman Bednarik, and Martha Crosby. 2020. A practical guide on conducting eye tracking studies in software engineering. Empirical Software Engineering 25, 5 (Sept. 2020), 3128–3174. https://doi.org/10.1007/s10664-020-09829-4

work page doi:10.1007/s10664-020-09829-4 2020
[54]

Zohreh Sharafi, Zéphyrin Soh, and Yann-Gaël Guéhéneuc. 2015. A systematic literature review on the usage of eye-tracking in software engineering. Informa- tion and Software Technology 67 (Nov. 2015), 79–107. https://doi.org/10.1016/j. infsof.2015.06.008

work page doi:10.1016/j 2015
[55]

Bonita Sharif, Mark Falcone, and Jonathan I. Maletic. 2012. An eye -tracking study on the role of scan time in finding source code defects. In Proceedings of the Symposium on Eye Tracking Research and Applications . ACM, 381–384. https://doi.org/10.1145/2168556.2168642

work page doi:10.1145/2168556.2168642 2012
[56]

Bonita Sharif and Huzefa Kagdi. 2011. On the use of eye tracking in software traceability. In Proceedings of the 6th International Workshop on Traceability in Emerging Forms of Software Engineering (TEFSE ’11) . ACM, 67–70. https: //doi.org/10.1145/1987856.1987872

work page doi:10.1145/1987856.1987872 2011
[57]

Bonita Sharif, Jeff Meinken, Timothy Shaffer, and Huzefa Kagdi. 2017. Eye movements in software traceability link recovery.Empirical Software Engineering 22, 3 (2017), 1063–1102. https://doi.org/10.1007/s10664-016-9486-9

work page doi:10.1007/s10664-016-9486-9 2017
[58]

Mohammed Latif Siddiq, Lindsay Roney, Jiahao Zhang, and Joanna Cecilia Da Silva Santos. 2024. Quality Assessment of ChatGPT Generated Code and Their Use by Developers. In Proceedings of the 21st International Confer- ence on Mining Software Repositories . ACM, Lisbon Portugal, 152–156. https: //doi.org/10.1145/3643991.3645071

work page doi:10.1145/3643991.3645071 2024
[59]

Stewart Slocum, Asher Parker-Sartori, and Dylan Hadfield-Menell. 2025. Diverse preference learning for capabilities and alignment. InThe Thirteenth International Conference on Learning Representations

work page 2025
[60]

E. D. Tempero and Y.-C. Tu. 2024. Using program comprehension models to teach comprehensibility. In Proceedings of the ACE 2024: Australian Computing Education Conference. ACM, 1–10. https://doi.org/10.1145/3636243.3636244

work page doi:10.1145/3636243.3636244 2024
[61]

Usman Ahmad Usmani, Ari Happonen, and Junzo Watada. 2023. Human-centered artificial intelligence: Designing for user empowerment and ethical considera- tions. In 2023 5th international congress on human-computer interaction, optimiza- tion and robotic applications (HORA) . IEEE, 1–7

work page 2023
[62]

Bram Wallace, Meihua Dang, Rafael Rafailov, Linqi Zhou, Aaron Lou, Senthil Purushwalkam, Stefano Ermon, Caiming Xiong, Shafiq Joty, and Nikhil Naik. 2024. Diffusion model alignment using direct preference optimization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 8228–8238

work page 2024
[63]

Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better llm agents. In Forty-first International Conference on Machine Learning

work page 2024
[64]

Yuanhao Wang, Qinghua Liu, and Chi Jin. 2023. Is rlhf more difficult than standard rl? a theoretical perspective. Advances in Neural Information Processing Systems 36 (2023), 76006–76032

work page 2023
[65]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Xiang-Bo Mao, Sitaram Asur, et al. 2024. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Blog post

Zora Zhiruo Wang, Akari Asai, Xinyan Velocity Yu, Frank F. Xu, Yiqing Xie, Graham Neubig, and Daniel Fried. 2025. CodeRAG-Bench: Can Retrieval Augment Code Generation? arXiv:2406.14497 [cs.SE] https://arxiv.org/abs/2406.14497

work page arXiv 2025
[67]

Wei Xiong, Hanze Dong, Chenlu Ye, Ziqi Wang, Han Zhong, Heng Ji, Nan Jiang, and Tong Zhang. 2023. Iterative preference learning from human feed- back: Bridging theory and practice for rlhf under kl-constraint. arXiv preprint arXiv:2312.11456 (2023)

work page arXiv 2023
[68]

Wei Xu, Marvin J Dainoff, Liezhong Ge, and Zaifeng Gao. 2023. Transitioning to human interaction with AI systems: New challenges and opportunities for HCI professionals to enable human-centered AI. International Journal of Human– Computer Interaction 39, 3 (2023), 494–518

work page 2023
[69]

Wenda Xu, Daniel Deutsch, Mara Finkelstein, Juraj Juraska, Biao Zhang, Zhong- tao Liu, William Yang Wang, Lei Li, and Markus Freitag. 2024. LLMRefine: Pinpointing and Refining Large Language Models via Fine-Grained Actionable Feedback. https://doi.org/10.48550/arXiv.2311.09336

work page doi:10.48550/arxiv.2311.09336 2024
[70]

Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. 2025. An empirical study of retrieval-augmented code generation: Challenges and opportunities. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025
[71]

Yifan Zhang, Jiliang Li, Zachary Karas, Aakash Bansal, Toby Jia-Jun Li, Collin McMillan, Kevin Leach, and Yu Huang. 2024. Eyetrans: Merging human and machine attention for neural code summarization. Proceedings of the ACM on Software Engineering 1, FSE (2024), 115–136

work page 2024
[72]

Ziyao Zhang, Yanlin Wang, Chong Wang, Jiachi Chen, and Zibin Zheng. 2025. LLM Hallucinations in Practical Code Generation: Phenomena, Mechanism, and Mitigation. https://doi.org/10.48550/arXiv.2409.20550

work page doi:10.48550/arxiv.2409.20550 2025
[73]

Li Zhong and Zilong Wang. 2024. Can llm replace stack overflow? a study on robustness and reliability of large language model code generation. InProceedings of the AAAI conference on artificial intelligence , Vol. 38. 21841–21849

work page 2024