arxiv: 2605.06901 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: no theorem link

Reflections and New Directions for Human-Centered Large Language Models

Caleb Ziems , Dora Zhao , Rose E. Wang , Matthew J\"orke , Ahmad Rushdi , Advit Deepak , Sunny Yu , Anshika Agarwal

show 50 more authors

Harshvardhan Agarwal Gabriela Aranguiz-Dias Aditri Bhagirath Justine Breuch Huanxing Chen Ruishi Chen Sarah Chen Haocheng Fan William Fang Cat Gonzales Fergesen Daniel Frees Tian Gao Ziqing Huang Vishal Jain Yucheng Jiang Kirill Kalinin Su Doga Karaca Arpandeep Khatua Teland La Isabelle Levent Miranda Li Xinling Li Yongce Li Angela Liu Minsik Oh Nathan J. Paek Anthony Qin Emily Redmond Michael J. Ryan Aadesh Salecha Xiaoxian Shen Pranava Singhal Shashanka Subrahmanya Mei Tan Irawadee Thawornbut Michelle Vinocour Xiaoyue Wang Zheng Wang Henry Jin Weng Pawan Wirawarn Shirley Wu Sophie Wu Yichen Xie Patrick Ye Sean Zhang Yutong Zhang Cathy Zhou Yiling Zhao James Landay Diyi Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:23 UTC · model grok-4.3

classification 💻 cs.CL

keywords human-centered AIlarge language modelsAI ethicsmodel development pipelineresponsible AINLPHCIfuture of work

0 comments

The pith

Large language model developers must address human concerns and values at every stage of the pipeline, not only after training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework for Human-Centered Large Language Models that combines ideas from natural language processing, human-computer interaction, and responsible AI. It claims that to handle the growing influence of these models in education, healthcare, finance, and other fields, developers need to treat human preferences, ethics, and goals as core requirements from initial system design through data sourcing, training, evaluation, and deployment. A closing case study applies the same lens to the future of work. A sympathetic reader would care because current models already shape private and professional decisions, and late-stage fixes alone have not prevented misalignment with real user needs.

Core claim

The central claim is that human-centered considerations must receive rigorous attention at every stage of the language modeling pipeline, from design and data sourcing to training, evaluation, and responsible deployment, rather than being limited to a cursory post-training adjustment.

What carries the argument

The HCLLM framework, which integrates perspectives from NLP, HCI, and responsible AI to guide decisions at each pipeline stage while balancing ethics, economics, and technical goals.

Load-bearing premise

That systematically folding human perspectives into every stage of development will produce better real-world outcomes without unacceptable losses in model capability or added cost.

What would settle it

An empirical comparison showing that LLMs built with only standard post-training alignment achieve equal or higher user satisfaction, ethical compliance, and task performance than those developed with the integrated human-centered process at every stage.

Figures

Figures reproduced from arXiv: 2605.06901 by Aadesh Salecha, Aditri Bhagirath, Advit Deepak, Ahmad Rushdi, Angela Liu, Anshika Agarwal, Anthony Qin, Arpandeep Khatua, Caleb Ziems, Cat Gonzales Fergesen, Cathy Zhou, Daniel Frees, Diyi Yang, Dora Zhao, Emily Redmond, Gabriela Aranguiz-Dias, Haocheng Fan, Harshvardhan Agarwal, Henry Jin Weng, Huanxing Chen, Irawadee Thawornbut, Isabelle Levent, James Landay, Justine Breuch, Kirill Kalinin, Matthew J\"orke, Mei Tan, Michael J. Ryan, Michelle Vinocour, Minsik Oh, Miranda Li, Nathan J. Paek, Patrick Ye, Pawan Wirawarn, Pranava Singhal, Rose E. Wang, Ruishi Chen, Sarah Chen, Sean Zhang, Shashanka Subrahmanya, Shirley Wu, Sophie Wu, Su Doga Karaca, Sunny Yu, Teland La, Tian Gao, Vishal Jain, William Fang, Xiaoxian Shen, Xiaoyue Wang, Xinling Li, Yichen Xie, Yiling Zhao, Yongce Li, Yucheng Jiang, Yutong Zhang, Zheng Wang, Ziqing Huang.

**Figure 2.** Figure 2: We can draw on the field of human-computer interaction (HCI) to help inform human [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: This chapter focuses on the human origins of data (§ [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: This chapter applies human-centered considerations to [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗

**Figure 5.** Figure 5: In this chapter, we discuss common pitfalls and best practices for evaluating HCLLMs, [PITH_FULL_IMAGE:figures/full_fig_p042_5.png] view at source ↗

**Figure 6.** Figure 6: We enumerate three properties for the responsible HCLLM deployment: [PITH_FULL_IMAGE:figures/full_fig_p055_6.png] view at source ↗

**Figure 7.** Figure 7: We present a case study on HCLLMs and the future of work, covering the three key areas of [PITH_FULL_IMAGE:figures/full_fig_p064_7.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are increasingly shaping the private and professional lives of users, with numerous applications in business, education, finance, healthcare, law, and science. With this rise in global influence comes greater urgency to build, evaluate, and deploy these systems in a manner that prioritizes not only technical capabilities but also human priorities. This work presents a framework for developing Human-Centered Large Language Models (HCLLMs), which integrates perspectives from Natural Language Processing (NLP), Human-Computer Interaction (HCI), and responsible AI. Considering the ethics, economics, and technical objectives of language modeling, we argue that model developers need to address human concerns, preferences, values, and goals, not only during a cursory post-training stage, but rather with rigor and care at every stage of the pipeline. This paper offers human-centered insights and recommendations for developers at each stage, from system design to data sourcing, model training, evaluation, and responsible deployment. Then we conclude with a case study, applying these insights to understand the future of work with HCLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear synthesis of existing human-centered ideas into an LLM pipeline framework, but it adds no new data or tests.

read the letter

The main thing here is a position paper that pushes for human concerns to shape LLMs at every stage—design, data collection, training, evaluation, and deployment—rather than just in post-training alignment. It draws from NLP, HCI, and responsible AI to give stage-specific recommendations and ends with a case study on future-of-work applications. That's the core contribution: an organized checklist for developers who want to operationalize these values without treating them as an afterthought. The structure is straightforward and the recommendations are concrete enough to be usable, like specific questions to ask during data sourcing or evaluation design. The case study helps ground the abstract advice in a practical scenario. That part feels useful for teams already trying to move beyond pure capability metrics. The weakness is that none of this is backed by fresh evidence. The claims rest on the assumption that broader integration of human perspectives will yield better outcomes without major capability or cost penalties, yet the paper offers no experiments, measurements, or falsifiable checks to support that. It mostly reorganizes principles already in the cited literatures. This makes the framework more of a normative argument than a tested approach. Readers who already follow the responsible AI conversation won't find surprises, but newcomers or applied groups might appreciate the single reference point. I'd bring it to a reading group for the discussion it could generate on trade-offs. It deserves peer review because the topic is timely and the framing could help shape how labs document their processes, even if revisions would need to clarify what counts as rigorous human-centered work in practice.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for Human-Centered Large Language Models (HCLLMs) that integrates insights from NLP, HCI, and responsible AI. It argues that human concerns, preferences, values, and goals must be addressed with rigor at every stage of the LLM pipeline, from system design to data sourcing, model training, evaluation, and responsible deployment, rather than only in post-training. The paper provides recommendations for each stage and concludes with a case study applying these to the future of work.

Significance. The synthesis of human-centered principles across the full development pipeline represents a valuable organizational contribution that could help guide more responsible LLM development practices. By highlighting the need for consistent attention to human factors, the paper encourages a shift from ad-hoc post-training fixes to systematic integration. Its significance is enhanced by the structured recommendations and the illustrative case study, though its impact will depend on how actionable and balanced the suggestions are in practice.

major comments (2)

[Introduction] The central claim (Introduction and Abstract) that systematically incorporating human-centered perspectives at each pipeline stage will produce meaningfully better outcomes rests on an unexamined assumption that this can be done without prohibitive trade-offs in capability or cost; no section addresses how alignment objectives interact with scaling or performance metrics.
[Case Study] The case study applies the framework to the future of work but provides no concrete metrics, baselines, or falsifiable predictions for success, which weakens its role in supporting the pipeline-wide recommendations.

minor comments (2)

Clarify the distinction between the proposed framework and prior work in responsible AI and HCI to better highlight the paper's unique contributions.
Some stage-specific recommendations could be strengthened with references to concrete examples from the literature or industry practices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and outline targeted revisions to strengthen the paper while preserving its reflective and framework-oriented scope.

read point-by-point responses

Referee: [Introduction] The central claim (Introduction and Abstract) that systematically incorporating human-centered perspectives at each pipeline stage will produce meaningfully better outcomes rests on an unexamined assumption that this can be done without prohibitive trade-offs in capability or cost; no section addresses how alignment objectives interact with scaling or performance metrics.

Authors: We acknowledge that the manuscript does not explicitly analyze potential trade-offs between human-centered integration and scaling laws or performance metrics. The central claim is framed as a call for systematic attention to human factors across the pipeline, based on the view that such integration supports more robust long-term outcomes, but we agree this leaves the interaction with capability and cost unexamined. In revision, we will add a concise discussion subsection (likely in the Introduction or following the pipeline recommendations) that references existing work on alignment tax, scaling behaviors, and performance trade-offs, while clarifying that the framework does not assume zero-cost integration. revision: partial
Referee: [Case Study] The case study applies the framework to the future of work but provides no concrete metrics, baselines, or falsifiable predictions for success, which weakens its role in supporting the pipeline-wide recommendations.

Authors: The case study is intended as an illustrative application of the HCLLM framework to the domain of future work, consistent with the paper's reflective and directional character rather than an empirical validation study. As such, it deliberately avoids quantitative metrics or baselines. To strengthen its supportive role, we will revise the case study to include a brief discussion of potential evaluation criteria and example metrics (e.g., human-centered outcome measures in workplace LLM deployment) that could be used to test the framework's recommendations in future empirical work, without introducing new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a normative position piece that advocates integrating human-centered perspectives from NLP, HCI, and responsible AI across the LLM pipeline. It contains no equations, empirical predictions, fitted parameters, or formal derivations whose validity depends on internal reductions. The central argument is conceptual and draws on external literatures without self-referential definitions or load-bearing self-citations that collapse the claim to its own inputs. Recommendations are framed as insights and reorganizations of known principles rather than novel results derived by construction from the paper's own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on domain assumptions about the feasibility and value of integrating human-centered perspectives throughout technical pipelines, without new evidence or invented entities.

axioms (1)

domain assumption Human concerns, preferences, values, and goals can and should be addressed with rigor at every stage of the LLM development pipeline.
Stated directly in the abstract as the core argument.

pith-pipeline@v0.9.0 · 5733 in / 1149 out tokens · 41743 ms · 2026-05-11T01:23:17.538473+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

[1]

arXiv preprint arXiv:2406.11271v5

MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv preprint arXiv:2406.11271v5. Preprint, under review. Stefan Baack. 2024. A critical analysis of the largest source for generative ai training data: Common crawl. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2...

work page doi:10.3386/w27548 2024
[2]

Datasheet for the pile

STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports 14, 1 (2024), 6616. 76 Jarg Bergold and Stefan Thomas. 2012. Participatory research methods: A methodological approach in motion.Historical Social Research/Historische Sozialforschung(2012), 191–222. Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger,...

work page arXiv 2024
[3]

Put-that-there

The values encoded in machine learning research. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 173–184. Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision?. In2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1536–1546. James Bisbee, Joshua...

work page doi:10.7275/20410631 2022
[4]

, Xie X: A survey on evaluation of large language models

Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs. arXiv preprint arXiv:2409.19759v3 [cs.CL]. CC BY 4.0. Denotes equal contribution. Work was done while Yung-Chieh was interning at Scale AI.. 80 Ruei-Che Chang, Yuxuan Liu, and Anhong Guo. 2024a. WorldScribe: Towards Context-Aware Live Visual Descriptions. InProceedings of t...

work page doi:10.1145/3641289 2023
[5]

https://lmsys.org/blog/2023-03-30-vicuna/ Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/ Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica

work page 2023
[6]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InICML. https://openreview.net/forum?id=3MW8GKNyzI Jennifer Chien and David Danks. 2024. Beyond behaviorist representational harms: A plan for measurement and mitigation. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 933–946. Laura Chioda, Da...

work page doi:10.3386/w28845 2024
[7]

Deep Reinforcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Sys- tems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Gar- nett (Eds.). 4299–4307....

work page doi:10.1145/3682112.3682117 2017
[8]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

Context versus Prior Knowledge in Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 13211–13235. doi:10.18653/v1/2024.acl-long.714 Nan Du, Yanping Huang, Andre...

work page doi:10.18653/v1/2024.acl-long.714 2024
[9]

In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)

Explaining Explanations: An Overview of Interpretability of Machine Learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). 80–89. doi:10.1109/DSAA.2018.00018 Mario Giulianelli, Joris Baan, Wilker Aziz, Raquel Fern’andez, and Barbara Plank. 2023. What Comes Next? Evaluating Uncertainty in Neural Text Generators A...

work page doi:10.1109/dsaa.2018.00018 2018
[10]

Task delegation from AI to humans: A principal-agent perspective. (2023). Neel Guha, Christie M. Lawrence, Lindsey A. Gailmard, Kit T. Rodolfa, Faiz Surani, Rishi Bommasani, Inioluwa Deborah Raji, Mariano-Florentino Cuéllar, Colleen Honigsberg, Percy Liang, and Daniel E. Ho. 2024. AI Regulation Has Its Own Alignment Problem: The Technical and Institutiona...

work page doi:10.48550/arxiv.2403.04190 2023
[11]

Don’t Forget the Teachers

CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias. ArXiv preprintabs/2308.12539 (2023).https://arxiv.org/abs/2308.12539 Suchin Gururangan, Dallas Card, Sarah Dreier, Emily Gade, Leroy Wang, Zeyu Wang, Luke Zettlemoyer, and Noah A. Smith. 2022. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data ...

work page doi:10.18653/v1/2022.emnlp-main.165 2023
[12]

Training Compute-Optimal Large Language Models

Training compute-optimal large language models.ArXiv preprintabs/2203.15556 (2022). https://arxiv.org/abs/2203.15556 Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024a. AI generates covertly racist decisions about people based on their dialect.Nature(2024), 1–8. Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2021.findings-emnlp.102 2022
[13]

Scaling Laws for Neural Language Models

7226–7245. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. ArXiv preprintabs/2001.08361 (2020).https://arxiv.org/abs/2001.08361 Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. 2023. Copyright Violations...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.18653/v1/2023.emnlp-main.458 2020
[14]

InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems

Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19. Nathalie Maria Kirch, Severin Field, and Stephen Casper. 2024. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks.https://ar...

work page doi:10.18653/v1/2023.emnlp-main.148 2025
[15]

Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.ArXiv preprintabs/2401.17217 (2024).https://arxiv.org/abs/2401.17217 Andreas Köpf, Yannic Kilcher, Dimitri Von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al . 2023. Openassistant conver...

work page doi:10.18653/v1/2023.acl-short.113 2024
[16]

Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

Too human and not human enough: A grounded theory analysis of mental health harms from emotional dependence on the social chatbot Replika.New Media & Society26, 10 (2024), 5923–5941. Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien Nguyen. 2023. Okapi: Instruction-tuned Large Language Models in Multiple Languages ...

work page doi:10.18653/v1/2023.emnlp-demo.28 2024
[17]

In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , Anna Rogers, Jordan L

Scalable agent alignment via reward modeling: a research direction. https: //arxiv.org/abs/1811.07871 Sharon Levy, William Adler, Tahilin Sanchez Karver, Mark Dredze, and Michelle R Kaufman. 2024. Gender bias in decision-making with large language models: A study of relationship conflicts. In Findings of the Association for Computational Linguistics: EMNL...

work page doi:10.18653/v1/2023.findings-acl.112 2024
[18]

Sometimes You Need Facts, and Sometimes a Hug

Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models.ArXiv preprintabs/2410.01532 (2024).https://arxiv.org/abs/2410.01532 Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. 2024a. Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural I...

work page doi:10.18653/v1/2024.acl-long.400 2024
[19]

InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

Mass-Editing Memory in a Transformer. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=MkbcAHIYgyS Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research.Nature627, 8002 (2024), 49–58. ...

work page doi:10.18653/v1/2022.acl-long.244 2023
[20]

arXiv:2308.00994 [cs.CV] https://arxiv.org/abs/2308.00994v3

SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems. arXiv:2308.00994 [cs.CV] https://arxiv.org/abs/2308.00994v3. Jared Moore, Tanvi Deshpande, and Diyi Yang. 2024. Are Large Language Models Consistent over Value-laden Questions?. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-N...

work page doi:10.4304/tpls.3.12.2214-2219 2024
[21]

Olmo 3

Olmo 3.arXiv preprint arXiv:2512.13961(2025). doi:10.48550/arXiv.2512.13961 Judith S Olson and Wendy A Kellogg. 2014.Ways of Knowing in HCI. Vol. 2. Springer. OpenAI. 2024a. Learning To Reason With LLMs. https://openai.com/index/ learning-to-reason-with-llms/. [Accessed 09-02-2025]. OpenAI. 2024b. Memory and new controls for ChatGPT. https://openai.com/in...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.13961 2025
[22]

B leu: a method for automatic evaluation of machine translation

Persona Generators: Generating Diverse Synthetic Personas at Scale.arXiv preprint arXiv:2602.03545(2026). Stefan Palan and Christian Schitter. 2018. Prolific. ac—A subject pool for online experiments.Journal of behavioral and experimental finance17 (2018), 22–27. Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. 2025. Paper2Poster: Towar...

work page doi:10.3115/1073083.1073135 2026
[23]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

Data and its (dis) contents: A survey of dataset development and use in machine learning research.Patterns2, 11 (2021). Sachin R Pendse, Darren Gergle, Rachel Kornfield, Jonah Meyerhoff, David Mohr, Jina Suh, Annie Wescott, Casey Williams, and Jessica Schleider. 2025. When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines. InProcee...

work page internal anchor Pith review arXiv 2021
[24]

InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 445–464. Iryna Pentina, Tyler Hancock, and Tianling Xie. 2023. Exploring relationship development with social chatbots: A mixed-met...

work page doi:10.18653/v1/2023.eacl-main.193 2023
[25]

Thick Evaluations

Eye gaze for spoken language understanding in multi-modal conversational interactions. US Patent 10,317,992. Rida Qadri, Mark Diaz, Ding Wang, and Michael Madaio. 2025. The case for" thick evaluations" of cultural representation in ai.arXiv preprint arXiv:2503.19075(2025). Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter...

work page arXiv 2025
[26]

InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22)

AI and the everything in the whole wide world benchmark.ArXiv preprintabs/2111.15366 (2021).https://arxiv.org/abs/2111.15366 Inioluwa Deborah Raji, I. Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst. 2022. The Fallacy of AI Functionality. In2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22). ACM, 959–972. doi:10.1145/3531146....

work page doi:10.1145/3531146.3533158 2021
[27]

Chi, James Caverlee, Julian J

Synthesizeme! inducing persona-guided prompts for personalized reward models in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8045–8078. Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. 20...

work page arXiv 2024
[28]

Own- ership, Not Just Happy Talk

Novelty and Primacy: A Long-Term Estimator for Online Experiments.Tech- nometrics64, 4 (2022), 524–534. arXiv:https://doi.org/10.1080/00401706.2022.2124309 doi:10.1080/00401706.2022.2124309 Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. 2022. A Survey of Evaluation Metrics Used for NLG Systems.ACM Comput. Surv.55, 2, Article 26 (Jan. 2022), ...

work page doi:10.1080/00401706.2022.2124309 2022
[29]

InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24)

Position: will we run out of data? limits of LLM scaling based on human-generated data. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2024, 22 pages. James Vincent. 2023. Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement.Theverge(2023). https://www...

work page doi:10.18653/v1/2023.bea-1.53 2024
[30]

How do AI agents do human work? comparing AI and human workflows across diverse occupations

3227–3246. 128 Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025d. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations.arXiv preprint arXiv:2510.22780(2025). Zora Zhiruo Wang, John Yang, Kilian Lieret, Alexa Tartaglini, Valerie Chen, Yuxiang Wei, Zijian Wang Lingming Zhang, ...

work page arXiv 2025
[31]

Chi, Quoc V

OpenReview.net.https://openreview.net/forum?id=gEZrGCozdqR Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. 2022b. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Proce...

work page doi:10.18653/v1/p19-1427 2022
[32]

InProceedings of the 1st Workshop on Multilingual Representation Learning, Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (Eds.)

Language Models are Few-shot Multilingual Learners. InProceedings of the 1st Workshop on Multilingual Representation Learning, Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 1–15. doi:10.18653/v1/2021.mrl-1.1 BigScience Work...

work page doi:10.18653/v1/2021.mrl-1.1 2021
[33]

Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al

Licensed under CC BY 4.0. Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. 2024. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems37 (2024), 95716–95743. Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min...

work page doi:10.18653/v1/2023.findings-emnlp.353 2024
[34]

Bernstein, and John Mitchell

On protecting the data privacy of large language models (llms): A survey.ArXiv preprint abs/2403.05156 (2024).https://arxiv.org/abs/2403.05156 Diyi Yang, Caleb Ziems, William Held, Omar Shaikh, Michael S. Bernstein, and John Mitchell. 2024b. Social Skill Training with Large Language Models.https://arxiv.org/abs/2404.04204 John Yang, Carlos E Jimenez, Alex...

work page doi:10.18653/v1/2023.emnlp-main.370 2024
[35]

Removing RLHF Protections in GPT-4 via Fine-Tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 2: Short Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 681–687. https:/...

work page doi:10.18653/v1/2023.emnlp-demo.49 2024
[36]

OPT: Open Pre-trained Transformer Language Models

OPT: Open Pre-trained Transformer Language Models.ArXivabs/2205.01068 (2022). https://api.semanticscholar.org/CorpusID:248496292 Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, and Julien Epps. 2024d. When LLMs meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detec...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1145/3639372 2022