pith. machine review for the scientific record. sign in

arxiv: 2605.06901 · v1 · submitted 2026-05-07 · 💻 cs.CL

Recognition: no theorem link

Reflections and New Directions for Human-Centered Large Language Models

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords human-centered AIlarge language modelsAI ethicsmodel development pipelineresponsible AINLPHCIfuture of work
0
0 comments X

The pith

Large language model developers must address human concerns and values at every stage of the pipeline, not only after training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a framework for Human-Centered Large Language Models that combines ideas from natural language processing, human-computer interaction, and responsible AI. It claims that to handle the growing influence of these models in education, healthcare, finance, and other fields, developers need to treat human preferences, ethics, and goals as core requirements from initial system design through data sourcing, training, evaluation, and deployment. A closing case study applies the same lens to the future of work. A sympathetic reader would care because current models already shape private and professional decisions, and late-stage fixes alone have not prevented misalignment with real user needs.

Core claim

The central claim is that human-centered considerations must receive rigorous attention at every stage of the language modeling pipeline, from design and data sourcing to training, evaluation, and responsible deployment, rather than being limited to a cursory post-training adjustment.

What carries the argument

The HCLLM framework, which integrates perspectives from NLP, HCI, and responsible AI to guide decisions at each pipeline stage while balancing ethics, economics, and technical goals.

Load-bearing premise

That systematically folding human perspectives into every stage of development will produce better real-world outcomes without unacceptable losses in model capability or added cost.

What would settle it

An empirical comparison showing that LLMs built with only standard post-training alignment achieve equal or higher user satisfaction, ethical compliance, and task performance than those developed with the integrated human-centered process at every stage.

Figures

Figures reproduced from arXiv: 2605.06901 by Aadesh Salecha, Aditri Bhagirath, Advit Deepak, Ahmad Rushdi, Angela Liu, Anshika Agarwal, Anthony Qin, Arpandeep Khatua, Caleb Ziems, Cat Gonzales Fergesen, Cathy Zhou, Daniel Frees, Diyi Yang, Dora Zhao, Emily Redmond, Gabriela Aranguiz-Dias, Haocheng Fan, Harshvardhan Agarwal, Henry Jin Weng, Huanxing Chen, Irawadee Thawornbut, Isabelle Levent, James Landay, Justine Breuch, Kirill Kalinin, Matthew J\"orke, Mei Tan, Michael J. Ryan, Michelle Vinocour, Minsik Oh, Miranda Li, Nathan J. Paek, Patrick Ye, Pawan Wirawarn, Pranava Singhal, Rose E. Wang, Ruishi Chen, Sarah Chen, Sean Zhang, Shashanka Subrahmanya, Shirley Wu, Sophie Wu, Su Doga Karaca, Sunny Yu, Teland La, Tian Gao, Vishal Jain, William Fang, Xiaoxian Shen, Xiaoyue Wang, Xinling Li, Yichen Xie, Yiling Zhao, Yongce Li, Yucheng Jiang, Yutong Zhang, Zheng Wang, Ziqing Huang.

Figure 1
Figure 1. Figure 1: This survey has three core sections, focused on (1) defining, (2) developing, and (3) deploying [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: We can draw on the field of human-computer interaction (HCI) to help inform human [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: This chapter focuses on the human origins of data (§ [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: This chapter applies human-centered considerations to [PITH_FULL_IMAGE:figures/full_fig_p029_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: In this chapter, we discuss common pitfalls and best practices for evaluating HCLLMs, [PITH_FULL_IMAGE:figures/full_fig_p042_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We enumerate three properties for the responsible HCLLM deployment: [PITH_FULL_IMAGE:figures/full_fig_p055_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: We present a case study on HCLLMs and the future of work, covering the three key areas of [PITH_FULL_IMAGE:figures/full_fig_p064_7.png] view at source ↗
read the original abstract

Large Language Models (LLMs) are increasingly shaping the private and professional lives of users, with numerous applications in business, education, finance, healthcare, law, and science. With this rise in global influence comes greater urgency to build, evaluate, and deploy these systems in a manner that prioritizes not only technical capabilities but also human priorities. This work presents a framework for developing Human-Centered Large Language Models (HCLLMs), which integrates perspectives from Natural Language Processing (NLP), Human-Computer Interaction (HCI), and responsible AI. Considering the ethics, economics, and technical objectives of language modeling, we argue that model developers need to address human concerns, preferences, values, and goals, not only during a cursory post-training stage, but rather with rigor and care at every stage of the pipeline. This paper offers human-centered insights and recommendations for developers at each stage, from system design to data sourcing, model training, evaluation, and responsible deployment. Then we conclude with a case study, applying these insights to understand the future of work with HCLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a framework for Human-Centered Large Language Models (HCLLMs) that integrates insights from NLP, HCI, and responsible AI. It argues that human concerns, preferences, values, and goals must be addressed with rigor at every stage of the LLM pipeline, from system design to data sourcing, model training, evaluation, and responsible deployment, rather than only in post-training. The paper provides recommendations for each stage and concludes with a case study applying these to the future of work.

Significance. The synthesis of human-centered principles across the full development pipeline represents a valuable organizational contribution that could help guide more responsible LLM development practices. By highlighting the need for consistent attention to human factors, the paper encourages a shift from ad-hoc post-training fixes to systematic integration. Its significance is enhanced by the structured recommendations and the illustrative case study, though its impact will depend on how actionable and balanced the suggestions are in practice.

major comments (2)
  1. [Introduction] The central claim (Introduction and Abstract) that systematically incorporating human-centered perspectives at each pipeline stage will produce meaningfully better outcomes rests on an unexamined assumption that this can be done without prohibitive trade-offs in capability or cost; no section addresses how alignment objectives interact with scaling or performance metrics.
  2. [Case Study] The case study applies the framework to the future of work but provides no concrete metrics, baselines, or falsifiable predictions for success, which weakens its role in supporting the pipeline-wide recommendations.
minor comments (2)
  1. Clarify the distinction between the proposed framework and prior work in responsible AI and HCI to better highlight the paper's unique contributions.
  2. Some stage-specific recommendations could be strengthened with references to concrete examples from the literature or industry practices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our manuscript. We address each major comment below and outline targeted revisions to strengthen the paper while preserving its reflective and framework-oriented scope.

read point-by-point responses
  1. Referee: [Introduction] The central claim (Introduction and Abstract) that systematically incorporating human-centered perspectives at each pipeline stage will produce meaningfully better outcomes rests on an unexamined assumption that this can be done without prohibitive trade-offs in capability or cost; no section addresses how alignment objectives interact with scaling or performance metrics.

    Authors: We acknowledge that the manuscript does not explicitly analyze potential trade-offs between human-centered integration and scaling laws or performance metrics. The central claim is framed as a call for systematic attention to human factors across the pipeline, based on the view that such integration supports more robust long-term outcomes, but we agree this leaves the interaction with capability and cost unexamined. In revision, we will add a concise discussion subsection (likely in the Introduction or following the pipeline recommendations) that references existing work on alignment tax, scaling behaviors, and performance trade-offs, while clarifying that the framework does not assume zero-cost integration. revision: partial

  2. Referee: [Case Study] The case study applies the framework to the future of work but provides no concrete metrics, baselines, or falsifiable predictions for success, which weakens its role in supporting the pipeline-wide recommendations.

    Authors: The case study is intended as an illustrative application of the HCLLM framework to the domain of future work, consistent with the paper's reflective and directional character rather than an empirical validation study. As such, it deliberately avoids quantitative metrics or baselines. To strengthen its supportive role, we will revise the case study to include a brief discussion of potential evaluation criteria and example metrics (e.g., human-centered outcome measures in workplace LLM deployment) that could be used to test the framework's recommendations in future empirical work, without introducing new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a normative position piece that advocates integrating human-centered perspectives from NLP, HCI, and responsible AI across the LLM pipeline. It contains no equations, empirical predictions, fitted parameters, or formal derivations whose validity depends on internal reductions. The central argument is conceptual and draws on external literatures without self-referential definitions or load-bearing self-citations that collapse the claim to its own inputs. Recommendations are framed as insights and reorganizations of known principles rather than novel results derived by construction from the paper's own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on domain assumptions about the feasibility and value of integrating human-centered perspectives throughout technical pipelines, without new evidence or invented entities.

axioms (1)
  • domain assumption Human concerns, preferences, values, and goals can and should be addressed with rigor at every stage of the LLM development pipeline.
    Stated directly in the abstract as the core argument.

pith-pipeline@v0.9.0 · 5733 in / 1149 out tokens · 41743 ms · 2026-05-11T01:23:17.538473+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 5 internal anchors

  1. [1]

    arXiv preprint arXiv:2406.11271v5

    MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv preprint arXiv:2406.11271v5. Preprint, under review. Stefan Baack. 2024. A critical analysis of the largest source for generative ai training data: Common crawl. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2...

  2. [2]

    Datasheet for the pile

    STELA: a community-centred approach to norm elicitation for AI alignment.Scientific Reports 14, 1 (2024), 6616. 76 Jarg Bergold and Stefan Thomas. 2012. Participatory research methods: A methodological approach in motion.Historical Social Research/Historische Sozialforschung(2012), 191–222. Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Rottger,...

  3. [3]

    Put-that-there

    The values encoded in machine learning research. InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. 173–184. Abeba Birhane and Vinay Uday Prabhu. 2021. Large image datasets: A pyrrhic win for computer vision?. In2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 1536–1546. James Bisbee, Joshua...

  4. [4]

    , Xie X: A survey on evaluation of large language models

    Balancing Cost and Effectiveness of Synthetic Data Generation Strategies for LLMs. arXiv preprint arXiv:2409.19759v3 [cs.CL]. CC BY 4.0. Denotes equal contribution. Work was done while Yung-Chieh was interning at Scale AI.. 80 Ruei-Che Chang, Yuxuan Liu, and Anhong Guo. 2024a. WorldScribe: Towards Context-Aware Live Visual Descriptions. InProceedings of t...

  5. [5]

    https://lmsys.org/blog/2023-03-30-vicuna/ Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I

    Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023-03-30-vicuna/ Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael I. Jordan, Joseph E. Gonzalez, and Ion Stoica

  6. [6]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. InICML. https://openreview.net/forum?id=3MW8GKNyzI Jennifer Chien and David Danks. 2024. Beyond behaviorist representational harms: A plan for measurement and mitigation. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 933–946. Laura Chioda, Da...

  7. [7]

    Deep Reinforcement Learning from Human Preferences. InAdvances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Sys- tems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Gar- nett (Eds.). 4299–4307....

  8. [8]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.)

    Context versus Prior Knowledge in Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand, 13211–13235. doi:10.18653/v1/2024.acl-long.714 Nan Du, Yanping Huang, Andre...

  9. [9]

    In: 2018 IEEE Inter- national Conference on Software Maintenance and Evolution (ICSME)

    Explaining Explanations: An Overview of Interpretability of Machine Learning. In 2018 IEEE 5th International Conference on Data Science and Advanced Analytics (DSAA). 80–89. doi:10.1109/DSAA.2018.00018 Mario Giulianelli, Joris Baan, Wilker Aziz, Raquel Fern’andez, and Barbara Plank. 2023. What Comes Next? Evaluating Uncertainty in Neural Text Generators A...

  10. [10]

    Task delegation from AI to humans: A principal-agent perspective. (2023). Neel Guha, Christie M. Lawrence, Lindsey A. Gailmard, Kit T. Rodolfa, Faiz Surani, Rishi Bommasani, Inioluwa Deborah Raji, Mariano-Florentino Cuéllar, Colleen Honigsberg, Percy Liang, and Daniel E. Ho. 2024. AI Regulation Has Its Own Alignment Problem: The Technical and Institutiona...

  11. [11]

    Don’t Forget the Teachers

    CALM: A Multi-task Benchmark for Comprehensive Assessment of Language Model Bias. ArXiv preprintabs/2308.12539 (2023).https://arxiv.org/abs/2308.12539 Suchin Gururangan, Dallas Card, Sarah Dreier, Emily Gade, Leroy Wang, Zeyu Wang, Luke Zettlemoyer, and Noah A. Smith. 2022. Whose Language Counts as High Quality? Measuring Language Ideologies in Text Data ...

  12. [12]

    Training Compute-Optimal Large Language Models

    Training compute-optimal large language models.ArXiv preprintabs/2203.15556 (2022). https://arxiv.org/abs/2203.15556 Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese King. 2024a. AI generates covertly racist decisions about people based on their dialect.Nature(2024), 1–8. Valentin Hofmann, Pratyusha Ria Kalluri, Dan Jurafsky, and Sharese...

  13. [13]

    Scaling Laws for Neural Language Models

    7226–7245. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. ArXiv preprintabs/2001.08361 (2020).https://arxiv.org/abs/2001.08361 Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. 2023. Copyright Violations...

  14. [14]

    InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems

    Fostering appropriate reliance on large language models: The role of explanations, sources, and inconsistencies. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–19. Nathalie Maria Kirch, Severin Field, and Stephen Casper. 2024. What Features in Prompts Jailbreak LLMs? Investigating the Mechanisms Behind Attacks.https://ar...

  15. [15]

    Gazegpt: Augmenting human capabilities using gaze-contingent contextual ai for smart eyewear.ArXiv preprintabs/2401.17217 (2024).https://arxiv.org/abs/2401.17217 Andreas Köpf, Yannic Kilcher, Dimitri Von Rütte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Nguyen, Oliver Stanley, Richárd Nagyfi, et al . 2023. Openassistant conver...

  16. [16]

    Okapi: Instruction-tuned Large Language Models in Multiple Languages with Reinforcement Learning from Human Feedback

    Too human and not human enough: A grounded theory analysis of mental health harms from emotional dependence on the social chatbot Replika.New Media & Society26, 10 (2024), 5923–5941. Viet Lai, Chien Nguyen, Nghia Ngo, Thuat Nguyen, Franck Dernoncourt, Ryan Rossi, and Thien Nguyen. 2023. Okapi: Instruction-tuned Large Language Models in Multiple Languages ...

  17. [17]

    In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada, July 9-14, 2023 , Anna Rogers, Jordan L

    Scalable agent alignment via reward modeling: a research direction. https: //arxiv.org/abs/1811.07871 Sharon Levy, William Adler, Tahilin Sanchez Karver, Mark Dredze, and Michelle R Kaufman. 2024. Gender bias in decision-making with large language models: A study of relationship conflicts. In Findings of the Association for Computational Linguistics: EMNL...

  18. [18]

    Sometimes You Need Facts, and Sometimes a Hug

    Seeing Eye to AI: Human Alignment via Gaze-Based Response Rewards for Large Language Models.ArXiv preprintabs/2410.01532 (2024).https://arxiv.org/abs/2410.01532 Yujie Lu, Dongfu Jiang, Wenhu Chen, William Yang Wang, Yejin Choi, and Bill Yuchen Lin. 2024a. Wildvision: Evaluating vision-language models in the wild with human preferences.Advances in Neural I...

  19. [19]

    InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023

    Mass-Editing Memory in a Transformer. InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net. https://openreview.net/pdf?id=MkbcAHIYgyS Lisa Messeri and Molly J Crockett. 2024. Artificial intelligence and illusions of understanding in scientific research.Nature627, 8002 (2024), 49–58. ...

  20. [20]

    arXiv:2308.00994 [cs.CV] https://arxiv.org/abs/2308.00994v3

    SYNAuG: Exploiting Synthetic Data for Data Imbalance Problems. arXiv:2308.00994 [cs.CV] https://arxiv.org/abs/2308.00994v3. Jared Moore, Tanvi Deshpande, and Diyi Yang. 2024. Are Large Language Models Consistent over Value-laden Questions?. InFindings of the Association for Computa- tional Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-N...

  21. [21]

    Olmo 3

    Olmo 3.arXiv preprint arXiv:2512.13961(2025). doi:10.48550/arXiv.2512.13961 Judith S Olson and Wendy A Kellogg. 2014.Ways of Knowing in HCI. Vol. 2. Springer. OpenAI. 2024a. Learning To Reason With LLMs. https://openai.com/index/ learning-to-reason-with-llms/. [Accessed 09-02-2025]. OpenAI. 2024b. Memory and new controls for ChatGPT. https://openai.com/in...

  22. [22]

    B leu: a method for automatic evaluation of machine translation

    Persona Generators: Generating Diverse Synthetic Personas at Scale.arXiv preprint arXiv:2602.03545(2026). Stefan Palan and Christian Schitter. 2018. Prolific. ac—A subject pool for online experiments.Journal of behavioral and experimental finance17 (2018), 22–27. Wei Pang, Kevin Qinghong Lin, Xiangru Jian, Xi He, and Philip Torr. 2025. Paper2Poster: Towar...

  23. [23]

    The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

    Data and its (dis) contents: A survey of dataset development and use in machine learning research.Patterns2, 11 (2021). Sachin R Pendse, Darren Gergle, Rachel Kornfield, Jonah Meyerhoff, David Mohr, Jina Suh, Annie Wescott, Casey Williams, and Jessica Schleider. 2025. When Testing AI Tests Us: Safeguarding Mental Health on the Digital Frontlines. InProcee...

  24. [24]

    InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)

    Seed-Free Synthetic Data Generation Framework for Instruction-Tuning LLMs: A Case Study in Thai. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop). 445–464. Iryna Pentina, Tyler Hancock, and Tianling Xie. 2023. Exploring relationship development with social chatbots: A mixed-met...

  25. [25]

    Thick Evaluations

    Eye gaze for spoken language understanding in multi-modal conversational interactions. US Patent 10,317,992. Rida Qadri, Mark Diaz, Ding Wang, and Michael Madaio. 2025. The case for" thick evaluations" of cultural representation in ai.arXiv preprint arXiv:2503.19075(2025). Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter...

  26. [26]

    InProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22)

    AI and the everything in the whole wide world benchmark.ArXiv preprintabs/2111.15366 (2021).https://arxiv.org/abs/2111.15366 Inioluwa Deborah Raji, I. Elizabeth Kumar, Aaron Horowitz, and Andrew Selbst. 2022. The Fallacy of AI Functionality. In2022 ACM Conference on Fairness, Accountability, and Transparency (FAccT ’22). ACM, 959–972. doi:10.1145/3531146....

  27. [27]

    Chi, James Caverlee, Julian J

    Synthesizeme! inducing persona-guided prompts for personalized reward models in llms. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 8045–8078. Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H Chi, James Caverlee, Julian McAuley, and Derek Zhiyuan Cheng. 20...

  28. [28]

    Own- ership, Not Just Happy Talk

    Novelty and Primacy: A Long-Term Estimator for Online Experiments.Tech- nometrics64, 4 (2022), 524–534. arXiv:https://doi.org/10.1080/00401706.2022.2124309 doi:10.1080/00401706.2022.2124309 Ananya B. Sai, Akash Kumar Mohankumar, and Mitesh M. Khapra. 2022. A Survey of Evaluation Metrics Used for NLG Systems.ACM Comput. Surv.55, 2, Article 26 (Jan. 2022), ...

  29. [29]

    InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24)

    Position: will we run out of data? limits of LLM scaling based on human-generated data. InProceedings of the 41st International Conference on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2024, 22 pages. James Vincent. 2023. Getty Images sues AI art generator Stable Diffusion in the US for copyright infringement.Theverge(2023). https://www...

  30. [30]

    How do AI agents do human work? comparing AI and human workflows across diverse occupations

    3227–3246. 128 Zora Zhiruo Wang, Yijia Shao, Omar Shaikh, Daniel Fried, Graham Neubig, and Diyi Yang. 2025d. How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations.arXiv preprint arXiv:2510.22780(2025). Zora Zhiruo Wang, John Yang, Kilian Lieret, Alexa Tartaglini, Valerie Chen, Yuxiang Wei, Zijian Wang Lingming Zhang, ...

  31. [31]

    Chi, Quoc V

    OpenReview.net.https://openreview.net/forum?id=gEZrGCozdqR Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. 2022b. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Proce...

  32. [32]

    InProceedings of the 1st Workshop on Multilingual Representation Learning, Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (Eds.)

    Language Models are Few-shot Multilingual Learners. InProceedings of the 1st Workshop on Multilingual Representation Learning, Duygu Ataman, Alexandra Birch, Alexis Conneau, Orhan Firat, Sebastian Ruder, and Gozde Gul Sahin (Eds.). Association for Computational Linguistics, Punta Cana, Dominican Republic, 1–15. doi:10.18653/v1/2021.mrl-1.1 BigScience Work...

  33. [33]

    Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al

    Licensed under CC BY 4.0. Qianqian Xie, Weiguang Han, Zhengyu Chen, Ruoyu Xiang, Xiao Zhang, Yueru He, Mengxi Xiao, Dong Li, Yongfu Dai, Duanyu Feng, et al. 2024. Finben: A holistic financial benchmark for large language models.Advances in Neural Information Processing Systems37 (2024), 95716–95743. Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min...

  34. [34]

    Bernstein, and John Mitchell

    On protecting the data privacy of large language models (llms): A survey.ArXiv preprint abs/2403.05156 (2024).https://arxiv.org/abs/2403.05156 Diyi Yang, Caleb Ziems, William Held, Omar Shaikh, Michael S. Bernstein, and John Mitchell. 2024b. Social Skill Training with Large Language Models.https://arxiv.org/abs/2404.04204 John Yang, Carlos E Jimenez, Alex...

  35. [35]

    Removing RLHF Protections in GPT-4 via Fine-Tuning. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Hu- man Language Technologies (Volume 2: Short Papers), Kevin Duh, Helena Gomez, and Steven Bethard (Eds.). Association for Computational Linguistics, Mexico City, Mexico, 681–687. https:/...

  36. [36]

    OPT: Open Pre-trained Transformer Language Models

    OPT: Open Pre-trained Transformer Language Models.ArXivabs/2205.01068 (2022). https://api.semanticscholar.org/CorpusID:248496292 Xiangyu Zhang, Hexin Liu, Kaishuai Xu, Qiquan Zhang, Daijiao Liu, Beena Ahmed, and Julien Epps. 2024d. When LLMs meets acoustic landmarks: An efficient approach to integrate speech into large language models for depression detec...