From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

Ahmed E. Hassan; Dayi Lin; Gopi Krishnan Rajbahadur; Gustavo A. Oliva; Jiho Shin

arxiv: 2410.20791 · v3 · submitted 2024-10-28 · 💻 cs.SE · cs.AI

From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap

Gopi Krishnan Rajbahadur , Gustavo A. Oliva , Dayi Lin , Jiho Shin , Ahmed E. Hassan This is my paper

Pith reviewed 2026-05-23 19:05 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords foundation modelsFMwareproduction challengessoftware engineeringLLM deploymentmodel alignmentagent orchestrationAI systems testing

0 comments

The pith

Production-ready FMware requires solving distinct challenges in model selection, data alignment, prompt engineering, agent orchestration, testing, and deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a semi-structured thematic synthesis to map the obstacles that prevent foundation-model-based software from moving beyond demonstrations. It draws on the authors' experience building the FMArts platform and integrating it into Huawei Cloud, plus grey literature, academic work, OPEA participation, AIware events, and ISO standards work on AI datasets. A reader would care because many teams can create working prototypes yet encounter reliability, cost, scalability, and compliance failures at scale. The synthesis highlights both lifecycle-specific problems and cross-cutting needs such as memory management and observability, then outlines technologies and practices that could close the gap.

Core claim

The paper claims that critical issues arise in FM(s) selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment, alongside cross-cutting concerns such as memory management, observability, and feedback integration; these issues are identified through a semi-structured thematic synthesis of industry experience, grey literature, academic publications, OPEA involvement, AIware events, and ISO SPDX work on AI and datasets.

What carries the argument

Semi-structured thematic synthesis that aggregates and thematically codes evidence from the authors' FMArts platform work, Huawei Cloud integration, grey literature, academic papers, OPEA, AIware events, and ISO SPDX SBOM efforts on AI and datasets.

If this is right

Better methods and criteria for selecting suitable foundation models become necessary for production systems.
Dedicated techniques for aligning data and models are required to achieve consistent behavior.
Specialized tooling and practices for prompt engineering and multi-agent orchestration are needed.
Testing and deployment pipelines must incorporate FM-specific uncertainties and compliance checks.
Observability, memory management, and continuous feedback mechanisms must be built into the system architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of the identified practices could shorten the time from prototype to reliable enterprise deployment.
Multi-industry efforts may be required to develop the supporting platforms and standards referenced in the roadmap.
Regulatory and standards bodies could incorporate the listed challenges into future compliance frameworks for AI software.

Load-bearing premise

The sources consulted comprehensively capture the main production challenges across the FMware domain.

What would settle it

A broad survey of deployed production FMware systems that identifies a substantially different primary set of obstacles not surfaced by the thematic synthesis.

Figures

Figures reproduced from arXiv: 2410.20791 by Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Gustavo A. Oliva, Jiho Shin.

**Figure 1.** Figure 1: FMware Lifecycle. practitioners while acknowledging the absence of exhaustive formal coding. These themes are presented in Section 5, offering a systematic view of critical challenges in transitioning FMware from demos to production-ready systems. 4 Recurrent Issues in Productionizing FMware In this section, we outline the stages of an FMware’s engineering lifecycle, as depicted in [PITH_FULL_IMAGE:figure… view at source ↗

read the original abstract

The rapid expansion of foundation models (FMs), such as large language models (LLMs), has given rise to FMware, software systems that integrate FM(s) as core components. While building demonstration-level FMware is relatively straightforward, transitioning to production-ready systems presents numerous challenges, including reliability, high implementation costs, scalability, and compliance with privacy regulations. Our paper conducts a semi-structured thematic synthesis to identify key challenges in productionizing FMware across diverse data sources, including our industry experience developing FMArts, a FMware lifecycle engineering platform, and its integration into Huawei Cloud; grey literature; academic publications; hands-on involvement in the Open Platform for Enterprise AI (OPEA); organizing the AIware conference and bootcamp; and co-leading the ISO SPDX SBOM working group on AI and datasets. We identify critical issues in FM(s) selection, data and model alignment, prompt engineering, agent orchestration, system testing, and deployment, alongside cross-cutting concerns such as memory management, observability, and feedback integration. We discuss necessary technologies and strategies to address these challenges and offer guidance to enable the transition from demonstration systems to scalable, production-ready FMware solutions. Our findings underscore the importance of continued research and multi-industry collaboration to advance the development of production-ready FMware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that a semi-structured thematic synthesis drawing on the authors' industry experience with FMArts and Huawei Cloud, grey literature, academic publications, OPEA involvement, AIware events, and ISO SPDX SBOM work identifies core challenges in productionizing FMware (foundation-model software systems). These include FM(s) selection, data and model alignment, prompt engineering, agent orchestration, system testing, deployment, and cross-cutting concerns such as memory management, observability, and feedback integration. The paper discusses necessary technologies and strategies and calls for continued research and multi-industry collaboration.

Significance. If the synthesis holds, the paper would offer a structured, practitioner-oriented roadmap that consolidates emerging challenges in an important area of software engineering. Its grounding in standards work (ISO) and open platforms (OPEA) adds practical relevance and could help focus research efforts on production concerns rather than demonstration-level capabilities.

major comments (1)

[Abstract / synthesis description] The description of the semi-structured thematic synthesis (abstract and any methods section) names the data sources but provides no detail on the coding scheme, inter-rater reliability, or how conflicts between sources were resolved. Because the central claim rests on the synthesis yielding the enumerated challenges, this omission is load-bearing for assessing the reliability of the findings.

minor comments (1)

[Introduction] The term 'FMware' is introduced as an invented entity without an explicit, concise definition early in the manuscript; adding one would improve accessibility for readers outside the immediate community.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback on the methodology description. We address the major comment below and will revise the manuscript to provide greater transparency.

read point-by-point responses

Referee: The description of the semi-structured thematic synthesis (abstract and any methods section) names the data sources but provides no detail on the coding scheme, inter-rater reliability, or how conflicts between sources were resolved. Because the central claim rests on the synthesis yielding the enumerated challenges, this omission is load-bearing for assessing the reliability of the findings.

Authors: We agree that additional detail on the synthesis process would strengthen the paper. In the revised version we will insert a dedicated Methods section describing the semi-structured thematic synthesis. It will explain: (1) the iterative coding process, beginning with challenges observed during FMArts development and Huawei Cloud integration, then refined against grey literature, academic publications, OPEA contributions, AIware events, and ISO SPDX work; (2) resolution of conflicts across sources through structured author discussions that weighted empirical production issues more heavily than purely theoretical ones; and (3) the rationale for not computing formal inter-rater reliability statistics, given the expert-driven, single-team nature of the synthesis, with reliability instead supported by triangulation across independent data sources. This addition will allow readers to evaluate the findings more rigorously while preserving the paper's practitioner-oriented focus. revision: yes

Circularity Check

0 steps flagged

No significant circularity: synthesis paper whose claim is the act of synthesis itself

full rationale

The paper performs a semi-structured thematic synthesis to enumerate production challenges for FMware. Its central claim is satisfied by reporting the synthesis performed on the listed sources (industry experience, grey literature, academic papers, OPEA, conferences, ISO work). No equations, quantitative predictions, fitted parameters, or uniqueness theorems appear. No step reduces a claimed derivation to its own inputs by construction, self-citation chain, or renaming. The reliance on authors' experience is disclosed as one input among others and does not create a load-bearing loop that forces the enumerated challenges. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the representativeness of the chosen data sources and the validity of thematic synthesis as a method for surfacing production challenges. No free parameters or invented physical entities are introduced. FMware is a coined umbrella term rather than a new postulated mechanism.

axioms (1)

domain assumption Thematic synthesis of the listed practitioner and literature sources yields a complete and unbiased set of production challenges.
Invoked in the description of the study method and source selection in the abstract.

invented entities (1)

FMware no independent evidence
purpose: Umbrella term for software systems that integrate foundation models as core components.
Introduced to name the class of systems under study; no independent falsifiable prediction attached.

pith-pipeline@v0.9.0 · 5775 in / 1255 out tokens · 25151 ms · 2026-05-23T19:05:09.767429+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · 9 internal anchors

[1]

[n. d.]. Amazon Web Services (AWS) - Cloud Computing Services — pages.awscloud.com. https://pages.awscloud.com/EMEA-Data-Flywheel.html. [Accessed 11-10-2024]

work page 2024
[2]

[n. d.]. Building a foundation for the future of AI models — research.ibm.com. https://research.ibm.com/blog/generative-ai-dario-gil-think. [Accessed 08-10-2024]

work page 2024
[3]

[n. d.]. Generative AI could raise global GDP by 7% — goldmansachs.com. https://www.goldmansachs.com/insights/articles/generative-ai-could- raise-global-gdp-by-7-percent. [Accessed 08-10-2024]

work page 2024
[4]

[n. d.]. LangSmith — langchain.com. https://www.langchain.com/langsmith. [Accessed 09-10-2024]

work page 2024
[5]

[n. d.]. Open-source Observability for LLMs with OpenTelemetry — traceloop.com. https://www.traceloop.com/openllmetry. [Accessed 09-10-2024]

work page 2024
[6]

[n. d.]. Optimizing OpenAI API Performance - Reducing Latency — signoz.io. https://signoz.io/guides/open-ai-api-latency/. [Accessed 07-10-2024]

work page 2024
[7]

[n. d.]. Rate limits – LLM engine. https://llm-engine.scale.com/guides/rate_limits/. [Accessed 07-10-2024]

work page 2024
[8]

10% of Organizations Surveyed Launched GenAI Solutions to Production

2023. 10% of Organizations Surveyed Launched GenAI Solutions to Production... — intel.com. https://www.intel.com/content/www/us/en/ newsroom/news/10-per-cent-orgs-launched-genai-solutions-2023.html#gs.gdxejv. [Accessed 08-10-2024]

work page 2023
[9]

Evaluating LLM Applications

2023. Evaluating LLM Applications. https://humanloop.com/blog/evaluating-llm-apps. [Accessed 07-10-2024]

work page 2023
[10]

Testing LLM-Based Applications: Strategy and Challenges — blog.scottlogic.com

2023. Testing LLM-Based Applications: Strategy and Challenges — blog.scottlogic.com. https://blog.scottlogic.com/2023/11/14/testing-LLM-based- applications-strategy-and-challenges.html. [Accessed 07-10-2024]

work page 2023
[11]

AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software

2024. AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software. Association for Computing Machinery, New York, NY, USA

work page 2024
[12]

FM+SE Summit 2024 — fmse.io

2024. FM+SE Summit 2024 — fmse.io. https://fmse.io/. [Accessed 11-10-2024]

work page 2024
[13]

Musings on Building a Generative AI Product — linkedin.com

2024. Musings on Building a Generative AI Product — linkedin.com. https://www.linkedin.com/blog/engineering/generative-ai/musings-on- building-a-generative-ai-product. [Accessed 08-10-2024]

work page 2024
[14]

OPEA Community Events - LF AI Foundation - Confluence — lf-aidata.atlassian.net

2024. OPEA Community Events - LF AI Foundation - Confluence — lf-aidata.atlassian.net. https://lf-aidata.atlassian.net/wiki/spaces/DL/pages/ 14094763/OPEA+Community+Events. [Accessed 11-10-2024]

work page 2024
[15]

Assad Abbas

Dr. Assad Abbas. 2024. The Financial Challenges of Leading in AI: A Look at OpenAI’s Operating Costs — unite.ai. https://www.unite.ai/the- financial-challenges-of-leading-in-ai-a-look-at-openais-operating-costs/. [Accessed 11-10-2024]

work page 2024
[16]

Bhashithe Abeysinghe and Ruhan Circi. 2024. The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches. arXiv preprint arXiv:2406.03339 (2024)

work page arXiv 2024
[17]

Ibrahim Ahmed, Sahil Parmar, Matthew Boyd, Michael Beidler, Kris Kang, Bill Liu, Kyle Roach, John Kim, and Dennis Abts. 2022. Answer fast: Accelerating BERT on the tensor streaming processor. In 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Manuscript submitted to ACM 20 Rajbahadur et al. Processors (ASAP). IEEE, 80–87

work page 2022
[18]

Toufique Ahmed and Premkumar Devanbu. 2023. Better patching using LLM prompting, via Self-Consistency. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 1742–1746

work page 2023
[19]

SM Didar Al Alam, Maleknaz Nayebi, Dietmar Pfahl, and Guenther Ruhe. 2017. A two-staged survey on release readiness. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering . 374–383

work page 2017
[20]

SM Didar Al Alam, SM Shahnewaz, Dietmar Pfahl, and Guenther Ruhe. 2014. Monitoring bottlenecks in achieving release readiness: A retrospective case study across ten oss projects. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement . 1–4

work page 2014
[21]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. 2024. A survey on data selection for language models. arXiv preprint arXiv:2402.16827 (2024)

work page arXiv 2024
[22]

Peter Alvaro and Andrew Quinn. 2024. Deterministic Record-and-Replay: Zeroing in only on the nondeterministic actions of the process. Queue 22, 4 (2024), 120–129

work page 2024
[23]

Oluyemi Enoch Amujo and Shanchieh Jay Yang. 2024. How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models. arXiv preprint arXiv:2407.11006 (2024)

work page arXiv 2024
[24]

Narges Ashtari, Ryan Mullins, Crystal Qian, James Wexler, Ian Tenney, and Mahima Pushkarna. 2023. From Discovery to Adoption: Understanding the ML Practitioners’ Interpretability Journey. InProceedings of the 2023 ACM Designing Interactive Systems Conference . 2304–2325

work page 2023
[25]

Abhaya Asthana and Jack Olivieri. 2009. Quantifying software reliability and readiness. In 2009 IEEE international workshop technical committee on communications quality and reliability . IEEE, 1–6

work page 2009
[26]

Suriya Ganesh Ayyamperumal and Limin Ge. 2024. Current state of LLM Risks and AI Guardrails. arXiv preprint arXiv:2406.12934 (2024)

work page arXiv 2024
[27]

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ML test score: A rubric for ML production readiness and technical debt reduction. In 2017 IEEE international conference on big data (big data) . IEEE, 1123–1132

work page 2017
[28]

Jenna L Butler, Thomas Zimmermann, and Christian Bird. 2024. Objectives and Key Results in Software Teams: Challenges, Opportunities and Impact on Development. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice . 358–368

work page 2024
[29]

Kaiyan Chang, Songcheng Xu, Chenglong Wang, Yingfeng Luo, Tong Xiao, and Jingbo Zhu. 2024. Efficient Prompting Methods for Large Language Models: A Survey. arXiv preprint arXiv:2404.01077 (2024)

work page arXiv 2024
[30]

Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, and Deepak Gupta. 2024. Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward. arXiv preprint arXiv:2402.01799 (2024)

work page arXiv 2024
[31]

Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al

work page
[32]

arXiv preprint arXiv:2406.01304 (2024)

CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint arXiv:2406.01304 (2024)

work page arXiv 2024
[33]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669 (2024)

work page arXiv 2024
[34]

Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, and Junbo Zhao. 2023. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246 (2023)

work page arXiv 2023
[35]

Hung-Ting Chen, Michael JQ Zhang, and Eunsol Choi. 2022. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. arXiv preprint arXiv:2210.13701 (2022)

work page arXiv 2022
[36]

Lihu Chen and Gaël Varoquaux. 2024. What is the Role of Small Models in the LLM Era: A Survey. arXiv preprint arXiv:2409.06857 (2024)

work page arXiv 2024
[37]

Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2025. An Empirical Study on Challenges for LLM Application Developers. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025
[38]

Yuyan Chen, Zhihao Wen, Ge Fan, Zhengyu Chen, Wei Wu, Dayiheng Liu, Zhixu Li, Bang Liu, and Yanghua Xiao. 2024. Mapo: Boosting large language model performance with model-adaptive prompt optimization. arXiv preprint arXiv:2407.04118 (2024)

work page arXiv 2024
[39]

Daniela S Cruzes and Tore Dybå. 2010. Synthesizing evidence in software engineering research. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement . 1–10

work page 2010
[40]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 719–729

work page 2024
[41]

James Cusick. 2013. Architecture and Production Readiness Reviews in Practice. arXiv preprint arXiv:1305.2402 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[42]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Hallucinating law: Legal mistakes with large language models are pervasive

work page 2024
[43]

Ximing Dong, Dayi Lin, Shaowei Wang, and Ahmed E Hassan. 2024. A Framework for Real-time Safeguarding the Text Generation of Large Language. arXiv preprint arXiv:2404.19048 (2024)

work page arXiv 2024
[44]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

work page 2023
[45]

Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. 2024. A bibliometric review of large language models research from 2017 to 2023. ACM Transactions on Intelligent Systems and Technology 15, 5 (2024), 1–25

work page 2024
[46]

Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2024. Llm-based nlg evaluation: Current status and challenges. arXiv preprint arXiv:2402.01383 (2024). Manuscript submitted to ACM From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap 21

work page arXiv 2024
[47]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92

work page 2021
[48]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063–7076

work page 2024
[50]

Philipp Hacker, Andreas Engel, and Marco Mauer. 2023. Regulating ChatGPT and other large generative AI models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency . 1112–1123

work page 2023
[51]

Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, and Ida Momennejad. 2023. ALLURE: auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv e-prints (2023), arXiv–2309

work page 2023
[52]

Hassan, Bram Adams, Haoxiang Zhang, Thomas Zimmermann, Foutse Khomh, and Nachi Nagappan

Ahmed E. Hassan, Bram Adams, Haoxiang Zhang, Thomas Zimmermann, Foutse Khomh, and Nachi Nagappan. 2024. Challenges and Opportunities in the Road Ahead . Technical Report. Queen’s University, Canada. https://sail.cs.queensu.ca/data/pdfs/2024_FMSE_Vision_2030_Summary_ Report__Challenges_and_Opportunities_in_the_Road_Ahead.pdf Accessed: 2024-10-10

work page 2024
[53]

Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe Roseiro Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Oliva, Jiahuei (Justina) Lin, Wali Mohammad Abdullah, and Zhen Ming (Jack) Jiang. 2024. Rethinking Software Engineering in the Era of Foundation Models: A Curated Catalogue of Challenges in the Developmen...

work page 2024
[54]

Hassan, Gustavo A

Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, and Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107

work page arXiv 2024
[55]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 1–79

work page 2024
[56]

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, and Deming Chen. 2024. New Solutions on LLM Acceleration, Optimization, and Application. arXiv preprint arXiv:2406.10903 (2024)

work page arXiv 2024
[58]

Chandra Irugalbandara, Ashish Mahendra, Roland Daynauth, Tharuka Kasthuri Arachchige, Jayanaka Dantanarayana, Krisztian Flautner, Lingjia Tang, Yiping Kang, and Jason Mars. 2024. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI’s LLM with Open Source SLMs in Production. In 2024 IEEE International Symposium on Performance Analysis of S...

work page 2024
[59]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint arXiv:2408.02479 (2024)

work page arXiv 2024
[61]

Uday Kamath, Kevin Keenan, Garrett Somers, and Sarah Sorenson. 2024. LLMs in Production. In Large Language Models: A Deep Dive: Bridging Theory and Practice. Springer, 315–373

work page 2024
[62]

Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, et al. 2024. Evaluating LLMs at Detecting Errors in LLM Responses. arXiv preprint arXiv:2404.03602 (2024)

work page arXiv 2024
[63]

Samantha Murphy Kelly. [n. d.]. Apple’s new China problem: ChatGPT is banned there | CNN Business — cnn.com. https://www.cnn.com/2024/06/ 21/tech/apple-ai-chatgpt-ban-china/index.html. [Accessed 04-10-2024]

work page 2024
[64]

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2024. Benchmarking Cognitive Biases in Large Language Models as Evaluators. arXiv:2309.17012 [cs.CL] https://arxiv.org/abs/2309.17012

work page arXiv 2024
[65]

Nico Koprowski, M Firdaus Harun, and Horst Lichter. 2014. Release readiness measurement: a comparison of best practices. In 2014 8th. Malaysian Software Engineering Conference (MySEC) . IEEE, 166–171

work page 2014
[66]

Michael Kuchnik, Virginia Smith, and George Amvrosiadis. 2023. Validating large language models with relm. Proceedings of Machine Learning and Systems 5 (2023), 457–476

work page 2023
[67]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626

work page 2023
[68]

Philippe Laban, Alexander R Fabbri, Caiming Xiong, and Chien-Sheng Wu. 2024. Summary of a haystack: A challenge to long-context llms and rag systems. arXiv preprint arXiv:2407.01370 (2024)

work page arXiv 2024
[69]

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. 2024. LLM Inference Serving: Survey of Recent Advances and Opportunities. arXiv preprint arXiv:2407.12391 (2024)

work page arXiv 2024
[70]

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[71]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173. Manuscript submitted to ACM 22 Rajbahadur et al

work page 2024
[72]

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, et al. 2024. Best practices and lessons learned on synthetic data for language models. arXiv preprint arXiv:2404.07503 (2024)

work page arXiv 2024
[73]

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. 2024. Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena. arXiv preprint arXiv:2407.10627 (2024)

work page arXiv 2024
[74]

Wanqin Ma, Chenyang Yang, and Christian Kästner. 2024. (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI . 166–171

work page 2024
[75]

Audris Mockus. 2003. Analogy based prediction of work item flow in software projects: a case study. In 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings. IEEE, 110–119

work page 2003
[76]

Nadia Nahar, Christian Kästner, Jenna Butler, Chris Parnin, Thomas Zimmermann, and Christian Bird. 2024. Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products. arXiv preprint arXiv:2410.12071 (2024)

work page arXiv 2024
[77]

OPEA. [n. d.]. Open Platform For Enterprise AI — opea.dev. https://opea.dev/. [Accessed 03-10-2024]

work page 2024
[78]

Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[79]

Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z Henley. 2023. Building Your Own Product Copilot: Challenges, Opportunities, and Needs. arXiv preprint arXiv:2312.14231 (2023)

work page arXiv 2023
[80]

Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, and Arsalan Shahid. 2024. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv preprint arXiv:2408.13296 (2024)

work page arXiv 2024

Showing first 80 references.

[1] [1]

[n. d.]. Amazon Web Services (AWS) - Cloud Computing Services — pages.awscloud.com. https://pages.awscloud.com/EMEA-Data-Flywheel.html. [Accessed 11-10-2024]

work page 2024

[2] [2]

[n. d.]. Building a foundation for the future of AI models — research.ibm.com. https://research.ibm.com/blog/generative-ai-dario-gil-think. [Accessed 08-10-2024]

work page 2024

[3] [3]

[n. d.]. Generative AI could raise global GDP by 7% — goldmansachs.com. https://www.goldmansachs.com/insights/articles/generative-ai-could- raise-global-gdp-by-7-percent. [Accessed 08-10-2024]

work page 2024

[4] [4]

[n. d.]. LangSmith — langchain.com. https://www.langchain.com/langsmith. [Accessed 09-10-2024]

work page 2024

[5] [5]

[n. d.]. Open-source Observability for LLMs with OpenTelemetry — traceloop.com. https://www.traceloop.com/openllmetry. [Accessed 09-10-2024]

work page 2024

[6] [6]

[n. d.]. Optimizing OpenAI API Performance - Reducing Latency — signoz.io. https://signoz.io/guides/open-ai-api-latency/. [Accessed 07-10-2024]

work page 2024

[7] [7]

[n. d.]. Rate limits – LLM engine. https://llm-engine.scale.com/guides/rate_limits/. [Accessed 07-10-2024]

work page 2024

[8] [8]

10% of Organizations Surveyed Launched GenAI Solutions to Production

2023. 10% of Organizations Surveyed Launched GenAI Solutions to Production... — intel.com. https://www.intel.com/content/www/us/en/ newsroom/news/10-per-cent-orgs-launched-genai-solutions-2023.html#gs.gdxejv. [Accessed 08-10-2024]

work page 2023

[9] [9]

Evaluating LLM Applications

2023. Evaluating LLM Applications. https://humanloop.com/blog/evaluating-llm-apps. [Accessed 07-10-2024]

work page 2023

[10] [10]

Testing LLM-Based Applications: Strategy and Challenges — blog.scottlogic.com

2023. Testing LLM-Based Applications: Strategy and Challenges — blog.scottlogic.com. https://blog.scottlogic.com/2023/11/14/testing-LLM-based- applications-strategy-and-challenges.html. [Accessed 07-10-2024]

work page 2023

[11] [11]

AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software

2024. AIware 2024: Proceedings of the 1st ACM International Conference on AI-Powered Software. Association for Computing Machinery, New York, NY, USA

work page 2024

[12] [12]

FM+SE Summit 2024 — fmse.io

2024. FM+SE Summit 2024 — fmse.io. https://fmse.io/. [Accessed 11-10-2024]

work page 2024

[13] [13]

Musings on Building a Generative AI Product — linkedin.com

2024. Musings on Building a Generative AI Product — linkedin.com. https://www.linkedin.com/blog/engineering/generative-ai/musings-on- building-a-generative-ai-product. [Accessed 08-10-2024]

work page 2024

[14] [14]

OPEA Community Events - LF AI Foundation - Confluence — lf-aidata.atlassian.net

2024. OPEA Community Events - LF AI Foundation - Confluence — lf-aidata.atlassian.net. https://lf-aidata.atlassian.net/wiki/spaces/DL/pages/ 14094763/OPEA+Community+Events. [Accessed 11-10-2024]

work page 2024

[15] [15]

Assad Abbas

Dr. Assad Abbas. 2024. The Financial Challenges of Leading in AI: A Look at OpenAI’s Operating Costs — unite.ai. https://www.unite.ai/the- financial-challenges-of-leading-in-ai-a-look-at-openais-operating-costs/. [Accessed 11-10-2024]

work page 2024

[16] [16]

Bhashithe Abeysinghe and Ruhan Circi. 2024. The Challenges of Evaluating LLM Applications: An Analysis of Automated, Human, and LLM-Based Approaches. arXiv preprint arXiv:2406.03339 (2024)

work page arXiv 2024

[17] [17]

Ibrahim Ahmed, Sahil Parmar, Matthew Boyd, Michael Beidler, Kris Kang, Bill Liu, Kyle Roach, John Kim, and Dennis Abts. 2022. Answer fast: Accelerating BERT on the tensor streaming processor. In 2022 IEEE 33rd International Conference on Application-specific Systems, Architectures and Manuscript submitted to ACM 20 Rajbahadur et al. Processors (ASAP). IEEE, 80–87

work page 2022

[18] [18]

Toufique Ahmed and Premkumar Devanbu. 2023. Better patching using LLM prompting, via Self-Consistency. In 2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 1742–1746

work page 2023

[19] [19]

SM Didar Al Alam, Maleknaz Nayebi, Dietmar Pfahl, and Guenther Ruhe. 2017. A two-staged survey on release readiness. In Proceedings of the 21st International Conference on Evaluation and Assessment in Software Engineering . 374–383

work page 2017

[20] [20]

SM Didar Al Alam, SM Shahnewaz, Dietmar Pfahl, and Guenther Ruhe. 2014. Monitoring bottlenecks in achieving release readiness: A retrospective case study across ten oss projects. In Proceedings of the 8th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement . 1–4

work page 2014

[21] [21]

Alon Albalak, Yanai Elazar, Sang Michael Xie, Shayne Longpre, Nathan Lambert, Xinyi Wang, Niklas Muennighoff, Bairu Hou, Liangming Pan, Haewon Jeong, et al. 2024. A survey on data selection for language models. arXiv preprint arXiv:2402.16827 (2024)

work page arXiv 2024

[22] [22]

Peter Alvaro and Andrew Quinn. 2024. Deterministic Record-and-Replay: Zeroing in only on the nondeterministic actions of the process. Queue 22, 4 (2024), 120–129

work page 2024

[23] [23]

Oluyemi Enoch Amujo and Shanchieh Jay Yang. 2024. How Good Is It? Evaluating the Efficacy of Common versus Domain-Specific Prompts on Foundational Large Language Models. arXiv preprint arXiv:2407.11006 (2024)

work page arXiv 2024

[24] [24]

Narges Ashtari, Ryan Mullins, Crystal Qian, James Wexler, Ian Tenney, and Mahima Pushkarna. 2023. From Discovery to Adoption: Understanding the ML Practitioners’ Interpretability Journey. InProceedings of the 2023 ACM Designing Interactive Systems Conference . 2304–2325

work page 2023

[25] [25]

Abhaya Asthana and Jack Olivieri. 2009. Quantifying software reliability and readiness. In 2009 IEEE international workshop technical committee on communications quality and reliability . IEEE, 1–6

work page 2009

[26] [26]

Suriya Ganesh Ayyamperumal and Limin Ge. 2024. Current state of LLM Risks and AI Guardrails. arXiv preprint arXiv:2406.12934 (2024)

work page arXiv 2024

[27] [27]

Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D Sculley. 2017. The ML test score: A rubric for ML production readiness and technical debt reduction. In 2017 IEEE international conference on big data (big data) . IEEE, 1123–1132

work page 2017

[28] [28]

Jenna L Butler, Thomas Zimmermann, and Christian Bird. 2024. Objectives and Key Results in Software Teams: Challenges, Opportunities and Impact on Development. In Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice . 358–368

work page 2024

[29] [29]

Kaiyan Chang, Songcheng Xu, Chenglong Wang, Yingfeng Luo, Tong Xiao, and Jingbo Zhu. 2024. Efficient Prompting Methods for Large Language Models: A Survey. arXiv preprint arXiv:2404.01077 (2024)

work page arXiv 2024

[30] [30]

Arnav Chavan, Raghav Magazine, Shubham Kushwaha, Mérouane Debbah, and Deepak Gupta. 2024. Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward. arXiv preprint arXiv:2402.01799 (2024)

work page arXiv 2024

[31] [31]

Dong Chen, Shaoxin Lin, Muhan Zeng, Daoguang Zan, Jian-Gang Wang, Anton Cheshkov, Jun Sun, Hao Yu, Guoliang Dong, Artem Aliev, et al

work page

[32] [32]

arXiv preprint arXiv:2406.01304 (2024)

CodeR: Issue Resolving with Multi-Agent and Task Graphs. arXiv preprint arXiv:2406.01304 (2024)

work page arXiv 2024

[33] [33]

Guiming Hardy Chen, Shunian Chen, Ziche Liu, Feng Jiang, and Benyou Wang. 2024. Humans or llms as the judge? a study on judgement biases. arXiv preprint arXiv:2402.10669 (2024)

work page arXiv 2024

[34] [34]

Hao Chen, Yiming Zhang, Qi Zhang, Hantao Yang, Xiaomeng Hu, Xuetao Ma, Yifan Yanggong, and Junbo Zhao. 2023. Maybe only 0.5% data is needed: A preliminary exploration of low training data instruction tuning. arXiv preprint arXiv:2305.09246 (2023)

work page arXiv 2023

[35] [35]

Hung-Ting Chen, Michael JQ Zhang, and Eunsol Choi. 2022. Rich knowledge sources bring complex knowledge conflicts: Recalibrating models to reflect conflicting evidence. arXiv preprint arXiv:2210.13701 (2022)

work page arXiv 2022

[36] [36]

Lihu Chen and Gaël Varoquaux. 2024. What is the Role of Small Models in the LLM Era: A Survey. arXiv preprint arXiv:2409.06857 (2024)

work page arXiv 2024

[37] [37]

Xiang Chen, Chaoyang Gao, Chunyang Chen, Guangbei Zhang, and Yong Liu. 2025. An Empirical Study on Challenges for LLM Application Developers. ACM Transactions on Software Engineering and Methodology (2025)

work page 2025

[38] [38]

Yuyan Chen, Zhihao Wen, Ge Fan, Zhengyu Chen, Wei Wu, Dayiheng Liu, Zhixu Li, Bang Liu, and Yanghua Xiao. 2024. Mapo: Boosting large language model performance with model-adaptive prompt optimization. arXiv preprint arXiv:2407.04118 (2024)

work page arXiv 2024

[39] [39]

Daniela S Cruzes and Tore Dybå. 2010. Synthesizing evidence in software engineering research. In Proceedings of the 2010 ACM-IEEE International Symposium on Empirical Software Engineering and Measurement . 1–10

work page 2010

[40] [40]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The power of noise: Redefining retrieval for rag systems. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval . 719–729

work page 2024

[41] [41]

James Cusick. 2013. Architecture and Production Readiness Reviews in Practice. arXiv preprint arXiv:1305.2402 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[42] [42]

Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E Ho. 2024. Hallucinating law: Legal mistakes with large language models are pervasive

work page 2024

[43] [43]

Ximing Dong, Dayi Lin, Shaowei Wang, and Ahmed E Hassan. 2024. A Framework for Real-time Safeguarding the Text Generation of Large Language. arXiv preprint arXiv:2404.19048 (2024)

work page arXiv 2024

[44] [44]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitya Lyubarskiy, Shubho Sengupta, Shin Yoo, and Jie M Zhang. 2023. Large language models for software engineering: Survey and open problems. In2023 IEEE/ACM International Conference on Software Engineering: Future of Software Engineering (ICSE-FoSE). IEEE, 31–53

work page 2023

[45] [45]

Lizhou Fan, Lingyao Li, Zihui Ma, Sanggyu Lee, Huizi Yu, and Libby Hemphill. 2024. A bibliometric review of large language models research from 2017 to 2023. ACM Transactions on Intelligent Systems and Technology 15, 5 (2024), 1–25

work page 2024

[46] [46]

Mingqi Gao, Xinyu Hu, Jie Ruan, Xiao Pu, and Xiaojun Wan. 2024. Llm-based nlg evaluation: Current status and challenges. arXiv preprint arXiv:2402.01383 (2024). Manuscript submitted to ACM From Cool Demos to Production-Ready FMware: Core Challenges and a Technology Roadmap 21

work page arXiv 2024

[47] [47]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Commun. ACM 64, 12 (2021), 86–92

work page 2021

[48] [48]

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. 2024. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[49] [49]

Tiezheng Guo, Qingwen Yang, Chen Wang, Yanyi Liu, Pan Li, Jiawei Tang, Dapeng Li, and Yingyou Wen. 2024. Knowledgenavigator: Leveraging large language models for enhanced reasoning over knowledge graph. Complex & Intelligent Systems 10, 5 (2024), 7063–7076

work page 2024

[50] [50]

Philipp Hacker, Andreas Engel, and Marco Mauer. 2023. Regulating ChatGPT and other large generative AI models. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency . 1112–1123

work page 2023

[51] [51]

Hosein Hasanbeig, Hiteshi Sharma, Leo Betthauser, Felipe Vieira Frujeri, and Ida Momennejad. 2023. ALLURE: auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv e-prints (2023), arXiv–2309

work page 2023

[52] [52]

Hassan, Bram Adams, Haoxiang Zhang, Thomas Zimmermann, Foutse Khomh, and Nachi Nagappan

Ahmed E. Hassan, Bram Adams, Haoxiang Zhang, Thomas Zimmermann, Foutse Khomh, and Nachi Nagappan. 2024. Challenges and Opportunities in the Road Ahead . Technical Report. Queen’s University, Canada. https://sail.cs.queensu.ca/data/pdfs/2024_FMSE_Vision_2030_Summary_ Report__Challenges_and_Opportunities_in_the_Road_Ahead.pdf Accessed: 2024-10-10

work page 2024

[53] [53]

Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe Roseiro Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Oliva, Jiahuei (Justina) Lin, Wali Mohammad Abdullah, and Zhen Ming (Jack) Jiang. 2024. Rethinking Software Engineering in the Era of Foundation Models: A Curated Catalogue of Challenges in the Developmen...

work page 2024

[54] [54]

Hassan, Gustavo A

Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, Zhen Ming, and Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap. arXiv:2410.06107

work page arXiv 2024

[55] [55]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2024. Large language models for software engineering: A systematic literature review. ACM Transactions on Software Engineering and Methodology 33, 8 (2024), 1–79

work page 2024

[56] [56]

Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey. arXiv preprint arXiv:2402.02716 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Yingbing Huang, Lily Jiaxin Wan, Hanchen Ye, Manvi Jha, Jinghua Wang, Yuhong Li, Xiaofan Zhang, and Deming Chen. 2024. New Solutions on LLM Acceleration, Optimization, and Application. arXiv preprint arXiv:2406.10903 (2024)

work page arXiv 2024

[58] [58]

Chandra Irugalbandara, Ashish Mahendra, Roland Daynauth, Tharuka Kasthuri Arachchige, Jayanaka Dantanarayana, Krisztian Flautner, Lingjia Tang, Yiping Kang, and Jason Mars. 2024. Scaling Down to Scale Up: A Cost-Benefit Analysis of Replacing OpenAI’s LLM with Open Source SLMs in Production. In 2024 IEEE International Symposium on Performance Analysis of S...

work page 2024

[59] [59]

Juyong Jiang, Fan Wang, Jiasi Shen, Sungju Kim, and Sunghun Kim. 2024. A Survey on Large Language Models for Code Generation. arXiv preprint arXiv:2406.00515 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[60] [60]

Haolin Jin, Linghan Huang, Haipeng Cai, Jun Yan, Bo Li, and Huaming Chen. 2024. From llms to llm-based agents for software engineering: A survey of current, challenges and future. arXiv preprint arXiv:2408.02479 (2024)

work page arXiv 2024

[61] [61]

Uday Kamath, Kevin Keenan, Garrett Somers, and Sarah Sorenson. 2024. LLMs in Production. In Large Language Models: A Deep Dive: Bridging Theory and Practice. Springer, 315–373

work page 2024

[62] [62]

Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, et al. 2024. Evaluating LLMs at Detecting Errors in LLM Responses. arXiv preprint arXiv:2404.03602 (2024)

work page arXiv 2024

[63] [63]

Samantha Murphy Kelly. [n. d.]. Apple’s new China problem: ChatGPT is banned there | CNN Business — cnn.com. https://www.cnn.com/2024/06/ 21/tech/apple-ai-chatgpt-ban-china/index.html. [Accessed 04-10-2024]

work page 2024

[64] [64]

Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2024. Benchmarking Cognitive Biases in Large Language Models as Evaluators. arXiv:2309.17012 [cs.CL] https://arxiv.org/abs/2309.17012

work page arXiv 2024

[65] [65]

Nico Koprowski, M Firdaus Harun, and Horst Lichter. 2014. Release readiness measurement: a comparison of best practices. In 2014 8th. Malaysian Software Engineering Conference (MySEC) . IEEE, 166–171

work page 2014

[66] [66]

Michael Kuchnik, Virginia Smith, and George Amvrosiadis. 2023. Validating large language models with relm. Proceedings of Machine Learning and Systems 5 (2023), 457–476

work page 2023

[67] [67]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles . 611–626

work page 2023

[68] [68]

Philippe Laban, Alexander R Fabbri, Caiming Xiong, and Chien-Sheng Wu. 2024. Summary of a haystack: A challenge to long-context llms and rag systems. arXiv preprint arXiv:2407.01370 (2024)

work page arXiv 2024

[69] [69]

Baolin Li, Yankai Jiang, Vijay Gadepally, and Devesh Tiwari. 2024. LLM Inference Serving: Survey of Recent Advances and Opportunities. arXiv preprint arXiv:2407.12391 (2024)

work page arXiv 2024

[70] [70]

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. 2023. Textbooks are all you need ii: phi-1.5 technical report. arXiv preprint arXiv:2309.05463 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[71] [71]

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173. Manuscript submitted to ACM 22 Rajbahadur et al

work page 2024

[72] [72]

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, et al. 2024. Best practices and lessons learned on synthetic data for language models. arXiv preprint arXiv:2404.07503 (2024)

work page arXiv 2024

[73] [73]

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Qingwei Lin, Jianguang Lou, Shifeng Chen, Yansong Tang, and Weizhu Chen. 2024. Arena Learning: Build Data Flywheel for LLMs Post-training via Simulated Chatbot Arena. arXiv preprint arXiv:2407.10627 (2024)

work page arXiv 2024

[74] [74]

Wanqin Ma, Chenyang Yang, and Christian Kästner. 2024. (Why) Is My Prompt Getting Worse? Rethinking Regression Testing for Evolving LLM APIs. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI . 166–171

work page 2024

[75] [75]

Audris Mockus. 2003. Analogy based prediction of work item flow in software projects: a case study. In 2003 International Symposium on Empirical Software Engineering, 2003. ISESE 2003. Proceedings. IEEE, 110–119

work page 2003

[76] [76]

Nadia Nahar, Christian Kästner, Jenna Butler, Chris Parnin, Thomas Zimmermann, and Christian Bird. 2024. Beyond the Comfort Zone: Emerging Solutions to Overcome Challenges in Integrating LLMs into Software Products. arXiv preprint arXiv:2410.12071 (2024)

work page arXiv 2024

[77] [77]

OPEA. [n. d.]. Open Platform For Enterprise AI — opea.dev. https://opea.dev/. [Accessed 03-10-2024]

work page 2024

[78] [78]

Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. 2023. Memgpt: Towards llms as operating systems. arXiv preprint arXiv:2310.08560 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[79] [79]

Chris Parnin, Gustavo Soares, Rahul Pandita, Sumit Gulwani, Jessica Rich, and Austin Z Henley. 2023. Building Your Own Product Copilot: Challenges, Opportunities, and Needs. arXiv preprint arXiv:2312.14231 (2023)

work page arXiv 2023

[80] [80]

Venkatesh Balavadhani Parthasarathy, Ahtsham Zafar, Aafaq Khan, and Arsalan Shahid. 2024. The Ultimate Guide to Fine-Tuning LLMs from Basics to Breakthroughs: An Exhaustive Review of Technologies, Research, Best Practices, Applied Research Challenges and Opportunities. arXiv preprint arXiv:2408.13296 (2024)

work page arXiv 2024