pith. machine review for the scientific record. sign in

arxiv: 2605.02765 · v1 · submitted 2026-05-04 · 💻 cs.AI · cs.HC· cs.LG

Recognition: 2 theorem links

· Lean Theorem

U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:13 UTC · model grok-4.3

classification 💻 cs.AI cs.HCcs.LG
keywords LLM-based planninghard constraintssoft constraintsconstraint typesuser workflowsmodel checkingLLM-as-judgeuser studies
0
0 comments X

The pith

User-defined hard and soft constraint types improve usefulness and satisfaction in LLM planning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines ways to help users control LLM-based task planning through constraints, noting that purely hard constraints are too rigid and weighted flexibility confuses users. It proposes U-Define, where users write constraints in everyday language and mark them as hard (must obey, checked formally) or soft (can bend, checked by another LLM). Technical and user studies with both general and expert users demonstrate that this typing leads to better usefulness, performance, and satisfaction ratings, with no loss in ease of use. The work matters for making AI planners more reliable and user-aligned in variable real-world settings.

Core claim

U-Define lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. It verifies these through formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. A technical evaluation and user studies find that this improves perceived usefulness, performance, and satisfaction while maintaining usability.

What carries the argument

The abstraction of constraint strictness into user-chosen hard and soft types with distinct verification methods in U-Define.

Load-bearing premise

That users can reliably categorize their constraints as hard or soft in natural language and that the verification methods correctly reflect real user intent.

What would settle it

An experiment showing that users often misclassify the same constraints or that the verification methods approve plans that violate the user's actual intent.

Figures

Figures reproduced from arXiv: 2605.02765 by Aws Albarghouthi, Bilge Mutlu, Christine P Lee, David Porfirio, Xinyu Jessica Wang.

Figure 1
Figure 1. Figure 1: In a conventional workflow (top), users receive an LLM plan but must manually detect errors and missing preferences. In view at source ↗
Figure 2
Figure 2. Figure 2: Front-end interface of U-Define — The user begins by defining soft or hard constraints for their planning task into U-Define. The user then inputs their request, and in response, an initial set of plans is generated. Each plan is then evaluated by distinct verification mechanisms on its adherence to the user-defined soft and hard constraints. The verification results and explanations are presented to the u… view at source ↗
Figure 3
Figure 3. Figure 3: U-Define Rule Translator — Pipeline of the rule translator described in §4.2.2. The Rule Translator is an LLM-based agent fine-tuned on a natural language to LTL dataset that translates user-defined hard constraints into LTL properties for model checking. These properties are encoded in PRISM language for interpretation by the model checker, then converted back into natural language for user validation. To… view at source ↗
Figure 4
Figure 4. Figure 4: U-Define Model Checker — Pipeline of the model checker described in §4.3.2. The model checker takes in the LLM-generated plan in PRISM language and the user-defined hard constraints from the Definition stage, and then evaluates the plan against these rules. The verification result (i.e., validity of the generated plan), along with any violated rules, is presented to the user through the interface. Finally,… view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative Data from User Study — Bar graphs on participants’ perceived performance of LLM, usefulness, satisfaction, usability scores, and constraint-fixing iterations across different conditions. Horizontal lines indicate significant pairwise comparisons with Dunnett’s test (𝑝 < 0.05∗ , 𝑝 < 0.01∗∗ , 𝑝 < 0.001∗∗∗). Vertical lines in each bar graph indicate standard error. This distinction shaped how par… view at source ↗
Figure 6
Figure 6. Figure 6: User interaction patterns across study conditions — Each dot marks when a participant added or modified a constraint during a 15-minute session. Grey bars show engagement time; conditions vary by constraint support: C1 (U-Define), C2 (soft only), C3 (hard only), C4 (none). Overall, task completion was much quicker, as shown by the fewer constraint adjustments in C1 (U-Define) compared to C4 (None) in view at source ↗
Figure 5
Figure 5. Figure 5: Instead, the two constraint types supported more effective task completion, resulting in greater usefulness and view at source ↗
Figure 7
Figure 7. Figure 7: Quantitative Data from User Study 2 — Bar graphs on experts’ perceived performance, usefulness, satisfaction, and usability scores across different conditions. Red lines represent results from Study 1, overlaid for context. Vertical lines in each bar graph indicate standard error. 7 USER STUDY 2: DOMAIN EXPERTS To assess U-Define in professional contexts where planning is more complex and high-stakes, we c… view at source ↗
Figure 7
Figure 7. Figure 7: To address these challenges, participants proposed ways to streamline constraint input. Five participants suggested template- or protocol-based approaches, where frequently used or repetitive constraints could be filled in semi￾automatically. Other three participants envisioned a learning process, where the system would save plans and constraints, gather feedback from the user on the plan outcome, and reus… view at source ↗
read the original abstract

LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users. We investigate how interaction workflows can better support users in applying constraints to guide LLM-generated plans, examining whether abstracting strictness into high-level types (i.e., hard and soft) paired with distinct verification mechanisms helps users more reliably express and align intent. We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. U-Define verifies these types through complementary methods: formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. Through a technical evaluation and user studies with general and expert participants, we find that user-defined constraint types improve perceived usefulness, performance, and satisfaction while maintaining usability. These findings provide insights for designing flexible yet reliable constraint-based workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents U-Define, a system enabling users to specify natural language constraints for LLM-based planning and classify them as hard (verified via model checking) or soft (via LLM-as-judge). Based on technical evaluation and user studies with general and expert participants, it concludes that user-defined constraint types enhance perceived usefulness, performance, and satisfaction while maintaining usability.

Significance. This contribution is potentially significant for improving user control and reliability in LLM planning applications. By addressing limitations of purely hard constraints and confusing numeric weights through high-level typing and dual verification, it offers a practical workflow. The inclusion of both general and expert user studies is a strength, providing insights into real-world applicability. However, the low level of detail on empirical methods tempers the significance until more evidence is provided.

major comments (2)
  1. [User Studies] The abstract states positive outcomes from user studies without reporting sample sizes, statistical tests, exclusion criteria, or raw data. Since these studies are central to validating the improvements in usefulness, performance, and satisfaction, this omission makes it impossible to assess the robustness of the central empirical claim.
  2. [§3 (System Design)] The assumption that the chosen verification methods accurately reflect user intent is load-bearing; the technical evaluation should provide concrete metrics on how well model checking and LLM-as-judge align with actual user expectations to avoid introducing new errors.
minor comments (1)
  1. [Abstract] Consider adding a sentence on the scale of the user studies or key quantitative findings to better summarize the evidence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive review and for acknowledging the potential significance of U-Define in improving user control and reliability in LLM planning. We address each major comment below with point-by-point responses and indicate the revisions made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [User Studies] The abstract states positive outcomes from user studies without reporting sample sizes, statistical tests, exclusion criteria, or raw data. Since these studies are central to validating the improvements in usefulness, performance, and satisfaction, this omission makes it impossible to assess the robustness of the central empirical claim.

    Authors: We agree that the abstract should provide sufficient detail for readers to evaluate the robustness of the user study results. While the full methodology—including sample sizes, statistical tests (such as t-tests and ANOVA), exclusion criteria, and data availability—is reported in Section 5 and the supplementary materials, we have revised the abstract to include a concise summary of these elements. This change improves transparency without altering the reported findings or conclusions. revision: yes

  2. Referee: [§3 (System Design)] The assumption that the chosen verification methods accurately reflect user intent is load-bearing; the technical evaluation should provide concrete metrics on how well model checking and LLM-as-judge align with actual user expectations to avoid introducing new errors.

    Authors: We acknowledge that alignment between verification methods and user intent is a critical assumption requiring explicit support. Section 4 already reports accuracy metrics for model checking on hard constraints and agreement rates for the LLM-as-judge on soft constraints using benchmark and held-out data. To directly address alignment with user expectations, we have added a new analysis in the revised manuscript that correlates verification outcomes with participant judgments from the expert user study, providing additional quantitative evidence on error rates and intent fidelity. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper describes an empirical system-building effort (U-Define) that lets users categorize constraints as hard or soft in natural language, then evaluates the approach via technical verification methods and user studies with general and expert participants. No equations, parameter-fitting procedures, derivations, or self-referential definitions appear in the abstract or described workflow. The central claims rest on external user-study outcomes rather than any reduction of predictions back to fitted inputs or self-citations, rendering the evaluation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The design rests on domain assumptions about LLM behavior and user categorization ability rather than new mathematical entities or fitted constants.

axioms (3)
  • domain assumption Users can meaningfully label constraints as hard or soft in natural language
    Core premise of the U-Define interface and user-study design.
  • domain assumption Formal model checking can be applied to LLM-generated plans to verify hard constraints
    Assumed to be feasible and reliable for the hard-constraint path.
  • domain assumption An LLM acting as judge can evaluate soft constraints in a way that aligns with user intent
    Assumed for the soft-constraint verification mechanism.

pith-pipeline@v0.9.0 · 5527 in / 1487 out tokens · 49766 ms · 2026-05-08T18:13:29.226954+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LearnMate^2: Design and Evaluation of an LLM-powered Personalized and Adaptive Support System for Online Learning

    cs.HC 2026-05 unverdicted novelty 6.0

    LearnMate^2, an LLM-driven personalized learning support system, improves learning outcomes and user experience over existing online platforms combined with generic LLM assistance in small-scale user studies.

Reference graph

Works this paper leans on

127 extracted references · 40 canonical work pages · cited by 1 Pith paper · 7 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)

  2. [2]

    Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. 2022. A review on language models as knowledge bases. arXiv preprint arXiv:2204.06031(2022)

  3. [3]

    Fouad Amer, Hui Yi Koh, and Mani Golparvar-Fard. 2021. Automated methods and systems for construction planning and scheduling: Critical review of three decades of research.Journal of Construction Engineering and Management147, 7 (2021), 03121002

  4. [4]

    Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz

    Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associat...

  5. [5]

    Suriya Ganesh Ayyamperumal and Limin Ge. 2024. Current state of LLM Risks and AI Guardrails.arXiv preprint arXiv:2406.12934(2024)

  6. [6]

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)

  7. [7]

    2008.Principles of model checking

    Christel Baier and Joost-Pieter Katoen. 2008.Principles of model checking. MIT press

  8. [8]

    Kristian Gonzalez Barman, Nathan Wood, and Pawel Pawlowski. 2024. Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use.Ethics and Information Technology26, 3 (2024), 47

  9. [9]

    Charles Bellemare, Luc Bissonnette, and Sabine Kröger. 2014. Statistical power of within and between-subjects designs in economic experiments. (2014)

  10. [10]

    Stefano Bistarelli, Ugo Montanari, and Francesca Rossi. 1997. Semiring-based constraint satisfaction and optimization.Journal of the ACM (JACM) 44, 2 (1997), 201–236

  11. [11]

    Ali Borji. 2023. A categorical archive of chatgpt failures.arXiv preprint arXiv:2302.03494(2023)

  12. [12]

    Craig Boutilier, Ronen I Brafman, Carmel Domshlak, Holger H Hoos, and David Poole. 2004. CP-nets: A tool for representing and reasoning withconditional ceteris paribus preference statements.Journal of artificial intelligence research21 (2004), 135–191

  13. [13]

    John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7. 28

  14. [14]

    Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later.Commun. ACM56, 2 (2013), 82–90

  15. [15]

    Marco Cascella, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. 2023. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios.Journal of medical systems47, 1 (2023), 33

  16. [16]

    Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. 2024. (A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469

  17. [17]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023- 03-30-vicuna/

  18. [18]

    Edmund M Clarke, William Klieber, Miloš Nováček, and Paolo Zuliani. 2011. Model checking and the state explosion problem. InLASER Summer School on Software Engineering. Springer, 1–30

  19. [19]

    Victoria Clarke and Virginia Braun. 2014. Thematic analysis. InEncyclopedia of critical psychology. Springer, 1947–1952

  20. [20]

    cRick. [n. d.]. NL-to-LTL-Synthetic-Dataset. Hugging Face. https://huggingface.co/datasets/cRick/NL-to-LTL-Synthetic-Dataset Accessed: 2025-04-03

  21. [21]

    Wes Davis. 2023. A lawyer used ChatGPT and now has to answer for its ‘bogus’ citations.The Verge(2023)

  22. [22]

    Emanuele De Pellegrin and Ronald P. A. Petrick. 2024. Planning Domain Simulation: An Interactive System for Plan Visualisation.Proceedings of the International Conference on Automated Planning and Scheduling34, 1 (May 2024), 133–141. https://doi.org/10.1609/icaps.v34i1.31469

  23. [23]

    2003.Constraint processing

    Rina Dechter. 2003.Constraint processing. Elsevier

  24. [24]

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115

  25. [25]

    The fridge door is open

    Clare Dixon, Matt Webster, Joe Saunders, Michael Fisher, and Kerstin Dautenhahn. 2014. “The fridge door is open”–Temporal Verification of a Robotic Assistant’s Behaviours. InAdvances in Autonomous Robotics Systems: 15th Annual Conference, TAROS 2014, Birmingham, UK, September 1-3,

  26. [26]

    Springer, 97–108

    Proceedings 15. Springer, 97–108

  27. [27]

    Minh Do and Subbarao Kambhampati. 2003. Sapa: A multi-objective metric temporal planner.Journal of Artificial Intelligence Research20 (2003), 155–194

  28. [28]

    Fernando Fradique Duarte, Nuno Lau, Artur Pereira, and Luis Paulo Reis. 2020. A survey of planning and learning in games.Applied Sciences10, 13 (2020), 4529

  29. [29]

    Cody Dunne, Nathalie Henry Riche, Bongshin Lee, Ronald Metoyer, and George Robertson. 2012. GraphTrail: Analyzing large multivariate, heterogeneous networks while supporting exploration history. InProceedings of the SIGCHI conference on human factors in computing systems. 1663–1672

  30. [30]

    Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models.Transactions of the Association for Computational Linguistics9 (2021), 1012–1031

  31. [31]

    Maria Fox and Derek Long. 2003. PDDL2. 1: An extension to PDDL for expressing temporal planning domains.Journal of artificial intelligence research20 (2003), 61–124

  32. [32]

    Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. 2021. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems4, 1 (2021), 265–293

  33. [33]

    2016.Automated Planning and Acting

    Malik Ghallab, Dana Nau, and Paolo Traverso. 2016.Automated Planning and Acting. Cambridge University Press, Cambridge, England

  34. [34]

    2025.Acting, Planning, and Learning

    Malik Ghallab, Dana Nau, and Paolo Traverso. 2025.Acting, Planning, and Learning. Cambridge University Press

  35. [35]

    Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning.Advances in Neural Information Processing Systems36 (2023), 79081–79094

  36. [36]

    Atharva Gundawar, Mudit Verma, Lin Guan, Karthik Valmeekam, Siddhant Bhambri, and Subbarao Kambhampati. 2024. Robust Planning with LLM-Modulo Framework: Case Study in Travel Planning. arXiv:2405.20625 [cs.AI] https://arxiv.org/abs/2405.20625

  37. [37]

    Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage.Authorea Preprints3 (2023)

  38. [38]

    Yohei Hayamizu, Saeid Amiri, Kishan Chandan, Keiki Takadama, and Shiqi Zhang. 2021. Guiding robot exploration in reinforcement learning via automated planning. InProceedings of the International Conference on Automated Planning and Scheduling, Vol. 31. 625–633

  39. [39]

    Gaole He, Gianluca Demartini, and Ujwal Gadiraju. 2025. Plan-Then-Execute: An Empirical Study of User Trust and Team Performance When Using LLM Agents As A Daily Assistant.arXiv preprint arXiv:2502.01390(2025)

  40. [40]

    Dan Hendrycks, Nicholas Carlini, John Schulman, and Jacob Steinhardt. 2021. Unsolved problems in ml safety.arXiv preprint arXiv:2109.13916 (2021)

  41. [41]

    Christian Hensel, Sebastian Junges, Joost-Pieter Katoen, Tim Quatmann, and Matthias Volk. 2022. The probabilistic model checker Storm. International Journal on Software Tools for Technology Transfer(2022), 1–22

  42. [42]

    Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. 2024. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.arXiv preprint arXiv:2403.02839(2024)

  43. [43]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55. U-Define: Designing User Workflows for Hard an...

  44. [44]

    Dominik Hurnaus and Herbert Prähofer. 2010. Programming assistance based on contracts and modular verification in the automation domain. In Proceedings of the 2010 ACM Symposium on Applied Computing. 2544–2551

  45. [45]

    Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.Comput. Surveys55, 12 (2023), 1–38

  46. [46]

    Junfeng Jiao, Saleh Afroogh, Yiming Xu, and Connor Phillips. 2024. Navigating llm ethics: Advancements, challenges, and future directions.arXiv preprint arXiv:2406.18841(2024)

  47. [47]

    Samia Kabir, David N Udo-Imeh, Bonan Kou, and Tianyi Zhang. 2023. Who answers it better? an in-depth analysis of chatgpt and stack overflow answers to software engineering questions.CoRR(2023)

  48. [48]

    Jean Kaddour, Joshua Harris, Maximilian Mozes, Herbie Bradley, Roberta Raileanu, and Robert McHardy. 2023. Challenges and applications of large language models.arXiv preprint arXiv:2307.10169(2023)

  49. [49]

    Subbarao Kambhampati, Karthik Valmeekam, Lin Guan, Kaya Stechly, Mudit Verma, Siddhant Bhambri, Lucas Saldyt, and Anil Murthy. 2024. LLMs Can’t Plan, But Can Help Planning in LLM-Modulo Frameworks.arXiv preprint arXiv:2402.01817(2024)

  50. [50]

    Zahir Kanjee, Byron Crowe, and Adam Rodman. 2023. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama330, 1 (2023), 78–80

  51. [51]

    K Kapellos, A Micheli, and A Valentini. [n. d.]. AIPlan4EU: Planning and Scheduling for Space Applications. ([n. d.])

  52. [52]

    Sunnie SY Kim, Jennifer Wortman Vaughan, Q Vera Liao, Tania Lombrozo, and Olga Russakovsky. 2025. Fostering Appropriate Reliance on Large Language Models: The Role of Explanations, Sources, and Inconsistencies.arXiv preprint arXiv:2502.08554(2025)

  53. [53]

    Yoonsu Kim, Jueon Lee, Seoyoung Kim, Jaehyuk Park, and Juho Kim. 2024. Understanding users’ dissatisfaction with chatgpt responses: Types, resolving tactics, and the effect of knowledge level. InProceedings of the 29th International Conference on Intelligent User Interfaces. 385–404

  54. [54]

    René F Kizilcec. 2016. How much information? Effects of transparency on trust in an algorithmic interface. InProceedings of the 2016 CHI conference on human factors in computing systems. 2390–2395

  55. [55]

    Ryan Koo, Minhwa Lee, Vipul Raheja, Jong Inn Park, Zae Myung Kim, and Dongyeop Kang. 2023. Benchmarking cognitive biases in large language models as evaluators.arXiv preprint arXiv:2309.17012(2023)

  56. [56]

    Hadas Kress-Gazit, Morteza Lahijanian, and Vasumathi Raman. 2018. Synthesis for robots: Guarantees and feedback for robot behavior.Annual Review of Control, Robotics, and Autonomous Systems1 (2018), 211–236

  57. [57]

    Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnn VanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactive ranking analytics. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11

  58. [58]

    Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)

  59. [59]

    Kwiatkowska, G

    M. Kwiatkowska, G. Norman, and D. Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. InProc. 23rd International Conference on Computer Aided Verification (CA V’11) (LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591

  60. [60]

    Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verification of probabilistic real-time systems. InComputer Aided Verification: 23rd International Conference, CA V 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings 23. Springer, 585–591

  61. [61]

    Philippe Laban, Jesse Vig, Marti Hearst, Caiming Xiong, and Chien-Sheng Wu. 2024. Beyond the chat: Executable and verifiable text-editing with llms. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–23

  62. [62]

    Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In2012 34th international conference on software engineering (ICSE). IEEE, 3–13

  63. [63]

    Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (2019), 56–65

  64. [64]

    Christine Lee, Jihye Choi, and Bilge Mutlu. 2025. MAP: Multi-user Personalization with Collaborative LLM-powered Agents.arXiv preprint arXiv:2503.12757(2025)

  65. [65]

    Christine Lee, David Porfirio, Xinyu Jessica Wang, Kevin Zhao, and Bilge Mutlu. 2025. VeriPlan: Integrating Formal Verification and LLMs into End-User Planning.arXiv preprint arXiv:2502.17898(2025)

  66. [66]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems33 (2020), 9459–9474

  67. [67]

    Rhydian Lewis. 2008. A survey of metaheuristic-based techniques for university timetabling problems.OR spectrum30, 1 (2008), 167–190

  68. [68]

    Zhenwen Liang, Ye Liu, Tong Niu, Xiangliang Zhang, Yingbo Zhou, and Semih Yavuz. 2024. Improving llm reasoning through scaling inference computation with collaborative verification.arXiv preprint arXiv:2410.05318(2024)

  69. [69]

    Q Vera Liao and Jennifer Wortman Vaughan. 2023. Ai transparency in the age of llms: A human-centered research roadmap.arXiv preprint arXiv:2306.01941(2023), 5368–5393

  70. [70]

    Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958 (2021)

  71. [71]

    Jason Xinyu Liu, Ziyi Yang, Ifrah Idrees, Sam Liang, Benjamin Schornstein, Stefanie Tellex, and Ankit Shah. 2023. Grounding complex natural language commands for temporal tasks in unseen environments. InConference on Robot Learning. PMLR, 1084–1110. 30

  72. [72]

    we need structured output

    Michael Xieyang Liu, Frederick Liu, Alexander J Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, and Carrie J Cai. 2024. " we need structured output": Towards user-centered constraints on large language model output. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–9

  73. [73]

    Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. 2023. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 43...

  74. [74]

    Arnold Lund. 2001. Measuring Usability with the USE Questionnaire.Usability and User Experience Newsletter of the STC Usability SIG8 (01 2001)

  75. [75]

    Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Ying Su, Jilin Chen, Chinmay Kulkarni, Heng-Tze Cheng, Quoc Le, and Ed Chi. 2024. Beyond chatbots: Explorellm for structured thoughts and personalized model responses. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–12

  76. [76]

    Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for java. InProceedings of the 25th international symposium on software testing and analysis. 441–444

  77. [77]

    Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice.Proceedings of the ACM on Human-Computer Interaction3 (11 2019), 1–23. https://doi.org/10.1145/3359174

  78. [78]

    Meta Platforms, Inc. 2013. React: A JavaScript library for building user interfaces. https://reactjs.org/. Accessed: 2025-04-08

  79. [79]

    Pinedo Michael. 1995. Scheduling. Theory, Algorithms and systems.ISBN0-13-706757-7(1995)

  80. [80]

    Mahdi Mostajabdaveh, Timothy T Yu, Rindranirina Ramamonjison, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. 2024. Optimization modeling and verification from problem specifications using a multi-agent multi-stage LLM framework.INFOR: Information Systems and Operational Research (2024), 1–19

Showing first 80 references.