Recognition: 2 theorem links
· Lean TheoremU-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning
Pith reviewed 2026-05-08 18:13 UTC · model grok-4.3
The pith
User-defined hard and soft constraint types improve usefulness and satisfaction in LLM planning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
U-Define lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. It verifies these through formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. A technical evaluation and user studies find that this improves perceived usefulness, performance, and satisfaction while maintaining usability.
What carries the argument
The abstraction of constraint strictness into user-chosen hard and soft types with distinct verification methods in U-Define.
Load-bearing premise
That users can reliably categorize their constraints as hard or soft in natural language and that the verification methods correctly reflect real user intent.
What would settle it
An experiment showing that users often misclassify the same constraints or that the verification methods approve plans that violate the user's actual intent.
Figures
read the original abstract
LLMs are increasingly used for end-user task planning, yet their black-box nature limits users' ability to ensure reliability and control. While recent systems incorporate verification techniques, it remains unclear how users can effectively apply such rigid constraints to represent intent or adapt to real-world variability. For example, prior work finds that hard-only constraints are too rigid, and numeric flexibility weights confuse users. We investigate how interaction workflows can better support users in applying constraints to guide LLM-generated plans, examining whether abstracting strictness into high-level types (i.e., hard and soft) paired with distinct verification mechanisms helps users more reliably express and align intent. We present U-Define, a system that lets users define constraints in natural language and categorize them as either hard rules that must not be violated or soft preferences that allow flexibility. U-Define verifies these types through complementary methods: formal model checking for hard constraints and LLM-as-judge evaluation for soft ones. Through a technical evaluation and user studies with general and expert participants, we find that user-defined constraint types improve perceived usefulness, performance, and satisfaction while maintaining usability. These findings provide insights for designing flexible yet reliable constraint-based workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents U-Define, a system enabling users to specify natural language constraints for LLM-based planning and classify them as hard (verified via model checking) or soft (via LLM-as-judge). Based on technical evaluation and user studies with general and expert participants, it concludes that user-defined constraint types enhance perceived usefulness, performance, and satisfaction while maintaining usability.
Significance. This contribution is potentially significant for improving user control and reliability in LLM planning applications. By addressing limitations of purely hard constraints and confusing numeric weights through high-level typing and dual verification, it offers a practical workflow. The inclusion of both general and expert user studies is a strength, providing insights into real-world applicability. However, the low level of detail on empirical methods tempers the significance until more evidence is provided.
major comments (2)
- [User Studies] The abstract states positive outcomes from user studies without reporting sample sizes, statistical tests, exclusion criteria, or raw data. Since these studies are central to validating the improvements in usefulness, performance, and satisfaction, this omission makes it impossible to assess the robustness of the central empirical claim.
- [§3 (System Design)] The assumption that the chosen verification methods accurately reflect user intent is load-bearing; the technical evaluation should provide concrete metrics on how well model checking and LLM-as-judge align with actual user expectations to avoid introducing new errors.
minor comments (1)
- [Abstract] Consider adding a sentence on the scale of the user studies or key quantitative findings to better summarize the evidence.
Simulated Author's Rebuttal
We thank the referee for their constructive review and for acknowledging the potential significance of U-Define in improving user control and reliability in LLM planning. We address each major comment below with point-by-point responses and indicate the revisions made to strengthen the manuscript.
read point-by-point responses
-
Referee: [User Studies] The abstract states positive outcomes from user studies without reporting sample sizes, statistical tests, exclusion criteria, or raw data. Since these studies are central to validating the improvements in usefulness, performance, and satisfaction, this omission makes it impossible to assess the robustness of the central empirical claim.
Authors: We agree that the abstract should provide sufficient detail for readers to evaluate the robustness of the user study results. While the full methodology—including sample sizes, statistical tests (such as t-tests and ANOVA), exclusion criteria, and data availability—is reported in Section 5 and the supplementary materials, we have revised the abstract to include a concise summary of these elements. This change improves transparency without altering the reported findings or conclusions. revision: yes
-
Referee: [§3 (System Design)] The assumption that the chosen verification methods accurately reflect user intent is load-bearing; the technical evaluation should provide concrete metrics on how well model checking and LLM-as-judge align with actual user expectations to avoid introducing new errors.
Authors: We acknowledge that alignment between verification methods and user intent is a critical assumption requiring explicit support. Section 4 already reports accuracy metrics for model checking on hard constraints and agreement rates for the LLM-as-judge on soft constraints using benchmark and held-out data. To directly address alignment with user expectations, we have added a new analysis in the revised manuscript that correlates verification outcomes with participant judgments from the expert user study, providing additional quantitative evidence on error rates and intent fidelity. revision: partial
Circularity Check
No significant circularity identified
full rationale
The paper describes an empirical system-building effort (U-Define) that lets users categorize constraints as hard or soft in natural language, then evaluates the approach via technical verification methods and user studies with general and expert participants. No equations, parameter-fitting procedures, derivations, or self-referential definitions appear in the abstract or described workflow. The central claims rest on external user-study outcomes rather than any reduction of predictions back to fitted inputs or self-citations, rendering the evaluation self-contained.
Axiom & Free-Parameter Ledger
axioms (3)
- domain assumption Users can meaningfully label constraints as hard or soft in natural language
- domain assumption Formal model checking can be applied to LLM-generated plans to verify hard constraints
- domain assumption An LLM acting as judge can evaluate soft constraints in a way that aligns with user intent
Forward citations
Cited by 1 Pith paper
-
LearnMate^2: Design and Evaluation of an LLM-powered Personalized and Adaptive Support System for Online Learning
LearnMate^2, an LLM-driven personalized learning support system, improves learning outcomes and user experience over existing online platforms combined with generic LLM assistance in small-scale user studies.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774(2023)
work page internal anchor Pith review arXiv 2023
- [2]
-
[3]
Fouad Amer, Hui Yi Koh, and Mani Golparvar-Fard. 2021. Automated methods and systems for construction planning and scheduling: Critical review of three decades of research.Journal of Construction Engineering and Management147, 7 (2021), 03121002
2021
-
[4]
Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz
Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, Besmira Nushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N. Bennett, Kori Inkpen, Jaime Teevan, Ruth Kikin-Gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems(Glasgow, Scotland Uk)(CHI ’19). Associat...
- [5]
-
[6]
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. 2022. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073(2022)
work page internal anchor Pith review arXiv 2022
-
[7]
2008.Principles of model checking
Christel Baier and Joost-Pieter Katoen. 2008.Principles of model checking. MIT press
2008
-
[8]
Kristian Gonzalez Barman, Nathan Wood, and Pawel Pawlowski. 2024. Beyond transparency and explainability: on the need for adequate and contextualized user guidelines for LLM use.Ethics and Information Technology26, 3 (2024), 47
2024
-
[9]
Charles Bellemare, Luc Bissonnette, and Sabine Kröger. 2014. Statistical power of within and between-subjects designs in economic experiments. (2014)
2014
-
[10]
Stefano Bistarelli, Ugo Montanari, and Francesca Rossi. 1997. Semiring-based constraint satisfaction and optimization.Journal of the ACM (JACM) 44, 2 (1997), 201–236
1997
- [11]
-
[12]
Craig Boutilier, Ronen I Brafman, Carmel Domshlak, Holger H Hoos, and David Poole. 2004. CP-nets: A tool for representing and reasoning withconditional ceteris paribus preference statements.Journal of artificial intelligence research21 (2004), 135–191
2004
-
[13]
John Brooke et al. 1996. SUS-A quick and dirty usability scale.Usability evaluation in industry189, 194 (1996), 4–7. 28
1996
-
[14]
Cristian Cadar and Koushik Sen. 2013. Symbolic execution for software testing: three decades later.Commun. ACM56, 2 (2013), 82–90
2013
-
[15]
Marco Cascella, Jonathan Montomoli, Valentina Bellini, and Elena Bignami. 2023. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios.Journal of medical systems47, 1 (2023), 33
2023
-
[16]
Inyoung Cheong, King Xia, KJ Kevin Feng, Quan Ze Chen, and Amy X Zhang. 2024. (A) I am not A lawyer, but...: engaging legal experts towards responsible LLM policies for legal advice. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency. 2454–2469
2024
-
[17]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality. https://lmsys.org/blog/2023- 03-30-vicuna/
2023
-
[18]
Edmund M Clarke, William Klieber, Miloš Nováček, and Paolo Zuliani. 2011. Model checking and the state explosion problem. InLASER Summer School on Software Engineering. Springer, 1–30
2011
-
[19]
Victoria Clarke and Virginia Braun. 2014. Thematic analysis. InEncyclopedia of critical psychology. Springer, 1947–1952
2014
-
[20]
cRick. [n. d.]. NL-to-LTL-Synthetic-Dataset. Hugging Face. https://huggingface.co/datasets/cRick/NL-to-LTL-Synthetic-Dataset Accessed: 2025-04-03
2025
-
[21]
Wes Davis. 2023. A lawyer used ChatGPT and now has to answer for its ‘bogus’ citations.The Verge(2023)
2023
-
[22]
Emanuele De Pellegrin and Ronald P. A. Petrick. 2024. Planning Domain Simulation: An Interactive System for Plan Visualisation.Proceedings of the International Conference on Automated Planning and Scheduling34, 1 (May 2024), 133–141. https://doi.org/10.1609/icaps.v34i1.31469
-
[23]
2003.Constraint processing
Rina Dechter. 2003.Constraint processing. Elsevier
2003
-
[24]
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. Qlora: Efficient finetuning of quantized llms.Advances in neural information processing systems36 (2023), 10088–10115
2023
-
[25]
The fridge door is open
Clare Dixon, Matt Webster, Joe Saunders, Michael Fisher, and Kerstin Dautenhahn. 2014. “The fridge door is open”–Temporal Verification of a Robotic Assistant’s Behaviours. InAdvances in Autonomous Robotics Systems: 15th Annual Conference, TAROS 2014, Birmingham, UK, September 1-3,
2014
-
[26]
Springer, 97–108
Proceedings 15. Springer, 97–108
-
[27]
Minh Do and Subbarao Kambhampati. 2003. Sapa: A multi-objective metric temporal planner.Journal of Artificial Intelligence Research20 (2003), 155–194
2003
-
[28]
Fernando Fradique Duarte, Nuno Lau, Artur Pereira, and Luis Paulo Reis. 2020. A survey of planning and learning in games.Applied Sciences10, 13 (2020), 4529
2020
-
[29]
Cody Dunne, Nathalie Henry Riche, Bongshin Lee, Ronald Metoyer, and George Robertson. 2012. GraphTrail: Analyzing large multivariate, heterogeneous networks while supporting exploration history. InProceedings of the SIGCHI conference on human factors in computing systems. 1663–1672
2012
-
[30]
Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models.Transactions of the Association for Computational Linguistics9 (2021), 1012–1031
2021
-
[31]
Maria Fox and Derek Long. 2003. PDDL2. 1: An extension to PDDL for expressing temporal planning domains.Journal of artificial intelligence research20 (2003), 61–124
2003
-
[32]
Caelan Reed Garrett, Rohan Chitnis, Rachel Holladay, Beomjoon Kim, Tom Silver, Leslie Pack Kaelbling, and Tomás Lozano-Pérez. 2021. Integrated task and motion planning.Annual review of control, robotics, and autonomous systems4, 1 (2021), 265–293
2021
-
[33]
2016.Automated Planning and Acting
Malik Ghallab, Dana Nau, and Paolo Traverso. 2016.Automated Planning and Acting. Cambridge University Press, Cambridge, England
2016
-
[34]
2025.Acting, Planning, and Learning
Malik Ghallab, Dana Nau, and Paolo Traverso. 2025.Acting, Planning, and Learning. Cambridge University Press
2025
-
[35]
Lin Guan, Karthik Valmeekam, Sarath Sreedharan, and Subbarao Kambhampati. 2023. Leveraging pre-trained large language models to construct and utilize world models for model-based task planning.Advances in Neural Information Processing Systems36 (2023), 79081–79094
2023
- [36]
-
[37]
Muhammad Usman Hadi, Rizwan Qureshi, Abbas Shah, Muhammad Irfan, Anas Zafar, Muhammad Bilal Shaikh, Naveed Akhtar, Jia Wu, Seyedali Mirjalili, et al. 2023. A survey on large language models: Applications, challenges, limitations, and practical usage.Authorea Preprints3 (2023)
2023
-
[38]
Yohei Hayamizu, Saeid Amiri, Kishan Chandan, Keiki Takadama, and Shiqi Zhang. 2021. Guiding robot exploration in reinforcement learning via automated planning. InProceedings of the International Conference on Automated Planning and Scheduling, Vol. 31. 625–633
2021
- [39]
- [40]
-
[41]
Christian Hensel, Sebastian Junges, Joost-Pieter Katoen, Tim Quatmann, and Matthias Volk. 2022. The probabilistic model checker Storm. International Journal on Software Tools for Technology Transfer(2022), 1–22
2022
- [42]
-
[43]
Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2025. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.ACM Transactions on Information Systems43, 2 (2025), 1–55. U-Define: Designing User Workflows for Hard an...
2025
-
[44]
Dominik Hurnaus and Herbert Prähofer. 2010. Programming assistance based on contracts and modular verification in the automation domain. In Proceedings of the 2010 ACM Symposium on Applied Computing. 2544–2551
2010
-
[45]
Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Ye Jin Bang, Andrea Madotto, and Pascale Fung. 2023. Survey of hallucination in natural language generation.Comput. Surveys55, 12 (2023), 1–38
2023
- [46]
-
[47]
Samia Kabir, David N Udo-Imeh, Bonan Kou, and Tianyi Zhang. 2023. Who answers it better? an in-depth analysis of chatgpt and stack overflow answers to software engineering questions.CoRR(2023)
2023
- [48]
- [49]
-
[50]
Zahir Kanjee, Byron Crowe, and Adam Rodman. 2023. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. Jama330, 1 (2023), 78–80
2023
-
[51]
K Kapellos, A Micheli, and A Valentini. [n. d.]. AIPlan4EU: Planning and Scheduling for Space Applications. ([n. d.])
- [52]
-
[53]
Yoonsu Kim, Jueon Lee, Seoyoung Kim, Jaehyuk Park, and Juho Kim. 2024. Understanding users’ dissatisfaction with chatgpt responses: Types, resolving tactics, and the effect of knowledge level. InProceedings of the 29th International Conference on Intelligent User Interfaces. 385–404
2024
-
[54]
René F Kizilcec. 2016. How much information? Effects of transparency on trust in an algorithmic interface. InProceedings of the 2016 CHI conference on human factors in computing systems. 2390–2395
2016
- [55]
-
[56]
Hadas Kress-Gazit, Morteza Lahijanian, and Vasumathi Raman. 2018. Synthesis for robots: Guarantees and feedback for robot behavior.Annual Review of Control, Robotics, and Autonomous Systems1 (2018), 211–236
2018
-
[57]
Caitlin Kuhlman, Diana Doherty, Malika Nurbekova, Goutham Deva, Zarni Phyo, Paul-Henry Schoenhagen, MaryAnn VanValkenburg, Elke Rundensteiner, and Lane Harrison. 2019. Evaluating preference collection methods for interactive ranking analytics. InProceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–11
2019
-
[58]
Lorenz Kuhn, Yarin Gal, and Sebastian Farquhar. 2023. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation.arXiv preprint arXiv:2302.09664(2023)
work page internal anchor Pith review arXiv 2023
-
[59]
Kwiatkowska, G
M. Kwiatkowska, G. Norman, and D. Parker. 2011. PRISM 4.0: Verification of Probabilistic Real-time Systems. InProc. 23rd International Conference on Computer Aided Verification (CA V’11) (LNCS, Vol. 6806), G. Gopalakrishnan and S. Qadeer (Eds.). Springer, 585–591
2011
-
[60]
Marta Kwiatkowska, Gethin Norman, and David Parker. 2011. PRISM 4.0: Verification of probabilistic real-time systems. InComputer Aided Verification: 23rd International Conference, CA V 2011, Snowbird, UT, USA, July 14-20, 2011. Proceedings 23. Springer, 585–591
2011
-
[61]
Philippe Laban, Jesse Vig, Marti Hearst, Caiming Xiong, and Chien-Sheng Wu. 2024. Beyond the chat: Executable and verifiable text-editing with llms. InProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology. 1–23
2024
-
[62]
Claire Le Goues, Michael Dewey-Vogt, Stephanie Forrest, and Westley Weimer. 2012. A systematic study of automated program repair: Fixing 55 out of 105 bugs for $8 each. In2012 34th international conference on software engineering (ICSE). IEEE, 3–13
2012
-
[63]
Claire Le Goues, Michael Pradel, and Abhik Roychoudhury. 2019. Automated program repair.Commun. ACM62, 12 (2019), 56–65
2019
- [64]
- [65]
-
[66]
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems33 (2020), 9459–9474
2020
-
[67]
Rhydian Lewis. 2008. A survey of metaheuristic-based techniques for university timetabling problems.OR spectrum30, 1 (2008), 167–190
2008
- [68]
- [69]
-
[70]
Stephanie Lin, Jacob Hilton, and Owain Evans. 2021. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958 (2021)
work page internal anchor Pith review arXiv 2021
-
[71]
Jason Xinyu Liu, Ziyi Yang, Ifrah Idrees, Sam Liang, Benjamin Schornstein, Stefanie Tellex, and Ankit Shah. 2023. Grounding complex natural language commands for temporal tasks in unseen environments. InConference on Robot Learning. PMLR, 1084–1110. 30
2023
-
[72]
we need structured output
Michael Xieyang Liu, Frederick Liu, Alexander J Fiannaca, Terry Koo, Lucas Dixon, Michael Terry, and Carrie J Cai. 2024. " we need structured output": Towards user-centered constraints on large language model output. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–9
2024
-
[73]
Pan Lu, Baolin Peng, Hao Cheng, Michel Galley, Kai-Wei Chang, Ying Nian Wu, Song-Chun Zhu, and Jianfeng Gao. 2023. Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 43...
2023
-
[74]
Arnold Lund. 2001. Measuring Usability with the USE Questionnaire.Usability and User Experience Newsletter of the STC Usability SIG8 (01 2001)
2001
-
[75]
Xiao Ma, Swaroop Mishra, Ariel Liu, Sophie Ying Su, Jilin Chen, Chinmay Kulkarni, Heng-Tze Cheng, Quoc Le, and Ed Chi. 2024. Beyond chatbots: Explorellm for structured thoughts and personalized model responses. InExtended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–12
2024
-
[76]
Matias Martinez and Martin Monperrus. 2016. Astor: A program repair library for java. InProceedings of the 25th international symposium on software testing and analysis. 441–444
2016
-
[77]
Nora McDonald, Sarita Schoenebeck, and Andrea Forte. 2019. Reliability and Inter-rater Reliability in Qualitative Research: Norms and Guidelines for CSCW and HCI Practice.Proceedings of the ACM on Human-Computer Interaction3 (11 2019), 1–23. https://doi.org/10.1145/3359174
-
[78]
Meta Platforms, Inc. 2013. React: A JavaScript library for building user interfaces. https://reactjs.org/. Accessed: 2025-04-08
2013
-
[79]
Pinedo Michael. 1995. Scheduling. Theory, Algorithms and systems.ISBN0-13-706757-7(1995)
1995
-
[80]
Mahdi Mostajabdaveh, Timothy T Yu, Rindranirina Ramamonjison, Giuseppe Carenini, Zirui Zhou, and Yong Zhang. 2024. Optimization modeling and verification from problem specifications using a multi-agent multi-stage LLM framework.INFOR: Information Systems and Operational Research (2024), 1–19
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.