pith. sign in

arxiv: 2605.19940 · v1 · pith:CXDYJXCCnew · submitted 2026-05-19 · 💻 cs.AI · cs.RO

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

Pith reviewed 2026-05-20 05:40 UTC · model grok-4.3

classification 💻 cs.AI cs.RO
keywords foundation modelsguardrailsinteraction trajectoriesrobotics controlbehavioral constraintsGrounded Observersocial AI safetyruntime enforcement
0
0 comments X

The pith

Guardrails for foundation models in social domains can be treated as runtime control of entire interaction trajectories using robotics-inspired formal constraints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes existing guardrail methods for foundation models as insufficient because they focus on single outputs and offer only empirical risk reduction instead of enforceable guarantees over time. It draws on robotics to treat safety as the control of closed-loop interaction trajectories in uncertain environments. The authors introduce the Grounded Observer framework to enforce constraints at runtime and demonstrate it in three deployments involving small talk, autism therapy, and school behavioral de-escalation. A sympathetic reader would care because cumulative failures in education, mental health, and caregiving can cause real harm that per-output moderation does not reliably prevent.

Core claim

Guardrails should be reframed as a problem of runtime behavioral control over interaction trajectories in uncertain closed-loop systems. Formal constructs for constraint enforcement are drawn from robotics and instantiated in the Grounded Observer framework. When applied across small talk, in-home autism therapy, and behavioral de-escalation in schools, the framework enables interventions that mitigate drift into undesirable regimes while adapting to different social contexts.

What carries the argument

The Grounded Observer framework, which supplies formal constructs for runtime constraint enforcement on interaction trajectories drawn from robotics control methods.

If this is right

  • Runtime interventions become possible that act on the full sequence of exchanges rather than isolated outputs.
  • The same framework can be deployed across varied social settings such as therapy sessions and classroom de-escalation.
  • Extensions of the framework can be explored to achieve stronger behavioral guarantees over longer interactions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar trajectory-control ideas might transfer to other closed-loop AI systems that operate over extended user sessions.
  • Hybrid systems combining foundation models with physical robots could use the same constraint machinery for coordinated safety.
  • Empirical testing in additional domains would help identify where the robotics transfer succeeds or requires adjustment.

Load-bearing premise

Formal constructs for constraint enforcement taken from robotics can be transferred to foundation model interactions to deliver enforceable behavioral guarantees in uncertain social contexts.

What would settle it

A real-world deployment of the Grounded Observer framework in which interaction trajectories still drift into undesirable regimes despite the applied robotics-inspired constraints.

Figures

Figures reproduced from arXiv: 2605.19940 by Brian Scassellati, Drazen Brscic, Rebecca Ramnauth.

Figure 1
Figure 1. Figure 1: Guardrails as Constraint Enforcement Over Interaction Trajectories. A deployed foundation model induces a trajectory 𝜏 = (𝑠0, 𝑎0, 𝑠1, 𝑎1, ...) through state space S. A safe set Ssafe ⊆ S defines acceptable behavioral states. At each timestep, the model proposes actions according to policy 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ), but a “guardrail” restricts execution to the admissible action set Asafe (𝑠𝑡 ), ensuring that transition… view at source ↗
Figure 2
Figure 2. Figure 2: The Grounded Observer Framework. An unconstrained base policy and a runtime observer enforces behavioral constraints over interaction trajectories. Given the current interaction state, candidate actions are sampled from a base model. The observer evaluates this sample using feature extractors and overlays with associated rigidity parameters, thereby constructing an admissible action set. Admissible actions… view at source ↗
Figure 3
Figure 3. Figure 3: Observer-Enabled Small Talk. The observer enforces conversational boundaries (e.g., brevity, tone, specificity, and thematic coherence). Through continuous filtering and feedback, small talk emerges as a stable interaction trajectory shaped by the intersection of multiple soft constraints (Ramnauth, Brščić, et al. 2024). overly detailed answers or slight conversational imbalance) were more likely to be int… view at source ↗
Figure 4
Figure 4. Figure 4: Observer-Enabled Autism Training. Example interactions from the in-home deployment. State-based classifiers govern when training is socially appropriate, while a grounded observer enforces hierarchical conversational constraints during generation. A separate module provides pedagogical feedback to support user learning. See Ramnauth, Brščić, et al. 2025a for further details on the software stack and custom… view at source ↗
Figure 5
Figure 5. Figure 5: Observer-Mediated Robot-Assisted Activities in a School-Based De-Escalation Setting. The robot engages students in structured routines (e.g., small talk, guided breathing, and collaborative tasks), each governed by an activity￾specific observer and overlay set. A higher-level supervisory observer regulates when to initiate, terminate, or switch activities based on the student’s interaction state, enabling … view at source ↗
read the original abstract

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript reframes guardrails for foundation models in socially sensitive domains (education, mental health, caregiving) as a runtime behavioral control problem over interaction trajectories. Drawing on robotics, it introduces formal constructs for constraint enforcement in uncertain closed-loop systems, instantiates them in the Grounded Observer framework, and applies the framework to three real-world deployments (small talk, in-home autism therapy, behavioral de-escalation in schools), claiming that the approach enables runtime interventions that mitigate drift into undesirable regimes while adapting to context.

Significance. If the robotics-derived formal constructs can be shown to deliver enforceable behavioral guarantees when transferred to foundation-model interaction trajectories, the work would offer a substantive shift from empirical risk reduction to control-theoretic safety mechanisms. This could strengthen reliability in high-stakes social deployments and open new research directions on closed-loop constraint enforcement for language models. The three deployments provide an initial test bed, but the significance is currently limited by the absence of supporting derivations or quantitative validation.

major comments (3)
  1. [Abstract] Abstract (reframing paragraph): the claim that robotics formal constructs (constraint sets, barrier functions, closed-loop enforcement) can be transferred to foundation-model interactions rests on an unshown assumption of observability and actuation; the manuscript supplies no state-estimation mechanism, invariance proof, or definition of how partial text observations and next-token actuation would realize these constructs.
  2. [Grounded Observer framework] Grounded Observer framework instantiation: the central assertion that the framework mitigates drift across the three deployments is load-bearing yet unsupported by any equations, quantitative results, error analysis, or before/after metrics demonstrating that the claimed runtime interventions actually enforce constraints or adapt to context.
  3. [Deployments] Deployments section: without explicit mapping of robotics-style closed-loop control to the open-loop disturbances introduced by human responses and the discrete, stochastic nature of token sampling, it remains unclear whether the framework can deliver the enforceable guarantees asserted in the abstract.
minor comments (1)
  1. [Abstract] The abstract lists three deployments but reports no outcome metrics or qualitative observations; adding a brief summary table of observed drift mitigation would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We appreciate the emphasis on the need for clearer mappings and supporting evidence when transferring robotics concepts to foundation model trajectories. Below we respond point by point to the major comments, clarifying the current scope of the work as a conceptual framework with illustrative case studies while indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract (reframing paragraph): the claim that robotics formal constructs (constraint sets, barrier functions, closed-loop enforcement) can be transferred to foundation-model interactions rests on an unshown assumption of observability and actuation; the manuscript supplies no state-estimation mechanism, invariance proof, or definition of how partial text observations and next-token actuation would realize these constructs.

    Authors: We agree that the abstract states the transfer at a high level without spelling out the realization details. The full manuscript defines the interaction state as the grounded trajectory (dialogue history plus contextual variables extracted from the domain), treats partial text outputs as observations, and realizes actuation via runtime filtering or guidance of the token distribution to keep the trajectory inside the constraint set. However, we do not supply a formal state estimator or invariance proof; the contribution is the reframing and framework rather than a complete control-theoretic derivation. We will revise the abstract to qualify the claim and add an explicit subsection mapping observability and actuation to the text-based setting. revision: partial

  2. Referee: [Grounded Observer framework] Grounded Observer framework instantiation: the central assertion that the framework mitigates drift across the three deployments is load-bearing yet unsupported by any equations, quantitative results, error analysis, or before/after metrics demonstrating that the claimed runtime interventions actually enforce constraints or adapt to context.

    Authors: The deployments are presented as qualitative illustrations of how the framework can be instantiated in sensitive real-world contexts rather than as quantitative validation experiments. Ethical and privacy constraints in autism therapy and school de-escalation settings precluded collection of before/after metrics or error analysis in this initial report. The mitigation of drift is described through the specific intervention points applied in each case study. We accept that stronger evidence is needed for the load-bearing claim and will expand the framework section with additional equations describing the observer update and constraint enforcement steps, together with any available qualitative logs of interventions. revision: partial

  3. Referee: [Deployments] Deployments section: without explicit mapping of robotics-style closed-loop control to the open-loop disturbances introduced by human responses and the discrete, stochastic nature of token sampling, it remains unclear whether the framework can deliver the enforceable guarantees asserted in the abstract.

    Authors: We acknowledge that the current text does not provide a detailed mapping of how human responses (treated as exogenous disturbances) and stochastic token sampling are accommodated within the closed-loop formulation. In the manuscript the observer monitors the evolving trajectory and issues corrective actions at each turn to steer the system back toward the safe set despite these disturbances. We will add an explicit mapping table in the deployments section that relates each robotics construct (state, observation, actuation, barrier) to its counterpart in the foundation-model interaction loop, including how discrete stochasticity is handled through set-valued constraints over token sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper reframes guardrails for foundation models as runtime behavioral control over interaction trajectories by drawing on robotics concepts for constraint enforcement in uncertain closed-loop systems, then instantiates the ideas in a Grounded Observer framework applied to three deployments. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the abstract or described structure. The central proposal is a conceptual transfer and instantiation rather than a derivation that reduces to its own inputs by construction. The chain remains self-contained as an independent proposal without renaming known results or smuggling ansatzes via citation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the transfer of robotics control ideas to LLM interactions plus the introduction of the Grounded Observer as a new construct; no numerical free parameters are defined and no additional invented physical entities are postulated.

axioms (1)
  • domain assumption Formal constructs for constraint enforcement in uncertain closed-loop systems from robotics can be transferred to foundation model interaction trajectories to provide enforceable guarantees.
    Core premise invoked when the abstract reframes guardrails as runtime behavioral control.
invented entities (1)
  • Grounded Observer framework no independent evidence
    purpose: To monitor interactions and enable runtime interventions that mitigate drift into undesirable regimes while adapting to social contexts.
    New framework introduced to instantiate the robotics-inspired constructs; no independent falsifiable evidence outside the paper is supplied in the abstract.

pith-pipeline@v0.9.0 · 5689 in / 1390 out tokens · 49409 ms · 2026-05-20T05:40:47.219159+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 18 internal anchors

  1. [1]

    Control barrier functions: Theory and applications

    “Control barrier functions: Theory and applications. ” In:2019 18th European control conference (ECC). Ieee, 3420–3431. D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané

  2. [2]

    Concrete Problems in AI Safety

    “Concrete problems in AI safety. ”arXiv preprint arXiv:1606.06565. P. Anantaprayoon, N. Babina, N. Asgharbeygi, and J. Tarifi

  3. [3]

    Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs

    “Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs. ”arXiv preprint arXiv:2603.10476. Y. Bai et al

  4. [4]

    Constitutional AI: Harmlessness from AI Feedback

    “Constitutional ai: Harmlessness from ai feedback. ”arXiv preprint arXiv:2212.08073. E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell

  5. [5]

    On the dangers of stochastic parrots: Can language models be too big?

    “On the dangers of stochastic parrots: Can language models be too big?” In:Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 610–623. D. Bertsekas. 2012.Dynamic programming and optimal control: Volume I. Vol

  6. [6]

    Action priors for large action spaces in robotics

    “Action priors for large action spaces in robotics. ”arXiv preprint arXiv:2101.04178. F. Blanchini

  7. [7]

    Set invariance in control

    “Set invariance in control. ”Automatica, 35, 11, 1747–1767. Preprint. Under review at JAIR. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains•31 S. Brown. 2014.Autism spectrum disorder and de-escalation strategies: A practical guide to positive behavioural interventions for children and young people. Jessica Kingsley Publish...

  8. [8]

    Safe learning in robotics: From learning-based control to safe reinforcement learning

    “Safe learning in robotics: From learning-based control to safe reinforcement learning. ”Annual Review of Control, Robotics, and Autonomous Systems, 5, 1, 411–444. Capgemini. 2023.Why Consumers Love Generative AI: Report from the Capgemini Research Institute. Accessed: 2024-08-08. (2023). https://www .capgemini.com/insights/research-library/creative-and-g...

  9. [9]

    Meta secalign: A secure foundation llm against prompt injection attacks, 2026

    “Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks. ” arXiv preprint arXiv:2507.02735. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei

  10. [10]

    Plug and play language models: A simple approach to controlled text generation

    “Plug and play language models: A simple approach to controlled text generation. ”arXiv preprint arXiv:1912.02164. C. Dawson, S. Gao, and C. Fan

  11. [11]

    SOTER: a runtime assurance framework for programming safe robotics systems

    “SOTER: a runtime assurance framework for programming safe robotics systems. ” In:2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 138–150. Y. Dong, R. Mu, G. Jin, Y. Qi, J. Hu, X. Zhao, J. Meng, W. Ruan, and X. Huang

  12. [12]

    Building guardrails for large language models.arXiv preprint arXiv:2402.01822, 2024

    “Building guardrails for large language models. ”arXiv preprint arXiv:2402.01822. Y. Dong, R. Mu, Y. Zhang, et al

  13. [13]

    Identification of the dynamic parameters of a closed loop robot

    “Identification of the dynamic parameters of a closed loop robot. ” In:Proceedings of 1995 IEEE International conference on robotics and automation. Vol

  14. [14]

    RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

    “Realtoxicityprompts: Evaluating neural toxic degeneration in language models. ”arXiv preprint arXiv:2009.11462. N. C. Georgiou, R. Ramnauth, E. Adeniran, M. Lee, L. Selin, and B. Scassellati

  15. [15]

    Safety controller synthesis for collaborative robots

    “Safety controller synthesis for collaborative robots. ” In:2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 83–92. J. Grace

  16. [16]

    2023.Guardrails AI

    Guardrails AI. 2023.Guardrails AI. https://github.com/guardrails-ai/guardrails. Accessed: 2026-02-08. (2023). J. Guiochet, M. Machin, and H. Waeselynck

  17. [17]

    Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search

    “Lexically constrained decoding for sequence generation using grid beam search. ”arXiv preprint arXiv:1704.07138. Preprint. Under review at JAIR. 32•Ramnauth et al. Y. Hu et al

  18. [18]

    Toward general-purpose robots via foundation models: A survey and meta-analysis

    “Toward general-purpose robots via foundation models: A survey and meta-analysis. ”arXiv preprint arXiv:2312.08782. Q. Huang, X. Liu, T. Ko, B. Wu, W. Wang, Y. Zhang, and L. Tang

  19. [19]

    Selective Prompting Tuning for Personalized Conversations with LLMs

    “Selective Prompting Tuning for Personalized Conversations with LLMs. ”arXiv preprint arXiv:2406.18187. A. Hurst et al

  20. [20]

    GPT-4o System Card

    “GPT-4o system card. ”arXiv preprint arXiv:2410.21276. H. Inan et al

  21. [21]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    “Llama guard: LLM-based input-output safeguard for human-ai conversations. ”arXiv preprint arXiv:2312.06674. N. Inkawhich, G. McDonald, and R. Luley

  22. [22]

    Adversarial attacks on foundational vision models

    “Adversarial attacks on foundational vision models. ”arXiv preprint arXiv:2308.14597. B. W. Israelsen and N. R. Ahmed

  23. [23]

    AI Alignment: A Comprehensive Survey

    “AI alignment: A comprehensive survey. ”arXiv preprint arXiv:2310.19852. W. Khalil and J. Kleinfinger

  24. [24]

    A new geometric notation for open and closed-loop robots

    “A new geometric notation for open and closed-loop robots. ” In:Proceedings. 1986 IEEE International Conference on Robotics and Automation. Vol

  25. [25]

    A Review of 40 Years of Cognitive Architecture Research: Core Cognitive Abilities and Practical Applications

    “A review of 40 years of cognitive architecture research: Focus on perception, attention, learning and applications. ”arXiv preprint arXiv:1610.08602, 1–74. H. Kress-Gazit, M. Lahijanian, and V. Raman

  26. [26]

    Backdoor threats from compromised foundation models to federated learning

    “Backdoor threats from compromised foundation models to federated learning. ”arXiv preprint arXiv:2311.00144. P. Liang et al

  27. [27]

    Holistic Evaluation of Language Models

    “Holistic evaluation of language models. ”arXiv preprint arXiv:2211.09110. A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y. Choi

  28. [28]

    DExperts: Decoding-time controlled text generation with experts and anti-experts

    “DExperts: Decoding-time controlled text generation with experts and anti-experts. ”arXiv preprint arXiv:2105.03023. M. Liu, Y. Tan, and V. Padois

  29. [29]

    Keeping LLMs aligned after fine-tuning: The crucial role of prompt templates

    “Keeping LLMs aligned after fine-tuning: The crucial role of prompt templates. ” arXiv preprint arXiv:2402.18540. P. Manakul, A. Liusie, and M. Gales

  30. [30]

    Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models

    “Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ” In:Proceedings of the 2023 conference on empirical methods in natural language processing, 9004–9017. K. Matheus, R. Ramnauth, B. Scassellati, and N. Salomons

  31. [31]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    “Long-Term Interactions with Social Robots: Trends, Insights, and Recom- mendations. ”ACM Transactions on Human-Robot Interaction, 14, 3, 1–42. M. Mazeika et al.. 2024.HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. (2024). https://arxiv .org/abs/2402.04249 arXiv: 2402.04249(cs.LG). B. Meskó

  32. [32]

    Veriguard: Enhancing llm agent safety via verified code generation.arXiv preprint arXiv:2510.05156, 2025

    “Veriguard: Enhancing LLM agent safety via verified code generation. ”arXiv preprint arXiv:2510.05156. T. M. Moldovan, P. Abbeel, M. Jordan, and F. Borrelli

  33. [33]

    The ABCs of assured autonomy

    “The ABCs of assured autonomy. ” In:2019 IEEE International Symposium on Technology and Society (ISTAS). IEEE, 1–5. C. Ng

  34. [34]

    Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots

    “Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots. ”arXiv preprint arXiv:2603.16537. Q. Nguyen and K. Sreenath

  35. [35]

    Building a Domain-specific Guardrail Model in Production

    “Building a Domain-specific Guardrail Model in Production. ”arXiv preprint arXiv:2408.01452. S. NTT Disruption Europe. 2020.Jibo | Together for You. Retrieved September 22, 2020 from https://jibo.com/. NVIDIA. 2023.NeMo Guardrails. https://github.com/NVIDIA/NeMo-Guardrails. Accessed: 2026-02-08. (2023). OpenAI. 2022.OpenAI Moderation API. https://openai.c...

  36. [36]

    Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation

    “Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. ”arXiv preprint arXiv:1804.06609. S. S. Raman, V. Cohen, E. Rosen, I. Idrees, D. Paulius, and S. Tellex

  37. [37]

    Planning with large language models via corrective re-prompting

    “Planning with large language models via corrective re-prompting. ” In:NeurIPS 2022 Foundation Models for Decision Making Workshop. R. Ramnauth

  38. [38]

    Chapter 3: Robots for Autism Therapy

    “Chapter 3: Robots for Autism Therapy. ” Ph.D. Dissertation. Yale University. https://rramnauth2220.github.io/ramnauth-2 0250719-f2.pdf. Systematic review appearing inBuilding Intelligent Robots for Social Regulation Therapy. R. Ramnauth, D. Brščić, and B. Scassellati. June 2025a. “A Robot-Assisted Approach to Small Talk Training for Adults with ASD. ” In...

  39. [39]

    More than Chit-Chat: Developing Robots for Small-Talk Interactions

    “More than Chit-Chat: Developing Robots for Small-Talk Interactions. ”arXiv preprint arXiv:2412.18023. R. Ramnauth and B. Scassellati

  40. [40]

    When Robots Should Break the Rules

    “When Robots Should Break the Rules. ” In:Proceedings of the 2026 ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE Press. S. S. Ravi, B. Mielczarek, A. Kannappan, D. Kiela, and R. Qian

  41. [41]

    Lynx: An open source hallucination evaluation model

    “Lynx: An open source hallucination evaluation model. ”arXiv preprint arXiv:2407.08488. Z. Ravichandran, A. Robey, V. Kumar, G. J. Pappas, and H. Hassani

  42. [42]

    Safety Guardrails for LLM-Enabled Robots

    “Safety Guardrails for LLM-Enabled Robots. ”arXiv preprint arXiv:2503.07885. T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen

  43. [43]

    Nemo guardrails: A toolkit for controllable and safe LLM applications with programmable rails

    “Nemo guardrails: A toolkit for controllable and safe LLM applications with programmable rails. ” In:Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 431–445. S. Russell, P. Norvig, and A. Intelligence

  44. [44]

    Proximal Policy Optimization Algorithms

    “Proximal policy optimization algorithms. ”arXiv preprint arXiv:1707.06347. M. Schwenzer, M. Ay, T. Bergs, and D. Abel

  45. [45]

    arXiv preprint arXiv:2406.09264 , volume=

    “Towards bidirectional human-AI alignment: A systematic review for clarifications, framework, and future directions. ” arXiv preprint arXiv:2406.09264, 2406, 1–56. B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo. 2009.Robotics: modelling, planning and control. Springer. M. Siegel

  46. [46]

    Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment

    “Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment. ” arXiv preprint arXiv:2604.05965. Y. Tao, O. Viberg, R. S. Baker, and R. F. Kizilcec

  47. [47]

    Assured autonomy: Path toward living with autonomous systems we can trust

    “Assured autonomy: Path toward living with autonomous systems we can trust. ”arXiv preprint arXiv:2010.14443. B. Vidgen, T. Thrush, Z. Talat, and D. Kiela

  48. [48]

    Adapting LLM agents through communication

    “Adapting LLM agents through communication. ”arXiv preprint arXiv:2310.01444. Preprint. Under review at JAIR. 34•Ramnauth et al. J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt

  49. [49]

    A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT

    “A prompt pattern catalog to enhance prompt engineering with chatgpt. ”arXiv preprint arXiv:2302.11382. B. T. Willard and R. Louf

  50. [50]

    Efficient Guided Generation for Large Language Models

    “Efficient guided generation for large language models. ”arXiv preprint arXiv:2307.09702. C. Wu, X. Li, and J. Wang

  51. [51]

    Vulnerabilities of foundation model integrated federated learning under adversarial threats

    “Vulnerabilities of foundation model integrated federated learning under adversarial threats. ”arXiv preprint arXiv:2401.10375. Z. Xiang et al

  52. [52]

    arXiv preprint arXiv:2406.09187 , year=

    “Guardagent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning. ”arXiv preprint arXiv:2406.09187. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao

  53. [53]

    Jailbreak Attacks and Defenses Against Large Language Models: A Survey

    “Jailbreak attacks and defenses against large language models: A survey. ” arXiv preprint arXiv:2407.04295. Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba

  54. [54]

    arXiv preprint arXiv:2211.01910 , year=

    “Large language models are human-level prompt engineers. ” arXiv preprint arXiv:2211.01910. A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson

  55. [55]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    “Universal and transferable adversarial attacks on aligned language models. ”arXiv preprint arXiv:2307.15043. Preprint. Under review at JAIR