Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains
Pith reviewed 2026-05-20 05:40 UTC · model grok-4.3
The pith
Guardrails for foundation models in social domains can be treated as runtime control of entire interaction trajectories using robotics-inspired formal constraints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Guardrails should be reframed as a problem of runtime behavioral control over interaction trajectories in uncertain closed-loop systems. Formal constructs for constraint enforcement are drawn from robotics and instantiated in the Grounded Observer framework. When applied across small talk, in-home autism therapy, and behavioral de-escalation in schools, the framework enables interventions that mitigate drift into undesirable regimes while adapting to different social contexts.
What carries the argument
The Grounded Observer framework, which supplies formal constructs for runtime constraint enforcement on interaction trajectories drawn from robotics control methods.
If this is right
- Runtime interventions become possible that act on the full sequence of exchanges rather than isolated outputs.
- The same framework can be deployed across varied social settings such as therapy sessions and classroom de-escalation.
- Extensions of the framework can be explored to achieve stronger behavioral guarantees over longer interactions.
Where Pith is reading between the lines
- Similar trajectory-control ideas might transfer to other closed-loop AI systems that operate over extended user sessions.
- Hybrid systems combining foundation models with physical robots could use the same constraint machinery for coordinated safety.
- Empirical testing in additional domains would help identify where the robotics transfer succeeds or requires adjustment.
Load-bearing premise
Formal constructs for constraint enforcement taken from robotics can be transferred to foundation model interactions to deliver enforceable behavioral guarantees in uncertain social contexts.
What would settle it
A real-world deployment of the Grounded Observer framework in which interaction trajectories still drift into undesirable regimes despite the applied robotics-inspired constraints.
Figures
read the original abstract
Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reframes guardrails for foundation models in socially sensitive domains (education, mental health, caregiving) as a runtime behavioral control problem over interaction trajectories. Drawing on robotics, it introduces formal constructs for constraint enforcement in uncertain closed-loop systems, instantiates them in the Grounded Observer framework, and applies the framework to three real-world deployments (small talk, in-home autism therapy, behavioral de-escalation in schools), claiming that the approach enables runtime interventions that mitigate drift into undesirable regimes while adapting to context.
Significance. If the robotics-derived formal constructs can be shown to deliver enforceable behavioral guarantees when transferred to foundation-model interaction trajectories, the work would offer a substantive shift from empirical risk reduction to control-theoretic safety mechanisms. This could strengthen reliability in high-stakes social deployments and open new research directions on closed-loop constraint enforcement for language models. The three deployments provide an initial test bed, but the significance is currently limited by the absence of supporting derivations or quantitative validation.
major comments (3)
- [Abstract] Abstract (reframing paragraph): the claim that robotics formal constructs (constraint sets, barrier functions, closed-loop enforcement) can be transferred to foundation-model interactions rests on an unshown assumption of observability and actuation; the manuscript supplies no state-estimation mechanism, invariance proof, or definition of how partial text observations and next-token actuation would realize these constructs.
- [Grounded Observer framework] Grounded Observer framework instantiation: the central assertion that the framework mitigates drift across the three deployments is load-bearing yet unsupported by any equations, quantitative results, error analysis, or before/after metrics demonstrating that the claimed runtime interventions actually enforce constraints or adapt to context.
- [Deployments] Deployments section: without explicit mapping of robotics-style closed-loop control to the open-loop disturbances introduced by human responses and the discrete, stochastic nature of token sampling, it remains unclear whether the framework can deliver the enforceable guarantees asserted in the abstract.
minor comments (1)
- [Abstract] The abstract lists three deployments but reports no outcome metrics or qualitative observations; adding a brief summary table of observed drift mitigation would improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We appreciate the emphasis on the need for clearer mappings and supporting evidence when transferring robotics concepts to foundation model trajectories. Below we respond point by point to the major comments, clarifying the current scope of the work as a conceptual framework with illustrative case studies while indicating revisions where appropriate.
read point-by-point responses
-
Referee: [Abstract] Abstract (reframing paragraph): the claim that robotics formal constructs (constraint sets, barrier functions, closed-loop enforcement) can be transferred to foundation-model interactions rests on an unshown assumption of observability and actuation; the manuscript supplies no state-estimation mechanism, invariance proof, or definition of how partial text observations and next-token actuation would realize these constructs.
Authors: We agree that the abstract states the transfer at a high level without spelling out the realization details. The full manuscript defines the interaction state as the grounded trajectory (dialogue history plus contextual variables extracted from the domain), treats partial text outputs as observations, and realizes actuation via runtime filtering or guidance of the token distribution to keep the trajectory inside the constraint set. However, we do not supply a formal state estimator or invariance proof; the contribution is the reframing and framework rather than a complete control-theoretic derivation. We will revise the abstract to qualify the claim and add an explicit subsection mapping observability and actuation to the text-based setting. revision: partial
-
Referee: [Grounded Observer framework] Grounded Observer framework instantiation: the central assertion that the framework mitigates drift across the three deployments is load-bearing yet unsupported by any equations, quantitative results, error analysis, or before/after metrics demonstrating that the claimed runtime interventions actually enforce constraints or adapt to context.
Authors: The deployments are presented as qualitative illustrations of how the framework can be instantiated in sensitive real-world contexts rather than as quantitative validation experiments. Ethical and privacy constraints in autism therapy and school de-escalation settings precluded collection of before/after metrics or error analysis in this initial report. The mitigation of drift is described through the specific intervention points applied in each case study. We accept that stronger evidence is needed for the load-bearing claim and will expand the framework section with additional equations describing the observer update and constraint enforcement steps, together with any available qualitative logs of interventions. revision: partial
-
Referee: [Deployments] Deployments section: without explicit mapping of robotics-style closed-loop control to the open-loop disturbances introduced by human responses and the discrete, stochastic nature of token sampling, it remains unclear whether the framework can deliver the enforceable guarantees asserted in the abstract.
Authors: We acknowledge that the current text does not provide a detailed mapping of how human responses (treated as exogenous disturbances) and stochastic token sampling are accommodated within the closed-loop formulation. In the manuscript the observer monitors the evolving trajectory and issues corrective actions at each turn to steer the system back toward the safe set despite these disturbances. We will add an explicit mapping table in the deployments section that relates each robotics construct (state, observation, actuation, barrier) to its counterpart in the foundation-model interaction loop, including how discrete stochasticity is handled through set-valued constraints over token sequences. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper reframes guardrails for foundation models as runtime behavioral control over interaction trajectories by drawing on robotics concepts for constraint enforcement in uncertain closed-loop systems, then instantiates the ideas in a Grounded Observer framework applied to three deployments. No equations, fitted parameters, self-referential definitions, or load-bearing self-citations appear in the abstract or described structure. The central proposal is a conceptual transfer and instantiation rather than a derivation that reduces to its own inputs by construction. The chain remains self-contained as an independent proposal without renaming known results or smuggling ansatzes via citation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Formal constructs for constraint enforcement in uncertain closed-loop systems from robotics can be transferred to foundation model interaction trajectories to provide enforceable guarantees.
invented entities (1)
-
Grounded Observer framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.lean (Jcost uniqueness); Foundation/AlexanderDuality.lean (D=3)reality_from_one_distinction; washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems... safe sets, reachability, barrier functions, and runtime assurance
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Control barrier functions: Theory and applications
“Control barrier functions: Theory and applications. ” In:2019 18th European control conference (ECC). Ieee, 3420–3431. D. Amodei, C. Olah, J. Steinhardt, P. Christiano, J. Schulman, and D. Mané
work page 2019
-
[2]
Concrete Problems in AI Safety
“Concrete problems in AI safety. ”arXiv preprint arXiv:1606.06565. P. Anantaprayoon, N. Babina, N. Asgharbeygi, and J. Tarifi
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs
“Learning to Negotiate: Multi-Agent Deliberation for Collective Value Alignment in LLMs. ”arXiv preprint arXiv:2603.10476. Y. Bai et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Constitutional AI: Harmlessness from AI Feedback
“Constitutional ai: Harmlessness from ai feedback. ”arXiv preprint arXiv:2212.08073. E. M. Bender, T. Gebru, A. McMillan-Major, and S. Shmitchell
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
On the dangers of stochastic parrots: Can language models be too big?
“On the dangers of stochastic parrots: Can language models be too big?” In:Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 610–623. D. Bertsekas. 2012.Dynamic programming and optimal control: Volume I. Vol
work page 2021
-
[6]
Action priors for large action spaces in robotics
“Action priors for large action spaces in robotics. ”arXiv preprint arXiv:2101.04178. F. Blanchini
-
[7]
“Set invariance in control. ”Automatica, 35, 11, 1747–1767. Preprint. Under review at JAIR. Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains•31 S. Brown. 2014.Autism spectrum disorder and de-escalation strategies: A practical guide to positive behavioural interventions for children and young people. Jessica Kingsley Publish...
work page 2014
-
[8]
Safe learning in robotics: From learning-based control to safe reinforcement learning
“Safe learning in robotics: From learning-based control to safe reinforcement learning. ”Annual Review of Control, Robotics, and Autonomous Systems, 5, 1, 411–444. Capgemini. 2023.Why Consumers Love Generative AI: Report from the Capgemini Research Institute. Accessed: 2024-08-08. (2023). https://www .capgemini.com/insights/research-library/creative-and-g...
work page 2023
-
[9]
Meta secalign: A secure foundation llm against prompt injection attacks, 2026
“Meta SecAlign: A Secure Foundation LLM Against Prompt Injection Attacks. ” arXiv preprint arXiv:2507.02735. P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei
-
[10]
Plug and play language models: A simple approach to controlled text generation
“Plug and play language models: A simple approach to controlled text generation. ”arXiv preprint arXiv:1912.02164. C. Dawson, S. Gao, and C. Fan
-
[11]
SOTER: a runtime assurance framework for programming safe robotics systems
“SOTER: a runtime assurance framework for programming safe robotics systems. ” In:2019 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, 138–150. Y. Dong, R. Mu, G. Jin, Y. Qi, J. Hu, X. Zhao, J. Meng, W. Ruan, and X. Huang
work page 2019
-
[12]
Building guardrails for large language models.arXiv preprint arXiv:2402.01822, 2024
“Building guardrails for large language models. ”arXiv preprint arXiv:2402.01822. Y. Dong, R. Mu, Y. Zhang, et al
-
[13]
Identification of the dynamic parameters of a closed loop robot
“Identification of the dynamic parameters of a closed loop robot. ” In:Proceedings of 1995 IEEE International conference on robotics and automation. Vol
work page 1995
-
[14]
RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models
“Realtoxicityprompts: Evaluating neural toxic degeneration in language models. ”arXiv preprint arXiv:2009.11462. N. C. Georgiou, R. Ramnauth, E. Adeniran, M. Lee, L. Selin, and B. Scassellati
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[15]
Safety controller synthesis for collaborative robots
“Safety controller synthesis for collaborative robots. ” In:2020 25th International Conference on Engineering of Complex Computer Systems (ICECCS). IEEE, 83–92. J. Grace
work page 2020
-
[16]
Guardrails AI. 2023.Guardrails AI. https://github.com/guardrails-ai/guardrails. Accessed: 2026-02-08. (2023). J. Guiochet, M. Machin, and H. Waeselynck
work page 2023
-
[17]
Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search
“Lexically constrained decoding for sequence generation using grid beam search. ”arXiv preprint arXiv:1704.07138. Preprint. Under review at JAIR. 32•Ramnauth et al. Y. Hu et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Toward general-purpose robots via foundation models: A survey and meta-analysis
“Toward general-purpose robots via foundation models: A survey and meta-analysis. ”arXiv preprint arXiv:2312.08782. Q. Huang, X. Liu, T. Ko, B. Wu, W. Wang, Y. Zhang, and L. Tang
-
[19]
Selective Prompting Tuning for Personalized Conversations with LLMs
“Selective Prompting Tuning for Personalized Conversations with LLMs. ”arXiv preprint arXiv:2406.18187. A. Hurst et al
-
[20]
“GPT-4o system card. ”arXiv preprint arXiv:2410.21276. H. Inan et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
“Llama guard: LLM-based input-output safeguard for human-ai conversations. ”arXiv preprint arXiv:2312.06674. N. Inkawhich, G. McDonald, and R. Luley
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Adversarial attacks on foundational vision models
“Adversarial attacks on foundational vision models. ”arXiv preprint arXiv:2308.14597. B. W. Israelsen and N. R. Ahmed
-
[23]
AI Alignment: A Comprehensive Survey
“AI alignment: A comprehensive survey. ”arXiv preprint arXiv:2310.19852. W. Khalil and J. Kleinfinger
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
A new geometric notation for open and closed-loop robots
“A new geometric notation for open and closed-loop robots. ” In:Proceedings. 1986 IEEE International Conference on Robotics and Automation. Vol
work page 1986
-
[25]
“A review of 40 years of cognitive architecture research: Focus on perception, attention, learning and applications. ”arXiv preprint arXiv:1610.08602, 1–74. H. Kress-Gazit, M. Lahijanian, and V. Raman
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Backdoor threats from compromised foundation models to federated learning
“Backdoor threats from compromised foundation models to federated learning. ”arXiv preprint arXiv:2311.00144. P. Liang et al
-
[27]
Holistic Evaluation of Language Models
“Holistic evaluation of language models. ”arXiv preprint arXiv:2211.09110. A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhagavatula, N. A. Smith, and Y. Choi
work page internal anchor Pith review Pith/arXiv arXiv
-
[28]
DExperts: Decoding-time controlled text generation with experts and anti-experts
“DExperts: Decoding-time controlled text generation with experts and anti-experts. ”arXiv preprint arXiv:2105.03023. M. Liu, Y. Tan, and V. Padois
-
[29]
Keeping LLMs aligned after fine-tuning: The crucial role of prompt templates
“Keeping LLMs aligned after fine-tuning: The crucial role of prompt templates. ” arXiv preprint arXiv:2402.18540. P. Manakul, A. Liusie, and M. Gales
-
[30]
Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models
“Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. ” In:Proceedings of the 2023 conference on empirical methods in natural language processing, 9004–9017. K. Matheus, R. Ramnauth, B. Scassellati, and N. Salomons
work page 2023
-
[31]
HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal
“Long-Term Interactions with Social Robots: Trends, Insights, and Recom- mendations. ”ACM Transactions on Human-Robot Interaction, 14, 3, 1–42. M. Mazeika et al.. 2024.HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal. (2024). https://arxiv .org/abs/2402.04249 arXiv: 2402.04249(cs.LG). B. Meskó
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
“Veriguard: Enhancing LLM agent safety via verified code generation. ”arXiv preprint arXiv:2510.05156. T. M. Moldovan, P. Abbeel, M. Jordan, and F. Borrelli
-
[33]
“The ABCs of assured autonomy. ” In:2019 IEEE International Symposium on Technology and Society (ISTAS). IEEE, 1–5. C. Ng
work page 2019
-
[34]
Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots
“Designing for Disagreement: Front-End Guardrails for Assistance Allocation in LLM-Enabled Robots. ”arXiv preprint arXiv:2603.16537. Q. Nguyen and K. Sreenath
-
[35]
Building a Domain-specific Guardrail Model in Production
“Building a Domain-specific Guardrail Model in Production. ”arXiv preprint arXiv:2408.01452. S. NTT Disruption Europe. 2020.Jibo | Together for You. Retrieved September 22, 2020 from https://jibo.com/. NVIDIA. 2023.NeMo Guardrails. https://github.com/NVIDIA/NeMo-Guardrails. Accessed: 2026-02-08. (2023). OpenAI. 2022.OpenAI Moderation API. https://openai.c...
-
[36]
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation
“Fast lexically constrained decoding with dynamic beam allocation for neural machine translation. ”arXiv preprint arXiv:1804.06609. S. S. Raman, V. Cohen, E. Rosen, I. Idrees, D. Paulius, and S. Tellex
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
Planning with large language models via corrective re-prompting
“Planning with large language models via corrective re-prompting. ” In:NeurIPS 2022 Foundation Models for Decision Making Workshop. R. Ramnauth
work page 2022
-
[38]
Chapter 3: Robots for Autism Therapy
“Chapter 3: Robots for Autism Therapy. ” Ph.D. Dissertation. Yale University. https://rramnauth2220.github.io/ramnauth-2 0250719-f2.pdf. Systematic review appearing inBuilding Intelligent Robots for Social Regulation Therapy. R. Ramnauth, D. Brščić, and B. Scassellati. June 2025a. “A Robot-Assisted Approach to Small Talk Training for Adults with ASD. ” In...
work page 2025
-
[39]
More than Chit-Chat: Developing Robots for Small-Talk Interactions
“More than Chit-Chat: Developing Robots for Small-Talk Interactions. ”arXiv preprint arXiv:2412.18023. R. Ramnauth and B. Scassellati
-
[40]
When Robots Should Break the Rules
“When Robots Should Break the Rules. ” In:Proceedings of the 2026 ACM/IEEE International Conference on Human-Robot Interaction (HRI). IEEE Press. S. S. Ravi, B. Mielczarek, A. Kannappan, D. Kiela, and R. Qian
work page 2026
-
[41]
Lynx: An open source hallucination evaluation model
“Lynx: An open source hallucination evaluation model. ”arXiv preprint arXiv:2407.08488. Z. Ravichandran, A. Robey, V. Kumar, G. J. Pappas, and H. Hassani
-
[42]
Safety Guardrails for LLM-Enabled Robots
“Safety Guardrails for LLM-Enabled Robots. ”arXiv preprint arXiv:2503.07885. T. Rebedea, R. Dinu, M. N. Sreedhar, C. Parisien, and J. Cohen
-
[43]
Nemo guardrails: A toolkit for controllable and safe LLM applications with programmable rails
“Nemo guardrails: A toolkit for controllable and safe LLM applications with programmable rails. ” In:Proceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, 431–445. S. Russell, P. Norvig, and A. Intelligence
work page 2023
-
[44]
Proximal Policy Optimization Algorithms
“Proximal policy optimization algorithms. ”arXiv preprint arXiv:1707.06347. M. Schwenzer, M. Ay, T. Bergs, and D. Abel
work page internal anchor Pith review Pith/arXiv arXiv
-
[45]
arXiv preprint arXiv:2406.09264 , volume=
“Towards bidirectional human-AI alignment: A systematic review for clarifications, framework, and future directions. ” arXiv preprint arXiv:2406.09264, 2406, 1–56. B. Siciliano, L. Sciavicco, L. Villani, and G. Oriolo. 2009.Robotics: modelling, planning and control. Springer. M. Siegel
-
[46]
Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment
“Beyond Compromise: Pareto-Lenient Consensus for Efficient Multi-Preference LLM Alignment. ” arXiv preprint arXiv:2604.05965. Y. Tao, O. Viberg, R. S. Baker, and R. F. Kizilcec
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
Assured autonomy: Path toward living with autonomous systems we can trust
“Assured autonomy: Path toward living with autonomous systems we can trust. ”arXiv preprint arXiv:2010.14443. B. Vidgen, T. Thrush, Z. Talat, and D. Kiela
-
[48]
Adapting LLM agents through communication
“Adapting LLM agents through communication. ”arXiv preprint arXiv:2310.01444. Preprint. Under review at JAIR. 34•Ramnauth et al. J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt
-
[49]
A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT
“A prompt pattern catalog to enhance prompt engineering with chatgpt. ”arXiv preprint arXiv:2302.11382. B. T. Willard and R. Louf
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
Efficient Guided Generation for Large Language Models
“Efficient guided generation for large language models. ”arXiv preprint arXiv:2307.09702. C. Wu, X. Li, and J. Wang
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Vulnerabilities of foundation model integrated federated learning under adversarial threats
“Vulnerabilities of foundation model integrated federated learning under adversarial threats. ”arXiv preprint arXiv:2401.10375. Z. Xiang et al
-
[52]
arXiv preprint arXiv:2406.09187 , year=
“Guardagent: Safeguard LLM agents by a guard agent via knowledge-enabled reasoning. ”arXiv preprint arXiv:2406.09187. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y. Cao
-
[53]
Jailbreak Attacks and Defenses Against Large Language Models: A Survey
“Jailbreak attacks and defenses against large language models: A survey. ” arXiv preprint arXiv:2407.04295. Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
arXiv preprint arXiv:2211.01910 , year=
“Large language models are human-level prompt engineers. ” arXiv preprint arXiv:2211.01910. A. Zou, Z. Wang, N. Carlini, M. Nasr, J. Z. Kolter, and M. Fredrikson
-
[55]
Universal and Transferable Adversarial Attacks on Aligned Language Models
“Universal and transferable adversarial attacks on aligned language models. ”arXiv preprint arXiv:2307.15043. Preprint. Under review at JAIR
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.