Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
Eicher-Miller, Toby Jia-Jun Li, Meng Jiang, and Ronald A
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
IdeaBlocks modularizes divergent intents into Exploration Blocks with multi-level reuse options, enabling 2.13 times more images explored and 12.5% greater visual diversity than baseline in a comparative user study.
MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.
LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.
Proof-of-concept shows fine-tuned small language models achieve adequate quality for real-time game content generation in a scoped RPG loop via retry-until-success and LLM-as-judge evaluation.
citing papers explorer
-
Using LLM-as-a-Judge/Jury to Advance Scalable, Clinically-Validated Safety Evaluations of Model Responses to Users Demonstrating Psychosis
Seven clinician-informed safety criteria enable LLM-as-a-Judge to reach substantial agreement with human consensus (Cohen's κ up to 0.75) on evaluating LLM responses to users demonstrating psychosis.
-
IdeaBlocks: Expressing and Reusing Divergent Intents for Graphic Design Exploration using Generative AI
IdeaBlocks modularizes divergent intents into Exploration Blocks with multi-level reuse options, enabling 2.13 times more images explored and 12.5% greater visual diversity than baseline in a comparative user study.
-
MAESTRO: Adapting GUIs and Guiding Navigation with User Preferences in Conversational Agents with GUIs
MAESTRO adds a shared preference memory plus GUI-adaptation and workflow-navigation mechanisms to conversational agents with GUIs and tests them in a 33-person movie-booking study.
-
Safe for Whom? Rethinking How We Evaluate the Safety of LLMs for Real Users
LLM safety evaluations for personal advice must test responses against diverse user vulnerability profiles, since context-blind ratings overestimate safety and realistic prompt context does not fix the problem.
-
High-quality generation of dynamic game content via small language models: A proof of concept
Proof-of-concept shows fine-tuned small language models achieve adequate quality for real-time game content generation in a scoped RPG loop via retry-until-success and LLM-as-judge evaluation.