Evaluating Human-Language Model Interaction
read the original abstract
Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.
This paper has not been read by Pith yet.
Forward citations
Cited by 11 Pith papers
-
CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
CIG scores utterances using novelty, relevance, and implication scope derived from a dynamic semantic memory model, outperforming traditional heuristics in correlating with human judgments on deliberative segments.
-
ResearchCube: Multi-Dimensional Trade-off Exploration for Research Ideation
ResearchCube provides a 3D spatial interface with bipolar trade-off dimensions and direct-manipulation interactions to support multi-dimensional research ideation, shown helpful in a study with 11 researchers for exte...
-
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
-
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
A Survey on Large Language Model based Autonomous Agents
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
-
BloombergGPT: A Large Language Model for Finance
BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
-
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles
LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.
-
CandorMD: An AI-Assisted Audio Simulation and Feedback System for Training Clinicians for Medical Error Disclosure
CandorMD is a new AI simulation and feedback system for training clinicians in medical error disclosure, informed by interviews with physicians, risk managers, and experts.
-
Interactive Evaluation Requires a Design Science
Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...
-
Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality
Semantic Reality maintains a persistent connectivity graph of objects in AR via multimodal reasoning and action recognition, then visualizes relationships to aid understanding and task guidance.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.