Evaluating Human-Language Model Interaction

Amelia Hardy; Ashwin Paranjape; Esin Durmus; Faisal Ladhak; Frieda Rong; Hancheng Cao; Ines Gerard-Ursin; John Thickstun; Joon Sung Park; Megha Srivastava

arxiv: 2212.09746 · v5 · pith:HSVQEW3Wnew · submitted 2022-12-19 · 💻 cs.CL

Evaluating Human-Language Model Interaction

Mina Lee , Megha Srivastava , Amelia Hardy , John Thickstun , Esin Durmus , Ashwin Paranjape , Ines Gerard-Ursin , Xiang Lisa Li

show 10 more authors

Faisal Ladhak Frieda Rong Rose E. Wang Minae Kwon Joon Sung Park Hancheng Cao Tony Lee Rishi Bommasani Michael Bernstein Percy Liang

This is my paper

classification 💻 cs.CL

keywords interactionevaluationhuman-lmnon-interactiveinteractivebetterhaliemetrics

0 comments

read the original abstract

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics
cs.CL 2026-04 unverdicted novelty 7.0

CIG scores utterances using novelty, relevance, and implication scope derived from a dynamic semantic memory model, outperforming traditional heuristics in correlating with human judgments on deliberative segments.
ResearchCube: Multi-Dimensional Trade-off Exploration for Research Ideation
cs.HC 2026-04 unverdicted novelty 7.0

ResearchCube provides a 3D spatial interface with bipolar trade-off dimensions and direct-manipulation interactions to support multi-dimensional research ideation, shown helpful in a study with 11 researchers for exte...
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild
cs.SE 2026-01 conditional novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
A Meta Reinforcement Learning Approach to Goals-Based Wealth Management
cs.LG 2026-05 unverdicted novelty 6.0

MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.
LLMs Get Lost In Multi-Turn Conversation
cs.CL 2025-05 unverdicted novelty 6.0

LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
BloombergGPT: A Large Language Model for Finance
cs.LG 2023-03 conditional novelty 6.0

BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.
Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles
cs.HC 2026-05 unverdicted novelty 5.0

LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.
CandorMD: An AI-Assisted Audio Simulation and Feedback System for Training Clinicians for Medical Error Disclosure
cs.HC 2026-05 unverdicted novelty 5.0

CandorMD is a new AI simulation and feedback system for training clinicians in medical error disclosure, informed by interviews with physicians, risk managers, and experts.
Interactive Evaluation Requires a Design Science
cs.AI 2026-05 unverdicted novelty 5.0

Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...
Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality
cs.HC 2026-04 unverdicted novelty 4.0

Semantic Reality maintains a persistent connectivity graph of objects in AR via multimodal reasoning and action recognition, then visualizes relationships to aid understanding and task guidance.