pith. sign in

arxiv: 2212.09746 · v5 · pith:HSVQEW3Wnew · submitted 2022-12-19 · 💻 cs.CL

Evaluating Human-Language Model Interaction

classification 💻 cs.CL
keywords interactionevaluationhuman-lmnon-interactiveinteractivebetterhaliemetrics
0
0 comments X
read the original abstract

Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CIG: Measuring Conversational Information Gain in Deliberative Dialogues with Semantic Memory Dynamics

    cs.CL 2026-04 unverdicted novelty 7.0

    CIG scores utterances using novelty, relevance, and implication scope derived from a dynamic semantic memory model, outperforming traditional heuristics in correlating with human judgments on deliberative segments.

  2. ResearchCube: Multi-Dimensional Trade-off Exploration for Research Ideation

    cs.HC 2026-04 unverdicted novelty 7.0

    ResearchCube provides a 3D spatial interface with bipolar trade-off dimensions and direct-manipulation interactions to support multi-dimensional research ideation, shown helpful in a study with 11 researchers for exte...

  3. Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

    cs.SE 2026-01 conditional novelty 7.0

    Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

  4. A Meta Reinforcement Learning Approach to Goals-Based Wealth Management

    cs.LG 2026-05 unverdicted novelty 6.0

    MetaRL pre-trained on GBWM problems delivers near-optimal dynamic strategies in 0.01s achieving 97.8% of DP optimal utility and handles larger problems where DP fails.

  5. LLMs Get Lost In Multi-Turn Conversation

    cs.CL 2025-05 unverdicted novelty 6.0

    LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.

  6. A Survey on Large Language Model based Autonomous Agents

    cs.AI 2023-08 accept novelty 6.0

    A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...

  7. BloombergGPT: A Large Language Model for Finance

    cs.LG 2023-03 conditional novelty 6.0

    BloombergGPT is a 50B parameter LLM trained on a 708B token mixed financial and general dataset that outperforms prior models on financial benchmarks while preserving general LLM performance.

  8. Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

    cs.HC 2026-05 unverdicted novelty 5.0

    LLM support roles in Alzheimer's caregiving queries systematically alter interactional risk prevalence and composition, with directive roles rated higher in quality despite elevated risks.

  9. CandorMD: An AI-Assisted Audio Simulation and Feedback System for Training Clinicians for Medical Error Disclosure

    cs.HC 2026-05 unverdicted novelty 5.0

    CandorMD is a new AI simulation and feedback system for training clinicians in medical error disclosure, informed by interviews with physicians, risk managers, and experts.

  10. Interactive Evaluation Requires a Design Science

    cs.AI 2026-05 unverdicted novelty 5.0

    Interactive evaluation of AI must be reframed as a distinct paradigm that maps interaction trajectories to judgments on process, recoverability, coordination, robustness, and system performance, supported by a two-axi...

  11. Semantic Reality: Interactive Context-Aware Visualization of Inter-Object Relationships in Augmented Reality

    cs.HC 2026-04 unverdicted novelty 4.0

    Semantic Reality maintains a persistent connectivity graph of objects in AR via multimodal reasoning and action recognition, then visualizes relationships to aid understanding and task guidance.