Evaluating Multi-turn Human-AI Interaction

Shi Ding; Sijian Tan

arxiv: 2605.18660 · v1 · pith:67TKQPGGnew · submitted 2026-05-18 · 💻 cs.HC

Evaluating Multi-turn Human-AI Interaction

Shi Ding , Sijian Tan This is my paper

Pith reviewed 2026-05-20 08:29 UTC · model grok-4.3

classification 💻 cs.HC

keywords human-AI interactionmulti-turn evaluationLLM assistantstransparencyconsistencyrefinementNLP evaluationeducational AI

0 comments

The pith

Current NLP metrics miss multi-turn behaviors, so TCR evaluates transparency, consistency, and refinement in human-AI interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard evaluation of language models relies on single-score measures like accuracy and fluency, which do not account for how AI systems behave over repeated exchanges with users. The paper identifies this gap in contexts like educational assistants, where staying consistent and refining answers based on ongoing dialogue matters. It proposes TCR as a way to assess transparency about the AI's process, consistency in responses, and the ability to refine outputs iteratively. Structured prompts are provided to make this evaluation practical, showing it can supplement existing methods like using another LLM to judge. This approach aims to make evaluations more aligned with real human interactions.

Core claim

The paper introduces TCR as a structured framework for evaluating human-AI interaction using educational LLM assistants as an illustrative example, emphasizing dimensions such as transparency, consistency, and refinement. It examines limitations of current NLP evaluation practices centered on aggregate metrics and presents structured evaluation prompts and illustrative interaction examples to demonstrate how this complements aggregate metrics and LLM-as-a-judge approaches.

What carries the argument

TCR framework, a set of dimensions and structured prompts focused on transparency, consistency, and refinement to assess interactive behaviors in multi-turn conversations.

Load-bearing premise

The dimensions of transparency, consistency, and refinement together with the provided structured prompts and illustrative examples are sufficient to meaningfully complement aggregate metrics and LLM-as-a-judge methods in practice.

What would settle it

Apply TCR and standard aggregate metrics to the same collection of multi-turn educational dialogues and check whether TCR identifies inconsistencies or refinement failures that the aggregate scores overlook.

read the original abstract

Large language models (LLMs) are increasingly used as collaborative assistants, yet dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency. These approaches often overlook behaviors that are critical in human-facing settings (e.g., consistency across multiple turns and iterative refinement). In this paper, we examine limitations of current NLP evaluation practices and introduce TCR, a structured framework for evaluating human--AI interaction using educational LLM assistants as an illustrative example. TCR emphasizes dimensions such as transparency, consistency, and refinement. We further present structured evaluation prompts and illustrative interaction examples demonstrating how structured evaluation can complement aggregate metrics and LLM-as-a-judge approaches. Our work highlights the need for more human-centered evaluation practices for interactive LLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TCR is a reasonable attempt to define better evaluation for multi-turn educational AI chats, but it stops at prompts and examples with no reliability data or comparisons.

read the letter

The main takeaway is that this paper flags how standard accuracy and fluency scores miss key aspects of ongoing human-AI exchanges and offers TCR—transparency, consistency, and refinement—as a structured alternative for educational assistants. It supplies evaluation prompts and a handful of dialogue examples to illustrate the point. That is the core contribution: a shift toward dimensions that matter in collaborative, iterative settings rather than one-shot metrics. The prompts look concrete enough that someone could try them out on their own logs without much extra work. The examples make the dimensions feel applicable to real tutoring-style interactions. This is useful for anyone already frustrated with aggregate scores in HCI work on LLMs. The soft spot is the complete absence of any check on whether TCR actually works in practice. There is no report of multiple raters scoring the same sessions to measure agreement, no head-to-head comparison against baseline metrics on the same data, and no test of whether the three dimensions are independent or just restate each other. Without that, the claim that TCR complements existing methods stays at the level of suggestion. The paper is aimed at HCI researchers and educational technology developers who need evaluation tools for conversational systems. A reader already working on multi-turn assistants could borrow the prompts as a quick starting point even if the framework is not yet proven. It deserves peer review so referees can ask for the missing validation steps and help turn the idea into something more testable.

Referee Report

3 major / 2 minor

Summary. The paper argues that dominant NLP evaluation practices centered on aggregate metrics such as accuracy and fluency overlook critical behaviors in multi-turn human-AI interactions, including consistency across turns and iterative refinement. It introduces the TCR framework (transparency, consistency, refinement) as a structured approach for evaluating such interactions, using educational LLM assistants as an illustrative domain. The manuscript provides structured evaluation prompts and example interactions to demonstrate how TCR can complement aggregate metrics and LLM-as-a-judge methods.

Significance. If the TCR dimensions can be shown to be reliably scorable and to capture non-redundant information beyond existing metrics, the framework could help shift evaluation of interactive LLM systems toward more human-centered criteria. The structured prompts are a practical element that might aid reproducibility in human evaluations of multi-turn dialogues.

major comments (3)

[Abstract] Abstract and illustrative examples section: The central claim that TCR 'meaningfully complements aggregate metrics and LLM-as-a-judge approaches' is not supported by any comparative data, inter-rater reliability statistics, or outcome improvement metrics; the manuscript presents only definitions and unvalidated examples.
[TCR Framework] TCR framework description: The three dimensions are defined without an operationalization that demonstrates they are non-redundant (e.g., no discussion of potential correlations between transparency and refinement scores or a scoring rubric with example annotations).
[Illustrative Examples] Illustrative interaction examples: No quantitative validation, error analysis, or baseline comparison is reported on the same set of interactions, so it remains unclear whether TCR scores provide additional insight or stable judgments beyond what aggregate metrics already capture.

minor comments (2)

[Abstract] The abstract could more concisely separate the problem statement from the proposed framework and its intended use cases.
Consider adding a summary table that lists each TCR dimension, its definition, and the corresponding structured prompt for quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the manuscript, as a conceptual introduction to the TCR framework with illustrative examples, would benefit from clearer scoping, added operational details, and explicit discussion of limitations. We address each major comment below and outline planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract and illustrative examples section: The central claim that TCR 'meaningfully complements aggregate metrics and LLM-as-a-judge approaches' is not supported by any comparative data, inter-rater reliability statistics, or outcome improvement metrics; the manuscript presents only definitions and unvalidated examples.

Authors: We agree that the manuscript provides no quantitative comparative data, reliability statistics, or outcome metrics to support the claim of meaningful complementarity. The work is positioned as a conceptual framework introduction using educational LLM assistants as an illustrative domain, with examples intended to show potential rather than prove superiority. In revision we will update the abstract and conclusion to state that TCR provides a structured, human-centered lens that can complement existing approaches, as illustrated by the examples, while adding an explicit limitations section and outline of planned empirical validation studies including inter-rater reliability and direct comparisons. revision: yes
Referee: [TCR Framework] TCR framework description: The three dimensions are defined without an operationalization that demonstrates they are non-redundant (e.g., no discussion of potential correlations between transparency and refinement scores or a scoring rubric with example annotations).

Authors: We acknowledge the absence of a detailed scoring rubric and discussion of dimension interdependence. The current definitions are intentionally high-level to introduce the framework. We will revise the TCR framework section to include a concrete scoring rubric with annotated examples for each dimension and add a subsection addressing potential correlations (e.g., transparency enabling refinement) and approaches for assessing non-redundancy in future applications. revision: yes
Referee: [Illustrative Examples] Illustrative interaction examples: No quantitative validation, error analysis, or baseline comparison is reported on the same set of interactions, so it remains unclear whether TCR scores provide additional insight or stable judgments beyond what aggregate metrics already capture.

Authors: The examples serve to demonstrate TCR application rather than constitute a validation study. We will add an error analysis subsection that examines the provided interactions for cases where TCR surfaces issues (such as cross-turn inconsistency) not directly captured by aggregate metrics. We will also explicitly note the lack of quantitative validation and baseline comparisons as a limitation and an avenue for subsequent work. revision: partial

Circularity Check

0 steps flagged

No circularity: TCR is a definitional framework without reduction to inputs

full rationale

The paper proposes TCR (transparency, consistency, refinement) as a new structured evaluation framework for multi-turn human-AI interactions in educational settings, supported by definitions, structured prompts, and illustrative examples. No mathematical derivations, equations, fitted parameters, or predictions are presented that could reduce to prior data or self-referential constructions. The text does not invoke self-citations as load-bearing justifications, uniqueness theorems, or ansatzes smuggled from prior work; the contribution is explicitly positioned as complementing existing aggregate metrics rather than deriving from them. This is a self-contained definitional and illustrative proposal with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects the high-level claims and assumptions stated there; the main addition is the TCR framework itself.

axioms (2)

domain assumption Current NLP evaluation practices centered on aggregate metrics such as accuracy and fluency overlook critical multi-turn behaviors including consistency and iterative refinement.
This premise directly motivates the introduction of TCR in the abstract.
domain assumption Transparency, consistency, and refinement are the appropriate core dimensions for a structured evaluation framework in human-AI educational interactions.
The TCR framework is explicitly built around these three dimensions.

invented entities (1)

TCR framework no independent evidence
purpose: To provide structured evaluation of multi-turn human-AI interactions by emphasizing transparency, consistency, and refinement.
Newly defined in the paper as an alternative to dominant aggregate-metric practices.

pith-pipeline@v0.9.0 · 5635 in / 1483 out tokens · 45564 ms · 2026-05-20T08:29:40.922379+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce TCR as a lightweight framework for evaluating model behaviors in multi-turn human-facing AI systems... TCR emphasizes dimensions such as transparency, consistency, and refinement.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 7 internal anchors

[1]

ACM Computing Surveys , volume=

From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023
[2]

Information fusion , volume=

Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , author=. Information fusion , volume=. 2020 , publisher=

work page 2020
[3]

2019 international joint conference on neural networks (IJCNN) , pages=

On the stability of interpretable models , author=. 2019 international joint conference on neural networks (IJCNN) , pages=. 2019 , organization=

work page 2019
[4]

Proceedings of the 20th international conference on autonomous agents and multiagent systems , pages=

Better metrics for evaluating explainable artificial intelligence , author=. Proceedings of the 20th international conference on autonomous agents and multiagent systems , pages=

work page
[5]

International Journal of Human--Computer Interaction , volume=

Explainable artificial intelligence: Evaluating the objective and subjective impacts of xai on human-agent interaction , author=. International Journal of Human--Computer Interaction , volume=. 2023 , publisher=

work page 2023
[6]

2019 , organization=

Can we do better explanations? A proposal of user-centered explainable AI , author=. 2019 , organization=

work page 2019
[7]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2000 , publisher=

How people learn , author=. 2000 , publisher=

work page 2000
[9]

IEEE Access , volume=

User experience design using machine learning: a systematic review , author=. IEEE Access , volume=. 2022 , publisher=

work page 2022
[10]

Proceedings of the 2019 chi conference on human factors in computing systems , pages=

Guidelines for human-AI interaction , author=. Proceedings of the 2019 chi conference on human factors in computing systems , pages=

work page 2019
[11]

ACM computing surveys (CSUR) , volume=

A survey of methods for explaining black box models , author=. ACM computing surveys (CSUR) , volume=. 2018 , publisher=

work page 2018
[12]

Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

Questioning the AI: informing design practices for explainable AI user experiences , author=. Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

work page 2020
[13]

1995 , howpublished =

Nielsen, Jakob , title =. 1995 , howpublished =

work page 1995
[14]

Electronics , volume=

Machine learning interpretability: A survey on methods and metrics , author=. Electronics , volume=. 2019 , publisher=

work page 2019
[15]

Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent , pages=

Perturbation-based explanations of prediction models , author=. Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent , pages=. 2018 , publisher=

work page 2018
[16]

CHI'12 Extended Abstracts on Human Factors in Computing Systems , pages=

Increasing the reliability and validity of quantitative laddering data with LadderUX , author=. CHI'12 Extended Abstracts on Human Factors in Computing Systems , pages=

work page
[17]

Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

Autonomous evaluation and refinement of digital agents , author=. arXiv preprint arXiv:2404.06474 , year=

work page arXiv
[18]

Advances in Neural Information Processing Systems , volume=

Ali-agent: Assessing llms' alignment with human values via agent-based evaluation , author=. Advances in Neural Information Processing Systems , volume=

work page
[19]

arXiv preprint arXiv:2411.07407 , year=

Using generative AI and multi-agents to provide automatic feedback , author=. arXiv preprint arXiv:2411.07407 , year=

work page arXiv
[20]

HCAI Workshop at NeurIPS , year=

Generation probabilities are not enough: Improving error highlighting for ai code suggestions , author=. HCAI Workshop at NeurIPS , year=

work page
[21]

2017 , organization=

Automatic annotation and evaluation of error types for grammatical error correction , author=. 2017 , organization=

work page 2017
[22]

My Grade is Wrong!

" My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays , author=. arXiv preprint arXiv:2409.07453 , year=

work page arXiv
[23]

arXiv preprint arXiv:2407.12687 (2024)

Towards responsible development of generative AI for education: An evaluation-driven approach , author=. arXiv preprint arXiv:2407.12687 , year=

work page arXiv
[24]

Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

Evaluation gaps in machine learning practice , author=. Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

work page 2022
[25]

Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

AI and the everything in the whole wide world benchmark , author=. arXiv preprint arXiv:2111.15366 , year=

work page arXiv
[26]

Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

work page 2018
[27]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page
[28]

Proceedings of the ACM on Human-Computer Interaction , volume=

Do datasets have politics? Disciplinary values in computer vision dataset development , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2021 , publisher=

work page 2021
[29]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages=

Documenting computer vision datasets: An invitation to reflexive data practices , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages=

work page 2021
[30]

Transactions of the Association for Computational Linguistics , volume=

Data statements for natural language processing: Toward mitigating system bias and enabling better science , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018
[31]

Proceedings of the conference on fairness, accountability, and transparency , pages=

Model cards for model reporting , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

work page
[32]

Computer Law & Security Review , volume=

AI and Big Data: A blueprint for a human rights, social and ethical impact assessment , author=. Computer Law & Security Review , volume=. 2018 , publisher=

work page 2018
[33]

International & Comparative Law Quarterly , volume=

International human rights law as a framework for algorithmic accountability , author=. International & Comparative Law Quarterly , volume=. 2019 , publisher=

work page 2019
[34]

Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing , author=. Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

work page 2020
[35]

2018 IEEE 31st computer security foundations symposium (CSF) , pages=

Privacy risk in machine learning: Analyzing the connection to overfitting , author=. 2018 IEEE 31st computer security foundations symposium (CSF) , pages=. 2018 , organization=

work page 2018
[36]

arXiv preprint arXiv:2006.09663 , year=

Extending the machine learning abstraction boundary: A Complex systems approach to incorporate societal context , author=. arXiv preprint arXiv:2006.09663 , year=

work page arXiv 2006
[37]

Proceedings of the conference on fairness, accountability, and transparency , pages=

Fairness and abstraction in sociotechnical systems , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

work page
[38]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

What we can't measure, we can't understand: Challenges to demographic data procurement in the pursuit of fairness , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

work page 2021
[39]

Advances in Neural Information Processing Systems , volume=

Can i trust my fairness metric? assessing fairness with unlabeled data and bayesian inference , author=. Advances in Neural Information Processing Systems , volume=

work page
[40]

Applied Sciences , volume=

Sentiment analysis of students’ feedback with NLP and deep learning: A systematic mapping study , author=. Applied Sciences , volume=. 2021 , publisher=

work page 2021
[41]

Ieee Access , volume=

A review of the trends and challenges in adopting natural language processing methods for education feedback analysis , author=. Ieee Access , volume=. 2022 , publisher=

work page 2022
[42]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page
[44]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page
[45]

BERTScore: Evaluating Text Generation with BERT

Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904
[46]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page
[47]

Advances in Neural Information Processing Systems , volume=

Alpacafarm: A simulation framework for methods that learn from human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[48]

Proceedings of the 30th International Conference on Intelligent User Interfaces , pages=

Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks , author=. Proceedings of the 30th International Conference on Intelligent User Interfaces , pages=

work page
[49]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025
[50]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline , author=. arXiv preprint arXiv:2406.11939 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[52]

International Journal of Child-Computer Interaction , volume=

A social robot’s influence on children’s figural creativity during gameplay , author=. International Journal of Child-Computer Interaction , volume=. 2021 , publisher=

work page 2021
[53]

OECD education working papers , number=

Trustworthy artificial intelligence (AI) in education: Promises and challenges , author=. OECD education working papers , number=. 2020 , publisher=

work page 2020
[54]

Journal of educational multimedia and hypermedia , volume=

Digital literacy: A conceptual framework for survival skills in the digital era , author=. Journal of educational multimedia and hypermedia , volume=. 2004 , publisher=

work page 2004
[55]

2017 , publisher=

Preparing for the future of artificial intelligence , author=. 2017 , publisher=

work page 2017
[56]

Creativity research journal , volume=

The case for domain specificity of creativity , author=. Creativity research journal , volume=. 1998 , publisher=

work page 1998
[57]

2018 , publisher=

Creativity in context: Update to the social psychology of creativity , author=. 2018 , publisher=

work page 2018
[58]

Educational researcher , volume=

Computational thinking in K--12: A review of the state of the field , author=. Educational researcher , volume=. 2013 , publisher=

work page 2013
[59]

1978 , publisher=

Mind in society: Development of higher psychological processes , author=. 1978 , publisher=

work page 1978
[60]

constructionism , volume=

Situating constructionism , author=. constructionism , volume=

work page
[61]

Journal of interactive learning research , volume=

" Thick" authenticity: New media and authentic learning , author=. Journal of interactive learning research , volume=

work page
[62]

2022 , publisher=

Human-centered AI , author=. 2022 , publisher=

work page 2022
[63]

University of Washington technical report , volume=

Value sensitive design: Theory and methods , author=. University of Washington technical report , volume=

work page
[64]

2013 , publisher=

Stealth assessment: Measuring and supporting learning in video games , author=. 2013 , publisher=

work page 2013
[65]

Proceedings of the 14th Learning Analytics and Knowledge Conference , pages=

Improving student learning with hybrid human-AI tutoring: A three-study quasi-experimental investigation , author=. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages=

work page
[66]

Artificial intelligence , volume=

Cognitive modeling and intelligent tutoring , author=. Artificial intelligence , volume=

work page
[67]

Cognition and instruction , volume=

Productive failure , author=. Cognition and instruction , volume=. 2008 , publisher=

work page 2008
[68]

Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? , author=. arXiv preprint arXiv:2004.03685 , year=

work page arXiv 2004
[69]

arXiv preprint arXiv:2507.22947 , year=

ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios , author=. arXiv preprint arXiv:2507.22947 , year=

work page arXiv
[70]

Large Language Models for NLP Evaluation: A Survey , author=

work page
[71]

Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Evaluating Supportive LLM Behavior Over Multiple Turns across Demographics , author=. Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

work page 2026
[72]

Advances in Neural Information Processing Systems , volume=

Intermt: Multi-turn interleaved preference alignment with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page
[73]

Advances in Neural Information Processing Systems , volume=

Consistently simulating human personas with multi-turn reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[74]

Advances in Neural Information Processing Systems , volume=

Enhancing personalized multi-turn dialogue with curiosity reward , author=. Advances in Neural Information Processing Systems , volume=

work page
[75]

arXiv preprint arXiv:2601.21375 , year=

TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models , author=. arXiv preprint arXiv:2601.21375 , year=

work page arXiv
[76]

Implementation science , volume=

Scoping studies: advancing the methodology , author=. Implementation science , volume=. 2010 , publisher=

work page 2010
[77]

Science , volume=

Rethink reporting of evaluation results in AI , author=. Science , volume=. 2023 , publisher=

work page 2023
[78]

AI Alignment: A Comprehensive Survey

Ai alignment: A comprehensive survey , author=. arXiv preprint arXiv:2310.19852 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[79]

Aligning AI With Shared Human Values

Aligning ai with shared human values , author=. arXiv preprint arXiv:2008.02275 , year=

work page internal anchor Pith review arXiv 2008
[80]

Towards bidi- rectional human-ai alignment: A systematic review for clarifications, framework, and future directions,

Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions , author=. arXiv preprint arXiv:2406.09264 , volume=

work page arXiv

Showing first 80 references.

[1] [1]

ACM Computing Surveys , volume=

From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai , author=. ACM Computing Surveys , volume=. 2023 , publisher=

work page 2023

[2] [2]

Information fusion , volume=

Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , author=. Information fusion , volume=. 2020 , publisher=

work page 2020

[3] [3]

2019 international joint conference on neural networks (IJCNN) , pages=

On the stability of interpretable models , author=. 2019 international joint conference on neural networks (IJCNN) , pages=. 2019 , organization=

work page 2019

[4] [4]

Proceedings of the 20th international conference on autonomous agents and multiagent systems , pages=

Better metrics for evaluating explainable artificial intelligence , author=. Proceedings of the 20th international conference on autonomous agents and multiagent systems , pages=

work page

[5] [5]

International Journal of Human--Computer Interaction , volume=

Explainable artificial intelligence: Evaluating the objective and subjective impacts of xai on human-agent interaction , author=. International Journal of Human--Computer Interaction , volume=. 2023 , publisher=

work page 2023

[6] [6]

2019 , organization=

Can we do better explanations? A proposal of user-centered explainable AI , author=. 2019 , organization=

work page 2019

[7] [7]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

2000 , publisher=

How people learn , author=. 2000 , publisher=

work page 2000

[9] [9]

IEEE Access , volume=

User experience design using machine learning: a systematic review , author=. IEEE Access , volume=. 2022 , publisher=

work page 2022

[10] [10]

Proceedings of the 2019 chi conference on human factors in computing systems , pages=

Guidelines for human-AI interaction , author=. Proceedings of the 2019 chi conference on human factors in computing systems , pages=

work page 2019

[11] [11]

ACM computing surveys (CSUR) , volume=

A survey of methods for explaining black box models , author=. ACM computing surveys (CSUR) , volume=. 2018 , publisher=

work page 2018

[12] [12]

Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

Questioning the AI: informing design practices for explainable AI user experiences , author=. Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

work page 2020

[13] [13]

1995 , howpublished =

Nielsen, Jakob , title =. 1995 , howpublished =

work page 1995

[14] [14]

Electronics , volume=

Machine learning interpretability: A survey on methods and metrics , author=. Electronics , volume=. 2019 , publisher=

work page 2019

[15] [15]

Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent , pages=

Perturbation-based explanations of prediction models , author=. Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent , pages=. 2018 , publisher=

work page 2018

[16] [16]

CHI'12 Extended Abstracts on Human Factors in Computing Systems , pages=

Increasing the reliability and validity of quantitative laddering data with LadderUX , author=. CHI'12 Extended Abstracts on Human Factors in Computing Systems , pages=

work page

[17] [17]

Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

Autonomous evaluation and refinement of digital agents , author=. arXiv preprint arXiv:2404.06474 , year=

work page arXiv

[18] [18]

Advances in Neural Information Processing Systems , volume=

Ali-agent: Assessing llms' alignment with human values via agent-based evaluation , author=. Advances in Neural Information Processing Systems , volume=

work page

[19] [19]

arXiv preprint arXiv:2411.07407 , year=

Using generative AI and multi-agents to provide automatic feedback , author=. arXiv preprint arXiv:2411.07407 , year=

work page arXiv

[20] [20]

HCAI Workshop at NeurIPS , year=

Generation probabilities are not enough: Improving error highlighting for ai code suggestions , author=. HCAI Workshop at NeurIPS , year=

work page

[21] [21]

2017 , organization=

Automatic annotation and evaluation of error types for grammatical error correction , author=. 2017 , organization=

work page 2017

[22] [22]

My Grade is Wrong!

" My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays , author=. arXiv preprint arXiv:2409.07453 , year=

work page arXiv

[23] [23]

arXiv preprint arXiv:2407.12687 (2024)

Towards responsible development of generative AI for education: An evaluation-driven approach , author=. arXiv preprint arXiv:2407.12687 , year=

work page arXiv

[24] [24]

Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

Evaluation gaps in machine learning practice , author=. Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

work page 2022

[25] [25]

Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

AI and the everything in the whole wide world benchmark , author=. arXiv preprint arXiv:2111.15366 , year=

work page arXiv

[26] [26]

Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

work page 2018

[27] [27]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page

[28] [28]

Proceedings of the ACM on Human-Computer Interaction , volume=

Do datasets have politics? Disciplinary values in computer vision dataset development , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2021 , publisher=

work page 2021

[29] [29]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages=

Documenting computer vision datasets: An invitation to reflexive data practices , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages=

work page 2021

[30] [30]

Transactions of the Association for Computational Linguistics , volume=

Data statements for natural language processing: Toward mitigating system bias and enabling better science , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

work page 2018

[31] [31]

Proceedings of the conference on fairness, accountability, and transparency , pages=

Model cards for model reporting , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

work page

[32] [32]

Computer Law & Security Review , volume=

AI and Big Data: A blueprint for a human rights, social and ethical impact assessment , author=. Computer Law & Security Review , volume=. 2018 , publisher=

work page 2018

[33] [33]

International & Comparative Law Quarterly , volume=

International human rights law as a framework for algorithmic accountability , author=. International & Comparative Law Quarterly , volume=. 2019 , publisher=

work page 2019

[34] [34]

Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing , author=. Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

work page 2020

[35] [35]

2018 IEEE 31st computer security foundations symposium (CSF) , pages=

Privacy risk in machine learning: Analyzing the connection to overfitting , author=. 2018 IEEE 31st computer security foundations symposium (CSF) , pages=. 2018 , organization=

work page 2018

[36] [36]

arXiv preprint arXiv:2006.09663 , year=

Extending the machine learning abstraction boundary: A Complex systems approach to incorporate societal context , author=. arXiv preprint arXiv:2006.09663 , year=

work page arXiv 2006

[37] [37]

Proceedings of the conference on fairness, accountability, and transparency , pages=

Fairness and abstraction in sociotechnical systems , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

work page

[38] [38]

Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

What we can't measure, we can't understand: Challenges to demographic data procurement in the pursuit of fairness , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

work page 2021

[39] [39]

Advances in Neural Information Processing Systems , volume=

Can i trust my fairness metric? assessing fairness with unlabeled data and bayesian inference , author=. Advances in Neural Information Processing Systems , volume=

work page

[40] [40]

Applied Sciences , volume=

Sentiment analysis of students’ feedback with NLP and deep learning: A systematic mapping study , author=. Applied Sciences , volume=. 2021 , publisher=

work page 2021

[41] [41]

Ieee Access , volume=

A review of the trends and challenges in adopting natural language processing methods for education feedback analysis , author=. Ieee Access , volume=. 2022 , publisher=

work page 2022

[42] [42]

A Survey of Large Language Models

A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Text summarization branches out , pages=

Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

work page

[44] [44]

Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

work page

[45] [45]

BERTScore: Evaluating Text Generation with BERT

Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1904

[46] [46]

Advances in neural information processing systems , volume=

Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

work page

[47] [47]

Advances in Neural Information Processing Systems , volume=

Alpacafarm: A simulation framework for methods that learn from human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[48] [48]

Proceedings of the 30th International Conference on Intelligent User Interfaces , pages=

Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks , author=. Proceedings of the 30th International Conference on Intelligent User Interfaces , pages=

work page

[49] [49]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2025

[50] [50]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline , author=. arXiv preprint arXiv:2406.11939 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[52] [52]

International Journal of Child-Computer Interaction , volume=

A social robot’s influence on children’s figural creativity during gameplay , author=. International Journal of Child-Computer Interaction , volume=. 2021 , publisher=

work page 2021

[53] [53]

OECD education working papers , number=

Trustworthy artificial intelligence (AI) in education: Promises and challenges , author=. OECD education working papers , number=. 2020 , publisher=

work page 2020

[54] [54]

Journal of educational multimedia and hypermedia , volume=

Digital literacy: A conceptual framework for survival skills in the digital era , author=. Journal of educational multimedia and hypermedia , volume=. 2004 , publisher=

work page 2004

[55] [55]

2017 , publisher=

Preparing for the future of artificial intelligence , author=. 2017 , publisher=

work page 2017

[56] [56]

Creativity research journal , volume=

The case for domain specificity of creativity , author=. Creativity research journal , volume=. 1998 , publisher=

work page 1998

[57] [57]

2018 , publisher=

Creativity in context: Update to the social psychology of creativity , author=. 2018 , publisher=

work page 2018

[58] [58]

Educational researcher , volume=

Computational thinking in K--12: A review of the state of the field , author=. Educational researcher , volume=. 2013 , publisher=

work page 2013

[59] [59]

1978 , publisher=

Mind in society: Development of higher psychological processes , author=. 1978 , publisher=

work page 1978

[60] [60]

constructionism , volume=

Situating constructionism , author=. constructionism , volume=

work page

[61] [61]

Journal of interactive learning research , volume=

" Thick" authenticity: New media and authentic learning , author=. Journal of interactive learning research , volume=

work page

[62] [62]

2022 , publisher=

Human-centered AI , author=. 2022 , publisher=

work page 2022

[63] [63]

University of Washington technical report , volume=

Value sensitive design: Theory and methods , author=. University of Washington technical report , volume=

work page

[64] [64]

2013 , publisher=

Stealth assessment: Measuring and supporting learning in video games , author=. 2013 , publisher=

work page 2013

[65] [65]

Proceedings of the 14th Learning Analytics and Knowledge Conference , pages=

Improving student learning with hybrid human-AI tutoring: A three-study quasi-experimental investigation , author=. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages=

work page

[66] [66]

Artificial intelligence , volume=

Cognitive modeling and intelligent tutoring , author=. Artificial intelligence , volume=

work page

[67] [67]

Cognition and instruction , volume=

Productive failure , author=. Cognition and instruction , volume=. 2008 , publisher=

work page 2008

[68] [68]

Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? , author=. arXiv preprint arXiv:2004.03685 , year=

work page arXiv 2004

[69] [69]

arXiv preprint arXiv:2507.22947 , year=

ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios , author=. arXiv preprint arXiv:2507.22947 , year=

work page arXiv

[70] [70]

Large Language Models for NLP Evaluation: A Survey , author=

work page

[71] [71]

Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

Evaluating Supportive LLM Behavior Over Multiple Turns across Demographics , author=. Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

work page 2026

[72] [72]

Advances in Neural Information Processing Systems , volume=

Intermt: Multi-turn interleaved preference alignment with human feedback , author=. Advances in Neural Information Processing Systems , volume=

work page

[73] [73]

Advances in Neural Information Processing Systems , volume=

Consistently simulating human personas with multi-turn reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[74] [74]

Advances in Neural Information Processing Systems , volume=

Enhancing personalized multi-turn dialogue with curiosity reward , author=. Advances in Neural Information Processing Systems , volume=

work page

[75] [75]

arXiv preprint arXiv:2601.21375 , year=

TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models , author=. arXiv preprint arXiv:2601.21375 , year=

work page arXiv

[76] [76]

Implementation science , volume=

Scoping studies: advancing the methodology , author=. Implementation science , volume=. 2010 , publisher=

work page 2010

[77] [77]

Science , volume=

Rethink reporting of evaluation results in AI , author=. Science , volume=. 2023 , publisher=

work page 2023

[78] [78]

AI Alignment: A Comprehensive Survey

Ai alignment: A comprehensive survey , author=. arXiv preprint arXiv:2310.19852 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[79] [79]

Aligning AI With Shared Human Values

Aligning ai with shared human values , author=. arXiv preprint arXiv:2008.02275 , year=

work page internal anchor Pith review arXiv 2008

[80] [80]

Towards bidi- rectional human-ai alignment: A systematic review for clarifications, framework, and future directions,

Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions , author=. arXiv preprint arXiv:2406.09264 , volume=

work page arXiv