pith. sign in

arxiv: 2605.18660 · v1 · pith:67TKQPGGnew · submitted 2026-05-18 · 💻 cs.HC

Evaluating Multi-turn Human-AI Interaction

Pith reviewed 2026-05-20 08:29 UTC · model grok-4.3

classification 💻 cs.HC
keywords human-AI interactionmulti-turn evaluationLLM assistantstransparencyconsistencyrefinementNLP evaluationeducational AI
0
0 comments X

The pith

Current NLP metrics miss multi-turn behaviors, so TCR evaluates transparency, consistency, and refinement in human-AI interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard evaluation of language models relies on single-score measures like accuracy and fluency, which do not account for how AI systems behave over repeated exchanges with users. The paper identifies this gap in contexts like educational assistants, where staying consistent and refining answers based on ongoing dialogue matters. It proposes TCR as a way to assess transparency about the AI's process, consistency in responses, and the ability to refine outputs iteratively. Structured prompts are provided to make this evaluation practical, showing it can supplement existing methods like using another LLM to judge. This approach aims to make evaluations more aligned with real human interactions.

Core claim

The paper introduces TCR as a structured framework for evaluating human-AI interaction using educational LLM assistants as an illustrative example, emphasizing dimensions such as transparency, consistency, and refinement. It examines limitations of current NLP evaluation practices centered on aggregate metrics and presents structured evaluation prompts and illustrative interaction examples to demonstrate how this complements aggregate metrics and LLM-as-a-judge approaches.

What carries the argument

TCR framework, a set of dimensions and structured prompts focused on transparency, consistency, and refinement to assess interactive behaviors in multi-turn conversations.

Load-bearing premise

The dimensions of transparency, consistency, and refinement together with the provided structured prompts and illustrative examples are sufficient to meaningfully complement aggregate metrics and LLM-as-a-judge methods in practice.

What would settle it

Apply TCR and standard aggregate metrics to the same collection of multi-turn educational dialogues and check whether TCR identifies inconsistencies or refinement failures that the aggregate scores overlook.

read the original abstract

Large language models (LLMs) are increasingly used as collaborative assistants, yet dominant NLP evaluation practices remain centered on aggregate metrics such as accuracy and fluency. These approaches often overlook behaviors that are critical in human-facing settings (e.g., consistency across multiple turns and iterative refinement). In this paper, we examine limitations of current NLP evaluation practices and introduce TCR, a structured framework for evaluating human--AI interaction using educational LLM assistants as an illustrative example. TCR emphasizes dimensions such as transparency, consistency, and refinement. We further present structured evaluation prompts and illustrative interaction examples demonstrating how structured evaluation can complement aggregate metrics and LLM-as-a-judge approaches. Our work highlights the need for more human-centered evaluation practices for interactive LLM systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that dominant NLP evaluation practices centered on aggregate metrics such as accuracy and fluency overlook critical behaviors in multi-turn human-AI interactions, including consistency across turns and iterative refinement. It introduces the TCR framework (transparency, consistency, refinement) as a structured approach for evaluating such interactions, using educational LLM assistants as an illustrative domain. The manuscript provides structured evaluation prompts and example interactions to demonstrate how TCR can complement aggregate metrics and LLM-as-a-judge methods.

Significance. If the TCR dimensions can be shown to be reliably scorable and to capture non-redundant information beyond existing metrics, the framework could help shift evaluation of interactive LLM systems toward more human-centered criteria. The structured prompts are a practical element that might aid reproducibility in human evaluations of multi-turn dialogues.

major comments (3)
  1. [Abstract] Abstract and illustrative examples section: The central claim that TCR 'meaningfully complements aggregate metrics and LLM-as-a-judge approaches' is not supported by any comparative data, inter-rater reliability statistics, or outcome improvement metrics; the manuscript presents only definitions and unvalidated examples.
  2. [TCR Framework] TCR framework description: The three dimensions are defined without an operationalization that demonstrates they are non-redundant (e.g., no discussion of potential correlations between transparency and refinement scores or a scoring rubric with example annotations).
  3. [Illustrative Examples] Illustrative interaction examples: No quantitative validation, error analysis, or baseline comparison is reported on the same set of interactions, so it remains unclear whether TCR scores provide additional insight or stable judgments beyond what aggregate metrics already capture.
minor comments (2)
  1. [Abstract] The abstract could more concisely separate the problem statement from the proposed framework and its intended use cases.
  2. Consider adding a summary table that lists each TCR dimension, its definition, and the corresponding structured prompt for quick reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We agree that the manuscript, as a conceptual introduction to the TCR framework with illustrative examples, would benefit from clearer scoping, added operational details, and explicit discussion of limitations. We address each major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and illustrative examples section: The central claim that TCR 'meaningfully complements aggregate metrics and LLM-as-a-judge approaches' is not supported by any comparative data, inter-rater reliability statistics, or outcome improvement metrics; the manuscript presents only definitions and unvalidated examples.

    Authors: We agree that the manuscript provides no quantitative comparative data, reliability statistics, or outcome metrics to support the claim of meaningful complementarity. The work is positioned as a conceptual framework introduction using educational LLM assistants as an illustrative domain, with examples intended to show potential rather than prove superiority. In revision we will update the abstract and conclusion to state that TCR provides a structured, human-centered lens that can complement existing approaches, as illustrated by the examples, while adding an explicit limitations section and outline of planned empirical validation studies including inter-rater reliability and direct comparisons. revision: yes

  2. Referee: [TCR Framework] TCR framework description: The three dimensions are defined without an operationalization that demonstrates they are non-redundant (e.g., no discussion of potential correlations between transparency and refinement scores or a scoring rubric with example annotations).

    Authors: We acknowledge the absence of a detailed scoring rubric and discussion of dimension interdependence. The current definitions are intentionally high-level to introduce the framework. We will revise the TCR framework section to include a concrete scoring rubric with annotated examples for each dimension and add a subsection addressing potential correlations (e.g., transparency enabling refinement) and approaches for assessing non-redundancy in future applications. revision: yes

  3. Referee: [Illustrative Examples] Illustrative interaction examples: No quantitative validation, error analysis, or baseline comparison is reported on the same set of interactions, so it remains unclear whether TCR scores provide additional insight or stable judgments beyond what aggregate metrics already capture.

    Authors: The examples serve to demonstrate TCR application rather than constitute a validation study. We will add an error analysis subsection that examines the provided interactions for cases where TCR surfaces issues (such as cross-turn inconsistency) not directly captured by aggregate metrics. We will also explicitly note the lack of quantitative validation and baseline comparisons as a limitation and an avenue for subsequent work. revision: partial

Circularity Check

0 steps flagged

No circularity: TCR is a definitional framework without reduction to inputs

full rationale

The paper proposes TCR (transparency, consistency, refinement) as a new structured evaluation framework for multi-turn human-AI interactions in educational settings, supported by definitions, structured prompts, and illustrative examples. No mathematical derivations, equations, fitted parameters, or predictions are presented that could reduce to prior data or self-referential constructions. The text does not invoke self-citations as load-bearing justifications, uniqueness theorems, or ansatzes smuggled from prior work; the contribution is explicitly positioned as complementing existing aggregate metrics rather than deriving from them. This is a self-contained definitional and illustrative proposal with no detectable circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Only the abstract is available, so the ledger reflects the high-level claims and assumptions stated there; the main addition is the TCR framework itself.

axioms (2)
  • domain assumption Current NLP evaluation practices centered on aggregate metrics such as accuracy and fluency overlook critical multi-turn behaviors including consistency and iterative refinement.
    This premise directly motivates the introduction of TCR in the abstract.
  • domain assumption Transparency, consistency, and refinement are the appropriate core dimensions for a structured evaluation framework in human-AI educational interactions.
    The TCR framework is explicitly built around these three dimensions.
invented entities (1)
  • TCR framework no independent evidence
    purpose: To provide structured evaluation of multi-turn human-AI interactions by emphasizing transparency, consistency, and refinement.
    Newly defined in the paper as an alternative to dominant aggregate-metric practices.

pith-pipeline@v0.9.0 · 5635 in / 1483 out tokens · 45564 ms · 2026-05-20T08:29:40.922379+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 7 internal anchors

  1. [1]

    ACM Computing Surveys , volume=

    From anecdotal evidence to quantitative evaluation methods: A systematic review on evaluating explainable ai , author=. ACM Computing Surveys , volume=. 2023 , publisher=

  2. [2]

    Information fusion , volume=

    Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI , author=. Information fusion , volume=. 2020 , publisher=

  3. [3]

    2019 international joint conference on neural networks (IJCNN) , pages=

    On the stability of interpretable models , author=. 2019 international joint conference on neural networks (IJCNN) , pages=. 2019 , organization=

  4. [4]

    Proceedings of the 20th international conference on autonomous agents and multiagent systems , pages=

    Better metrics for evaluating explainable artificial intelligence , author=. Proceedings of the 20th international conference on autonomous agents and multiagent systems , pages=

  5. [5]

    International Journal of Human--Computer Interaction , volume=

    Explainable artificial intelligence: Evaluating the objective and subjective impacts of xai on human-agent interaction , author=. International Journal of Human--Computer Interaction , volume=. 2023 , publisher=

  6. [6]

    2019 , organization=

    Can we do better explanations? A proposal of user-centered explainable AI , author=. 2019 , organization=

  7. [7]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  8. [8]

    2000 , publisher=

    How people learn , author=. 2000 , publisher=

  9. [9]

    IEEE Access , volume=

    User experience design using machine learning: a systematic review , author=. IEEE Access , volume=. 2022 , publisher=

  10. [10]

    Proceedings of the 2019 chi conference on human factors in computing systems , pages=

    Guidelines for human-AI interaction , author=. Proceedings of the 2019 chi conference on human factors in computing systems , pages=

  11. [11]

    ACM computing surveys (CSUR) , volume=

    A survey of methods for explaining black box models , author=. ACM computing surveys (CSUR) , volume=. 2018 , publisher=

  12. [12]

    Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

    Questioning the AI: informing design practices for explainable AI user experiences , author=. Proceedings of the 2020 CHI conference on human factors in computing systems , pages=

  13. [13]

    1995 , howpublished =

    Nielsen, Jakob , title =. 1995 , howpublished =

  14. [14]

    Electronics , volume=

    Machine learning interpretability: A survey on methods and metrics , author=. Electronics , volume=. 2019 , publisher=

  15. [15]

    Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent , pages=

    Perturbation-based explanations of prediction models , author=. Human and Machine Learning: Visible, Explainable, Trustworthy and Transparent , pages=. 2018 , publisher=

  16. [16]

    CHI'12 Extended Abstracts on Human Factors in Computing Systems , pages=

    Increasing the reliability and validity of quantitative laddering data with LadderUX , author=. CHI'12 Extended Abstracts on Human Factors in Computing Systems , pages=

  17. [17]

    Autonomous evaluation and refinement of digital agents.arXiv preprint arXiv:2404.06474, 2024

    Autonomous evaluation and refinement of digital agents , author=. arXiv preprint arXiv:2404.06474 , year=

  18. [18]

    Advances in Neural Information Processing Systems , volume=

    Ali-agent: Assessing llms' alignment with human values via agent-based evaluation , author=. Advances in Neural Information Processing Systems , volume=

  19. [19]

    arXiv preprint arXiv:2411.07407 , year=

    Using generative AI and multi-agents to provide automatic feedback , author=. arXiv preprint arXiv:2411.07407 , year=

  20. [20]

    HCAI Workshop at NeurIPS , year=

    Generation probabilities are not enough: Improving error highlighting for ai code suggestions , author=. HCAI Workshop at NeurIPS , year=

  21. [21]

    2017 , organization=

    Automatic annotation and evaluation of error types for grammatical error correction , author=. 2017 , organization=

  22. [22]

    My Grade is Wrong!

    " My Grade is Wrong!": A Contestable AI Framework for Interactive Feedback in Evaluating Student Essays , author=. arXiv preprint arXiv:2409.07453 , year=

  23. [23]

    arXiv preprint arXiv:2407.12687 (2024)

    Towards responsible development of generative AI for education: An evaluation-driven approach , author=. arXiv preprint arXiv:2407.12687 , year=

  24. [24]

    Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

    Evaluation gaps in machine learning practice , author=. Proceedings of the 2022 ACM conference on fairness, accountability, and transparency , pages=

  25. [25]

    Ai and the everything in the whole wide world benchmark.arXiv preprint arXiv:2111.15366,

    AI and the everything in the whole wide world benchmark , author=. arXiv preprint arXiv:2111.15366 , year=

  26. [26]

    Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

    GLUE: A multi-task benchmark and analysis platform for natural language understanding , author=. Proceedings of the 2018 EMNLP workshop BlackboxNLP: Analyzing and interpreting neural networks for NLP , pages=

  27. [27]

    Advances in neural information processing systems , volume=

    Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

  28. [28]

    Proceedings of the ACM on Human-Computer Interaction , volume=

    Do datasets have politics? Disciplinary values in computer vision dataset development , author=. Proceedings of the ACM on Human-Computer Interaction , volume=. 2021 , publisher=

  29. [29]

    Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages=

    Documenting computer vision datasets: An invitation to reflexive data practices , author=. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency , pages=

  30. [30]

    Transactions of the Association for Computational Linguistics , volume=

    Data statements for natural language processing: Toward mitigating system bias and enabling better science , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

  31. [31]

    Proceedings of the conference on fairness, accountability, and transparency , pages=

    Model cards for model reporting , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

  32. [32]

    Computer Law & Security Review , volume=

    AI and Big Data: A blueprint for a human rights, social and ethical impact assessment , author=. Computer Law & Security Review , volume=. 2018 , publisher=

  33. [33]

    International & Comparative Law Quarterly , volume=

    International human rights law as a framework for algorithmic accountability , author=. International & Comparative Law Quarterly , volume=. 2019 , publisher=

  34. [34]

    Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

    Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing , author=. Proceedings of the 2020 conference on fairness, accountability, and transparency , pages=

  35. [35]

    2018 IEEE 31st computer security foundations symposium (CSF) , pages=

    Privacy risk in machine learning: Analyzing the connection to overfitting , author=. 2018 IEEE 31st computer security foundations symposium (CSF) , pages=. 2018 , organization=

  36. [36]

    arXiv preprint arXiv:2006.09663 , year=

    Extending the machine learning abstraction boundary: A Complex systems approach to incorporate societal context , author=. arXiv preprint arXiv:2006.09663 , year=

  37. [37]

    Proceedings of the conference on fairness, accountability, and transparency , pages=

    Fairness and abstraction in sociotechnical systems , author=. Proceedings of the conference on fairness, accountability, and transparency , pages=

  38. [38]

    Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

    What we can't measure, we can't understand: Challenges to demographic data procurement in the pursuit of fairness , author=. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency , pages=

  39. [39]

    Advances in Neural Information Processing Systems , volume=

    Can i trust my fairness metric? assessing fairness with unlabeled data and bayesian inference , author=. Advances in Neural Information Processing Systems , volume=

  40. [40]

    Applied Sciences , volume=

    Sentiment analysis of students’ feedback with NLP and deep learning: A systematic mapping study , author=. Applied Sciences , volume=. 2021 , publisher=

  41. [41]

    Ieee Access , volume=

    A review of the trends and challenges in adopting natural language processing methods for education feedback analysis , author=. Ieee Access , volume=. 2022 , publisher=

  42. [42]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , volume=

  43. [43]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  44. [44]

    Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

    Bleu: a method for automatic evaluation of machine translation , author=. Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages=

  45. [45]

    BERTScore: Evaluating Text Generation with BERT

    Bertscore: Evaluating text generation with bert , author=. arXiv preprint arXiv:1904.09675 , year=

  46. [46]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  47. [47]

    Advances in Neural Information Processing Systems , volume=

    Alpacafarm: A simulation framework for methods that learn from human feedback , author=. Advances in Neural Information Processing Systems , volume=

  48. [48]

    Proceedings of the 30th International Conference on Intelligent User Interfaces , pages=

    Limitations of the llm-as-a-judge approach for evaluating llm outputs in expert knowledge tasks , author=. Proceedings of the 30th International Conference on Intelligent User Interfaces , pages=

  49. [49]

    Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

    From generation to judgment: Opportunities and challenges of llm-as-a-judge , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

  50. [50]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Length-controlled alpacaeval: A simple way to debias automatic evaluators , author=. arXiv preprint arXiv:2404.04475 , year=

  51. [51]

    From Crowdsourced Data to High-Quality Benchmarks: Arena-Hard and BenchBuilder Pipeline

    From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline , author=. arXiv preprint arXiv:2406.11939 , year=

  52. [52]

    International Journal of Child-Computer Interaction , volume=

    A social robot’s influence on children’s figural creativity during gameplay , author=. International Journal of Child-Computer Interaction , volume=. 2021 , publisher=

  53. [53]

    OECD education working papers , number=

    Trustworthy artificial intelligence (AI) in education: Promises and challenges , author=. OECD education working papers , number=. 2020 , publisher=

  54. [54]

    Journal of educational multimedia and hypermedia , volume=

    Digital literacy: A conceptual framework for survival skills in the digital era , author=. Journal of educational multimedia and hypermedia , volume=. 2004 , publisher=

  55. [55]

    2017 , publisher=

    Preparing for the future of artificial intelligence , author=. 2017 , publisher=

  56. [56]

    Creativity research journal , volume=

    The case for domain specificity of creativity , author=. Creativity research journal , volume=. 1998 , publisher=

  57. [57]

    2018 , publisher=

    Creativity in context: Update to the social psychology of creativity , author=. 2018 , publisher=

  58. [58]

    Educational researcher , volume=

    Computational thinking in K--12: A review of the state of the field , author=. Educational researcher , volume=. 2013 , publisher=

  59. [59]

    1978 , publisher=

    Mind in society: Development of higher psychological processes , author=. 1978 , publisher=

  60. [60]

    constructionism , volume=

    Situating constructionism , author=. constructionism , volume=

  61. [61]

    Journal of interactive learning research , volume=

    " Thick" authenticity: New media and authentic learning , author=. Journal of interactive learning research , volume=

  62. [62]

    2022 , publisher=

    Human-centered AI , author=. 2022 , publisher=

  63. [63]

    University of Washington technical report , volume=

    Value sensitive design: Theory and methods , author=. University of Washington technical report , volume=

  64. [64]

    2013 , publisher=

    Stealth assessment: Measuring and supporting learning in video games , author=. 2013 , publisher=

  65. [65]

    Proceedings of the 14th Learning Analytics and Knowledge Conference , pages=

    Improving student learning with hybrid human-AI tutoring: A three-study quasi-experimental investigation , author=. Proceedings of the 14th Learning Analytics and Knowledge Conference , pages=

  66. [66]

    Artificial intelligence , volume=

    Cognitive modeling and intelligent tutoring , author=. Artificial intelligence , volume=

  67. [67]

    Cognition and instruction , volume=

    Productive failure , author=. Cognition and instruction , volume=. 2008 , publisher=

  68. [68]

    Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

    Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? , author=. arXiv preprint arXiv:2004.03685 , year=

  69. [69]

    arXiv preprint arXiv:2507.22947 , year=

    ELMES: An Automated Framework for Evaluating Large Language Models in Educational Scenarios , author=. arXiv preprint arXiv:2507.22947 , year=

  70. [70]

    Large Language Models for NLP Evaluation: A Survey , author=

  71. [71]

    Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

    Evaluating Supportive LLM Behavior Over Multiple Turns across Demographics , author=. Proceedings of the Extended Abstracts of the 2026 CHI Conference on Human Factors in Computing Systems , pages=

  72. [72]

    Advances in Neural Information Processing Systems , volume=

    Intermt: Multi-turn interleaved preference alignment with human feedback , author=. Advances in Neural Information Processing Systems , volume=

  73. [73]

    Advances in Neural Information Processing Systems , volume=

    Consistently simulating human personas with multi-turn reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  74. [74]

    Advances in Neural Information Processing Systems , volume=

    Enhancing personalized multi-turn dialogue with curiosity reward , author=. Advances in Neural Information Processing Systems , volume=

  75. [75]

    arXiv preprint arXiv:2601.21375 , year=

    TeachBench: A Syllabus-Grounded Framework for Evaluating Teaching Ability in Large Language Models , author=. arXiv preprint arXiv:2601.21375 , year=

  76. [76]

    Implementation science , volume=

    Scoping studies: advancing the methodology , author=. Implementation science , volume=. 2010 , publisher=

  77. [77]

    Science , volume=

    Rethink reporting of evaluation results in AI , author=. Science , volume=. 2023 , publisher=

  78. [78]

    AI Alignment: A Comprehensive Survey

    Ai alignment: A comprehensive survey , author=. arXiv preprint arXiv:2310.19852 , year=

  79. [79]

    Aligning AI With Shared Human Values

    Aligning ai with shared human values , author=. arXiv preprint arXiv:2008.02275 , year=

  80. [80]

    Towards bidi- rectional human-ai alignment: A systematic review for clarifications, framework, and future directions,

    Towards bidirectional human-ai alignment: A systematic review for clarifications, framework, and future directions , author=. arXiv preprint arXiv:2406.09264 , volume=

Showing first 80 references.