pith. sign in

arxiv: 2507.11198 · v2 · submitted 2025-07-15 · 💻 cs.CL · cs.AI

Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Pith reviewed 2026-05-19 03:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords LLM agentsmulti-agent systemsqualitative codingtemperaturepersonaconsensusaccuracyeducational data
0
0 comments X

The pith

Temperature and persona settings shape when LLM multi-agent systems reach consensus but produce little accuracy gain over single agents in qualitative coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how temperature and persona assignments affect consensus formation and coding accuracy when multi-agent LLM systems deductively code dialog segments using a fixed codebook. It reports that these factors reliably alter the timing and likelihood of consensus yet fail to deliver consistent accuracy improvements relative to single LLM agents. A sympathetic reader would care because the results suggest that elaborate multi-agent arrangements may add complexity without proportional benefit for accuracy-driven annotation tasks against human gold standards. The experiments draw on over 77,000 decisions from six open-source models applied to real math-tutoring transcripts.

Core claim

Temperature significantly impacted whether and when consensus was reached across all six LLMs, multiple personas delayed consensus in four models, and higher temperatures diminished those persona effects in three models; however, neither temperature nor persona pairing produced robust improvements in coding accuracy, with single agents matching or outperforming MAS consensus in most conditions.

What carries the argument

The open-source multi-agent system that emulates deductive human coding through structured agent discussion and consensus arbitration.

If this is right

  • Consensus timing varies with temperature in every tested LLM.
  • Multiple personas delay consensus relative to uniform personas in four LLMs.
  • Higher temperatures reduce the delaying effect of multiple personas in three LLMs.
  • Single agents match or exceed MAS accuracy in most tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analysis of coding disagreements within the MAS could guide refinements to codebook design.
  • The pattern of minimal accuracy gains may hold for other deductive annotation tasks beyond educational dialogues.
  • Researchers might default to single-agent prompting unless the consensus process itself is the object of study.

Load-bearing premise

The structured agent discussion and consensus arbitration in the multi-agent system accurately emulates the deductive human coding process captured by the gold-standard annotations.

What would settle it

A replication in which MAS consensus accuracy exceeds single-agent accuracy by a statistically significant margin across a majority of models and experimental conditions.

Figures

Figures reproduced from arXiv: 2507.11198 by Bahar Shahrokhian, Conrad Borchers, Elham Tajik, Francesco Balzan, Sebastian Simon, Sreecharan Sankaranarayanan.

Figure 1
Figure 1. Figure 1: Distribution of consensus outcomes over total number of dialog segments and experi [PITH_FULL_IMAGE:figures/full_fig_p015_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Mean difference in alignment between the MAS and single LLM coding with the [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
read the original abstract

Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including annotation and qualitative coding of educational data. While LLM-based multi-agent systems (MAS) can emulate human coding workflows, their benefits over single LLM agents for coding remain poorly understood. To that end, we conducted an experimental study of how persona and temperature of component agents of a MAS shapes consensus-building and coding accuracy for dialog segments. LLMs were prompted to code these segments deductively using a mature codebook with 8 codes and high inter-rater reliability derived from prior research. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions facilitated by educational software. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic) significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing led to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design and human-AI coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study of multi-agent LLM systems (MAS) for deductive qualitative coding of educational dialog segments against a human gold-standard dataset. Using six open-source LLMs (3B–32B parameters) and 18 configurations, the authors examine how temperature and persona diversity affect consensus timing and coding accuracy across >77,000 decisions. Key claims are that temperature modulates consensus, multiple personas delay consensus in four of six models, and neither factor produces robust accuracy gains; single agents match or exceed MAS performance in most conditions.

Significance. If the central empirical comparison holds after methodological clarification, the work provides useful evidence that added complexity of structured MAS discussion and arbitration may not improve accuracy over single-agent prompting for deductive coding with a mature, high-reliability codebook. The scale of the experiment and the open-source MAS implementation are strengths that could inform efficient LLM deployment in qualitative research pipelines.

major comments (2)
  1. [Methods] Methods (MAS description): The consensus arbitration procedure is described only at a high level as 'structured agent discussion and consensus arbitration.' It is not specified whether final codes are produced by majority vote, moderator override, iterative refinement until agreement, or another rule. This detail is load-bearing for the accuracy comparison because any aggregation rule that systematically favors high-frequency codes or penalizes rare ones could artifactually lower MAS accuracy relative to single agents and the gold-standard distribution.
  2. [Results] Results (accuracy claims): The assertion that 'single agents matched or outperformed MAS consensus in most conditions' is not supported by per-configuration accuracy numbers, confidence intervals, or statistical tests. Without these, it is impossible to evaluate whether the observed pattern is robust across the 18 experimental cells or driven by a subset of LLMs or code frequencies.
minor comments (2)
  1. [Abstract] Abstract: The final sentence states that 'Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design,' yet no such qualitative analysis is referenced in the provided text; either include a brief summary or revise the claim.
  2. [Methods] Notation: 'MAS' and 'single agents' are used throughout; a short table or paragraph defining the exact prompting templates and output formats for each would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods (MAS description): The consensus arbitration procedure is described only at a high level as 'structured agent discussion and consensus arbitration.' It is not specified whether final codes are produced by majority vote, moderator override, iterative refinement until agreement, or another rule. This detail is load-bearing for the accuracy comparison because any aggregation rule that systematically favors high-frequency codes or penalizes rare ones could artifactually lower MAS accuracy relative to single agents and the gold-standard distribution.

    Authors: We agree that the arbitration procedure requires greater specificity. In the revised manuscript we will expand the Methods section to describe the exact consensus rules, including the voting mechanism (majority vote with moderator tie-breaker), the number of discussion rounds permitted, and how the final code is selected when agents disagree. revision: yes

  2. Referee: [Results] Results (accuracy claims): The assertion that 'single agents matched or outperformed MAS consensus in most conditions' is not supported by per-configuration accuracy numbers, confidence intervals, or statistical tests. Without these, it is impossible to evaluate whether the observed pattern is robust across the 18 experimental cells or driven by a subset of LLMs or code frequencies.

    Authors: We accept that the accuracy comparison would be more convincing with disaggregated results. We will add a table (or supplementary table) reporting accuracy for each of the 18 configurations together with 95% confidence intervals and note any pairwise statistical comparisons between single-agent and MAS conditions. revision: yes

Circularity Check

0 steps flagged

Empirical comparison to human gold standard; no derivations or fitted predictions

full rationale

The paper reports direct experimental measurements of coding accuracy and consensus rates for single LLM agents versus MAS configurations across 18 setups and 77,000 decisions, benchmarked against an external human-annotated gold standard. No equations, parameter fits, or first-principles derivations are present that could reduce reported outcomes to inputs by construction. All load-bearing claims rest on observable experimental contrasts rather than self-referential definitions or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the human gold-standard annotations and the assumption that the MAS discussion protocol mirrors human deductive coding; no free parameters, new entities, or ad-hoc axioms beyond standard domain assumptions about inter-rater reliability are introduced.

axioms (2)
  • domain assumption The codebook with 8 codes has high inter-rater reliability derived from prior research
    Invoked to justify deductive coding against the gold standard
  • domain assumption Structured agent discussion and consensus arbitration in the MAS emulates human coding workflows
    Stated as the design goal of the open-source MAS

pith-pipeline@v0.9.0 · 5813 in / 1397 out tokens · 70609 ms · 2026-05-19T03:58:11.564941+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 7 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION article output.bibitem format.authors "author" output.check author format.key output output.year.check new.block format.title "title" output.check new.block crossref missing format.jour.vol output format.article.crossref output.nonnull format.pages output if new.block note output fin.entry FUNCTION b...

  2. [2]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Abdin, M. , Aneja, J. , Awadalla, H. , Awadallah, A. , Awan, A. A. , Bach, N. , Bahree, A. , Bakhtiari, A. , Bao, J. , Behl, H. , et al . 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219\/

  3. [3]

    , Iqbal, W

    Ahmad, K. , Iqbal, W. , El-Hassan, A. , Qadir, J. , Benhaddou, D. , Ayyash, M. , and Al-Fuqaha, A. 2023. Data-driven artificial intelligence in education: A comprehensive review. IEEE Transactions on Learning Technologies\/ 17 , 12--31

  4. [4]

    Baker, R. S. , Ga s evi \'c , D. , and Karumbaiah, S. 2021. Four paradigms in learning analytics: Why paradigm convergence matters. Computers and Education: Artificial Intelligence\/ 2 , 100021

  5. [5]

    , Nasiar, N

    Barany, A. , Nasiar, N. , Porter, C. , Zambrano, A. F. , Andres, A. L. , Bright, D. , Shah, M. , Liu, X. , Gao, S. , Zhang, J. , et al . 2024. Chatgpt for education research: exploring the potential of large language models for qualitative codebook development. In International conference on artificial intelligence in education . Springer, 134--149

  6. [6]

    Barrick, M. R. , Stewart, G. L. , Neubert, M. J. , and Mount, M. K. 1998. Relating member ability and personality to work-team processes and team effectiveness. Journal of applied psychology\/ 83,\/ 3, 377

  7. [7]

    and Stewart, G

    Barry, B. and Stewart, G. L. 1997. Composition, process, and performance in self-managed groups: the role of personality. Journal of Applied psychology\/ 82,\/ 1, 62

  8. [8]

    , M \"a chler, M

    Bates, D. , M \"a chler, M. , Bolker, B. , and Walker, S. 2015. Fitting linear mixed-effects models using lme4. Journal of statistical software\/ 67 , 1--48

  9. [9]

    and Hochberg, Y

    Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological)\/ 57,\/ 1, 289--300

  10. [10]

    , Thomas, D

    Borchers, C. , Thomas, D. R. , Lin, J. , Abboud, R. , and Koedinger, K. R. 2025. Augmenting human-annotated training data with large language model generation and distillation in open-response assessment. arXiv preprint arXiv:2501.09126\/

  11. [11]

    , Zhang, J

    Borchers, C. , Zhang, J. , Baker, R. S. , and Aleven, V. 2024. Using think-aloud data to understand relations between self-regulation cycle characteristics and student performance in intelligent tutoring systems. In Proceedings of the 14th Learning Analytics and Knowledge Conference . 529--539

  12. [12]

    and Clarke, V

    Braun, V. and Clarke, V. 2006. Using thematic analysis in psychology. Qualitative research in psychology\/ 3,\/ 2, 77--101

  13. [13]

    and Clarke, V

    Braun, V. and Clarke, V. 2021. One size fits all? what counts as quality practice in (reflexive) thematic analysis? Qualitative research in psychology\/ 18,\/ 3, 328--352

  14. [14]

    , Breideband, T

    Chandler, C. , Breideband, T. , Reitman, J. G. , Chitwood, M. , Bush, J. B. , Howard, A. , Leonhart, S. , Foltz, P. W. , Penuel, W. R. , and D'Mello, S. K. 2024. Computational modeling of collaborative discourse to enable feedback and reflection in middle school classrooms. In Proceedings of the 14th Learning Analytics and Knowledge Conference . 576--586

  15. [15]

    , Chan, C

    Chen, G. , Chan, C. K. , Chan, K. K. , Clarke, S. N. , and Resnick, L. B. 2020. Efficacy of video-based teacher professional development for increasing classroom discourse and student learning. Journal of the Learning Sciences\/ 29,\/ 4-5, 642--680

  16. [16]

    Chen, H. , Ji, W. , Xu, L. , and Zhao, S. 2023. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151\/

  17. [17]

    Chen, J. C.-Y. , Saha, S. , and Bansal, M. 2023. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007\/

  18. [18]

    Cheung, K. K. C. and Tai, K. W. 2023. The use of intercoder reliability in qualitative interview data analysis in science education. Research in Science & Technological Education\/ 41,\/ 3, 1155--1175

  19. [19]

    , Bollenbacher, J

    Chew, R. , Bollenbacher, J. , Wenger, M. , Speer, J. , and Kim, A. 2023. Llm-assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv:2306.14924\/

  20. [20]

    , Shrivastava, A

    Chittem, A. , Shrivastava, A. , Pendela, S. T. , Challa, J. S. , and Kumar, D. 2025. Sac: A framework for measuring and inducing personality traits in llms with dynamic intensity control. arXiv preprint arXiv:2506.20993\/

  21. [21]

    Cohen, M. C. , Su, Z. , Kao, H.-T. , Nguyen, D. , Lynch, S. , Sap, M. , and Volkova, S. 2025. Exploring big five personality and ai capability effects in llm-simulated negotiation dialogues. arXiv preprint arXiv:2506.15928\/

  22. [22]

    De Paoli, S. 2024. Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach. Social Science Computer Review\/ 42,\/ 4, 997--1019

  23. [23]

    and Russ, R

    D \'o sa, K. and Russ, R. 2016. Beyond correctness: Using qualitative methods to uncover nuances of student learning in undergraduate stem education. Journal of College Science Teaching\/ 46,\/ 2, 70--81

  24. [24]

    , Pardos, Z

    Fischer, C. , Pardos, Z. A. , Baker, R. S. , Williams, J. J. , Smyth, P. , Yu, R. , Slater, S. , Baker, R. , and Warschauer, M. 2020. Mining big data in education: Affordances and challenges. Review of research in education\/ 44,\/ 1, 130--160

  25. [25]

    , Guo, Y

    Gao, J. , Guo, Y. , Lim, G. , Zhang, T. , Zhang, Z. , Li, T. J.-J. , and Perrault, S. T. 2024. Collabcoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems . 1--29

  26. [26]

    , Chen, Y

    Gao, Y. , Chen, Y. , Wang, M. , Wu, J. , Kim, Y. , Zhou, K. , Li, M. , Liu, X. , Fu, X. , Wu, J. , et al . 2024. Optimising the paradigms of human ai collaborative clinical coding. npj Digital Medicine\/ 7,\/ 1, 368

  27. [27]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D. , Yang, D. , Zhang, H. , Song, J. , Zhang, R. , Xu, R. , Zhu, Q. , Ma, S. , Wang, P. , Bi, X. , et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948\/

  28. [28]

    , Song, X

    Gupta, A. , Song, X. , and Anumanchipalli, G. 2023. Self-assessment tests are unreliable measures of llm personality. arXiv preprint arXiv:2309.08163\/

  29. [29]

    Hilal, A. H. and Alabri, S. S. 2013. Using nvivo for data analysis in qualitative research. International interdisciplinary journal of education\/ 2,\/ 2, 181--186

  30. [30]

    OliverPJohnandSanjaySrivastava.1999

    Jiang, H. , Zhang, X. , Cao, X. , Breazeal, C. , Roy, D. , and Kabbara, J. 2023. Personallm: Investigating the ability of large language models to express personality traits. arXiv preprint arXiv:2305.02547\/

  31. [31]

    big five

    John, O. P. 1990. The" big five" factor taxonomy: Dimensions of personality in the natural language and in questionnaires. Handbook of personality: Theory and research\/

  32. [32]

    , Stechly, K

    Kambhampati, S. , Stechly, K. , Valmeekam, K. , Saldyt, L. , Bhambri, S. , Palod, V. , Gundawar, A. , Samineni, S. R. , Kalwar, D. , and Biswas, U. 2025. Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! arXiv preprint arXiv:2504.09762\/

  33. [33]

    Khan, A. H. , Kegalle, H. , D'Silva, R. , Watt, N. , Whelan-Shamy, D. , Ghahremanlou, L. , and Magee, L. 2024. Automating thematic analysis: How llms analyse controversial topics. arXiv preprint arXiv:2405.06919\/

  34. [34]

    and R \"a diker, S

    Kuckartz, U. and R \"a diker, S. 2019. Analyzing qualitative data with MAXQDA . Springer

  35. [35]

    and Tagarelli, A

    La Cava, L. and Tagarelli, A. 2025. Open models, closed minds? on agents capabilities in mimicking human personalities through open large language models. In Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 39. 1355--1363

  36. [36]

    LLMs Get Lost In Multi-Turn Conversation

    Laban, P. , Hayashi, H. , Zhou, Y. , and Neville, J. 2025. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120\/

  37. [37]

    Landis, J. R. and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics\/ , 159--174

  38. [38]

    Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025

    Li, A. , Chen, H. , Namkoong, H. , and Peng, T. 2025. Llm generated persona is a promise with a catch. arXiv preprint arXiv:2503.16527\/

  39. [39]

    , Chen, L

    Li, G. , Chen, L. , Tang, C. , S v \'a bensk \`y , V. , Deguchi, D. , Yamashita, T. , and Shimada, A. 2025. Single-agent vs. multi-agent llm strategies for automated student reflection assessment. In Pacific-Asia Conference on Knowledge Discovery and Data Mining . Springer, 300--311

  40. [40]

    Internal consistency and self-feedback in large language models: A survey, 2024

    Liang, X. , Song, S. , Zheng, Z. , Wang, H. , Yu, Q. , Li, X. , Li, R.-H. , Wang, Y. , Wang, Z. , Xiong, F. , et al . 2024. Internal consistency and self-feedback in large language models: A survey. arXiv preprint arXiv:2407.14507\/

  41. [41]

    Liu, L. 2016. Using generic inductive approach in qualitative educational research: A case study analysis. Journal of Education and Learning\/ 5,\/ 2, 129--135

  42. [42]

    , Zambrano, A

    Liu, X. , Zambrano, A. F. , Baker, R. S. , Barany, A. , Ocumpaugh, J. , Zhang, J. , Pankiewicz, M. , Nasiar, N. , and Wei, Z. 2025. Qualitative coding with gpt-4: Where it works better. Journal of Learning Analytics\/ , 1--17

  43. [43]

    McCrae, R. R. and John, O. P. 1992. An introduction to the five-factor model and its applications. Journal of personality\/ 60,\/ 2, 175--215

  44. [44]

    , Xia, F

    Mirchandani, S. , Xia, F. , Florence, P. , Ichter, B. , Driess, D. , Arenas, M. G. , Rao, K. , Sadigh, D. , and Zeng, A. 2023. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721\/

  45. [45]

    Mistral-AI . 2024. Mistral-small-3.2-24b-instruct-2506. [Mistral-Small] Hugging Face

  46. [46]

    Mount, M. K. , Barrick, M. R. , and Stewart, G. L. 1998. Five-factor model of personality and performance in jobs involving interpersonal interactions. Human performance\/ 11,\/ 2-3, 145--165

  47. [47]

    , Ozuem, W

    Naeem, M. , Ozuem, W. , Howell, K. , and Ranfagni, S. 2023. A step-by-step process of thematic analysis to develop a conceptual model in qualitative research. International journal of qualitative methods\/ 22 , 16094069231205789

  48. [48]

    Personality-driven decision-making in llm-based au- tonomous agents

    Newsham, L. and Prince, D. 2025. Personality-driven decision-making in llm-based autonomous agents. arXiv preprint arXiv:2504.00727\/

  49. [49]

    Ng, A. 2024. Agentic design patterns part 5: Multi-agent collaboration

  50. [50]

    Ollama . 2023. Ollama: Run large language models locally. https://ollama.com. Accessed: 2025-07-09

  51. [51]

    and Joffe, H

    O’Connor, C. and Joffe, H. 2020. Intercoder reliability in qualitative research: debates and practical guidelines. International journal of qualitative methods\/ 19 , 1609406919899220

  52. [52]

    and Zeng, Y

    Pan, K. and Zeng, Y. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180\/

  53. [53]

    , Bowman, S

    Panickssery, A. , Bowman, S. , and Feng, S. 2024. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems\/ 37 , 68772--68802

  54. [54]

    Pinkwart, N. 2016. Another 25 years of aied? challenges and opportunities for intelligent educational technologies of the future. International journal of artificial intelligence in education\/ 26 , 771--783

  55. [55]

    Pugh, S. L. , Rao, A. , Stewart, A. E. , and D'Mello, S. K. 2022. Do speech-based collaboration analytics generalize across task contexts? In LAK22: 12th International Learning Analytics and Knowledge Conference . 208--218

  56. [56]

    , Walker, C

    Qiao, T. , Walker, C. , Cunningham, C. W. , and Koh, Y. S. 2025. Thematic-lm: a llm-based multi-agent system for large-scale thematic analysis. In Proceedings of the ACM on Web Conference 2025 . 649--658

  57. [57]

    , Sharma, A

    Rafailov, R. , Sharma, A. , Mitchell, E. , Manning, C. D. , Ermon, S. , and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems\/ 36 , 53728--53741

  58. [58]

    , Lim, L.-A

    Ramanathan, S. , Lim, L.-A. , Mottaghi, N. R. , and Buckingham Shum, S. 2025. When the prompt becomes the codebook: Grounded prompt engineering (groproe) and its application to belonging analytics. In Proceedings of the 15th International Learning Analytics and Knowledge Conference . 713--725

  59. [59]

    , Waseem, M

    Rasheed, Z. , Waseem, M. , Ahmad, A. , Kemell, K.-K. , Xiaofeng, W. , Duc, A. N. , and Abrahamsson, P. 2024. Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis. arXiv preprint arXiv:2402.01386\/

  60. [60]

    A Survey of Hallucination in Large Foundation Models

    Rawte, V. , Sheth, A. , and Das, A. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922\/

  61. [61]

    , Anastasopoulos, I

    Reza, M. , Anastasopoulos, I. , Bhandari, S. , and Pardos, Z. A. 2025. Prompthive: Bringing subject matter experts back to the forefront with collaborative prompt engineering for educational content creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems . 1--22

  62. [62]

    and Hemphill, M

    Richards, K. and Hemphill, M. 2017. A practical guide to collaborative qualitative data analysis. Journal of Teaching in Physical Education\/ 37 , 1--20

  63. [63]

    , Borchers, C

    Sankaranarayanan, S. , Borchers, C. , Simon, S. , Tajik, E. , Ata s , A. H. , Celik, B. , Balzan, F. , and Shahrokhian, B. 2025. Automating thematic analysis with multi-agent llm systems. EdArXiv Preprints (https://doi.org/10.35542/osf.io/kq8zh\_v1)\/

  64. [64]

    , Safdari, M

    Serapio-Garc \' a, G. , Safdari, M. , Crepy, C. , Sun, L. , Fitz, S. , Abdulhai, M. , Faust, A. , and Matari \'c , M. 2023. Personality traits in large language models

  65. [65]

    , Sankaranarayanan, S

    Simon, S. , Sankaranarayanan, S. , Tajik, E. , Borchers, C. , Bahar, s. , Balzan, F. , Strau , S. , Viswanathan, S. , Ata s , A. , C arapina, M. , Liang, L. , and Celik, B. 2025. Comparing human and llm-generated inductive thematic analyses: Assessing agreement in coding consistency and interpretative accuracy. Proceedings of 26th International Conference...

  66. [66]

    Smit, B. 2002. Atlas. ti for qualitative data analysis. Perspectives in education\/ 20,\/ 3, 65--75

  67. [67]

    Tai, R. H. , Bentley, L. R. , Xia, X. , Sitt, J. M. , Fankhauser, S. C. , Chicas-Mosier, A. M. , and Monteith, B. G. 2024. An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods\/ 23 , 16094069241231168

  68. [68]

    , Masumori, A

    Takata, R. , Masumori, A. , and Ikegami, T. 2024. Spontaneous emergence of agent individuality through social interactions in llm-based communities. arXiv preprint arXiv:2411.03252\/

  69. [69]

    Teknium . 2023. Openhermes-2-mistral-7b. [Openhermes2-7B] Hugging Face

  70. [70]

    , Hayfield, N

    Terry, G. , Hayfield, N. , Clarke, V. , Braun, V. , et al . 2017. Thematic analysis. The SAGE handbook of qualitative research in psychology\/ 2,\/ 17-37, 25

  71. [71]

    , Hegazy, M

    Tommaso, T. , Hegazy, M. , Lemay, D. , Abukalam, M. , Rish, I. , and Dumas, G. 2024. Llms and personalities: Inconsistencies across scales. In NeurIPS 2024 Workshop on Behavioral Machine Learning

  72. [72]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H. , Lavril, T. , Izacard, G. , Martinet, X. , Lachaux, M.-A. , Lacroix, T. , Rozi \`e re, B. , Goyal, N. , Hambro, E. , Azhar, F. , et al . 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971\/

  73. [73]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Tran, K.-T. , Dao, D. , Nguyen, M.-D. , Pham, Q.-V. , O'Sullivan, B. , and Nguyen, H. D. 2025. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322\/

  74. [74]

    , Yan, Z

    Venugopalan, D. , Yan, Z. , Borchers, C. , Lin, J. , and Aleven, V. 2025. Combining large language models with tutoring system intelligence: A case study in caregiver homework support. In Proceedings of the 15th International Learning Analytics and Knowledge Conference . 373--383

  75. [75]

    , Spitale, G

    Vinay, R. , Spitale, G. , Biller-Andorno, N. , and Germani, F. 2025. Emotional prompting amplifies disinformation generation in ai large language models. Frontiers in Artificial Intelligence\/ 8 , 1543603

  76. [76]

    , Yuan, X

    Xiao, Z. , Yuan, X. , Liao, Q. V. , Abdelghani, R. , and Oudeyer, P.-Y. 2023. Supporting qualitative analysis with large language models: Combining codebook with gpt-3 for deductive coding. In Companion proceedings of the 28th international conference on intelligent user interfaces . 75--78

  77. [77]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Xu, C. , Sun, Q. , Zheng, K. , Geng, X. , Zhao, P. , Feng, J. , Tao, C. , and Jiang, D. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244\/

  78. [78]

    , Echeverria, V

    Yan, L. , Echeverria, V. , Fernandez-Nieto, G. M. , Jin, Y. , Swiecki, Z. , Zhao, L. , Ga s evi \'c , D. , and Martinez-Maldonado, R. 2024. Human-ai collaboration in thematic analysis using chatgpt: A user study and design recommendations. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems . 1--7

  79. [79]

    Zambrano, A. F. , Liu, X. , Barany, A. , Baker, R. S. , Kim, J. , and Nasiar, N. 2023. From ncoder to chatgpt: From automated coding to refining human coding. In International conference on quantitative ethnography . Springer, 470--485

  80. [80]

    Zhang, J. , Xu, X. , Zhang, N. , Liu, R. , Hooi, B. , and Deng, S. 2023. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124\/

Showing first 80 references.