Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Bahar Shahrokhian; Conrad Borchers; Elham Tajik; Francesco Balzan; Sebastian Simon; Sreecharan Sankaranarayanan

arxiv: 2507.11198 · v2 · submitted 2025-07-15 · 💻 cs.CL · cs.AI

Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding

Conrad Borchers , Bahar Shahrokhian , Francesco Balzan , Elham Tajik , Sreecharan Sankaranarayanan , Sebastian Simon This is my paper

Pith reviewed 2026-05-19 03:58 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords LLM agentsmulti-agent systemsqualitative codingtemperaturepersonaconsensusaccuracyeducational data

0 comments

The pith

Temperature and persona settings shape when LLM multi-agent systems reach consensus but produce little accuracy gain over single agents in qualitative coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how temperature and persona assignments affect consensus formation and coding accuracy when multi-agent LLM systems deductively code dialog segments using a fixed codebook. It reports that these factors reliably alter the timing and likelihood of consensus yet fail to deliver consistent accuracy improvements relative to single LLM agents. A sympathetic reader would care because the results suggest that elaborate multi-agent arrangements may add complexity without proportional benefit for accuracy-driven annotation tasks against human gold standards. The experiments draw on over 77,000 decisions from six open-source models applied to real math-tutoring transcripts.

Core claim

Temperature significantly impacted whether and when consensus was reached across all six LLMs, multiple personas delayed consensus in four models, and higher temperatures diminished those persona effects in three models; however, neither temperature nor persona pairing produced robust improvements in coding accuracy, with single agents matching or outperforming MAS consensus in most conditions.

What carries the argument

The open-source multi-agent system that emulates deductive human coding through structured agent discussion and consensus arbitration.

If this is right

Consensus timing varies with temperature in every tested LLM.
Multiple personas delay consensus relative to uniform personas in four LLMs.
Higher temperatures reduce the delaying effect of multiple personas in three LLMs.
Single agents match or exceed MAS accuracy in most tested conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analysis of coding disagreements within the MAS could guide refinements to codebook design.
The pattern of minimal accuracy gains may hold for other deductive annotation tasks beyond educational dialogues.
Researchers might default to single-agent prompting unless the consensus process itself is the object of study.

Load-bearing premise

The structured agent discussion and consensus arbitration in the multi-agent system accurately emulates the deductive human coding process captured by the gold-standard annotations.

What would settle it

A replication in which MAS consensus accuracy exceeds single-agent accuracy by a statistically significant margin across a majority of models and experimental conditions.

Figures

Figures reproduced from arXiv: 2507.11198 by Bahar Shahrokhian, Conrad Borchers, Elham Tajik, Francesco Balzan, Sebastian Simon, Sreecharan Sankaranarayanan.

**Figure 2.** Figure 2: Mean difference in alignment between the MAS and single LLM coding with the [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

read the original abstract

Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including annotation and qualitative coding of educational data. While LLM-based multi-agent systems (MAS) can emulate human coding workflows, their benefits over single LLM agents for coding remain poorly understood. To that end, we conducted an experimental study of how persona and temperature of component agents of a MAS shapes consensus-building and coding accuracy for dialog segments. LLMs were prompted to code these segments deductively using a mature codebook with 8 codes and high inter-rater reliability derived from prior research. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions facilitated by educational software. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic) significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing led to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design and human-AI coding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports an empirical study of multi-agent LLM systems (MAS) for deductive qualitative coding of educational dialog segments against a human gold-standard dataset. Using six open-source LLMs (3B–32B parameters) and 18 configurations, the authors examine how temperature and persona diversity affect consensus timing and coding accuracy across >77,000 decisions. Key claims are that temperature modulates consensus, multiple personas delay consensus in four of six models, and neither factor produces robust accuracy gains; single agents match or exceed MAS performance in most conditions.

Significance. If the central empirical comparison holds after methodological clarification, the work provides useful evidence that added complexity of structured MAS discussion and arbitration may not improve accuracy over single-agent prompting for deductive coding with a mature, high-reliability codebook. The scale of the experiment and the open-source MAS implementation are strengths that could inform efficient LLM deployment in qualitative research pipelines.

major comments (2)

[Methods] Methods (MAS description): The consensus arbitration procedure is described only at a high level as 'structured agent discussion and consensus arbitration.' It is not specified whether final codes are produced by majority vote, moderator override, iterative refinement until agreement, or another rule. This detail is load-bearing for the accuracy comparison because any aggregation rule that systematically favors high-frequency codes or penalizes rare ones could artifactually lower MAS accuracy relative to single agents and the gold-standard distribution.
[Results] Results (accuracy claims): The assertion that 'single agents matched or outperformed MAS consensus in most conditions' is not supported by per-configuration accuracy numbers, confidence intervals, or statistical tests. Without these, it is impossible to evaluate whether the observed pattern is robust across the 18 experimental cells or driven by a subset of LLMs or code frequencies.

minor comments (2)

[Abstract] Abstract: The final sentence states that 'Qualitative analysis of MAS collaboration and coding disagreement may, however, improve codebook design,' yet no such qualitative analysis is referenced in the provided text; either include a brief summary or revise the claim.
[Methods] Notation: 'MAS' and 'single agents' are used throughout; a short table or paragraph defining the exact prompting templates and output formats for each would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods (MAS description): The consensus arbitration procedure is described only at a high level as 'structured agent discussion and consensus arbitration.' It is not specified whether final codes are produced by majority vote, moderator override, iterative refinement until agreement, or another rule. This detail is load-bearing for the accuracy comparison because any aggregation rule that systematically favors high-frequency codes or penalizes rare ones could artifactually lower MAS accuracy relative to single agents and the gold-standard distribution.

Authors: We agree that the arbitration procedure requires greater specificity. In the revised manuscript we will expand the Methods section to describe the exact consensus rules, including the voting mechanism (majority vote with moderator tie-breaker), the number of discussion rounds permitted, and how the final code is selected when agents disagree. revision: yes
Referee: [Results] Results (accuracy claims): The assertion that 'single agents matched or outperformed MAS consensus in most conditions' is not supported by per-configuration accuracy numbers, confidence intervals, or statistical tests. Without these, it is impossible to evaluate whether the observed pattern is robust across the 18 experimental cells or driven by a subset of LLMs or code frequencies.

Authors: We accept that the accuracy comparison would be more convincing with disaggregated results. We will add a table (or supplementary table) reporting accuracy for each of the 18 configurations together with 95% confidence intervals and note any pairwise statistical comparisons between single-agent and MAS conditions. revision: yes

Circularity Check

0 steps flagged

Empirical comparison to human gold standard; no derivations or fitted predictions

full rationale

The paper reports direct experimental measurements of coding accuracy and consensus rates for single LLM agents versus MAS configurations across 18 setups and 77,000 decisions, benchmarked against an external human-annotated gold standard. No equations, parameter fits, or first-principles derivations are present that could reduce reported outcomes to inputs by construction. All load-bearing claims rest on observable experimental contrasts rather than self-referential definitions or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the validity of the human gold-standard annotations and the assumption that the MAS discussion protocol mirrors human deductive coding; no free parameters, new entities, or ad-hoc axioms beyond standard domain assumptions about inter-rater reliability are introduced.

axioms (2)

domain assumption The codebook with 8 codes has high inter-rater reliability derived from prior research
Invoked to justify deductive coding against the gold standard
domain assumption Structured agent discussion and consensus arbitration in the MAS emulates human coding workflows
Stated as the design goal of the open-source MAS

pith-pipeline@v0.9.0 · 5813 in / 1397 out tokens · 70609 ms · 2026-05-19T03:58:11.564941+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · 7 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION article output.bibitem format.authors "author" output.check author format.key output output.year.check new.block format.title "title" output.check new.block crossref missing format.jour.vol output format.article.crossref output.nonnull format.pages output if new.block note output fin.entry FUNCTION b...

work page
[2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M. , Aneja, J. , Awadalla, H. , Awadallah, A. , Awan, A. A. , Bach, N. , Bahree, A. , Bakhtiari, A. , Bao, J. , Behl, H. , et al . 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219\/

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

, Iqbal, W

Ahmad, K. , Iqbal, W. , El-Hassan, A. , Qadir, J. , Benhaddou, D. , Ayyash, M. , and Al-Fuqaha, A. 2023. Data-driven artificial intelligence in education: A comprehensive review. IEEE Transactions on Learning Technologies\/ 17 , 12--31

work page 2023
[4]

Baker, R. S. , Ga s evi \'c , D. , and Karumbaiah, S. 2021. Four paradigms in learning analytics: Why paradigm convergence matters. Computers and Education: Artificial Intelligence\/ 2 , 100021

work page 2021
[5]

, Nasiar, N

Barany, A. , Nasiar, N. , Porter, C. , Zambrano, A. F. , Andres, A. L. , Bright, D. , Shah, M. , Liu, X. , Gao, S. , Zhang, J. , et al . 2024. Chatgpt for education research: exploring the potential of large language models for qualitative codebook development. In International conference on artificial intelligence in education . Springer, 134--149

work page 2024
[6]

Barrick, M. R. , Stewart, G. L. , Neubert, M. J. , and Mount, M. K. 1998. Relating member ability and personality to work-team processes and team effectiveness. Journal of applied psychology\/ 83,\/ 3, 377

work page 1998
[7]

and Stewart, G

Barry, B. and Stewart, G. L. 1997. Composition, process, and performance in self-managed groups: the role of personality. Journal of Applied psychology\/ 82,\/ 1, 62

work page 1997
[8]

, M \"a chler, M

Bates, D. , M \"a chler, M. , Bolker, B. , and Walker, S. 2015. Fitting linear mixed-effects models using lme4. Journal of statistical software\/ 67 , 1--48

work page 2015
[9]

and Hochberg, Y

Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological)\/ 57,\/ 1, 289--300

work page 1995
[10]

, Thomas, D

Borchers, C. , Thomas, D. R. , Lin, J. , Abboud, R. , and Koedinger, K. R. 2025. Augmenting human-annotated training data with large language model generation and distillation in open-response assessment. arXiv preprint arXiv:2501.09126\/

work page arXiv 2025
[11]

, Zhang, J

Borchers, C. , Zhang, J. , Baker, R. S. , and Aleven, V. 2024. Using think-aloud data to understand relations between self-regulation cycle characteristics and student performance in intelligent tutoring systems. In Proceedings of the 14th Learning Analytics and Knowledge Conference . 529--539

work page 2024
[12]

and Clarke, V

Braun, V. and Clarke, V. 2006. Using thematic analysis in psychology. Qualitative research in psychology\/ 3,\/ 2, 77--101

work page 2006
[13]

and Clarke, V

Braun, V. and Clarke, V. 2021. One size fits all? what counts as quality practice in (reflexive) thematic analysis? Qualitative research in psychology\/ 18,\/ 3, 328--352

work page 2021
[14]

, Breideband, T

Chandler, C. , Breideband, T. , Reitman, J. G. , Chitwood, M. , Bush, J. B. , Howard, A. , Leonhart, S. , Foltz, P. W. , Penuel, W. R. , and D'Mello, S. K. 2024. Computational modeling of collaborative discourse to enable feedback and reflection in middle school classrooms. In Proceedings of the 14th Learning Analytics and Knowledge Conference . 576--586

work page 2024
[15]

, Chan, C

Chen, G. , Chan, C. K. , Chan, K. K. , Clarke, S. N. , and Resnick, L. B. 2020. Efficacy of video-based teacher professional development for increasing classroom discourse and student learning. Journal of the Learning Sciences\/ 29,\/ 4-5, 642--680

work page 2020
[16]

Chen, H. , Ji, W. , Xu, L. , and Zhao, S. 2023. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151\/

work page arXiv 2023
[17]

Chen, J. C.-Y. , Saha, S. , and Bansal, M. 2023. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007\/

work page arXiv 2023
[18]

Cheung, K. K. C. and Tai, K. W. 2023. The use of intercoder reliability in qualitative interview data analysis in science education. Research in Science & Technological Education\/ 41,\/ 3, 1155--1175

work page 2023
[19]

, Bollenbacher, J

Chew, R. , Bollenbacher, J. , Wenger, M. , Speer, J. , and Kim, A. 2023. Llm-assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv:2306.14924\/

work page arXiv 2023
[20]

, Shrivastava, A

Chittem, A. , Shrivastava, A. , Pendela, S. T. , Challa, J. S. , and Kumar, D. 2025. Sac: A framework for measuring and inducing personality traits in llms with dynamic intensity control. arXiv preprint arXiv:2506.20993\/

work page arXiv 2025
[21]

Cohen, M. C. , Su, Z. , Kao, H.-T. , Nguyen, D. , Lynch, S. , Sap, M. , and Volkova, S. 2025. Exploring big five personality and ai capability effects in llm-simulated negotiation dialogues. arXiv preprint arXiv:2506.15928\/

work page arXiv 2025
[22]

De Paoli, S. 2024. Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach. Social Science Computer Review\/ 42,\/ 4, 997--1019

work page 2024
[23]

and Russ, R

D \'o sa, K. and Russ, R. 2016. Beyond correctness: Using qualitative methods to uncover nuances of student learning in undergraduate stem education. Journal of College Science Teaching\/ 46,\/ 2, 70--81

work page 2016
[24]

, Pardos, Z

Fischer, C. , Pardos, Z. A. , Baker, R. S. , Williams, J. J. , Smyth, P. , Yu, R. , Slater, S. , Baker, R. , and Warschauer, M. 2020. Mining big data in education: Affordances and challenges. Review of research in education\/ 44,\/ 1, 130--160

work page 2020
[25]

, Guo, Y

Gao, J. , Guo, Y. , Lim, G. , Zhang, T. , Zhang, Z. , Li, T. J.-J. , and Perrault, S. T. 2024. Collabcoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems . 1--29

work page 2024
[26]

, Chen, Y

Gao, Y. , Chen, Y. , Wang, M. , Wu, J. , Kim, Y. , Zhou, K. , Li, M. , Liu, X. , Fu, X. , Wu, J. , et al . 2024. Optimising the paradigms of human ai collaborative clinical coding. npj Digital Medicine\/ 7,\/ 1, 368

work page 2024
[27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D. , Yang, D. , Zhang, H. , Song, J. , Zhang, R. , Xu, R. , Zhu, Q. , Ma, S. , Wang, P. , Bi, X. , et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948\/

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

, Song, X

Gupta, A. , Song, X. , and Anumanchipalli, G. 2023. Self-assessment tests are unreliable measures of llm personality. arXiv preprint arXiv:2309.08163\/

work page arXiv 2023
[29]

Hilal, A. H. and Alabri, S. S. 2013. Using nvivo for data analysis in qualitative research. International interdisciplinary journal of education\/ 2,\/ 2, 181--186

work page 2013
[30]

OliverPJohnandSanjaySrivastava.1999

Jiang, H. , Zhang, X. , Cao, X. , Breazeal, C. , Roy, D. , and Kabbara, J. 2023. Personallm: Investigating the ability of large language models to express personality traits. arXiv preprint arXiv:2305.02547\/

work page arXiv 2023
[31]

big five

John, O. P. 1990. The" big five" factor taxonomy: Dimensions of personality in the natural language and in questionnaires. Handbook of personality: Theory and research\/

work page 1990
[32]

, Stechly, K

Kambhampati, S. , Stechly, K. , Valmeekam, K. , Saldyt, L. , Bhambri, S. , Palod, V. , Gundawar, A. , Samineni, S. R. , Kalwar, D. , and Biswas, U. 2025. Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! arXiv preprint arXiv:2504.09762\/

work page arXiv 2025
[33]

Khan, A. H. , Kegalle, H. , D'Silva, R. , Watt, N. , Whelan-Shamy, D. , Ghahremanlou, L. , and Magee, L. 2024. Automating thematic analysis: How llms analyse controversial topics. arXiv preprint arXiv:2405.06919\/

work page arXiv 2024
[34]

and R \"a diker, S

Kuckartz, U. and R \"a diker, S. 2019. Analyzing qualitative data with MAXQDA . Springer

work page 2019
[35]

and Tagarelli, A

La Cava, L. and Tagarelli, A. 2025. Open models, closed minds? on agents capabilities in mimicking human personalities through open large language models. In Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 39. 1355--1363

work page 2025
[36]

LLMs Get Lost In Multi-Turn Conversation

Laban, P. , Hayashi, H. , Zhou, Y. , and Neville, J. 2025. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120\/

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Landis, J. R. and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics\/ , 159--174

work page 1977
[38]

Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025

Li, A. , Chen, H. , Namkoong, H. , and Peng, T. 2025. Llm generated persona is a promise with a catch. arXiv preprint arXiv:2503.16527\/

work page arXiv 2025
[39]

, Chen, L

Li, G. , Chen, L. , Tang, C. , S v \'a bensk \`y , V. , Deguchi, D. , Yamashita, T. , and Shimada, A. 2025. Single-agent vs. multi-agent llm strategies for automated student reflection assessment. In Pacific-Asia Conference on Knowledge Discovery and Data Mining . Springer, 300--311

work page 2025
[40]

Internal consistency and self-feedback in large language models: A survey, 2024

Liang, X. , Song, S. , Zheng, Z. , Wang, H. , Yu, Q. , Li, X. , Li, R.-H. , Wang, Y. , Wang, Z. , Xiong, F. , et al . 2024. Internal consistency and self-feedback in large language models: A survey. arXiv preprint arXiv:2407.14507\/

work page arXiv 2024
[41]

Liu, L. 2016. Using generic inductive approach in qualitative educational research: A case study analysis. Journal of Education and Learning\/ 5,\/ 2, 129--135

work page 2016
[42]

, Zambrano, A

Liu, X. , Zambrano, A. F. , Baker, R. S. , Barany, A. , Ocumpaugh, J. , Zhang, J. , Pankiewicz, M. , Nasiar, N. , and Wei, Z. 2025. Qualitative coding with gpt-4: Where it works better. Journal of Learning Analytics\/ , 1--17

work page 2025
[43]

McCrae, R. R. and John, O. P. 1992. An introduction to the five-factor model and its applications. Journal of personality\/ 60,\/ 2, 175--215

work page 1992
[44]

, Xia, F

Mirchandani, S. , Xia, F. , Florence, P. , Ichter, B. , Driess, D. , Arenas, M. G. , Rao, K. , Sadigh, D. , and Zeng, A. 2023. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721\/

work page arXiv 2023
[45]

Mistral-AI . 2024. Mistral-small-3.2-24b-instruct-2506. [Mistral-Small] Hugging Face

work page 2024
[46]

Mount, M. K. , Barrick, M. R. , and Stewart, G. L. 1998. Five-factor model of personality and performance in jobs involving interpersonal interactions. Human performance\/ 11,\/ 2-3, 145--165

work page 1998
[47]

, Ozuem, W

Naeem, M. , Ozuem, W. , Howell, K. , and Ranfagni, S. 2023. A step-by-step process of thematic analysis to develop a conceptual model in qualitative research. International journal of qualitative methods\/ 22 , 16094069231205789

work page 2023
[48]

Personality-driven decision-making in llm-based au- tonomous agents

Newsham, L. and Prince, D. 2025. Personality-driven decision-making in llm-based autonomous agents. arXiv preprint arXiv:2504.00727\/

work page arXiv 2025
[49]

Ng, A. 2024. Agentic design patterns part 5: Multi-agent collaboration

work page 2024
[50]

Ollama . 2023. Ollama: Run large language models locally. https://ollama.com. Accessed: 2025-07-09

work page 2023
[51]

and Joffe, H

O’Connor, C. and Joffe, H. 2020. Intercoder reliability in qualitative research: debates and practical guidelines. International journal of qualitative methods\/ 19 , 1609406919899220

work page 2020
[52]

and Zeng, Y

Pan, K. and Zeng, Y. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180\/

work page arXiv 2023
[53]

, Bowman, S

Panickssery, A. , Bowman, S. , and Feng, S. 2024. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems\/ 37 , 68772--68802

work page 2024
[54]

Pinkwart, N. 2016. Another 25 years of aied? challenges and opportunities for intelligent educational technologies of the future. International journal of artificial intelligence in education\/ 26 , 771--783

work page 2016
[55]

Pugh, S. L. , Rao, A. , Stewart, A. E. , and D'Mello, S. K. 2022. Do speech-based collaboration analytics generalize across task contexts? In LAK22: 12th International Learning Analytics and Knowledge Conference . 208--218

work page 2022
[56]

, Walker, C

Qiao, T. , Walker, C. , Cunningham, C. W. , and Koh, Y. S. 2025. Thematic-lm: a llm-based multi-agent system for large-scale thematic analysis. In Proceedings of the ACM on Web Conference 2025 . 649--658

work page 2025
[57]

, Sharma, A

Rafailov, R. , Sharma, A. , Mitchell, E. , Manning, C. D. , Ermon, S. , and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems\/ 36 , 53728--53741

work page 2023
[58]

, Lim, L.-A

Ramanathan, S. , Lim, L.-A. , Mottaghi, N. R. , and Buckingham Shum, S. 2025. When the prompt becomes the codebook: Grounded prompt engineering (groproe) and its application to belonging analytics. In Proceedings of the 15th International Learning Analytics and Knowledge Conference . 713--725

work page 2025
[59]

, Waseem, M

Rasheed, Z. , Waseem, M. , Ahmad, A. , Kemell, K.-K. , Xiaofeng, W. , Duc, A. N. , and Abrahamsson, P. 2024. Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis. arXiv preprint arXiv:2402.01386\/

work page arXiv 2024
[60]

A Survey of Hallucination in Large Foundation Models

Rawte, V. , Sheth, A. , and Das, A. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922\/

work page internal anchor Pith review Pith/arXiv arXiv 2023
[61]

, Anastasopoulos, I

Reza, M. , Anastasopoulos, I. , Bhandari, S. , and Pardos, Z. A. 2025. Prompthive: Bringing subject matter experts back to the forefront with collaborative prompt engineering for educational content creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems . 1--22

work page 2025
[62]

and Hemphill, M

Richards, K. and Hemphill, M. 2017. A practical guide to collaborative qualitative data analysis. Journal of Teaching in Physical Education\/ 37 , 1--20

work page 2017
[63]

, Borchers, C

Sankaranarayanan, S. , Borchers, C. , Simon, S. , Tajik, E. , Ata s , A. H. , Celik, B. , Balzan, F. , and Shahrokhian, B. 2025. Automating thematic analysis with multi-agent llm systems. EdArXiv Preprints (https://doi.org/10.35542/osf.io/kq8zh\_v1)\/

work page doi:10.35542/osf.io/kq8zh 2025
[64]

, Safdari, M

Serapio-Garc \' a, G. , Safdari, M. , Crepy, C. , Sun, L. , Fitz, S. , Abdulhai, M. , Faust, A. , and Matari \'c , M. 2023. Personality traits in large language models

work page 2023
[65]

, Sankaranarayanan, S

Simon, S. , Sankaranarayanan, S. , Tajik, E. , Borchers, C. , Bahar, s. , Balzan, F. , Strau , S. , Viswanathan, S. , Ata s , A. , C arapina, M. , Liang, L. , and Celik, B. 2025. Comparing human and llm-generated inductive thematic analyses: Assessing agreement in coding consistency and interpretative accuracy. Proceedings of 26th International Conference...

work page 2025
[66]

Smit, B. 2002. Atlas. ti for qualitative data analysis. Perspectives in education\/ 20,\/ 3, 65--75

work page 2002
[67]

Tai, R. H. , Bentley, L. R. , Xia, X. , Sitt, J. M. , Fankhauser, S. C. , Chicas-Mosier, A. M. , and Monteith, B. G. 2024. An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods\/ 23 , 16094069241231168

work page 2024
[68]

, Masumori, A

Takata, R. , Masumori, A. , and Ikegami, T. 2024. Spontaneous emergence of agent individuality through social interactions in llm-based communities. arXiv preprint arXiv:2411.03252\/

work page arXiv 2024
[69]

Teknium . 2023. Openhermes-2-mistral-7b. [Openhermes2-7B] Hugging Face

work page 2023
[70]

, Hayfield, N

Terry, G. , Hayfield, N. , Clarke, V. , Braun, V. , et al . 2017. Thematic analysis. The SAGE handbook of qualitative research in psychology\/ 2,\/ 17-37, 25

work page 2017
[71]

, Hegazy, M

Tommaso, T. , Hegazy, M. , Lemay, D. , Abukalam, M. , Rish, I. , and Dumas, G. 2024. Llms and personalities: Inconsistencies across scales. In NeurIPS 2024 Workshop on Behavioral Machine Learning

work page 2024
[72]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H. , Lavril, T. , Izacard, G. , Martinet, X. , Lachaux, M.-A. , Lacroix, T. , Rozi \`e re, B. , Goyal, N. , Hambro, E. , Azhar, F. , et al . 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971\/

work page internal anchor Pith review Pith/arXiv arXiv 2023
[73]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Tran, K.-T. , Dao, D. , Nguyen, M.-D. , Pham, Q.-V. , O'Sullivan, B. , and Nguyen, H. D. 2025. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322\/

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

, Yan, Z

Venugopalan, D. , Yan, Z. , Borchers, C. , Lin, J. , and Aleven, V. 2025. Combining large language models with tutoring system intelligence: A case study in caregiver homework support. In Proceedings of the 15th International Learning Analytics and Knowledge Conference . 373--383

work page 2025
[75]

, Spitale, G

Vinay, R. , Spitale, G. , Biller-Andorno, N. , and Germani, F. 2025. Emotional prompting amplifies disinformation generation in ai large language models. Frontiers in Artificial Intelligence\/ 8 , 1543603

work page 2025
[76]

, Yuan, X

Xiao, Z. , Yuan, X. , Liao, Q. V. , Abdelghani, R. , and Oudeyer, P.-Y. 2023. Supporting qualitative analysis with large language models: Combining codebook with gpt-3 for deductive coding. In Companion proceedings of the 28th international conference on intelligent user interfaces . 75--78

work page 2023
[77]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C. , Sun, Q. , Zheng, K. , Geng, X. , Zhao, P. , Feng, J. , Tao, C. , and Jiang, D. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244\/

work page internal anchor Pith review Pith/arXiv arXiv 2023
[78]

, Echeverria, V

Yan, L. , Echeverria, V. , Fernandez-Nieto, G. M. , Jin, Y. , Swiecki, Z. , Zhao, L. , Ga s evi \'c , D. , and Martinez-Maldonado, R. 2024. Human-ai collaboration in thematic analysis using chatgpt: A user study and design recommendations. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems . 1--7

work page 2024
[79]

Zambrano, A. F. , Liu, X. , Barany, A. , Baker, R. S. , Kim, J. , and Nasiar, N. 2023. From ncoder to chatgpt: From automated coding to refining human coding. In International conference on quantitative ethnography . Springer, 470--485

work page 2023
[80]

Zhang, J. , Xu, X. , Zhang, N. , Liu, R. , Hooi, B. , and Deng, S. 2023. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124\/

work page arXiv 2023

Showing first 80 references.

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION article output.bibitem format.authors "author" output.check author format.key output output.year.check new.block format.title "title" output.check new.block crossref missing format.jour.vol output format.article.crossref output.nonnull format.pages output if new.block note output fin.entry FUNCTION b...

work page

[2] [2]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Abdin, M. , Aneja, J. , Awadalla, H. , Awadallah, A. , Awan, A. A. , Bach, N. , Bahree, A. , Bakhtiari, A. , Bao, J. , Behl, H. , et al . 2024. Phi-3 technical report: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219\/

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

, Iqbal, W

Ahmad, K. , Iqbal, W. , El-Hassan, A. , Qadir, J. , Benhaddou, D. , Ayyash, M. , and Al-Fuqaha, A. 2023. Data-driven artificial intelligence in education: A comprehensive review. IEEE Transactions on Learning Technologies\/ 17 , 12--31

work page 2023

[4] [4]

Baker, R. S. , Ga s evi \'c , D. , and Karumbaiah, S. 2021. Four paradigms in learning analytics: Why paradigm convergence matters. Computers and Education: Artificial Intelligence\/ 2 , 100021

work page 2021

[5] [5]

, Nasiar, N

Barany, A. , Nasiar, N. , Porter, C. , Zambrano, A. F. , Andres, A. L. , Bright, D. , Shah, M. , Liu, X. , Gao, S. , Zhang, J. , et al . 2024. Chatgpt for education research: exploring the potential of large language models for qualitative codebook development. In International conference on artificial intelligence in education . Springer, 134--149

work page 2024

[6] [6]

Barrick, M. R. , Stewart, G. L. , Neubert, M. J. , and Mount, M. K. 1998. Relating member ability and personality to work-team processes and team effectiveness. Journal of applied psychology\/ 83,\/ 3, 377

work page 1998

[7] [7]

and Stewart, G

Barry, B. and Stewart, G. L. 1997. Composition, process, and performance in self-managed groups: the role of personality. Journal of Applied psychology\/ 82,\/ 1, 62

work page 1997

[8] [8]

, M \"a chler, M

Bates, D. , M \"a chler, M. , Bolker, B. , and Walker, S. 2015. Fitting linear mixed-effects models using lme4. Journal of statistical software\/ 67 , 1--48

work page 2015

[9] [9]

and Hochberg, Y

Benjamini, Y. and Hochberg, Y. 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal statistical society: series B (Methodological)\/ 57,\/ 1, 289--300

work page 1995

[10] [10]

, Thomas, D

Borchers, C. , Thomas, D. R. , Lin, J. , Abboud, R. , and Koedinger, K. R. 2025. Augmenting human-annotated training data with large language model generation and distillation in open-response assessment. arXiv preprint arXiv:2501.09126\/

work page arXiv 2025

[11] [11]

, Zhang, J

Borchers, C. , Zhang, J. , Baker, R. S. , and Aleven, V. 2024. Using think-aloud data to understand relations between self-regulation cycle characteristics and student performance in intelligent tutoring systems. In Proceedings of the 14th Learning Analytics and Knowledge Conference . 529--539

work page 2024

[12] [12]

and Clarke, V

Braun, V. and Clarke, V. 2006. Using thematic analysis in psychology. Qualitative research in psychology\/ 3,\/ 2, 77--101

work page 2006

[13] [13]

and Clarke, V

Braun, V. and Clarke, V. 2021. One size fits all? what counts as quality practice in (reflexive) thematic analysis? Qualitative research in psychology\/ 18,\/ 3, 328--352

work page 2021

[14] [14]

, Breideband, T

Chandler, C. , Breideband, T. , Reitman, J. G. , Chitwood, M. , Bush, J. B. , Howard, A. , Leonhart, S. , Foltz, P. W. , Penuel, W. R. , and D'Mello, S. K. 2024. Computational modeling of collaborative discourse to enable feedback and reflection in middle school classrooms. In Proceedings of the 14th Learning Analytics and Knowledge Conference . 576--586

work page 2024

[15] [15]

, Chan, C

Chen, G. , Chan, C. K. , Chan, K. K. , Clarke, S. N. , and Resnick, L. B. 2020. Efficacy of video-based teacher professional development for increasing classroom discourse and student learning. Journal of the Learning Sciences\/ 29,\/ 4-5, 642--680

work page 2020

[16] [16]

Chen, H. , Ji, W. , Xu, L. , and Zhao, S. 2023. Multi-agent consensus seeking via large language models. arXiv preprint arXiv:2310.20151\/

work page arXiv 2023

[17] [17]

Chen, J. C.-Y. , Saha, S. , and Bansal, M. 2023. Reconcile: Round-table conference improves reasoning via consensus among diverse llms. arXiv preprint arXiv:2309.13007\/

work page arXiv 2023

[18] [18]

Cheung, K. K. C. and Tai, K. W. 2023. The use of intercoder reliability in qualitative interview data analysis in science education. Research in Science & Technological Education\/ 41,\/ 3, 1155--1175

work page 2023

[19] [19]

, Bollenbacher, J

Chew, R. , Bollenbacher, J. , Wenger, M. , Speer, J. , and Kim, A. 2023. Llm-assisted content analysis: Using large language models to support deductive coding. arXiv preprint arXiv:2306.14924\/

work page arXiv 2023

[20] [20]

, Shrivastava, A

Chittem, A. , Shrivastava, A. , Pendela, S. T. , Challa, J. S. , and Kumar, D. 2025. Sac: A framework for measuring and inducing personality traits in llms with dynamic intensity control. arXiv preprint arXiv:2506.20993\/

work page arXiv 2025

[21] [21]

Cohen, M. C. , Su, Z. , Kao, H.-T. , Nguyen, D. , Lynch, S. , Sap, M. , and Volkova, S. 2025. Exploring big five personality and ai capability effects in llm-simulated negotiation dialogues. arXiv preprint arXiv:2506.15928\/

work page arXiv 2025

[22] [22]

De Paoli, S. 2024. Performing an inductive thematic analysis of semi-structured interviews with a large language model: An exploration and provocation on the limits of the approach. Social Science Computer Review\/ 42,\/ 4, 997--1019

work page 2024

[23] [23]

and Russ, R

D \'o sa, K. and Russ, R. 2016. Beyond correctness: Using qualitative methods to uncover nuances of student learning in undergraduate stem education. Journal of College Science Teaching\/ 46,\/ 2, 70--81

work page 2016

[24] [24]

, Pardos, Z

Fischer, C. , Pardos, Z. A. , Baker, R. S. , Williams, J. J. , Smyth, P. , Yu, R. , Slater, S. , Baker, R. , and Warschauer, M. 2020. Mining big data in education: Affordances and challenges. Review of research in education\/ 44,\/ 1, 130--160

work page 2020

[25] [25]

, Guo, Y

Gao, J. , Guo, Y. , Lim, G. , Zhang, T. , Zhang, Z. , Li, T. J.-J. , and Perrault, S. T. 2024. Collabcoder: a lower-barrier, rigorous workflow for inductive collaborative qualitative analysis with large language models. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems . 1--29

work page 2024

[26] [26]

, Chen, Y

Gao, Y. , Chen, Y. , Wang, M. , Wu, J. , Kim, Y. , Zhou, K. , Li, M. , Liu, X. , Fu, X. , Wu, J. , et al . 2024. Optimising the paradigms of human ai collaborative clinical coding. npj Digital Medicine\/ 7,\/ 1, 368

work page 2024

[27] [27]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D. , Yang, D. , Zhang, H. , Song, J. , Zhang, R. , Xu, R. , Zhu, Q. , Ma, S. , Wang, P. , Bi, X. , et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948\/

work page internal anchor Pith review Pith/arXiv arXiv 2025

[28] [28]

, Song, X

Gupta, A. , Song, X. , and Anumanchipalli, G. 2023. Self-assessment tests are unreliable measures of llm personality. arXiv preprint arXiv:2309.08163\/

work page arXiv 2023

[29] [29]

Hilal, A. H. and Alabri, S. S. 2013. Using nvivo for data analysis in qualitative research. International interdisciplinary journal of education\/ 2,\/ 2, 181--186

work page 2013

[30] [30]

OliverPJohnandSanjaySrivastava.1999

Jiang, H. , Zhang, X. , Cao, X. , Breazeal, C. , Roy, D. , and Kabbara, J. 2023. Personallm: Investigating the ability of large language models to express personality traits. arXiv preprint arXiv:2305.02547\/

work page arXiv 2023

[31] [31]

big five

John, O. P. 1990. The" big five" factor taxonomy: Dimensions of personality in the natural language and in questionnaires. Handbook of personality: Theory and research\/

work page 1990

[32] [32]

, Stechly, K

Kambhampati, S. , Stechly, K. , Valmeekam, K. , Saldyt, L. , Bhambri, S. , Palod, V. , Gundawar, A. , Samineni, S. R. , Kalwar, D. , and Biswas, U. 2025. Stop anthropomorphizing intermediate tokens as reasoning/thinking traces! arXiv preprint arXiv:2504.09762\/

work page arXiv 2025

[33] [33]

Khan, A. H. , Kegalle, H. , D'Silva, R. , Watt, N. , Whelan-Shamy, D. , Ghahremanlou, L. , and Magee, L. 2024. Automating thematic analysis: How llms analyse controversial topics. arXiv preprint arXiv:2405.06919\/

work page arXiv 2024

[34] [34]

and R \"a diker, S

Kuckartz, U. and R \"a diker, S. 2019. Analyzing qualitative data with MAXQDA . Springer

work page 2019

[35] [35]

and Tagarelli, A

La Cava, L. and Tagarelli, A. 2025. Open models, closed minds? on agents capabilities in mimicking human personalities through open large language models. In Proceedings of the AAAI Conference on Artificial Intelligence . Vol. 39. 1355--1363

work page 2025

[36] [36]

LLMs Get Lost In Multi-Turn Conversation

Laban, P. , Hayashi, H. , Zhou, Y. , and Neville, J. 2025. Llms get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120\/

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Landis, J. R. and Koch, G. G. 1977. The measurement of observer agreement for categorical data. biometrics\/ , 159--174

work page 1977

[38] [38]

Llm generated persona is a promise with a catch.arXiv preprint arXiv:2503.16527, 2025

Li, A. , Chen, H. , Namkoong, H. , and Peng, T. 2025. Llm generated persona is a promise with a catch. arXiv preprint arXiv:2503.16527\/

work page arXiv 2025

[39] [39]

, Chen, L

Li, G. , Chen, L. , Tang, C. , S v \'a bensk \`y , V. , Deguchi, D. , Yamashita, T. , and Shimada, A. 2025. Single-agent vs. multi-agent llm strategies for automated student reflection assessment. In Pacific-Asia Conference on Knowledge Discovery and Data Mining . Springer, 300--311

work page 2025

[40] [40]

Internal consistency and self-feedback in large language models: A survey, 2024

Liang, X. , Song, S. , Zheng, Z. , Wang, H. , Yu, Q. , Li, X. , Li, R.-H. , Wang, Y. , Wang, Z. , Xiong, F. , et al . 2024. Internal consistency and self-feedback in large language models: A survey. arXiv preprint arXiv:2407.14507\/

work page arXiv 2024

[41] [41]

Liu, L. 2016. Using generic inductive approach in qualitative educational research: A case study analysis. Journal of Education and Learning\/ 5,\/ 2, 129--135

work page 2016

[42] [42]

, Zambrano, A

Liu, X. , Zambrano, A. F. , Baker, R. S. , Barany, A. , Ocumpaugh, J. , Zhang, J. , Pankiewicz, M. , Nasiar, N. , and Wei, Z. 2025. Qualitative coding with gpt-4: Where it works better. Journal of Learning Analytics\/ , 1--17

work page 2025

[43] [43]

McCrae, R. R. and John, O. P. 1992. An introduction to the five-factor model and its applications. Journal of personality\/ 60,\/ 2, 175--215

work page 1992

[44] [44]

, Xia, F

Mirchandani, S. , Xia, F. , Florence, P. , Ichter, B. , Driess, D. , Arenas, M. G. , Rao, K. , Sadigh, D. , and Zeng, A. 2023. Large language models as general pattern machines. arXiv preprint arXiv:2307.04721\/

work page arXiv 2023

[45] [45]

Mistral-AI . 2024. Mistral-small-3.2-24b-instruct-2506. [Mistral-Small] Hugging Face

work page 2024

[46] [46]

Mount, M. K. , Barrick, M. R. , and Stewart, G. L. 1998. Five-factor model of personality and performance in jobs involving interpersonal interactions. Human performance\/ 11,\/ 2-3, 145--165

work page 1998

[47] [47]

, Ozuem, W

Naeem, M. , Ozuem, W. , Howell, K. , and Ranfagni, S. 2023. A step-by-step process of thematic analysis to develop a conceptual model in qualitative research. International journal of qualitative methods\/ 22 , 16094069231205789

work page 2023

[48] [48]

Personality-driven decision-making in llm-based au- tonomous agents

Newsham, L. and Prince, D. 2025. Personality-driven decision-making in llm-based autonomous agents. arXiv preprint arXiv:2504.00727\/

work page arXiv 2025

[49] [49]

Ng, A. 2024. Agentic design patterns part 5: Multi-agent collaboration

work page 2024

[50] [50]

Ollama . 2023. Ollama: Run large language models locally. https://ollama.com. Accessed: 2025-07-09

work page 2023

[51] [51]

and Joffe, H

O’Connor, C. and Joffe, H. 2020. Intercoder reliability in qualitative research: debates and practical guidelines. International journal of qualitative methods\/ 19 , 1609406919899220

work page 2020

[52] [52]

and Zeng, Y

Pan, K. and Zeng, Y. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180\/

work page arXiv 2023

[53] [53]

, Bowman, S

Panickssery, A. , Bowman, S. , and Feng, S. 2024. Llm evaluators recognize and favor their own generations. Advances in Neural Information Processing Systems\/ 37 , 68772--68802

work page 2024

[54] [54]

Pinkwart, N. 2016. Another 25 years of aied? challenges and opportunities for intelligent educational technologies of the future. International journal of artificial intelligence in education\/ 26 , 771--783

work page 2016

[55] [55]

Pugh, S. L. , Rao, A. , Stewart, A. E. , and D'Mello, S. K. 2022. Do speech-based collaboration analytics generalize across task contexts? In LAK22: 12th International Learning Analytics and Knowledge Conference . 208--218

work page 2022

[56] [56]

, Walker, C

Qiao, T. , Walker, C. , Cunningham, C. W. , and Koh, Y. S. 2025. Thematic-lm: a llm-based multi-agent system for large-scale thematic analysis. In Proceedings of the ACM on Web Conference 2025 . 649--658

work page 2025

[57] [57]

, Sharma, A

Rafailov, R. , Sharma, A. , Mitchell, E. , Manning, C. D. , Ermon, S. , and Finn, C. 2023. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems\/ 36 , 53728--53741

work page 2023

[58] [58]

, Lim, L.-A

Ramanathan, S. , Lim, L.-A. , Mottaghi, N. R. , and Buckingham Shum, S. 2025. When the prompt becomes the codebook: Grounded prompt engineering (groproe) and its application to belonging analytics. In Proceedings of the 15th International Learning Analytics and Knowledge Conference . 713--725

work page 2025

[59] [59]

, Waseem, M

Rasheed, Z. , Waseem, M. , Ahmad, A. , Kemell, K.-K. , Xiaofeng, W. , Duc, A. N. , and Abrahamsson, P. 2024. Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis. arXiv preprint arXiv:2402.01386\/

work page arXiv 2024

[60] [60]

A Survey of Hallucination in Large Foundation Models

Rawte, V. , Sheth, A. , and Das, A. 2023. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922\/

work page internal anchor Pith review Pith/arXiv arXiv 2023

[61] [61]

, Anastasopoulos, I

Reza, M. , Anastasopoulos, I. , Bhandari, S. , and Pardos, Z. A. 2025. Prompthive: Bringing subject matter experts back to the forefront with collaborative prompt engineering for educational content creation. In Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems . 1--22

work page 2025

[62] [62]

and Hemphill, M

Richards, K. and Hemphill, M. 2017. A practical guide to collaborative qualitative data analysis. Journal of Teaching in Physical Education\/ 37 , 1--20

work page 2017

[63] [63]

, Borchers, C

Sankaranarayanan, S. , Borchers, C. , Simon, S. , Tajik, E. , Ata s , A. H. , Celik, B. , Balzan, F. , and Shahrokhian, B. 2025. Automating thematic analysis with multi-agent llm systems. EdArXiv Preprints (https://doi.org/10.35542/osf.io/kq8zh\_v1)\/

work page doi:10.35542/osf.io/kq8zh 2025

[64] [64]

, Safdari, M

Serapio-Garc \' a, G. , Safdari, M. , Crepy, C. , Sun, L. , Fitz, S. , Abdulhai, M. , Faust, A. , and Matari \'c , M. 2023. Personality traits in large language models

work page 2023

[65] [65]

, Sankaranarayanan, S

Simon, S. , Sankaranarayanan, S. , Tajik, E. , Borchers, C. , Bahar, s. , Balzan, F. , Strau , S. , Viswanathan, S. , Ata s , A. , C arapina, M. , Liang, L. , and Celik, B. 2025. Comparing human and llm-generated inductive thematic analyses: Assessing agreement in coding consistency and interpretative accuracy. Proceedings of 26th International Conference...

work page 2025

[66] [66]

Smit, B. 2002. Atlas. ti for qualitative data analysis. Perspectives in education\/ 20,\/ 3, 65--75

work page 2002

[67] [67]

Tai, R. H. , Bentley, L. R. , Xia, X. , Sitt, J. M. , Fankhauser, S. C. , Chicas-Mosier, A. M. , and Monteith, B. G. 2024. An examination of the use of large language models to aid analysis of textual data. International Journal of Qualitative Methods\/ 23 , 16094069241231168

work page 2024

[68] [68]

, Masumori, A

Takata, R. , Masumori, A. , and Ikegami, T. 2024. Spontaneous emergence of agent individuality through social interactions in llm-based communities. arXiv preprint arXiv:2411.03252\/

work page arXiv 2024

[69] [69]

Teknium . 2023. Openhermes-2-mistral-7b. [Openhermes2-7B] Hugging Face

work page 2023

[70] [70]

, Hayfield, N

Terry, G. , Hayfield, N. , Clarke, V. , Braun, V. , et al . 2017. Thematic analysis. The SAGE handbook of qualitative research in psychology\/ 2,\/ 17-37, 25

work page 2017

[71] [71]

, Hegazy, M

Tommaso, T. , Hegazy, M. , Lemay, D. , Abukalam, M. , Rish, I. , and Dumas, G. 2024. Llms and personalities: Inconsistencies across scales. In NeurIPS 2024 Workshop on Behavioral Machine Learning

work page 2024

[72] [72]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H. , Lavril, T. , Izacard, G. , Martinet, X. , Lachaux, M.-A. , Lacroix, T. , Rozi \`e re, B. , Goyal, N. , Hambro, E. , Azhar, F. , et al . 2023. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971\/

work page internal anchor Pith review Pith/arXiv arXiv 2023

[73] [73]

Multi-Agent Collaboration Mechanisms: A Survey of LLMs

Tran, K.-T. , Dao, D. , Nguyen, M.-D. , Pham, Q.-V. , O'Sullivan, B. , and Nguyen, H. D. 2025. Multi-agent collaboration mechanisms: A survey of llms. arXiv preprint arXiv:2501.06322\/

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

, Yan, Z

Venugopalan, D. , Yan, Z. , Borchers, C. , Lin, J. , and Aleven, V. 2025. Combining large language models with tutoring system intelligence: A case study in caregiver homework support. In Proceedings of the 15th International Learning Analytics and Knowledge Conference . 373--383

work page 2025

[75] [75]

, Spitale, G

Vinay, R. , Spitale, G. , Biller-Andorno, N. , and Germani, F. 2025. Emotional prompting amplifies disinformation generation in ai large language models. Frontiers in Artificial Intelligence\/ 8 , 1543603

work page 2025

[76] [76]

, Yuan, X

Xiao, Z. , Yuan, X. , Liao, Q. V. , Abdelghani, R. , and Oudeyer, P.-Y. 2023. Supporting qualitative analysis with large language models: Combining codebook with gpt-3 for deductive coding. In Companion proceedings of the 28th international conference on intelligent user interfaces . 75--78

work page 2023

[77] [77]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Xu, C. , Sun, Q. , Zheng, K. , Geng, X. , Zhao, P. , Feng, J. , Tao, C. , and Jiang, D. 2023. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244\/

work page internal anchor Pith review Pith/arXiv arXiv 2023

[78] [78]

, Echeverria, V

Yan, L. , Echeverria, V. , Fernandez-Nieto, G. M. , Jin, Y. , Swiecki, Z. , Zhao, L. , Ga s evi \'c , D. , and Martinez-Maldonado, R. 2024. Human-ai collaboration in thematic analysis using chatgpt: A user study and design recommendations. In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems . 1--7

work page 2024

[79] [79]

Zambrano, A. F. , Liu, X. , Barany, A. , Baker, R. S. , Kim, J. , and Nasiar, N. 2023. From ncoder to chatgpt: From automated coding to refining human coding. In International conference on quantitative ethnography . Springer, 470--485

work page 2023

[80] [80]

Zhang, J. , Xu, X. , Zhang, N. , Liu, R. , Hooi, B. , and Deng, S. 2023. Exploring collaboration mechanisms for llm agents: A social psychology view. arXiv preprint arXiv:2310.02124\/

work page arXiv 2023