Multi-agent AI systems outperform human teams in creativity

David Stillwell; Haotian Li; Jos\'e Hern\'andez-Orallo; Luning Sun; Nigel Collier; Tiancheng Hu; Xing Xie; Yixuan Jiang

arxiv: 2605.17885 · v1 · pith:I5RNUMAKnew · submitted 2026-05-18 · 💻 cs.CL · cs.AI

Multi-agent AI systems outperform human teams in creativity

Tiancheng Hu , Yixuan Jiang , Haotian Li , Jos\'e Hern\'andez-Orallo , Xing Xie , Nigel Collier , David Stillwell , Luning Sun This is my paper

Pith reviewed 2026-05-20 11:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords multi-agent systemslarge language modelscreativitysemantic spacehuman-AI comparisonproblem solvingconversational dynamics

0 comments

The pith

Multi-agent LLM teams generate more creative ideas than human teams by exploring semantic space more efficiently.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that teams of multiple large language model agents produce more creative solutions than groups of humans on problem-solving tasks. This conclusion rests on comparing thousands of ideas generated by AI teams against hundreds from human teams across six different tasks, revealing a large performance gap driven mainly by higher novelty while usefulness remains similar. Mapping conversations as paths through semantic space shows that both AI and human groups improve when they explore widely rather than fixating on one theme, yet AI teams gain from rapid broad coverage in shorter exchanges while humans gain from steady local connections. Model selection and discussion format together shape a sizable share of how creatively the AI conversations proceed.

Core claim

Multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coher

What carries the argument

Representation of team conversations as paths through semantic space using neural language model embeddings, which quantifies global coherence, semantic spread, local coherence, and path length to compare generative processes.

If this is right

LLM teams reach higher creativity via high semantic spread combined with shorter paths.
Human teams reach higher creativity via high local coherence and frequent pivots.
Model choice and discussion structure together explain 26.8 percent of variance in LLM conversational dynamics.
These levers enable systematic design of multi-agent systems for greater creative output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hybrid human-AI teams could combine human conversational flow with AI semantic spread to exceed either alone.
The semantic path technique could be applied to study creativity in scientific or design collaborations beyond the tested tasks.
Training objectives that reward efficient wide exploration might further increase LLM team performance.

Load-bearing premise

Creativity can be measured comparably for AI and human ideas through novelty and usefulness ratings without systematic bias from the chosen tasks or evaluation method.

What would settle it

Independent raters blind to idea source rate human-team ideas as equally or more novel than multi-agent LLM ideas on the same six tasks, or the performance gap vanishes when the tasks are replaced with new real-world innovation challenges.

read the original abstract

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-agent LLM teams beat human ones on novelty-driven creativity here with a big effect size, but the ratings setup carries a clear risk of source detection bias.

read the letter

The main thing to know is that this paper reports multi-agent LLM teams producing more creative ideas than human teams across six tasks, with a Cohen's d of 1.50 driven by higher novelty while usefulness stays comparable. They back this with over 4,500 AI-generated ideas against 341 from humans and add semantic path analysis to show how conversation dynamics differ between the two groups. Both do better when they range widely rather than staying on one theme, but LLMs gain from efficient spread and shorter paths while humans gain from smooth local flow and pivots. Model choice and structure explain about a quarter of the variance in the LLM side.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical comparison of creative idea generation by multi-agent LLM teams versus single LLMs and human teams across six problem-solving tasks. It reports that multi-agent LLM teams substantially outperform human teams in overall creativity (Cohen's d=1.50) based on pooled novelty and usefulness ratings of 4,541 LLM-generated ideas and 341 human-team ideas, with the advantage driven by novelty. The work further represents team conversations as paths in semantic embedding space, showing that both groups benefit from low global coherence but differ in other predictors (efficient exploration for LLMs, local coherence for humans), and identifies model choice plus discussion structure as levers explaining 26.8% of variance in LLM dynamics.

Significance. If the measurement and comparison protocols prove robust, the finding that multi-agent LLM systems can exceed human teams in creativity would be a notable contribution to AI-augmented innovation research, with direct implications for designing collaborative AI systems. The semantic-path representation of conversational trajectories provides a useful quantitative lens for comparing generative processes, and the variance decomposition offers actionable design insights. The scale of the LLM idea sample is a clear empirical strength.

major comments (3)

[Methods] Methods section: the rating protocol description provides no inter-rater reliability statistics (e.g., ICC or Cohen's kappa), no details on rater blinding verification, and no source-guessing test or residual-source regression to quantify whether stylistic cues allowed raters to detect LLM versus human origin. This directly undermines confidence in the commensurability of novelty scores that drive the reported Cohen's d=1.50.
[Results] Results section on creativity comparison: the headline effect size is presented without error bars, confidence intervals, or explicit reporting of data exclusion criteria and participant-matching procedures between the human-team (n=341 ideas) and LLM conditions. These omissions are load-bearing for interpreting the magnitude and generalizability of the claimed superiority.
[Semantic path analysis] Semantic path analysis subsection: the assumption that the same embedding space represents LLM and human conversational trajectories without differential distortion is not tested (e.g., via domain-adaptation checks or cross-source alignment metrics), yet it underpins the differential predictor findings for LLM versus human teams.

minor comments (2)

[Abstract] Abstract: the six tasks are referred to as 'diverse' but not enumerated; adding their names would improve immediate clarity for readers.
[Figures] Figure captions for semantic path visualizations: axis labels and color legends could be expanded to explicitly state what 'global coherence' and 'semantic spread' quantify in the embedding space.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of methodological transparency and analytical rigor. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Methods] Methods section: the rating protocol description provides no inter-rater reliability statistics (e.g., ICC or Cohen's kappa), no details on rater blinding verification, and no source-guessing test or residual-source regression to quantify whether stylistic cues allowed raters to detect LLM versus human origin. This directly undermines confidence in the commensurability of novelty scores that drive the reported Cohen's d=1.50.

Authors: We agree that inter-rater reliability and blinding details are essential for validating the novelty and usefulness ratings. In the revised manuscript, we will report intraclass correlation coefficients (ICC) for the ratings across the four raters. We will also add explicit confirmation that raters were blinded to idea origin (LLM vs. human) and include a source-guessing test on a subset of ideas to quantify any detectable stylistic cues, along with residual-source regression if appropriate. These additions will directly support the commensurability of the scores underlying the d=1.50 effect. revision: yes
Referee: [Results] Results section on creativity comparison: the headline effect size is presented without error bars, confidence intervals, or explicit reporting of data exclusion criteria and participant-matching procedures between the human-team (n=341 ideas) and LLM conditions. These omissions are load-bearing for interpreting the magnitude and generalizability of the claimed superiority.

Authors: We acknowledge that error bars, confidence intervals, and detailed exclusion/matching criteria are necessary for proper interpretation. The revised results section will include 95% confidence intervals for Cohen's d and mean novelty/usefulness scores. We will also explicitly report all data exclusion criteria applied to both the 4,541 LLM ideas and 341 human ideas, as well as the participant- and task-matching procedures used to align the human-team and multi-agent LLM conditions. revision: yes
Referee: [Semantic path analysis] Semantic path analysis subsection: the assumption that the same embedding space represents LLM and human conversational trajectories without differential distortion is not tested (e.g., via domain-adaptation checks or cross-source alignment metrics), yet it underpins the differential predictor findings for LLM versus human teams.

Authors: This is a fair point regarding the embedding space. In revision, we will add explicit tests for differential distortion, including cross-source alignment metrics (e.g., Procrustes superimposition) and domain-adaptation checks comparing intra- vs. inter-source trajectory distances. Results of these checks will be reported, and any implications for the distinct predictor patterns (efficient exploration for LLMs vs. local coherence for humans) will be discussed. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical comparison

full rationale

The paper conducts an empirical study generating and rating 4,541 multi-agent LLM ideas against 341 human-team ideas across six tasks, using human novelty/usefulness scores and semantic path representations from pre-trained embeddings. No equations or derivations are presented that reduce any claimed result (such as Cohen's d=1.50 or the 26.8% variance figure) to a fitted parameter defined by the same data or to a self-citation chain. The variance result is reported as an observational outcome from regression on model choice and discussion structure, not as a definitional prediction. All central claims rest on external human ratings and standard statistical methods without self-referential loops, rendering the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical comparative study that relies on standard statistical reporting (Cohen's d, variance explained) and established NLP techniques for semantic embeddings; no new physical entities or ad-hoc constants are introduced.

axioms (1)

standard math Standard assumptions underlying Cohen's d effect size and linear variance partitioning hold for the creativity ratings and semantic coherence metrics.
Invoked when reporting d=1.50 and the 26.8% variance figure.

pith-pipeline@v0.9.0 · 5776 in / 1428 out tokens · 46051 ms · 2026-05-20T11:21:57.656485+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

multi-agent LLM teams ... outperform human teams in creativity (Cohen’s d=1.50) ... represent conversations as paths through semantic space using neural language model representations ... trajectory features ... global coherence, semantic spread, local coherence
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Jcost uniqueness, φ-ladder, 8-tick period, parameter-free constants

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

66 extracted references · 66 canonical work pages · 8 internal anchors

[1]

Westview press, Boulder, CO (1996)

Amabile, T.M.: Creativity in Context: Update to the Social Psychology of Creativity. Westview press, Boulder, CO (1996)

work page 1996
[2]

Routledge, London, UK (2004)

Boden, M.A.: The Creative Mind: Myths and Mechanisms, 2nd edn. Routledge, London, UK (2004). https://doi.org/10.4324/9780203508527 . https://doi.org/10.4324/9780203508527

work page doi:10.4324/9780203508527 2004
[3]

Creativity Research Journal24(1), 92–96 (2012)

Runco, M.A., Jaeger, G.J.: The standard definition of creativity. Creativity Research Journal24(1), 92–96 (2012)

work page 2012
[4]

ECAI’12, pp

Colton, S., Wiggins, G.A.: Computational creativity: the final frontier? In: Pro- ceedings of the 20th European Conference on Artificial Intelligence. ECAI’12, pp. 21–26. IOS Press, NLD (2012)

work page 2012
[5]

Delving Deep into Rectifiers:

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1026–1034. IEEE Computer Society, Washington, DC (2015). https: //doi.org/10.1109/ICCV.2015.123 .https://doi.org/10...

work page doi:10.1109/iccv.2015.123 2015
[6]

and Ko, Justin and Swetter, Susan M

Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature542(7639), 115–118 (2017) https://doi.org/10.1038/nature21056

work page doi:10.1038/nature21056 2017
[7]

GPT-4 Technical Report

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman,...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M.: The Impact of AI on Devel- oper Productivity: Evidence from GitHub Copilot (2023). https://arxiv.org/abs/ 2302.06590

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Paradis, E., Grey, K., Madison, Q., Nam, D., Macvean, A., Meimand, V., Zhang, N., Ferrari-Church, B., Chandra, S.: How much does AI impact development speed?Anenterprise-basedrandomizedcontrolledtrial(2024).https://arxiv.org/ abs/2410.12944

work page arXiv 2024
[10]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Novikov, A., V˜ u, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A.Z., Shirobokov, S., Kozlovskii, B., Ruiz, F.J.R., Mehrabian, A., Kumar, M.P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., Balog, M.: AlphaEvolve: A coding agent for scientific and algorithmic discovery (2025). https://arxiv.org/abs/2506.13131

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Nature (2026) https://doi.org/10.1038/s41586-025-10072-4

Asai, A., He, J., Shao, R., Shi, W., Singh, A., Chang, J.C., Lo, K., Soldaini, L., Feldman, S., D’Arcy, M., Wadden, D., Latzke, M., Sparks, J., Hwang, J.D., Kishore, V., Tian, M., Ji, P., Liu, S., Tong, H., Wu, B., Xiong, Y., Zettlemoyer, L., Neubig, G., Weld, D.S., Downey, D., Yih, W.-t., Koh, P.W., Hajishirzi, H.: Syn- thesizing scientific literature wi...

work page doi:10.1038/s41586-025-10072-4 2026
[12]

Nature651(8107), 914–919 (2026)

Lu, C., Lu, C., Lange, R.T., Yamada, Y., Hu, S., Foerster, J., Ha, D., Clune, 21 J.: Towards end-to-end automation of ai research. Nature651(8107), 914–919 (2026)

work page 2026
[13]

Nature550(7676), 354–359 (2017) https://doi.org/10.1038/ nature24270

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature550(7676), 354–359 (2017) https://doi.org/10.1038/ nature24270

work page 2017
[14]

Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., O. Pinto, H.P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., Zhang, S.: Dota 2 with Large Scale Deep Reinforcemen...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Science 330(6004), 686–688 (2010) https://doi.org/10.1126/science.1193147

Woolley, A.W., Chabris, C.F., Pentland, A., Hashmi, N., Malone, T.W.: Evidence for a collective intelligence factor in the performance of human groups. Science 330(6004), 686–688 (2010) https://doi.org/10.1126/science.1193147

work page doi:10.1126/science.1193147 2010
[16]

In: Proceedings of the 41st International Conference on Machine Learning

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24. JMLR.org, Vienna, Austria (2024)

work page 2024
[17]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., Tu, Z.: Encouraging divergent thinking in large language models through multi-agent debate. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 17889–17904. Association for Computational...

work page doi:10.18653/v1/2024.emnlp-main.992 2024
[18]

doi: 10.18653/v1/2024.acl-long

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., Sun, M.: ChatDev: Communicative agents for software development. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15174–...

work page doi:10.18653/v1/2024.acl-long 2024
[19]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Lin, Y.-C., Chen, K.-C., Li, Z.-Y., Wu, T.-H., Wu, T.-H., Chen, K.-Y., Lee, H.-y., Chen, Y.-N.: Creativity in LLM-based multi-agent systems: A sur- vey. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 27584–27607. Association for Computatio...

work page doi:10.18653/v1/2025.emnlp-main.1403 2025
[20]

Journal of Personality and Social Psychology43(5), 997–1013 (1982)

Amabile, T.M.: Social psychology of creativity: A consensual assessment tech- nique. Journal of Personality and Social Psychology43(5), 997–1013 (1982)

work page 1982
[21]

Journal of Personality and Social Psychology53(3), 497–509 (1987) https://doi.org/10.1037/0022-3514.53.3.497

Diehl, M., Stroebe, W.: Productivity loss in brainstorming groups: Toward the solution of a riddle. Journal of Personality and Social Psychology53(3), 497–509 (1987) https://doi.org/10.1037/0022-3514.53.3.497

work page doi:10.1037/0022-3514.53.3.497 1987
[22]

https://arxiv.org/abs/2601.13295

Khatua, A., Zhu, H., Tran, P., Prabhudesai, A., Sadrieh, F., Lieberwirth, J.K., Yu, X., Fu, Y., Ryan, M.J., Pei, J., Yang, D.: CooperBench: Why Coding Agents Cannot be Your Teammates Yet (2026). https://arxiv.org/abs/2601.13295

work page arXiv 2026
[23]

doi: 10.1177/0003122419877135

Kozlowski, A.C., Taddy, M., Evans, J.A.: The geometry of culture: Ana- lyzing the meanings of class through word embeddings. American Sociolog- ical Review84(5), 905–949 (2019) https://doi.org/10.1177/0003122419877135 https://doi.org/10.1177/0003122419877135

work page doi:10.1177/0003122419877135 2019
[24]

Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

Toro-Hernández, F.D., Filho, J.V., Cabral-Carvalho, R.M.: Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embed- ding Space (2026). https://arxiv.org/abs/2602.05971

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

Proceedings of the National Academy of Sciences120(42), 2305290120 (2023) https://doi.org/10

Nour, M.M., McNamee, D.C., Liu, Y., Dolan, R.J.: Trajectories through semantic spaces in schizophrenia and the relationship to ripple bursts. Proceedings of the National Academy of Sciences120(42), 2305290120 (2023) https://doi.org/10. 1073/pnas.2305290120

work page 2023
[26]

Psychological Review119(2), 431–440 (2012) https://doi.org/10.1037/a0027373

Hills, T.T., Jones, M.N., Todd, P.M.: Optimal foraging in semantic memory. Psychological Review119(2), 431–440 (2012) https://doi.org/10.1037/a0027373

work page doi:10.1037/a0027373 2012
[27]

Trends in Cognitive Sciences19(1), 46–54 (2015)

Hills, T.T., Todd, P.M., Lazer, D., Redish, A.D., Couzin, I.D., Group, C.S.R.: Exploration versus exploitation in space, mind, and society. Trends in Cognitive Sciences19(1), 46–54 (2015)

work page 2015
[28]

Measuring Semantic Coherence of a Conversation

Vakulenko, S., Rijke, M., Cochez, M., Savenkov, V., Polleres, A.: Measuring Semantic Coherence of a Conversation (2018). https://arxiv.org/abs/1806.06411

work page internal anchor Pith review Pith/arXiv arXiv 2018
[29]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[30]

and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle- moyer,L.:Deepcontextualizedwordrepresentations.In:Walker,M.,Ji,H.,Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association ...

work page doi:10.18653/v1/n18-1202 2018
[31]

Nature Human Behaviour10(3), 531–540 (2026) https://doi.org/10.1038/s41562-025-02331-1

Wang, D., Huang, D., Shen, H., Uzzi, B.: A large-scale comparison of divergent creativity in humans and large language models. Nature Human Behaviour10(3), 531–540 (2026) https://doi.org/10.1038/s41562-025-02331-1

work page doi:10.1038/s41562-025-02331-1 2026
[32]

Scientific Reports16, 1279 (2026) https://doi.org/10.1038/ s41598-025-25157-3

Bellemare-Pepin, A., Lespinasse, F., Thölke, P., Harel, Y., Mathewson, K., Olson, J.A., Bengio, Y., Jerbi, K.: Divergent creativity in humans and large language models. Scientific Reports16, 1279 (2026) https://doi.org/10.1038/ s41598-025-25157-3

work page 2026
[33]

arXiv (2022)

Stevenson, C., Smal, I., Baas, M., Grasman, R., Maas, H.: Putting GPT-3’s Cre- ativity to the (Alternative Uses) Test. arXiv (2022). https://doi.org/10.48550/ arXiv.2206.08932 . http://arxiv.org/abs/2206.08932 Accessed 2024-02-18

work page arXiv 2022
[34]

arXiv (2024)

Nath, S.S., Dayan, P., Stevenson, C.: Characterising the Creative Process in Humans and Large Language Models. arXiv (2024). https://doi.org/10.48550/ arXiv.2405.00899 . http://arxiv.org/abs/2405.00899 Accessed 2026-01-22

work page arXiv 2024
[35]

Proceedings of the National Academy of Sciences118(25), 2022340118 (2021)

Olson, J.A., Nahas, J., Chmoulevitch, D., Cropper, S.J., Webb, M.E.: Nam- ing unrelated words predicts creativity. Proceedings of the National Academy of Sciences118(25), 2022340118 (2021)

work page 2021
[36]

Behavior Research Methods 53(2), 757–780 (2021)

Beaty, R.E., Johnson, D.R.: Automating creativity assessment with semdis: An open platform for computing semantic distance. Behavior Research Methods 53(2), 757–780 (2021)

work page 2021
[37]

European Review of Social Psychology21(1), 34–77 (2010)

Nijstad, B.A., De Dreu, C.K., Rietzschel, E.F., Baas, M.: The dual pathway to creativity model: Creative ideation as a function of flexibility and persistence. European Review of Social Psychology21(1), 34–77 (2010)

work page 2010
[38]

Administrative Science Quarterly44(2), 350–383 (1999)

Edmondson, A.: Psychological safety and learning behavior in work teams. Administrative Science Quarterly44(2), 350–383 (1999)

work page 1999
[39]

Academy of Management Review39(3), 324–343 (2014)

Harvey, S.: Creative synthesis: Exploring the process of extraordinary group creativity. Academy of Management Review39(3), 324–343 (2014)

work page 2014
[40]

https://arxiv.org/abs/2601.10825

Kim, J., Lai, S., Scherrer, N., Arcas, B.A., Evans, J.: Reasoning Models Generate Societies of Thought (2026). https://arxiv.org/abs/2601.10825

work page arXiv 2026
[41]

Psychological Bulletin53(4), 267–293 (1956) https://doi.org/10.1037/h0040755

Guilford, J.P.: The structure of intellect. Psychological Bulletin53(4), 267–293 (1956) https://doi.org/10.1037/h0040755

work page doi:10.1037/h0040755 1956
[42]

Applied Psychology49(2), 237–262 (2000)

Paulus, P.B.: Groups, teams, and creativity: The creative potential of idea- generating groups. Applied Psychology49(2), 237–262 (2000)

work page 2000
[43]

, year 2007

Paulus, P.B., Nijstad, B.A.: Group creativity: An introduction. In: Paulus, P.B., Nijstad, B.A. (eds.) Group Creativity: Innovation Through Collaboration, pp. 3–12. Oxford University Press, Oxford, 24 UK (2003). https://doi.org/10.1093/acprof:oso/9780195147308.003.0001 . https://doi.org/10.1093/acprof:oso/9780195147308.003.0001

work page doi:10.1093/acprof:oso/9780195147308.003.0001 2003
[44]

https://openai.com/index/gpt-4-1/ (2025)

OpenAI: Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/ (2025)

work page 2025
[45]

https://openai.com/index/ introducing-o3-and-o4-mini/ (2025)

OpenAI: Introducing OpenAI o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/ (2025)

work page 2025
[46]

Nature645, 633–638 (2025) https://doi.org/10.1038/ s41586-025-09422-z

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025) https://doi.org/10.1038/ s41586-025-09422-z

work page 2025
[47]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Quantifying the persona effect in LLM simulations

Hu, T., Collier, N.: Quantifying the persona effect in LLM simulations. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10289–10307. Association for Computational Linguis- tics, Bangkok, Thailand (2024). https://doi.org/10.18653/v1/2024.acl...

work page doi:10.18653/v1/2024.acl-long.554 2024
[49]

https://arxiv.org/abs/2502.20859

Duan, Y., Tang, Y., Bai, X., Chen, K., Li, J., Zhang, M.: The Power of Per- sonality: A Human Simulation Perspective to Investigate Large Language Model Agents (2025). https://arxiv.org/abs/2502.20859

work page arXiv 2025
[50]

Scribner, New York (1953)

Osborn, A.F.: Applied Imagination: Principles and Procedures of Creative Thinking. Scribner, New York (1953)

work page 1953
[51]

https://arxiv.org/abs/2506

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models (2025). https://arxiv.org/abs/2506. 05176

work page 2025
[52]

Neuropsychology11(1), 138–146 (1997) https://doi.org/10.1037/0894-4105.11.1

Troyer, A.K., Moscovitch, M., Winocur, G.: Clustering and switching as two components of verbal fluency: Evidence from younger and older healthy adults. Neuropsychology11(1), 138–146 (1997) https://doi.org/10.1037/0894-4105.11.1. 138

work page doi:10.1037/0894-4105.11.1 1997
[53]

Creativity Research Journal18(3), 391–404 (2006) https://doi.org/10.1207/s15326934crj1803_13 25

Cropley, A.: In praise of convergent thinking. Creativity Research Journal18(3), 391–404 (2006) https://doi.org/10.1207/s15326934crj1803_13 25

work page doi:10.1207/s15326934crj1803_13 2006
[54]

Frontiers in Human Neuroscience8, 407 (2014)

Kenett, Y.N., Anaki, D., Faust, M.: Investigating the structure of semantic net- works in low and high creative persons. Frontiers in Human Neuroscience8, 407 (2014)

work page 2014
[55]

Schizophrenia Research93(1-3), 304–316 (2007)

Elvevåg, B., Foltz, P.W., Weinberger, D.R., Goldberg, T.E.: Quantifying incoher- ence in speech: An automated methodology and novel application to schizophre- nia. Schizophrenia Research93(1-3), 304–316 (2007)

work page 2007
[56]

In: Loveys, K., Niederhoffer, K., Prud’hommeaux, E., Resnik, R., Resnik, P

Iter, D., Yoon, J., Jurafsky, D.: Automatic detection of incoherent speech for diagnosing schizophrenia. In: Loveys, K., Niederhoffer, K., Prud’hommeaux, E., Resnik, R., Resnik, P. (eds.) Proceedings of the Fifth Workshop on Compu- tational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 136–146. Association for Computational Linguistics...

work page doi:10.18653/v1/w18-0615 2018
[57]

In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

Li, J., Hovy, E.: A model of coherence based on distributed sentence representa- tion. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2039–2048 (2014)

work page 2014
[58]

Psychological Science35(7), 749–759 (2024)

Malaie, S., Spivey, M.J., Marghetis, T.: Divergent and convergent creativity are different kinds of foraging. Psychological Science35(7), 749–759 (2024)

work page 2024
[59]

The Journal of Creative Behavior1(1), 3–14 (1967) https://doi.org/10.1002/j.2162-6057.1967

Guilford, J.P.: Creativity: Yesterday, today and tomorrow. The Journal of Creative Behavior1(1), 3–14 (1967) https://doi.org/10.1002/j.2162-6057.1967. tb00002.x

work page doi:10.1002/j.2162-6057.1967 1967
[60]

Management Science56(4), 591–605 (2010)

Girotra, K., Terwiesch, C., Ulrich, K.T.: Idea generation and the quality of the best idea. Management Science56(4), 591–605 (2010)

work page 2010
[61]

Behavior Research Methods55(7), 3726–3759 (2023) https://doi.org/10.3758/s13428-022-01986-2

Johnson, D.R., Kaufman, J.C., Baker, B.S., Patterson, J.D., Barbot, B., Green, A.E., Hell, J., Kennedy, E., Sullivan, G.F., Taylor, C.L., Ward, T., Beaty, R.E.: Divergent semantic integration (DSI): Extracting creativity from narratives with distributional semantic modeling. Behavior Research Methods55(7), 3726–3759 (2023) https://doi.org/10.3758/s13428-0...

work page doi:10.3758/s13428-022-01986-2 2023
[62]

In: Boleda, G., Roth, M

Zarrieß, S., Junker, S., Sieker, J., Alacam, Ö.: Components of creativity: Lan- guage model-based predictors for clustering and switching in verbal fluency. In: Boleda, G., Roth, M. (eds.) Proceedings of the 29th Conference on Computa- tional Natural Language Learning, pp. 216–232. Association for Computational Linguistics, Vienna, Austria (2025). https:/...

work page doi:10.18653/v1/2025.conll-1.15 2025
[63]

AI Progress through the Lens of Predictable AI Ecosystems

Coursey, L.E., Gertner, R.T., Williams, B.C., Kenworthy, J.B., Paulus, P.B., Doboli, S.: Linking the divergent and convergent processes of collaborative cre- ativity: The impact of expertise levels and elaboration processes. Frontiers in Psychology10(2019) https://doi.org/10.3389/fpsyg.2019.00699 26 Supplementary information.Supplementary information is p...

work page doi:10.3389/fpsyg.2019.00699 2019
[64]

One agent generated a new candidate idea intended to be more creative than existing ideas

work page
[65]

All agents rated the new candidate, the current idea, and recently explored past ideas on creativity (1-10 scale)

work page
[66]

evolve [X] into [Y]

The highest-rated idea became the new current idea for the next iteration. This allowed the system to either adopt the new modification, retain the current idea, or backtrack to a recently considered alternative. This proposal-and-evaluation cycle continued until one of two stopping criteria was met: • Convergence: The same idea was selected (rated highes...

work page 2082

[1] [1]

Westview press, Boulder, CO (1996)

Amabile, T.M.: Creativity in Context: Update to the Social Psychology of Creativity. Westview press, Boulder, CO (1996)

work page 1996

[2] [2]

Routledge, London, UK (2004)

Boden, M.A.: The Creative Mind: Myths and Mechanisms, 2nd edn. Routledge, London, UK (2004). https://doi.org/10.4324/9780203508527 . https://doi.org/10.4324/9780203508527

work page doi:10.4324/9780203508527 2004

[3] [3]

Creativity Research Journal24(1), 92–96 (2012)

Runco, M.A., Jaeger, G.J.: The standard definition of creativity. Creativity Research Journal24(1), 92–96 (2012)

work page 2012

[4] [4]

ECAI’12, pp

Colton, S., Wiggins, G.A.: Computational creativity: the final frontier? In: Pro- ceedings of the 20th European Conference on Artificial Intelligence. ECAI’12, pp. 21–26. IOS Press, NLD (2012)

work page 2012

[5] [5]

Delving Deep into Rectifiers:

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: 2015 IEEE International Conference on Computer Vision, ICCV 2015, Santiago, Chile, December 7-13, 2015, pp. 1026–1034. IEEE Computer Society, Washington, DC (2015). https: //doi.org/10.1109/ICCV.2015.123 .https://doi.org/10...

work page doi:10.1109/iccv.2015.123 2015

[6] [6]

and Ko, Justin and Swetter, Susan M

Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., Thrun, S.: Dermatologist-level classification of skin cancer with deep neural networks. Nature542(7639), 115–118 (2017) https://doi.org/10.1038/nature21056

work page doi:10.1038/nature21056 2017

[7] [7]

GPT-4 Technical Report

OpenAI, Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., Avila, R., Babuschkin, I., Balaji, S., Balcom, V., Baltescu, P., Bao, H., Bavarian, M., Belgum, J., Bello, I., Berdine, J., Bernadett-Shapiro, G., Berner, C., Bogdonoff, L., Boiko, O., Boyd, M., Brakman, A.-L., Brockman,...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

The Impact of AI on Developer Productivity: Evidence from GitHub Copilot

Peng, S., Kalliamvakou, E., Cihon, P., Demirer, M.: The Impact of AI on Devel- oper Productivity: Evidence from GitHub Copilot (2023). https://arxiv.org/abs/ 2302.06590

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Paradis, E., Grey, K., Madison, Q., Nam, D., Macvean, A., Meimand, V., Zhang, N., Ferrari-Church, B., Chandra, S.: How much does AI impact development speed?Anenterprise-basedrandomizedcontrolledtrial(2024).https://arxiv.org/ abs/2410.12944

work page arXiv 2024

[10] [10]

AlphaEvolve: A coding agent for scientific and algorithmic discovery

Novikov, A., V˜ u, N., Eisenberger, M., Dupont, E., Huang, P.-S., Wagner, A.Z., Shirobokov, S., Kozlovskii, B., Ruiz, F.J.R., Mehrabian, A., Kumar, M.P., See, A., Chaudhuri, S., Holland, G., Davies, A., Nowozin, S., Kohli, P., Balog, M.: AlphaEvolve: A coding agent for scientific and algorithmic discovery (2025). https://arxiv.org/abs/2506.13131

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Nature (2026) https://doi.org/10.1038/s41586-025-10072-4

Asai, A., He, J., Shao, R., Shi, W., Singh, A., Chang, J.C., Lo, K., Soldaini, L., Feldman, S., D’Arcy, M., Wadden, D., Latzke, M., Sparks, J., Hwang, J.D., Kishore, V., Tian, M., Ji, P., Liu, S., Tong, H., Wu, B., Xiong, Y., Zettlemoyer, L., Neubig, G., Weld, D.S., Downey, D., Yih, W.-t., Koh, P.W., Hajishirzi, H.: Syn- thesizing scientific literature wi...

work page doi:10.1038/s41586-025-10072-4 2026

[12] [12]

Nature651(8107), 914–919 (2026)

Lu, C., Lu, C., Lange, R.T., Yamada, Y., Hu, S., Foerster, J., Ha, D., Clune, 21 J.: Towards end-to-end automation of ai research. Nature651(8107), 914–919 (2026)

work page 2026

[13] [13]

Nature550(7676), 354–359 (2017) https://doi.org/10.1038/ nature24270

Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T., Baker, L., Lai, M., Bolton, A., Chen, Y., Lillicrap, T., Hui, F., Sifre, L., Driessche, G., Graepel, T., Hassabis, D.: Mastering the game of go without human knowledge. Nature550(7676), 354–359 (2017) https://doi.org/10.1038/ nature24270

work page 2017

[14] [14]

Berner, C., Brockman, G., Chan, B., Cheung, V., Dębiak, P., Dennison, C., Farhi, D., Fischer, Q., Hashme, S., Hesse, C., Józefowicz, R., Gray, S., Olsson, C., Pachocki, J., Petrov, M., O. Pinto, H.P., Raiman, J., Salimans, T., Schlatter, J., Schneider, J., Sidor, S., Sutskever, I., Tang, J., Wolski, F., Zhang, S.: Dota 2 with Large Scale Deep Reinforcemen...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Science 330(6004), 686–688 (2010) https://doi.org/10.1126/science.1193147

Woolley, A.W., Chabris, C.F., Pentland, A., Hashmi, N., Malone, T.W.: Evidence for a collective intelligence factor in the performance of human groups. Science 330(6004), 686–688 (2010) https://doi.org/10.1126/science.1193147

work page doi:10.1126/science.1193147 2010

[16] [16]

In: Proceedings of the 41st International Conference on Machine Learning

Du, Y., Li, S., Torralba, A., Tenenbaum, J.B., Mordatch, I.: Improving factuality and reasoning in language models through multiagent debate. In: Proceedings of the 41st International Conference on Machine Learning. ICML’24. JMLR.org, Vienna, Austria (2024)

work page 2024

[17] [17]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Liang, T., He, Z., Jiao, W., Wang, X., Wang, Y., Wang, R., Yang, Y., Shi, S., Tu, Z.: Encouraging divergent thinking in large language models through multi-agent debate. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 17889–17904. Association for Computational...

work page doi:10.18653/v1/2024.emnlp-main.992 2024

[18] [18]

doi: 10.18653/v1/2024.acl-long

Qian, C., Liu, W., Liu, H., Chen, N., Dang, Y., Li, J., Yang, C., Chen, W., Su, Y., Cong, X., Xu, J., Li, D., Liu, Z., Sun, M.: ChatDev: Communicative agents for software development. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Proceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 15174–...

work page doi:10.18653/v1/2024.acl-long 2024

[19] [19]

In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V

Lin, Y.-C., Chen, K.-C., Li, Z.-Y., Wu, T.-H., Wu, T.-H., Chen, K.-Y., Lee, H.-y., Chen, Y.-N.: Creativity in LLM-based multi-agent systems: A sur- vey. In: Christodoulopoulos, C., Chakraborty, T., Rose, C., Peng, V. (eds.) Proceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, pp. 27584–27607. Association for Computatio...

work page doi:10.18653/v1/2025.emnlp-main.1403 2025

[20] [20]

Journal of Personality and Social Psychology43(5), 997–1013 (1982)

Amabile, T.M.: Social psychology of creativity: A consensual assessment tech- nique. Journal of Personality and Social Psychology43(5), 997–1013 (1982)

work page 1982

[21] [21]

Journal of Personality and Social Psychology53(3), 497–509 (1987) https://doi.org/10.1037/0022-3514.53.3.497

Diehl, M., Stroebe, W.: Productivity loss in brainstorming groups: Toward the solution of a riddle. Journal of Personality and Social Psychology53(3), 497–509 (1987) https://doi.org/10.1037/0022-3514.53.3.497

work page doi:10.1037/0022-3514.53.3.497 1987

[22] [22]

https://arxiv.org/abs/2601.13295

Khatua, A., Zhu, H., Tran, P., Prabhudesai, A., Sadrieh, F., Lieberwirth, J.K., Yu, X., Fu, Y., Ryan, M.J., Pei, J., Yang, D.: CooperBench: Why Coding Agents Cannot be Your Teammates Yet (2026). https://arxiv.org/abs/2601.13295

work page arXiv 2026

[23] [23]

doi: 10.1177/0003122419877135

Kozlowski, A.C., Taddy, M., Evans, J.A.: The geometry of culture: Ana- lyzing the meanings of class through word embeddings. American Sociolog- ical Review84(5), 905–949 (2019) https://doi.org/10.1177/0003122419877135 https://doi.org/10.1177/0003122419877135

work page doi:10.1177/0003122419877135 2019

[24] [24]

Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space

Toro-Hernández, F.D., Filho, J.V., Cabral-Carvalho, R.M.: Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embed- ding Space (2026). https://arxiv.org/abs/2602.05971

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

Proceedings of the National Academy of Sciences120(42), 2305290120 (2023) https://doi.org/10

Nour, M.M., McNamee, D.C., Liu, Y., Dolan, R.J.: Trajectories through semantic spaces in schizophrenia and the relationship to ripple bursts. Proceedings of the National Academy of Sciences120(42), 2305290120 (2023) https://doi.org/10. 1073/pnas.2305290120

work page 2023

[26] [26]

Psychological Review119(2), 431–440 (2012) https://doi.org/10.1037/a0027373

Hills, T.T., Jones, M.N., Todd, P.M.: Optimal foraging in semantic memory. Psychological Review119(2), 431–440 (2012) https://doi.org/10.1037/a0027373

work page doi:10.1037/a0027373 2012

[27] [27]

Trends in Cognitive Sciences19(1), 46–54 (2015)

Hills, T.T., Todd, P.M., Lazer, D., Redish, A.D., Couzin, I.D., Group, C.S.R.: Exploration versus exploitation in space, mind, and society. Trends in Cognitive Sciences19(1), 46–54 (2015)

work page 2015

[28] [28]

Measuring Semantic Coherence of a Conversation

Vakulenko, S., Rijke, M., Cochez, M., Savenkov, V., Polleres, A.: Measuring Semantic Coherence of a Conversation (2018). https://arxiv.org/abs/1806.06411

work page internal anchor Pith review Pith/arXiv arXiv 2018

[29] [29]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[30] [30]

and Neumann, Mark and Iyyer, Mohit and Gardner, Matt and Clark, Christopher and Lee, Kenton and Zettlemoyer, Luke

Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettle- moyer,L.:Deepcontextualizedwordrepresentations.In:Walker,M.,Ji,H.,Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 2227–2237. Association ...

work page doi:10.18653/v1/n18-1202 2018

[31] [31]

Nature Human Behaviour10(3), 531–540 (2026) https://doi.org/10.1038/s41562-025-02331-1

Wang, D., Huang, D., Shen, H., Uzzi, B.: A large-scale comparison of divergent creativity in humans and large language models. Nature Human Behaviour10(3), 531–540 (2026) https://doi.org/10.1038/s41562-025-02331-1

work page doi:10.1038/s41562-025-02331-1 2026

[32] [32]

Scientific Reports16, 1279 (2026) https://doi.org/10.1038/ s41598-025-25157-3

Bellemare-Pepin, A., Lespinasse, F., Thölke, P., Harel, Y., Mathewson, K., Olson, J.A., Bengio, Y., Jerbi, K.: Divergent creativity in humans and large language models. Scientific Reports16, 1279 (2026) https://doi.org/10.1038/ s41598-025-25157-3

work page 2026

[33] [33]

arXiv (2022)

Stevenson, C., Smal, I., Baas, M., Grasman, R., Maas, H.: Putting GPT-3’s Cre- ativity to the (Alternative Uses) Test. arXiv (2022). https://doi.org/10.48550/ arXiv.2206.08932 . http://arxiv.org/abs/2206.08932 Accessed 2024-02-18

work page arXiv 2022

[34] [34]

arXiv (2024)

Nath, S.S., Dayan, P., Stevenson, C.: Characterising the Creative Process in Humans and Large Language Models. arXiv (2024). https://doi.org/10.48550/ arXiv.2405.00899 . http://arxiv.org/abs/2405.00899 Accessed 2026-01-22

work page arXiv 2024

[35] [35]

Proceedings of the National Academy of Sciences118(25), 2022340118 (2021)

Olson, J.A., Nahas, J., Chmoulevitch, D., Cropper, S.J., Webb, M.E.: Nam- ing unrelated words predicts creativity. Proceedings of the National Academy of Sciences118(25), 2022340118 (2021)

work page 2021

[36] [36]

Behavior Research Methods 53(2), 757–780 (2021)

Beaty, R.E., Johnson, D.R.: Automating creativity assessment with semdis: An open platform for computing semantic distance. Behavior Research Methods 53(2), 757–780 (2021)

work page 2021

[37] [37]

European Review of Social Psychology21(1), 34–77 (2010)

Nijstad, B.A., De Dreu, C.K., Rietzschel, E.F., Baas, M.: The dual pathway to creativity model: Creative ideation as a function of flexibility and persistence. European Review of Social Psychology21(1), 34–77 (2010)

work page 2010

[38] [38]

Administrative Science Quarterly44(2), 350–383 (1999)

Edmondson, A.: Psychological safety and learning behavior in work teams. Administrative Science Quarterly44(2), 350–383 (1999)

work page 1999

[39] [39]

Academy of Management Review39(3), 324–343 (2014)

Harvey, S.: Creative synthesis: Exploring the process of extraordinary group creativity. Academy of Management Review39(3), 324–343 (2014)

work page 2014

[40] [40]

https://arxiv.org/abs/2601.10825

Kim, J., Lai, S., Scherrer, N., Arcas, B.A., Evans, J.: Reasoning Models Generate Societies of Thought (2026). https://arxiv.org/abs/2601.10825

work page arXiv 2026

[41] [41]

Psychological Bulletin53(4), 267–293 (1956) https://doi.org/10.1037/h0040755

Guilford, J.P.: The structure of intellect. Psychological Bulletin53(4), 267–293 (1956) https://doi.org/10.1037/h0040755

work page doi:10.1037/h0040755 1956

[42] [42]

Applied Psychology49(2), 237–262 (2000)

Paulus, P.B.: Groups, teams, and creativity: The creative potential of idea- generating groups. Applied Psychology49(2), 237–262 (2000)

work page 2000

[43] [43]

, year 2007

Paulus, P.B., Nijstad, B.A.: Group creativity: An introduction. In: Paulus, P.B., Nijstad, B.A. (eds.) Group Creativity: Innovation Through Collaboration, pp. 3–12. Oxford University Press, Oxford, 24 UK (2003). https://doi.org/10.1093/acprof:oso/9780195147308.003.0001 . https://doi.org/10.1093/acprof:oso/9780195147308.003.0001

work page doi:10.1093/acprof:oso/9780195147308.003.0001 2003

[44] [44]

https://openai.com/index/gpt-4-1/ (2025)

OpenAI: Introducing GPT-4.1 in the API. https://openai.com/index/gpt-4-1/ (2025)

work page 2025

[45] [45]

https://openai.com/index/ introducing-o3-and-o4-mini/ (2025)

OpenAI: Introducing OpenAI o3 and o4-mini. https://openai.com/index/ introducing-o3-and-o4-mini/ (2025)

work page 2025

[46] [46]

Nature645, 633–638 (2025) https://doi.org/10.1038/ s41586-025-09422-z

Guo, D., Yang, D., Zhang, H., Song, J., Wang, P., Zhu, Q., Xu, R., Zhang, R., Ma, S., Bi, X.,et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025) https://doi.org/10.1038/ s41586-025-09422-z

work page 2025

[47] [47]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Quantifying the persona effect in LLM simulations

Hu, T., Collier, N.: Quantifying the persona effect in LLM simulations. In: Ku, L.-W., Martins, A., Srikumar, V. (eds.) Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 10289–10307. Association for Computational Linguis- tics, Bangkok, Thailand (2024). https://doi.org/10.18653/v1/2024.acl...

work page doi:10.18653/v1/2024.acl-long.554 2024

[49] [49]

https://arxiv.org/abs/2502.20859

Duan, Y., Tang, Y., Bai, X., Chen, K., Li, J., Zhang, M.: The Power of Per- sonality: A Human Simulation Perspective to Investigate Large Language Model Agents (2025). https://arxiv.org/abs/2502.20859

work page arXiv 2025

[50] [50]

Scribner, New York (1953)

Osborn, A.F.: Applied Imagination: Principles and Procedures of Creative Thinking. Scribner, New York (1953)

work page 1953

[51] [51]

https://arxiv.org/abs/2506

Zhang, Y., Li, M., Long, D., Zhang, X., Lin, H., Yang, B., Xie, P., Yang, A., Liu, D., Lin, J., Huang, F., Zhou, J.: Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models (2025). https://arxiv.org/abs/2506. 05176

work page 2025

[52] [52]

Neuropsychology11(1), 138–146 (1997) https://doi.org/10.1037/0894-4105.11.1

Troyer, A.K., Moscovitch, M., Winocur, G.: Clustering and switching as two components of verbal fluency: Evidence from younger and older healthy adults. Neuropsychology11(1), 138–146 (1997) https://doi.org/10.1037/0894-4105.11.1. 138

work page doi:10.1037/0894-4105.11.1 1997

[53] [53]

Creativity Research Journal18(3), 391–404 (2006) https://doi.org/10.1207/s15326934crj1803_13 25

Cropley, A.: In praise of convergent thinking. Creativity Research Journal18(3), 391–404 (2006) https://doi.org/10.1207/s15326934crj1803_13 25

work page doi:10.1207/s15326934crj1803_13 2006

[54] [54]

Frontiers in Human Neuroscience8, 407 (2014)

Kenett, Y.N., Anaki, D., Faust, M.: Investigating the structure of semantic net- works in low and high creative persons. Frontiers in Human Neuroscience8, 407 (2014)

work page 2014

[55] [55]

Schizophrenia Research93(1-3), 304–316 (2007)

Elvevåg, B., Foltz, P.W., Weinberger, D.R., Goldberg, T.E.: Quantifying incoher- ence in speech: An automated methodology and novel application to schizophre- nia. Schizophrenia Research93(1-3), 304–316 (2007)

work page 2007

[56] [56]

In: Loveys, K., Niederhoffer, K., Prud’hommeaux, E., Resnik, R., Resnik, P

Iter, D., Yoon, J., Jurafsky, D.: Automatic detection of incoherent speech for diagnosing schizophrenia. In: Loveys, K., Niederhoffer, K., Prud’hommeaux, E., Resnik, R., Resnik, P. (eds.) Proceedings of the Fifth Workshop on Compu- tational Linguistics and Clinical Psychology: From Keyboard to Clinic, pp. 136–146. Association for Computational Linguistics...

work page doi:10.18653/v1/w18-0615 2018

[57] [57]

In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp

Li, J., Hovy, E.: A model of coherence based on distributed sentence representa- tion. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 2039–2048 (2014)

work page 2014

[58] [58]

Psychological Science35(7), 749–759 (2024)

Malaie, S., Spivey, M.J., Marghetis, T.: Divergent and convergent creativity are different kinds of foraging. Psychological Science35(7), 749–759 (2024)

work page 2024

[59] [59]

The Journal of Creative Behavior1(1), 3–14 (1967) https://doi.org/10.1002/j.2162-6057.1967

Guilford, J.P.: Creativity: Yesterday, today and tomorrow. The Journal of Creative Behavior1(1), 3–14 (1967) https://doi.org/10.1002/j.2162-6057.1967. tb00002.x

work page doi:10.1002/j.2162-6057.1967 1967

[60] [60]

Management Science56(4), 591–605 (2010)

Girotra, K., Terwiesch, C., Ulrich, K.T.: Idea generation and the quality of the best idea. Management Science56(4), 591–605 (2010)

work page 2010

[61] [61]

Behavior Research Methods55(7), 3726–3759 (2023) https://doi.org/10.3758/s13428-022-01986-2

Johnson, D.R., Kaufman, J.C., Baker, B.S., Patterson, J.D., Barbot, B., Green, A.E., Hell, J., Kennedy, E., Sullivan, G.F., Taylor, C.L., Ward, T., Beaty, R.E.: Divergent semantic integration (DSI): Extracting creativity from narratives with distributional semantic modeling. Behavior Research Methods55(7), 3726–3759 (2023) https://doi.org/10.3758/s13428-0...

work page doi:10.3758/s13428-022-01986-2 2023

[62] [62]

In: Boleda, G., Roth, M

Zarrieß, S., Junker, S., Sieker, J., Alacam, Ö.: Components of creativity: Lan- guage model-based predictors for clustering and switching in verbal fluency. In: Boleda, G., Roth, M. (eds.) Proceedings of the 29th Conference on Computa- tional Natural Language Learning, pp. 216–232. Association for Computational Linguistics, Vienna, Austria (2025). https:/...

work page doi:10.18653/v1/2025.conll-1.15 2025

[63] [63]

AI Progress through the Lens of Predictable AI Ecosystems

Coursey, L.E., Gertner, R.T., Williams, B.C., Kenworthy, J.B., Paulus, P.B., Doboli, S.: Linking the divergent and convergent processes of collaborative cre- ativity: The impact of expertise levels and elaboration processes. Frontiers in Psychology10(2019) https://doi.org/10.3389/fpsyg.2019.00699 26 Supplementary information.Supplementary information is p...

work page doi:10.3389/fpsyg.2019.00699 2019

[64] [64]

One agent generated a new candidate idea intended to be more creative than existing ideas

work page

[65] [65]

All agents rated the new candidate, the current idea, and recently explored past ideas on creativity (1-10 scale)

work page

[66] [66]

evolve [X] into [Y]

The highest-rated idea became the new current idea for the next iteration. This allowed the system to either adopt the new modification, retain the current idea, or backtrack to a recently considered alternative. This proposal-and-evaluation cycle continued until one of two stopping criteria was met: • Convergence: The same idea was selected (rated highes...

work page 2082