LLM Jaggedness Unlocks Scientific Creativity

Esther H. R. Tsai; J. Anibal Boscoboinik; Kevin G. Yager; Shray Mathur

arxiv: 2605.10574 · v2 · pith:XVWEDMQ3new · submitted 2026-05-11 · 💻 cs.AI

LLM Jaggedness Unlocks Scientific Creativity

Shray Mathur , J. Anibal Boscoboinik , Esther H. R. Tsai , Kevin G. Yager This is my paper

Pith reviewed 2026-05-21 08:13 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM jaggednessscientific creativityidea generationmodel ensemblesSciAidanBenchinference-time computeknowledge pooling

0 comments

The pith

LLM jaggedness in scientific idea generation can be harnessed to build ensembles that outperform any single model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models advance unevenly rather than uniformly, producing jagged performance profiles when asked to generate scientific ideas. The work introduces SciAidanBench, which measures creative potential by counting the number of unique coherent ideas a model produces for open-ended scientific questions. Across dozens of models the authors observe this jaggedness at three scales: general versus scientific tasks, prompt to prompt within one model, and across scientific subfields. They then show that the same unevenness supplies complementary strengths that can be combined through inference-time compute, knowledge pooling, and brainstorming. The resulting meta-model ensembles generate more valid ideas than the strongest individual model, turning a seeming limitation into a practical resource for AI-assisted science.

Core claim

Jaggedness appears both across models and inside single models, with high variability across questions and fragmented strengths across domains. This structural feature of current LLM progress can be leveraged via inference-time compute, knowledge pooling, and brainstorming to construct meta-model ensembles that outperform any single model on the benchmark.

What carries the argument

Jagged capability profiles of LLMs, where performance varies unevenly across tasks, prompts, and scientific domains and is then combined in meta-model ensembles.

Load-bearing premise

The total number of unique valid ideas generated serves as a reliable proxy for scientific creative potential.

What would settle it

A follow-up evaluation in which the proposed ensembles produce no more valid ideas than the single best model on a fresh set of scientific questions.

Figures

Figures reproduced from arXiv: 2605.10574 by Esther H. R. Tsai, J. Anibal Boscoboinik, Kevin G. Yager, Shray Mathur.

**Figure 2.** Figure 2: Response count ranges across models on SciAidanBench. Each row [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Probability density of response counts for three models of increasing [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Domain-level response profiles for the top five distinct models on [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Relationship between average reasoning token usage and SciAidan [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Domain-level performance across SciAidanBench subfields for the [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

read the original abstract

As artificial intelligence advances, models are not improving uniformly. Instead, progress unfolds in a jagged fashion, with capabilities growing unevenly across tasks, domains, and model scales. In this work, we examine this dynamic jaggedness through the lens of scientific idea generation. We introduce SciAidanBench, a benchmark of open-ended scientific questions designed to measure the scientific creativity of large language models (LLMs). Given a scientific question, models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential. Evaluating 19 base models across 8 providers (30 total variants including reasoning versions), we find that jaggedness manifests both across models and within models. First, in a cross-task comparison between general and scientific creativity, improvements in general creativity do not translate uniformly to scientific creativity, revealing divergent capability profiles across models. Second, at the prompt level, stronger models do not improve uniformly; instead, they exhibit high variability, with bursts of creativity on some questions and limited performance on others. Third, at the domain level, individual models display uneven strengths across scientific subfields, reflecting fragmented internal capability profiles. Finally, we show that this jaggedness can be harnessed. We explore mechanisms of inference-time compute, knowledge pooling, and brainstorming to combine models effectively and construct meta-model ensembles that outperform any single model. Our results position jaggedness not as a limitation, but as a resource, a structural feature of AI progress that, when understood and leveraged, can amplify LLM-driven scientific creativity.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces SciAidanBench and shows ensembles can beat single models by exploiting uneven LLM strengths on scientific idea tasks, but the count of valid responses as creativity proxy needs clearer validation.

read the letter

Colleague, the main point is that this paper builds SciAidanBench to test open-ended scientific idea generation and reports that meta-model ensembles, built through inference-time compute, knowledge pooling, and brainstorming, produce more valid ideas than any individual model by capitalizing on jagged capability patterns. They evaluate 19 base models plus variants and document three kinds of unevenness: general creativity gains do not map evenly onto scientific questions, stronger models still show high prompt-to-prompt variability, and individual models have patchy strengths across scientific subfields. Those empirical patterns and the practical ensemble results are the concrete additions here. The work does a reasonable job of running the same generation prompt across many providers and scales, then testing a few combination strategies that actually move the needle on their count metric. The soft spot sits right at the measurement step. The proxy is simply the total number of unique and coherent ideas, yet the abstract gives no operational definition of validity, no inter-annotator numbers, and no correlation check against expert ratings for novelty or feasibility. If the count mostly tracks verbosity or loose coherence filters instead of deeper scientific insight, then the jaggedness observations and the ensemble gains do not yet demonstrate that jaggedness unlocks genuine creativity. This is aimed at people working on AI-assisted research ideation or on mapping capability distributions in LLMs. A reader who wants a domain-specific benchmark and some initial ensemble numbers will find usable starting points, even if the evaluation needs tightening. The paper is coherent enough on its own terms and introduces a fresh benchmark with reproducible-looking experiments, so it deserves a serious referee rather than a desk reject. I would send it out for review and ask the authors to add explicit validity criteria plus at least a small expert rating study to ground the proxy.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SciAidanBench, a benchmark for assessing scientific creativity in LLMs by prompting models to generate as many unique and coherent ideas as possible for open-ended scientific questions, using the count of valid responses as a proxy for creative potential. It evaluates 19 base models across 8 providers (including 30 variants), identifies jaggedness in capabilities across tasks (general vs scientific), prompt levels, and domains, and proposes harnessing this through meta-model ensembles via inference-time compute, knowledge pooling, and brainstorming, claiming these outperform single models.

Significance. Should the count of valid ideas prove to be a reliable indicator of scientific creativity, this work would demonstrate that non-uniform progress in LLMs can be leveraged through ensemble methods to enhance idea generation in science. The broad evaluation across multiple models and the exploration of combination strategies could inform future approaches to AI-supported scientific discovery by emphasizing diversity over monolithic scaling.

major comments (2)

The abstract describes the evaluation of 19 models and ensemble methods but provides no details on the criteria for determining 'valid' responses, inter-annotator agreement, statistical tests for the observed patterns, or controls for prompt sensitivity. This is a load-bearing issue because the jaggedness findings and the superiority of ensembles are all quantified using this unelaborated proxy.
The central assumption that the total number of valid responses serves as a proxy for creative potential lacks any reported correlation to external measures such as expert-assessed novelty, feasibility, or impact. If higher counts primarily capture verbosity or tolerance for repetition rather than deeper insight, the claims about harnessing jaggedness for scientific creativity do not follow from the presented results.

minor comments (1)

Consider adding a sentence on the scale of SciAidanBench (e.g., number of questions or domains covered) to provide immediate context for the evaluation scope.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our evaluation methodology. We address each major comment below and indicate the revisions we plan to incorporate.

read point-by-point responses

Referee: The abstract describes the evaluation of 19 models and ensemble methods but provides no details on the criteria for determining 'valid' responses, inter-annotator agreement, statistical tests for the observed patterns, or controls for prompt sensitivity. This is a load-bearing issue because the jaggedness findings and the superiority of ensembles are all quantified using this unelaborated proxy.

Authors: We agree that the abstract should be more self-contained. In the revised version we will expand the abstract to briefly state the validity criteria (unique and coherent ideas, per the benchmark definition in Section 3), note that responses were annotated by multiple raters with inter-annotator agreement reported in the main text, mention that statistical tests were applied to the observed jaggedness patterns, and indicate that prompt-sensitivity controls were included in the experimental design. These elements already exist in the body of the paper; the revision will surface them in the abstract as well. revision: yes
Referee: The central assumption that the total number of valid responses serves as a proxy for creative potential lacks any reported correlation to external measures such as expert-assessed novelty, feasibility, or impact. If higher counts primarily capture verbosity or tolerance for repetition rather than deeper insight, the claims about harnessing jaggedness for scientific creativity do not follow from the presented results.

Authors: We acknowledge the value of external validation. The benchmark definition already requires both uniqueness and coherence precisely to limit the influence of verbosity or repetition; these constraints are detailed in Section 3. Nevertheless, we accept that an explicit correlation study with expert ratings on novelty and feasibility would strengthen the proxy interpretation. We will therefore add a targeted validation experiment (or, if space is limited, a clear limitations paragraph) that reports such correlations on a sampled subset of ideas. Until that analysis is complete we will also revise the language to emphasize that the count serves as a proxy under the stated constraints rather than a direct measure of insight. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark with independent measurements

full rationale

The paper is an empirical study that defines SciAidanBench, instructs models to generate ideas, counts valid responses as a proxy metric, evaluates 19 models across tasks/domains, and reports observed performance differences plus ensemble gains. No equations, derivations, or first-principles claims exist that reduce by construction to fitted parameters, self-definitions, or self-citations. The proxy is an explicit operational choice whose validity can be assessed externally; the reported patterns and ensemble results are direct measurements against that metric rather than any loop that forces the outcome from the inputs. This is a standard benchmark evaluation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claims rest on empirical measurement choices and the assumption that idea count proxies creativity; no new theoretical entities or derivations are introduced.

free parameters (2)

Model selection and variants
Choice of which 19 base models and 30 total variants to evaluate is a selection decision that affects reported jaggedness patterns.
Validity and uniqueness criteria
Rules for counting an idea as valid and unique are not detailed in the abstract yet determine the creativity scores.

axioms (1)

domain assumption The total number of unique coherent ideas generated serves as a valid proxy for scientific creative potential.
Explicitly stated in the abstract as the measurement basis for the benchmark.

pith-pipeline@v0.9.0 · 5823 in / 1324 out tokens · 67298 ms · 2026-05-21T08:13:57.184772+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

models are asked to generate as many unique and coherent ideas as possible, with the total number of valid responses serving as a proxy for creative potential
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

jaggedness manifests both across models and within models... domain-level response profiles... uneven strengths

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 3 internal anchors

[1]

Fabrizio Dell’Acqua, Edward McFowland III, Ethan R Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, Fran¸ cois Candelon, and Karim R Lakhani. Navigating the jagged technological fron- tier: Field experimental evidence of the effects of ai on knowledge worker productivity and quality.Harvard Business School Technology & Opera-...

work page 2023
[2]

Inverse scaling: When bigger isn’t better.arXiv [cs.CL], June 2023

Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Rec- chia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zheng- ping Z...

work page 2023
[3]

Inverse scaling in test-time compute.arXiv [cs.AI], July 2025

Aryo Pradipta Gema, Alexander H¨ agele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, and Ethan Perez. Inverse scaling in test-time compute.arXiv [cs.AI], July 2025. 16

work page 2025
[4]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv [cs.AI], July 2025

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv [cs.AI], July 2025

work page 2025
[5]

Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity.arXiv [cs.CL], October 2025

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R Tomz, Christopher D Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity.arXiv [cs.CL], October 2025

work page 2025
[6]

AidanBench: Evaluating novel idea generation on open-ended questions

Aidan McLaughlin, James Campbell, and Anuja Uppuluri. AidanBench: Evaluating novel idea generation on open-ended questions

work page
[7]

Sciaidanbench: Evaluating llm scientific creativity

Shray Mathur, Noah van der Vleuten, J Anibal Boscoboinik, Esther Tsai, and Kevin Yager. Sciaidanbench: Evaluating llm scientific creativity. In New York Scientific Data Summit 2025: Powering the Future of Science with Artificial Intelligence, pages 25–28. SIAM, 2025

work page 2025
[8]

The AI scientist: Towards fully automated open-ended scientific discovery.arXiv [cs.AI], August 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv [cs.AI], August 2024

work page 2024
[9]

ResearchTown: Simulator of human research community.arXiv [cs.CL], December 2024

Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, and Jiaxuan You. ResearchTown: Simulator of human research community.arXiv [cs.CL], December 2024

work page 2024
[10]

Large language models as biomedical hypothesis generators: A comprehensive evaluation.arXiv [cs.CL], July 2024

Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, and Bowen Zhou. Large language models as biomedical hypothesis generators: A comprehensive evaluation.arXiv [cs.CL], July 2024

work page 2024
[11]

Towards a unified framework for reference retrieval and related work generation

Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, and Zhaochun Ren. Towards a unified framework for reference retrieval and related work generation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5785–5799, 2023

work page 2023
[12]

Chime: Llm-assisted hierarchical organization of scientific studies for literature review support

Chao-Chun Hsu, Erin Bransom, Jenna Sparks, Bailey Kuehl, Chenhao Tan, David Wadden, Lucy Lu Wang, and Aakanksha Naik. Chime: Llm-assisted hierarchical organization of scientific studies for literature review support. arXiv preprint arXiv:2407.16148, 2024

work page arXiv 2024
[13]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 3(4), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

Crispr-gpt: An llm agent for automated design of gene-editing experiments

Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024. 17

work page arXiv 2024
[15]

Vision: a modular ai assistant for natural human-instrument interaction at scientific user facilities.Machine Learning: Science and Technology, 6(2): 025051, 2025

Shray Mathur, Noah van der Vleuten, Kevin G Yager, and Esther HR Tsai. Vision: a modular ai assistant for natural human-instrument interaction at scientific user facilities.Machine Learning: Science and Technology, 6(2): 025051, 2025

work page 2025
[16]

Modifying large language model post- training for diverse creative writing.arXiv preprint arXiv:2503.17126, 2025

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski. Modifying large language model post- training for diverse creative writing.arXiv preprint arXiv:2503.17126, 2025

work page arXiv 2025
[17]

Sotopia: Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Gra- ham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2023

work page arXiv 2023
[18]

Real sampling: Boosting factuality and diversity of open- ended generation by extrapolating the entropy of an infinitely large lm

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, and Tagyoung Chung. Real sampling: Boosting factuality and diversity of open- ended generation by extrapolating the entropy of an infinitely large lm. Transactions of the Association for Computational Linguistics, 13:760–783, 2025

work page 2025
[19]

Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082, 2024

Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082, 2024

work page arXiv 2024
[20]

Many heads are better than one: Improved sci- entific idea generation by a LLM-based multi-agent system.arXiv [cs.AI], October 2024

Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong. Many heads are better than one: Improved sci- entific idea generation by a LLM-based multi-agent system.arXiv [cs.AI], October 2024

work page 2024
[21]

Scideator: Human-LLM scientific idea generation grounded in research-paper facet recombination.arXiv [cs.HC], September 2024

Marissa Radensky, Simra Shahid, Raymond Fok, Pao Siangliulue, Tom Hope, and Daniel S Weld. Scideator: Human-LLM scientific idea generation grounded in research-paper facet recombination.arXiv [cs.HC], September 2024

work page 2024
[22]

ResearchAgent: Iterative research idea generation over scientific literature with large language models.arXiv [cs.CL], April 2024

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models.arXiv [cs.CL], April 2024

work page 2024
[23]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Met- zler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[24]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. 18

work page internal anchor Pith review Pith/arXiv arXiv 2020
[25]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Jour- nal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Jour- nal of Machine Learning Research, 23(120):1–39, 2022

work page 2022
[26]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

work page 2022
[27]

Groups of diverse problem solvers can outper- form groups of high-ability problem solvers.Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

Lu Hong and Scott E Page. Groups of diverse problem solvers can outper- form groups of high-ability problem solvers.Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

work page 2004
[28]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

work page 2024
[29]

Encouraging divergent think- ing in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent think- ing in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

work page 2024
[30]

Generative monoculture in large language models.arXiv preprint arXiv:2407.02209, 2024

Fan Wu, Emily Black, and Varun Chandrasekaran. Generative monoculture in large language models.arXiv preprint arXiv:2407.02209, 2024

work page arXiv 2024
[31]

Addressing llm diversity by infusing random concepts.arXiv preprint arXiv:2601.18053, 2026

Pulin Agrawal and Prasoon Goyal. Addressing llm diversity by infusing random concepts.arXiv preprint arXiv:2601.18053, 2026

work page arXiv 2026
[32]

Multilingual prompting for improving llm generation diversity

Qihan Wang, Shidong Pan, Tal Linzen, and Emily Black. Multilingual prompting for improving llm generation diversity. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6378–6400, 2025. 19

work page 2025

[1] [1]

Fabrizio Dell’Acqua, Edward McFowland III, Ethan R Mollick, Hila Lifshitz-Assaf, Katherine Kellogg, Saran Rajendran, Lisa Krayer, Fran¸ cois Candelon, and Karim R Lakhani. Navigating the jagged technological fron- tier: Field experimental evidence of the effects of ai on knowledge worker productivity and quality.Harvard Business School Technology & Opera-...

work page 2023

[2] [2]

Inverse scaling: When bigger isn’t better.arXiv [cs.CL], June 2023

Ian R McKenzie, Alexander Lyzhov, Michael Pieler, Alicia Parrish, Aaron Mueller, Ameya Prabhu, Euan McLean, Aaron Kirtland, Alexis Ross, Alisa Liu, Andrew Gritsevskiy, Daniel Wurgaft, Derik Kauffman, Gabriel Rec- chia, Jiacheng Liu, Joe Cavanagh, Max Weiss, Sicong Huang, The Floating Droid, Tom Tseng, Tomasz Korbak, Xudong Shen, Yuhui Zhang, Zheng- ping Z...

work page 2023

[3] [3]

Inverse scaling in test-time compute.arXiv [cs.AI], July 2025

Aryo Pradipta Gema, Alexander H¨ agele, Runjin Chen, Andy Arditi, Jacob Goldman-Wetzler, Kit Fraser-Taliente, Henry Sleight, Linda Petrini, Julian Michael, Beatrice Alex, Pasquale Minervini, Yanda Chen, Joe Benton, and Ethan Perez. Inverse scaling in test-time compute.arXiv [cs.AI], July 2025. 16

work page 2025

[4] [4]

The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv [cs.AI], July 2025

Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar. The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity.arXiv [cs.AI], July 2025

work page 2025

[5] [5]

Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity.arXiv [cs.CL], October 2025

Jiayi Zhang, Simon Yu, Derek Chong, Anthony Sicilia, Michael R Tomz, Christopher D Manning, and Weiyan Shi. Verbalized sampling: How to mitigate mode collapse and unlock LLM diversity.arXiv [cs.CL], October 2025

work page 2025

[6] [6]

AidanBench: Evaluating novel idea generation on open-ended questions

Aidan McLaughlin, James Campbell, and Anuja Uppuluri. AidanBench: Evaluating novel idea generation on open-ended questions

work page

[7] [7]

Sciaidanbench: Evaluating llm scientific creativity

Shray Mathur, Noah van der Vleuten, J Anibal Boscoboinik, Esther Tsai, and Kevin Yager. Sciaidanbench: Evaluating llm scientific creativity. In New York Scientific Data Summit 2025: Powering the Future of Science with Artificial Intelligence, pages 25–28. SIAM, 2025

work page 2025

[8] [8]

The AI scientist: Towards fully automated open-ended scientific discovery.arXiv [cs.AI], August 2024

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, and David Ha. The AI scientist: Towards fully automated open-ended scientific discovery.arXiv [cs.AI], August 2024

work page 2024

[9] [9]

ResearchTown: Simulator of human research community.arXiv [cs.CL], December 2024

Haofei Yu, Zhaochen Hong, Zirui Cheng, Kunlun Zhu, Keyang Xuan, Jinwei Yao, Tao Feng, and Jiaxuan You. ResearchTown: Simulator of human research community.arXiv [cs.CL], December 2024

work page 2024

[10] [10]

Large language models as biomedical hypothesis generators: A comprehensive evaluation.arXiv [cs.CL], July 2024

Biqing Qi, Kaiyan Zhang, Kai Tian, Haoxiang Li, Zhang-Ren Chen, Sihang Zeng, Ermo Hua, Hu Jinfang, and Bowen Zhou. Large language models as biomedical hypothesis generators: A comprehensive evaluation.arXiv [cs.CL], July 2024

work page 2024

[11] [11]

Towards a unified framework for reference retrieval and related work generation

Zhengliang Shi, Shen Gao, Zhen Zhang, Xiuying Chen, Zhumin Chen, Pengjie Ren, and Zhaochun Ren. Towards a unified framework for reference retrieval and related work generation. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 5785–5799, 2023

work page 2023

[12] [12]

Chime: Llm-assisted hierarchical organization of scientific studies for literature review support

Chao-Chun Hsu, Erin Bransom, Jenna Sparks, Bailey Kuehl, Chenhao Tan, David Wadden, Lucy Lu Wang, and Aakanksha Naik. Chime: Llm-assisted hierarchical organization of scientific studies for literature review support. arXiv preprint arXiv:2407.16148, 2024

work page arXiv 2024

[13] [13]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 3(4), 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

Crispr-gpt: An llm agent for automated design of gene-editing experiments

Kaixuan Huang, Yuanhao Qu, Henry Cousins, William A Johnson, Di Yin, Mihir Shah, Denny Zhou, Russ Altman, Mengdi Wang, and Le Cong. Crispr-gpt: An llm agent for automated design of gene-editing experiments. arXiv preprint arXiv:2404.18021, 2024. 17

work page arXiv 2024

[15] [15]

Vision: a modular ai assistant for natural human-instrument interaction at scientific user facilities.Machine Learning: Science and Technology, 6(2): 025051, 2025

Shray Mathur, Noah van der Vleuten, Kevin G Yager, and Esther HR Tsai. Vision: a modular ai assistant for natural human-instrument interaction at scientific user facilities.Machine Learning: Science and Technology, 6(2): 025051, 2025

work page 2025

[16] [16]

Modifying large language model post- training for diverse creative writing.arXiv preprint arXiv:2503.17126, 2025

John Joon Young Chung, Vishakh Padmakumar, Melissa Roemmele, Yuqian Sun, and Max Kreminski. Modifying large language model post- training for diverse creative writing.arXiv preprint arXiv:2503.17126, 2025

work page arXiv 2025

[17] [17]

Sotopia: Interactive evaluation for social intelligence in language agents

Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Gra- ham Neubig, et al. Sotopia: Interactive evaluation for social intelligence in language agents.arXiv preprint arXiv:2310.11667, 2023

work page arXiv 2023

[18] [18]

Real sampling: Boosting factuality and diversity of open- ended generation by extrapolating the entropy of an infinitely large lm

Haw-Shiuan Chang, Nanyun Peng, Mohit Bansal, Anil Ramakrishna, and Tagyoung Chung. Real sampling: Boosting factuality and diversity of open- ended generation by extrapolating the entropy of an infinitely large lm. Transactions of the Association for Computational Linguistics, 13:760–783, 2025

work page 2025

[19] [19]

Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082, 2024

Minh Nhat Nguyen, Andrew Baker, Clement Neo, Allen Roush, Andreas Kirsch, and Ravid Shwartz-Ziv. Turning up the heat: Min-p sampling for creative and coherent llm outputs.arXiv preprint arXiv:2407.01082, 2024

work page arXiv 2024

[20] [20]

Many heads are better than one: Improved sci- entific idea generation by a LLM-based multi-agent system.arXiv [cs.AI], October 2024

Haoyang Su, Renqi Chen, Shixiang Tang, Zhenfei Yin, Xinzhe Zheng, Jinzhe Li, Biqing Qi, Qi Wu, Hui Li, Wanli Ouyang, Philip Torr, Bowen Zhou, and Nanqing Dong. Many heads are better than one: Improved sci- entific idea generation by a LLM-based multi-agent system.arXiv [cs.AI], October 2024

work page 2024

[21] [21]

Scideator: Human-LLM scientific idea generation grounded in research-paper facet recombination.arXiv [cs.HC], September 2024

Marissa Radensky, Simra Shahid, Raymond Fok, Pao Siangliulue, Tom Hope, and Daniel S Weld. Scideator: Human-LLM scientific idea generation grounded in research-paper facet recombination.arXiv [cs.HC], September 2024

work page 2024

[22] [22]

ResearchAgent: Iterative research idea generation over scientific literature with large language models.arXiv [cs.CL], April 2024

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. ResearchAgent: Iterative research idea generation over scientific literature with large language models.arXiv [cs.CL], April 2024

work page 2024

[23] [23]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Met- zler, et al. Emergent abilities of large language models.arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[24] [24]

The Pile: An 800GB Dataset of Diverse Text for Language Modeling

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020. 18

work page internal anchor Pith review Pith/arXiv arXiv 2020

[25] [25]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Jour- nal of Machine Learning Research, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Jour- nal of Machine Learning Research, 23(120):1–39, 2022

work page 2022

[26] [26]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730– 27744, 2022

work page 2022

[27] [27]

Groups of diverse problem solvers can outper- form groups of high-ability problem solvers.Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

Lu Hong and Scott E Page. Groups of diverse problem solvers can outper- form groups of high-ability problem solvers.Proceedings of the National Academy of Sciences, 101(46):16385–16389, 2004

work page 2004

[28] [28]

Improving factuality and reasoning in language models through multiagent debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InForty-first international conference on machine learning, 2024

work page 2024

[29] [29]

Encouraging divergent think- ing in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent think- ing in large language models through multi-agent debate. InProceedings of the 2024 conference on empirical methods in natural language processing, pages 17889–17904, 2024

work page 2024

[30] [30]

Generative monoculture in large language models.arXiv preprint arXiv:2407.02209, 2024

Fan Wu, Emily Black, and Varun Chandrasekaran. Generative monoculture in large language models.arXiv preprint arXiv:2407.02209, 2024

work page arXiv 2024

[31] [31]

Addressing llm diversity by infusing random concepts.arXiv preprint arXiv:2601.18053, 2026

Pulin Agrawal and Prasoon Goyal. Addressing llm diversity by infusing random concepts.arXiv preprint arXiv:2601.18053, 2026

work page arXiv 2026

[32] [32]

Multilingual prompting for improving llm generation diversity

Qihan Wang, Shidong Pan, Tal Linzen, and Emily Black. Multilingual prompting for improving llm generation diversity. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6378–6400, 2025. 19

work page 2025