pith. the verified trust layer for science. sign in

arxiv: 2509.11295 · v2 · submitted 2025-09-14 · 💻 cs.CL

The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Pith reviewed 2026-05-18 16:34 UTC · model grok-4.3

classification 💻 cs.CL
keywords prompt engineeringlarge language modelslife scienceszero-shot promptingfew-shot promptingself-criticismtask decompositionliterature summarization
0
0 comments X p. Extension

The pith

Life sciences researchers can achieve substantial efficiency gains by mastering six core prompt engineering techniques for common workflows like summarization and data extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distills a large report on prompt engineering down to six techniques that target the repetitive tasks biologists and medical researchers do with language models. These techniques are zero-shot prompting, few-shot examples, thought generation, ensembling multiple outputs, self-criticism, and task decomposition. The authors show how to apply each one to literature reviews, data extraction from papers, and document editing while warning about pitfalls such as hallucination and loss of quality in long conversations. They argue that the upfront effort to learn these structured approaches pays off quickly because it reduces trial-and-error time and improves output reliability across different models. The result is a practical guide that moves prompting from ad-hoc use to a repeatable part of research practice.

Core claim

By selecting and grounding six techniques—zero-shot, few-shot, thought generation, ensembling, self-criticism, and decomposition—in life-sciences use cases and by supplying explicit rules for prompt structure plus warnings on multi-turn degradation, hallucinations, and model differences, the paper shows how researchers can move from opportunistic prompting to a low-friction systematic practice that raises output quality and delivers net time savings.

What carries the argument

The six distilled techniques (zero-shot, few-shot, thought generation, ensembling, self-criticism, and decomposition) that organize prompt construction for repeated life-sciences tasks and include explicit do-and-don't guidance plus model-specific caveats.

If this is right

  • Structured prompts using these techniques reduce hallucinations during extraction of experimental results from papers.
  • Self-criticism and ensembling steps improve consistency when editing research drafts or grant sections.
  • Decomposition breaks complex data-processing jobs into smaller LLM calls that fit within context limits.
  • Awareness of reasoning versus non-reasoning model differences prevents wasted effort on unsuitable tasks.
  • Use of the techniques augments rather than replaces existing data-processing and editing habits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same six-technique skeleton could be re-grounded in other domains such as materials science or clinical trial reporting with only minor example changes.
  • Systematic prompting of this kind might serve as a lightweight alternative to building custom agents for many routine analysis steps.
  • If the techniques scale to newer models, they could become a standard training module for graduate students entering data-heavy fields.

Load-bearing premise

That these six techniques will be sufficient for most life-sciences workflows and that following the structure and pitfall advice will reliably improve LLM output quality across different models and tasks.

What would settle it

A controlled test in which researchers perform the same literature summarization or data-extraction task with and without the six techniques and measure both total time spent and an independent quality score; if the structured prompts show no clear time or quality advantage, the efficiency claim does not hold.

read the original abstract

Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper distills a 2025 report containing 58 prompt engineering techniques into six core approaches (zero-shot, few-shot, thought generation, ensembling, self-criticism, and decomposition). It supplies structured templates, use-case examples for life-sciences tasks such as literature summarization and data extraction, pitfall warnings (multi-turn degradation, hallucinations), platform comparisons (Deep Research tools, context windows), and guidance on reasoning versus non-reasoning models, claiming that systematic adoption of these techniques will produce efficiency gains that substantially exceed the initial mastery cost.

Significance. If the guidance proves effective in practice, the manuscript supplies a concise, actionable quick-start resource that could lower the barrier for life-sciences researchers to move from ad-hoc to systematic LLM prompting. The explicit treatment of pitfalls, model distinctions, and tool limitations is a practical strength; however, the absence of any quantitative validation or controlled measurements restricts the work to the status of an advisory tutorial rather than a validated methodology.

major comments (2)
  1. Abstract: The central claim that 'researchers could achieve substantial efficiency gains that far exceed the initial time investment' is unsupported by any empirical data. The manuscript contains no time logs, accuracy benchmarks, before/after comparisons, or user studies on life-sciences tasks; the qualifiers 'substantial' and 'far exceed' therefore rest on untested extrapolation from descriptive examples.
  2. Use-cases and recommendations sections: The assertion that the six selected techniques are broadly sufficient for life-sciences workflows (literature summarization, data extraction, editorial tasks) is presented without explicit justification for why these six were chosen over the remaining 52 techniques from the source report or without coverage analysis for common tasks such as statistical analysis scripting or experimental design.
minor comments (2)
  1. The manuscript would benefit from a short table summarizing the six techniques, their typical prompt structures, and the specific life-sciences use cases to which each is applied.
  2. Platform comparisons (OpenAI, Google, Anthropic, Perplexity) would be clearer if accompanied by one or two concrete prompt-output examples illustrating differences in handling the same life-sciences query.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below, agreeing that the claims require qualification and that selection criteria should be made explicit. We will implement revisions accordingly while maintaining the manuscript's focus as an advisory guide.

read point-by-point responses
  1. Referee: Abstract: The central claim that 'researchers could achieve substantial efficiency gains that far exceed the initial time investment' is unsupported by any empirical data. The manuscript contains no time logs, accuracy benchmarks, before/after comparisons, or user studies on life-sciences tasks; the qualifiers 'substantial' and 'far exceed' therefore rest on untested extrapolation from descriptive examples.

    Authors: We agree the manuscript presents no new empirical measurements or controlled studies, as its scope is to distill the 2025 report and illustrate application to life-sciences tasks. The efficiency statement was intended as a forward-looking observation drawn from the use-case examples rather than a validated result. In revision we will replace the phrasing with 'may achieve efficiency gains that exceed the initial investment, as illustrated by the structured examples' and add a sentence in the introduction noting the advisory character of the work and absence of quantitative benchmarks. revision: yes

  2. Referee: Use-cases and recommendations sections: The assertion that the six selected techniques are broadly sufficient for life-sciences workflows (literature summarization, data extraction, editorial tasks) is presented without explicit justification for why these six were chosen over the remaining 52 techniques from the source report or without coverage analysis for common tasks such as statistical analysis scripting or experimental design.

    Authors: The six techniques were chosen because they map to the principal categories in the source report (direct prompting, example-based, reasoning augmentation, and iterative improvement) and together address the majority of routine LLM interactions. We will add a short subsection explaining this selection rationale and the coverage it provides. We will also extend the use-cases section with brief illustrations for statistical scripting (via decomposition) and experimental design (via thought generation plus self-criticism) to demonstrate applicability beyond the original examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper distills six prompt engineering techniques from an external 2025 report and offers practical templates, pitfall guidance, and platform comparisons for life sciences tasks such as literature summarization. Its central claim about efficiency gains is presented as a qualitative expectation rather than a derived prediction from fitted parameters or equations. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear; the manuscript references standard LLM behaviors and the external report without reducing any result to its own inputs by construction. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that structured prompting reliably improves LLM performance for text-processing tasks in life sciences; no free parameters, invented entities, or additional ad-hoc axioms are introduced.

axioms (1)
  • domain assumption Structured prompt engineering techniques can reliably improve the quality and reliability of LLM responses for life sciences workflows such as summarization and data extraction.
    Invoked when the abstract claims substantial efficiency gains and addresses pitfalls like hallucinations and multi-turn degradation.

pith-pipeline@v0.9.0 · 5792 in / 1331 out tokens · 81821 ms · 2026-05-18T16:34:34.082266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · 30 internal anchors

  1. [1]

    Multimodal Techniques (audio, video etc) (Schulhoff et al. 2025). The core focus of this review will be on text-based techniques. We will mention considerations for getting reliable results from multilingual prompting and plan to release a separate report on multimodal techniques in the future. In the Prompt Report, the authors identified 58 different pro...

  2. [2]

    simple prompt

    summarizing dense, nuanced information from academic sources. Generating article summaries is a popular demonstration of zero-shot prompting and significantly useful for academic work (Figure 1). Well represented summaries can help researchers narrow down the potential list of articles that need to be read to gain the necessary coverage of a topic and to ...

  3. [3]

    Original prompts from Peters and Chin-Yee (2025) (Peters and Chin-Yee

    A zero-shot prompt case study. Original prompts from Peters and Chin-Yee (2025) (Peters and Chin-Yee

  4. [4]

    are shown on the left, with suggested improvements on the right. Each improvement (+) represents a specific enhancement that may improve outcomes: adding domain specificity, incorporating quality exemplars, clarifying instructions to avoid ambiguity, and addressing technical constraints. 2.2 On Context Window and Token Consumption The context window of an...

  5. [5]

    the persona of a clinician

    Variability in LLM persona interpretation across independent conversations. The same prompt requesting Claude Opus 4.1 to describe "the persona of a clinician" in under 10 keywords was submitted in three separate conversation threads. Despite the identical input, the model produced three distinct sets of attributes, demonstrating both semantic overlap (e....

  6. [6]

    rigorous science

    Examples of a personality and an assistant prompt from literature. Left: An example of a literature review specialist named Bohr, as part of a larger team of AI agents, each with their own persona, including Project Manager, Analytical Assistant, Chemical Synthesis Consultant, Modeling and Coding Specialist, Robot Technician, Lab Equipment Designer, and t...

  7. [7]

    ground truth

    Recommended approach for leverage LLMs for editing text. The framework illustrates prompt engineering strategies for different editing tasks (grammar corrections, style improvements, and word choice refinement), with examples showing original text (orange), suggested revisions (blue), and justifications (purple) to maintain transparency during the editing...

  8. [8]

    Typically, the more examples, the better (Brown et al

    The number of examples provided to the LLM. Typically, the more examples, the better (Brown et al. 2020), while ensuring to monitor overall token count. Agarwal et al. (2024) found that using many examples (>1000 (~85k tokens) in some cases) led to dramatic improvements in summarization, mathematical problem solving, algorithmic reasoning among many other...

  9. [9]

    LLMs are sensitive to the order of information in prompts. Recent evidence demonstrates that the sequential arrangement of in-context examples can substantially impact model performance, with accuracy variations of 5.5-10.5 percentage points depending on example ordering alone (Bhope et al. 2025). This positional bias appears particularly pronounced in ta...

  10. [10]

    The left panel shows the conventional prompt structure with task instructions positioned after contextual information and examples

    Prompt reordering strategy for improved task adherence in structured data extraction. The left panel shows the conventional prompt structure with task instructions positioned after contextual information and examples. The right panel demonstrates the optimized structure with task instructions relocated to the beginning of the prompt. Both configurations t...

  11. [11]

    reasoning

    but contain enough diversity in examples to account for many different ways that information can be presented, summarized and articulated (Wang et al. 2024). In fact, with just 18 examples, Su et al. (2022) showed 12.9% relative gain in performance when selecting examples that were both representative of the task and diverse in coverage within context for...

  12. [12]

    Think step by step

    Chain-of-thought prompting can be beneficial under certain conditions. Left: A typical zero-shot prompt will typically struggle with logic or mathematics-based problems, in this case misidentifying channel width (100 μm instead of 200 μm), used wrong unit conversions, and a large error in final droplet output counts 62.5 kHz vs. actual ~130 Hz. Right: Add...

  13. [13]

    Multi-turn degradation compared to a well specified single-prompt for scientific data mining. Top: Four-turn conversation showing progressive loss of data integrity, where eventually the LLM encounters missing drug-protein associations (Turn 3), incorrect value mapping, and complete data omission (Turn 4). Bottom: Single-turn prompt with explicit instruct...

  14. [14]

    the wrong direction

    the prompt was broken into multiple smaller underspecified prompts. Models with higher aptitude, that is higher intelligence, tend to be more reliable in single-turn conversations, however, will severely degrade in reliability under multi-turn conversations, regardless of how intelligent they are. One key takeaway from this study is that underspecified pr...

  15. [15]

    Left: Initial prompt requesting median lnIC50 extraction for compound AZD5991 from a snippet of text (Carli et al

    Example of Ensembling. Left: Initial prompt requesting median lnIC50 extraction for compound AZD5991 from a snippet of text (Carli et al. 2025). Right: Four independent conversations, using the same prompt and snippet of text resulted in three correctly identify median values, while one trial (Conversation

  16. [16]

    The ensemble approach converges on the correct answer (4.591) by consensus

    erroneously attributes both 4.591 and 2.014 to AZD5991, despite 2.014 being the median for AZD5582. The ensemble approach converges on the correct answer (4.591) by consensus. There are several variations of Ensembling that extend beyond simple repeated queries with majority voting (i.e. re-running the same prompt in a new conversation). The self-consiste...

  17. [17]

    These techniques have proven effective in production environments, with ensemble methods showing improvements in key business metrics (Fang et al

    can also be effective. These techniques have proven effective in production environments, with ensemble methods showing improvements in key business metrics (Fang et al. 2024), suggesting that the computational overhead of multiple sampling is justified by substantial gains in accuracy and reliability for critical data extraction tasks. 5.1 On Deep Resear...

  18. [18]

    find, analyse, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst

    Described as an agentic tool, Deep Research promised to “find, analyse, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst”(‘Introducing Deep Research’ 2025). Similarly in late 2025, Anthropic described the development of their Deep Research framework, a suite of subagents orchestrated by a lead a...

  19. [19]

    Left: The prompt provided to each model

    Reproducibility analysis of Deep Research tools across duplicate runs using the same prompt. Left: The prompt provided to each model. The prompt was developed in an ‘opportunistic’ prompting style, that is, we did not engineer this prompt to reflect how most people use these tools. Right panels: Comparison metrics between two independent runs (Report 1: l...

  20. [20]

    always know

    and the fact remains, hallucinations are one of the majors reasons for low adoption of LLMs within academic and business pipelines. Hallucinations are non-factual responses, such as an answer to a question, or generated academic references that do not exist, typically persuasively presented to the user as fact. Latest research from OpenAI has concluded th...

  21. [21]

    Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step

    appending it’s solution back to the main prompt with all sub-problems. The main prompt now contains all sub-problems plus the first solution. Using the sub-problems and solution as context, the LLM moves on to the next sub-problem (Zhou et al. 2023). Prompt decomposition is one of several popular prompt engineering techniques for improving multi-agent com...

  22. [22]

    LLMs have a finite number of tokens they can process and attend to, per conversation

    solve the sub-problem in a single turn conversation, is token consumption. LLMs have a finite number of tokens they can process and attend to, per conversation. Further, the number of tokens or words the LLM can produce as a response to a query is also limited. For example, in AI Studio from Google it is possible to set the maximum token output for Gemini...

  23. [23]

    One approach to addressing context window limitations involves initiating new conversation instances for each sub-problem requiring a solution

    under the Creative Commons Attribution 4.0 International License (CC BY 4.0). One approach to addressing context window limitations involves initiating new conversation instances for each sub-problem requiring a solution. Decomposition strategies and established best practices have facilitated the development of next-generation academic tools with signifi...

  24. [24]

    Multi-agent framework for performing a literature review utilizing parallelized sub-agents. Initial prompt defines the research scope (vascularized microfluidics, 2020-2025), which is then decomposed into five parallel sub-agents, each focusing on distinct research domains: core technology and fabrication methods, biological applications, biomaterials and...

  25. [25]

    Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

    ‘Many-Shot In-Context Learning’. arXiv:2404.11018. Preprint, arXiv, October

  26. [26]

    Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

    https://doi.org/10.48550/arXiv.2404.11018. Aggarwal, Pranjal, Aman Madaan, Yiming Yang, and Mausam

  27. [27]

    arXiv:2305.11860

    ‘Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs’. arXiv:2305.11860. Preprint, arXiv, November

  28. [28]

    arXiv:2305.11860

    https://doi.org/10.48550/arXiv.2305.11860. Albert, Emőke, Péter Basa, Bálint Fodor, et al

  29. [29]

    Langmuir 41 (1): 704–18

    ‘Experimental and Computational Synthesis of TiO2 Sol–Gel Coatings’. Langmuir 41 (1): 704–18. https://doi.org/10.1021/acs.langmuir.4c03959. Alkaissi, Hussam, and Samy I McFarlane

  30. [30]

    Amad, Harry, Nicolás Astorga, and Mihaela van der Schaar

    https://doi.org/10.7759/cureus.35179. Amad, Harry, Nicolás Astorga, and Mihaela van der Schaar

  31. [31]

    arXiv:2506.12091

    ‘Continuously Updating Digital Twins Using Large Language Models’. arXiv:2506.12091. Preprint, arXiv, July

  32. [32]

    arXiv:2506.12091

    https://doi.org/10.48550/arXiv.2506.12091. An, Shengnan, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou

  33. [33]

    arXiv:2404.16811

    ‘Make Your LLM Fully Utilize the Context’. arXiv:2404.16811. Preprint, arXiv, April

  34. [34]

    arXiv:2404.16811

    https://doi.org/10.48550/arXiv.2404.16811. Andersen, Jens Peter, Lise Degn, Rachel Fishberg, et al

  35. [35]

    Technology in Society 81 (June): 102813

    ‘Generative Artificial Intelligence (GenAI) in the Research Process – A Survey of Researchers’ Practices and Perceptions’. Technology in Society 81 (June): 102813. https://doi.org/10.1016/j.techsoc.2025.102813. Anthropic. 2024a. ‘Prompt Generator’. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator. Anthropic. 2024b. ...

  36. [36]

    PLOS ONE 20 (6): e0325664

    ‘Helpful Assistant or Fruitful Facilitator? Investigating How Personas Affect Language Model Behavior’. PLOS ONE 20 (6): e0325664. https://doi.org/10.1371/journal.pone.0325664. Aydin, Abdulkerim, Süleyman Eren Yürük, İlknur Reisoğlu, and Yuksel Goktas

  37. [37]

    Scientometrics 128 (1): 623–50

    ‘Main Barriers and Possible Enablers of Academicians While Publishing’. Scientometrics 128 (1): 623–50. https://doi.org/10.1007/s11192-022-04528-x. Becker, Jonas, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, and Bela Gipp

  38. [38]

    Stay Focused: Problem Drift in Multi-Agent Debate

    ‘Stay Focused: Problem Drift in Multi-Agent Debate’. arXiv:2502.19559. Preprint, arXiv, May

  39. [39]

    Stay Focused: Problem Drift in Multi-Agent Debate

    https://doi.org/10.48550/arXiv.2502.19559. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

  40. [40]

    Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’21, March 1, 610–23

    ‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’21, March 1, 610–23. https://doi.org/10.1145/3442188.3445922. Bhope, Rahul Atul, Praveen Venkateswaran, K. R. Jayaram, Vatche Isahagian, Vinod Muthusamy, and Nalini Venk...

  41. [41]

    arXiv:2501.15030

    ‘OptiSeq: Ordering Examples On-The-Fly for In-Context Learning’. arXiv:2501.15030. Preprint, arXiv, February

  42. [42]

    arXiv:2501.15030

    https://doi.org/10.48550/arXiv.2501.15030. Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, et al

  43. [43]

    On the Opportunities and Risks of Foundation Models

    ‘On the Opportunities and Risks of Foundation Models’. arXiv:2108.07258. Preprint, arXiv, July

  44. [44]

    On the Opportunities and Risks of Foundation Models

    https://doi.org/10.48550/arXiv.2108.07258. Brown, Tom, Benjamin Mann, Nick Ryder, et al

  45. [45]

    Advances in Neural Information Processing Systems 33: 1877–901

    ‘Language Models Are Few-Shot Learners’. Advances in Neural Information Processing Systems 33: 1877–901. https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. Capstick, Alexander, Rahul G. Krishnan, and Payam Barnaghi

  46. [46]

    arXiv:2411.17284

    ‘AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling’. arXiv:2411.17284. Preprint, arXiv, May

  47. [47]

    arXiv:2411.17284

    https://doi.org/10.48550/arXiv.2411.17284. Carli, Francesco, Pierluigi Di Chiaro, Mariangela Morelli, et al

  48. [48]

    Chan, Kuang Wen, Farhan Ali, Joonhyeong Park, et al

    https://doi.org/10.1038/s41467-025-56827-5. Chan, Kuang Wen, Farhan Ali, Joonhyeong Park, et al

  49. [49]

    Computers and Education: Artificial Intelligence 8 (June): 100344

    ‘Automatic Item Generation in Various STEM Subjects Using Large Language Model Prompting’. Computers and Education: Artificial Intelligence 8 (June): 100344. https://doi.org/10.1016/j.caeai.2024.100344. Chang, Yung-Chun, Ming-Siang Huang, Yi-Hsuan Huang, and Yi-Hsuan Lin

  50. [50]

    Scientific Reports 15 (1): 15493

    ‘The Influence of Prompt Engineering on Large Language Models for Protein–Protein Interaction Identification in Biomedical Literature’. Scientific Reports 15 (1): 15493. https://doi.org/10.1038/s41598-025-99290-4. DeHaan, Soren, Yuanze Liu, Johan Bollen, and Sa’ul A. Blanco

  51. [51]

    arXiv:2505.17327

    ‘GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints’. arXiv:2505.17327. Preprint, arXiv, May

  52. [52]

    arXiv:2505.17327

    https://doi.org/10.48550/arXiv.2505.17327. Du, Mingxuan, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao

  53. [53]

    DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

    ‘DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents’. arXiv:2506.11763. Preprint, arXiv, June

  54. [54]
  55. [55]

    arXiv:2501.17084

    ‘Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving’. arXiv:2501.17084. Preprint, arXiv, January

  56. [56]

    arXiv:2501.17084

    https://doi.org/10.48550/arXiv.2501.17084. Fang, Chenhao, Xiaohan Li, Zezhong Fan, et al

  57. [57]

    arXiv:2403.00863

    ‘LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-Commerce Product Attribute Value Extraction’. arXiv:2403.00863. Preprint, arXiv, June

  58. [58]

    https://doi.org/10.48550/arXiv.2403.00863. Google

  59. [59]

    Quantitative Science Studies 5 (4): 823–43

    ‘The Strain on Scientific Publishing’. Quantitative Science Studies 5 (4): 823–43. https://doi.org/10.1162/qss_a_00327. Hu, Ke, Zhehuai Chen, Chao-Han Huck Yang, et al

  60. [60]

    Targeted Password Guessing Using Neural Language Models,

    ‘Chain-of-Thought Prompting for Speech Translation’. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April, 1–5. https://doi.org/10.1109/ICASSP49660.2025.10890560. Huang, Jiaxin, Shixiang Shane Gu, Le Hou, et al

  61. [61]

    Large Language Models Can Self-Improve

    ‘Large Language Models Can Self-Improve’. arXiv:2210.11610. Preprint, arXiv, October

  62. [62]

    Large Language Models Can Self-Improve

    https://doi.org/10.48550/arXiv.2210.11610. Hwang, Taesoon, Nishant Aggarwal, Pir Zarak Khan, et al

  63. [63]

    PLOS ONE 19 (2): e0297701

    ‘Can ChatGPT Assist Authors with Abstract Writing in Medical Journals? Evaluating the Quality of Scientific Abstracts Generated by ChatGPT and Original Abstracts’. PLOS ONE 19 (2): e0297701. https://doi.org/10.1371/journal.pone.0297701. Imani, Shima, Liang Du, and Harsh Shrivastava

  64. [64]

    arXiv:2303.05398

    ‘MathPrompter: Mathematical Reasoning Using Large Language Models’. arXiv:2303.05398. Preprint, arXiv, March

  65. [65]

    arXiv:2303.05398

    https://doi.org/10.48550/arXiv.2303.05398. ‘Introducing Deep Research’

  66. [66]

    From llms to llm- based agents for software engineering: A survey of current, challenges and future,

    ‘From LLMs to LLM-Based Agents for Software Engineering: A Survey of Current, Challenges and Future’. arXiv:2408.02479. Preprint, arXiv, April

  67. [67]

    From llms to llm- based agents for software engineering: A survey of current, challenges and future,

    https://doi.org/10.48550/arXiv.2408.02479. Jin, Hongye, Xiaotian Han, Jingfeng Yang, et al

  68. [68]

    arXiv:2401.01325

    ‘LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning’. arXiv:2401.01325. Preprint, arXiv, July

  69. [69]

    arXiv:2401.01325

    https://doi.org/10.48550/arXiv.2401.01325. Jin, Mingyu, Haochen Xue, Zhenting Wang, et al

  70. [70]

    arXiv:2405.06649

    ‘ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction’. arXiv:2405.06649. Preprint, arXiv, July

  71. [71]

    arXiv:2405.06649

    https://doi.org/10.48550/arXiv.2405.06649. Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang

  72. [72]

    Why Language Models Hallucinate

    ‘Why Language Models Hallucinate’. arXiv:2509.04664. Preprint, arXiv, September

  73. [73]

    Why Language Models Hallucinate

    https://doi.org/10.48550/arXiv.2509.04664. Kobak, Dmitry, Rita González-Márquez, Emőke-Ágnes Horvát, and Jan Lause

  74. [74]

    Science Advances 11 (27): eadt3813

    ‘Delving into LLM-Assisted Writing in Biomedical Publications through Excess Vocabulary’. Science Advances 11 (27): eadt3813. https://doi.org/10.1126/sciadv.adt3813. Laban, Philippe, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville

  75. [75]

    LLMs Get Lost In Multi-Turn Conversation

    ‘LLMs Get Lost In Multi-Turn Conversation’. arXiv:2505.06120. Preprint, arXiv, May

  76. [76]

    LLMs Get Lost In Multi-Turn Conversation

    https://doi.org/10.48550/arXiv.2505.06120. Li, Ang, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng

  77. [77]

    , Chen, H

    ‘LLM Generated Persona Is a Promise with a Catch’. arXiv:2503.16527. Preprint, arXiv, March

  78. [78]

    , Chen, H

    https://doi.org/10.48550/arXiv.2503.16527. Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem

  79. [79]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    ‘CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society’. arXiv:2303.17760. Preprint, arXiv, November

  80. [80]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    https://doi.org/10.48550/arXiv.2303.17760. Li, Jia, Ge Li, Yongmin Li, and Zhi Jin

Showing first 80 references.