arxiv: 2509.11295 · v2 · submitted 2025-09-14 · 💻 cs.CL

The Prompt Engineering Report Distilled: Quick Start Guide for Life Sciences

Valentin Romanov , Steven A Niederer This is my paper

Pith reviewed 2026-05-18 16:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords prompt engineeringlarge language modelslife scienceszero-shot promptingfew-shot promptingself-criticismtask decompositionliterature summarization

0 comments p. Extension

The pith

Life sciences researchers can achieve substantial efficiency gains by mastering six core prompt engineering techniques for common workflows like summarization and data extraction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper distills a large report on prompt engineering down to six techniques that target the repetitive tasks biologists and medical researchers do with language models. These techniques are zero-shot prompting, few-shot examples, thought generation, ensembling multiple outputs, self-criticism, and task decomposition. The authors show how to apply each one to literature reviews, data extraction from papers, and document editing while warning about pitfalls such as hallucination and loss of quality in long conversations. They argue that the upfront effort to learn these structured approaches pays off quickly because it reduces trial-and-error time and improves output reliability across different models. The result is a practical guide that moves prompting from ad-hoc use to a repeatable part of research practice.

Core claim

By selecting and grounding six techniques—zero-shot, few-shot, thought generation, ensembling, self-criticism, and decomposition—in life-sciences use cases and by supplying explicit rules for prompt structure plus warnings on multi-turn degradation, hallucinations, and model differences, the paper shows how researchers can move from opportunistic prompting to a low-friction systematic practice that raises output quality and delivers net time savings.

What carries the argument

The six distilled techniques (zero-shot, few-shot, thought generation, ensembling, self-criticism, and decomposition) that organize prompt construction for repeated life-sciences tasks and include explicit do-and-don't guidance plus model-specific caveats.

If this is right

Structured prompts using these techniques reduce hallucinations during extraction of experimental results from papers.
Self-criticism and ensembling steps improve consistency when editing research drafts or grant sections.
Decomposition breaks complex data-processing jobs into smaller LLM calls that fit within context limits.
Awareness of reasoning versus non-reasoning model differences prevents wasted effort on unsuitable tasks.
Use of the techniques augments rather than replaces existing data-processing and editing habits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same six-technique skeleton could be re-grounded in other domains such as materials science or clinical trial reporting with only minor example changes.
Systematic prompting of this kind might serve as a lightweight alternative to building custom agents for many routine analysis steps.
If the techniques scale to newer models, they could become a standard training module for graduate students entering data-heavy fields.

Load-bearing premise

That these six techniques will be sufficient for most life-sciences workflows and that following the structure and pitfall advice will reliably improve LLM output quality across different models and tasks.

What would settle it

A controlled test in which researchers perform the same literature summarization or data-extraction task with and without the six techniques and measure both total time spent and an independent quality score; if the structured prompts show no clear time or quality advantage, the efficiency claim does not hold.

read the original abstract

Developing effective prompts demands significant cognitive investment to generate reliable, high-quality responses from Large Language Models (LLMs). By deploying case-specific prompt engineering techniques that streamline frequently performed life sciences workflows, researchers could achieve substantial efficiency gains that far exceed the initial time investment required to master these techniques. The Prompt Report published in 2025 outlined 58 different text-based prompt engineering techniques, highlighting the numerous ways prompts could be constructed. To provide actionable guidelines and reduce the friction of navigating these various approaches, we distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition. We breakdown the significance of each approach and ground it in use cases relevant to life sciences, from literature summarization and data extraction to editorial tasks. We provide detailed recommendations for how prompts should and shouldn't be structured, addressing common pitfalls including multi-turn conversation degradation, hallucinations, and distinctions between reasoning and non-reasoning models. We examine context window limitations, agentic tools like Claude Code, while analyzing the effectiveness of Deep Research tools across OpenAI, Google, Anthropic and Perplexity platforms, discussing current limitations. We demonstrate how prompt engineering can augment rather than replace existing established individual practices around data processing and document editing. Our aim is to provide actionable guidance on core prompt engineering principles, and to facilitate the transition from opportunistic prompting to an effective, low-friction systematic practice that contributes to higher quality research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a practical distillation of six prompt techniques for life sciences tasks, but the efficiency claims rest on reasoning alone with no measurements or tests.

read the letter

The main point is that this paper narrows a 2025 report on 58 prompt methods down to six core ones and shows how they might fit life sciences work like paper summarization and data extraction. It gives structured examples, platform notes, and warnings on hallucinations or multi-turn issues. That organization is the useful part—it turns a long list into something quicker to apply without forcing readers to hunt through everything themselves. The life-sciences examples feel relevant and the advice to treat these tools as supplements to normal practices is straightforward and fair. Credit for keeping the scope narrow and focused on common researcher pain points rather than claiming broad new theory. The soft spot is the lack of any data. The abstract and description promise substantial efficiency gains that exceed the upfront cost, yet there are no time logs, accuracy checks, user trials, or before-and-after comparisons. The recommendations on prompt structure and pitfalls are reasonable on general grounds, but the size of the claimed benefit stays untested. This leaves the central practical pitch as an extrapolation rather than an observed result. The paper is aimed at life scientists who already use LLMs for writing and analysis and want a short reference instead of trial-and-error. A reader in that position could pick up usable templates and cautions quickly. It is less interesting for anyone looking for new techniques or rigorous evaluation. I would not push for full peer review in its current form because the work is mainly a domain application without fresh evidence. A lighter check for clarity on the examples might help if the venue wants applied notes, but otherwise it reads as ready for preprint sharing.

Referee Report

2 major / 2 minor

Summary. The paper distills a 2025 report containing 58 prompt engineering techniques into six core approaches (zero-shot, few-shot, thought generation, ensembling, self-criticism, and decomposition). It supplies structured templates, use-case examples for life-sciences tasks such as literature summarization and data extraction, pitfall warnings (multi-turn degradation, hallucinations), platform comparisons (Deep Research tools, context windows), and guidance on reasoning versus non-reasoning models, claiming that systematic adoption of these techniques will produce efficiency gains that substantially exceed the initial mastery cost.

Significance. If the guidance proves effective in practice, the manuscript supplies a concise, actionable quick-start resource that could lower the barrier for life-sciences researchers to move from ad-hoc to systematic LLM prompting. The explicit treatment of pitfalls, model distinctions, and tool limitations is a practical strength; however, the absence of any quantitative validation or controlled measurements restricts the work to the status of an advisory tutorial rather than a validated methodology.

major comments (2)

Abstract: The central claim that 'researchers could achieve substantial efficiency gains that far exceed the initial time investment' is unsupported by any empirical data. The manuscript contains no time logs, accuracy benchmarks, before/after comparisons, or user studies on life-sciences tasks; the qualifiers 'substantial' and 'far exceed' therefore rest on untested extrapolation from descriptive examples.
Use-cases and recommendations sections: The assertion that the six selected techniques are broadly sufficient for life-sciences workflows (literature summarization, data extraction, editorial tasks) is presented without explicit justification for why these six were chosen over the remaining 52 techniques from the source report or without coverage analysis for common tasks such as statistical analysis scripting or experimental design.

minor comments (2)

The manuscript would benefit from a short table summarizing the six techniques, their typical prompt structures, and the specific life-sciences use cases to which each is applied.
Platform comparisons (OpenAI, Google, Anthropic, Perplexity) would be clearer if accompanied by one or two concrete prompt-output examples illustrating differences in handling the same life-sciences query.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We respond to each major point below, agreeing that the claims require qualification and that selection criteria should be made explicit. We will implement revisions accordingly while maintaining the manuscript's focus as an advisory guide.

read point-by-point responses

Referee: Abstract: The central claim that 'researchers could achieve substantial efficiency gains that far exceed the initial time investment' is unsupported by any empirical data. The manuscript contains no time logs, accuracy benchmarks, before/after comparisons, or user studies on life-sciences tasks; the qualifiers 'substantial' and 'far exceed' therefore rest on untested extrapolation from descriptive examples.

Authors: We agree the manuscript presents no new empirical measurements or controlled studies, as its scope is to distill the 2025 report and illustrate application to life-sciences tasks. The efficiency statement was intended as a forward-looking observation drawn from the use-case examples rather than a validated result. In revision we will replace the phrasing with 'may achieve efficiency gains that exceed the initial investment, as illustrated by the structured examples' and add a sentence in the introduction noting the advisory character of the work and absence of quantitative benchmarks. revision: yes
Referee: Use-cases and recommendations sections: The assertion that the six selected techniques are broadly sufficient for life-sciences workflows (literature summarization, data extraction, editorial tasks) is presented without explicit justification for why these six were chosen over the remaining 52 techniques from the source report or without coverage analysis for common tasks such as statistical analysis scripting or experimental design.

Authors: The six techniques were chosen because they map to the principal categories in the source report (direct prompting, example-based, reasoning augmentation, and iterative improvement) and together address the majority of routine LLM interactions. We will add a short subsection explaining this selection rationale and the coverage it provides. We will also extend the use-cases section with brief illustrations for statistical scripting (via decomposition) and experimental design (via thought generation plus self-criticism) to demonstrate applicability beyond the original examples. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper distills six prompt engineering techniques from an external 2025 report and offers practical templates, pitfall guidance, and platform comparisons for life sciences tasks such as literature summarization. Its central claim about efficiency gains is presented as a qualitative expectation rather than a derived prediction from fitted parameters or equations. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear; the manuscript references standard LLM behaviors and the external report without reducing any result to its own inputs by construction. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that structured prompting reliably improves LLM performance for text-processing tasks in life sciences; no free parameters, invented entities, or additional ad-hoc axioms are introduced.

axioms (1)

domain assumption Structured prompt engineering techniques can reliably improve the quality and reliability of LLM responses for life sciences workflows such as summarization and data extraction.
Invoked when the abstract claims substantial efficiency gains and addresses pitfalls like hallucinations and multi-turn degradation.

pith-pipeline@v0.9.0 · 5792 in / 1331 out tokens · 81821 ms · 2026-05-18T16:34:34.082266+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We distil this report to focus on 6 core techniques: zero-shot, few-shot approaches, thought generation, ensembling, self-criticism, and decomposition.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

153 extracted references · 153 canonical work pages · 30 internal anchors

[1]

Multimodal Techniques (audio, video etc) (Schulhoff et al. 2025). The core focus of this review will be on text-based techniques. We will mention considerations for getting reliable results from multilingual prompting and plan to release a separate report on multimodal techniques in the future. In the Prompt Report, the authors identified 58 different pro...

work page 2025
[2]

simple prompt

summarizing dense, nuanced information from academic sources. Generating article summaries is a popular demonstration of zero-shot prompting and significantly useful for academic work (Figure 1). Well represented summaries can help researchers narrow down the potential list of articles that need to be read to gain the necessary coverage of a topic and to ...

work page 2023
[3]

Original prompts from Peters and Chin-Yee (2025) (Peters and Chin-Yee

A zero-shot prompt case study. Original prompts from Peters and Chin-Yee (2025) (Peters and Chin-Yee

work page 2025
[4]

are shown on the left, with suggested improvements on the right. Each improvement (+) represents a specific enhancement that may improve outcomes: adding domain specificity, incorporating quality exemplars, clarifying instructions to avoid ambiguity, and addressing technical constraints. 2.2 On Context Window and Token Consumption The context window of an...

work page 2025
[5]

the persona of a clinician

Variability in LLM persona interpretation across independent conversations. The same prompt requesting Claude Opus 4.1 to describe "the persona of a clinician" in under 10 keywords was submitted in three separate conversation threads. Despite the identical input, the model produced three distinct sets of attributes, demonstrating both semantic overlap (e....

work page 2023
[6]

rigorous science

Examples of a personality and an assistant prompt from literature. Left: An example of a literature review specialist named Bohr, as part of a larger team of AI agents, each with their own persona, including Project Manager, Analytical Assistant, Chemical Synthesis Consultant, Modeling and Coding Specialist, Robot Technician, Lab Equipment Designer, and t...

work page 2023
[7]

ground truth

Recommended approach for leverage LLMs for editing text. The framework illustrates prompt engineering strategies for different editing tasks (grammar corrections, style improvements, and word choice refinement), with examples showing original text (orange), suggested revisions (blue), and justifications (purple) to maintain transparency during the editing...

work page 2019
[8]

Typically, the more examples, the better (Brown et al

The number of examples provided to the LLM. Typically, the more examples, the better (Brown et al. 2020), while ensuring to monitor overall token count. Agarwal et al. (2024) found that using many examples (>1000 (~85k tokens) in some cases) led to dramatic improvements in summarization, mathematical problem solving, algorithmic reasoning among many other...

work page 2020
[9]

LLMs are sensitive to the order of information in prompts. Recent evidence demonstrates that the sequential arrangement of in-context examples can substantially impact model performance, with accuracy variations of 5.5-10.5 percentage points depending on example ordering alone (Bhope et al. 2025). This positional bias appears particularly pronounced in ta...

work page 2025
[10]

The left panel shows the conventional prompt structure with task instructions positioned after contextual information and examples

Prompt reordering strategy for improved task adherence in structured data extraction. The left panel shows the conventional prompt structure with task instructions positioned after contextual information and examples. The right panel demonstrates the optimized structure with task instructions relocated to the beginning of the prompt. Both configurations t...

work page 2000
[11]

reasoning

but contain enough diversity in examples to account for many different ways that information can be presented, summarized and articulated (Wang et al. 2024). In fact, with just 18 examples, Su et al. (2022) showed 12.9% relative gain in performance when selecting examples that were both representative of the task and diverse in coverage within context for...

work page 2024
[12]

Think step by step

Chain-of-thought prompting can be beneficial under certain conditions. Left: A typical zero-shot prompt will typically struggle with logic or mathematics-based problems, in this case misidentifying channel width (100 μm instead of 200 μm), used wrong unit conversions, and a large error in final droplet output counts 62.5 kHz vs. actual ~130 Hz. Right: Add...

work page 2025
[13]

Multi-turn degradation compared to a well specified single-prompt for scientific data mining. Top: Four-turn conversation showing progressive loss of data integrity, where eventually the LLM encounters missing drug-protein associations (Turn 3), incorrect value mapping, and complete data omission (Turn 4). Bottom: Single-turn prompt with explicit instruct...

work page 2025
[14]

the wrong direction

the prompt was broken into multiple smaller underspecified prompts. Models with higher aptitude, that is higher intelligence, tend to be more reliable in single-turn conversations, however, will severely degrade in reliability under multi-turn conversations, regardless of how intelligent they are. One key takeaway from this study is that underspecified pr...

work page 2025
[15]

Left: Initial prompt requesting median lnIC50 extraction for compound AZD5991 from a snippet of text (Carli et al

Example of Ensembling. Left: Initial prompt requesting median lnIC50 extraction for compound AZD5991 from a snippet of text (Carli et al. 2025). Right: Four independent conversations, using the same prompt and snippet of text resulted in three correctly identify median values, while one trial (Conversation

work page 2025
[16]

The ensemble approach converges on the correct answer (4.591) by consensus

erroneously attributes both 4.591 and 2.014 to AZD5991, despite 2.014 being the median for AZD5582. The ensemble approach converges on the correct answer (4.591) by consensus. There are several variations of Ensembling that extend beyond simple repeated queries with majority voting (i.e. re-running the same prompt in a new conversation). The self-consiste...

work page 2023
[17]

These techniques have proven effective in production environments, with ensemble methods showing improvements in key business metrics (Fang et al

can also be effective. These techniques have proven effective in production environments, with ensemble methods showing improvements in key business metrics (Fang et al. 2024), suggesting that the computational overhead of multiple sampling is justified by substantial gains in accuracy and reliability for critical data extraction tasks. 5.1 On Deep Resear...

work page 2024
[18]

find, analyse, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst

Described as an agentic tool, Deep Research promised to “find, analyse, and synthesize hundreds of online sources to create a comprehensive report at the level of a research analyst”(‘Introducing Deep Research’ 2025). Similarly in late 2025, Anthropic described the development of their Deep Research framework, a suite of subagents orchestrated by a lead a...

work page 2025
[19]

Left: The prompt provided to each model

Reproducibility analysis of Deep Research tools across duplicate runs using the same prompt. Left: The prompt provided to each model. The prompt was developed in an ‘opportunistic’ prompting style, that is, we did not engineer this prompt to reflect how most people use these tools. Right panels: Comparison metrics between two independent runs (Report 1: l...

work page 2023
[20]

always know

and the fact remains, hallucinations are one of the majors reasons for low adoption of LLMs within academic and business pipelines. Hallucinations are non-factual responses, such as an answer to a question, or generated academic references that do not exist, typically persuasively presented to the user as fact. Latest research from OpenAI has concluded th...

work page 2025
[21]

Let’s first understand the problem and devise a plan to solve the problem. Then, let’s carry out the plan and solve the problem step by step

appending it’s solution back to the main prompt with all sub-problems. The main prompt now contains all sub-problems plus the first solution. Using the sub-problems and solution as context, the LLM moves on to the next sub-problem (Zhou et al. 2023). Prompt decomposition is one of several popular prompt engineering techniques for improving multi-agent com...

work page 2023
[22]

LLMs have a finite number of tokens they can process and attend to, per conversation

solve the sub-problem in a single turn conversation, is token consumption. LLMs have a finite number of tokens they can process and attend to, per conversation. Further, the number of tokens or words the LLM can produce as a response to a query is also limited. For example, in AI Studio from Google it is possible to set the maximum token output for Gemini...

work page 2024
[23]

One approach to addressing context window limitations involves initiating new conversation instances for each sub-problem requiring a solution

under the Creative Commons Attribution 4.0 International License (CC BY 4.0). One approach to addressing context window limitations involves initiating new conversation instances for each sub-problem requiring a solution. Decomposition strategies and established best practices have facilitated the development of next-generation academic tools with signifi...

work page 2025
[24]

Multi-agent framework for performing a literature review utilizing parallelized sub-agents. Initial prompt defines the research scope (vascularized microfluidics, 2020-2025), which is then decomposed into five parallel sub-agents, each focusing on distinct research domains: core technology and fabrication methods, biological applications, biomaterials and...

work page 2020
[25]

Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

‘Many-Shot In-Context Learning’. arXiv:2404.11018. Preprint, arXiv, October

work page arXiv
[26]

Zhang, Bernd Bohnet, Luis Rosias, Stephanie Chan, Biao Zhang, Ankesh Anand, Zaheer Abbas, Azade Nova, John D

https://doi.org/10.48550/arXiv.2404.11018. Aggarwal, Pranjal, Aman Madaan, Yiming Yang, and Mausam

work page doi:10.48550/arxiv.2404.11018
[27]

arXiv:2305.11860

‘Let’s Sample Step by Step: Adaptive-Consistency for Efficient Reasoning and Coding with LLMs’. arXiv:2305.11860. Preprint, arXiv, November

work page arXiv
[28]

arXiv:2305.11860

https://doi.org/10.48550/arXiv.2305.11860. Albert, Emőke, Péter Basa, Bálint Fodor, et al

work page doi:10.48550/arxiv.2305.11860
[29]

Langmuir 41 (1): 704–18

‘Experimental and Computational Synthesis of TiO2 Sol–Gel Coatings’. Langmuir 41 (1): 704–18. https://doi.org/10.1021/acs.langmuir.4c03959. Alkaissi, Hussam, and Samy I McFarlane

work page doi:10.1021/acs.langmuir.4c03959
[30]

Amad, Harry, Nicolás Astorga, and Mihaela van der Schaar

https://doi.org/10.7759/cureus.35179. Amad, Harry, Nicolás Astorga, and Mihaela van der Schaar

work page doi:10.7759/cureus.35179
[31]

arXiv:2506.12091

‘Continuously Updating Digital Twins Using Large Language Models’. arXiv:2506.12091. Preprint, arXiv, July

work page arXiv
[32]

arXiv:2506.12091

https://doi.org/10.48550/arXiv.2506.12091. An, Shengnan, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian-Guang Lou

work page doi:10.48550/arxiv.2506.12091
[33]

arXiv:2404.16811

‘Make Your LLM Fully Utilize the Context’. arXiv:2404.16811. Preprint, arXiv, April

work page arXiv
[34]

arXiv:2404.16811

https://doi.org/10.48550/arXiv.2404.16811. Andersen, Jens Peter, Lise Degn, Rachel Fishberg, et al

work page doi:10.48550/arxiv.2404.16811
[35]

Technology in Society 81 (June): 102813

‘Generative Artificial Intelligence (GenAI) in the Research Process – A Survey of Researchers’ Practices and Perceptions’. Technology in Society 81 (June): 102813. https://doi.org/10.1016/j.techsoc.2025.102813. Anthropic. 2024a. ‘Prompt Generator’. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering/prompt-generator. Anthropic. 2024b. ...

work page doi:10.1016/j.techsoc.2025.102813 2025
[36]

PLOS ONE 20 (6): e0325664

‘Helpful Assistant or Fruitful Facilitator? Investigating How Personas Affect Language Model Behavior’. PLOS ONE 20 (6): e0325664. https://doi.org/10.1371/journal.pone.0325664. Aydin, Abdulkerim, Süleyman Eren Yürük, İlknur Reisoğlu, and Yuksel Goktas

work page doi:10.1371/journal.pone.0325664
[37]

Scientometrics 128 (1): 623–50

‘Main Barriers and Possible Enablers of Academicians While Publishing’. Scientometrics 128 (1): 623–50. https://doi.org/10.1007/s11192-022-04528-x. Becker, Jonas, Lars Benedikt Kaesberg, Andreas Stephan, Jan Philip Wahle, Terry Ruas, and Bela Gipp

work page doi:10.1007/s11192-022-04528-x
[38]

Stay Focused: Problem Drift in Multi-Agent Debate

‘Stay Focused: Problem Drift in Multi-Agent Debate’. arXiv:2502.19559. Preprint, arXiv, May

work page internal anchor Pith review Pith/arXiv arXiv
[39]

Stay Focused: Problem Drift in Multi-Agent Debate

https://doi.org/10.48550/arXiv.2502.19559. Bender, Emily M., Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2502.19559
[40]

Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’21, March 1, 610–23

‘On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜’. Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (New York, NY, USA), FAccT ’21, March 1, 610–23. https://doi.org/10.1145/3442188.3445922. Bhope, Rahul Atul, Praveen Venkateswaran, K. R. Jayaram, Vatche Isahagian, Vinod Muthusamy, and Nalini Venk...

work page doi:10.1145/3442188.3445922 2021
[41]

arXiv:2501.15030

‘OptiSeq: Ordering Examples On-The-Fly for In-Context Learning’. arXiv:2501.15030. Preprint, arXiv, February

work page arXiv
[42]

arXiv:2501.15030

https://doi.org/10.48550/arXiv.2501.15030. Bommasani, Rishi, Drew A. Hudson, Ehsan Adeli, et al

work page doi:10.48550/arxiv.2501.15030
[43]

On the Opportunities and Risks of Foundation Models

‘On the Opportunities and Risks of Foundation Models’. arXiv:2108.07258. Preprint, arXiv, July

work page internal anchor Pith review Pith/arXiv arXiv
[44]

On the Opportunities and Risks of Foundation Models

https://doi.org/10.48550/arXiv.2108.07258. Brown, Tom, Benjamin Mann, Nick Ryder, et al

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2108.07258
[45]

Advances in Neural Information Processing Systems 33: 1877–901

‘Language Models Are Few-Shot Learners’. Advances in Neural Information Processing Systems 33: 1877–901. https://proceedings.neurips.cc/paper_files/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html. Capstick, Alexander, Rahul G. Krishnan, and Payam Barnaghi

work page 2020
[46]

arXiv:2411.17284

‘AutoElicit: Using Large Language Models for Expert Prior Elicitation in Predictive Modelling’. arXiv:2411.17284. Preprint, arXiv, May

work page arXiv
[47]

arXiv:2411.17284

https://doi.org/10.48550/arXiv.2411.17284. Carli, Francesco, Pierluigi Di Chiaro, Mariangela Morelli, et al

work page doi:10.48550/arxiv.2411.17284
[48]

Chan, Kuang Wen, Farhan Ali, Joonhyeong Park, et al

https://doi.org/10.1038/s41467-025-56827-5. Chan, Kuang Wen, Farhan Ali, Joonhyeong Park, et al

work page doi:10.1038/s41467-025-56827-5
[49]

Computers and Education: Artificial Intelligence 8 (June): 100344

‘Automatic Item Generation in Various STEM Subjects Using Large Language Model Prompting’. Computers and Education: Artificial Intelligence 8 (June): 100344. https://doi.org/10.1016/j.caeai.2024.100344. Chang, Yung-Chun, Ming-Siang Huang, Yi-Hsuan Huang, and Yi-Hsuan Lin

work page doi:10.1016/j.caeai.2024.100344 2024
[50]

Scientific Reports 15 (1): 15493

‘The Influence of Prompt Engineering on Large Language Models for Protein–Protein Interaction Identification in Biomedical Literature’. Scientific Reports 15 (1): 15493. https://doi.org/10.1038/s41598-025-99290-4. DeHaan, Soren, Yuanze Liu, Johan Bollen, and Sa’ul A. Blanco

work page doi:10.1038/s41598-025-99290-4
[51]

arXiv:2505.17327

‘GPT Editors, Not Authors: The Stylistic Footprint of LLMs in Academic Preprints’. arXiv:2505.17327. Preprint, arXiv, May

work page arXiv
[52]

arXiv:2505.17327

https://doi.org/10.48550/arXiv.2505.17327. Du, Mingxuan, Benfeng Xu, Chiwei Zhu, Xiaorui Wang, and Zhendong Mao

work page doi:10.48550/arxiv.2505.17327
[53]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

‘DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents’. arXiv:2506.11763. Preprint, arXiv, June

work page internal anchor Pith review Pith/arXiv arXiv
[54]

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

https://doi.org/10.48550/arXiv.2506.11763. Emsley, Robin

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2506.11763
[55]

arXiv:2501.17084

‘Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving’. arXiv:2501.17084. Preprint, arXiv, January

work page arXiv
[56]

arXiv:2501.17084

https://doi.org/10.48550/arXiv.2501.17084. Fang, Chenhao, Xiaohan Li, Zezhong Fan, et al

work page doi:10.48550/arxiv.2501.17084
[57]

arXiv:2403.00863

‘LLM-Ensemble: Optimal Large Language Model Ensemble Method for E-Commerce Product Attribute Value Extraction’. arXiv:2403.00863. Preprint, arXiv, June

work page arXiv
[58]

https://doi.org/10.48550/arXiv.2403.00863. Google

work page doi:10.48550/arxiv.2403.00863
[59]

Quantitative Science Studies 5 (4): 823–43

‘The Strain on Scientific Publishing’. Quantitative Science Studies 5 (4): 823–43. https://doi.org/10.1162/qss_a_00327. Hu, Ke, Zhehuai Chen, Chao-Han Huck Yang, et al

work page doi:10.1162/qss_a_00327
[60]

Targeted Password Guessing Using Neural Language Models,

‘Chain-of-Thought Prompting for Speech Translation’. ICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), April, 1–5. https://doi.org/10.1109/ICASSP49660.2025.10890560. Huang, Jiaxin, Shixiang Shane Gu, Le Hou, et al

work page doi:10.1109/icassp49660.2025.10890560 2025
[61]

Large Language Models Can Self-Improve

‘Large Language Models Can Self-Improve’. arXiv:2210.11610. Preprint, arXiv, October

work page internal anchor Pith review arXiv
[62]

Large Language Models Can Self-Improve

https://doi.org/10.48550/arXiv.2210.11610. Hwang, Taesoon, Nishant Aggarwal, Pir Zarak Khan, et al

work page internal anchor Pith review doi:10.48550/arxiv.2210.11610
[63]

PLOS ONE 19 (2): e0297701

‘Can ChatGPT Assist Authors with Abstract Writing in Medical Journals? Evaluating the Quality of Scientific Abstracts Generated by ChatGPT and Original Abstracts’. PLOS ONE 19 (2): e0297701. https://doi.org/10.1371/journal.pone.0297701. Imani, Shima, Liang Du, and Harsh Shrivastava

work page doi:10.1371/journal.pone.0297701
[64]

arXiv:2303.05398

‘MathPrompter: Mathematical Reasoning Using Large Language Models’. arXiv:2303.05398. Preprint, arXiv, March

work page arXiv
[65]

arXiv:2303.05398

https://doi.org/10.48550/arXiv.2303.05398. ‘Introducing Deep Research’

work page doi:10.48550/arxiv.2303.05398
[66]

arXiv:2408.02479

‘From LLMs to LLM-Based Agents for Software Engineering: A Survey of Current, Challenges and Future’. arXiv:2408.02479. Preprint, arXiv, April

work page arXiv
[67]

arXiv:2408.02479

https://doi.org/10.48550/arXiv.2408.02479. Jin, Hongye, Xiaotian Han, Jingfeng Yang, et al

work page doi:10.48550/arxiv.2408.02479
[68]

arXiv:2401.01325

‘LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning’. arXiv:2401.01325. Preprint, arXiv, July

work page arXiv
[69]

arXiv:2401.01325

https://doi.org/10.48550/arXiv.2401.01325. Jin, Mingyu, Haochen Xue, Zhenting Wang, et al

work page doi:10.48550/arxiv.2401.01325
[70]

arXiv:2405.06649

‘ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction’. arXiv:2405.06649. Preprint, arXiv, July

work page arXiv
[71]

arXiv:2405.06649

https://doi.org/10.48550/arXiv.2405.06649. Kalai, Adam Tauman, Ofir Nachum, Santosh S. Vempala, and Edwin Zhang

work page doi:10.48550/arxiv.2405.06649
[72]

Why Language Models Hallucinate

‘Why Language Models Hallucinate’. arXiv:2509.04664. Preprint, arXiv, September

work page internal anchor Pith review Pith/arXiv arXiv
[73]

Why Language Models Hallucinate

https://doi.org/10.48550/arXiv.2509.04664. Kobak, Dmitry, Rita González-Márquez, Emőke-Ágnes Horvát, and Jan Lause

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.04664
[74]

Science Advances 11 (27): eadt3813

‘Delving into LLM-Assisted Writing in Biomedical Publications through Excess Vocabulary’. Science Advances 11 (27): eadt3813. https://doi.org/10.1126/sciadv.adt3813. Laban, Philippe, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville

work page doi:10.1126/sciadv.adt3813
[75]

LLMs Get Lost In Multi-Turn Conversation

‘LLMs Get Lost In Multi-Turn Conversation’. arXiv:2505.06120. Preprint, arXiv, May

work page internal anchor Pith review Pith/arXiv arXiv
[76]

LLMs Get Lost In Multi-Turn Conversation

https://doi.org/10.48550/arXiv.2505.06120. Li, Ang, Haozhe Chen, Hongseok Namkoong, and Tianyi Peng

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.06120
[77]

, Chen, H

‘LLM Generated Persona Is a Promise with a Catch’. arXiv:2503.16527. Preprint, arXiv, March

work page arXiv
[78]

, Chen, H

https://doi.org/10.48550/arXiv.2503.16527. Li, Guohao, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem

work page doi:10.48550/arxiv.2503.16527
[79]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

‘CAMEL: Communicative Agents for “Mind” Exploration of Large Language Model Society’. arXiv:2303.17760. Preprint, arXiv, November

work page internal anchor Pith review Pith/arXiv arXiv
[80]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

https://doi.org/10.48550/arXiv.2303.17760. Li, Jia, Ge Li, Yongmin Li, and Zhi Jin

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.17760

Showing first 80 references.