arxiv: 2402.07927 · v2 · submitted 2024-02-05 · 💻 cs.AI · cs.CL· cs.HC

Recognition: 2 theorem links

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

Pranab Sahoo , Ayush Kumar Singh , Sriparna Saha , Vinija Jain , Samrat Mondal , Aman Chadha

Authors on Pith no claims yet

Pith reviewed 2026-05-12 21:47 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.HC

keywords prompt engineeringlarge language modelssurveytaxonomyvision-language modelsapplicationstechniqueslimitations

0 comments

The pith

This survey organizes prompt engineering techniques for large language models into categories by application area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys advancements in prompt engineering for large language models and vision-language models. It organizes these techniques into categories based on their application areas such as question answering and commonsense reasoning. For each category the authors describe the specific prompting methodology along with the models and datasets used. They also evaluate the strengths and limitations of the approaches. The survey includes a taxonomy diagram and summary table to support better understanding of the field.

Core claim

By compiling recent literature, the survey establishes a structured overview of prompt engineering methods grouped by application. It details for each approach the methodology, applications, involved models, utilized datasets, and critical strengths and limitations. This is accompanied by a taxonomy diagram and a comprehensive table of key elements across all reviewed techniques.

What carries the argument

An application-area taxonomy of prompting techniques that groups methods to enable systematic review and comparison of their use across tasks.

If this is right

Clarifies how prompts can adapt pre-trained models to downstream tasks without updating parameters.
Highlights open challenges and opportunities for future prompt engineering research.
Provides practitioners with summaries to compare and select methods for specific applications.
Documents the range from natural language instructions to learned vector representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could apply the taxonomy to match prompting strategies to new tasks more efficiently.
The survey's structure will require periodic updates to track the field's fast growth.
Inclusion of both language and vision-language models points toward potential value in cross-modal prompt designs.

Load-bearing premise

The papers selected for review form a sufficiently complete and unbiased sample of the prompt engineering literature.

What would settle it

Identification of a major prompt engineering paper or technique from the covered period that is omitted from the survey or placed in an incorrect category.

read the original abstract

Prompt engineering has emerged as an indispensable technique for extending the capabilities of large language models (LLMs) and vision-language models (VLMs). This approach leverages task-specific instructions, known as prompts, to enhance model efficacy without modifying the core model parameters. Rather than updating the model parameters, prompts allow seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt. Prompts can be natural language instructions that provide context to guide the model or learned vector representations that activate relevant knowledge. This burgeoning field has enabled success across various applications, from question-answering to commonsense reasoning. However, there remains a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques. This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area. For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized. We also delve into the strengths and limitations of each approach and include a taxonomy diagram and table summarizing datasets, models, and critical points of each prompting technique. This systematic analysis enables a better understanding of this rapidly developing field and facilitates future research by illuminating open challenges and opportunities for prompt engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical taxonomy and summaries for prompt engineering techniques, but the systematic claim is undercut by missing search and selection details.

read the letter

The main takeaway from this paper is that it provides a categorized overview of prompt engineering techniques, complete with summaries of methods, applications, models, datasets, strengths, and limitations, plus a taxonomy diagram and summary table. This organization is the primary contribution. It does well in pulling together information from many sources into a single reference that highlights practical details. The consistent structure across techniques makes it easier to compare them, which is useful for anyone trying to choose or adapt a prompting strategy for a specific task. The inclusion of both strengths and limitations for each approach adds balance that some surveys miss. The soft spots center on the review methodology. There is no section explaining the literature search process, such as the databases queried, the keywords or Boolean searches used, the date range, the number of papers screened or included, or how duplicates and irrelevant works were handled. This makes it hard to evaluate whether the selected papers represent a comprehensive sample or if the taxonomy accurately reflects the full diversity of the field. For a paper positioned as systematic, this omission is noticeable and could be addressed in revision. Overall, this paper is aimed at applied researchers, engineers, and students who want a structured introduction to prompt engineering without diving into every original publication. It offers value as a starting point for understanding current techniques and open challenges. I would recommend it for peer review. The synthesis effort is worthwhile, and with some additions to the methods description, it could become a solid reference for the community.

Referee Report

1 major / 2 minor

Summary. The paper claims to address the lack of systematic organization in prompt engineering by delivering a structured survey of recent advancements in LLMs and VLMs. It categorizes techniques by application area, supplying for each a summary of the prompting methodology, applications, models used, datasets, strengths, and limitations, along with a taxonomy diagram and a table summarizing datasets, models, and critical points.

Significance. If the categorization is shown to be comprehensive, the survey would provide a practical reference consolidating knowledge across techniques, models, and datasets while highlighting open challenges and opportunities. The explicit taxonomy and summary table are strengths that could aid researchers in navigating this fast-moving area.

major comments (1)

Abstract and §1: The manuscript repeatedly describes its contribution as a 'systematic' overview and 'systematic analysis,' yet provides no methods section or appendix detailing the literature search protocol (databases, Boolean strings, date range), inclusion/exclusion criteria, number of papers screened versus included, or any PRISMA-style flow diagram. Without these elements the representativeness of the selected papers and the accuracy of the application-area taxonomy cannot be verified, which is load-bearing for the central claim.

minor comments (2)

Taxonomy diagram: Consider adding a legend or explicit category labels to improve readability and ensure the diagram clearly maps to the textual sections.
Summary table: Verify that every model and dataset entry is accompanied by a citation in the main text or references section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that documenting the literature search process is necessary to support the claim of a systematic survey and will revise the manuscript accordingly.

read point-by-point responses

Referee: Abstract and §1: The manuscript repeatedly describes its contribution as a 'systematic' overview and 'systematic analysis,' yet provides no methods section or appendix detailing the literature search protocol (databases, Boolean strings, date range), inclusion/exclusion criteria, number of papers screened versus included, or any PRISMA-style flow diagram. Without these elements the representativeness of the selected papers and the accuracy of the application-area taxonomy cannot be verified, which is load-bearing for the central claim.

Authors: We acknowledge that the absence of an explicit methods section weakens the 'systematic' claim. In the revised manuscript we will add a new subsection (or appendix) titled 'Literature Search and Selection Methodology'. It will specify the databases and repositories searched (arXiv, Google Scholar, ACL Anthology, NeurIPS, ICML, CVPR, and EMNLP proceedings), the Boolean search strings used (e.g., ('prompt engineering' OR 'prompting technique' OR 'prompt design') AND ('large language model' OR LLM OR 'vision-language model' OR VLM)), the date range (primarily 2020–early 2024), inclusion criteria (papers presenting novel prompting methods with empirical results on LLMs or VLMs), exclusion criteria (non-technical surveys, duplicates, non-English works), and approximate screening statistics (initial hits, duplicates removed, papers retained after title/abstract and full-text review). A PRISMA-style flow diagram will also be included. This addition will make the taxonomy's coverage verifiable while preserving the existing categorization and analysis. revision: yes

Circularity Check

0 steps flagged

No circularity: survey compiles external literature without derivations or self-referential claims

full rationale

This is a survey paper that organizes and summarizes existing prompt engineering literature by application area, providing summaries of methodologies, models, datasets, strengths, and limitations from cited works. No original equations, predictions, fitted parameters, or derivation chains exist that could reduce to the paper's own inputs by construction. The central claim of filling a gap via structured overview relies on external sources rather than self-definition or self-citation load-bearing. Lack of explicit search protocol is a methodological limitation for representativeness but does not create circularity under the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the central claim rests entirely on the selection, reading, and categorization of previously published work on prompt engineering; no new free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5543 in / 1055 out tokens · 43647 ms · 2026-05-12T21:47:07.613128+00:00 · methodology

discussion (0)

Forward citations

Cited by 29 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
cs.CR 2026-05 unverdicted novelty 7.0

PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
cs.AI 2026-04 unverdicted novelty 7.0

TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
cs.DC 2026-04 unverdicted novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
Dynamic Cyber Ranges
cs.CR 2026-04 unverdicted novelty 7.0

Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
cs.SE 2026-04 unverdicted novelty 7.0

Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
cs.AI 2026-04 unverdicted novelty 7.0

GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on ...
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
cs.HC 2026-04 unverdicted novelty 7.0

LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
cs.LG 2026-05 unverdicted novelty 6.0

Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
cs.CL 2026-05 unverdicted novelty 6.0

LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
cs.SE 2026-05 unverdicted novelty 6.0

VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
Black-box model classification under the discriminative factorization
cs.LG 2026-05 unverdicted novelty 6.0

Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on ...
GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
cs.CL 2026-05 unverdicted novelty 6.0

GRaSp optimizes in-context examples for LLMs via synthetic generation, clustering, dimensionality reduction, and genetic algorithms with diversity-adaptive mutation, reaching 45.84% micro-F1 on financial NER with real...
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
cs.CR 2026-05 unverdicted novelty 6.0

An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
cs.SE 2026-04 unverdicted novelty 6.0

Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
Understanding the Mechanism of Altruism in Large Language Models
econ.GN 2026-04 unverdicted novelty 6.0

A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
cs.CR 2026-04 unverdicted novelty 6.0

Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
Beyond Single Reports: Evaluating Automated ATT&CK Technique Extraction in Multi-Report Campaign Settings
cs.SE 2026-04 unverdicted novelty 6.0

Aggregating multiple CTI reports improves ATT&CK technique extraction F1 by about 26 percent over single-report baselines, with saturation after 5-15 reports and maximum F1 scores of 78.6 percent and 54.9 percent acro...
Context-Value-Action Architecture for Value-Driven Large Language Model Agents
cs.AI 2026-04 unverdicted novelty 6.0

The Context-Value-Action architecture decouples reasoning from action in LLM agents via a human-data-trained Value Verifier, mitigating polarization and outperforming prompt-based methods on a large real-world benchmark.
VIP-COP: Context Optimization for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
cs.SE 2026-05 conditional novelty 5.0

LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.
Jailbreaking Large Language Models with Morality Attacks
cs.CL 2026-04 unverdicted novelty 5.0

Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
cs.CL 2026-04 unverdicted novelty 5.0

ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standar...
From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems
cs.CR 2026-04 unverdicted novelty 5.0

ASTRAL applies multimodal LLMs with prompt chaining and few-shot learning to synthesize CPS architectures from disparate sources, enabling adaptive threat identification and quantitative risk estimation, as supported ...
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
cs.CL 2026-04 accept novelty 5.0

PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt ...
Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks
cs.SE 2026-04 unverdicted novelty 4.0

CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
cs.SE 2026-04 conditional novelty 4.0

Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
cs.CL 2024-12 accept novelty 3.0

A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA
cs.CL 2026-05 unverdicted novelty 2.0

Prompt-based LLM evaluation without training data secured top rankings in the ArchEHR-QA 2026 shared task on clinical QA.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 29 Pith papers · 6 internal anchors

[1]

Exploring visual prompts for adapting large- scale models

Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large- scale models. arXiv preprint arXiv:2203.17274,

work page arXiv
[2]

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Unleashing the potential of prompt engineering in large language models: a comprehensive review

Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engi- neering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735,

work page arXiv
[4]

Contrastive chain-of-thought prompting

Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting. arXiv preprint arXiv:2311.09277,

work page arXiv
[5]

Rephrase and respond: Let large language models ask better questions for themselves

Yihe Deng, Weitong Zhang, Zixiang Chen, and Quanquan Gu. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205,

work page arXiv
[6]

arXiv preprint arXiv:2309.11495 (2023)

Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large lan- guage models. arXiv preprint arXiv:2309.11495,

work page arXiv
[7]

Active prompting with chain- of-thought for large language models,

Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246,

work page arXiv
[8]

2023 , month = nov, journal =

Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Xing Xie. Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760,

work page arXiv
[9]

and Liang, J

Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474,

work page arXiv
[10]

Structured chain- of-thought prompting for code generation

Jia Li, Ge Li, Yongmin Li, and Zhi Jin. Structured chain- of-thought prompting for code generation. arXiv preprint arXiv:2305.06599,

work page arXiv
[11]

Large language model guided tree-of-thought

Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291,

work page arXiv
[12]

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen- ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,

work page internal anchor Pith review arXiv
[13]

ART: Automatic multi-step reasoning and tool-use for large language models.arXiv preprint arXiv:2303.09014,

Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014,

work page arXiv
[14]

A comprehensive survey of hallucination in large language, image, video and audio foundation models

Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, and Aman Chadha. A comprehensive survey of hallucination in large language, image, video and audio foundation models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11709– 11724,

work page 2024
[15]

A comprehensive survey of hallucination mitigation techniques in large language models

SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehen- sive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313,

work page arXiv
[16]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

System 2 atten- tion (is something you might need too)

Jason Weston and Sainbayar Sukhbaatar. System 2 atten- tion (is something you might need too). arXiv preprint arXiv:2311.11829,

work page arXiv
[18]

Large Language Models as Optimizers

Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409 ,

work page internal anchor Pith review arXiv
[19]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,

Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in large language mod- els. arXiv preprint arXiv:2305.16582,

work page arXiv
[22]

Automatic chain of thought prompting in large language models,

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493,

work page arXiv
[23]

Take a step back: Evoking reasoning via abstraction in large language models,

Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng- Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. Take a step back: evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117,

work page arXiv
[24]

Large language models are human-level prompt engineers, 2023

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910,

work page arXiv
[25]

Thread of thought unraveling chaotic contexts

Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, and Jianbing Shen. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734,

work page arXiv
[26]

Can language models perform robust reasoning in chain-of-thought prompting with noisy ratio- nales? In A

Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy ratio- nales? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 123846–123910. Curran...

work page 2024