Recognition: 2 theorem links
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
Pith reviewed 2026-05-12 21:47 UTC · model grok-4.3
The pith
This survey organizes prompt engineering techniques for large language models into categories by application area.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By compiling recent literature, the survey establishes a structured overview of prompt engineering methods grouped by application. It details for each approach the methodology, applications, involved models, utilized datasets, and critical strengths and limitations. This is accompanied by a taxonomy diagram and a comprehensive table of key elements across all reviewed techniques.
What carries the argument
An application-area taxonomy of prompting techniques that groups methods to enable systematic review and comparison of their use across tasks.
If this is right
- Clarifies how prompts can adapt pre-trained models to downstream tasks without updating parameters.
- Highlights open challenges and opportunities for future prompt engineering research.
- Provides practitioners with summaries to compare and select methods for specific applications.
- Documents the range from natural language instructions to learned vector representations.
Where Pith is reading between the lines
- Developers could apply the taxonomy to match prompting strategies to new tasks more efficiently.
- The survey's structure will require periodic updates to track the field's fast growth.
- Inclusion of both language and vision-language models points toward potential value in cross-modal prompt designs.
Load-bearing premise
The papers selected for review form a sufficiently complete and unbiased sample of the prompt engineering literature.
What would settle it
Identification of a major prompt engineering paper or technique from the covered period that is omitted from the survey or placed in an incorrect category.
read the original abstract
Prompt engineering has emerged as an indispensable technique for extending the capabilities of large language models (LLMs) and vision-language models (VLMs). This approach leverages task-specific instructions, known as prompts, to enhance model efficacy without modifying the core model parameters. Rather than updating the model parameters, prompts allow seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt. Prompts can be natural language instructions that provide context to guide the model or learned vector representations that activate relevant knowledge. This burgeoning field has enabled success across various applications, from question-answering to commonsense reasoning. However, there remains a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques. This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area. For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized. We also delve into the strengths and limitations of each approach and include a taxonomy diagram and table summarizing datasets, models, and critical points of each prompting technique. This systematic analysis enables a better understanding of this rapidly developing field and facilitates future research by illuminating open challenges and opportunities for prompt engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to address the lack of systematic organization in prompt engineering by delivering a structured survey of recent advancements in LLMs and VLMs. It categorizes techniques by application area, supplying for each a summary of the prompting methodology, applications, models used, datasets, strengths, and limitations, along with a taxonomy diagram and a table summarizing datasets, models, and critical points.
Significance. If the categorization is shown to be comprehensive, the survey would provide a practical reference consolidating knowledge across techniques, models, and datasets while highlighting open challenges and opportunities. The explicit taxonomy and summary table are strengths that could aid researchers in navigating this fast-moving area.
major comments (1)
- Abstract and §1: The manuscript repeatedly describes its contribution as a 'systematic' overview and 'systematic analysis,' yet provides no methods section or appendix detailing the literature search protocol (databases, Boolean strings, date range), inclusion/exclusion criteria, number of papers screened versus included, or any PRISMA-style flow diagram. Without these elements the representativeness of the selected papers and the accuracy of the application-area taxonomy cannot be verified, which is load-bearing for the central claim.
minor comments (2)
- Taxonomy diagram: Consider adding a legend or explicit category labels to improve readability and ensure the diagram clearly maps to the textual sections.
- Summary table: Verify that every model and dataset entry is accompanied by a citation in the main text or references section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We agree that documenting the literature search process is necessary to support the claim of a systematic survey and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: Abstract and §1: The manuscript repeatedly describes its contribution as a 'systematic' overview and 'systematic analysis,' yet provides no methods section or appendix detailing the literature search protocol (databases, Boolean strings, date range), inclusion/exclusion criteria, number of papers screened versus included, or any PRISMA-style flow diagram. Without these elements the representativeness of the selected papers and the accuracy of the application-area taxonomy cannot be verified, which is load-bearing for the central claim.
Authors: We acknowledge that the absence of an explicit methods section weakens the 'systematic' claim. In the revised manuscript we will add a new subsection (or appendix) titled 'Literature Search and Selection Methodology'. It will specify the databases and repositories searched (arXiv, Google Scholar, ACL Anthology, NeurIPS, ICML, CVPR, and EMNLP proceedings), the Boolean search strings used (e.g., ('prompt engineering' OR 'prompting technique' OR 'prompt design') AND ('large language model' OR LLM OR 'vision-language model' OR VLM)), the date range (primarily 2020–early 2024), inclusion criteria (papers presenting novel prompting methods with empirical results on LLMs or VLMs), exclusion criteria (non-technical surveys, duplicates, non-English works), and approximate screening statistics (initial hits, duplicates removed, papers retained after title/abstract and full-text review). A PRISMA-style flow diagram will also be included. This addition will make the taxonomy's coverage verifiable while preserving the existing categorization and analysis. revision: yes
Circularity Check
No circularity: survey compiles external literature without derivations or self-referential claims
full rationale
This is a survey paper that organizes and summarizes existing prompt engineering literature by application area, providing summaries of methodologies, models, datasets, strengths, and limitations from cited works. No original equations, predictions, fitted parameters, or derivation chains exist that could reduce to the paper's own inputs by construction. The central claim of filling a gap via structured overview relies on external sources rather than self-definition or self-citation load-bearing. Lack of explicit search protocol is a methodological limitation for representativeness but does not create circularity under the enumerated patterns.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 29 Pith papers
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker protects agent prompts as IP by building non-portable obfuscated versions that function only on the intended LLM through code-symbol semantic anchoring followed by target-model feedback noise injection.
-
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
-
Dynamic Cyber Ranges
Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
-
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.
-
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on ...
-
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
-
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via inte...
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robot
VISOR applies VLMs to automate robot test oracles for correctness and quality assessment while reporting uncertainty, with evaluation on GPT and Gemini showing trade-offs in precision and recall but poor uncertainty c...
-
Black-box model classification under the discriminative factorization
Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on ...
-
GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
GRaSp optimizes in-context examples for LLMs via synthetic generation, clustering, dimensionality reduction, and genetic algorithms with diversity-adaptive mutation, reaching 45.84% micro-F1 on financial NER with real...
-
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
Beyond Single Reports: Evaluating Automated ATT&CK Technique Extraction in Multi-Report Campaign Settings
Aggregating multiple CTI reports improves ATT&CK technique extraction F1 by about 26 percent over single-report baselines, with saturation after 5-15 reports and maximum F1 scores of 78.6 percent and 54.9 percent acro...
-
Context-Value-Action Architecture for Value-Driven Large Language Model Agents
The Context-Value-Action architecture decouples reasoning from action in LLM agents via a human-data-trained Value Verifier, mitigating polarization and outperforming prompt-based methods on a large real-world benchmark.
-
VIP-COP: Context Optimization for Tabular Foundation Models
VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimen...
-
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.
-
Jailbreaking Large Language Models with Morality Attacks
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
-
Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standar...
-
From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems
ASTRAL applies multimodal LLMs with prompt chaining and few-shot learning to synthesize CPS architectures from disparate sources, enabling adaptive threat identification and quantitative risk estimation, as supported ...
-
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt ...
-
Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks
CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.
-
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
-
LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods
A survey that organizes LLMs-as-judges research into functionality, methodology, applications, meta-evaluation, and limitations.
-
BIT.UA-AAUBS at ArchEHR-QA 2026: Evaluating Open-Source and Proprietary LLMs via Prompting in Low-Resource QA
Prompt-based LLM evaluation without training data secured top rankings in the ArchEHR-QA 2026 shared task on clinical QA.
Reference graph
Works this paper leans on
-
[1]
Exploring visual prompts for adapting large- scale models
Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, and Phillip Isola. Exploring visual prompts for adapting large- scale models. arXiv preprint arXiv:2203.17274,
-
[2]
Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Unleashing the potential of prompt engineering in large language models: a comprehensive review
Banghao Chen, Zhaofeng Zhang, Nicolas Langrené, and Shengxin Zhu. Unleashing the potential of prompt engi- neering in large language models: a comprehensive review. arXiv preprint arXiv:2310.14735,
-
[4]
Contrastive chain-of-thought prompting
Yew Ken Chia, Guizhen Chen, Luu Anh Tuan, Soujanya Poria, and Lidong Bing. Contrastive chain-of-thought prompting. arXiv preprint arXiv:2311.09277,
-
[5]
Rephrase and respond: Let large language models ask better questions for themselves
Yihe Deng, Weitong Zhang, Zixiang Chen, and Quanquan Gu. Rephrase and respond: Let large language models ask better questions for themselves. arXiv preprint arXiv:2311.04205,
-
[6]
arXiv preprint arXiv:2309.11495 (2023)
Shehzaad Dhuliawala, Mojtaba Komeili, Jing Xu, Roberta Raileanu, Xian Li, Asli Celikyilmaz, and Jason Weston. Chain-of-verification reduces hallucination in large lan- guage models. arXiv preprint arXiv:2309.11495,
-
[7]
Active prompting with chain- of-thought for large language models,
Shizhe Diao, Pengcheng Wang, Yong Lin, and Tong Zhang. Active prompting with chain-of-thought for large language models. arXiv preprint arXiv:2302.12246,
-
[8]
Cheng Li, Jindong Wang, Yixuan Zhang, Kaijie Zhu, Wenxin Hou, Jianxun Lian, Fang Luo, Qiang Yang, and Xing Xie. Large language models understand and can be enhanced by emotional stimuli. arXiv preprint arXiv:2307.11760,
-
[9]
Chengshu Li, Jacky Liang, Andy Zeng, Xinyun Chen, Karol Hausman, Dorsa Sadigh, Sergey Levine, Li Fei-Fei, Fei Xia, and Brian Ichter. Chain of code: Reasoning with a language model-augmented code emulator. arXiv preprint arXiv:2312.04474,
-
[10]
Structured chain- of-thought prompting for code generation
Jia Li, Ge Li, Yongmin Li, and Zhi Jin. Structured chain- of-thought prompting for code generation. arXiv preprint arXiv:2305.06599,
-
[11]
Large language model guided tree-of-thought
Jieyi Long. Large language model guided tree-of-thought. arXiv preprint arXiv:2305.08291,
-
[12]
Show Your Work: Scratchpads for Intermediate Computation with Language Models
Maxwell Nye, Anders Johan Andreassen, Guy Gur-Ari, Hen- ryk Michalewski, Jacob Austin, David Bieber, David Dohan, Aitor Lewkowycz, Maarten Bosma, David Luan, et al. Show your work: Scratchpads for intermediate computation with language models. arXiv preprint arXiv:2112.00114,
work page internal anchor Pith review arXiv
-
[13]
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014,
-
[14]
A comprehensive survey of hallucination in large language, image, video and audio foundation models
Pranab Sahoo, Prabhash Meharia, Akash Ghosh, Sriparna Saha, Vinija Jain, and Aman Chadha. A comprehensive survey of hallucination in large language, image, video and audio foundation models. In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 11709– 11724,
work page 2024
-
[15]
A comprehensive survey of hallucination mitigation techniques in large language models
SM Tonmoy, SM Zaman, Vinija Jain, Anku Rani, Vipula Rawte, Aman Chadha, and Amitava Das. A comprehen- sive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313,
-
[16]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
System 2 atten- tion (is something you might need too)
Jason Weston and Sainbayar Sukhbaatar. System 2 atten- tion (is something you might need too). arXiv preprint arXiv:2311.11829,
-
[18]
Large Language Models as Optimizers
Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V Le, Denny Zhou, and Xinyun Chen. Large language models as optimizers. arXiv preprint arXiv:2309.03409 ,
work page internal anchor Pith review arXiv
-
[19]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Beyond Chain-of-Thought, Effec- tive Graph-of-Thought Reasoning in Language Models,
Yao Yao, Zuchao Li, and Hai Zhao. Beyond chain-of-thought, effective graph-of-thought reasoning in large language mod- els. arXiv preprint arXiv:2305.16582,
-
[22]
Automatic chain of thought prompting in large language models,
Zhuosheng Zhang, Aston Zhang, Mu Li, and Alex Smola. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493,
-
[23]
Take a step back: Evoking reasoning via abstraction in large language models,
Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng- Tze Cheng, Ed H Chi, Quoc V Le, and Denny Zhou. Take a step back: evoking reasoning via abstraction in large language models. arXiv preprint arXiv:2310.06117,
-
[24]
Large language models are human-level prompt engineers, 2023
Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan, and Jimmy Ba. Large language models are human-level prompt engineers. arXiv preprint arXiv:2211.01910,
-
[25]
Thread of thought unraveling chaotic contexts
Yucheng Zhou, Xiubo Geng, Tao Shen, Chongyang Tao, Guodong Long, Jian-Guang Lou, and Jianbing Shen. Thread of thought unraveling chaotic contexts. arXiv preprint arXiv:2311.08734,
-
[26]
Zhanke Zhou, Rong Tao, Jianing Zhu, Yiwen Luo, Zengmao Wang, and Bo Han. Can language models perform robust reasoning in chain-of-thought prompting with noisy ratio- nales? In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors, Advances in Neural Information Processing Systems, volume 37, pages 123846–123910. Curran...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.