TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
hub Canonical reference
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications
Canonical reference. 82% of citing Pith papers cite this work as background.
abstract
Prompt engineering has emerged as an indispensable technique for extending the capabilities of large language models (LLMs) and vision-language models (VLMs). This approach leverages task-specific instructions, known as prompts, to enhance model efficacy without modifying the core model parameters. Rather than updating the model parameters, prompts allow seamless integration of pre-trained models into downstream tasks by eliciting desired model behaviors solely based on the given prompt. Prompts can be natural language instructions that provide context to guide the model or learned vector representations that activate relevant knowledge. This burgeoning field has enabled success across various applications, from question-answering to commonsense reasoning. However, there remains a lack of systematic organization and understanding of the diverse prompt engineering methods and techniques. This survey paper addresses the gap by providing a structured overview of recent advancements in prompt engineering, categorized by application area. For each prompting approach, we provide a summary detailing the prompting methodology, its applications, the models involved, and the datasets utilized. We also delve into the strengths and limitations of each approach and include a taxonomy diagram and table summarizing datasets, models, and critical points of each prompting technique. This systematic analysis enables a better understanding of this rapidly developing field and facilitates future research by illuminating open challenges and opportunities for prompt engineering.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.
GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on subreddit data.
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.
An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
Reflective Prompt Tuning uses LLM function calling and diagnostic reports to iteratively optimize prompts, yielding up to 12.9 point gains on reasoning tasks while improving calibration.
Proposes nearly balanced TCARDs that minimize the first two generalized word-length pattern components, defines Φ_BCD criterion linked to classical optimality, and constructs designs via coordinate exchange with simulation-calibrated weights for LLM prompt engineering.
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.
GRaSp optimizes in-context examples for LLMs via synthetic generation, clustering, dimensionality reduction, and genetic algorithms with diversity-adaptive mutation, reaching 45.84% micro-F1 on financial NER with real data and outperforming zero-shot and random few-shot baselines.
PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.
An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.
ClusterRAG applies density-based clustering to user profiles for collaborative retrieval in personalized RAG and reports best performance on LaMP tasks by combining target and similar-user profiles.
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
citing papers explorer
-
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constrained SkyPilot baseline.
-
Dynamic Cyber Ranges
Dynamic Cyber Ranges with LLM defender agents reduce attacker success to 0-55% and preserve evaluation headroom as models advance by using comparable capabilities on both sides.
-
Atropos: Improving Cost-Benefit Trade-off of LLM-based Agents under Self-Consistency with Early Termination and Model Hotswap
Atropos uses GCN on inference graphs for early failure prediction and hotswaps to larger LLMs, achieving 74% of large-model performance at 24% cost.
-
Human-Centric Topic Modeling with Goal-Prompted Contrastive Learning and Optimal Transport
GCTM-OT extracts goal candidates with an LLM, then uses goal-prompted contrastive learning and optimal transport to discover topics that are more coherent, diverse, and aligned with human intent than prior methods on subreddit data.
-
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
-
RubberDuckBench: A Benchmark for AI Coding Assistants
RubberDuckBench shows top AI models score around 68% on real GitHub coding questions, rarely answer completely correctly, and hallucinate in 58% of responses on average.
-
PIAST: Rapid Prompting with In-context Augmentation for Scarce Training data
PIAST iteratively optimizes few-shot examples in prompts via Monte Carlo Shapley value estimation, outperforming prior automatic prompting methods and setting new SOTA on classification, simplification, and GSM8K with modest compute.
-
From Task to Tutorial: An Automated GUI Framework for Excel Tutorial Document and Video Creation
An AI framework automates Excel tutorial and video creation from task descriptions via an Execution Agent, achieving 8.5% higher task success and 1/20th the authoring time of experts.
-
The PIMMUR Principles: Ensuring Validity in Collective Behavior of LLM Societies
A systematic audit of LLM-based AI societies finds that 89.7% of 39 studies violate at least one of six PIMMUR validity principles, with reproductions showing that many claimed collective behaviors disappear when controls are tightened.
-
Reflective Prompt Tuning through Language Model Function-Calling
Reflective Prompt Tuning uses LLM function calling and diagnostic reports to iteratively optimize prompts, yielding up to 12.9 point gains on reasoning tasks while improving calibration.
-
TCARD: Nearly Balanced Two-Level Designs with Treatment Cardinality Constraints with an Application to LLM Prompt Engineering
Proposes nearly balanced TCARDs that minimize the first two generalized word-length pattern components, defines Φ_BCD criterion linked to classical optimality, and constructs designs via coordinate exchange with simulation-calibrated weights for LLM prompt engineering.
-
Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits
Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
VISOR: A Vision-Language Model-based Test Oracle for Testing Robots
VISOR is a VLM-based automated test oracle that evaluates robot task correctness and quality from videos while reporting its own uncertainty, tested on GPT and Gemini across four tasks and over 1000 videos with Gemini showing higher recall and GPT higher precision but low uncertainty-correctness tie
-
Black-box model classification under the discriminative factorization
Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on auditing tasks.
-
GRaSp: Automatic Example Optimization for In-Context Learning in Low-Data Tasks
GRaSp optimizes in-context examples for LLMs via synthetic generation, clustering, dimensionality reduction, and genetic algorithms with diversity-adaptive mutation, reaching 45.84% micro-F1 on financial NER with real data and outperforming zero-shot and random few-shot baselines.
-
PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts
PragLocker generates function-preserving but non-portable prompts for LLM agents via code-symbol semantic anchoring followed by target-model feedback noise injection.
-
Tailored Prompts, Targeted Protection: Vulnerability-Specific LLM Analysis for Smart Contracts
An LLM framework with tailored prompts and a new dataset of 31,165 annotated instances achieves 0.92 positive recall and 0.85 negative recall for detecting 13 smart contract vulnerability categories.
-
Leveraging LLMs for Multi-File DSL Code Generation: An Industrial Case Study
Fine-tuning 7B code LLMs on a custom multi-file DSL dataset achieves structural fidelity of 1.00, high exact-match accuracy, and practical utility validated by expert survey and execution checks.
-
Understanding the Mechanism of Altruism in Large Language Models
A small set of sparse autoencoder features in LLMs drives shifts between generous and selfish allocations in dictator games, with causal patching and steering confirming their role and generalization to other social games.
-
From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers
Arbiter-K is a governance-first architecture that turns probabilistic agent reasoning into discrete instructions with runtime taint propagation to block unsafe actions, reporting 76-95% interception rates and a 92.79% gain over baseline policies on two test systems.
-
ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation
ClusterRAG applies density-based clustering to user profiles for collaborative retrieval in personalized RAG and reports best performance on LaMP tasks by combining target and similar-user profiles.
-
When LLMs Lag Behind: Knowledge Conflicts from Evolving APIs in Code Generation
LLMs produce executable code only 42.55% of the time under API evolution without full documentation, improving to 66.36% with structured docs and by 11% more with reasoning strategies, yet outdated patterns persist.
-
Beyond Single Reports: Evaluating Automated ATT&CK Technique Extraction in Multi-Report Campaign Settings
Aggregating multiple CTI reports improves ATT&CK technique extraction F1 by about 26 percent over single-report baselines, with saturation after 5-15 reports and maximum F1 scores of 78.6 percent and 54.9 percent across the tested campaigns.
-
Context-Value-Action Architecture for Value-Driven Large Language Model Agents
The Context-Value-Action architecture decouples reasoning from action in LLM agents via a human-data-trained Value Verifier, mitigating polarization and outperforming prompt-based methods on a large real-world benchmark.
-
Configuring Agentic AI Coding Tools: An Exploratory Study
Developers overwhelmingly rely on simple static context files such as AGENTS.md to configure agentic AI coding tools, while advanced mechanisms like skills and subagents see very low adoption.
-
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
-
Illusions of Confidence? Diagnosing LLM Truthfulness via Neighborhood Consistency
Neighbor-Consistency Belief (NCB) measures LLM belief robustness across conceptual neighborhoods, revealing that high-NCB facts resist contextual interference better, and Structure-Aware Training reduces brittleness by about 30%.
-
Can GPT-4o Evaluate Usability Like Human Experts? A Comparative Study on Issue Identification in Heuristic Evaluation
GPT-4o identified only 21.2% of the usability issues found by human experts in heuristic evaluation, while discovering 27 additional issues and exhibiting difficulties with certain heuristics and generating false positives.
-
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
-
From Concept to Practice: an Automated LLM-aided UVM Machine for RTL Verification
UVM^2 is an LLM-driven system that generates and refines UVM testbenches for RTL verification, reporting up to substantial time savings and average code/function coverage of 87.44%/89.58% on designs up to 1.6K lines, outperforming prior methods.
-
General Hazard Detection
Introduces CompliVision dataset and active learning framework for rule-based hazard compliance assessment using vision-language models grounded in safety standards.
-
Less Back-and-Forth: A Comparative Study of Structured Prompting
Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.
-
VIP-COP: Context Optimization for Tabular Foundation Models
VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimensional data.
-
User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models
LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.
-
Jailbreaking Large Language Models with Morality Attacks
Morality-specific jailbreak attacks expose critical vulnerabilities in both large language models and guardrail systems when handling pluralistic values.
-
Cross-Lingual Attention Distillation with Personality-Informed Generative Augmentation for Multilingual Personality Recognition
ADAM uses personality-guided LLM augmentation and cross-lingual attention distillation to raise balanced accuracy on multilingual personality recognition to 0.6332 on Essays and 0.7448 on Kaggle, outperforming standard BCE loss.
-
From Incomplete Architecture to Quantified Risk: Multimodal LLM-Driven Security Assessment for Cyber-Physical Systems
ASTRAL applies multimodal LLMs with prompt chaining and few-shot learning to synthesize CPS architectures from disparate sources, enabling adaptive threat identification and quantitative risk estimation, as supported by ablation studies and feedback from 14 cybersecurity practitioners.
-
The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure
PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt concepts.
-
Enhancing Large Language Model-Based Systems for End-to-End Circuit Analysis Problem Solving
Hybrid pipeline using YOLO vision and ngspice verification raises circuit analysis accuracy from Gemini's 79.52% baseline to 97.59%, with similar gains on hand-drawn diagrams.
-
Large Language Models as Virtual Survey Respondents: Evaluating Sociodemographic Response Generation
Introduces PAS and FAS task abstractions plus the LLM-S^3 benchmark to evaluate LLMs on generating sociodemographic survey responses across 11 real datasets and multiple models.
-
PRL: Prompts from Reinforcement Learning
PRL is a reinforcement learning method that generates novel prompts and achieves state-of-the-art results on text classification, simplification, and summarization benchmarks, outperforming APE and EvoPrompt.
-
Artificial Intelligence in Number Theory: LLMs for Algorithm Generation and Ensemble Methods for Conjecture Verification
LLM reaches >=0.95 accuracy on 60 number theory problems with optimal hints; LightGBM classifier empirically supports Dirichlet conductor conjecture via zero features at 93.9% test accuracy for small q.
-
Improving Language Models with Intentional Analysis
Intentional Analysis improves language model task performance by explicitly adding intent-aware analysis and reasoning, outperforming Chain-of-Thought and working synergistically with it even on frontier models.
-
Analyzing Chain of Thought (CoT) Approaches in Control Flow Code Deobfuscation Tasks
CoT prompting improves LLM performance on control-flow deobfuscation of C benchmarks, yielding ~16% better CFG reconstruction and ~20.5% better semantic preservation for GPT5 versus zero-shot prompting.
-
Combining Static Code Analysis and Large Language Models Improves Correctness and Performance of Algorithm Recognition
Hybrid LLM plus static analysis for algorithm recognition in code cuts required model calls by 72-97% and lifts F1-scores by as much as 12 points.
-
Toward a Safe Internet of Agents
The paper proposes a bottom-up framework for safe agentic AI systems that treats each component as a dual-use interface where added capabilities also expand attack surfaces across single agents, multi-agent systems, and interoperable ecosystems.
-
Foundational Design Principles and Patterns for Building Robust and Adaptive GenAI-Native Systems
Proposes five foundational pillars and architectural patterns for building robust GenAI-native systems by combining AI with software engineering principles.
-
AI, Meet Human: Learning Paradigms for Hybrid Decision Making Systems
Proposes a taxonomy of Hybrid Decision Making Systems as a conceptual and technical framework for modeling human-machine interaction in machine learning literature.