pith. machine review for the scientific record. sign in

arxiv: 2406.06608 · v6 · submitted 2024-06-06 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

The Prompt Report: A Systematic Survey of Prompt Engineering Techniques

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords prompt engineeringtaxonomylarge language modelsgenerative AIprompting techniquesbest practicesmeta-analysisvocabulary
0
0 comments X

The pith

A survey organizes prompt engineering into a taxonomy of 58 LLM techniques and 40 for other modalities, supported by a 33-term vocabulary and best practices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to resolve conflicting terminology and fragmented understanding in prompt engineering for generative AI by building a shared taxonomy and vocabulary. It gathers and categorizes techniques that developers use to guide large language models and other systems. This structure clarifies how different prompting approaches affect outputs across research and industry uses. The authors review applications, supply guidelines for models such as ChatGPT, and include a meta-analysis of prefix-prompting studies. If the taxonomy holds, it gives practitioners a clearer map for choosing and refining prompts.

Core claim

The authors establish a structured understanding of prompt engineering by assembling a taxonomy of 58 LLM prompting techniques and 40 techniques for other modalities, a vocabulary of 33 terms, best practices and guidelines for prompting state-of-the-art models, and a meta-analysis of the literature on natural language prefix-prompting, presenting this collection as the most comprehensive survey to date.

What carries the argument

The taxonomy of prompting techniques, which classifies 58 LLM methods and 40 others to clarify what constitutes an effective prompt.

If this is right

  • A shared vocabulary reduces conflicting descriptions of the same prompting approach across papers and tools.
  • The taxonomy helps developers select suitable techniques for specific tasks instead of relying on trial and error.
  • Guidelines for state-of-the-art models improve output quality and consistency when applied to systems like ChatGPT.
  • The meta-analysis highlights trends in how prefix-prompting research has evolved over time.
  • Clear categories support more systematic testing of which techniques work best for given domains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could serve as a base layer for automated tools that suggest or optimize prompts for new tasks.
  • Researchers might test whether the same categories apply to emerging modalities such as video generation or code interpreters.
  • Standard terms could enable consistent benchmarks that compare prompting performance across different models.
  • Extending the survey periodically would track how new techniques fit into or expand the existing structure.

Load-bearing premise

The literature search and categorization process captured a representative and exhaustive set of prompting techniques without significant omissions or misclassifications.

What would settle it

An independent literature search that identifies a substantial number of distinct prompting techniques absent from the taxonomy or placed in incorrect categories would falsify the claim of comprehensiveness.

read the original abstract

Generative Artificial Intelligence (GenAI) systems are increasingly being deployed across diverse industries and research domains. Developers and end-users interact with these systems through the use of prompting and prompt engineering. Although prompt engineering is a widely adopted and extensively researched area, it suffers from conflicting terminology and a fragmented ontological understanding of what constitutes an effective prompt due to its relatively recent emergence. We establish a structured understanding of prompt engineering by assembling a taxonomy of prompting techniques and analyzing their applications. We present a detailed vocabulary of 33 vocabulary terms, a taxonomy of 58 LLM prompting techniques, and 40 techniques for other modalities. Additionally, we provide best practices and guidelines for prompt engineering, including advice for prompting state-of-the-art (SOTA) LLMs such as ChatGPT. We further present a meta-analysis of the entire literature on natural language prefix-prompting. As a culmination of these efforts, this paper presents the most comprehensive survey on prompt engineering to date.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper surveys prompt engineering for generative AI systems. It assembles a 33-term vocabulary, a taxonomy of 58 LLM prompting techniques plus 40 multimodal techniques, best-practice guidelines (including for SOTA models such as ChatGPT), and a meta-analysis of the natural-language prefix-prompting literature, claiming to deliver the most comprehensive survey to date.

Significance. If the taxonomy and counts are reproducible and exhaustive, the work would provide a much-needed organizing framework for a rapidly growing but terminologically fragmented subfield, reducing duplication of effort and supplying practitioners with a consolidated reference.

major comments (2)
  1. [Methods / Taxonomy Construction] Methods / Taxonomy Construction: The manuscript supplies no PRISMA-style flow diagram, search strings, database list (arXiv, ACL Anthology, etc.), date cutoffs, inclusion/exclusion criteria, or inter-annotator agreement statistics for the categorization that produced the specific counts of 58 LLM and 40 multimodal techniques. Without these details the central claim of exhaustiveness cannot be evaluated.
  2. [Meta-analysis] Meta-analysis section: The meta-analysis of prefix-prompting literature is asserted but no quantitative aggregation method, effect-size extraction protocol, or study-selection criteria are described, making it impossible to assess whether reported trends rest on a representative sample or on the authors' sampling frame.
minor comments (3)
  1. [Abstract] Abstract: The abstract states that a meta-analysis was performed yet reports none of its quantitative findings or key trends.
  2. [Taxonomy] Taxonomy presentation: The boundary between the 58 LLM techniques and the 40 multimodal techniques is not explicitly justified; several techniques (e.g., certain chain-of-thought variants) appear to straddle both categories.
  3. [Best Practices] Best-practice guidelines: Concrete prompting examples for ChatGPT and other SOTA models are given but are not cross-referenced to the numbered taxonomy entries, reducing traceability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review. The comments highlight important areas for improving methodological transparency in our survey. We address each point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Methods / Taxonomy Construction] The manuscript supplies no PRISMA-style flow diagram, search strings, database list (arXiv, ACL Anthology, etc.), date cutoffs, inclusion/exclusion criteria, or inter-annotator agreement statistics for the categorization that produced the specific counts of 58 LLM and 40 multimodal techniques. Without these details the central claim of exhaustiveness cannot be evaluated.

    Authors: We acknowledge that the current manuscript does not include a dedicated methods section with these details. This omission limits the ability to fully assess reproducibility and exhaustiveness. In the revised version, we will add a new 'Systematic Review Methodology' section that provides: (1) a PRISMA-style flow diagram, (2) the exact search strings used, (3) the list of databases queried (arXiv, ACL Anthology, Google Scholar, and others), (4) date cutoffs (literature collected through May 2024), (5) explicit inclusion/exclusion criteria, and (6) inter-annotator agreement statistics for the taxonomy categorization process. These additions will directly support the claims of comprehensiveness. revision: yes

  2. Referee: [Meta-analysis] The meta-analysis of prefix-prompting literature is asserted but no quantitative aggregation method, effect-size extraction protocol, or study-selection criteria are described, making it impossible to assess whether reported trends rest on a representative sample or on the authors' sampling frame.

    Authors: We agree that the meta-analysis section lacks sufficient methodological specification. The analysis was based on a systematic collection of papers focused on natural-language prefix-prompting, with trends summarized through counts and qualitative synthesis of reported performance improvements. In revision, we will expand this section to explicitly state the study-selection criteria, the protocol for extracting trends and any quantitative metrics (such as reported accuracy deltas), and clarify that the aggregation is a narrative meta-summary with frequency counts rather than a formal statistical meta-analysis with effect sizes. This will make the sampling frame and methods transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity in survey aggregation and taxonomy construction

full rationale

The paper is a systematic literature survey that assembles a taxonomy and vocabulary from external sources without presenting new quantitative derivations, fitted parameters, or predictions. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the derivation chain. The central claim of comprehensiveness rests on literature search and categorization rather than any equation or definition that reduces to its own inputs by construction. This is the expected outcome for an honest survey paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the collected papers adequately represent the prompt-engineering literature and that the proposed taxonomy categories are both exhaustive and mutually exclusive.

axioms (1)
  • domain assumption The literature search methodology captured a representative sample of prompt engineering research
    Survey validity depends on comprehensive coverage of the field.

pith-pipeline@v0.9.0 · 5593 in / 1057 out tokens · 44412 ms · 2026-05-15T02:11:47.649500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data

    cs.AI 2026-04 unverdicted novelty 7.0

    TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design matter...

  2. Can Vision Language Models Judge Action Quality? An Empirical Evaluation

    cs.CV 2026-04 conditional novelty 7.0

    Vision-language models perform only marginally above random on action quality assessment and retain systematic biases even after targeted prompting and contrastive reformulation.

  3. Automated Design of Agentic Systems

    cs.AI 2024-08 conditional novelty 7.0

    Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...

  4. Efficient Multi-objective Prompt Optimization via Pure-exploration Bandits

    cs.LG 2026-05 unverdicted novelty 6.0

    Adapting multi-objective pure-exploration bandits enables efficient Pareto prompt set recovery and best feasible prompt identification for LLMs, with linear-case guarantees and empirical gains over baselines.

  5. Alignment has a Fantasia Problem

    cs.AI 2026-04 unverdicted novelty 6.0

    AI alignment must move beyond assuming users have fully formed goals and instead provide active cognitive support to help form and refine intent over time.

  6. From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

    cs.CR 2026-04 unverdicted novelty 6.0

    Arbiter-K is a new execution architecture that treats LLMs as probabilistic processors inside a neuro-symbolic kernel with a semantic ISA to enable deterministic security enforcement and unsafe trajectory interdiction...

  7. LLMs for Qualitative Data Analysis Fail on Security-specificComments in Human Experiments

    cs.SE 2026-04 unverdicted novelty 6.0

    LLMs improve with detailed code descriptions but remain insufficient to replace human annotators for security-specific qualitative coding.

  8. User Reviews as a Source for Usability Requirements: A Precursor Study on Using Large Language Models

    cs.SE 2026-05 conditional novelty 5.0

    LLMs can detect usability content in user reviews with F-scores comparable to humans, though performance depends strongly on prompt design.

  9. LLARS: Enabling Domain Expert & Developer Collaboration for LLM Prompting, Generation and Evaluation

    cs.AI 2026-05 unverdicted novelty 5.0

    LLARS is a new integrated platform that combines collaborative prompt authoring, cost-controlled batch generation, and hybrid evaluation to help domain experts and developers jointly build and assess LLM systems.

  10. U-Define: Designing User Workflows for Hard and Soft Constraints in LLM-Based Planning

    cs.AI 2026-05 unverdicted novelty 5.0

    U-Define improves user control in LLM planning by letting people define hard rules and soft preferences in natural language with matching verification methods, raising usefulness and satisfaction scores.

  11. Looking Into the Past: Eye Movements Characterize Elements of Autobiographical Recall in Interviews with Holocaust Survivors

    cs.MM 2026-04 unverdicted novelty 5.0

    Eye movements during Holocaust survivor interviews vary by episodic, semantic, affective and temporal memory dimensions, with pre-onset gaze sufficient to predict sentence temporal context.

  12. OOPrompt: Reifying Intents into Structured Artifacts for Modular and Iterative Prompting

    cs.HC 2026-04 unverdicted novelty 5.0

    OOPrompt reifies user intents into structured manipulable artifacts to enable modular and iterative prompting in LLM-based interactive systems.

  13. Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

    cs.AI 2026-04 unverdicted novelty 5.0

    Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.

  14. Confidence Without Competence in AI-Assisted Knowledge Work

    cs.HC 2026-04 unverdicted novelty 5.0

    Standard LLM chats produce high perceived understanding but low objective learning in students, while future-self explanations best align confidence with actual gains and guided hints maximize learning with moderate workload.

  15. The PICCO Framework for Large Language Model Prompting: A Taxonomy and Reference Architecture for Prompt Structure

    cs.CL 2026-04 accept novelty 5.0

    PICCO is a five-element reference architecture (Persona, Instructions, Context, Constraints, Output) for structuring LLM prompts, derived from synthesizing prior frameworks along with a taxonomy distinguishing prompt ...

  16. Self-Describing Structured Data with Dual-Layer Guidance: A Lightweight Alternative to RAG for Precision Retrieval in Large-Scale LLM Knowledge Navigation

    cs.CL 2026-03 unverdicted novelty 5.0

    SDSR places human metadata at file primacy and combines it with prompt routing rules to reach 100% primary category accuracy on a 119-category benchmark, far above the 65% no-guidance baseline.

  17. Characterizing Students' LLM Usage Behaviors and Their Association with Learning in Critical Thinking Tasks

    cs.HC 2026-05 unverdicted novelty 4.0

    Refined bottom-up categories of LLM usage in critical thinking homework, labeled by student initiative, are examined for associations with midterm performance across two course offerings.

  18. Hint-Writing with Deferred AI Assistance: Fostering Critical Engagement in Data Science Education

    cs.HC 2026-04 unverdicted novelty 4.0

    In a randomized experiment with 97 graduate students, deferred AI assistance produced the highest-quality hints and helped students spot more code mistakes than independent writing or immediate AI help.

  19. Prompt Engineering Strategies for LLM-based Qualitative Coding of Psychological Safety in Software Engineering Communities: A Controlled Empirical Study

    cs.SE 2026-05 unverdicted novelty 3.0

    Multi-shot prompting raises agreement with humans for Claude Haiku but not DeepSeek-Chat or Gemini 2.5 Flash, with models showing different stability and a consistent bias toward over-labeling negative feedback.

  20. CLaC at SemEval-2026 Task 6: Response Clarity Detection in Political Discourse

    cs.CL 2026-05 unverdicted novelty 3.0

    An LLM ensemble reached 80 macro-F1 on 3-class clarity detection and 59 on 9-class evasion detection, with partial layer unfreezing and multilingual ensembles improving encoder results while enriched context helped only LLMs.

  21. A Reproducibility Study of Metacognitive Retrieval-Augmented Generation

    cs.IR 2026-04 unverdicted novelty 3.0

    MetaRAG is only partially reproducible with lower absolute scores than originally reported, gains substantially from reranking, and shows greater robustness than SIM-RAG under extended retrieval features.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 21 Pith papers · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2404.11018

    Many-shot in-context learning. arXiv preprint arXiv:2404.11018. Sweta Agrawal, Chunting Zhou, Mike Lewis, Luke Zettlemoyer, and Marjan Ghazvininejad. 2023. In- context examples selection for machine translation. In Findings of the Association for Computational Linguistics: ACL 2023, pages 8857–8873, Toronto, Canada. Association for Computational Linguisti...

  2. [2]

    Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shah- baz Khan

    BUFFET: Benchmarking Large Language Models for Few-shot Cross-lingual Transfer. Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, and Fahad Shah- baz Khan. 2023. Foundational models defining a new era in vision: A survey and outlook. Abhijeet Awasthi, Nitish Gupta, Bidisha Samanta, Shachi Da...

  3. [3]

    In Proceedings of the 17th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 2455– 2467, Dubrovnik, Croatia

    Bootstrapping multilingual semantic parsers using large language models. In Proceedings of the 17th Conference of the European Chapter of the As- sociation for Computational Linguistics, pages 2455– 2467, Dubrovnik, Croatia. Association for Computa- tional Linguistics. Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao ...

  4. [4]

    Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale. In ACL. Omer Bar-Tal, Dolev Ofri-Amar, Rafail Fridman, Yoni Kasten, and Tali Dekel. 2022. Text2live: Text-driven layered image and video editing. Amanda Bertsch, Maor Ivgi, Uri Alon, Jonathan Berant, Matthew R Gormley, and Graham Neubig. 2024. I...

  5. [5]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    Language models are few-shot learners. Sébastien Bubeck, Varun Chandrasekaran, Ronen El- dan, John A. Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuan-Fang Li, Scott M. Lundberg, Harsha Nori, Hamid Palangi, Marco Tulio Ribeiro, and Yi Zhang. 2023. Sparks of artificial general intelligence: Early experiments with gpt-4. ArXiv, abs/2303.12712. ...

  6. [6]

    How is chatgpt’s behav- ior changing over time? arXiv preprint arXiv:2307.09009, 2023

    Chateval: Towards better LLM-based eval- uators through multi-agent debate. In The Twelfth International Conference on Learning Representa- tions. Ernie Chang, Pin-Jie Lin, Yang Li, Sidd Srinivasan, Gael Le Lan, David Kant, Yangyang Shi, Forrest Iandola, and Vikas Chandra. 2023. In-context prompt editing for conditional audio generation. Harrison Chase. 2...

  7. [7]

    GPTScore: Evaluate as You Desire , publisher =

    Template-based named entity recognition us- ing bart. Findings of the Association for Computa- tional Linguistics: ACL-IJCNLP 2021. Hai Dang, Lukas Mecke, Florian Lehmann, Sven Goller, and Daniel Buschek. 2022. How to prompt? opportu- nities and challenges of zero- and few-shot learning for human-ai interaction in creative applications of generative model...

  8. [8]

    In NeurIPS

    ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings. In NeurIPS. Hangfeng He, Hongming Zhang, and Dan Roth. 2023a. Socreval: Large language models with the so- cratic method for reference-free reasoning evaluation. arXiv preprint arXiv:2310.00074. Zhiwei He, Tian Liang, Wenxiang Jiao, Zhuosheng Zhang, Yujiu Yang, Rui Wang,...

  9. [9]

    Measuring Massive Multitask Language Un- derstanding. In ICLR. Amr Hendy, Mohamed Gomaa Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young Jin Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. ArXiv, abs/2302.09210. Amir Hertz, Ron Mokady, Jay Tenenba...

  10. [10]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    A comprehensive study of vision transformers in image classification tasks. Omar Khattab, Keshav Santhanam, Xiang Lisa Li, David Hall, Percy Liang, Christopher Potts, and Matei Zaharia. 2022. Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp. Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Kesh...

  11. [11]

    Natalie Kiesler and Daniel Schiffner

    Decomposed prompting: A modular approach for solving complex tasks. Natalie Kiesler and Daniel Schiffner. 2023. Large lan- guage models in introductory programming educa- tion: Chatgpt’s performance and implications for assessments. arXiv preprint arXiv:2308.08572. Hwichan Kim and Mamoru Komachi. 2023. Enhancing few-shot cross-lingual transfer with target...

  12. [12]

    51 Soochan Lee and Gunhee Kim

    Euclidreamer: Fast and high-quality texturing for 3d models with stable diffusion depth. 51 Soochan Lee and Gunhee Kim. 2023. Recursion of thought: A divide-and-conquer approach to multi- context reasoning with language models. Alina Leidinger, Robert van Rooij, and Ekaterina Shutova. 2023. The language of prompting: What linguistic properties make a prom...

  13. [13]

    Yaoyiran Li, Anna Korhonen, and Ivan Vuli ´c

    Oscar: Object-semantics aligned pre-training for vision-language tasks. Yaoyiran Li, Anna Korhonen, and Ivan Vuli ´c. 2023h. On bilingual lexicon induction with large language models. Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2023i. Making language models better reasoners with step-aware verifier. In Proceedin...

  14. [14]

    Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, and Diyi Yang

    Att3d: Amortized text-to-3d object synthesis. Albert Lu, Hongxin Zhang, Yanzhe Zhang, Xuezhi Wang, and Diyi Yang. 2023a. Bounding the capabili- ties of large language models in open text generation with prompt constraints. Hongyuan Lu, Haoyang Huang, Dongdong Zhang, Hao- ran Yang, Wai Lam, and Furu Wei. 2023b. Chain- of-dictionary prompting elicits transl...

  15. [15]

    arXiv preprint arXiv:2303.15621 , year=

    Chatgpt as a factual inconsistency evaluator for abstractive text summarization. arXiv preprint arXiv:2303.15621. Jiaxi Lv, Yi Huang, Mingfu Yan, Jiancheng Huang, Jianzhuang Liu, Yifan Liu, Yafei Wen, Xiaoxin Chen, and Shifeng Chen. 2023. Gpt4motion: Script- ing physical motions in text-to-video generation via blender-oriented gpt planning. Qing Lyu, Shre...

  16. [16]

    gradient descent

    Suicide crisis syndrome: A systematic review. Suicide and Life-Threatening Behavior. February 27, online ahead of print. Fanxu Meng, Haotong Yang, Yiding Wang, and Muhan Zhang. 2023. Chain of images for intuitively reason- ing. B. Meskó. 2023. Prompt engineering as an impor- tant emerging skill for medical professionals: Tuto- rial. Journal of Medical Int...

  17. [17]

    Conversation style transfer using few-shot learning. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Vol- ume 1: Long Papers) , pages 119–143, Nusa Dua, Bali. Association for Computational Linguistics. Ohad Rubin, J...

  18. [18]

    In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies

    Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies. Association for Computational Linguistics. Runway. 2023. Gen-2 prompt tips. https: //help.runwayml.com/hc/en-us/articles/ 17329337959699-Gen-2-Prompt-Tips...

  19. [19]

    Shubhra Kanti Karmaker Santu and Dongji Feng

    Lost at c: A user study on the security implica- tions of large language model code assistants. Shubhra Kanti Karmaker Santu and Dongji Feng. 2023. Teler: A general taxonomy of llm prompts for bench- marking complex tasks. Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 20...

  20. [20]

    Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He

    Reflexion: Language agents with verbal rein- forcement learning. Chenglei Si, Dan Friedman, Nitish Joshi, Shi Feng, Danqi Chen, and He He. 2023a. Measuring induc- tive biases of in-context learning with underspecified demonstrations. In Association for Computational Linguistics (ACL). Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang, Jor...

  21. [21]

    In Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 819–862, Dublin, Ireland

    An information-theoretic approach to prompt engineering without ground truth labels. In Proceed- ings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), pages 819–862, Dublin, Ireland. Association for Computational Linguistics. Andrea Sottana, Bin Liang, Kai Zou, and Zheng Yuan

  22. [22]

    arXiv preprint arXiv:2310.13800

    Evaluation metrics in the era of gpt-4: Reli- ably evaluating large language models on sequence to sequence tasks. arXiv preprint arXiv:2310.13800. Michal Štefánik and Marek Kadl ˇcík. 2023. Can in- context learners learn a reasoning concept from demonstrations? In Proceedings of the 1st Work- shop on Natural Language Reasoning and Structured Explanations...

  23. [23]

    Eshaan Tanwar, Subhabrata Dutta, Manish Borthakur, and Tanmoy Chakraborty

    Towards training-free open-world segmenta- tion via image prompting foundation models. Eshaan Tanwar, Subhabrata Dutta, Manish Borthakur, and Tanmoy Chakraborty. 2023. Multilingual LLMs are better cross-lingual in-context learners with align- ment. In Proceedings of the 61st Annual Meeting of 57 the Association for Computational Linguistics (Vol- ume 1: L...

  24. [24]

    Jason Weston and Sainbayar Sukhbaatar

    Large language models are better reasoners with self-verification. Jason Weston and Sainbayar Sukhbaatar. 2023. System 2 attention (is something you might need too). Jules White, Quchen Fu, Sam Hays, Michael Sandborn, Carlos Olea, Henry Gilbert, Ashraf Elnashar, Jesse Spencer-Smith, and Douglas C. Schmidt. 2023. A prompt pattern catalog to enhance prompt ...

  25. [25]

    Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs

    Ai chains: Transparent and controllable human-ai interaction by chaining large language model prompts. CHI Conference on Human Factors in Computing Systems. Xiaodong Wu, Ran Duan, and Jianbing Ni. 2023c. Un- veiling security, privacy, and ethical concerns of chat- gpt. Journal of Information and Intelligence. 59 Congying Xia, Chen Xing, Jiangshu Du, Xinyi...

  26. [26]

    The dawn of lmms: Preliminary explorations with gpt-4v (ision)

    Re-reading improves reasoning in language models. Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji. 2023. Rcot: Detecting and rectifying factual inconsistency in reasoning by reversing chain-of-thought. Chengrun Yang, Xuezhi Wang, Yifeng Lu, Hanxiao Liu, Quoc V . Le, Denny Zhou, and Xinyun Chen. 2023a. Large language models as opt...

  27. [27]

    slots to fill

    Thread of thought unraveling chaotic contexts. Xizhou Zhu, Yuntao Chen, Hao Tian, Chenxin Tao, Wei- jie Su, Chenyu Yang, Gao Huang, Bin Li, Lewei Lu, Xiaogang Wang, Yu Qiao, Zhaoxiang Zhang, and Jifeng Dai. 2023. Ghost in the minecraft: Gener- ally capable agents for open-world environments via large language models with text-based knowledge and memory. Z...