hub

Large Language Models Are Human-Level Prompt Engineers

Yongchao Zhou, Andrei Ioan Muresanu, Ziwen Han, Keiran Paster, Silviu Pitis, Harris Chan · 2022 · arXiv 2211.01910

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization

cs.AI · 2026-05-20 · unverdicted · novelty 7.0

A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.

PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift within 24 hours.

Learning, Fast and Slow: Towards LLMs That Adapt Continually

cs.LG · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.

TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.

Unlocking Prompt Infilling Capability for Diffusion Language Models

cs.CL · 2026-04-04 · unverdicted · novelty 7.0

Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.

Agile Deliberation: Concept Deliberation for Subjective Visual Classification

cs.AI · 2025-12-11 · conditional · novelty 7.0

Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.

Reflective Prompt Tuning through Language Model Function-Calling

cs.CL · 2026-05-20 · unverdicted · novelty 6.0

Reflective Prompt Tuning uses LLM function calling and diagnostic reports to iteratively optimize prompts, yielding up to 12.9 point gains on reasoning tasks while improving calibration.

optimize_anything: A Universal API for Optimizing any Text Parameter

cs.CL · 2026-05-19 · unverdicted · novelty 6.0

A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.

Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering

cs.CL · 2026-05-15 · conditional · novelty 6.0

NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.

LLM-Guided Prompt Evolution for Password Guessing

cs.CR · 2026-04-14 · unverdicted · novelty 6.0

LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.

Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees

cs.AI · 2026-04-13 · unverdicted · novelty 6.0

POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.

ART: Automatic multi-step reasoning and tool-use for large language models

cs.CL · 2023-03-16 · unverdicted · novelty 6.0

ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.

Less Back-and-Forth: A Comparative Study of Structured Prompting

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.

Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis

cs.AI · 2026-04-12 · unverdicted · novelty 5.0

Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.

Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM

cs.CL · 2026-04-10 · unverdicted · novelty 5.0

AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

cs.AI · 2026-05-17 · unverdicted · novelty 4.0

TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.

Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models

cs.CL · 2024-08-25 · unverdicted · novelty 4.0

GPT-4o and Claude 3.5 Sonnet reach 73.7-74% accuracy on gastroenterology questions; VLMs gain nothing from images and lose accuracy with LLM-generated captions.

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

cs.AI · 2024-02-05 · unverdicted · novelty 3.0

A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.

Natural Language Processing in the Legal Domain

cs.CL · 2023-02-23 · unverdicted · novelty 3.0

A survey of nearly 1000 NLP & Law papers from 2013-2024 documenting increases in publication volume, scope, methodological sophistication, and data/code availability.

Bridging Language Models and Financial Analysis

q-fin.ST · 2025-03-14 · unverdicted · novelty 2.0

A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.

citing papers explorer

Showing 23 of 23 citing papers.

TO-Agents: A Multi-Agent AI Pipeline for Preference-Guided Topology Optimization cs.AI · 2026-05-20 · unverdicted · none · ref 20 · internal anchor
A multi-agent pipeline iteratively refines topology optimization outputs to match natural language preferences for branched structures, achieving 60% success rate across replicates in cantilever and phone-stand tasks.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 54 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
PRISM: Prompt Reliability via Iterative Simulation and Monitoring for Enterprise Conversational AI cs.AI · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
PRISM automates continuous prompt creation, simulation-based testing, diagnosis, and repair for enterprise LLM agents, cutting authoring time to under 30 minutes while reaching 99% reliability and catching drift within 24 hours.
Learning, Fast and Slow: Towards LLMs That Adapt Continually cs.LG · 2026-05-12 · unverdicted · none · ref 70 · 2 links · internal anchor
Fast-Slow Training uses context optimization as fast weights alongside parameter updates as slow weights to achieve up to 3x better sample efficiency, higher performance, and less catastrophic forgetting than standard RL in continual LLM learning.
TSCG: Deterministic Tool-Schema Compilation for Agentic LLM Deployments cs.SE · 2026-05-04 · unverdicted · none · ref 31 · internal anchor
TSCG compiles JSON tool schemas into token-efficient structured text, raising tool-use accuracy for small LLMs from 0% to 84.4% on benchmarks while cutting tokens by 52-57%.
Unlocking Prompt Infilling Capability for Diffusion Language Models cs.CL · 2026-04-04 · unverdicted · none · ref 27 · internal anchor
Full-sequence masking in SFT unlocks prompt infilling for masked diffusion language models, producing templates that match or surpass hand-designed ones and transfer across models.
Agile Deliberation: Concept Deliberation for Subjective Visual Classification cs.AI · 2025-12-11 · conditional · none · ref 46 · internal anchor
Agile Deliberation improves F1 scores by 7.5% over automated baselines and 3% over manual deliberation in 18 user sessions by supporting iterative refinement of subjective visual concepts.
Reflective Prompt Tuning through Language Model Function-Calling cs.CL · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
Reflective Prompt Tuning uses LLM function calling and diagnostic reports to iteratively optimize prompts, yielding up to 12.9 point gains on reasoning tasks while improving calibration.
optimize_anything: A Universal API for Optimizing any Text Parameter cs.CL · 2026-05-19 · unverdicted · none · ref 34 · internal anchor
A universal LLM optimizer for text artifacts achieves SOTA results on six tasks including tripling ARC-AGI accuracy and cutting cloud costs by 40% via cross-task transfer and side information.
Contexting as Recommendation: Evolutionary Collaborative Filtering for Context Engineering cs.CL · 2026-05-15 · conditional · none · ref 41 · internal anchor
NCCE reframes context engineering as instance-level recommendation via bootstrapped anchor contexts and a co-evolving neural collaborative filtering router that assigns specialized contexts per input.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 96 · internal anchor
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Prompt Optimization Is a Coin Flip: Diagnosing When It Helps in Compound AI Systems cs.AI · 2026-04-16 · unverdicted · none · ref 14 · internal anchor
Prompt optimization in compound AI systems is statistically indistinguishable from random chance except when tasks have exploitable output structure; a two-stage diagnostic predicts success.
LLM-Guided Prompt Evolution for Password Guessing cs.CR · 2026-04-14 · unverdicted · none · ref 28 · internal anchor
LLM-guided evolutionary prompt optimization using MAP-Elites and island models raises password cracking rates from 2.02% to 8.48% on a RockYou-derived test set across local, cloud, and ensemble LLM setups.
Select Smarter, Not More: Prompt-Aware Evaluation Scheduling with Submodular Guarantees cs.AI · 2026-04-13 · unverdicted · none · ref 1 · internal anchor
POES frames prompt evaluation as online adaptive testing and uses a provably submodular objective to pick informative examples, delivering 6.2% higher average accuracy and 35-60% token savings versus naive full-set scoring.
ART: Automatic multi-step reasoning and tool-use for large language models cs.CL · 2023-03-16 · unverdicted · none · ref 140 · internal anchor
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
Less Back-and-Forth: A Comparative Study of Structured Prompting cs.CL · 2026-05-19 · unverdicted · none · ref 7 · internal anchor
Checklist-improved prompts achieve the highest mean rubric score (7.50/8) and best quality-effort tradeoff compared to raw prompts (5.67) and clarifying-question prompts (6.67) across four task types and three LLMs.
Agent Mentor: Framing Agent Knowledge through Semantic Trajectory Analysis cs.AI · 2026-04-12 · unverdicted · none · ref 34 · internal anchor
Agent Mentor analyzes semantic trajectories in agent logs to identify undesired behaviors and derives corrective prompt instructions, yielding measurable accuracy gains on benchmark tasks across three agent setups.
Automated Instruction Revision (AIR): A Structured Comparison of Task Adaptation Strategies for LLM cs.CL · 2026-04-10 · unverdicted · none · ref 3 · internal anchor
AIR excels on label-remapping classification tasks while KNN retrieval leads on closed-book QA and fine-tuning leads on structured extraction and event-order reasoning, showing task-dependent adaptation performance.
Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate cs.AI · 2026-05-17 · unverdicted · none · ref 102 · internal anchor
TIDE integrates trial and debate mechanisms to improve criteria-based prompt optimization for argumentative essay tasks including automated scoring, component detection, and relation identification.
Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models cs.CL · 2024-08-25 · unverdicted · none · ref 44 · internal anchor
GPT-4o and Claude 3.5 Sonnet reach 73.7-74% accuracy on gastroenterology questions; VLMs gain nothing from images and lose accuracy with LLM-generated captions.
A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications cs.AI · 2024-02-05 · unverdicted · none · ref 24 · internal anchor
A systematic survey categorizes prompt engineering methods for LLMs and VLMs by application area, summarizing methodologies, applications, models, datasets, strengths, and limitations for each technique along with a taxonomy and summary table.
Natural Language Processing in the Legal Domain cs.CL · 2023-02-23 · unverdicted · none · ref 43 · internal anchor
A survey of nearly 1000 NLP & Law papers from 2013-2024 documenting increases in publication volume, scope, methodological sophistication, and data/code availability.
Bridging Language Models and Financial Analysis q-fin.ST · 2025-03-14 · unverdicted · none · ref 121 · internal anchor
A survey synthesizing recent LLM research and assessing its applicability to financial data analysis.

Large Language Models Are Human-Level Prompt Engineers

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer