Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
arXiv preprint arXiv:2303.17071 , year=
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
citing papers explorer
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
-
Voyager: An Open-Ended Embodied Agent with Large Language Models
Voyager achieves superior lifelong learning in Minecraft by combining an automatic exploration curriculum, a library of executable skills, and iterative LLM prompting with environment feedback, yielding 3.3x more unique items and 15.3x faster milestone unlocks than prior methods while generalizing技能
-
Reflexion: Language Agents with Verbal Reinforcement Learning
Reflexion lets LLM agents improve via stored verbal reflections on task feedback, reaching 91% pass@1 on HumanEval and outperforming prior GPT-4 results.
-
Towards an AI co-scientist
A multi-agent AI system generates novel biomedical hypotheses that show promising experimental validation in drug repurposing for leukemia, new targets for liver fibrosis, and a bacterial gene transfer mechanism.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Towards Expert-Level Medical Question Answering with Large Language Models
Med-PaLM 2 achieves 86.5% accuracy on MedQA and approaches or exceeds prior state-of-the-art on other medical QA benchmarks while receiving higher physician preference ratings than human answers on consumer questions.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.