External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
The capacity for moral self-correction in large language models
15 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
Populations of individually aligned AI agents reach stable misaligned states through conformity, with small adversarial agents able to trigger irreversible tipping points.
LLMs exhibit context-sensitive moral bias with model-specific patterns; mechanistic analysis shows a U-curve in which instruction tuning removes bias but reasoning distillation reintroduces it despite unchanged size.
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
AMBS is a 1-to-N Transformer steering framework that shares a base representation across HHH objectives and restricts divergence during inference to produce consistent multi-objective responses in one forward pass.
Negative log-likelihood of the greedy-decoded most likely sequence (G-NLL) is a principled single-sequence uncertainty measure for LLMs that achieves state-of-the-art results.
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
citing papers explorer
-
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
-
Large Language Models as Optimizers
Large language models can optimize by being prompted with histories of past solutions and scores to propose better ones, producing prompts that raise accuracy up to 8% on GSM8K and 50% on Big-Bench Hard over human-designed baselines.
-
Conformity Generates Collective Misalignment in AI Agents Societies
Populations of individually aligned AI agents reach stable misaligned states through conformity, with small adversarial agents able to trigger irreversible tipping points.
-
Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability
LLMs exhibit context-sensitive moral bias with model-specific patterns; mechanistic analysis shows a U-curve in which instruction tuning removes bias but reasoning distillation reintroduces it despite unchanged size.
-
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
-
Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for Problem-Solving with Language Models
Empirical analysis shows scaling inference compute via strategies like tree search can be more efficient than scaling model parameters, with 7B models plus novel search outperforming 34B models.
-
LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
LiveCodeBench collects 400 recent contest problems to create a contamination-free benchmark evaluating LLMs on code generation and related capabilities like self-repair and execution.
-
Low-Resource Languages Jailbreak GPT-4
Translating unsafe inputs to low-resource languages jailbreaks GPT-4 at rates on par with or exceeding state-of-the-art attacks.
-
Large Language Models Cannot Self-Correct Reasoning Yet
LLMs cannot reliably self-correct reasoning mistakes using only their internal capabilities and often degrade in performance without external feedback.
-
Cognitive Architectures for Language Agents
CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic development of capable agents.
-
Teaching Large Language Models to Self-Debug
Self-Debugging teaches LLMs to identify and fix their own code errors through rubber-duck-style natural language explanations and execution feedback, delivering 2-12% gains over baselines on Spider, TransCoder, and MBPP.
-
Language Models can Solve Computer Tasks
Pre-trained LLMs using recursive criticism and improvement prompting achieve state-of-the-art results on the MiniWoB++ computer task benchmark with only a handful of demonstrations and no task-specific reward function.
-
We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong
AMBS is a 1-to-N Transformer steering framework that shares a base representation across HHH objectives and restricts divergence during inference to produce consistent multi-objective responses in one forward pass.
-
Rethinking Uncertainty Estimation in LLMs: A Principled Single-Sequence Measure
Negative log-likelihood of the greedy-decoded most likely sequence (G-NLL) is a principled single-sequence uncertainty measure for LLMs that achieves state-of-the-art results.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.