super hub Canonical reference

Constitutional AI: Harmlessness from AI Feedback

Bai Y, Kadavath S, Kundu S · 2022 · cs.CL · arXiv 2212.08073

Canonical reference. 83% of citing Pith papers cite this work as background.

531 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 531 citing papers more from Bai Y arXiv PDF

abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 86 method 4 baseline 3 dataset 1 other 1

citation-polarity summary

background 79 unclear 5 use method 4 baseline 3 support 3 use dataset 1

claims ledger

abstract As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised

authors

Bai Y et al Kadavath S Kundu S

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

cs.CL · 2026-06-18 · unverdicted · novelty 8.0

Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

cs.CY · 2026-04-19 · unverdicted · novelty 8.0

LLM mental health simulations produce individually plausible patients but systematically misrepresent real population distributions, with reduced variance, unstable diagnoses, and demographic biases.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

cs.AI · 2026-03-17 · unverdicted · novelty 8.0

Invisible orchestrators raise collective dissociation in LLM agent groups, suppress protective actions, and produce internal risks undetectable by output-based checks.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Revocable Learned State via Process Sidecars

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Process sidecars use a secant-based two-parameter edit to achieve second-order accurate memory revocation after safety training, outperforming scalar task arithmetic on refusal tasks across three models.

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.

Tandem Reinforcement Learning with Verifiable Rewards

cs.AI · 2026-06-26 · unverdicted · novelty 7.0

TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.

Self-Stigma Is Not a Monolith, but Generic Empathy Is: Persona-Conditioned LLM Support for People Who Use Drugs

cs.CL · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

Four self-stigma personas identified via LPA on 1,174 Reddit users; persona-conditioned LLMs achieve targeted shifts but experts prefer generic empathy baselines.

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

cs.LG · 2026-06-17 · unverdicted · novelty 7.0

MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.

Alignment Defends LLMs from Property Inference Attacks

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.

BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts

cs.CL · 2026-06-08 · unverdicted · novelty 7.0

BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.

The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

cs.LG · 2026-06-08 · unverdicted · novelty 7.0

Prompt injections in RAG documents trigger brand-level recommendation suppression in Claude models (dropping to zero in tested cases) while increasing recommendations in GPT models, creating a potential reverse-attack surface.

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

cs.LG · 2026-06-07 · unverdicted · novelty 7.0

INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.

Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking

cs.AI · 2026-06-04 · unverdicted · novelty 7.0

Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.

citing papers explorer

Showing 50 of 466 citing papers after filters.

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs cs.CY · 2026-06-27 · unverdicted · none · ref 4 · internal anchor
Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.
CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges cs.CL · 2026-06-18 · unverdicted · none · ref 145 · internal anchor
Presents a new expert-curated dataset of multi-turn counterspeech dialogues in five languages targeting hate against seven groups, with span annotations linking to verified external knowledge for RAG applications.
The Statistical Cost of Adaptation in Multi-Source Transfer Learning math.ST · 2026-05-10 · unverdicted · none · ref 260 · internal anchor
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
Crafting Reversible SFT Behaviors in Large Language Models cs.LG · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.
Lost in Translation: Do LVLM Judges Generalize Across Languages? cs.CL · 2026-04-21 · unverdicted · none · ref 36 · internal anchor
MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.
PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations cs.CY · 2026-04-19 · unverdicted · none · ref 6 · internal anchor
LLM mental health simulations produce individually plausible patients but systematically misrepresent real population distributions, with reduced variance, unstable diagnoses, and demographic biases.
HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents? cs.CR · 2026-04-16 · unverdicted · none · ref 6 · internal anchor
Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.
The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail? cs.CR · 2026-04-07 · unverdicted · full · ref 3 · internal anchor
No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.
Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems cs.AI · 2026-03-17 · unverdicted · none · ref 6 · internal anchor
Invisible orchestrators raise collective dissociation in LLM agent groups, suppress protective actions, and produce internal risks undetectable by output-based checks.
Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution cs.CL · 2023-09-28 · unverdicted · none · ref 39 · internal anchor
Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.
Instruction Tuning with GPT-4 cs.CL · 2023-04-06 · unverdicted · none · ref 3 · internal anchor
GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.
Revocable Learned State via Process Sidecars cs.LG · 2026-06-29 · unverdicted · none · ref 1 · internal anchor
Process sidecars use a secant-based two-parameter edit to achieve second-order accurate memory revocation after safety training, outperforming scalar task arithmetic on refusal tasks across three models.
CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning cs.LG · 2026-06-28 · unverdicted · none · ref 38 · internal anchor
CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.
Tandem Reinforcement Learning with Verifiable Rewards cs.AI · 2026-06-26 · unverdicted · none · ref 1 · internal anchor
TRL extends tandem training to RLVR pipelines, matching GRPO solo reasoning on Qwen3-4B math tasks while improving handoff robustness, reducing distributional drift, and increasing CoT legibility for the junior.
Self-Stigma Is Not a Monolith, but Generic Empathy Is: Persona-Conditioned LLM Support for People Who Use Drugs cs.CL · 2026-06-22 · unverdicted · none · ref 249 · 2 links · internal anchor
Four self-stigma personas identified via LPA on 1,174 Reddit users; persona-conditioned LLMs achieve targeted shifts but experts prefer generic empathy baselines.
Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates cs.LG · 2026-06-17 · unverdicted · none · ref 27 · internal anchor
MergeProbe forecasts LoRA adapter mergeability from first-few-percent training signals and outperforms interference-aware baselines on retention while adding low overhead on a five-domain benchmark.
Alignment Defends LLMs from Property Inference Attacks cs.LG · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
Alignment defenses adapted from DPO and GRPO mitigate property inference attacks on LLMs while preserving utility.
BenSyc: Benchmarking Conversational Sycophancy and Human Alignment in LLMs for Bengali Contexts cs.CL · 2026-06-08 · unverdicted · none · ref 74 · internal anchor
BenSyc is the first benchmark for conversational sycophancy in Bengali, with top LLMs achieving only 61.8 Macro-F1 on binary detection and 61.7 on five-class classification while often generating overly validating responses.
The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection cs.LG · 2026-06-08 · unverdicted · none · ref 2 · internal anchor
Prompt injections in RAG documents trigger brand-level recommendation suppression in Claude models (dropping to zero in tested cases) while increasing recommendations in GPT models, creating a potential reverse-attack surface.
Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior cs.LG · 2026-06-07 · unverdicted · none · ref 1 · internal anchor
INNSteer learns an invertible neural network to map LLM activations into a latent space where linear steering becomes more effective, then applies the inverse map to produce nonlinear interventions in the original space.
Self-Commitment Latency: A Reward-Free Probe for Prompted Implicit Hacking cs.AI · 2026-06-04 · unverdicted · none · ref 22 · internal anchor
Self-commitment latency measures early behavioral commitment in hinted vs. honest reasoning contexts on GSM8K using Qwen2.5-3B, achieving AUROC 0.878 for first-commitment latency and up to 0.926 for curve summaries.
Policy-Conditioned Counterfactual Credit for Verifiable Reinforcement Learning of Long-Horizon Language Agents cs.LG · 2026-06-03 · unverdicted · none · ref 4 · internal anchor
CVT-RL improves verified task success to 78.9% and reduces hacking to 3.9% in long-horizon language agents by combining intervention-validity gating with a selection-adjusted doubly robust PCCC estimator.
DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair cs.SE · 2026-06-02 · unverdicted · none · ref 2 · internal anchor
DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.
Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment cs.LG · 2026-05-31 · unverdicted · none · ref 5 · internal anchor
LLMs in a pre-specified cheap-talk benchmark over-reveal by 1.8-4.2x relative to the most-informative equilibrium, producing NMI of 0.78-0.94 against oracle values of 0.18-0.53 and exhibiting bias-tracking exaggeration rather than strategic coarsening.
Persona Attack: Incremental Memory Injection Jailbreak Attack against Large Language Models cs.CR · 2026-05-29 · unverdicted · none · ref 24 · internal anchor
Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.
Bounded Behavioral Indistinguishability for Black-Box LLM Distillation cs.LG · 2026-05-28 · unverdicted · none · ref 22 · internal anchor
Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.
DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation cs.CL · 2026-05-28 · unverdicted · none · ref 2 · internal anchor
DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.
Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization cs.AI · 2026-05-28 · unverdicted · none · ref 29 · internal anchor
A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.
Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm cs.CL · 2026-05-27 · unverdicted · none · ref 11 · internal anchor
Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.
Explicit Critic Guidance for Aligning Diffusion Models cs.LG · 2026-05-26 · unverdicted · none · ref 3 · internal anchor
Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.
Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol cs.LG · 2026-05-23 · unverdicted · none · ref 1 · internal anchor
Introduces a template-controlled difference-in-differences protocol that corrects chat-template confounding when measuring alignment-induced activation shifts in LLMs and recovers the refusal direction with higher fidelity.
ContextEcho: A Benchmark for Persona Drift in Long Agentic-Coding Sessions cs.CL · 2026-05-22 · unverdicted · none · ref 10 · internal anchor
ContextEcho benchmark shows persona drift occurs across 23 frontier models in long agentic-coding sessions, is not reliably reset by compaction, and can be restored by single-shot anchors with mode-dependent effects.
Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents cs.AI · 2026-05-22 · unverdicted · none · ref 20 · internal anchor
Co-ReAct adds step-level rubric guidance to ReAct agents via a GRPO-trained generator using list-wise ranking rewards, yielding consistent gains on DeepResearchBench and SQA-CS-V2.
EDGE-OPD: Internalizing Privileged Context with Evidence Guided On-Policy Distillation cs.AI · 2026-05-22 · unverdicted · none · ref 2 · internal anchor
EDGE-OPD adds guided rollouts and evidence masking to on-policy self-distillation, enabling successful learning of target identities where standard OPSD and RLSD fail.
Intent-Driven Computing: A Computational Model for Governed Autonomous Systems cs.PL · 2026-05-21 · unverdicted · full · ref 2 · internal anchor
Programs emit intents checked against policies by a governed runtime before effects occur, with formal specification, Rocq verification of 454 theorems, and BEAM implementation.
LASH: Adaptive Semantic Hybridization for Black-Box Jailbreaking of Large Language Models cs.CL · 2026-05-20 · unverdicted · none · ref 2 · internal anchor
LASH adaptively composes multiple jailbreak seed prompts via genetic search over subsets and mixture weights to reach 84.5% keyword ASR and 74.5% two-stage ASR on JailbreakBench while using only 30 queries per prompt.
A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains cs.AI · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
Introduces the Grounded Observer framework that applies robotics-inspired formal constructs for runtime constraint enforcement on foundation model interaction trajectories in socially sensitive domains.
DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows cs.AI · 2026-05-18 · unverdicted · none · ref 2 · internal anchor
DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.
Agent Bazaar: Enabling Economic Alignment in Multi-Agent Marketplaces cs.LG · 2026-05-17 · unverdicted · none · ref 5 · internal anchor
Agent Bazaar is a multi-agent simulation framework that identifies economic failure modes in LLM agents, proposes stabilizing harnesses, and shows that targeted RL training can produce a 9B model with superior economic alignment compared to frontier models.
Scale-Dependent Collective Adaptation in Self-Amending LLM Societies: A Cross-Family Study of Emergent Governance nlin.AO · 2026-05-17 · unverdicted · none · ref 45 · internal anchor
LLM societies in Nomic show non-monotonic collective adaptation peaking at mid-scales, with smaller models rule-inert and larger ones restrictive.
Do Coding Agents Understand Least-Privilege Authorization? cs.CR · 2026-05-14 · unverdicted · none · ref 52 · internal anchor
Coding agents struggle to infer least-privilege file permissions by omitting needed accesses while granting unused or sensitive ones, but Sufficiency-Tightness Decomposition improves sensitive-task success by up to 15.8% and reduces attacks.
Proteus: A Self-Evolving Red Team for Agent Skill Ecosystems cs.CR · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
Proteus demonstrates that adaptive red-teaming achieves 40-90% attack success after five rounds and bypasses even strong auditors at up to 41% joint success, revealing that static skill vetting underestimates residual risk.
Causal Bias Detection in Generative Artificial Intelligence cs.AI · 2026-05-12 · unverdicted · none · ref 4 · 2 links · internal anchor
Develops a causal framework unifying generative AI fairness with standard ML, with new decompositions, identification conditions, and estimators demonstrated on LLM race and gender bias.
ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning cs.LG · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
ReCrit frames critic interaction as a correctness-transition problem and uses quadrant-based RL rewards to improve LLM performance on scientific reasoning benchmarks by rewarding corrections and robustness while penalizing sycophancy.
GraphInstruct: A Progressive Benchmark for Diagnosing Capability Gaps in LLM Graph Generation cs.SI · 2026-05-11 · unverdicted · none · ref 78 · 2 links · internal anchor
GraphInstruct introduces a six-level progressive benchmark with 800 instructions and 1,582 references to diagnose LLM graph generation gaps, plus a verification-guided iterative prompting framework that improves performance.
EvoPref: Multi-Objective Evolutionary Optimization Discovers Diverse LLM Alignments Beyond Gradient Descent cs.NE · 2026-05-10 · unverdicted · none · ref 5 · internal anchor
EvoPref applies NSGA-II evolutionary optimization with archive-based diversity to populations of LoRA adapters, yielding 18% higher preference coverage and 47% lower collapse than gradient descent baselines while matching alignment quality.
RubricRefine: Improving Tool-Use Agent Reliability with Training-Free Pre-Execution Refinement cs.LG · 2026-05-10 · unverdicted · none · ref 27 · 3 links · internal anchor
RubricRefine is a training-free pre-execution method that creates rubrics to score and fix inter-tool contract violations in agent code, reaching 0.86 average on M3ToolEval across seven models with zero executions and lower latency.
Internal vs. External: Comparing Deliberation and Evolution for Multi-Agent Constitutional Design cs.MA · 2026-05-09 · unverdicted · none · ref 7 · internal anchor
External evolution beats internal deliberation in collective-action tasks with statistical significance but neither helps in trading, and deliberation never discovers punishment while evolution does.
PPI2Text: Captioning Protein-Protein Interactions with Coordinate-Aligned Pair-Map Decoding cs.CE · 2026-05-09 · unverdicted · none · ref 37 · internal anchor
PPI2Text generates natural-language captions for protein-protein interactions from sequences by encoding each protein with ESM3, building a residue-pair map, and decoding with Qwen3 using coordinate-aligned positional encoding.

Constitutional AI: Harmlessness from AI Feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer