super hub Canonical reference

Constitutional AI: Harmlessness from AI Feedback

Bai Y, Kadavath S, Kundu S · 2022 · cs.CL · arXiv 2212.08073

Canonical reference. 83% of citing Pith papers cite this work as background.

432 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 432 citing papers more from Bai Y arXiv PDF

abstract

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 86 method 4 baseline 3 dataset 1 other 1

citation-polarity summary

background 79 unclear 5 use method 4 baseline 3 support 3 use dataset 1

claims ledger

abstract As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised

authors

Bai Y et al Kadavath S Kundu S

co-cited works

representative citing papers

Bad company corrupts good morals: Understanding and Measuring Narrative-Induced Moral Reasoning Degradation in LLMs

cs.CY · 2026-06-27 · unverdicted · novelty 8.0

Negative narrative immersion causes 12-31% drops in LLM moral accuracy and produces structured shifts that appear in downstream applications.

RefusalBench: Why Refusal Rate Misranks Frontier LLMs on Biological Research Prompts

cs.SE · 2026-05-20 · conditional · novelty 8.0

RefusalBench shows strict refusal rates fail to rank frontier LLMs correctly on biological safety, with provider effects and partial-compliance patterns that binary metrics miss.

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

cs.CR · 2026-05-14 · conditional · novelty 8.0

Sequential LLM defense deployment leads to risk exacerbation in 38.9% of cases due to anti-aligned updates in shared critical layers, addressed by conflict-guided layer freezing.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

Crafting Reversible SFT Behaviors in Large Language Models

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

LCDD creates sparse carriers for SFT behaviors that SFT-Eraser can reverse, with ablations showing the sparse structure enables causal control.

Lost in Translation: Do LVLM Judges Generalize Across Languages?

cs.CL · 2026-04-21 · unverdicted · novelty 8.0

MM-JudgeBench shows substantial cross-lingual performance variance in 22 LVLM judges, with model size and architecture as poor predictors of multilingual robustness.

PsychBench: Auditing Epidemiological Fidelity in Large Language Model Mental Health Simulations

cs.CY · 2026-04-19 · unverdicted · novelty 8.0

LLM mental health simulations produce individually plausible patients but systematically misrepresent real population distributions, with reduced variance, unstable diagnoses, and demographic biases.

HarmfulSkillBench: How Do Harmful Skills Weaponize Your Agents?

cs.CR · 2026-04-16 · unverdicted · novelty 8.0

Harmful skills in open agent ecosystems raise average harm scores from 0.27 to 0.76 across six LLMs by lowering refusal rates when tasks are presented via pre-installed skills.

The Defense Trilemma: Why Prompt Injection Defense Wrappers Fail?

cs.CR · 2026-04-07 · unverdicted · novelty 8.0

No continuous utility-preserving input wrapper can eliminate all prompt injection risks in connected prompt spaces for language models.

Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

cs.AI · 2026-03-17 · unverdicted · novelty 8.0

Invisible orchestrators raise collective dissociation in LLM agent groups, suppress protective actions, and produce internal risks undetectable by output-based checks.

ORPO: Monolithic Preference Optimization without Reference Model

cs.CL · 2024-03-12 · conditional · novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.

Promptbreeder: Self-Referential Self-Improvement Via Prompt Evolution

cs.CL · 2023-09-28 · unverdicted · novelty 8.0

Promptbreeder evolves both task prompts and the mutation prompts that improve them using LLMs, outperforming Chain-of-Thought and Plan-and-Solve on arithmetic and commonsense reasoning benchmarks.

Instruction Tuning with GPT-4

cs.CL · 2023-04-06 · unverdicted · novelty 8.0

GPT-4-generated instruction data produces superior zero-shot performance in finetuned LLaMA models versus prior state-of-the-art data.

Revocable Learned State via Process Sidecars

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Process sidecars use a secant-based two-parameter edit to achieve second-order accurate memory revocation after safety training, outperforming scalar task arithmetic on refusal tasks across three models.

CRAFT: Counterfactual Credit Assignment from Free Sibling Rollouts for Self-Distilled Agentic Reinforcement Learning

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

CRAFT is a three-pillar credit assignment scheme that uses counterfactual token importance from GRPO sibling rollouts to provide signed per-token distillation signals in self-distilled agentic RL.

Self-Stigma Is Not a Monolith, but Generic Empathy Is: Persona-Conditioned LLM Support for People Who Use Drugs

cs.CL · 2026-06-22 · unverdicted · novelty 7.0 · 2 refs

Four self-stigma personas identified via LPA on 1,174 Reddit users; persona-conditioned LLMs achieve targeted shifts but experts prefer generic empathy baselines.

DDOR: Delta Debugging for Explainable Overrefusal Testing and Repair

cs.SE · 2026-06-02 · unverdicted · novelty 7.0

DDOR is a delta-debugging framework that localizes minimal refusal-triggering fragments for explainable overrefusal testing and targeted prompt repair in black-box LLMs.

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

LLMs in a pre-specified cheap-talk benchmark over-reveal by 1.8-4.2x relative to the most-informative equilibrium, producing NMI of 0.78-0.94 against oracle values of 0.18-0.53 and exhibiting bias-tracking exaggeration rather than strategic coarsening.

Low-Resource Safety Failures Are Action Failures, Not Representation Failures

cs.CL · 2026-05-31 · conditional · novelty 7.0

Low-resource safety failures are action failures because the harmfulness representation transfers but the decision calibration does not; this is fixed by recalibrating a high-resource gate with 1-4 target-language examples.

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

Introduces (ε,q,t,A)-behavioral indistinguishability and shows via Qwen/Llama experiments that LoRA distillation boosts semantic similarity but leaves detectable behavioral differences under adversarial evaluation.

DirectorBench: Diagnosing Long-Form Video Generation with Personalized Multi-Agent Evaluation

cs.CL · 2026-05-28 · unverdicted · novelty 7.0

DirectorBench is a profile-aware diagnostic benchmark that localizes bottlenecks in long-form video generation workflows using structured checkpoints and multi-agent evaluation.

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

cs.AI · 2026-05-28 · unverdicted · novelty 7.0

A hybrid first-order then zeroth-order optimization approach improves robustness of safety-aligned LLMs while preserving utility, with layer-wise sensitivity estimation for efficiency.

Does Capability Transfer to Subjective Behavior -- and Would Our Instruments Tell Us? A Self-Evolving, Trust-by-Construction Evaluation Paradigm

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

Self-evolving rubric with anti-gaming fitness reveals that objective capability scaling fails to transfer to subjective LLM behaviors, with advice-restraint as the universal lowest dimension that can regress.

Explicit Critic Guidance for Aligning Diffusion Models

cs.LG · 2026-05-26 · unverdicted · novelty 7.0

Introduces a state-aligned latent actor-critic framework that lets diffusion models act as their own timestep-conditioned value functions for trajectory-level RL post-training and inference steering.

citing papers explorer

Showing 50 of 54 citing papers after filters.

Turning Drift into Constraint: Robust Reasoning Alignment in Non-Stationary Multi-Stream Environments cs.CV · 2025-10-05 · unverdicted · none · ref 67 · internal anchor
APO framework aligns multi-source MLLM reasoning under concept drift by using inter-model divergences as negative constraints via supervised bootstrapping and multi-negative Plackett-Luce optimization, with a 7B model outperforming proprietary sources on chest X-ray tasks and a new CXR-MAX benchmark
Evalet: Evaluating Large Language Models through Functional Fragmentation cs.HC · 2025-09-14 · conditional · none · ref 10 · internal anchor
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
PRIMETIME : Limits of LLMs in Temporal Primitives cs.NE · 2025-04-22 · unverdicted · none · ref 3 · internal anchor
PRIMETIME generator reveals that LLM datetime parsing and arithmetic primitives are individually unreliable but fully learnable via fine-tuning, enabling frontier-level accuracy on event planning with small LoRA models.
Large Language Models have Chain-of-Affect cs.HC · 2025-12-13 · unverdicted · none · ref 22 · internal anchor
LLMs exhibit structured chain-of-affect dynamics with stable family fingerprints, convergence to accumulation-overload-numbing under negative exposure, and downstream effects on generation, human interaction, and group polarization.
SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models cs.CR · 2025-12-10 · unverdicted · none · ref 25 · internal anchor
SCOUT uses token saliency analysis to detect both standard and contextually-plausible backdoor attacks in language models while maintaining clean accuracy.
Graph-Regularized Sparse Autoencoders for LLM Safety Steering cs.LG · 2025-12-07 · unverdicted · none · ref 1 · internal anchor
GSAE improves selective refusal on safety benchmarks by smoothing SAE directions over a co-activation graph and applying them via a two-gate controller, outperforming standard SAEs and baselines on Llama-3 and other models.
Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs cs.CL · 2025-11-16 · unverdicted · none · ref 2 · internal anchor
EvoSynth evolves code-based jailbreak algorithms via multi-agent self-correction, reaching 85.5% ASR on Claude-Sonnet-4.5 and 95.9% average across targets with greater diversity.
Intelligence per Watt: Measuring Intelligence Efficiency of Local AI cs.DC · 2025-11-11 · unverdicted · none · ref 1 · internal anchor
Local LLMs answer 88.7% of 1M real-world queries with IPW improving 5.3x from 2023-2025, indicating local inference can handle most queries efficiently on power-constrained devices.
Alpamayo-R1: Bridging Reasoning and Action Prediction for Generalizable Autonomous Driving in the Long Tail cs.RO · 2025-10-30 · conditional · none · ref 4 · internal anchor
Alpamayo-R1 introduces a VLA model with a Chain of Causation dataset and multi-stage SFT-plus-RL training that reports 12% better planning accuracy and 35% fewer close encounters versus trajectory-only baselines in driving tasks.
SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning cs.CV · 2025-10-18 · unverdicted · none · ref 2 · internal anchor
SSL4RL reformulates self-supervised learning objectives into dense, verifiable reward signals for RL-based fine-tuning of vision-language models, yielding performance gains on reasoning benchmarks.
From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails cs.AI · 2025-10-15 · unverdicted · none · ref 79 · internal anchor
Control-theoretic guardrails enable proactive correction of risky LLM agent actions in latent space, preventing catastrophes like collisions or bankruptcy while preserving task performance in simulated environments.
LightReasoner: Can Small Language Models Teach Large Language Models Reasoning? cs.CL · 2025-10-09 · unverdicted · none · ref 1 · internal anchor
LightReasoner distills supervision signals from SLM-LLM behavioral divergence to improve LLM reasoning on math benchmarks with up to 28.1% accuracy gains and 90-99% reductions in resources.
Red-Bandit: Test-Time Adaptation for LLM Red-Teaming via Bandit-Guided LoRA Experts cs.CL · 2025-10-08 · unverdicted · none · ref 2 · internal anchor
Red-Bandit adapts online to LLM failure modes by dynamically selecting among RL-trained LoRA attack-style experts via a bandit policy, reporting SOTA ASR@10 on AdvBench with lower-perplexity prompts.
Extreme Self-Preference in Language Models cs.AI · 2025-09-30 · unverdicted · none · ref 69 · internal anchor
Eight LLMs exhibited massive self-preference that followed assigned identities rather than true ones, appearing in both simple word tasks and consequential evaluations of job candidates and AI technologies.
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models cs.LG · 2025-09-30 · unverdicted · none · ref 10 · internal anchor
TPCs allow term-by-term progressive polynomial evaluation on LLM activations for flexible safety monitoring that supports both stronger guardrails and low-cost adaptive cascades.
Unlocking Zero-Shot Geospatial Reasoning via Indirect Rewards cs.CV · 2025-09-29 · unverdicted · none · ref 2 · internal anchor
Geo-R1 uses indirect proxy rewards from cross-view alignment with geolocation metadata to drive reinforcement learning, enabling zero-shot geospatial reasoning that transfers across 25+ tasks and sometimes exceeds supervised specialists.
Benchmarking and Mitigating Sycophancy in Medical Vision Language Models cs.CV · 2025-09-26 · unverdicted · none · ref 2 · 2 links · internal anchor
The paper benchmarks sycophancy in medical VLMs using hierarchical VQA templates and proposes VIPER to filter non-evidence social cues, reducing sycophancy while preserving interpretability.
Steerable Adversarial Scenario Generation through Test-Time Preference Alignment cs.AI · 2025-09-24 · unverdicted · none · ref 4 · internal anchor
SAGE reframes adversarial scenario generation as multi-objective preference alignment, using hierarchical group-based optimization and test-time linear interpolation of two expert policies to enable steerable control over adversariality-realism trade-offs.
Compute as Teacher: Turning Inference Compute Into Reference-Free Supervision cs.LG · 2025-09-17 · unverdicted · none · ref 3 · internal anchor
Parallel inference rollouts aggregated into pseudo-references enable reference-free RL supervision that matches expert-annotated performance on health tasks while using 9x less test-time compute.
DoubleAgents: Human-Agent Alignment in a Socially Embedded Workflow cs.HC · 2025-09-16 · unverdicted · none · ref 5 · internal anchor
DoubleAgents shows that a distributed-cognition design with coordination agent, dashboard, and policy module increases user comfort and reliance on AI agents for coordination tasks over time.
Learning to Conceal Risk: Controllable Multi-turn Red Teaming for LLMs in the Financial Domain cs.CL · 2025-09-07 · unverdicted · none · ref 5 · internal anchor
CoRT achieves 95% average attack success rate on nine LLMs by using iterative risk-concealing prompts and a controller that scores concealment levels on a new 522-instruction financial risk benchmark.
Beyond I'm Sorry, I Can't: Dissecting Large Language Model Refusal cs.CL · 2025-09-07 · unverdicted · none · ref 6 · internal anchor
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 50 · internal anchor
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
The Unreasonable Effectiveness of Entropy Minimization in LLM Reasoning cs.LG · 2025-05-21 · unverdicted · none · ref 5 · internal anchor
Entropy minimization on self-generated outputs elicits strong reasoning in pretrained LLMs, matching or exceeding supervised RL methods on benchmarks.
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation cs.SE · 2025-03-21 · unverdicted · none · ref 2 · internal anchor
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.
The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment cs.AI · 2025-11-19 · unverdicted · none · ref 2 · internal anchor
Static value alignment approaches for AI are structurally insufficient for robust alignment because of Hume's is-ought gap, value pluralism, and the extended frame problem.
ASTRA: An Automated Framework for Strategy Discovery, Retrieval, and Evolution for Jailbreaking LLMs cs.CR · 2025-11-04 · unverdicted · none · ref 5 · internal anchor
ASTRA is an automated closed-loop framework that discovers, retrieves, and evolves jailbreak attack strategies for LLMs using a dynamic three-tier strategy library and outperforms baselines in black-box settings.
POPI: Personalizing LLMs via Optimized Natural Language Preference Inference cs.CL · 2025-10-17 · unverdicted · none · ref 4 · internal anchor
POPI distills user preferences into reusable natural-language summaries via a shared inference model and conditions a generator on them, trained jointly with RL to improve personalization quality while cutting context length by up to 10x on benchmarks.
LifeAlign: Lifelong Alignment for Large Language Models with Memory-Augmented Focalized Preference Optimization cs.CL · 2025-09-21 · unverdicted · none · ref 6 · internal anchor
LifeAlign uses focalized preference optimization and short-to-long memory consolidation via dimensionality reduction to let LLMs align with new preferences while retaining prior knowledge.
Probabilistic Modeling of Latent Agentic Substructures in Deep Neural Networks cs.LG · 2025-09-08 · unverdicted · none · ref 36 · internal anchor
Proposes a probabilistic framework for latent agentic substructures in DNNs using log-score utilities and log pooling, with proofs on unanimity and an application to persona emergence in LLM alignment.
Why Language Models Hallucinate cs.CL · 2025-09-04 · unverdicted · none · ref 1 · internal anchor
Language models hallucinate due to statistical pressures from training pipelines that treat uncertainty as binary classification errors and from evaluations that penalize admitting ignorance.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap cs.CL · 2025-08-06 · unverdicted · none · ref 26 · internal anchor
Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
Mitigating Watermark Forgery in Generative Models via Randomized Key Selection cs.CR · 2025-07-10 · unverdicted · none · ref 5 · internal anchor
Randomized per-query key selection with single-key detection acceptance bounds forgery success rate independently of collected samples while preserving model utility.
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction cs.CR · 2025-06-02 · unverdicted · none · ref 21 · internal anchor
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
Gemini Robotics: Bringing AI into the Physical World cs.RO · 2025-03-25 · unverdicted · none · ref 1 · internal anchor
Gemini Robotics is a Vision-Language-Action model for robot control that handles complex tasks robustly and adapts with minimal data, supported by an embodied reasoning extension.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models cs.AI · 2025-03-12 · unverdicted · none · ref 29 · internal anchor
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning cs.RO · 2025-03-05 · unverdicted · none · ref 51 · internal anchor
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model cs.CL · 2025-02-04 · unverdicted · none · ref 149 · internal anchor
SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.
Quantifying the Climate Risk of Generative AI: Region-Aware Carbon Accounting with G-TRACE and the AI Sustainability Pyramid cs.CY · 2025-11-06 · unverdicted · none · ref 6 · 2 links · internal anchor
G-TRACE provides region-aware estimates of GenAI carbon emissions including 4309 MWh and 2068 tCO2 for a 2024-2025 image generation trend, paired with a seven-level AI Sustainability Pyramid for policy guidance.
A Survey of Context Engineering for Large Language Models cs.CL · 2025-07-17 · accept · none · ref 51 · internal anchor
The survey organizes Context Engineering into retrieval, processing, management, and integrated systems like RAG and multi-agent setups while identifying an asymmetry where LLMs handle complex inputs well but struggle with equally sophisticated long outputs.
Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities cs.CL · 2025-07-07 · unverdicted · none · ref 5 · internal anchor
Gemini 2.5 Pro and Flash models are presented as achieving frontier performance in reasoning, coding, and long-context multimodal tasks while spanning a cost-capability Pareto curve.
Vision-EKIPL: External Knowledge-Infused Policy Learning for Visual Reasoning cs.CV · 2025-06-07 · unverdicted · none · ref 6 · internal anchor
Vision-EKIPL injects high-quality actions from external models into RL training to expand exploration and raise the reasoning ceiling of MLLMs, reporting up to 5% gains on the Reason-RFT-CoT benchmark.
Towards provable probabilistic safety for scalable embodied AI systems eess.SY · 2025-06-05 · unverdicted · none · ref 112 · internal anchor
The paper proposes a paradigm of provable probabilistic safety to enable scalable, safe deployment of embodied AI in critical applications.
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model cs.CL · 2025-04-13 · unverdicted · none · ref 34 · internal anchor
QM-ToT applies Tree of Thoughts decomposition and evaluator layers to quantized LLMs, reporting accuracy gains from 34% to 50% on MedQAUSMLE for LLaMA2-70b and from 58.77% to 69.49% for LLaMA-3.1-8b, plus an 86.27% improvement in data distillation using only 3.9% of the data.
LLM-Safety Evaluations Lack Robustness cs.CR · 2025-03-04 · unverdicted · none · ref 8 · internal anchor
LLM safety evaluations are hindered by noise in dataset curation, automated red-teaming, response generation, and LLM-judge evaluation, making fair comparisons difficult and slowing progress.
AI Realtor: Towards Grounded Persuasive Language Generation for Automated Copywriting cs.AI · 2025-02-24 · unverdicted · none · ref 6 · internal anchor
An LLM agent with grounding, personalization, and marketing modules generates real estate descriptions that human buyers prefer over expert-written ones while matching factual accuracy.
Responsible Federated LLMs via Safety Filtering and Constitutional AI cs.CL · 2025-02-23 · unverdicted · none · ref 5 · internal anchor
Integrates safety filtering and constitutional AI into FedLLM, reporting over 20% safety improvement on AdvBench.
Beyond Context: Large Language Models' Failure to Grasp Users' Intent cs.AI · 2025-12-24 · unverdicted · none · ref 18 · internal anchor
LLMs fail to detect hidden harmful intent, allowing systematic bypass of safety mechanisms through framing techniques, with reasoning modes often worsening the issue.
A Survey of Reinforcement Learning for Large Reasoning Models cs.CL · 2025-09-10 · accept · none · ref 28 · internal anchor
A survey compiling RL methods, challenges, data resources, and applications for enhancing reasoning in large language models and large reasoning models since DeepSeek-R1.
A Survey on Foundation Models for Personalized Federated Intelligence cs.AI · 2025-05-11 · unverdicted · none · ref 105 · internal anchor
The survey introduces personalized federated intelligence (PFI) as a framework integrating federated learning and foundation models to support privacy-aware personalization of AI models.

Constitutional AI: Harmlessness from AI Feedback

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer