Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.
In: Findings of the Association for Computational Lin- guistics: EMNLP 2024
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4verdicts
UNVERDICTED 4representative citing papers
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
AURA is an adaptive uncertainty-aware refinement method for auditing LLM-as-a-judge pairwise decisions that learns human-consistency signals through selective human verification on uncertain cases.
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.
citing papers explorer
-
Auditing Stealth Sycophancy in Mental-Health Dialogue: Structured Clinical-State Diagnostics and Clean Matched Benchmarks
Introduces a clean matched benchmark and Dynamic Emotional Signature Graphs (DESG) framework that detects implicit sycophancy via clinical-state transitions and reports a 0.0488 macro-F1 gain over baselines on harmful-risk detection.
-
Debiasing Reward Models via Causally Motivated Inference-Time Intervention
Neuron-level inference-time intervention reduces multiple biases in reward models, enabling 2B and 7B models to match 70B performance on LLM alignment benchmarks without trade-offs.
-
AURA: Adaptive Uncertainty-aware Refinement for LLM-as-a-Judge Auditing
AURA is an adaptive uncertainty-aware refinement method for auditing LLM-as-a-judge pairwise decisions that learns human-consistency signals through selective human verification on uncertain cases.
-
Safety is Contextual, LLM-Judges Are Not: Navigating the Rigid Priors of Evaluators
LLM safety judges resist adjusting evaluations when given contradictory context or new safety definitions, despite some ability to learn from new information.