SURF derives weight sampling rules from the arc-length CDF of the scalarization path to uniformly traverse the Pareto front in multi-objective optimization.
Routledge, 2021
9 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
PROCO generates synthetic unsafe samples via model-based rollouts and LLM-grounded costs to enable safer policy learning from offline datasets containing few or no violations.
Cat-DPO applies per-category adaptive safety margins during direct preference optimization to reduce variance in safety across harm categories.
Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
citing papers explorer
-
SURF: Steering the Scalarization Weight to Uniformly Traverse the Pareto Front
SURF derives weight sampling rules from the arc-length CDF of the scalarization path to uniformly traverse the Pareto front in multi-objective optimization.
-
Shaping Zero-Shot Coordination via State Blocking
SBC generates virtual environments via state blocking to expose agents to diverse suboptimal partner policies, yielding superior zero-shot coordination performance including with humans.
-
A Single Deep Preference-Conditioned Policy for Learning Pareto Coverage Sets
A single preference-conditioned policy achieves unique and Lipschitz-continuous Pareto coverage in multi-objective MDPs via a new mirror-descent policy iteration algorithm with O(1/k) convergence.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
-
Bridging the Gap Between Average and Discounted TD Learning
A new two-trajectory sampling algorithm for average-reward TD learning guarantees convergence with quadratic sample complexity and no explicit dimension dependence in both tabular and linear approximation settings.
-
Model-Based Proactive Cost Generation for Learning Safe Policies Offline with Limited Violation Data
PROCO generates synthetic unsafe samples via model-based rollouts and LLM-grounded costs to enable safer policy learning from offline datasets containing few or no violations.
-
Cat-DPO: Category-Adaptive Safety Alignment
Cat-DPO applies per-category adaptive safety margins during direct preference optimization to reduce variance in safety across harm categories.
-
When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited
Robust minimax task inference in BFMs achieves dynamics-shift robustness from nominal offline data alone and outperforms standard baselines.
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.