Among the four types listed above, the first type can be regarded as an intermediate state achieved while simultaneously enhancing the model’s helpfulness and harmlessness

Complex Text Command Embedding:The model is explicitly asked to output specific content, or harmful instructions are inserted among multiple commands

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Safe RLHF: Safe Reinforcement Learning from Human Feedback

cs.AI · 2023-10-19 · conditional · novelty 6.0

Safe RLHF separates helpfulness and harmlessness preferences into distinct models and uses Lagrangian constrained optimization to improve both during LLM fine-tuning.

citing papers explorer

Showing 1 of 1 citing paper.

Safe RLHF: Safe Reinforcement Learning from Human Feedback cs.AI · 2023-10-19 · conditional · none · ref 42
Safe RLHF separates helpfulness and harmlessness preferences into distinct models and uses Lagrangian constrained optimization to improve both during LLM fine-tuning.

Among the four types listed above, the first type can be regarded as an intermediate state achieved while simultaneously enhancing the model’s helpfulness and harmlessness

fields

years

verdicts

representative citing papers

citing papers explorer