hub

Direct preference optimization: Your language model is secretly a reward model

· 2023

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

browse 10 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 2 method 1

citation-polarity summary

background 3

representative citing papers

HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking

cs.CR · 2026-04-18 · unverdicted · novelty 7.0

HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.

LLM-Agnostic Semantic Representation Attack

cs.CL · 2026-05-09 · unverdicted · novelty 6.0

SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.

From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data

cs.CV · 2026-05-08 · unverdicted · novelty 6.0

The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.

PARM: Pipeline-Adapted Reward Model

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.

Weight Patching: Toward Source-Level Mechanistic Localization in LLMs

cs.AI · 2026-04-15 · unverdicted · novelty 6.0

Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.

Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba

cs.LG · 2026-02-24 · unverdicted · novelty 6.0

ECO uses supervised warm-up plus iterative batched DPO on a Mamba backbone to reach top neural performance on TSP and CVRP while lowering memory growth and raising throughput.

One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization

cs.SD · 2026-01-14 · unverdicted · novelty 6.0

LLMs using in-context learning and fine-tuning on listener experiment data generate equalization settings that align better with population preferences than random sampling or static presets.

SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

cs.CR · 2025-12-10 · unverdicted · novelty 6.0

SCOUT uses token saliency analysis to detect both standard and contextually-plausible backdoor attacks in language models while maintaining clean accuracy.

LLM Reasoning with Process Rewards for Outcome-Guided Steps

cs.LG · 2026-02-08 · unverdicted · novelty 5.0

PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.

Reflection-Based Task Adaptation for Self-Improving VLA

cs.RO · 2025-10-14 · unverdicted · novelty 5.0

Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.

citing papers explorer

Showing 10 of 10 citing papers.

HarmChip: Evaluating Hardware Security Centric LLM Safety via Jailbreak Benchmarking cs.CR · 2026-04-18 · unverdicted · none · ref 19
HarmChip is a new benchmark exposing an alignment paradox where LLMs refuse legitimate hardware security queries but comply with semantically disguised malicious requests.
LLM-Agnostic Semantic Representation Attack cs.CL · 2026-05-09 · unverdicted · none · ref 11
SRA achieves 99.71% average attack success across 26 LLMs by optimizing for coherent malicious semantics via the SRHS algorithm, with claimed theoretical guarantees on convergence and transfer.
From Synthetic to Real: Toward Identity-Consistent Makeup Transfer with Synthetic and Real Data cs.CV · 2026-05-08 · unverdicted · none · ref 39
The work creates identity-consistent synthetic makeup data via ConsistentBeauty and adapts models to real images using reinforcement learning in RealBeauty, achieving better identity preservation and real-world performance than prior methods.
PARM: Pipeline-Adapted Reward Model cs.AI · 2026-04-20 · unverdicted · none · ref 38
PARM adapts reward models to multi-stage LLM pipelines via pipeline data and direct preference optimization, improving execution rate and solving accuracy on optimization benchmarks and showing transfer to GSM8K.
Weight Patching: Toward Source-Level Mechanistic Localization in LLMs cs.AI · 2026-04-15 · unverdicted · none · ref 62
Weight Patching localizes capabilities to specific parameter modules in LLMs by replacing weights from a behavior-specialized model into a base model and validating recovery via a vector-anchor interface, revealing a hierarchy of source, routing, and execution components.
Rethinking Efficiency in Neural Combinatorial Optimization: Batched Preference Optimization with Mamba cs.LG · 2026-02-24 · unverdicted · none · ref 24
ECO uses supervised warm-up plus iterative batched DPO on a Mamba backbone to reach top neural performance on TSP and CVRP while lowering memory growth and raising throughput.
One Prompt, Many Sounds: Modeling Listener Variability in LLM-Based Equalization cs.SD · 2026-01-14 · unverdicted · none · ref 48
LLMs using in-context learning and fine-tuning on listener experiment data generate equalization settings that align better with population preferences than random sampling or static presets.
SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models cs.CR · 2025-12-10 · unverdicted · none · ref 28
SCOUT uses token saliency analysis to detect both standard and contextually-plausible backdoor attacks in language models while maintaining clean accuracy.
LLM Reasoning with Process Rewards for Outcome-Guided Steps cs.LG · 2026-02-08 · unverdicted · none · ref 20
PROGRS uses outcome-conditioned centering on PRM scores to safely integrate process rewards into GRPO for improved Pass@1 on math benchmarks.
Reflection-Based Task Adaptation for Self-Improving VLA cs.RO · 2025-10-14 · unverdicted · none · ref 16
Reflective Self-Adaptation combines failure-reflective reinforcement learning with success-guided imitation learning to enable faster and more reliable task adaptation for pre-trained Vision-Language-Action models.

Direct preference optimization: Your language model is secretly a reward model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer