LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.
Contrastive prefence learning: Learning from human feedback without rl
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
UniIntervene uses future-conditioned action-value estimation and a temporal value-risk critic to trigger memory-based recovery interventions, reporting 8.6% higher success rates and 57% fewer human interventions than prior HiL-RL methods on real manipulation tasks.
RePO reframes RLHF through regret minimization by modeling preferences as behavior-conditioned relative suboptimality assessments and reports performance gains on reasoning and preference benchmarks.
DIPPER uses bi-level optimization and DPO to train the higher-level policy from stationary preference comparisons and value regularization, claiming up to 40% gains on robotic navigation and manipulation tasks while introducing metrics for non-stationarity and infeasible subgoals.
citing papers explorer
-
UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning
UniIntervene uses future-conditioned action-value estimation and a temporal value-risk critic to trigger memory-based recovery interventions, reporting 8.6% higher success rates and 57% fewer human interventions than prior HiL-RL methods on real manipulation tasks.