LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.
hub
Terry , title =
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
method 3polarities
use method 3representative citing papers
Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.
STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.
DPO derives the optimal policy directly from human preferences via a reparameterized reward model, solving the RLHF objective with only a binary classification loss and no sampling or separate reward model.
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
DPOP is a new loss function that prevents DPO from lowering preferred response likelihoods and outperforms standard DPO on diverse datasets, MT-Bench, and enables Smaug-72B to exceed 80% on the Open LLM Leaderboard.
LLM ranking reliability for prioritization tasks can be assessed via coefficient of consistency ζ (intra-run circular triads) and Kendall's τ (inter-run distance), with three leading models showing distinct consistency profiles on homelessness allocation and ED triage.
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
QUIVER adaptively mixes objective evaluations with two types of preference queries in surrogate-assisted evolutionary multi-objective optimization to reduce final utility regret, reporting 25% gains on hard WFG benchmarks.
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.
citing papers explorer
-
Contextualizing Biological Language Models across Modalities via Logit-Space Contrastive Alignment
LOGICA adds context to pretrained biological LMs via logit-space contrastive alignment with gated adapters, improving AUC on held-out drug-resistance mutation ranking from ~0.55 to ~0.65 while preserving token likelihoods.
-
Eliciting associations between clinical variables from LLMs via comparison questions across populations
Indirect elicitation via triplet comparisons recovers meaningful association structures from LLMs and supports conservative causal candidate links across prompted subpopulations.
-
Soft Tournament Equilibrium
STE is a differentiable method to compute continuous analogues of the Top Cycle and Uncovered Set from pairwise comparison data for stable set-valued evaluation of cyclic agent interactions.
-
Consistency Training while Mitigating Obfuscation via Rate Matching
RMCT matches the rate of target behaviors like bias-following across input perturbations to reduce sycophancy in LLMs while preserving verbalization of bias cues.
-
Understanding Goal Generalisation in Sequential Reinforcement Learning
Empirical analysis of over 100 sequential RL training pipelines across 250+ OOD environments finds salient features drive generalization and early goals persist, with latent policy gradients simulating latent variable evolution to predict OOD behavior from training history.
-
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
-
Can LLMs Rank? A Tale of Triads and Triage
LLM ranking reliability for prioritization tasks can be assessed via coefficient of consistency ζ (intra-run circular triads) and Kendall's τ (inter-run distance), with three leading models showing distinct consistency profiles on homelessness allocation and ED triage.
-
DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity
DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
-
QUIVER: Cost-Aware Adaptive Preference Querying in Surrogate-Assisted Evolutionary Multi-Objective Optimization
QUIVER adaptively mixes objective evaluations with two types of preference queries in surrogate-assisted evolutionary multi-objective optimization to reduce final utility regret, reporting 25% gains on hard WFG benchmarks.
-
Aligning Modalities in Vision Large Language Models via Preference Fine-tuning
POVID generates AI-created preference data to fine-tune vision-language models with DPO, reducing hallucinations and improving benchmark scores.