CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
hub
Controlled decoding from language models
13 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Introduces TBPO, which derives a Bregman-divergence density-ratio matching objective for token-level preference optimization that generalizes DPO while preserving the induced optimal policy.
UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.
TRAM is a test-time mixture method that scores and composes risk-neutral source policies using reward and occupancy-based risk to achieve new reward-risk tradeoffs without parameter updates.
A parallel-tempering evolutionary framework for LLM hypothesis search improves both quality and diversity of candidates in molecular, equation, and algorithm discovery under fixed validation budgets.
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
Value-filtered decoding steers LLM outputs for safety at decoding time using a value criterion with an explicit bound on false interventions controlled by one threshold hyperparameter.
DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.
LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.
Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.
Hyformer jointly models molecule generation and property prediction via alternating attention and joint pre-training, showing synergistic gains in conditional sampling, OOD prediction, and a drug design case for antimicrobial peptides.
MI-EPO maximizes joint conditional mutual information among responses, feedback, and preference vectors, using probabilistic routing to improve alignment and controllability in multi-objective LLM optimization.