hub

Controlled decoding from language models

Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, Yanping Huang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, et al · 2023 · arXiv 2310.17022

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

cs.CL · 2026-05-12 · unverdicted · novelty 7.0 · 2 refs

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.

Universal Reasoner: A Single, Composable Plug-and-Play Reasoner for Frozen LLMs

cs.AI · 2025-05-25 · unverdicted · novelty 7.0

UniR is a composable reasoning module trained with verifiable rewards and added to frozen LLMs via logit summation, enabling modular composition and weak-to-strong generalization across tasks and model sizes.

TRAM: Test-Time Risk Adaptation with Mixture of Agents

cs.LG · 2024-08-16 · unverdicted · novelty 7.0

TRAM is a test-time mixture method that scores and composes risk-neutral source policies using reward and occupancy-based risk to achieve new reward-risk tradeoffs without parameter updates.

Spectral Souping: A Unified Framework for Online Preference Alignment

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

cs.CL · 2026-05-11 · conditional · novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, without any weight updates.

Generalization in LLM Problem Solving: The Case of the Shortest Path

cs.AI · 2026-04-16 · unverdicted · novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.

Reward Models Can Improve Themselves: Reward-Guided Adversarial Failure Mode Discovery for Robust Reward Modeling

cs.CL · 2025-07-08 · unverdicted · novelty 6.0

REFORM uses reward-guided controlled decoding to generate adversarial failures and augments training data to improve reward model robustness on preference datasets.

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

cs.AI · 2024-08-13 · unverdicted · novelty 6.0

Agent Q integrates MCTS-guided search, self-critique, and off-policy DPO to train LLM agents that outperform behavior cloning and reinforced fine-tuning baselines in WebShop and achieve up to 95.4% success in real-world booking scenarios.

Synergistic Benefits of Joint Molecule Generation and Property Prediction

cs.LG · 2025-04-23 · unverdicted · novelty 5.0

Hyformer jointly models molecule generation and property prediction via alternating attention and joint pre-training, showing synergistic gains in conditional sampling, OOD prediction, and a drug design case for antimicrobial peptides.

citing papers explorer

Showing 4 of 4 citing papers after filters.

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion cs.LG · 2026-05-22 · unverdicted · none · ref 51
CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
TRAM: Test-Time Risk Adaptation with Mixture of Agents cs.LG · 2024-08-16 · unverdicted · none · ref 27
TRAM is a test-time mixture method that scores and composes risk-neutral source policies using reward and occupancy-based risk to achieve new reward-risk tradeoffs without parameter updates.
Spectral Souping: A Unified Framework for Online Preference Alignment cs.LG · 2026-05-19 · unverdicted · none · ref 18
Spectral Souping learns offline specialized policies for fine-grained preferences and merges them online using a discovered universal spectral representation for efficient LLM alignment.
Synergistic Benefits of Joint Molecule Generation and Property Prediction cs.LG · 2025-04-23 · unverdicted · none · ref 50
Hyformer jointly models molecule generation and property prediction via alternating attention and joint pre-training, showing synergistic gains in conditional sampling, OOD prediction, and a drug design case for antimicrobial peptides.

Controlled decoding from language models

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer