LambdaPO introduces pairwise preference-based advantage estimation and a semantic density reward to extract more optimization signal from trajectory groups than GRPO's monolithic baseline.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.CL 2verdicts
UNVERDICTED 2representative citing papers
LLM agents exhibit temporal blindness, achieving no better than 65% normalized alignment with human preferences on tool-use decisions across time-sensitive scenarios in the new TicToc dataset.
citing papers explorer
-
LambdaPO: A Lambda Style Policy Optimization for Reasoning Language Models
LambdaPO introduces pairwise preference-based advantage estimation and a semantic density reward to extract more optimization signal from trajectory groups than GRPO's monolithic baseline.
-
Your LLM Agents are Temporally Blind: The Misalignment Between Tool Use Decisions and Human Time Perception
LLM agents exhibit temporal blindness, achieving no better than 65% normalized alignment with human preferences on tool-use decisions across time-sensitive scenarios in the new TicToc dataset.