DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
The Innovation , year=
9 Pith papers cite this work. Polarity classification is still indexing.
years
2026 9verdicts
UNVERDICTED 9representative citing papers
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
The paper formulates LLM-as-judge evaluation as a two-stage missing-data problem and derives sample-size formulas via doubly robust estimators to achieve desired power while allocating more human reviews where LLM predictability is low.
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
LLMs show reproducible asymmetries in advice on faith transitions, favoring Catholic, Bahá'í, and Sikh religions while disfavoring Atheism, Agnosticism, and Jehovah's Witnesses across 20 models and 182 pairings.
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
An LLM-powered triaging agent for banking fraud reports uses multi-turn conversations and synthetic customer simulations to achieve a 30.6% increase in classification accuracy over prior methods.
citing papers explorer
-
DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
-
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
-
Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
The paper formulates LLM-as-judge evaluation as a two-stage missing-data problem and derives sample-size formulas via doubly robust estimators to achieve desired power while allocating more human reviews where LLM predictability is low.
-
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
-
When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance
LLMs show reproducible asymmetries in advice on faith transitions, favoring Catholic, Bahá'í, and Sikh religions while disfavoring Atheism, Agnosticism, and Jehovah's Witnesses across 20 models and 182 pairings.
-
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes
An LLM-powered triaging agent for banking fraud reports uses multi-turn conversations and synthetic customer simulations to achieve a 30.6% increase in classification accuracy over prior methods.