The Innovation , year=

A survey on llm-as-a-judge , author=

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

browse 9 citing papers

representative citing papers

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.

ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

The paper formulates LLM-as-judge evaluation as a two-stage missing-data problem and derives sample-size formulas via doubly robust estimators to achieve desired power while allocating more human reviews where LLM predictability is low.

WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation

cs.CL · 2026-05-08 · unverdicted · novelty 6.0

WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.

Optimal Transport for LLM Reward Modeling from Noisy Preference

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.

When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance

cs.CL · 2026-05-21 · unverdicted · novelty 5.0

LLMs show reproducible asymmetries in advice on faith transitions, favoring Catholic, Bahá'í, and Sikh religions while disfavoring Atheism, Agnosticism, and Jehovah's Witnesses across 20 models and 182 pairings.

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

cs.CL · 2026-05-19 · unverdicted · novelty 5.0

Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

cs.CL · 2026-05-07 · unverdicted · novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

cs.HC · 2026-03-31 · unverdicted · novelty 4.0

An LLM-powered triaging agent for banking fraud reports uses multi-turn conversations and synthetic customer simulations to achieve a 30.6% increase in classification accuracy over prior methods.

citing papers explorer

Showing 9 of 9 citing papers.

DRIP-R: A Benchmark for Decision-Making and Reasoning Under Real-World Policy Ambiguity in the Retail Domain cs.CL · 2026-05-08 · unverdicted · none · ref 40
DRIP-R is a new benchmark showing that frontier LLMs systematically disagree on how to resolve identical ambiguous retail policy scenarios, highlighting ambiguity as a core challenge for agent decision-making.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost? cs.CL · 2026-05-01 · unverdicted · none · ref 15
Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need? cs.LG · 2026-05-08 · unverdicted · none · ref 12
The paper formulates LLM-as-judge evaluation as a two-stage missing-data problem and derives sample-size formulas via doubly robust estimators to achieve desired power while allocating more human reviews where LLM predictability is low.
WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation cs.CL · 2026-05-08 · unverdicted · none · ref 98
WeatherSyn is the first instruction-tuned MLLM for weather forecasting report generation, outperforming closed-source models on a new dataset of 31 US cities across 8 weather aspects.
Optimal Transport for LLM Reward Modeling from Noisy Preference cs.LG · 2026-05-07 · unverdicted · none · ref 215
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy preference samples.
When AI Takes Sides on Questions of Faith: Persistent Asymmetries in AI-Mediated Faith Guidance cs.CL · 2026-05-21 · unverdicted · none · ref 60
LLMs show reproducible asymmetries in advice on faith transitions, favoring Catholic, Bahá'í, and Sikh religions while disfavoring Atheism, Agnosticism, and Jehovah's Witnesses across 20 models and 182 pairings.
Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering cs.CL · 2026-05-19 · unverdicted · none · ref 83
Mainstream UQ for LLMs reduces to unsupervised clustering of internal generation consistency and therefore cannot detect confident hallucinations or provide reliable safety signals.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 64
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes cs.HC · 2026-03-31 · unverdicted · none · ref 7
An LLM-powered triaging agent for banking fraud reports uses multi-turn conversations and synthetic customer simulations to achieve a 30.6% increase in classification accuracy over prior methods.

The Innovation , year=

fields

years

verdicts

representative citing papers

citing papers explorer