pith. sign in

Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

citation-role summary

background 1 baseline 1

citation-polarity summary

years

2026 10 2025 1

clear filters

representative citing papers

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

Token-Level LLM Collaboration via FusionRoute

cs.AI · 2026-01-08 · unverdicted · novelty 6.0

FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

ComplexConstraints and Beyond: Expert Rubrics for RLVR

cs.AI · 2026-06-08 · unverdicted · novelty 5.0

Expert-curated rubrics in the new ComplexConstraints dataset improve LLM instruction following by 12-15% when used as RL training signals, with gains transferring to out-of-distribution agentic benchmarks.

citing papers explorer

Showing 0 of 0 citing papers after filters.

No citing papers match the current filters.