Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601

Yiwei Qin, Kaiqiang Song, Yebowen Hu, Wenlin Yao, Sangwoo Cho, Xiaoyang Wang, Xuansheng Wu, Fei Liu, Pengfei Liu, Dong Yu · 2025 · arXiv 2401.03601

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

cs.CL · 2026-03-17 · conditional · novelty 7.0

Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · conditional · novelty 6.0

Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.

Token-Level LLM Collaboration via FusionRoute

cs.AI · 2026-01-08 · unverdicted · novelty 6.0

FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration

cs.HC · 2026-04-18 · unverdicted · novelty 5.0

An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07

citing papers explorer

Showing 8 of 8 citing papers.

SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation cs.AI · 2026-04-30 · unverdicted · none · ref 21
SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
SAGE: A Service Agent Graph-guided Evaluation Benchmark cs.AI · 2026-04-10 · unverdicted · none · ref 39
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users cs.CL · 2026-03-17 · conditional · none · ref 3
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · conditional · none · ref 16
Experiments reveal that LLMs follow instructions at rates from 1% to 99% when opposed by hardcoded conflicting patterns, with robustness tied to output diversity and alignment with model priors rather than general capability.
Token-Level LLM Collaboration via FusionRoute cs.AI · 2026-01-08 · unverdicted · none · ref 19
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 126
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration cs.HC · 2026-04-18 · unverdicted · none · ref 41
An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unreviewed · ref 16

Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer