Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601

InFoBench: Evaluating Instruction Following Ability in Large Language Models , author= · 2025 · arXiv 2401.03601

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following

cs.CL · 2026-05-27 · unverdicted · novelty 7.0

IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.

SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation

cs.AI · 2026-04-30 · unverdicted · novelty 7.0

SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.

SAGE: A Service Agent Graph-guided Evaluation Benchmark

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.

Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users

cs.CL · 2026-03-17 · conditional · novelty 7.0

Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.

Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs

cs.CL · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

cs.AI · 2026-05-07 · unverdicted · novelty 6.0

Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.

Token-Level LLM Collaboration via FusionRoute

cs.AI · 2026-01-08 · unverdicted · novelty 6.0

FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.

Process Reinforcement through Implicit Rewards

cs.LG · 2025-02-03 · conditional · novelty 6.0

PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.

SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning

cs.CV · 2026-06-22 · unverdicted · novelty 5.0

SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.

Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration

cs.HC · 2026-04-18 · unverdicted · novelty 5.0

An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.

citing papers explorer

Showing 10 of 10 citing papers.

IFMTBench: A Comprehensive Benchmark for Multilingual Translation Instruction Following cs.CL · 2026-05-27 · unverdicted · none · ref 6
IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.
SpatialGrammar: A Domain-Specific Language for LLM-Based 3D Indoor Scene Generation cs.AI · 2026-04-30 · unverdicted · none · ref 21
SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
SAGE: A Service Agent Graph-guided Evaluation Benchmark cs.AI · 2026-04-10 · unverdicted · none · ref 39
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Language Models Don't Know What You Want: Evaluating Personalization in Deep Research Needs Real Users cs.CL · 2026-03-17 · conditional · none · ref 3
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
Do as I Say, Not as I Do: Instruction-Induction Conflict in LLMs cs.CL · 2026-05-19 · unverdicted · none · ref 16 · 2 links
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models cs.AI · 2026-05-07 · unverdicted · none · ref 16
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
Token-Level LLM Collaboration via FusionRoute cs.AI · 2026-01-08 · unverdicted · none · ref 19
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
Process Reinforcement through Implicit Rewards cs.LG · 2025-02-03 · conditional · none · ref 126
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning cs.CV · 2026-06-22 · unverdicted · none · ref 201
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
Intelligent Drill-Down: Large Language Model-Driven Drill-Down Technique for Human-AI Collaborative Visual Exploration cs.HC · 2026-04-18 · unverdicted · none · ref 41
An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.

Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer