IFMTBench is a new benchmark for multilingual translation instruction following that tests models on single and multi-constraint scenarios using deterministic checkers and LLM judges.
Infobench: Evaluating instruction following ability in large language models.arXiv preprint arXiv:2401.03601
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
SpatialGrammar provides a grid-based DSL and compiler that lets LLMs generate collision-free 3D indoor scenes more reliably than raw-coordinate or code-based approaches.
SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 models in 6 scenarios.
Personalized deep research systems need evaluation with real users because LLM judges overlook nuanced errors that matter to researchers.
LLMs show instruction-following rates from 1% to 99% when instructions conflict with hardcoded pattern demonstrations, with output diversity as the main predictor of resistance.
Dynamic Boundary Evaluation locates each LLM's performance boundary at ~50% pass probability via a calibrated item bank and Skill-Guided Boundary Search algorithm to enable unified, adaptive evaluations across safety, capability, and truthfulness.
FusionRoute augments token-level expert routing with a trainable complementary logit generator to expand the policy class and recover optimal decoding under mild conditions, outperforming prior collaboration and merging methods on reasoning and generation benchmarks.
PRIME enables online process reward model updates in LLM RL using implicit rewards from rollouts and outcome labels, yielding 15.1% average gains on reasoning benchmarks and surpassing a stronger instruct model with 10% of the data.
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.
An LLM-based framework recommends drill-down paths in visual analytics by approximating a greedy algorithm, interpreting user intent, and managing exploration branches to reduce cognitive load.
citing papers explorer
-
SingGuard: A Policy-Adaptive Multimodal LLM Guardrail with Dynamic Reasoning
SingGuard introduces a policy-adaptive multimodal LLM guardrail with dynamic reasoning regimes and SingGuard-Bench, reporting SOTA F1 scores across 35 datasets and improved policy-following accuracy under runtime shifts.