ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 6verdicts
UNVERDICTED 6roles
method 1polarities
use method 1representative citing papers
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
Soft-labelling ordinal deep learning with binomial, beta, triangular, and exponential distributions improves KL and CPPD grading over one-hot baselines on knee X-rays.
Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.
Specificity and Context predict actionable code generation while Verification predicts adoption and Context predicts integration depth in LLM-assisted PR workflows.
ArguAgent scores arguments via AI, clusters stances, and forms groups with stance variety but argumentation quality within one level, validated at expert alpha 0.817 and 95.4% success in simulations.
citing papers explorer
-
ProactBench: Beyond What The User Asked For
ProactBench measures LLM conversational proactivity in three phases using 198 multi-agent dialogues and finds recovery behavior hard to predict from existing benchmarks.
-
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
-
From Kellgren-Lawrence to Calcium Pyrophosphate Crystal Deposition: A Soft-Labelling Framework for Knee Osteoarthritis Assessmen
Soft-labelling ordinal deep learning with binomial, beta, triangular, and exponential distributions improves KL and CPPD grading over one-hot baselines on knee X-rays.
-
Reward-Free Code Alignment from Pretrained or Fine-Tuned LLM: Unpacking the Trade-offs for Code Generation
Empirical study on five LLMs finds pretrained-to-aligned paths yield bigger gains over baseline than finetuned-to-aligned paths, though absolute accuracy remains lower for pretrained starts.
-
Prompt Quality and Pull Request Outcomes: A Stage-Based Empirical Study of LLM-Assisted Development
Specificity and Context predict actionable code generation while Verification predicts adoption and Context predicts integration depth in LLM-assisted PR workflows.
-
ArguAgent: AI-Supported Real-Time Grouping for Productive Argumentation in STEM Classrooms
ArguAgent scores arguments via AI, clusters stances, and forms groups with stance variety but argumentation quality within one level, validated at expert alpha 0.817 and 95.4% success in simulations.