Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
glucagon
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
other 1polarities
unclear 1representative citing papers
FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.
MRRG elicits evaluation criteria from multiple complementary roles to build rubrics that outperform single-role baselines for validating LLM preferences and providing rewards in RLVR.
citing papers explorer
-
HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
-
FlexStructRAG: Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG
FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.
-
Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling
MRRG elicits evaluation criteria from multiple complementary roles to build rubrics that outperform single-role baselines for validating LLM preferences and providing rewards in RLVR.