Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
Title resolution pending
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
other 1polarities
unclear 1representative citing papers
FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.
citing papers explorer
-
HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing
Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
-
FlexStructRAG: Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG
FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.