Title resolution pending

Hellobench: Evaluating long text generation capabilities of large language models · 2024 · arXiv 2409.16191

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.

FlexStructRAG: Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG

cs.IR · 2026-02-01 · unverdicted · novelty 6.0

FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

cs.CL · 2024-12-19 · accept · novelty 6.0

LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.

citing papers explorer

Showing 4 of 4 citing papers.

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing cs.CL · 2026-04-21 · unverdicted · none · ref 1
Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
FlexStructRAG: Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG cs.IR · 2026-02-01 · unverdicted · none · ref 17
FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents cs.CL · 2025-06-13 · conditional · none · ref 23
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks cs.CL · 2024-12-19 · accept · none · ref 5
LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer