glucagon

Llm evaluators recognize, favor their own generations · 2024 · arXiv 2409.16191

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

other 1

citation-polarity summary

unclear 1

representative citing papers

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.

FlexStructRAG: Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG

cs.IR · 2026-02-01 · unverdicted · novelty 6.0

FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.

DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents

cs.CL · 2025-06-13 · conditional · novelty 6.0

DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.

LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks

cs.CL · 2024-12-19 · accept · novelty 6.0

LongBench v2 benchmark shows current LLMs underperform humans on deep long-context reasoning tasks, but extended inference-time reasoning enables surpassing the human baseline.

Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling

cs.LG · 2026-07-02 · unverdicted · novelty 5.0

MRRG elicits evaluation criteria from multiple complementary roles to build rubrics that outperform single-role baselines for validating LLM preferences and providing rewards in RLVR.

citing papers explorer

Showing 3 of 3 citing papers after filters.

HoWToBench: Holistic Evaluation for LLM's Capability in Human-level Writing using Tree of Writing cs.CL · 2026-04-21 · unverdicted · none · ref 1
Tree-of-Writing achieves 0.93 Pearson correlation with human judgments by using a tree-structured workflow to aggregate sub-feature scores, outperforming standard LLM-as-a-judge and overlap metrics on the new HowToBench.
FlexStructRAG: Flexible Structure-Aware Multi-Granular Relational Retrieval for RAG cs.IR · 2026-02-01 · unverdicted · none · ref 17
FlexStructRAG jointly constructs knowledge graphs, hypergraphs, and semantic clusters with dynamic partitioning to enable query-adaptive multi-granular retrieval that improves semantic scores over standard RAG baselines on UltraDomain.
Many Voices, One Reward: Multi-Role Rubric Generation for LLM Judging and Reward Modeling cs.LG · 2026-07-02 · unverdicted · none · ref 4
MRRG elicits evaluation criteria from multiple complementary roles to build rubrics that outperform single-role baselines for validating LLM preferences and providing rewards in RLVR.

glucagon

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer