Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
Rad-bench: Evaluating large language models capabilities in retrieval augmented dialogues
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
H-RAG uses hierarchical parent-child document segmentation with hybrid retrieval and parent-level aggregation to achieve 0.4271 nDCG@5 on retrieval and 0.3241 harmonic mean on generation in a multi-turn RAG shared task.
citing papers explorer
-
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
-
H-RAG at SemEval-2026 Task 8: Hierarchical Parent-Child Retrieval for Multi-Turn RAG Conversations
H-RAG uses hierarchical parent-child document segmentation with hybrid retrieval and parent-level aggregation to achieve 0.4271 nDCG@5 on retrieval and 0.3241 harmonic mean on generation in a multi-turn RAG shared task.