EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
hub
Discounted Cumulated Gain Based Evaluation of Multiple-Query IR Sessions
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
A Llama-based model trained on serialized user stories unifies item, carousel, and search ranking and outperforms specialist baselines offline while improving some online metrics and reducing latency.
A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
Matrix factorization on a literature-mined concept-object graph predicts future associations in astronomy better than neighborhood similarity or recency heuristics.
ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.
An unsupervised method detects domain shifts via localized density anomaly search in feature space, attributes the shift to a minimal subspace, and extracts balanced subsets from two unlabeled datasets.
The paper develops a design science framework for governing AI-assisted operational decision support in security operations centers by specifying a query-broker artifact that separates AI planning from execution through approved templates, policy validation, and engineering review gates.
Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.
A systematic review of user simulation frameworks, models, and applications for evaluating information access systems.
citing papers explorer
-
Evaluating LLMs on Large-Scale Graph Property Estimation via Random Walks
EstGraph benchmark evaluates LLMs on estimating properties of very large graphs from random-walk samples that fit in context limits.
-
TubiFM: Unified Item, Carousel, and Search Ranking for Streaming Discovery
A Llama-based model trained on serialized user stories unifies item, carousel, and search ranking and outperforms specialist baselines offline while improving some online metrics and reducing latency.
-
Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix
A q-log odds variant of BM25 raises NDCG@10 by 89% relative on CodeSearchNet Go under fixed generic tokenization while recovering standard BM25 at q=1.
-
MIRA: An LLM-Assisted Benchmark for Multi-Category Integrated Retrieval
MIRA is a new benchmark for multi-category integrated retrieval built from real queries on a social science platform, with LLM assistance for topic descriptions and relevance labeling across four item categories.
-
JU\'A -- A Benchmark for Information Retrieval in Brazilian Legal Text Collections
JU'A is a new heterogeneous benchmark for Brazilian legal IR that distinguishes retrieval methods and shows domain-adapted models excel on aligned subsets while BM25 stays competitive elsewhere.
-
Predicting New Concept-Object Associations in Astronomy by Mining the Literature
Matrix factorization on a literature-mined concept-object graph predicts future associations in astronomy better than neighborhood similarity or recency heuristics.
-
ECPO: Evidence-Coupled Policy Optimization for Evidence-Certified Candidate Ranking
ECPO is a listwise policy optimization method that couples ranking utility with span-level evidence certificate validity and a deterministic verifier reward on MAVEN-ERE and RAMS datasets.
-
Unsupervised Domain Shift Detection with Interpretable Subspace Attribution
An unsupervised method detects domain shifts via localized density anomaly search in feature space, attributes the shift to a minimal subspace, and extracts balanced subsets from two unlabeled datasets.
-
Governing AI-Assisted Security Operations: A Design Science Framework for Operational Decision Support
The paper develops a design science framework for governing AI-assisted operational decision support in security operations centers by specifying a query-broker artifact that separates AI planning from execution through approved templates, policy validation, and engineering review gates.
-
Beyond Relevance: On the Relationship Between Retrieval and RAG Information Coverage
Coverage-focused retrieval metrics correlate strongly with nugget coverage in RAG responses across text and multimodal benchmarks, supporting their use as performance proxies when retrieval and generation goals align.
-
User Simulation for Evaluating Information Access Systems
A systematic review of user simulation frameworks, models, and applications for evaluating information access systems.
- IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research