Title resolution pending

Satanjeev Banerjee, Alon Lavie · 2005

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.

CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

CommitSuite is a large benchmark for commit classification and message generation that includes AST-level changes and LLM annotations, together with a reference-free evaluation framework achieving 0.849 Cohen's Kappa with humans.

Evaluating Remote Sensing Image Captions Beyond Metric Biases

cs.CV · 2026-04-22 · unverdicted · novelty 7.0

Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.

Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.

Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation

cs.IR · 2026-04-04 · unverdicted · novelty 7.0

The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.

DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.

DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

cs.CV · 2026-04-29 · unverdicted · novelty 6.0

A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.

Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications

cs.HC · 2026-04-21 · unverdicted · novelty 6.0

VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.

Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery

cs.IR · 2026-05-11 · conditional · novelty 5.0

PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.

A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.

citing papers explorer

Showing 10 of 10 citing papers.

CommonWhy: A Dataset for Evaluating Entity-Based Causal Commonsense Reasoning in Large Language Models cs.CL · 2026-05-13 · unverdicted · none · ref 1
CommonWhy is a new dataset of 15,000 why-questions for evaluating LLMs on entity-based causal commonsense reasoning grounded in Wikidata.
CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation cs.SE · 2026-05-04 · unverdicted · none · ref 1
CommitSuite is a large benchmark for commit classification and message generation that includes AST-level changes and LLM annotations, together with a reference-free evaluation framework achieving 0.849 Cohen's Kappa with humans.
Evaluating Remote Sensing Image Captions Beyond Metric Biases cs.CV · 2026-04-22 · unverdicted · none · ref 5
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development cs.SE · 2026-04-08 · unverdicted · none · ref 3
SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation cs.IR · 2026-04-04 · unverdicted · none · ref 4
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning cs.CV · 2026-05-06 · unverdicted · none · ref 2
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation cs.CV · 2026-04-29 · unverdicted · none · ref 12
A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioning, step grounding, and cross-modal retrieval.
Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications cs.HC · 2026-04-21 · unverdicted · none · ref 4
VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery cs.IR · 2026-05-11 · conditional · none · ref 3
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing cs.CV · 2026-04-08 · unverdicted · none · ref 3
The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer