hub Mixed citations

ROUGE : A Package for Automatic Evaluation of Summaries

Lin, Chin-Yew · 2004

Mixed citation behavior. Most common role is background (67%).

29 Pith papers citing it

Background 67% of classified citations

browse 29 citing papers

hub tools

JSON dossier citing papers JSON

citation-role summary

background 4 method 1 other 1

citation-polarity summary

background 4 unclear 1 use method 1

representative citing papers

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays

cs.CV · 2026-05-15 · conditional · novelty 8.0

MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.

Evaluating Very Long-Term Conversational Memory of LLM Agents

cs.CL · 2024-02-27 · unverdicted · novelty 8.0

Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.

Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.

Evaluating Non-English Developer Support in Machine Learning for Software Engineering

cs.SE · 2026-05-07 · unverdicted · novelty 7.0

Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.

How English Print Media Frames Human-Elephant Conflicts in India

cs.AI · 2026-04-23 · unverdicted · novelty 7.0

English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.

Self-Rewarding Language Models

cs.CL · 2024-01-18 · conditional · novelty 7.0

Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.

Prefix-Tuning: Optimizing Continuous Prompts for Generation

cs.CL · 2021-01-01 · conditional · novelty 7.0

Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.

CodeBLEU: a Method for Automatic Evaluation of Code Synthesis

cs.SE · 2020-09-22 · conditional · novelty 7.0

CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.

Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

Decision theory shows that LLM cascades are structurally limited by always incurring the cheap model's cost before deciding to escalate, with the best performance given by the envelope of pairwise cascades rather than fixed chains or many stages.

LLM Output Detectability and Task Performance Can be Jointly Optimized

cs.CL · 2026-05-02 · unverdicted · novelty 6.0

PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.

SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.

Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens

cs.CL · 2026-04-20 · unverdicted · novelty 6.0

ETW uses predictive entropy as a proxy for token informativeness to improve selective unlearning in LLMs, achieving better forgetting with less utility loss than prior token-level methods.

ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

cs.IR · 2026-04-14 · unverdicted · novelty 6.0

ClusterRAG applies density-based clustering to user profiles for collaborative retrieval in personalized RAG and reports best performance on LaMP tasks by combining target and similar-user profiles.

VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation

cs.CV · 2024-12-30 · unverdicted · novelty 6.0

VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.

Large Language Models are not Fair Evaluators

cs.CL · 2023-05-29 · conditional · novelty 6.0

LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.

Enhancing Chat Language Models by Scaling High-quality Instructional Conversations

cs.CL · 2023-05-23 · conditional · novelty 6.0

UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model

cs.CL · 2022-11-09 · unverdicted · novelty 6.0

BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.

Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs

cs.LG · 2026-05-21 · unverdicted · novelty 5.0

RGoT uses RL to adaptively generate task-specific graphs of operations for GoT-style LLM prompting from a human-provided set, with results suggesting feasibility under constraints.

COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection

cs.CV · 2026-05-14 · unverdicted · novelty 5.0

COPRA introduces conditional parameter adaptation via RL to dynamically tune frozen VLMs for video anomaly detection, outperforming static methods in in-domain and cross-domain settings while generalizing to other video tasks.

Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering

cs.CL · 2026-05-12 · unverdicted · novelty 5.0

IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.

UserGPT Technical Report

cs.IR · 2026-05-09 · unverdicted · novelty 5.0

UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.

PRISM: Perception Reasoning Interleaved for Sequential Decision Making

cs.AI · 2026-05-06 · unverdicted · novelty 5.0

PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.

Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets

cs.CL · 2026-05-06 · unverdicted · novelty 5.0

An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.

Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization

cs.CL · 2026-04-19 · unverdicted · novelty 5.0

A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.

citing papers explorer

Showing 29 of 29 citing papers.

MI-CXR: A Benchmark for Longitudinal Reasoning over Multi-Interval Chest X-rays cs.CV · 2026-05-15 · conditional · none · ref 66
MI-CXR is a new benchmark that shows state-of-the-art vision-language models achieve only 29.3% accuracy on longitudinal reasoning tasks across multi-visit chest X-ray sequences.
Evaluating Very Long-Term Conversational Memory of LLM Agents cs.CL · 2024-02-27 · unverdicted · none · ref 69
Creates LoCoMo benchmark dataset for very long-term LLM conversational memory and shows current models struggle with lengthy dialogues and long-range temporal dynamics.
Knowledge Beyond Language: Bridging the Gap in Multilingual Machine Unlearning Evaluation cs.CL · 2026-05-14 · unverdicted · none · ref 29
New metrics KSS and KPS are introduced to evaluate multilingual machine unlearning quality and cross-language consistency in LLMs, addressing limitations of single-language evaluation protocols.
Evaluating Non-English Developer Support in Machine Learning for Software Engineering cs.SE · 2026-05-07 · unverdicted · none · ref 16
Code LLMs generate substantially worse comments outside English, and no tested automatic metric or LLM judge reliably matches human assessment of those outputs.
How English Print Media Frames Human-Elephant Conflicts in India cs.AI · 2026-04-23 · unverdicted · none · ref 18
English print media coverage of human-elephant conflicts in India is dominated by fear-inducing and aggression-related language.
Self-Rewarding Language Models cs.CL · 2024-01-18 · conditional · none · ref 10
Iterative self-rewarding via LLM-as-Judge in DPO training on Llama 2 70B improves instruction following and self-evaluation, outperforming GPT-4 on AlpacaEval 2.0.
Prefix-Tuning: Optimizing Continuous Prompts for Generation cs.CL · 2021-01-01 · conditional · none · ref 79
Prefix-tuning matches or exceeds fine-tuning on NLG tasks by optimizing a continuous prefix using 0.1% of parameters while keeping the LM frozen.
CodeBLEU: a Method for Automatic Evaluation of Code Synthesis cs.SE · 2020-09-22 · conditional · none · ref 57
CodeBLEU improves correlation with human programmer scores on code synthesis tasks by adding syntactic AST matching and semantic data-flow matching to the standard BLEU n-gram approach.
Is Escalation Worth It? A Decision-Theoretic Characterization of LLM Cascades cs.LG · 2026-05-07 · unverdicted · none · ref 75
Decision theory shows that LLM cascades are structurally limited by always incurring the cheap model's cost before deciding to escalate, with the best performance given by the envelope of pairwise cascades rather than fixed chains or many stages.
LLM Output Detectability and Task Performance Can be Jointly Optimized cs.CL · 2026-05-02 · unverdicted · none · ref 60
PUPPET jointly optimizes LLM outputs for high detectability and task performance via RL rewards from a detector and a task evaluator, outperforming watermarking on tasks while matching detectability.
SCURank: Ranking Multiple Candidate Summaries with Summary Content Units for Enhanced Summarization cs.CL · 2026-04-21 · unverdicted · none · ref 16
SCURank ranks multiple summary candidates with Summary Content Units to outperform ROUGE and LLM-based methods in summarization distillation.
Forget What Matters, Keep the Rest: Selective Unlearning of Informative Tokens cs.CL · 2026-04-20 · unverdicted · none · ref 24
ETW uses predictive entropy as a proxy for token informativeness to improve selective unlearning in LLMs, achieving better forgetting with less utility loss than prior token-level methods.
ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation cs.IR · 2026-04-14 · unverdicted · none · ref 57
ClusterRAG applies density-based clustering to user profiles for collaborative retrieval in personalized RAG and reports best performance on LaMP tasks by combining target and similar-user profiles.
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation cs.CV · 2024-12-30 · unverdicted · none · ref 51
VisionReward learns multi-dimensional human preferences for image and video generation via hierarchical assessment and linear weighting, outperforming VideoScore by 17.2% in prediction accuracy and yielding 31.6% higher win rates in text-to-video models.
Large Language Models are not Fair Evaluators cs.CL · 2023-05-29 · conditional · none · ref 50
LLMs show strong position bias when scoring model outputs, allowing easy manipulation of rankings, but calibration with multiple evidence, position balancing, and selective human input reduces this bias to better match human judgments.
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations cs.CL · 2023-05-23 · conditional · none · ref 43
UltraChat supplies 1.5 million high-quality multi-turn dialogues that, when used to fine-tune LLaMA, produce UltraLLaMA, which outperforms prior open-source chat models including Vicuna.
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model cs.CL · 2022-11-09 · unverdicted · none · ref 90
BLOOM is a 176B-parameter open-access multilingual language model trained on the ROOTS corpus that achieves competitive performance on benchmarks, with improved results after multitask prompted finetuning.
Reinforced Graph of Thoughts: RL-Driven Adaptive Prompting for LLMs cs.LG · 2026-05-21 · unverdicted · none · ref 21
RGoT uses RL to adaptively generate task-specific graphs of operations for GoT-style LLM prompting from a human-provided set, with results suggesting feasibility under constraints.
COPRA: Conditional Parameter Adaptation with Reinforcement Learning for Video Anomaly Detection cs.CV · 2026-05-14 · unverdicted · none · ref 52
COPRA introduces conditional parameter adaptation via RL to dynamically tune frozen VLMs for video anomaly detection, outperforming static methods in in-domain and cross-domain settings while generalizing to other video tasks.
Training LLMs with Reinforcement Learning for Intent-Aware Personalized Question Answering cs.CL · 2026-05-12 · unverdicted · none · ref 71
IAP uses RL to train LLMs to explicitly infer and apply implicit user intent in single-turn personalized QA, achieving ~7.5% average macro-score gains over baselines on LaMP-QA.
UserGPT Technical Report cs.IR · 2026-05-09 · unverdicted · none · ref 18
UserGPT introduces a generative LLM framework with a behavior simulation engine, semantization module, and DF-GRPO post-training that scores 0.7325 on tag prediction and 0.7528 on summary generation on HPR-Bench while compressing records by up to 97.9%.
PRISM: Perception Reasoning Interleaved for Sequential Decision Making cs.AI · 2026-05-06 · unverdicted · none · ref 6
PRISM interleaves VLM perception and LLM reasoning via a dynamic goal-oriented question-answer pipeline to produce sharper scene descriptions, outperforming prior image-based models on ALFWorld and Room-to-Room.
Generating Query-Focused Summarization Datasets from Query-Free Summarization Datasets cs.CL · 2026-05-06 · unverdicted · none · ref 8
An evidence-based model generates queries from query-free datasets, yielding summaries with competitive ROUGE scores to those using original queries.
Beyond Overlap Metrics: Rewarding Reasoning and Preferences for Faithful Multi-Role Dialogue Summarization cs.CL · 2026-04-19 · unverdicted · none · ref 7
A reasoning-distillation plus dual-reward GRPO method for multi-role dialogue summarization matches ROUGE and BERTScore baselines while improving factual faithfulness and preference alignment on CSDS and SAMSum.
Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications cs.CL · 2026-05-10 · unverdicted · none · ref 30
RAG is more effective and cost-efficient than fine-tuning for industrial QA adaptation on automotive datasets.
Benchmarked Yet Not Measured -- Generative AI Should be Evaluated Against Real-World Utility cs.LG · 2026-05-07 · unverdicted · none · ref 11 · 2 links
Generative AI evaluation must shift from static benchmark scores to measuring sustained improvements in human capabilities within specific deployment contexts.
AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems cs.SE · 2026-05-22 · unverdicted · none · ref 3
Proposes an AI Failure Taxonomy, a five-layer AI Assurance Pyramid, and operational guidance for RAG testing and model lifecycle management in enterprise settings.
A Survey on Knowledge Distillation of Large Language Models cs.CL · 2024-02-20 · accept · none · ref 193
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Divide-Prompt-Refine: a Training-Free, Structure-Aware Framework for Biomedical Abstract Generation cs.CL · 2026-05-20 · unreviewed · ref 47

ROUGE : A Package for Automatic Evaluation of Summaries

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer