EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
Title resolution pending
20 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
CommitSuite is a large benchmark for commit classification and message generation that includes AST-level changes and LLM annotations, together with a reference-free evaluation framework achieving 0.849 Cohen's Kappa with humans.
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
ClarifySTL uses LLM agents to interactively detect and resolve vagueness and ambiguity in natural language requirements via clarification queries before generating STL formulas, with evaluations on existing and new benchmarks showing effectiveness.
VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.
MMP-Refer augments LLMs with multimodal retrieval paths and a trainable collaborative adapter to produce more accurate and explainable recommendations.
CHORUS multi-agent system reduced professional translation time by 33.8% while lowering cognitive effort and raising BLEU/COMET scores in a 30-participant within-subject study.
An LLM pipeline generates knowledge components for coding problems, enabling KCGen-KT to outperform existing KT methods and human-written KCs on student response prediction across two datasets.
DPPMG learns discrete modal-specific preferences via a dedicated GNN from multimodal user data, quantizes them into tokens, and feeds them into generators with a consistency reward to produce personalized text and images.
ToxiShield delivers a real-time GitHub extension with a BERT toxicity detector at 98% accuracy, a Claude-based coach, and a fine-tuned Llama reframer at 95% style transfer accuracy, validated by a 10-person TAM study.
The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.
citing papers explorer
-
EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations
EVE enables verifiable self-evolution of MLLMs by using a Challenger-Solver architecture to generate dynamic executable visual transformations that produce VQA problems with absolute execution-verified ground truth.
-
CommitSuite: A Comprehensive Benchmark for Commit Classification and Message Generation
CommitSuite is a large benchmark for commit classification and message generation that includes AST-level changes and LLM annotations, together with a reference-free evaluation framework achieving 0.849 Cohen's Kappa with humans.
-
Purifying Multimodal Retrieval: Fragment-Level Evidence Selection for RAG
FES-RAG reframes multimodal RAG as fragment-level selection using Fragment Information Gain to outperform document-level methods with up to 27% relative CIDEr gains on M2RAG while shortening context.
-
EmoTrans: A Benchmark for Understanding, Reasoning, and Predicting Emotion Transitions in Multimodal LLMs
EmoTrans is a new video benchmark with four progressive tasks that measures how well current multimodal LLMs handle dynamic emotion transitions rather than static recognition.
-
Evaluating Remote Sensing Image Captions Beyond Metric Biases
Unfine-tuned MLLMs outperform fine-tuned models on remote sensing image captioning when captions are scored by their ability to reconstruct the source image, and a training-free self-correction method achieves SOTA performance.
-
PinpointQA: A Dataset and Benchmark for Small Object-Centric Spatial Understanding in Indoor Videos
PinpointQA is the first benchmark dataset for small object-centric spatial understanding in indoor videos, with four progressive tasks built from ScanNet data.
-
Evaluating Repository-level Software Documentation via Question Answering and Feature-Driven Development
SWD-Bench evaluates repo-level docs through functionality detection, localization, and completion QA tasks on 4170 entries from PRs, showing best docs raise SWE-Agent issue-solving rate by 20%.
-
Rank, Don't Generate: Statement-level Ranking for Explainable Recommendation
The work reframes explainable recommendation as statement-level ranking, introduces the StaR benchmark from Amazon reviews, and finds popularity baselines outperforming SOTA models in item-level personalized ranking.
-
In Line with Context: Repository-Level Code Generation via Context Inlining
InlineCoder reframes repository-level code generation as function-level coding by using a draft anchor to inline the target function into its call graph for upstream usage and downstream dependency context.
-
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
-
ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification
ClarifySTL uses LLM agents to interactively detect and resolve vagueness and ambiguity in natural language requirements via clarification queries before generating STL formulas, with evaluations on existing and new benchmarks showing effectiveness.
-
Beyond Semantic Similarity: A Component-Wise Evaluation Framework for Medical Question Answering Systems with Health Equity Implications
VB-Score shows three major LLMs have severe failures in medical entity recognition and factual consistency, with 13.8% lower performance on chronic conditions affecting older and minority groups, indicating condition-based algorithmic discrimination.
-
MMP-Refer: Multimodal Path Retrieval-augmented LLMs For Explainable Recommendation
MMP-Refer augments LLMs with multimodal retrieval paths and a trainable collaborative adapter to produce more accurate and explainable recommendations.
-
CHORUS: Effort-Aware Multi-Agent Human-AI Collaboration for Professional Translation
CHORUS multi-agent system reduced professional translation time by 33.8% while lowering cognitive effort and raising BLEU/COMET scores in a 30-participant within-subject study.
-
Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems
An LLM pipeline generates knowledge components for coding problems, enabling KCGen-KT to outperform existing KT methods and human-written KCs on student response prediction across two datasets.
-
Discrete Preference Learning for Personalized Multimodal Generation
DPPMG learns discrete modal-specific preferences via a dedicated GNN from multimodal user data, quantizes them into tokens, and feeds them into generators with a consistency reward to produce personalized text and images.
-
ToxiShield: Promoting Inclusive Developer Communication through Real-Time Toxicity Filtering
ToxiShield delivers a real-time GitHub extension with a BERT toxicity detector at 98% accuracy, a Claude-based coach, and a fine-tuned Llama reframer at 95% style transfer accuracy, validated by a 10-person TAM study.
-
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing
The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.
-
Context-Guided Decompilation: A Step Towards Re-executability
ICL4Decomp applies in-context learning to guide LLMs in generating re-executable decompiled code from binaries, reporting roughly 40% higher re-executability than prior methods across datasets and optimization levels.
-
Evaluation of LLM-Based Software Engineering Tools: Practices, Challenges, and Future Directions
LLM-based SE tools lack stable ground truth and deterministic outputs, making standard evaluation assumptions invalid and requiring new approaches for reliable assessment.