A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 7roles
method 1polarities
use method 1representative citing papers
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
OmniTQA integrates LLM semantic reasoning as a first-class query operator with classical relational operators in a cost-aware planner for hybrid structured and semi-structured data.
A framework transforms aggressive social media text into neutral styles while preserving semantics, measured by a new Emotion Drift Index to reduce online harm.
citing papers explorer
-
A-MAR: Agent-based Multimodal Art Retrieval for Fine-Grained Artwork Understanding
A-MAR decomposes art queries into reasoning plans to condition retrieval, leading to improved explanation quality and multi-step reasoning on art benchmarks compared to baselines.
-
UniEditBench: A Unified and Cost-Effective Benchmark for Image and Video Editing via Distilled MLLMs
UniEditBench unifies image and video editing evaluation with a nine-plus-eight operation taxonomy and cost-effective 4B/8B distilled MLLM evaluators that align with human judgments.
-
Evaluating LLM-Based 0-to-1 Software Generation in End-to-End CLI Tool Scenarios
A new benchmark for 0-to-1 CLI tool generation shows state-of-the-art LLMs achieve under 43% success rate with black-box equivalence testing against real oracles.
-
DiffCap-Bench: A Comprehensive, Challenging, Robust Benchmark for Image Difference Captioning
DiffCap-Bench supplies a diverse IDC benchmark with ten categories and LLM judging grounded in human difference lists to evaluate MLLMs more robustly than prior lexical metrics.
-
OmniTQA: A Cost-Aware System for Hybrid Query Processing over Semi-Structured Data
OmniTQA integrates LLM semantic reasoning as a first-class query operator with classical relational operators in a cost-aware planner for hybrid structured and semi-structured data.
-
From Notepad AI to Social Media: How Can Text Style Transformation Mitigate Social Harm?
A framework transforms aggressive social media text into neutral styles while preserving semantics, measured by a new Emotion Drift Index to reduce online harm.
- Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning