PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.
Konstantin Fedorov, Boris Zarubin, and Vladimir Ivanov
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
SGKR uses function-call dependency graphs to retrieve structured code knowledge, improving LLM correctness on multi-step data reasoning benchmarks over similarity baselines.
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
citing papers explorer
-
PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?
PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.
-
Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning
SGKR uses function-call dependency graphs to retrieve structured code knowledge, improving LLM correctness on multi-step data reasoning benchmarks over similarity baselines.
-
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
-
Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows
Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.
-
Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media
Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.
- DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis