Dabstep: Data agent benchmark for multi-step reasoning

Alex Egg, Martin Iglesias Goyanes, Friso Kingma, Andreu Mora, Leandro von Werra, Thomas Wolf · 2025 · arXiv 2506.23719

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

cs.DB · 2026-05-09 · unverdicted · novelty 7.0

PrepBench is a benchmark showing that state-of-the-art LLMs still struggle with natural-language-driven data preparation involving disambiguation, code generation, and workflow translation.

Structure-Grounded Knowledge Retrieval via Code Dependencies for Multi-Step Data Reasoning

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

SGKR uses function-call dependency graphs to retrieve structured code knowledge, improving LLM correctness on multi-step data reasoning benchmarks over similarity baselines.

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

cs.DC · 2026-04-17 · unverdicted · novelty 6.0

KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows

cs.AI · 2025-12-15 · unverdicted · novelty 6.0

Finch is a new benchmark with 172 composite workflows and 384 tasks from real enterprise data that shows top AI models like GPT-5.1 Pro pass only 38.4% of workflows under human evaluation.

Text Analytics Evaluation Framework: A Case Study on LLMs and Social Media

cs.CL · 2026-05-20 · unverdicted · novelty 5.0

Presents a new question-based evaluation framework for LLMs on aggregated social media text and reports that performance declines with input scale, task complexity, and numerical operations beyond 500 instances.

DataClawBench: An Agent Benchmark for Exploratory Real-World Financial Data Analysis

cs.AI · 2026-05-04 · 2 refs

citing papers explorer

Showing 1 of 1 citing paper after filters.

KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving cs.DC · 2026-04-17 · unverdicted · none · ref 16
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

Dabstep: Data agent benchmark for multi-step reasoning

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer