TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
Canonical reference
Dataflow: An llm-driven framework for unified data preparation and workflow automation in the era of data-centric ai
Canonical reference. 80% of citing Pith papers cite this work as background.
citation-role summary
citation-polarity summary
years
2026 8representative citing papers
PopPy combines an ahead-of-time compiler and runtime to extract parallelism from Python compound AI applications, delivering up to 6.4x end-to-end speedups while preserving sequential semantics.
K12-KGraph is a textbook-derived knowledge graph that powers a new benchmark revealing LLMs' poor curriculum cognition and a small training corpus that outperforms general instruction data on educational tasks.
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
AAFLOW is a unified distributed runtime that models agentic workflows as operators with a zero-copy data plane using Apache Arrow and Cylon, achieving up to 4.64x pipeline speedup through improved data flow and batching.
DataArc-SynData-Toolkit is an open-source, configuration-driven framework that unifies synthetic data generation for multimodal, multilingual, and multi-task LLM training with improved usability and quality control.
citing papers explorer
-
Programming with Data: Test-Driven Data Engineering for Self-Improving LLMs from Raw Corpora
Structured knowledge extracted from corpora enables test-driven data engineering for LLMs by mapping training data to source code, model training to compilation, benchmarking to unit testing, and failures to targeted data repairs, demonstrated across 16 disciplines.