ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
Title resolution pending
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized methods and 23.4% over GPT-4o on T-IC13.
MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
An LLM pipeline generates knowledge components for coding problems, enabling KCGen-KT to outperform existing KT methods and human-written KCs on student response prediction across two datasets.
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
citing papers explorer
-
ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence
ONOTE is a multi-format benchmark that applies a deterministic pipeline to expose a disconnect between perceptual accuracy and music-theoretic comprehension in leading omnimodal AI models.
-
LASER: A Data-Centric Method for Low-Cost and Efficient SQL Rewriting based on SQL-GRPO
LASER generates complex slow-query training data with MCTS and aligns small models via SQL-GRPO to deliver efficient, low-cost SQL rewriting that outperforms rules and large models.
-
An Iterative Test-and-Repair Framework for Competitive Code Generation
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
-
DocShield: Towards AI Document Safety via Evidence-Grounded Agentic Reasoning
DocShield presents a new agentic reasoning framework using Cross-Cues-aware Chain of Thought to detect, localize, and explain text-centric forgeries in documents, with reported F1 gains of 41.4% over specialized methods and 23.4% over GPT-4o on T-IC13.
-
Do MLLMs Really Understand Space? A Mathematical Reasoning Evaluation
MLLMs show a large gap in spatial mathematical reasoning compared to humans, and a new 10,000-problem dataset helps narrow it through training.
-
Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems
An LLM pipeline generates knowledge components for coding problems, enabling KCGen-KT to outperform existing KT methods and human-written KCs on student response prediction across two datasets.
-
Agentless: Demystifying LLM-based Software Engineering Agents
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.