Enforcing hard schemas on sub-3B models raises schema validity to 100% but drops answer accuracy from 19.7% to 11.0% and executable accuracy from 91.5% to 48.0% on tool-call tasks.
arXiv preprint arXiv:2212.06094 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.
citing papers explorer
-
The Constraint Tax: Measuring Validity-Correctness Tradeoffs in Structured Outputs for Small Language Models
Enforcing hard schemas on sub-3B models raises schema validity to 100% but drops answer accuracy from 19.7% to 11.0% and executable accuracy from 91.5% to 48.0% on tool-call tasks.
-
ART: Automatic multi-step reasoning and tool-use for large language models
ART automatically generates multi-step reasoning programs with tool integration for LLMs, yielding substantial gains over few-shot and auto-CoT prompting on BigBench and MMLU while matching hand-crafted CoT on most tasks.