BIG-bench is a 204-task benchmark that measures scaling trends, calibration, and absolute limitations of language models across knowledge, reasoning, and social domains.
Automating string processing in spreadsheets using input-output examples.SIGPLAN Not., 46(1):317–330, January 2011
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.
Children and LLM agents show parallel adaptations to evidence reliability in a Bayesian program induction task but differ in information-seeking costs and compliance.
citing papers explorer
No citing papers match the current filters.