pith. sign in
Pith Number

pith:WIBKBDV4

pith:2024:WIBKBDV4KRKOFYE3MOVUGKOGTE
not attested not anchored not stored refs resolved

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Alex Gu, Armel Randy Zebaze, Binyuan Hui, Chen Gong, Daniel Fried, David Lo, Han Hu, Haolan Zhan, Harm de Vries, Imam Nur Bani Yusuf, Indraneil Paul, Jean Kaddour, Jenny Chim, Jiawei Liu, Junda He, Leandro Von Werra, Ming Xu, Minh Chien Vu, Naman Jain, Niklas Muennighoff, Prateek Yadav, Qian Liu, Ratnadira Widyasari, Simon Brunner, Terry Yue Zhuo, Thong Hoang, Wen-Ding Li, Wenhao Yu, Xiaoheng Hong, Xiaoning Du, Zhihan Zhang, Zhoujun Cheng, Zijian Wang

Large language models reach only up to 60 percent success on tasks requiring precise use of diverse function calls from many libraries, far below the 97 percent human level.

arxiv:2406.15877 v4 · 2024-06-22 · cs.SE · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{WIBKBDV4KRKOFYE3MOVUGKOGTE}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%.

C2weakest assumption

The 1,140 tasks and their test cases accurately represent the challenges of real-world practical coding that requires diverse function calls from many libraries.

C3one line summary

BigCodeBench shows LLMs achieve at most 60% on 1,140 tasks needing diverse function calls and complex instructions, compared to 97% human performance.

References

19 extracted · 19 resolved · 0 Pith anchors

[1] Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance 2021
[2] How well do the models generalize to the unseen tools and tasks? 2014
[3] This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body
[4] write blackbox unit tests to ensure the functional correctness of the given function. You should also ← - make the function easy to test. ### Step1:Check Library Imports #### Import Statement - Remove 2025
[5] This prevents the ← - user from inferring the function’s purpose based on its name

Formal links

2 machine-checked theorem links

Cited by

50 papers in Pith

Receipt and verification
First computed 2026-05-17T23:39:22.050812Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

b202a08ebc5454e2e09b63ab4329c6991b6ad92dac721646508438456ba6a097

Aliases

arxiv: 2406.15877 · arxiv_version: 2406.15877v4 · doi: 10.48550/arxiv.2406.15877 · pith_short_12: WIBKBDV4KRKO · pith_short_16: WIBKBDV4KRKOFYE3 · pith_short_8: WIBKBDV4
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/WIBKBDV4KRKOFYE3MOVUGKOGTE \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: b202a08ebc5454e2e09b63ab4329c6991b6ad92dac721646508438456ba6a097
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "e27ee9ec1d97a8e6fe8dc014c14fe2b3f77ac5e8a07139bc07cc8139bcea59fa",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.SE",
    "submitted_at": "2024-06-22T15:52:04Z",
    "title_canon_sha256": "224b55f50f07d308debafc307feffe7fc059e5057e530715828242735aa4cb43"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2406.15877",
    "kind": "arxiv",
    "version": 4
  }
}