arxiv: 2406.15877 · v4 · submitted 2024-06-22 · 💻 cs.SE · cs.AI· cs.CL

Recognition: 2 theorem links

· Lean Theorem

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

Terry Yue Zhuo , Minh Chien Vu , Jenny Chim , Han Hu , Wenhao Yu , Ratnadira Widyasari , Imam Nur Bani Yusuf , Haolan Zhan

show 25 more authors

Junda He Indraneil Paul Simon Brunner Chen Gong Thong Hoang Armel Randy Zebaze Xiaoheng Hong Wen-Ding Li Jean Kaddour Ming Xu Zhihan Zhang Prateek Yadav Naman Jain Alex Gu Zhoujun Cheng Jiawei Liu Qian Liu Zijian Wang Binyuan Hui Niklas Muennighoff David Lo Daniel Fried Xiaoning Du Harm de Vries Leandro Von Werra

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CL

keywords code generationlarge language modelsbenchmarksfunction callsinstruction followingsoftware engineeringtool usagepractical coding tasks

0 comments

The pith

Large language models reach only up to 60 percent success on tasks requiring precise use of diverse function calls from many libraries, far below the 97 percent human level.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces BigCodeBench to test how well large language models generate code for practical tasks that demand multiple function calls drawn from 139 different libraries across seven domains. The benchmark contains 1,140 tasks, each checked by an average of 5.6 test cases that achieve 99 percent branch coverage. A parallel version called BigCodeBench-Instruct converts the original detailed docstrings into short natural-language instructions. Evaluation of 60 models shows they struggle to follow the complex instructions needed for accurate tool use. A sympathetic reader would care because everyday programming relies on composing calls from varied sources rather than solving self-contained problems.

Core claim

BigCodeBench challenges large language models to solve 1,140 fine-grained tasks by invoking multiple function calls as tools from 139 libraries across seven domains. Each task supplies an average of 5.6 test cases with 99 percent branch coverage. The authors also release BigCodeBench-Instruct, which automatically converts docstrings into short instructions containing only essential information. Testing sixty models establishes that current systems are not yet capable of following complex instructions to use function calls precisely.

What carries the argument

BigCodeBench benchmark of 1,140 tasks that require invoking multiple function calls from 139 libraries in seven domains, together with its BigCodeBench-Instruct variant that uses simplified natural-language instructions.

If this is right

Current models must improve compositional reasoning to combine multiple function calls accurately under complex instructions.
Many existing code-generation benchmarks underestimate the demands of practical tasks that draw from numerous libraries.
The large gap to human performance indicates a need for targeted advances in instruction following for code that uses external tools.
Future model development should prioritize precise and reliable use of diverse library functions in realistic settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the tasks mirror everyday coding demands, then closing the gap will likely require training methods that explicitly practice multi-library composition rather than relying on scale alone.
The benchmark could be expanded to include live execution environments to test whether models handle runtime errors and library version differences as well as static instruction following.
Providing few-shot examples of correct function-call sequences in prompts might narrow the performance difference enough to make the tasks solvable by current architectures.
The results suggest that improvements in general reasoning may not transfer directly to code tool use without additional fine-tuning on library-specific patterns.

Load-bearing premise

The 1,140 tasks and their test cases accurately represent the challenges of real-world practical coding that requires diverse function calls from many libraries.

What would settle it

A model that scores above 80 percent across the full set of 1,140 tasks on the provided test cases would show that large language models can follow such complex instructions for precise function use.

read the original abstract

Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BigCodeBench gives a new multi-library code benchmark with 1140 tasks and a 60% vs 97% gap, but the tasks' real complexity needs tighter validation.

read the letter

The one thing to know is that BigCodeBench is a new benchmark for code generation that requires LLMs to make multiple function calls from a large set of libraries, and their tests on 60 models show performance maxing at 60% against 97% for humans. The paper does a good job creating 1,140 fine-grained tasks across 7 domains and 139 libraries. Each task comes with an average of 5.6 test cases that achieve 99% branch coverage, which is a solid effort to make the evaluation rigorous. They also introduce BigCodeBench-Instruct by simplifying the docstrings to short instructions, allowing a direct test of instruction following without extra context. Running this on such a wide range of LLMs provides a useful snapshot of current capabilities in compositional code writing. Credit where due: the human baseline adds real perspective, and the scale of the evaluation is larger than most prior code gen benchmarks. The results do point to a limitation in how well models handle precise tool use under complex instructions. On the other side, the central claim depends on the tasks genuinely needing diverse and compositional function calls in practical settings. The abstract describes the setup but leaves open questions about the minimum number of calls required per task or the exact criteria used to select and validate the tasks. If a portion of the tasks can be solved with shallow reasoning or fewer calls, the performance difference might not fully demonstrate the claimed challenge. More details on curation would help secure that part. This paper is for researchers and practitioners working on improving LLMs for software engineering tasks, particularly those looking for benchmarks that move past isolated algorithmic problems. Anyone evaluating code models would find the numbers and the new test set relevant. It has enough new material and empirical grounding to deserve a serious referee, though the authors should be prepared to address questions on task representativeness. I'd recommend putting it through peer review.

Referee Report

2 major / 1 minor

Summary. The paper introduces BigCodeBench, a benchmark of 1,140 fine-grained Python code-generation tasks that require invoking multiple function calls drawn from 139 libraries across 7 domains. Each task supplies an average of 5.6 test cases achieving 99 % branch coverage. A natural-language variant (BigCodeBench-Instruct) is also presented that condenses docstrings into short instructions. Evaluation of 60 LLMs yields a maximum score of 60 %, compared with 97 % human performance, leading the authors to conclude that current models cannot yet follow complex instructions to use function calls precisely.

Significance. If the 1,140 tasks genuinely demand compositional, multi-library function use under realistic instruction complexity, the reported performance gap supplies a concrete, falsifiable signal that existing LLMs remain limited in practical tool-use reasoning. The benchmark’s scale, library coverage, and high-coverage test suites would then constitute a useful addition to the code-generation evaluation suite.

major comments (2)

[Abstract and §3] Abstract and §3 (task construction): the central claim that tasks require 'compositional reasoning' and 'diverse function calls' is load-bearing for the headline 60 % vs. 97 % gap, yet the manuscript supplies no statistics on the minimum or average number of distinct function calls per task, the distribution of call counts, or explicit curation criteria that guarantee multi-library composition. Without these quantities it is impossible to verify that the observed gap reflects inability to handle complex real-world instructions rather than simpler single- or dual-call patterns.
[Evaluation section] Evaluation section (results on 60 LLMs): the human baseline of 97 % is reported without stating the number of human participants, their selection criteria, time limits, or whether they had access to the same test harness and documentation as the models. This detail is required to interpret the magnitude of the LLM–human gap.

minor comments (1)

[Abstract] Abstract, first sentence: the clause 'where the tasks ranging from …' is grammatically incomplete and should be rephrased for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and statistics where needed to strengthen the presentation of task complexity and evaluation methodology.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (task construction): the central claim that tasks require 'compositional reasoning' and 'diverse function calls' is load-bearing for the headline 60 % vs. 97 % gap, yet the manuscript supplies no statistics on the minimum or average number of distinct function calls per task, the distribution of call counts, or explicit curation criteria that guarantee multi-library composition. Without these quantities it is impossible to verify that the observed gap reflects inability to handle complex real-world instructions rather than simpler single- or dual-call patterns.

Authors: We agree that quantitative details on function-call composition are needed to support the claims. In the revision we will add to §3 a table and accompanying text reporting the minimum, average, and distribution of distinct function calls per task, as well as the fraction of tasks that draw functions from multiple libraries. The curation criteria (selection of tasks requiring compositional use of APIs drawn from real-world scenarios) will also be stated explicitly with supporting counts from the construction pipeline. revision: yes
Referee: [Evaluation section] Evaluation section (results on 60 LLMs): the human baseline of 97 % is reported without stating the number of human participants, their selection criteria, time limits, or whether they had access to the same test harness and documentation as the models. This detail is required to interpret the magnitude of the LLM–human gap.

Authors: We acknowledge the omission. The revised evaluation section will include the number of human participants, their selection criteria (experienced Python developers), time limits applied, and explicit confirmation that they received identical documentation and test harness access as the models. These clarifications will be added without altering the reported 97 % figure. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmark construction

full rationale

The paper introduces BigCodeBench as a new collection of 1,140 tasks drawn from 139 libraries and evaluates 60 LLMs plus humans directly on them. No derivation, equation, or prediction is claimed; performance numbers are obtained by executing the released tasks and test suites. The representativeness concern raised in the skeptic note is an external-validity question, not a reduction of any result to its own inputs by construction. No self-citation chain, fitted parameter renamed as prediction, or ansatz is load-bearing for the headline claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the curated tasks and test cases represent practical coding challenges; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The selected tasks and test cases represent real-world practical coding challenges involving diverse function calls
Invoked when claiming the benchmark measures LLM capability on challenging and practical tasks.

pith-pipeline@v0.9.0 · 5705 in / 1245 out tokens · 30475 ms · 2026-05-14T19:24:25.315333+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 unverdicted novelty 8.0

SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
cs.AI 2026-05 accept novelty 8.0

SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
cs.SE 2026-03 unverdicted novelty 8.0

SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
cs.SE 2026-05 unverdicted novelty 7.0

CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
cs.SE 2026-04 unverdicted novelty 7.0

Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
Skill Retrieval Augmentation for Agentic AI
cs.CL 2026-04 unverdicted novelty 7.0

Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
cs.DC 2026-04 unverdicted novelty 7.0

Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
cs.SE 2026-04 conditional novelty 7.0

Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
Neurosymbolic Repo-level Code Localization
cs.SE 2026-04 unverdicted novelty 7.0

LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
cs.SE 2026-04 unverdicted novelty 7.0

CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode
cs.SE 2026-04 unverdicted novelty 7.0

DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
cs.SE 2025-02 unverdicted novelty 7.0

SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
cs.LG 2025-02 unverdicted novelty 7.0

A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
Exploiting LLM Agent Supply Chains via Payload-less Skills
cs.CR 2026-05 conditional novelty 6.0

Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
cs.LG 2026-05 unverdicted novelty 6.0

METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
cs.SE 2026-04 conditional novelty 6.0

SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs
cs.SE 2026-04 unverdicted novelty 6.0

A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
cs.AR 2026-04 unverdicted novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
cs.LG 2025-12 conditional novelty 6.0

LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
cs.LG 2026-05 unverdicted novelty 5.0

Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
An End-to-End Framework for Building Large Language Models for Software Operations
cs.LG 2026-04 unverdicted novelty 5.0

OpsLLM outperforms general LLMs on software operations QA and RCA tasks through human-in-the-loop data curation, supervised fine-tuning, and domain-specific reinforcement learning.
MiMo-V2-Flash Technical Report
cs.CL 2026-01 unverdicted novelty 5.0

MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
cs.CL 2025-03 unverdicted novelty 5.0

Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
cs.SE 2026-04 unverdicted novelty 4.0

Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
An End-to-End Framework for Building Large Language Models for Software Operations
cs.LG 2026-04 unverdicted novelty 4.0

OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
Qwen2.5-Coder Technical Report
cs.CL 2024-09 unverdicted novelty 4.0

Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · cited by 25 Pith papers

[1]

Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance

are instruction-tuned with synthetic data containing diverse instruction-code pairs. Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance. More recently, there has been a growing interest in agentic programming (Yang et al.), developing prompting systems to enhance the capabilit...

work page 2021
[2]

How well do the models generalize to the unseen tools and tasks?

for task completion and reasoning provide the possibility towards artificial general intelligence. Our goal is to provide the community with the most open, reliable, and scalable evaluations to truly understand the fundamental capabilities of LLMs for programming, pinpointing the ways to unleash their power. G.1 L IMITATIONS Given the limited time and bud...

work page 2014
[3]

This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body

refine the function including its docstrings in order to make the function more realistic and less ← - ambiguous. This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body

work page
[4]

hello")

write blackbox unit tests to ensure the functional correctness of the given function. You should also ← - make the function easy to test. ### Step1:Check Library Imports #### Import Statement - Remove the library imports that are not used in the code. - Import libraries before the function declaration. #### Library Usage - Check if the usage of these libr...

work page 2025
[5]

This prevents the ← - user from inferring the function’s purpose based on its name

‘Function Name‘ has not been obfuscated: - The given function should have a generic name such as ‘f‘ to ensure anonymity. This prevents the ← - user from inferring the function’s purpose based on its name. - Example: Before: ‘def calculate_average(nums):‘ After: ‘def f(nums):‘

work page
[6]

""Calculates something

‘Docstring‘ is unclear, ambiguous, impractical or not well aligned with ‘Solution‘: - The function’s docstring should provide a clear and concise description of its purpose, expected ← - inputs, outputs, and examples of usage. If the description is vague or doesn’t match the function’s ← - behavior, it can lead to confusion. - Example: Before: ‘"""Calcula...

work page
[7]

- Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function

‘Solution‘ does not use all imported libraries or APIs: - If libraries are imported but not used in the function, it indicates redundant code or a mismatch ← - between the problem description and the solution. - Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function

work page
[8]

- Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning

‘Solution‘ uses APIs that are not included in ‘Import Statement‘: - All external libraries or functions used in the solution should be imported at the beginning of the ← - script to ensure the code runs without errors. - Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning

work page
[9]

- Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function

‘Solution‘ does not use any library APIs: - The problem should be designed in a way that requires the usage of library APIs to solve it, ← - ensuring the challenge of integrating external tools. - Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function

work page
[10]

- Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘

‘Solution‘ uses APIs in ‘random‘, but does not pass a random seed to ‘Function Parameters‘: - When using random functionalities, for reproducibility, it’s good practice to allow the user to set ← - a seed. - Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘

work page
[11]

- Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic

‘Solution‘ contains dummy code: - Placeholder or dummy code should be replaced with actual implementation to ensure the function works ← - as expected. - Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic

work page
[12]

40 Published as a conference paper at ICLR 2025

Unused global constants before ‘Problem Function‘: - Any constants or variables that are not used in the solution should be removed to clean up the code. 40 Published as a conference paper at ICLR 2025

work page 2025
[13]

‘TestCases‘ uses libraries or APIs that are not included in ‘Import Statement‘: - Similar to the solution, all external libraries or functions used in the test cases should be ← - imported

work page
[14]

‘TestCases‘ contains test cases that do not work for ‘Solution‘: - All test cases should be aligned with the function’s behavior to ensure they test the function ← - correctly

work page
[15]

For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return

‘TestCases‘ does not test all attributes of the returned object: - If the function returns an object with multiple attributes or methods, the test cases should ← - validate all of them to ensure complete coverage. For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return. This object has various attributes, like the t...

work page
[16]

‘TestCases‘ does not test the files that result in ‘Solution‘: - If the function creates or modifies files, the test cases should validate these files to ensure the ← - function works as expected

work page
[17]

‘TestCases‘ is wrapped in ‘run_tests‘: - The test cases and the function to run them should be separated for clarity

work page
[18]

Test cases in ‘TestCases‘ are duplicated or used to test the same behavior: - Redundant test cases should be removed to keep the test suite concise and focused

work page
[19]

\nLet’s think step by step

Test data used in ‘TestCases‘ is missing: - All required data for testing should be provided or generated to ensure the test cases can run ← - without issues. K E VALUATION SETUP K.1 I NFERENCE We perform all the model inference on A100 GPUs, except for the closed ones. For the closed models, we rely on their official APIs provided in the documents. K.2 E...

work page 2023