BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Pith reviewed 2026-05-14 19:24 UTC · model grok-4.3
The pith
Large language models reach only up to 60 percent success on tasks requiring precise use of diverse function calls from many libraries, far below the 97 percent human level.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
BigCodeBench challenges large language models to solve 1,140 fine-grained tasks by invoking multiple function calls as tools from 139 libraries across seven domains. Each task supplies an average of 5.6 test cases with 99 percent branch coverage. The authors also release BigCodeBench-Instruct, which automatically converts docstrings into short instructions containing only essential information. Testing sixty models establishes that current systems are not yet capable of following complex instructions to use function calls precisely.
What carries the argument
BigCodeBench benchmark of 1,140 tasks that require invoking multiple function calls from 139 libraries in seven domains, together with its BigCodeBench-Instruct variant that uses simplified natural-language instructions.
If this is right
- Current models must improve compositional reasoning to combine multiple function calls accurately under complex instructions.
- Many existing code-generation benchmarks underestimate the demands of practical tasks that draw from numerous libraries.
- The large gap to human performance indicates a need for targeted advances in instruction following for code that uses external tools.
- Future model development should prioritize precise and reliable use of diverse library functions in realistic settings.
Where Pith is reading between the lines
- If the tasks mirror everyday coding demands, then closing the gap will likely require training methods that explicitly practice multi-library composition rather than relying on scale alone.
- The benchmark could be expanded to include live execution environments to test whether models handle runtime errors and library version differences as well as static instruction following.
- Providing few-shot examples of correct function-call sequences in prompts might narrow the performance difference enough to make the tasks solvable by current architectures.
- The results suggest that improvements in general reasoning may not transfer directly to code tool use without additional fine-tuning on library-specific patterns.
Load-bearing premise
The 1,140 tasks and their test cases accurately represent the challenges of real-world practical coding that requires diverse function calls from many libraries.
What would settle it
A model that scores above 80 percent across the full set of 1,140 tasks on the provided test cases would show that large language models can follow such complex instructions for precise function use.
read the original abstract
Task automation has been greatly empowered by the recent advances in Large Language Models (LLMs) via Python code, where the tasks ranging from software engineering development to general-purpose reasoning. While current benchmarks have shown that LLMs can solve tasks using programs like human developers, the majority of their evaluations are limited to short and self-contained algorithmic tasks or standalone function calls. Solving challenging and practical tasks requires the capability of utilizing diverse function calls as tools to efficiently implement functionalities like data analysis and web development. In addition, using multiple tools to solve a task needs compositional reasoning by accurately understanding complex instructions. Fulfilling both of these characteristics can pose a great challenge for LLMs.To assess how well LLMs can solve challenging and practical tasks via programs, we introduce BigCodeBench, a benchmark that challenges LLMs to invoke multiple function calls as tools from 139 libraries and 7 domains for 1,140 fine-grained tasks. To evaluate LLMs rigorously, each task encompasses 5.6 test cases with an average branch coverage of 99%. In addition, we propose a natural-language-oriented variant of BigCodeBench, BigCodeBench-Instruct, that automatically transforms the original docstrings into short instructions only with essential information. Our extensive evaluation of 60 LLMs shows that LLMs are not yet capable of following complex instructions to use function calls precisely, with scores up to 60%, significantly lower than the human performance of 97%. The results underscore the need for further advancements in this area.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces BigCodeBench, a benchmark of 1,140 fine-grained Python code-generation tasks that require invoking multiple function calls drawn from 139 libraries across 7 domains. Each task supplies an average of 5.6 test cases achieving 99 % branch coverage. A natural-language variant (BigCodeBench-Instruct) is also presented that condenses docstrings into short instructions. Evaluation of 60 LLMs yields a maximum score of 60 %, compared with 97 % human performance, leading the authors to conclude that current models cannot yet follow complex instructions to use function calls precisely.
Significance. If the 1,140 tasks genuinely demand compositional, multi-library function use under realistic instruction complexity, the reported performance gap supplies a concrete, falsifiable signal that existing LLMs remain limited in practical tool-use reasoning. The benchmark’s scale, library coverage, and high-coverage test suites would then constitute a useful addition to the code-generation evaluation suite.
major comments (2)
- [Abstract and §3] Abstract and §3 (task construction): the central claim that tasks require 'compositional reasoning' and 'diverse function calls' is load-bearing for the headline 60 % vs. 97 % gap, yet the manuscript supplies no statistics on the minimum or average number of distinct function calls per task, the distribution of call counts, or explicit curation criteria that guarantee multi-library composition. Without these quantities it is impossible to verify that the observed gap reflects inability to handle complex real-world instructions rather than simpler single- or dual-call patterns.
- [Evaluation section] Evaluation section (results on 60 LLMs): the human baseline of 97 % is reported without stating the number of human participants, their selection criteria, time limits, or whether they had access to the same test harness and documentation as the models. This detail is required to interpret the magnitude of the LLM–human gap.
minor comments (1)
- [Abstract] Abstract, first sentence: the clause 'where the tasks ranging from …' is grammatically incomplete and should be rephrased for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to incorporate additional details and statistics where needed to strengthen the presentation of task complexity and evaluation methodology.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (task construction): the central claim that tasks require 'compositional reasoning' and 'diverse function calls' is load-bearing for the headline 60 % vs. 97 % gap, yet the manuscript supplies no statistics on the minimum or average number of distinct function calls per task, the distribution of call counts, or explicit curation criteria that guarantee multi-library composition. Without these quantities it is impossible to verify that the observed gap reflects inability to handle complex real-world instructions rather than simpler single- or dual-call patterns.
Authors: We agree that quantitative details on function-call composition are needed to support the claims. In the revision we will add to §3 a table and accompanying text reporting the minimum, average, and distribution of distinct function calls per task, as well as the fraction of tasks that draw functions from multiple libraries. The curation criteria (selection of tasks requiring compositional use of APIs drawn from real-world scenarios) will also be stated explicitly with supporting counts from the construction pipeline. revision: yes
-
Referee: [Evaluation section] Evaluation section (results on 60 LLMs): the human baseline of 97 % is reported without stating the number of human participants, their selection criteria, time limits, or whether they had access to the same test harness and documentation as the models. This detail is required to interpret the magnitude of the LLM–human gap.
Authors: We acknowledge the omission. The revised evaluation section will include the number of human participants, their selection criteria (experienced Python developers), time limits applied, and explicit confirmation that they received identical documentation and test harness access as the models. These clarifications will be added without altering the reported 97 % figure. revision: yes
Circularity Check
No circularity in empirical benchmark construction
full rationale
The paper introduces BigCodeBench as a new collection of 1,140 tasks drawn from 139 libraries and evaluates 60 LLMs plus humans directly on them. No derivation, equation, or prediction is claimed; performance numbers are obtained by executing the released tasks and test suites. The representativeness concern raised in the skeptic note is an external-validity question, not a reduction of any result to its own inputs by construction. No self-citation chain, fitted parameter renamed as prediction, or ansatz is load-bearing for the headline claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The selected tasks and test cases represent real-world practical coding challenges involving diverse function calls
Forward citations
Cited by 50 Pith papers
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio deploys an evolving coding agent to create adaptive 3D environments that co-evolve with embodied learners, delivering 18-point success-rate gains over fixed environments in navigation benchmarks.
-
SimWorld Studio: Automatic Environment Generation with Evolving Coding Agent for Embodied Agent Learning
SimWorld Studio uses a self-evolving coding agent to generate adaptive 3D environments that improve embodied agent performance, with reported gains of 18 points over fixed environments in navigation tasks.
-
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
SlopCodeBench shows coding agents degrade in structural quality and verbosity across iterative extensions, with no agent solving any problem completely and agent code 2x more eroded than human code.
-
BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems
BOHM extracts multi-resolution attribution trees from existing routing weights in hierarchical AI systems, providing zero-cost explanations that correlate with SHAP when routing is near-optimal.
-
Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents
SkillTTA synthesizes temporary task-specific skills from retrieved training trajectories to boost LLM agent Pass@1 scores on SpreadsheetBench and BigCodeBench without parameter updates.
-
CRANE: Constrained Reasoning Injection for Code Agents via Nullspace Editing
CRANE merges Instruct and Thinking model checkpoints via constrained nullspace editing to improve code agent reasoning and benchmark performance without retraining.
-
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
-
When Prompt Under-Specification Improves Code Correctness: An Exploratory Study of Prompt Wording and Structure Effects on LLM-Based Code Generation
Structurally rich task descriptions make LLMs robust to prompt under-specification, and under-specification can enhance code correctness by disrupting misleading lexical or structural cues.
-
Skill Retrieval Augmentation for Agentic AI
Introduces the SRA paradigm and SRA-Bench benchmark showing retrieval-based skill augmentation improves agent performance but skill incorporation remains a bottleneck regardless of retrieval quality.
-
Skill Retrieval Augmentation for Agentic AI
Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
-
Assessing the Impact of Requirement Ambiguity on LLM-based Function-Level Code Generation
Orchid benchmark shows requirement ambiguity degrades LLM code generation performance across all models, with advanced models hit hardest, and LLMs rarely detect or resolve the ambiguity themselves.
-
Neurosymbolic Repo-level Code Localization
LogicLoc combines LLMs with Datalog to achieve accurate repo-level code localization without relying on keyword shortcuts in benchmarks.
-
Structural Anchors and Reasoning Fragility:Understanding CoT Robustness in LLM4Code
CoT prompting in LLM4Code shows mixed robustness that depends on model family, task structure, and perturbations destabilizing structural anchors, leading to trajectory deformations like lengthening, branching, and si...
-
DuET: Dual Execution for Test Output Prediction with Generated Code and Pseudocode
DuET uses dual execution of generated code and pseudocode with majority voting to achieve state-of-the-art test output prediction, boosting Pass@1 by 13.6 percentage points on LiveCodeBench.
-
Pimp My LLM: Leveraging Variability Modeling to Tune Inference Hyperparameters
Variability modeling from software engineering enables systematic sampling, measurement, and prediction of LLM inference configurations for energy, latency, and accuracy trade-offs.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
PerfCoder: Large Language Models for Interpretable Code Performance Optimization
PerfCoder is a family of LLMs trained on optimization trajectories with human annotations and runtime-based preference alignment that achieves higher runtime speedups and optimization rates on the PIE benchmark than p...
-
Library Hallucinations in LLM-Generated Code: A Risk Analysis Grounded in Developer Queries
A study of seven LLMs finds that realistic prompt variations such as one-character misspellings trigger library hallucinations in up to 26% of cases, fabricated names in up to 99%, and time-based prompts in up to 85%,...
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
The paper delivers a taxonomy of seven LLM study types in software engineering along with eight guidelines that separate mandatory requirements from recommended practices to address reproducibility challenges.
-
Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games
Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
A recurrent-depth architecture enables language models to improve reasoning performance by iterating computation in latent space, achieving gains equivalent to much larger models on benchmarks.
-
Design and Report Benchmarks for Knowledge Work
Proposes a three-step benchmark design method (define work activity, specify tested setting, score work product) derived from work studies and O*NET, demonstrated via three case analyses.
-
Harnessing LLM Agents with Skill Programs
HASP upgrades textual skills into executable Program Functions that intervene in LLM agent loops at inference, post-training, or self-evolution, delivering 25% gains over ReAct and 30.4% over Search-R1 on reasoning be...
-
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up...
-
Exploiting LLM Agent Supply Chains via Payload-less Skills
Semantic Compliance Hijacking lets attackers hijack LLM agents by disguising malicious instructions as compliance rules in skills, reaching up to 77.67% success on confidentiality breaches and 67.33% on RCE while evad...
-
Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning
METIS internalizes curriculum judgment in LLM reinforcement fine-tuning by predicting within-prompt reward variance via in-context learning and jointly optimizing with a self-judgment reward, yielding superior perform...
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
-
Defective Task Descriptions in LLM-Based Code Generation: Detection and Analysis
SpecValidator detects lexical vagueness, under-specification, and syntax-formatting defects in LLM code-generation prompts with F1 0.804, outperforming GPT-5-mini and Claude Sonnet 4, and shows that under-specificatio...
-
Learned or Memorized ? Quantifying Memorization Advantage in Code LLMs
A perturbation method shows memorization advantage in code LLMs varies widely by model and task, remaining low on CVEFixes and Defects4J benchmarks.
-
InCoder-32B-Thinking: Industrial Code World Model for Thinking
InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
-
ContextCov: Deriving and Enforcing Executable Constraints from Agent Instruction Files
ContextCov compiles agent instruction files into static, runtime, and architectural guardrails, raising constraint compliance to 88.3% on SWE-bench Lite tasks versus 67% and 50.3% for prompt and reflection baselines.
-
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
-
The Geometric Reasoner: Manifold-Informed Latent Foresight Search for Long-Context Reasoning
TGR performs manifold-informed latent foresight search to boost trajectory coverage in long-context reasoning tasks by up to 13 AUC points with minimal overhead.
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
TRINITY: An Evolved LLM Coordinator
A compact 0.6B-parameter coordinator with a 10K-parameter head uses evolutionary strategy to dynamically delegate roles to LLMs, achieving SOTA results such as 86.2% on LiveCodeBench.
-
Guidelines for Empirical Studies in Software Engineering involving Large Language Models
A group of 22 researchers proposes seven study types and eight guidelines for empirical software engineering studies involving LLMs to enhance reproducibility and replicability.
-
Seed Diffusion: A Large-Scale Diffusion Language Model with High-Speed Inference
Seed Diffusion Preview is a discrete diffusion language model that reaches 2146 tokens per second inference on H20 GPUs with competitive code benchmark performance, establishing a new speed-quality Pareto frontier.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
A Study of LLMs' Preferences for Libraries and Programming Languages
Empirical study of eight LLMs finds overuse of popular libraries like NumPy in up to 45% of unnecessary cases and strong default preference for Python even when suboptimal.
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD integrates Muon-style spectral-norm geometry with orthogonal gradient constraints to improve the stability-plasticity trade-off during sequential LLM adaptation.
-
An End-to-End Framework for Building Large Language Models for Software Operations
OpsLLM outperforms general LLMs on software operations QA and RCA tasks through human-in-the-loop data curation, supervised fine-tuning, and domain-specific reinforcement learning.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
AdaDec: A Uncertainty-Guided Lookahead Decoding Framework for LLM-Based Code Generation
AdaDec improves Pass@1 accuracy of LLM code generation by up to 20.9% over greedy decoding by triggering lookahead reranking only at high-uncertainty steps on HumanEval+, MBPP+, and DevEval.
-
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Phi-4-Mini achieves strong math and coding performance with only 3.8B parameters via high-quality synthetic data, while Phi-4-Multimodal uses Mixture-of-LoRAs to integrate modalities and top speech recognition leaderboards.
-
OpenCompass: A Universal Evaluation Platform for Large Language Models
OpenCompass is a modular, high-concurrency platform for unified LLM evaluation across knowledge, reasoning, code, and other domains with support for rule-based, LLM-as-judge, and cascaded evaluators.
-
How Many Tries Does It Take? Iterative Self-Repair in LLM Code Generation Across Model Scales and Benchmarks
Iterative self-repair improves LLM code pass rates by 4.9-17.1 pp on HumanEval and 16-30 pp on MBPP across seven models, with gains concentrated early and syntax errors easier to fix than logical ones.
-
An End-to-End Framework for Building Large Language Models for Software Operations
OpsLLM is a domain-specific LLM for software ops QA and RCA built with human-curated data, SFT, and RL using a domain process reward model, showing accuracy gains of 0.2-5.7% on QA and 2.7-70.3% on RCA over general LLMs.
-
Qwen2.5-Coder Technical Report
Qwen2.5-Coder models claim state-of-the-art results on over 10 code benchmarks, outperforming larger models of similar size.
Reference graph
Works this paper leans on
-
[1]
are instruction-tuned with synthetic data containing diverse instruction-code pairs. Later, OpenCodeInterpreter (Zheng et al., 2024b) developed a multi-turn instruction dataset and achieved better coding performance. More recently, there has been a growing interest in agentic programming (Yang et al.), developing prompting systems to enhance the capabilit...
work page 2021
-
[2]
How well do the models generalize to the unseen tools and tasks?
for task completion and reasoning provide the possibility towards artificial general intelligence. Our goal is to provide the community with the most open, reliable, and scalable evaluations to truly understand the fundamental capabilities of LLMs for programming, pinpointing the ways to unleash their power. G.1 L IMITATIONS Given the limited time and bud...
work page 2014
-
[3]
refine the function including its docstrings in order to make the function more realistic and less ← - ambiguous. This means when you see the function stub and docstring, you should be able to implement ← - with exactly the same functionality with the given function body
-
[4]
write blackbox unit tests to ensure the functional correctness of the given function. You should also ← - make the function easy to test. ### Step1:Check Library Imports #### Import Statement - Remove the library imports that are not used in the code. - Import libraries before the function declaration. #### Library Usage - Check if the usage of these libr...
work page 2025
-
[5]
This prevents the ← - user from inferring the function’s purpose based on its name
‘Function Name‘ has not been obfuscated: - The given function should have a generic name such as ‘f‘ to ensure anonymity. This prevents the ← - user from inferring the function’s purpose based on its name. - Example: Before: ‘def calculate_average(nums):‘ After: ‘def f(nums):‘
-
[6]
‘Docstring‘ is unclear, ambiguous, impractical or not well aligned with ‘Solution‘: - The function’s docstring should provide a clear and concise description of its purpose, expected ← - inputs, outputs, and examples of usage. If the description is vague or doesn’t match the function’s ← - behavior, it can lead to confusion. - Example: Before: ‘"""Calcula...
-
[7]
‘Solution‘ does not use all imported libraries or APIs: - If libraries are imported but not used in the function, it indicates redundant code or a mismatch ← - between the problem description and the solution. - Example: Before: ‘import math‘ (but no usage of ‘math‘ in the function) After: Remove ‘import math‘ or ensure it’s used in the function
-
[8]
‘Solution‘ uses APIs that are not included in ‘Import Statement‘: - All external libraries or functions used in the solution should be imported at the beginning of the ← - script to ensure the code runs without errors. - Example: If using ‘sqrt‘ from ‘math‘ library in the function, ensure ‘from math import sqrt‘ is present at ← - the beginning
-
[9]
‘Solution‘ does not use any library APIs: - The problem should be designed in a way that requires the usage of library APIs to solve it, ← - ensuring the challenge of integrating external tools. - Example: If the problem is to calculate the square root, the solution should leverage the ‘math.sqrt‘ ← - function
-
[10]
- Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘
‘Solution‘ uses APIs in ‘random‘, but does not pass a random seed to ‘Function Parameters‘: - When using random functionalities, for reproducibility, it’s good practice to allow the user to set ← - a seed. - Example: Before: ‘random.randint(1,10)‘ After: ‘random.seed(seed); random.randint(1,10)‘
-
[11]
- Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic
‘Solution‘ contains dummy code: - Placeholder or dummy code should be replaced with actual implementation to ensure the function works ← - as expected. - Example: Before: ‘# TODO: Implement this‘ After: Actual implementation of the required logic
-
[12]
40 Published as a conference paper at ICLR 2025
Unused global constants before ‘Problem Function‘: - Any constants or variables that are not used in the solution should be removed to clean up the code. 40 Published as a conference paper at ICLR 2025
work page 2025
-
[13]
‘TestCases‘ uses libraries or APIs that are not included in ‘Import Statement‘: - Similar to the solution, all external libraries or functions used in the test cases should be ← - imported
-
[14]
‘TestCases‘ contains test cases that do not work for ‘Solution‘: - All test cases should be aligned with the function’s behavior to ensure they test the function ← - correctly
-
[15]
For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return
‘TestCases‘ does not test all attributes of the returned object: - If the function returns an object with multiple attributes or methods, the test cases should ← - validate all of them to ensure complete coverage. For example, when plotting data on a graph, you ← - might get an ‘AxesSubplot‘ object in return. This object has various attributes, like the t...
-
[16]
‘TestCases‘ does not test the files that result in ‘Solution‘: - If the function creates or modifies files, the test cases should validate these files to ensure the ← - function works as expected
-
[17]
‘TestCases‘ is wrapped in ‘run_tests‘: - The test cases and the function to run them should be separated for clarity
-
[18]
Test cases in ‘TestCases‘ are duplicated or used to test the same behavior: - Redundant test cases should be removed to keep the test suite concise and focused
-
[19]
Test data used in ‘TestCases‘ is missing: - All required data for testing should be provided or generated to ensure the test cases can run ← - without issues. K E VALUATION SETUP K.1 I NFERENCE We perform all the model inference on A100 GPUs, except for the closed ones. For the closed models, we rely on their official APIs provided in the documents. K.2 E...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.