FullStack Bench: Evaluating LLMs as Full Stack Coders

Aoyan Li; Bo Li; Bowen Li; Boyi Liu; Bytedance-Seed-Foundation-Code-Team: Yao Cheng; Chenguang Xi; Guanghan Ning; Guoyin Wang; He Zhu; Hongxia Yang

arxiv: 2412.00535 · v6 · pith:4YORXJVVnew · submitted 2024-11-30 · 💻 cs.AI · cs.SE

FullStack Bench: Evaluating LLMs as Full Stack Coders

Bytedance-Seed-Foundation-Code-Team: Yao Cheng , Jianfeng Chen , Jie Chen , Li Chen , Liyu Chen , Wentao Chen , Zhengyu Chen , Shijie Geng

show 46 more authors

Aoyan Li Bo Li Bowen Li Linyi Li Boyi Liu Jiaheng Liu Kaibo Liu Qi Liu Shukai Liu Siyao Liu Tianyi Liu Tingkai Liu Yongfei Liu Rui Long Jing Mai Guanghan Ning Z.Y. Peng Kai Shen Jiahao Su Jing Su Tao Sun Yifan Sun Yunzhe Tao Guoyin Wang Siwei Wang Xuwu Wang Yite Wang Zihan Wang Jinxiang Xia Liang Xiang Xia Xiao Yongsheng Xiao Chenguang Xi Shulin Xin Jingjing Xu Shikun Xu Hongxia Yang Jack Yang Yingxiang Yang Jianbo Yuan Jun Zhang Yufeng Zhang Yuyu Zhang Shen Zheng He Zhu Ming Zhu

This is my paper

classification 💻 cs.AI cs.SE

keywords benchfullstackprogrammingcodedomainsapplicationcapabilitiescomprehensive

0 comments

read the original abstract

As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Multi-Objective and Mixed-Reward Reinforcement Learning via Reward-Decorrelated Policy Optimization
cs.LG 2026-05 unverdicted novelty 7.0

RDPO applies magnitude-aware quantile normalization and Mahalanobis whitening to decorrelate heterogeneous rewards in multi-objective RL, improving instruction following and writing quality on LongCat-Flash post-train...
SWE-WebDevBench: Evaluating Coding Agent Application Platforms as Virtual Software Agencies
cs.MA 2026-05 conditional novelty 7.0

SWE-WebDevBench finds that AI app builders commonly fail at translating business needs into complete, secure, production-ready software due to specification bottlenecks, frontend-backend decoupling, low engineering qu...
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
cs.CL 2026-04 unverdicted novelty 7.0

GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training
cs.LG 2026-05 unverdicted novelty 6.0

MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

Spreadsheet-RL applies RL fine-tuning and a custom Gym environment to raise LLM agent Pass@1 scores on spreadsheet benchmarks from roughly 8-12% to 17-23%.
Efficient Agentic Reasoning Through Self-Regulated Simulative Planning
cs.AI 2026-05 unverdicted novelty 6.0

SR²AM achieves competitive Pass@1 accuracy on diverse tasks with 25.8-95.3% fewer reasoning tokens than much larger models by using self-regulated simulative planning trained via supervised learning and RL.
InCoder-32B-Thinking: Industrial Code World Model for Thinking
cs.AR 2026-04 unverdicted novelty 6.0

InCoder-32B-Thinking uses error-feedback synthesized thinking traces and a code world model to reach top open-source scores on general and industrial code benchmarks including 81.3% on LiveCodeBench and 84.0% on CAD-Coder.
CharTool: Tool-Integrated Visual Reasoning for Chart Understanding
cs.AI 2026-04 unverdicted novelty 6.0

CharTool equips MLLMs with cropping and code tools plus agentic RL on DuoChart data to raise chart-reasoning accuracy by up to 9.78 percent on benchmarks.
SCALER:Synthetic Scalable Adaptive Learning Environment for Reasoning
cs.AI 2026-01 unverdicted novelty 6.0

SCALER creates adaptive synthetic environments for RL-based LLM reasoning training that outperforms fixed-dataset baselines with more stable long-term progress.
WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent
cs.IR 2025-08 unverdicted novelty 6.0

WebWatcher introduces a vision-language deep research agent trained on synthetic multimodal trajectories and RL that outperforms baselines on VQA benchmarks, along with a new BrowseComp-VL evaluation.
Ling and Ring 2.6 Technical Report: Efficient and Instant Agentic Intelligence at Trillion-Parameter Scale
cs.CL 2026-06 unverdicted novelty 4.0

Technical report announcing Ling-2.6 and Ring-2.6 models with hybrid linear attention, evolutionary CoT, and KPop RL for efficient agentic intelligence at scale.