pith. sign in

arxiv: 2607.01883 · v1 · pith:2WOBA535new · submitted 2026-07-02 · 💻 cs.CL

PairCoder++: Pair Programming as a Universal Paradigm for Verified Code-Driven Multimodal and Structured-Artifact Generation

Pith reviewed 2026-07-03 15:13 UTC · model grok-4.3

classification 💻 cs.CL
keywords pair programmingcode generationLLM agentsstructured artifactsverification feedbackmultimodal generationBlender scenesTikZ compilation
0
0 comments X

The pith

Pair programming between Driver and Navigator agents, grounded in toolchain verification evidence, raises success rates on code-driven generation of structured artifacts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PairCoder as a two-agent system in which a Driver writes programs to produce artifacts such as charts, figures, CAD models, and 3D scenes while a Navigator reviews them against diagnostics, execution results, and renderings supplied by the compiler or renderer. When errors persist the agents switch roles and continue until the artifact passes verification. Tested on 17 public benchmarks with seven models from three vendors, the approach improves full official metrics on tasks that supply informative oracles, for instance lifting Blender scene executability from 0.20 to 0.78 and raising TikZ compilation rates by 10 to 30 points on every model. Gains appear at a cost of 2.9 to 9.2 times single-model inference, roughly seven times overall, and the method is presented as a general recipe for verified code-driven generation wherever the toolchain can supply usable feedback.

Core claim

PairCoder realizes review through two-agent pair programming in which a Driver agent generates the program and a Navigator agent critiques it against verification evidence consisting of diagnostics, execution results, and renderings of the current artifact beside the target; the agents switch roles when errors persist. Across 17 benchmarks and seven models this yields consistent gains on every benchmark whose artifact is verifiable, measured on full official metric suites, at 2.9 to 9.2 times single-model cost, with the largest improvements where the baseline leaves headroom and the oracle is informative.

What carries the argument

The Driver-Navigator pair programming loop that uses toolchain-provided verification evidence (diagnostics, execution results, renderings) to drive corrections and triggers role switches on persistent errors.

If this is right

  • Improvements concentrate on benchmarks that supply informative oracles and where single-model baselines leave headroom.
  • The method ties or mildly regresses on tasks where the oracle is weak.
  • Gains are measured on full official metric suites rather than execution success alone.
  • Pair programming serves as a reliable recipe for verified code-driven generation of charts, figures, vector graphics, CAD models, 3D scenes, and hardware designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If toolchain oracles were strengthened with richer feedback signals, the same agent roles might achieve further gains without additional model calls.
  • The approach could be tested on code-generation tasks outside artifact creation, such as algorithmic problem solving, to see whether the oracle-and-role-switch pattern transfers.
  • Cost multipliers might be reduced by limiting the number of role switches or by using smaller models for the Navigator role once an informative oracle is available.

Load-bearing premise

The toolchain must supply an informative oracle that the Navigator agent can reliably interpret and that role switching will resolve persistent errors when the oracle is present.

What would settle it

A benchmark in which the oracle supplies no usable diagnostics or renderings, or in which repeated role switches still leave errors uncorrected, would produce no improvement or a regression relative to single-model baselines.

Figures

Figures reproduced from arXiv: 2607.01883 by Boran Zhang, Fangsheng Weng, Fei Ma, Hao Zhao, Henghaofan Zhang, Junhao Chen, Mingjin Chen, Qi Tian, Ruqi Huang, Xiang Li, Yibin Xu, Yuehan Cui.

Figure 1
Figure 1. Figure 1: PairCoder improves code generation across every artifact domain, not only Python program synthesis. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The PairCoder framework: the Driver generates and revises code, the Navigator reviews it against [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PairCoder workflow: the Driver proposes and revises, the Navigator reviews against toolchain evidence, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Agent brain architecture: LLM-based decision [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Collaboration mechanism with iterative feed [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cost versus benefit at gpt-5.4-mini: per￾benchmark token multiplier against primary-metric gain; colours follow the families of [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Role-switching policy at gpt-5.4-mini: offi￾cial metric (top) and PairCoder token cost (bottom) for four policies; see Sec. 4.6. Switching on each error is the best or tied￾best policy on all four benchmarks, and never the worst. The effect is small where the loop is short (DaTikZ saturates near a perfect compile rate un￾der every policy) but clear where iteration mat￾ters: on HumanEval-X C++, switching im… view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative comparison across six artifact domains: target/reference (left), single model (middle), [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Case 1 rendering progression: target (left), the [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Case 2 rendering progression: target, rounds [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Plot2Code qualitative gallery (27 distinct examples); both arms render and PairCoder is quantitatively [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: PandasPlotBench qualitative gallery (27 distinct examples); both arms render and PairCoder is quantita [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: ChartMimic qualitative gallery (23 distinct examples); both arms render and PairCoder is quantitatively [PITH_FULL_IMAGE:figures/full_fig_p024_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: StarVector qualitative gallery (12 distinct examples); both arms render and PairCoder is quantitatively [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: GenCAD-Code qualitative gallery (12 distinct examples) demonstrating PairCoder’s advantage; gray [PITH_FULL_IMAGE:figures/full_fig_p026_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: DaTikZ qualitative gallery (12 distinct examples) demonstrating PairCoder’s advantage; gray panels [PITH_FULL_IMAGE:figures/full_fig_p027_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: 3DCodeBench qualitative gallery (12 distinct examples) demonstrating PairCoder’s advantage; gray [PITH_FULL_IMAGE:figures/full_fig_p028_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: ChartMimic across four models and three shared tasks (single vs. PairCoder per cell). [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Plot2Code across four models and three shared tasks. [PITH_FULL_IMAGE:figures/full_fig_p031_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: PandasPlotBench across four models and three shared tasks. [PITH_FULL_IMAGE:figures/full_fig_p032_21.png] view at source ↗
Figure 22
Figure 22. Figure 22: StarVector across four models and three shared tasks. [PITH_FULL_IMAGE:figures/full_fig_p033_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: 3DCodeBench across four models and three shared tasks; gray marks baselines that fail to render. [PITH_FULL_IMAGE:figures/full_fig_p034_23.png] view at source ↗
Figure 24
Figure 24. Figure 24: P3D-Bench geometry-refinement gallery (gpt-5.4): ground truth (green), single model (pink; valid but geometrically poor), PairCoder (blue); per-case F-score@0.05 annotated. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_24.png] view at source ↗
read the original abstract

Code is the medium through which large language models generate structured artifacts: charts, scientific figures, vector graphics, CAD models, 3D scenes, and hardware designs are all produced by writing programs. In this regime single pass inference is brittle, because the compiler, renderer, or simulator that decides whether the artifact exists is invisible to the model. We present PairCoder, which grounds review in the toolchain and realizes it as two agent pair programming: a Driver agent writes the program, a Navigator agent reviews it against verification evidence (diagnostics, execution results, and renderings of the current artifact beside the target), and the two switch roles when errors persist. Across 17 public benchmarks and seven models from three vendors, PairCoder improves essentially every benchmark whose artifact is verifiable, on full official metric suites rather than execution alone (for example, Blender scene executability 0.20 to 0.78; TikZ compile rate up 10 to 30 points on every model), at 2.9 to 9.2 times single model cost (about 7 times overall). The improvements concentrate where the toolchain provides an informative oracle and the baseline leaves headroom, and the method ties or mildly regresses where the oracle is weak; we frame pair programming as a reliable recipe for verified code driven generation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript presents PairCoder, a two-agent pair-programming system for code-driven generation of structured artifacts (charts, TikZ, Blender scenes, CAD, etc.). A Driver agent emits code while a Navigator agent reviews it against toolchain oracles (diagnostics, execution output, renderings); the agents switch roles on persistent errors. Across 17 public benchmarks and seven models the method is reported to raise official metrics on every verifiable-artifact task (e.g., Blender executability 0.20 o0.78, TikZ compile rate +10–30 points) at 2.9–9.2 imes single-model cost, with gains concentrated where the oracle is informative.

Significance. If the causal contribution of the pair-programming structure can be isolated, the work supplies a concrete, broadly applicable recipe for grounding LLM generation in external verifiers rather than internal model confidence. The scale of the evaluation (17 benchmarks, multiple vendors) would make the result a useful reference point for agentic code-generation research.

major comments (2)
  1. [Experimental Results] Experimental Results section: the headline gains (Blender 0.20 o0.78, TikZ +10–30 points) are presented without ablations that hold total inference budget fixed while removing either the explicit Driver/Navigator division or the role-switching rule. Consequently it remains unclear whether the reported improvements require the pair-programming architecture or would be obtained by simply allocating the same extra model calls to a single agent.
  2. [Method] Method section: the central assumption that the Navigator can reliably translate toolchain renderings and diagnostics into actionable edits is stated but not quantified. No success-rate statistics on oracle interpretation, no failure-case analysis, and no comparison of continued single-agent turns versus forced role switches are supplied, leaving the load-bearing mechanism untested.
minor comments (3)
  1. [Abstract] Abstract: the title contains “PairCoder++” while the text consistently refers to “PairCoder”; the relationship between the two should be clarified.
  2. [Experimental Results] The cost multiplier is given only as a range (2.9–9.2 imes); a per-benchmark or per-model table would allow readers to assess whether the overhead is uniform or driven by a few hard cases.
  3. [Experimental Results] No error bars, statistical significance tests, or dataset-split details are mentioned; these should be added to the experimental protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. We address each of the major comments below.

read point-by-point responses
  1. Referee: [Experimental Results] Experimental Results section: the headline gains (Blender 0.20 to 0.78, TikZ +10–30 points) are presented without ablations that hold total inference budget fixed while removing either the explicit Driver/Navigator division or the role-switching rule. Consequently it remains unclear whether the reported improvements require the pair-programming architecture or would be obtained by simply allocating the same extra model calls to a single agent.

    Authors: We agree that an ablation holding the inference budget fixed would help isolate the contribution of the pair-programming structure. The current results report performance at increased cost (2.9–9.2 times), but do not directly compare to a single agent with equivalent calls. In the revised manuscript, we will add such an ablation study, comparing PairCoder to a single-agent setup with the same total number of model invocations, to determine if the structured roles and switching provide benefits beyond additional computation. revision: yes

  2. Referee: [Method] Method section: the central assumption that the Navigator can reliably translate toolchain renderings and diagnostics into actionable edits is stated but not quantified. No success-rate statistics on oracle interpretation, no failure-case analysis, and no comparison of continued single-agent turns versus forced role switches are supplied, leaving the load-bearing mechanism untested.

    Authors: The manuscript indeed does not include quantitative metrics on the Navigator's interpretation success or detailed failure analysis of role switches. We will revise the Method and Experimental Results sections to include: (1) statistics on how often Navigator edits lead to error reduction, (2) analysis of role-switch frequency and its correlation with progress, and (3) a comparison of performance when role-switching is disabled (i.e., continued single-agent turns). This will provide evidence on the effectiveness of the oracle-driven review mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method grounded in external toolchain oracles

full rationale

The paper presents PairCoder as an agent-based pair-programming method (Driver writes code; Navigator reviews against diagnostics/execution/renderings) and reports empirical gains on 17 benchmarks across models. No equations, fitted parameters, predictions derived from inputs, or self-citation chains appear in the provided text. The central claim is validated against external verifiable artifacts and official metrics rather than reducing to internal definitions or self-referential logic by construction. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities can be extracted.

pith-pipeline@v0.9.1-grok · 5806 in / 1228 out tokens · 25080 ms · 2026-07-03T15:13:06.163532+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 34 canonical work pages · 22 internal anchors

  1. [1]

    IW -Bench: Evaluating Large Multimodal Models for Converting Image-to-Web

    Guo, Hongcheng and Zhang, Wei and Chen, Junhao and Gu, Yaonan and Yang, Jian and Du, Junjia and Cao, Shaosheng and Hui, Binyuan and Liu, Tianyu and Ma, Jianxin and Zhou, Chang and Li, Zhoujun. IW -Bench: Evaluating Large Multimodal Models for Converting Image-to-Web. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v...

  2. [2]

    DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ , url =

    Belouadi, Jonas and Ponzetto, Simone Paolo and Eger, Steffen , booktitle =. DeTikZify: Synthesizing Graphics Programs for Scientific Figures and Sketches with TikZ , url =. doi:10.52202/079017-2701 , editor =

  3. [3]

    SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes

    SceneSmith: Agentic Generation of Simulation-Ready Indoor Scenes , author=. arXiv preprint arXiv:2602.09153 , year=

  4. [4]

    2026 , eprint=

    One Video, One World: Turning Monocular Video into Physical 4D Scenes , author=. 2026 , eprint=

  5. [5]

    arXiv preprint arXiv:2409.16294 , year=

    GenCAD: Image-Conditioned Computer-Aided Design Generation with Transformer-Based Contrastive Representation and Diffusion Priors , author=. arXiv preprint arXiv:2409.16294 , year=

  6. [6]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features , author=. arXiv preprint arXiv:2502.14786 , year=

  7. [7]

    Advances in Neural Information Processing Systems , volume=

    Reflexion: Language agents with verbal reinforcement learning , author=. Advances in Neural Information Processing Systems , volume=

  8. [8]

    The Twelfth International Conference on Learning Representations , year=

    Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors , author=. The Twelfth International Conference on Learning Representations , year=

  9. [9]

    International Conference on Machine Learning , year=

    SceneCraft: An LLM Agent for Synthesizing 3D Scenes as Blender Code , author=. International Conference on Machine Learning , year=

  10. [10]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Metagpt: Meta programming for multi-agent collaborative framework , author=. arXiv preprint arXiv:2308.00352 , year=

  11. [11]

    Machine Vision and Applications , volume=

    Ultraman: ultra-fast and high-resolution texture generation for 3D human reconstruction from a single image , author=. Machine Vision and Applications , volume=. 2026 , publisher=

  12. [12]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  13. [13]

    International Conference on Machine Learning , pages=

    LEVER: Learning to Verify Language-to-Code Generation with Execution , author=. International Conference on Machine Learning , pages=

  14. [14]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation , author=. arXiv preprint arXiv:2308.08155 , year=

  15. [15]

    Advances in neural information processing systems , volume=

    Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation , author=. Advances in neural information processing systems , volume=

  16. [16]

    Proceedings of the ACM on Software Engineering , volume=

    Llm hallucinations in practical code generation: Phenomena, mechanism, and mitigation , author=. Proceedings of the ACM on Software Engineering , volume=. 2025 , publisher=

  17. [17]

    The Thirteenth International Conference on Learning Representations , year=

    OpenHands: An Open Platform for AI Software Developers as Generalist Agents , author=. The Thirteenth International Conference on Learning Representations , year=

  18. [18]

    ChatDev: Communicative Agents for Software Development

    Communicative agents for software development , author=. arXiv preprint arXiv:2307.07924 , volume=

  19. [19]

    expert--expert , author=

    Pair programming productivity: Novice--novice vs. expert--expert , author=. International Journal of Human-computer studies , volume=. 2006 , publisher=

  20. [20]

    Advances in Neural Information Processing Systems , volume=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems , volume=

  21. [21]

    The Twelfth International Conference on Learning Representations , year=

    CRITIC: Large Language Models Can Self-Correct with Tool-Interactive Critiquing , author=. The Twelfth International Conference on Learning Representations , year=

  22. [22]

    arXiv preprint arXiv:2305.02309 , year=

    Codegen2: Lessons for training llms on programming and natural languages , author=. arXiv preprint arXiv:2305.02309 , year=

  23. [23]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  24. [24]

    Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

    Codegeex: A pre-trained model for code generation with multilingual benchmarking on humaneval-x , author=. Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining , pages=

  25. [25]

    Fangsheng Weng and Junhao Chen and Xiang Li and Jie Qin and Hanzhong Guo and ShaochunHao and Xiaoguang Han , booktitle=. Garment. 2026 , url=

  26. [26]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Sun, Mingze and Chen, Junhao and Dong, Junting and Chen, Yurun and Jiang, Xinyu and Mao, Shiwei and Jiang, Puhua and Wang, Jingbo and Dai, Bo and Huang, Ruqi , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2025 , pages =

  27. [27]

    arXiv preprint arXiv:2408.00019 , year=

    WebApp1K: A Practical Code-Generation Benchmark for Web App Development , author=. arXiv preprint arXiv:2408.00019 , year=

  28. [28]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Sun, Mingze and Zeng, Cheng and Pei, Jiansong and Chen, Junhao and Song, Chaoyue and Wang, Shaohui and Chang, Tianyuan and Huang, Bin and Zeng, Zijiao and Huang, Ruqi , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2026 , pages =

  29. [29]

    Proceedings of the AAAI Conference on Artificial Intelligence , volume=

    Codehalu: Investigating code hallucinations in llms via execution-based verification , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

  30. [30]

    International Conference on Learning Representations , volume=

    Swe-bench: Can language models resolve real-world github issues? , author=. International Conference on Learning Representations , volume=

  31. [31]

    The Thirteenth International Conference on Learning Representations , year=

    Generating CAD Code with Vision-Language Models for 3D Designs , author=. The Thirteenth International Conference on Learning Representations , year=

  32. [32]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    Idea23d: Collaborative lmm agents enable 3d model generation from interleaved multimodal inputs , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  33. [33]

    International Conference on Artificial Neural Networks , pages=

    Towards energy-efficient sentiment classification with spiking neural networks , author=. International Conference on Artificial Neural Networks , pages=. 2023 , organization=

  34. [34]

    Code Llama: Open Foundation Models for Code

    Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

  35. [35]

    GPT-3.5 Documentation

    OpenAI. GPT-3.5 Documentation. 2023 , howpublished =

  36. [36]

    2026 , eprint=

    From Frames to Sequences: Temporally Consistent Human-Centric Dense Prediction , author=. 2026 , eprint=

  37. [37]

    IEEE Transactions on Image Processing , volume=

    Image Quality Assessment: From Error Visibility to Structural Similarity , author=. IEEE Transactions on Image Processing , volume=

  38. [38]

    3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

    3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code , author=. arXiv preprint arXiv:2606.01057 , year=

  39. [39]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

    Benchmarking PhD-Level Coding in 3D Geometric Computer Vision , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

  40. [40]

    LLM s P ark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts

    Chen, Junhao and Sun, Jingbo and Li, Xiang and Xin, Haidong and Xue, Yuhao and Xu, Yibin and Zhao, Hao. LLM s P ark: A Benchmark for Evaluating Large Language Models in Strategic Gaming Contexts. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.12

  41. [41]

    ACM Transactions on Software Engineering and Methodology , volume=

    Self-collaboration code generation via chatgpt , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2024 , publisher=

  42. [42]

    StarCoder: may the source be with you!

    Starcoder: may the source be with you! , author=. arXiv preprint arXiv:2305.06161 , year=

  43. [43]

    Advances in Neural Information Processing Systems , volume=

    CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author=. Advances in Neural Information Processing Systems , volume=

  44. [44]

    Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

    Starvector: Generating scalable vector graphics code from images and text , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=

  45. [45]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

    Chat2SVG: Vector Graphics Generation with Large Language Models and Image Diffusion Models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , year=

  46. [46]

    Qwen3 Technical Report

    Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

  47. [47]

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. arXiv preprint arXiv:2406.15877 , year=

  48. [48]

    The Fourteenth International Conference on Learning Representations , year=

    DanceTogether: Generating Interactive Multi-Person Video without Identity Drifting , author=. The Fourteenth International Conference on Learning Representations , year=

  49. [49]

    2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC) , pages=

    RTLLM: An Open-Source Benchmark for Design RTL Generation with Large Language Model , author=. 2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC) , pages=

  50. [50]

    The Thirteenth International Conference on Learning Representations , year=

    Can Large Language Models Understand Symbolic Graphics Programs? , author=. The Thirteenth International Conference on Learning Representations , year=

  51. [51]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. arXiv preprint arXiv:2403.07974 , year=

  52. [52]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  53. [53]

    Qwen2.5-Coder Technical Report

    Qwen2. 5-coder technical report , author=. arXiv preprint arXiv:2409.12186 , year=

  54. [54]

    International Conference on Machine Learning , pages=

    Learning Transferable Visual Models From Natural Language Supervision , author=. International Conference on Machine Learning , pages=

  55. [55]

    arXiv preprint arXiv:2311.13562 , year=

    Soulstyler: Using large language model to guide image style transfer for target object , author=. arXiv preprint arXiv:2311.13562 , year=

  56. [56]

    2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) , pages=

    VerilogEval: Evaluating Large Language Models for Verilog Code Generation , author=. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD) , pages=

  57. [57]

    Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

    MMAD: Multi-modal movie audio description , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

  58. [58]

    ACM computing surveys , volume=

    Survey of hallucination in natural language generation , author=. ACM computing surveys , volume=. 2023 , publisher=

  59. [59]

    2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

    Cctest: Testing and repairing code completion systems , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

  60. [60]

    arXiv preprint arXiv:2406.09961 , year=

    ChartMimic: Evaluating LMM's Cross-Modal Reasoning Capability via Chart-to-Code Generation , author=. arXiv preprint arXiv:2406.09961 , year=

  61. [61]

    Design2code: Benchmarking multimodal code generation for automated front-end engineering , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  62. [62]

    Qwen Technical Report

    Qwen technical report , author=. arXiv preprint arXiv:2309.16609 , year=

  63. [63]

    Information and software technology , volume=

    The effectiveness of pair programming: A meta-analysis , author=. Information and software technology , volume=. 2009 , publisher=

  64. [64]

    Transactions on Machine Learning Research , year=

    DINOv2: Learning Robust Visual Features without Supervision , author=. Transactions on Machine Learning Research , year=

  65. [65]

    The Twelfth International Conference on Learning Representations , year=

    AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ , author=. The Twelfth International Conference on Learning Representations , year=

  66. [66]

    Advances in Neural Information Processing Systems , volume=

    Swe-agent: Agent-computer interfaces enable automated software engineering , author=. Advances in Neural Information Processing Systems , volume=

  67. [67]

    Findings of the Association for Computational Linguistics: ACL 2024 , pages=

    Matplotagent: Method and evaluation for llm-based agentic scientific data visualization , author=. Findings of the Association for Computational Linguistics: ACL 2024 , pages=

  68. [68]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Chen, Mingjin and Chen, Junhao and Fan, Zhaoxin and Lee, Yujian and Dang, Zichen and Wang, Lili and Cui, Yawen and Chau, Lap-Pui and Wang, Yi , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2026 , pages =

  69. [69]

    arXiv preprint arXiv:2501.06598 , year=

    ChartCoder: Advancing Multimodal Large Language Model for Chart-to-Code Generation , author=. arXiv preprint arXiv:2501.06598 , year=

  70. [70]

    arXiv preprint arXiv:2412.02764 , year=

    Drawing Pandas: A Benchmark for LLMs in Generating Plotting Code , author=. arXiv preprint arXiv:2412.02764 , year=

  71. [71]

    arXiv preprint arXiv:2310.12945 , year=

    3D-GPT: Procedural 3D Modeling with Large Language Models , author=. arXiv preprint arXiv:2310.12945 , year=

  72. [72]

    Proceedings of the 61st ACM/IEEE Design Automation Conference , year=

    RTLFixer: Automatically Fixing RTL Syntax Errors with Large Language Model , author=. Proceedings of the 61st ACM/IEEE Design Automation Conference , year=

  73. [73]

    A Survey on Large Language Models for Code Generation

    A survey on large language models for code generation , author=. arXiv preprint arXiv:2406.00515 , year=

  74. [74]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  75. [75]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =

    Chen, Junhao and Gao, Kejun and Cui, Yuehan and Sun, Mingze and Chen, Mingjin and Wang, Shaohui and Long, Xiaoxiao and Ma, Fei and Tian, Qi and Zhao, Hao and Huang, Ruqi , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , month =. 2026 , pages =

  76. [76]

    The Twelfth International Conference on Learning Representations , year=

    Teaching Large Language Models to Self-Debug , author=. The Twelfth International Conference on Learning Representations , year=

  77. [77]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  78. [78]

    2026 , eprint=

    Feedforward 3D Editing Learns from Semantic-Part Transformation , author=. 2026 , eprint=

  79. [79]

    AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

    Agentcoder: Multi-agent-based code generation with iterative testing and optimisation , author=. arXiv preprint arXiv:2312.13010 , year=

  80. [80]

    IEEE software , volume=

    Strengthening the case for pair programming , author=. IEEE software , volume=. 2000 , publisher=

Showing first 80 references.