pith. sign in

arxiv: 2512.24120 · v2 · pith:WT2B2WJPnew · submitted 2025-12-30 · 💻 cs.CV · cs.AI

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Pith reviewed 2026-05-16 19:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords few-shot promptingLLM architecture generationneural architecture searchcomputer visiondeduplicationvalidationautomated design
0
0 comments X

The pith

Three examples in few-shot prompts let LLMs generate the most balanced neural architectures for vision tasks while a simple hash check speeds validation by 100 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how many examples to include in prompts when asking large language models to design neural networks for computer vision problems. It finds that three examples give the best mix of new architecture ideas and task relevance across standard benchmarks. The work also shows that a lightweight whitespace-normalized hash can detect duplicate network code in under a millisecond, avoiding repeated training runs that waste compute. Experiments produced 1900 unique architectures on seven datasets and introduced a balanced way to compare results across different vision tasks. These steps make LLM-driven architecture search practical for groups with limited hardware.

Core claim

Few-Shot Architecture Prompting with exactly three supporting examples optimally balances architectural diversity and task-specific focus for vision networks, while Whitespace-Normalized Hash Validation deduplicates generated code 100 times faster than AST parsing and prevents redundant training, enabling efficient large-scale generation of 1900 unique architectures across seven heterogeneous benchmarks with a dataset-balanced evaluation method.

What carries the argument

Few-Shot Architecture Prompting (FSAP) with variable shot counts combined with Whitespace-Normalized Hash Validation for fast deduplication

Load-bearing premise

Observed differences in generated architecture quality across different numbers of prompt examples are caused by the example count itself rather than random variation in LLM sampling, training runs, or data splits.

What would settle it

Re-running the full set of experiments with fixed random seeds, identical training hyperparameters, and deterministic LLM sampling to check whether performance gaps between one-shot through six-shot regimes disappear.

Figures

Figures reproduced from arXiv: 2512.24120 by Avi Goyal, Chandini Vysyaraju, Dmitry Ignatov, Radu Timofte, Raghuvir Duvvuri.

Figure 1
Figure 1. Figure 1: Complete architecture generation pipeline showing integration of Few-Shot Architecture Prompting (FSAP) and Whitespace [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Context overflow phenomenon. Balanced mean accu [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
read the original abstract

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Few-Shot Architecture Prompting (FSAP) as a systematic study of the number of in-context examples (n=1 to 6) for LLM-based generation of computer-vision architectures, concluding that n=3 optimally balances diversity and task focus. It also proposes Whitespace-Normalized Hash Validation for sub-millisecond deduplication that yields a claimed 100x speedup over AST parsing. Large-scale experiments generate 1,900 unique architectures and evaluate them on seven heterogeneous vision benchmarks (MNIST through Places365) using a dataset-balanced protocol.

Significance. If the central empirical ranking survives statistical controls, the work supplies concrete, actionable prompting guidelines for LLM-driven architecture search in vision and a lightweight deduplication primitive that materially reduces wasted training cycles. The scale of the reported generation (1,900 architectures) and the emphasis on cross-task comparability are strengths that could influence practical NAS pipelines with limited compute.

major comments (2)
  1. [Abstract and experimental results] Abstract and experimental results section: the claim that n=3 'best balances architectural diversity and context focus' rests on single-generation runs per prompting regime. No error bars, repeated LLM sampling seeds, temperature sweeps, or hypothesis testing (t-test/ANOVA) are reported, despite the stochasticity of both LLM decoding and network training. This leaves the observed accuracy ordering vulnerable to uncontrolled random variation rather than to the number of shots.
  2. [§3 and §4] §3 (methodology) and §4 (experiments): training protocols, optimizer settings, data splits, and number of independent training runs per architecture are not specified. Without these details the performance differences across n values cannot be attributed to prompting strategy rather than to hyper-parameter or split noise.
minor comments (2)
  1. [§4] The description of the dataset-balanced evaluation methodology is too brief; a concrete formula or pseudocode would clarify how accuracies are aggregated across tasks with different class counts and image resolutions.
  2. [Figures and tables] Table or figure captions should explicitly state the number of independent trials and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility and statistical transparency.

read point-by-point responses
  1. Referee: [Abstract and experimental results] Abstract and experimental results section: the claim that n=3 'best balances architectural diversity and context focus' rests on single-generation runs per prompting regime. No error bars, repeated LLM sampling seeds, temperature sweeps, or hypothesis testing (t-test/ANOVA) are reported, despite the stochasticity of both LLM decoding and network training. This leaves the observed accuracy ordering vulnerable to uncontrolled random variation rather than to the number of shots.

    Authors: We acknowledge that each prompting regime (n=1 to 6) was evaluated from single LLM generation runs to keep the overall experiment tractable at the reported scale of 1,900 architectures. The n=3 result is supported by its consistent ranking across all seven heterogeneous benchmarks under a dataset-balanced protocol. In the revision we will (i) explicitly state the single-run limitation, (ii) add training-run error bars for the final reported accuracies where additional compute permits, and (iii) include a brief discussion of why full multi-seed LLM sampling and formal hypothesis testing were not performed. We do not claim statistical significance beyond the observed cross-benchmark pattern. revision: partial

  2. Referee: [§3 and §4] §3 (methodology) and §4 (experiments): training protocols, optimizer settings, data splits, and number of independent training runs per architecture are not specified. Without these details the performance differences across n values cannot be attributed to prompting strategy rather than to hyper-parameter or split noise.

    Authors: We agree that these implementation details are necessary for reproducibility. The revised manuscript will expand §3 and §4 to specify: Adam optimizer with learning rate 0.001 and standard weight decay, the exact train/validation/test splits used for each of the seven benchmarks, and that each generated architecture was trained once (to prioritize breadth of 1,900 unique models). These additions will make clear that performance differences are measured under identical training conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct architecture generation and benchmarking

full rationale

The paper reports an experimental study generating 1,900 unique architectures via LLM prompting with varying shot counts (n=1..6) and evaluating them on seven vision benchmarks. The central claim that n=3 best balances diversity and context focus is obtained by comparing measured accuracies and diversity metrics across regimes, not by any equation, fitted parameter, or self-citation that reduces the output to the input by construction. The Whitespace-Normalized Hash Validation is introduced as a lightweight implementation with a reported 100x speedup over AST parsing; its correctness is verified by direct timing and deduplication counts rather than derived from prior results. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renaming of known patterns appear in the described methodology. The evaluation methodology is self-contained against the stated benchmarks and does not invoke external derivations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The optimal n=3 is selected after empirical sweep; the framework assumes LLMs can produce executable architectures from natural-language prompts.

free parameters (1)
  • number of prompt examples n
    Experimentally chosen optimum after testing n=1 to 6; value 3 is reported as best balance for vision tasks.
axioms (1)
  • domain assumption LLMs can generate syntactically valid and trainable neural network code from few-shot prompts
    Inherited from the NNGPT/LEMUR framework and required for the prompting experiments to be meaningful.

pith-pipeline@v0.9.0 · 5606 in / 1282 out tokens · 47989 ms · 2026-05-16T19:04:48.401714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs

    cs.LG 2026-05 unverdicted novelty 7.0

    Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...

  2. Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models

    cs.CV 2026-01 unverdicted novelty 6.0

    Closed-loop LLM search with AST-generated examples discovers non-standard channel widths that improve vision model performance over initial architectures on CIFAR-100.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 10 internal anchors

  1. [1]

    Augmentgest: Can random data cropping augmentation boost gesture recognition performance?arXiv preprint arXiv:2506.07216, 2025

    Nada Aboudeshish, Dmitry Ignatov, and Radu Timofte. Augmentgest: Can random data cropping augmentation boost gesture recognition performance?arXiv preprint arXiv:2506.07216, 2025. 3

  2. [2]

    Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, et al

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, et al. Language mod- els are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020. 2, 7

  3. [3]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 1, 2

  4. [4]

    Program Synthesis with Large Language Models

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2022. 1, 2

  5. [5]

    DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

    DeepSeek-AI. DeepSeek-Coder: When the large lan- guage model meets programming.arXiv preprint arXiv:2401.14196, 2024. 1, 2

  6. [6]

    Ai on the edge: An automated pipeline for pytorch-to-android deployment and benchmarking.Preprints, 2025

    Saif U Din, Muhammad Ahsan Hussain, Mohsin Ikram, Dmitry Ignatov, and Radu Timofte. Ai on the edge: An automated pipeline for pytorch-to-android deployment and benchmarking.Preprints, 2025. 2

  7. [7]

    Vist-gpt: Ush- ering in the era of visual storytelling with llms?arXiv preprint arXiv:2504.19267, 2025

    Mohamed Gado, Towhid Taliee, Muhammad Danish Memon, Dmitry Ignatov, and Radu Timofte. Vist-gpt: Ush- ering in the era of visual storytelling with llms?arXiv preprint arXiv:2504.19267, 2025. 2

  8. [8]

    Lemur neural net- work dataset: Towards seamless automl.arXiv preprint arXiv:2504.10552, 2025

    Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Ben- tyn, Dmitry Ignatov, and Radu Timofte. Lemur neural net- work dataset: Towards seamless automl.arXiv preprint arXiv:2504.10552, 2025. 1, 2

  9. [9]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

  10. [10]

    Llm as a neural architect: Controlled generation of image cap- tioning models under strict api contracts.arXiv preprint arXiv:2512.14706, 2025

    Krunal Jesani, Dmitry Ignatov, and Radu Timofte. Llm as a neural architect: Controlled generation of image cap- tioning models under strict api contracts.arXiv preprint arXiv:2512.14706, 2025. 2

  11. [11]

    A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

    Waleed Khalid, Dmitry Ignatov, and Radu Timofte. A retrieval-augmented generation approach to extracting al- gorithmic logic from neural networks.arXiv preprint arXiv:2512.04329, 2025. 2

  12. [12]

    Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, and Radu Timofte. Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparame- ter Tuning? InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 5664–5674, 2025. 2

  13. [13]

    Nngpt: Rethinking automl with large language models.arXiv preprint arXiv:2511.20333, 2025

    Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chan- dini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Igna- tov, and Radu Timofte. Nngpt: Rethinking automl with large language models.arXiv preprint arXiv:2511.20333, 2025. 1, 2

  14. [14]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 3

  15. [15]

    Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 1998

    Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 1998. 3

  16. [16]

    StarCoder: may the source be with you!

    Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muen- nighoff, Denis Kocetkov, Chenghao Mou, and Leandro von Werra. StarCoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023. 2

  17. [17]

    DARTS: Differentiable architecture search

    Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Confer- ence on Learning Representations (ICLR), 2019. 2

  18. [18]

    Deep learning face attributes in the wild

    Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3730–3738, 2015. 3

  19. [19]

    Preparation of Fractal-Inspired Computational Architectures for Automated Neural Design Exploration

    Yash Mittal, Dmitry Ignatov, and Radu Timofte. Prepara- tion of fractal-inspired computational architectures for ad- vanced large language model analysis.arXiv preprint arXiv:2511.07329, 2025. 2

  20. [20]

    Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Bo Wu, and Andrew Y . Ng. Reading digits in nat- ural images with unsupervised feature learning.NIPS Work- shop on Deep Learning and Unsupervised Feature Learning,

  21. [21]

    CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

    Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, and Caiming Xiong. CodeGen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474, 2023. 2

  22. [22]

    Le, and Jeff Dean

    Hieu Pham, Melody Guan, Barret Zoph, Quoc V . Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. InInternational Conference on Machine Learning (ICML), pages 4095–4104, 2018. 2

  23. [23]

    Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V . Le. Regularized evolution for image classifier architecture search. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4780–4789, 2019. 1, 2

  24. [24]

    Code Llama: Open Foundation Models for Code

    Baptiste Rozi `ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, and Gabriel Synnaeve. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023. 1, 2

  25. [25]

    Explor- ing the collaboration between vision models and llms for en- hanced image classification.Preprints, 2025

    Bhavya Rupani, Dmitry Ignatov, and Radu Timofte. Explor- ing the collaboration between vision models and llms for en- hanced image classification.Preprints, 2025. 2

  26. [26]

    Wilkerson, and Alex Aiken

    Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Win- nowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 76–85. ACM, 2003. 1, 2

  27. [27]

    Lemur 2: Unlocking neural network diversity for ai.arXiv preprint, 2025

    Tolgay Atincand Uzun, Waleed Khalid, Saif U Din, Sai Re- vanth Mulukuledu, Akashdeep Singh, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Yashkumar Rajeshbhai Lukhi, Ahsan Hussain, Krunal Jesani, Usha Shrestha, Yash Mittal, Roman Kochnev, Pritam Kadam, Mohsin Ikram, 9 Harsh Rameshbhai Moradiya, Alice Arslanian, Dmitry Igna- tov, and Radu Timofte. Lemur...

  28. [28]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 35:24824–24837, 2022. 1, 2

  29. [29]

    Places: A 10 million image database for scene recognition

    Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. InIEEE Transactions on Pattern Anal- ysis and Machine Intelligence (PAMI), pages 1452–1464,

  30. [30]

    Barret Zoph and Quoc V . Le. Neural architecture search with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2017. 1, 2 10