Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Avi Goyal; Chandini Vysyaraju; Dmitry Ignatov; Radu Timofte; Raghuvir Duvvuri

arxiv: 2512.24120 · v2 · pith:WT2B2WJPnew · submitted 2025-12-30 · 💻 cs.CV · cs.AI

Enhancing LLM-Based Neural Network Generation: Few-Shot Prompting and Efficient Validation for Automated Architecture Design

Raghuvir Duvvuri , Chandini Vysyaraju , Avi Goyal , Dmitry Ignatov , Radu Timofte This is my paper

Pith reviewed 2026-05-16 19:04 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords few-shot promptingLLM architecture generationneural architecture searchcomputer visiondeduplicationvalidationautomated design

0 comments

The pith

Three examples in few-shot prompts let LLMs generate the most balanced neural architectures for vision tasks while a simple hash check speeds validation by 100 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests how many examples to include in prompts when asking large language models to design neural networks for computer vision problems. It finds that three examples give the best mix of new architecture ideas and task relevance across standard benchmarks. The work also shows that a lightweight whitespace-normalized hash can detect duplicate network code in under a millisecond, avoiding repeated training runs that waste compute. Experiments produced 1900 unique architectures on seven datasets and introduced a balanced way to compare results across different vision tasks. These steps make LLM-driven architecture search practical for groups with limited hardware.

Core claim

Few-Shot Architecture Prompting with exactly three supporting examples optimally balances architectural diversity and task-specific focus for vision networks, while Whitespace-Normalized Hash Validation deduplicates generated code 100 times faster than AST parsing and prevents redundant training, enabling efficient large-scale generation of 1900 unique architectures across seven heterogeneous benchmarks with a dataset-balanced evaluation method.

What carries the argument

Few-Shot Architecture Prompting (FSAP) with variable shot counts combined with Whitespace-Normalized Hash Validation for fast deduplication

Load-bearing premise

Observed differences in generated architecture quality across different numbers of prompt examples are caused by the example count itself rather than random variation in LLM sampling, training runs, or data splits.

What would settle it

Re-running the full set of experiments with fixed random seeds, identical training hyperparameters, and deterministic LLM sampling to check whether performance gaps between one-shot through six-shot regimes disappear.

Figures

Figures reproduced from arXiv: 2512.24120 by Avi Goyal, Chandini Vysyaraju, Dmitry Ignatov, Radu Timofte, Raghuvir Duvvuri.

**Figure 1.** Figure 1: Complete architecture generation pipeline showing integration of Few-Shot Architecture Prompting (FSAP) and Whitespace [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Context overflow phenomenon. Balanced mean accu [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Automated neural network architecture design remains a significant challenge in computer vision. Task diversity and computational constraints require both effective architectures and efficient search methods. Large Language Models (LLMs) present a promising alternative to computationally intensive Neural Architecture Search (NAS), but their application to architecture generation in computer vision has not been systematically studied, particularly regarding prompt engineering and validation strategies. Building on the task-agnostic NNGPT/LEMUR framework, this work introduces and validates two key contributions for computer vision. First, we present Few-Shot Architecture Prompting (FSAP), the first systematic study of the number of supporting examples (n = 1, 2, 3, 4, 5, 6) for LLM-based architecture generation. We find that using n = 3 examples best balances architectural diversity and context focus for vision tasks. Second, we introduce Whitespace-Normalized Hash Validation, a lightweight deduplication method (less than 1 ms) that provides a 100x speedup over AST parsing and prevents redundant training of duplicate computer vision architectures. In large-scale experiments across seven computer vision benchmarks (MNIST, CIFAR-10, CIFAR-100, CelebA, ImageNette, SVHN, Places365), we generated 1,900 unique architectures. We also introduce a dataset-balanced evaluation methodology to address the challenge of comparing architectures across heterogeneous vision tasks. These contributions provide actionable guidelines for LLM-based architecture search in computer vision and establish rigorous evaluation practices, making automated design more accessible to researchers with limited computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

n=3 prompting gives a usable heuristic in their runs but the optimality claim lacks controls for sampling noise and needs variance checks.

read the letter

The paper's main takeaway is that three few-shot examples strike the best balance for getting LLMs to generate useful vision architectures, paired with a whitespace-normalized hash that deduplicates models in under a millisecond. They run this on top of the NNGPT/LEMUR setup by sweeping example counts from one to six and producing 1900 architectures across seven benchmarks, plus a dataset-balanced evaluation method to compare results fairly across tasks like MNIST and Places365. The hash method is the cleaner piece of engineering here because it is fast, simple to implement, and directly cuts wasted training time compared with full AST checks. The scale of the generation effort also shows they put real work into testing the prompting idea at volume rather than just describing it. What holds up less well is the central claim that n=3 is optimal. The reported results give no error bars, no repeated LLM seeds, no temperature sweeps, and no statistical tests, even though both architecture generation and downstream training are stochastic. Without those controls the ranking could easily shift with different random draws, so the practical guideline rests on thinner evidence than the abstract suggests. This is the sort of paper that helps groups already experimenting with LLM-based architecture search who want concrete prompting numbers and a cheap validation shortcut rather than a new theoretical framework. A reader working on low-resource NAS would get immediate value from the hash trick and the prompting sweep. It deserves a serious referee because the experimental volume is large enough to be worth checking and the contributions are scoped tightly enough to be actionable, even if the statistical side needs tightening.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Few-Shot Architecture Prompting (FSAP) as a systematic study of the number of in-context examples (n=1 to 6) for LLM-based generation of computer-vision architectures, concluding that n=3 optimally balances diversity and task focus. It also proposes Whitespace-Normalized Hash Validation for sub-millisecond deduplication that yields a claimed 100x speedup over AST parsing. Large-scale experiments generate 1,900 unique architectures and evaluate them on seven heterogeneous vision benchmarks (MNIST through Places365) using a dataset-balanced protocol.

Significance. If the central empirical ranking survives statistical controls, the work supplies concrete, actionable prompting guidelines for LLM-driven architecture search in vision and a lightweight deduplication primitive that materially reduces wasted training cycles. The scale of the reported generation (1,900 architectures) and the emphasis on cross-task comparability are strengths that could influence practical NAS pipelines with limited compute.

major comments (2)

[Abstract and experimental results] Abstract and experimental results section: the claim that n=3 'best balances architectural diversity and context focus' rests on single-generation runs per prompting regime. No error bars, repeated LLM sampling seeds, temperature sweeps, or hypothesis testing (t-test/ANOVA) are reported, despite the stochasticity of both LLM decoding and network training. This leaves the observed accuracy ordering vulnerable to uncontrolled random variation rather than to the number of shots.
[§3 and §4] §3 (methodology) and §4 (experiments): training protocols, optimizer settings, data splits, and number of independent training runs per architecture are not specified. Without these details the performance differences across n values cannot be attributed to prompting strategy rather than to hyper-parameter or split noise.

minor comments (2)

[§4] The description of the dataset-balanced evaluation methodology is too brief; a concrete formula or pseudocode would clarify how accuracies are aggregated across tasks with different class counts and image resolutions.
[Figures and tables] Table or figure captions should explicitly state the number of independent trials and whether error bars represent standard deviation or standard error.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve reproducibility and statistical transparency.

read point-by-point responses

Referee: [Abstract and experimental results] Abstract and experimental results section: the claim that n=3 'best balances architectural diversity and context focus' rests on single-generation runs per prompting regime. No error bars, repeated LLM sampling seeds, temperature sweeps, or hypothesis testing (t-test/ANOVA) are reported, despite the stochasticity of both LLM decoding and network training. This leaves the observed accuracy ordering vulnerable to uncontrolled random variation rather than to the number of shots.

Authors: We acknowledge that each prompting regime (n=1 to 6) was evaluated from single LLM generation runs to keep the overall experiment tractable at the reported scale of 1,900 architectures. The n=3 result is supported by its consistent ranking across all seven heterogeneous benchmarks under a dataset-balanced protocol. In the revision we will (i) explicitly state the single-run limitation, (ii) add training-run error bars for the final reported accuracies where additional compute permits, and (iii) include a brief discussion of why full multi-seed LLM sampling and formal hypothesis testing were not performed. We do not claim statistical significance beyond the observed cross-benchmark pattern. revision: partial
Referee: [§3 and §4] §3 (methodology) and §4 (experiments): training protocols, optimizer settings, data splits, and number of independent training runs per architecture are not specified. Without these details the performance differences across n values cannot be attributed to prompting strategy rather than to hyper-parameter or split noise.

Authors: We agree that these implementation details are necessary for reproducibility. The revised manuscript will expand §3 and §4 to specify: Adam optimizer with learning rate 0.001 and standard weight decay, the exact train/validation/test splits used for each of the seven benchmarks, and that each generated architecture was trained once (to prioritize breadth of 1,900 unique models). These additions will make clear that performance differences are measured under identical training conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results from direct architecture generation and benchmarking

full rationale

The paper reports an experimental study generating 1,900 unique architectures via LLM prompting with varying shot counts (n=1..6) and evaluating them on seven vision benchmarks. The central claim that n=3 best balances diversity and context focus is obtained by comparing measured accuracies and diversity metrics across regimes, not by any equation, fitted parameter, or self-citation that reduces the output to the input by construction. The Whitespace-Normalized Hash Validation is introduced as a lightweight implementation with a reported 100x speedup over AST parsing; its correctness is verified by direct timing and deduplication counts rather than derived from prior results. No load-bearing uniqueness theorems, ansatzes smuggled via citation, or renaming of known patterns appear in the described methodology. The evaluation methodology is self-contained against the stated benchmarks and does not invoke external derivations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The optimal n=3 is selected after empirical sweep; the framework assumes LLMs can produce executable architectures from natural-language prompts.

free parameters (1)

number of prompt examples n
Experimentally chosen optimum after testing n=1 to 6; value 3 is reported as best balance for vision tasks.

axioms (1)

domain assumption LLMs can generate syntactically valid and trainable neural network code from few-shot prompts
Inherited from the NNGPT/LEMUR framework and required for the prompting experiments to be meaningful.

pith-pipeline@v0.9.0 · 5606 in / 1282 out tokens · 47989 ms · 2026-05-16T19:04:48.401714+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Delta-Based Neural Architecture Search: LLM Fine-Tuning via Code Diffs
cs.LG 2026-05 unverdicted novelty 7.0

Fine-tuned 7B LLMs generating unified diffs for neural architecture refinement achieve 66-75% valid rates and 64-66% mean first-epoch accuracy, outperforming full-generation baselines by large margins while cutting ou...
Closed-Loop LLM Discovery of Non-Standard Channel Priors in Vision Models
cs.CV 2026-01 unverdicted novelty 6.0

Closed-loop LLM search with AST-generated examples discovers non-standard channel widths that improve vision model performance over initial architectures on CIFAR-100.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · cited by 2 Pith papers · 10 internal anchors

[1]

Augmentgest: Can random data cropping augmentation boost gesture recognition performance?arXiv preprint arXiv:2506.07216, 2025

Nada Aboudeshish, Dmitry Ignatov, and Radu Timofte. Augmentgest: Can random data cropping augmentation boost gesture recognition performance?arXiv preprint arXiv:2506.07216, 2025. 3

work page arXiv 2025
[2]

Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, et al. Language mod- els are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020. 2, 7

work page 1901
[3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Program Synthesis with Large Language Models

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-AI. DeepSeek-Coder: When the large lan- guage model meets programming.arXiv preprint arXiv:2401.14196, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Ai on the edge: An automated pipeline for pytorch-to-android deployment and benchmarking.Preprints, 2025

Saif U Din, Muhammad Ahsan Hussain, Mohsin Ikram, Dmitry Ignatov, and Radu Timofte. Ai on the edge: An automated pipeline for pytorch-to-android deployment and benchmarking.Preprints, 2025. 2

work page 2025
[7]

Vist-gpt: Ush- ering in the era of visual storytelling with llms?arXiv preprint arXiv:2504.19267, 2025

Mohamed Gado, Towhid Taliee, Muhammad Danish Memon, Dmitry Ignatov, and Radu Timofte. Vist-gpt: Ush- ering in the era of visual storytelling with llms?arXiv preprint arXiv:2504.19267, 2025. 2

work page arXiv 2025
[8]

Lemur neural net- work dataset: Towards seamless automl.arXiv preprint arXiv:2504.10552, 2025

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Ben- tyn, Dmitry Ignatov, and Radu Timofte. Lemur neural net- work dataset: Towards seamless automl.arXiv preprint arXiv:2504.10552, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[9]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

work page
[10]

Llm as a neural architect: Controlled generation of image cap- tioning models under strict api contracts.arXiv preprint arXiv:2512.14706, 2025

Krunal Jesani, Dmitry Ignatov, and Radu Timofte. Llm as a neural architect: Controlled generation of image cap- tioning models under strict api contracts.arXiv preprint arXiv:2512.14706, 2025. 2

work page arXiv 2025
[11]

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Waleed Khalid, Dmitry Ignatov, and Radu Timofte. A retrieval-augmented generation approach to extracting al- gorithmic logic from neural networks.arXiv preprint arXiv:2512.04329, 2025. 2

work page internal anchor Pith review arXiv 2025
[12]

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, and Radu Timofte. Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparame- ter Tuning? InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 5664–5674, 2025. 2

work page 2025
[13]

Nngpt: Rethinking automl with large language models.arXiv preprint arXiv:2511.20333, 2025

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chan- dini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Igna- tov, and Radu Timofte. Nngpt: Rethinking automl with large language models.arXiv preprint arXiv:2511.20333, 2025. 1, 2

work page internal anchor Pith review arXiv 2025
[14]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 3

work page 2009
[15]

Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 1998. 3

work page 1998
[16]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muen- nighoff, Denis Kocetkov, Chenghao Mou, and Leandro von Werra. StarCoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Confer- ence on Learning Representations (ICLR), 2019. 2

work page 2019
[18]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3730–3738, 2015. 3

work page 2015
[19]

Preparation of Fractal-Inspired Computational Architectures for Automated Neural Design Exploration

Yash Mittal, Dmitry Ignatov, and Radu Timofte. Prepara- tion of fractal-inspired computational architectures for ad- vanced large language model analysis.arXiv preprint arXiv:2511.07329, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Bo Wu, and Andrew Y . Ng. Reading digits in nat- ural images with unsupervised feature learning.NIPS Work- shop on Deep Learning and Unsupervised Feature Learning,

work page
[21]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, and Caiming Xiong. CodeGen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Le, and Jeff Dean

Hieu Pham, Melody Guan, Barret Zoph, Quoc V . Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. InInternational Conference on Machine Learning (ICML), pages 4095–4104, 2018. 2

work page 2018
[23]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V . Le. Regularized evolution for image classifier architecture search. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4780–4789, 2019. 1, 2

work page 2019
[24]

Code Llama: Open Foundation Models for Code

Baptiste Rozi `ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, and Gabriel Synnaeve. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Explor- ing the collaboration between vision models and llms for en- hanced image classification.Preprints, 2025

Bhavya Rupani, Dmitry Ignatov, and Radu Timofte. Explor- ing the collaboration between vision models and llms for en- hanced image classification.Preprints, 2025. 2

work page 2025
[26]

Wilkerson, and Alex Aiken

Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Win- nowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 76–85. ACM, 2003. 1, 2

work page 2003
[27]

Lemur 2: Unlocking neural network diversity for ai.arXiv preprint, 2025

Tolgay Atincand Uzun, Waleed Khalid, Saif U Din, Sai Re- vanth Mulukuledu, Akashdeep Singh, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Yashkumar Rajeshbhai Lukhi, Ahsan Hussain, Krunal Jesani, Usha Shrestha, Yash Mittal, Roman Kochnev, Pritam Kadam, Mohsin Ikram, 9 Harsh Rameshbhai Moradiya, Alice Arslanian, Dmitry Igna- tov, and Radu Timofte. Lemur...

work page 2025
[28]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 35:24824–24837, 2022. 1, 2

work page 2022
[29]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. InIEEE Transactions on Pattern Anal- ysis and Machine Intelligence (PAMI), pages 1452–1464,

work page
[30]

Barret Zoph and Quoc V . Le. Neural architecture search with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2017. 1, 2 10

work page 2017

[1] [1]

Augmentgest: Can random data cropping augmentation boost gesture recognition performance?arXiv preprint arXiv:2506.07216, 2025

Nada Aboudeshish, Dmitry Ignatov, and Radu Timofte. Augmentgest: Can random data cropping augmentation boost gesture recognition performance?arXiv preprint arXiv:2506.07216, 2025. 3

work page arXiv 2025

[2] [2]

Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, et al

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Sub- biah, Jared Kaplan, Prafulla Dhariwal, et al. Language mod- els are few-shot learners.Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020. 2, 7

work page 1901

[3] [3]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, and Wojciech Zaremba. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Program Synthesis with Large Language Models

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732, 2022. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2022

[5] [5]

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence

DeepSeek-AI. DeepSeek-Coder: When the large lan- guage model meets programming.arXiv preprint arXiv:2401.14196, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Ai on the edge: An automated pipeline for pytorch-to-android deployment and benchmarking.Preprints, 2025

Saif U Din, Muhammad Ahsan Hussain, Mohsin Ikram, Dmitry Ignatov, and Radu Timofte. Ai on the edge: An automated pipeline for pytorch-to-android deployment and benchmarking.Preprints, 2025. 2

work page 2025

[7] [7]

Vist-gpt: Ush- ering in the era of visual storytelling with llms?arXiv preprint arXiv:2504.19267, 2025

Mohamed Gado, Towhid Taliee, Muhammad Danish Memon, Dmitry Ignatov, and Radu Timofte. Vist-gpt: Ush- ering in the era of visual storytelling with llms?arXiv preprint arXiv:2504.19267, 2025. 2

work page arXiv 2025

[8] [8]

Lemur neural net- work dataset: Towards seamless automl.arXiv preprint arXiv:2504.10552, 2025

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Ben- tyn, Dmitry Ignatov, and Radu Timofte. Lemur neural net- work dataset: Towards seamless automl.arXiv preprint arXiv:2504.10552, 2025. 1, 2

work page internal anchor Pith review arXiv 2025

[9] [9]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. InIn- ternational Conference on Learning Representations (ICLR),

work page

[10] [10]

Llm as a neural architect: Controlled generation of image cap- tioning models under strict api contracts.arXiv preprint arXiv:2512.14706, 2025

Krunal Jesani, Dmitry Ignatov, and Radu Timofte. Llm as a neural architect: Controlled generation of image cap- tioning models under strict api contracts.arXiv preprint arXiv:2512.14706, 2025. 2

work page arXiv 2025

[11] [11]

A Retrieval-Augmented Generation Approach to Extracting Algorithmic Logic from Neural Networks

Waleed Khalid, Dmitry Ignatov, and Radu Timofte. A retrieval-augmented generation approach to extracting al- gorithmic logic from neural networks.arXiv preprint arXiv:2512.04329, 2025. 2

work page internal anchor Pith review arXiv 2025

[12] [12]

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, and Radu Timofte. Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparame- ter Tuning? InProceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 5664–5674, 2025. 2

work page 2025

[13] [13]

Nngpt: Rethinking automl with large language models.arXiv preprint arXiv:2511.20333, 2025

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chan- dini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Igna- tov, and Radu Timofte. Nngpt: Rethinking automl with large language models.arXiv preprint arXiv:2511.20333, 2025. 1, 2

work page internal anchor Pith review arXiv 2025

[14] [14]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 3

work page 2009

[15] [15]

Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 1998

Yann LeCun, L ´eon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recog- nition.Proceedings of the IEEE, 86(11):2278–2324, 1998. 3

work page 1998

[16] [16]

StarCoder: may the source be with you!

Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muen- nighoff, Denis Kocetkov, Chenghao Mou, and Leandro von Werra. StarCoder: May the source be with you!arXiv preprint arXiv:2305.06161, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

DARTS: Differentiable architecture search

Hanxiao Liu, Karen Simonyan, and Yiming Yang. DARTS: Differentiable architecture search. InInternational Confer- ence on Learning Representations (ICLR), 2019. 2

work page 2019

[18] [18]

Deep learning face attributes in the wild

Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3730–3738, 2015. 3

work page 2015

[19] [19]

Preparation of Fractal-Inspired Computational Architectures for Automated Neural Design Exploration

Yash Mittal, Dmitry Ignatov, and Radu Timofte. Prepara- tion of fractal-inspired computational architectures for ad- vanced large language model analysis.arXiv preprint arXiv:2511.07329, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis- sacco, Bo Wu, and Andrew Y . Ng. Reading digits in nat- ural images with unsupervised feature learning.NIPS Work- shop on Deep Learning and Unsupervised Feature Learning,

work page

[21] [21]

CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang, Hiroaki Hayashi, Lifu Tu, Huan Wang, Yingbo Zhou, and Caiming Xiong. CodeGen: An open large language model for code with multi-turn program synthesis.arXiv preprint arXiv:2203.13474, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[22] [22]

Le, and Jeff Dean

Hieu Pham, Melody Guan, Barret Zoph, Quoc V . Le, and Jeff Dean. Efficient neural architecture search via parameter sharing. InInternational Conference on Machine Learning (ICML), pages 4095–4104, 2018. 2

work page 2018

[23] [23]

Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V . Le. Regularized evolution for image classifier architecture search. InProceedings of the AAAI Conference on Artificial Intelligence, pages 4780–4789, 2019. 1, 2

work page 2019

[24] [24]

Code Llama: Open Foundation Models for Code

Baptiste Rozi `ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Tan, and Gabriel Synnaeve. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Explor- ing the collaboration between vision models and llms for en- hanced image classification.Preprints, 2025

Bhavya Rupani, Dmitry Ignatov, and Radu Timofte. Explor- ing the collaboration between vision models and llms for en- hanced image classification.Preprints, 2025. 2

work page 2025

[26] [26]

Wilkerson, and Alex Aiken

Saul Schleimer, Daniel S. Wilkerson, and Alex Aiken. Win- nowing: Local algorithms for document fingerprinting. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 76–85. ACM, 2003. 1, 2

work page 2003

[27] [27]

Lemur 2: Unlocking neural network diversity for ai.arXiv preprint, 2025

Tolgay Atincand Uzun, Waleed Khalid, Saif U Din, Sai Re- vanth Mulukuledu, Akashdeep Singh, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Yashkumar Rajeshbhai Lukhi, Ahsan Hussain, Krunal Jesani, Usha Shrestha, Yash Mittal, Roman Kochnev, Pritam Kadam, Mohsin Ikram, 9 Harsh Rameshbhai Moradiya, Alice Arslanian, Dmitry Igna- tov, and Radu Timofte. Lemur...

work page 2025

[28] [28]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Process- ing Systems (NeurIPS), 35:24824–24837, 2022. 1, 2

work page 2022

[29] [29]

Places: A 10 million image database for scene recognition

Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. InIEEE Transactions on Pattern Anal- ysis and Machine Intelligence (PAMI), pages 1452–1464,

work page

[30] [30]

Barret Zoph and Quoc V . Le. Neural architecture search with reinforcement learning. InInternational Conference on Learning Representations (ICLR), 2017. 1, 2 10

work page 2017