pith. sign in

arxiv: 2606.30109 · v1 · pith:HCCNVM4Znew · submitted 2026-06-29 · 💻 cs.RO

TacEvo: Self-Evolving Architecture Discovery for Robotic Tactile Perception via LLM-Driven Quality-Diversity Search

Pith reviewed 2026-06-30 05:44 UTC · model grok-4.3

classification 💻 cs.RO
keywords tactile sensingneural architecture searchLLMquality-diversityMAP-Elitesrobotic perceptionforce regressiongrating classification
0
0 comments X

The pith

TacEvo uses an LLM to generate code mutations inside a quality-diversity loop that produces trainable tactile networks matching or beating an expert baseline on force and texture tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-based tactile sensors turn surface deformation into images whose interpretation requires networks tuned to specific sensor physics, yet conventional design still relies on repeated manual iteration by experts. TacEvo replaces that iteration with an automated cycle in which an LLM proposes code-level mutations and crossovers while a MAP-Elites archive retains diverse high-fitness architectures according to two behavioral descriptors. The reported outcome is that 94 to 96 percent of generated networks remain trainable and that the best validation fitness rises 56 to 96 percent across twenty generations. In a separate high-fidelity test the evolved networks equal the expert baseline on force regression and exceed it on fine-grained grating classification. The framework therefore treats architecture discovery itself as an optimizable, feedback-driven process rather than a fixed human-designed search space.

Core claim

TacEvo shows that LLM-generated code mutations and crossovers, filtered only by Architectural Diversity and Efficiency Ratio inside a MAP-Elites loop, reliably yield valid networks whose downstream fitness on ViTacTip force regression and grating classification improves substantially over generations and reaches or surpasses a hand-crafted expert baseline in post-search evaluation.

What carries the argument

MAP-Elites quality-diversity archive driven by LLM code-level mutations and crossovers, with Architectural Diversity and Efficiency Ratio as the two behavioral descriptors that shape the preserved population.

If this is right

  • The same search loop can be applied to new tactile sensor hardware without redesigning a discrete search space.
  • The Efficiency Ratio descriptor produces networks whose compute-size trade-offs remain usable for onboard robot inference.
  • Prompt reuse across generations allows the system to favor mutation styles that historically produced trainable improvements.
  • High autonomous generation reliability indicates that the two-descriptor filter removes most invalid code before training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same LLM-QD pattern could be tested on other sensor modalities whose images are likewise physics-specific.
  • If the LLM version changes, the fraction of trainable outputs may shift and require re-tuning of the diversity descriptors.
  • The method implicitly treats architecture code as an evolvable program, which may extend to other code-level design tasks beyond neural nets.

Load-bearing premise

Fitness gains observed after twenty generations arise from the LLM-driven search process rather than from differences in training code, hyper-parameters, or baseline implementation details.

What would settle it

Run an otherwise identical twenty-generation MAP-Elites loop that replaces every LLM mutation with random but syntactically valid architecture edits and measure whether the fitness trajectory and final performance remain statistically indistinguishable from the reported TacEvo results.

Figures

Figures reproduced from arXiv: 2606.30109 by Dandan Zhang, Lan Wei, Mohammed AbuSadeh.

Figure 1
Figure 1. Figure 1: Overview of TacEvo. (a) Closed-loop self-evolving architecture discovery with LLM-based code generation, candidate validation, low-fidelity evaluation, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 5
Figure 5. Figure 5: Four CVT maps showing the final state of the network and prompt [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Number of valid, trainable networks per generation. Candidates must [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Archive growth over 20 generations, measured by the number of filled [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Force prediction test-set MSE over 20 training seeds for the expert [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Grating classification test accuracy over 20 training seeds for the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Test-set confusion matrices for grating classification for the baseline and four TacEvo variants on the same seed, showing per-class performance across [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Vision-based tactile sensing converts contact-induced surface deformation into images, enabling robots to infer contact forces and fine surface textures that are not accessible through conventional vision alone. However, tactile images are sensor- and physics-specific, so effective architectures often require expert intuition and extensive manual iteration. Existing neural architecture search (NAS) pipelines can reduce this burden, but they are often computationally expensive and restricted to hand-designed search spaces, which limits architectural novelty and diversity. We introduce TacEvo, a self-evolving architecture discovery framework that improves network designs from downstream feedback. TacEvo uses an LLM to generate code-level mutations and crossovers, and a MAP-Elites quality-diversity loop that preserves diverse elite architectures while preferentially reusing prompts that consistently yield improvements. Exploration is guided by two behavioural descriptors, Architectural Diversity and Efficiency Ratio, which encourage coverage across structural variations and compute-size trade-offs. On ViTacTip force regression and grating classification, TacEvo achieves high autonomous generation reliability (96.0%/94.5% trainable) and improves best validation fitness over 20 generations by 56.1%/96.1%. In a 20-seed post-search high-fidelity evaluation, TacEvo matches the expert baseline on force prediction and outperforms it on fine-grained grating classification. These results suggest that LLM-driven self-evolving search constitutes a practical paradigm for AI-assisted scientific discovery in specialised robotic sensing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TacEvo, a self-evolving architecture discovery framework for vision-based tactile sensing that employs an LLM to generate code-level mutations and crossovers within a MAP-Elites quality-diversity loop. Exploration is guided by two behavioral descriptors (Architectural Diversity and Efficiency Ratio), with preferential reuse of successful prompts. On ViTacTip force regression and grating classification tasks, it reports 96.0%/94.5% trainable generation rates and best-validation-fitness gains of 56.1%/96.1% over 20 generations; a 20-seed high-fidelity evaluation shows the evolved networks matching an expert baseline on force prediction and outperforming it on fine-grained grating classification.

Significance. If the fitness gains and task improvements can be robustly attributed to the LLM-driven QD search rather than unstated training details or generator variance, the work would demonstrate a practical route to automated architecture design in physics-specific robotic sensing, reducing dependence on manual expert iteration.

major comments (2)
  1. [§4 and §5] §4 (Experimental results) and §5 (post-search evaluation): the central performance claims (56.1%/96.1% fitness lifts and grating-classification superiority) rest on comparisons whose attribution to the MAP-Elites archive, LLM mutations, and prompt-reuse mechanism is not isolated. No ablations are described that hold the LLM generator, training protocol, data splits, and optimizer fixed while disabling the quality-diversity archive or preferential prompt reuse; without these controls the reported gains cannot be confidently ascribed to the proposed search process rather than variance in LLM outputs or baseline choices.
  2. [Abstract and §4] Abstract and §4: the quantitative claims supply concrete percentages but contain no mention of statistical tests, error bars, or variance across the 20 generations or 20-seed evaluations, leaving the reliability of the reported improvements unevaluable.
minor comments (2)
  1. [Abstract and §4] The abstract and results sections should explicitly state the exact expert baseline architectures, data splits, and hyper-parameter settings used for comparison.
  2. [§3] Notation for the two behavioral descriptors (Architectural Diversity and Efficiency Ratio) should be defined with equations or pseudocode in the methods section for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify gaps in component isolation and statistical reporting. We address each major comment below and commit to revisions that strengthen attribution and reliability assessment.

read point-by-point responses
  1. Referee: [§4 and §5] §4 (Experimental results) and §5 (post-search evaluation): the central performance claims (56.1%/96.1% fitness lifts and grating-classification superiority) rest on comparisons whose attribution to the MAP-Elites archive, LLM mutations, and prompt-reuse mechanism is not isolated. No ablations are described that hold the LLM generator, training protocol, data splits, and optimizer fixed while disabling the quality-diversity archive or preferential prompt reuse; without these controls the reported gains cannot be confidently ascribed to the proposed search process rather than variance in LLM outputs or baseline choices.

    Authors: We agree that the manuscript does not contain ablations that disable the MAP-Elites archive or preferential prompt reuse while holding the LLM generator, training protocol, data splits, and optimizer fixed. This limits confident attribution of the fitness gains specifically to the quality-diversity and prompt-reuse components. In the revised version we will add these controls: (i) a non-QD evolutionary baseline that retains LLM mutations/crossovers but removes the archive and behavioral descriptors, and (ii) a version without preferential prompt reuse. Results will be reported alongside the original TacEvo curves to isolate the contribution of each mechanism. revision: yes

  2. Referee: [Abstract and §4] Abstract and §4: the quantitative claims supply concrete percentages but contain no mention of statistical tests, error bars, or variance across the 20 generations or 20-seed evaluations, leaving the reliability of the reported improvements unevaluable.

    Authors: We acknowledge that the current manuscript reports point estimates without error bars, variance across generations or seeds, or statistical tests. In the revision we will augment §4 with standard-deviation bands on the fitness-evolution plots (across the 20 generations) and on the 20-seed post-search evaluation. We will also add the results of paired statistical tests (Wilcoxon signed-rank or t-tests, as appropriate) between TacEvo and the expert baseline for both tasks, together with p-values, to allow readers to assess the reliability of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical search results independent of inputs

full rationale

The paper describes an empirical MAP-Elites + LLM mutation procedure evaluated on tactile sensing tasks. Reported fitness lifts (56.1%/96.1%) and post-search comparisons are measured outcomes of running the loop on held-out validation data, not quantities defined by the same parameters or reduced to self-citations. No equations, uniqueness theorems, or ansatzes appear that would make any result equivalent to its inputs by construction. The method is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete free parameters, axioms, or invented entities; the central claim rests on the unstated premise that LLM code generation plus the chosen behavioral descriptors constitute a sufficient search mechanism.

pith-pipeline@v0.9.1-grok · 5788 in / 1206 out tokens · 35062 ms · 2026-06-30T05:44:39.105114+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    Visuo- haptic object perception for robots: an overview,

    N. Navarro-Guerrero, S. Toprak, J. Josifovski, and L. Jamone, “Visuo- haptic object perception for robots: an overview,”Autonomous Robots, vol. 47, no. 4, pp. 377–403, 2023

  2. [2]

    Gelsight: High-resolution robot tactile sensors for estimating geometry and force,

    W. Yuan, S. Dong, and E. H. Adelson, “Gelsight: High-resolution robot tactile sensors for estimating geometry and force,”Sensors, vol. 17, no. 12, p. 2762, 2017

  3. [3]

    Taceva: A performance evaluation framework for vision-based tactile sensors,

    Q. Cong, S. Oh, W. Fan, S. Luo, K. Althoefer, and D. Zhang, “Taceva: A performance evaluation framework for vision-based tactile sensors,” Advanced Intelligent Systems, p. e202501179, 2025

  4. [4]

    Higuera, A

    C. Higuera, A. Sharma, C. K. Bodduluri, T. Fan, P. Lancaster, M. Kalakr- ishnan, M. Kaess, B. Boots, M. Lambeta, T. Wuet al., “Sparsh: Self- supervised touch representations for vision-based tactile sensing,”arXiv preprint arXiv:2410.24090, 2024

  5. [5]

    A comprehensive survey of neural architecture search: Challenges and solutions,

    P. Ren, Y . Xiao, X. Chang, P.-Y . Huang, Z. Li, X. Chen, and X. Wang, “A comprehensive survey of neural architecture search: Challenges and solutions,”ACM Computing Surveys (CSUR), vol. 54, no. 4, pp. 1–34, 2021

  6. [6]

    A review of neural architecture search methods for super-resolution imaging,

    J. Guo, X. Wang, and Y . Guo, “A review of neural architecture search methods for super-resolution imaging,”Artificial Intelligence Review, 2026

  7. [7]

    Weight-sharing neural architecture search: A battle to shrink the optimization gap,

    L. Xie, X. Chen, K. Bi, L. Wei, Y . Xu, L. Wang, Z. Chen, A. Xiao, J. Chang, X. Zhanget al., “Weight-sharing neural architecture search: A battle to shrink the optimization gap,”ACM Computing Surveys (CSUR), vol. 54, no. 9, pp. 1–37, 2021

  8. [8]

    A review on code generation with llms: Application and evaluation,

    J. Wang and Y . Chen, “A review on code generation with llms: Application and evaluation,” in2023 IEEE International Conference on Medical Artificial Intelligence (MedAI). IEEE, 2023, pp. 284–289

  9. [9]

    A Survey on Code Generation with LLM-based Agents

    Y . Dong, X. Jiang, J. Qian, T. Wang, K. Zhang, Z. Jin, and G. Li, “A survey on code generation with llm-based agents,”arXiv preprint arXiv:2508.00083, 2025

  10. [10]

    Evaluating Large Language Models Trained on Code

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. D. O. Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

  11. [11]

    Mathematical discoveries from program search with large language models,

    B. Romera-Paredes, M. Barekatain, A. Novikov, M. Balog, M. P. Kumar, E. Dupont, F. J. Ruiz, J. S. Ellenberg, P. Wang, O. Fawziet al., “Mathematical discoveries from program search with large language models,”Nature, vol. 625, no. 7995, pp. 468–475, 2024

  12. [12]

    Nas-bench-101: Towards reproducible neural architecture search,

    C. Ying, A. Klein, E. Christiansen, E. Real, K. Murphy, and F. Hutter, “Nas-bench-101: Towards reproducible neural architecture search,” in International conference on machine learning. PMLR, 2019, pp. 7105– 7114

  13. [13]

    Using Centroidal Voronoi Tessellations to Scale Up the Multi-dimensional Archive of Phenotypic Elites Algorithm

    V . Vassiliades, K. Chatzilygeroudis, and J.-B. Mouret, “Using centroidal vorono¨ı tessellations to scale up the multi-dimensional archive of phe- notypic elites algorithm,”arXiv preprint arXiv:1610.05729, 2016

  14. [14]

    Vitactip: Design and verification of a novel biomimetic physical vision-tactile fusion sensor,

    W. Fan, H. Li, W. Si, S. Luo, N. Lepora, and D. Zhang, “Vitactip: Design and verification of a novel biomimetic physical vision-tactile fusion sensor,” in2024 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2024, pp. 1056–1062

  15. [15]

    Design and benchmarking of a multimodality sensor for robotic manipulation with gan-based cross-modality interpretation,

    D. Zhang, W. Fan, J. Lin, H. Li, Q. Cong, W. Liu, N. F. Lepora, and S. Luo, “Design and benchmarking of a multimodality sensor for robotic manipulation with gan-based cross-modality interpretation,” IEEE Transactions on Robotics, vol. 41, pp. 1278–1295, 2025

  16. [16]

    Automl: A survey of the state-of-the-art,

    X. He, K. Zhao, and X. Chu, “Automl: A survey of the state-of-the-art,” Knowledge-based systems, vol. 212, p. 106622, 2021

  17. [17]

    Eight years of automl: categorisation, review and trends,

    R. Barbudo, S. Ventura, and J. R. Romero, “Eight years of automl: categorisation, review and trends,”Knowledge and Information Systems, vol. 65, no. 12, pp. 5097–5149, 2023

  18. [18]

    Neural architecture search: A survey,

    T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: A survey,”Journal of Machine Learning Research, vol. 20, no. 55, pp. 1–21, 2019

  19. [19]

    Design principle transfer in neural architecture search via large language models,

    X. Zhou, X. Wu, L. Feng, Z. Lu, and K. C. Tan, “Design principle transfer in neural architecture search via large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 21, 2025, pp. 23 000–23 008

  20. [20]

    Software architecture-based self-adaptation in robotics,

    E. Alberts, I. Gerostathopoulos, I. Malavolta, C. H. Corbato, and P. Lago, “Software architecture-based self-adaptation in robotics,”Journal of Systems and Software, vol. 219, p. 112258, 2025

  21. [21]

    Optimization of forcemyography sensor placement for arm movement recognition,

    X. Xu, Z. Du, H. Zhang, R. Zhang, Z. Hong, Q. Huang, and B. Han, “Optimization of forcemyography sensor placement for arm movement recognition,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, pp. 9845–9850

  22. [22]

    Optimal placement of passive sensors for robot localisation,

    F. Zenatti, D. Fontanelli, L. Palopoli, D. Macii, and P. Nazemzadeh, “Optimal placement of passive sensors for robot localisation,” in2016 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4586–4593

  23. [23]

    AI agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126,

    H. Wang, J. Gong, H. Zhang, J. Xu, and Z. Wang, “Ai agentic programming: A survey of techniques, challenges, and opportunities,” arXiv preprint arXiv:2508.11126, 2025

  24. [24]

    EvoPrompting: Language models for code-level neural architecture search,

    A. Chen, D. M. Dohan, and D. R. So, “EvoPrompting: Language models for code-level neural architecture search,”arXiv preprint arXiv:2302.14838, 2023. [Online]. Available: https://arxiv.org/abs/2302. 14838

  25. [25]

    LL- Matic: Neural architecture search via large language models and quality diversity optimization,

    M. U. Nasir, S. Earle, J. Togelius, S. James, and C. Cleghorn, “LL- Matic: Neural architecture search via large language models and quality diversity optimization,”arXiv preprint arXiv:2306.01102, 2024

  26. [26]

    Illuminating search spaces by mapping elites

    J.-B. Mouret and J. Clune, “Illuminating search spaces by mapping elites,”arXiv preprint arXiv:1504.04909, 2015

  27. [27]

    Using cen- troidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm,

    V . Vassiliades, K. Chatzilygeroudis, and J.-B. Mouret, “Using cen- troidal voronoi tessellations to scale up the multidimensional archive of phenotypic elites algorithm,”IEEE Transactions on Evolutionary Computation, vol. 22, no. 4, pp. 623–630, 2017

  28. [28]

    Robots that can adapt like animals,

    A. Cully, J. Clune, D. Tarapore, and J.-B. Mouret, “Robots that can adapt like animals,”Nature, vol. 521, no. 7553, pp. 503–507, 2015

  29. [29]

    AlphaEvolve: A coding agent for scientific and algorithmic discovery

    A. Novikov, N. V ˜u, M. Eisenberger, E. Dupont, P.-S. Huang, A. Z. Wagner, S. Shirobokov, B. Kozlovskii, F. J. Ruiz, A. Mehrabianet al., “Alphaevolve: A coding agent for scientific and algorithmic discovery,” arXiv preprint arXiv:2506.13131, 2025

  30. [30]

    The claude 3 model family: Opus, sonnet, haiku,

    Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2024, accessed: 2026-03-05. [Online]. Available: https://www.anthropic.com/ news/claude-3-family