pith. sign in

arxiv: 2605.20940 · v1 · pith:5D6P2U5Knew · submitted 2026-05-20 · 💻 cs.CV

3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike Volume Estimation in Wheat

Pith reviewed 2026-05-21 05:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords wheat spike volumeknowledge distillation3D reconstructionmulti-view imagesregulated Transformerfield phenotypingvolume estimationpoint cloud features
0
0 comments X

The pith

Distilling geometric knowledge from 3D reconstructions into 2D image Transformers improves wheat spike volume estimation accuracy while cutting inference time from 160 ms to 1.4 ms per spike.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that a multi-view image model can acquire useful 3D geometric understanding by distilling from a point cloud teacher, allowing accurate spike volume estimates from fast 2D images alone. A reader would care because current 3D methods are too slow or sensitive for routine field work on wheat, while pure image models miss explicit shape cues needed for yield and stress analysis. The approach first builds a rigid-invariant 3D network on histogram features, ensembles it with a regulated Transformer, then transfers the knowledge via feature or label distillation. If successful, this yields both higher accuracy and dramatically lower compute cost for high-throughput phenotyping.

Core claim

We train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations, combine it with a multi-view image-based regulated Transformer in an ensemble, and distill the ensemble into a purely image-based student model using either feature-based or label-based distillation. The two distilled models reduce mean absolute error from 654.31 mm³ to 639.93 mm³ and 644.62 mm³, raise correlation from 0.76 to 0.77 and 0.82, cut inference time from 160 ms to 1.4 ms per spike, mitigate volume-dependent bias, and reshape the image model's latent space toward geometry-aware shape.

What carries the argument

Knowledge distillation from an ensemble of a rigid-invariant 3D point cloud network and a multi-view regulated Transformer into a purely image-based student model.

If this is right

  • The final model performs volume estimation from images only, with no 3D data or reconstruction required at inference time.
  • Distillation reduces volume-dependent bias in the estimates compared with the non-distilled image model.
  • The method produces a geometry-aware latent representation inside an otherwise standard image Transformer.
  • Inference becomes fast enough for routine high-throughput field phenotyping of many spikes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pattern could be tested on other plant organs or crops where 3D shape informs important traits but only 2D capture is practical at scale.
  • If the 3D teacher is replaced by a different geometric representation, such as voxel or mesh features, the student might retain similar speed gains with altered accuracy.
  • Extending the regulated Transformer to additional views or temporal sequences might further strengthen the geometric signal transferred during distillation.

Load-bearing premise

The 3D point cloud reconstructions and their rigid-invariant histogram features supply accurate geometric knowledge that transfers usefully to the 2D student without adding reconstruction biases or errors.

What would settle it

Running the distilled image model on a fresh collection of wheat spikes imaged outdoors under motion or lighting conditions different from the training set and finding equal or higher volume estimation error than the non-distilled baseline would show the transferred geometric knowledge does not generalize.

Figures

Figures reproduced from arXiv: 2605.20940 by Andreas Hund, Jannis Widmer, Lukas Roth, Norbert Kirchgessner, Olivia Zumsteg, Paraskevi Nousi, Yann Bourd\'e.

Figure 1
Figure 1. Figure 1: Overview of our proposed in-field single wheat spike [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Regulated Transformer image model architecture for one [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the training procedure. Images and rigid [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Correlation between MAE in and the ground truth vol [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: PCA of the embeddings of the rigid-invariant PointNet [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
read the original abstract

Accurate estimation of wheat spike volume is important for yield component analysis and stress resilience assessment, yet field-based measurement remains challenging. Active 3D sensing methods such as Light Detection and Ranging (LiDAR) or time-of-flight (ToF) are sensitive to plant motion or poorly suited to outdoor conditions, while 3D reconstructions are computationally expensive. Direct 2D image processing would offer computational advantages, but image-based models lack explicit geometric information. We therefore propose a hybrid 2D-3D approach with knowledge distillation during training while enabling efficient image-only inference. First, we train a rigid-invariant point cloud network using distance-based histogram features to obtain pose-robust geometric representations. We then combine the 3D model with a proposed multi-view image-based regulated Transformer (RT) in an ensemble architecture. Finally, we distill the ensemble knowledge into a purely image-based student model using either feature-based or label-based distillation. The two distilled RTs reduce the mean absolute error (MAE) from 654.31 mm$^3$ of the non-distilled RT to 639.93 mm$^3$ and 644.62 mm$^3$, and increase correlation from 0.76 to 0.77 and 0.82, respectively. At the same time, inference time is reduced from 160 ms to 1.4 ms per spike. Distillation further mitigates volume-dependent bias and reshapes the latent representation of the image model toward a geometry-aware shape. Our results demonstrate that 3D-informed training of a 2D Transformer allows for scalable and efficient spike volume estimation for high-throughput field phenotyping.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a hybrid 2D-3D pipeline for wheat spike volume estimation: a rigid-invariant point cloud network is trained on distance-based histogram features, combined with a multi-view regulated Transformer (RT) in an ensemble, and the ensemble knowledge is distilled (via feature- or label-based methods) into a pure image-based student model. This enables accurate yet fast 2D-only inference. The abstract reports that the two distilled RT variants lower MAE from 654.31 mm³ (non-distilled RT) to 639.93 mm³ and 644.62 mm³, raise correlation from 0.76 to 0.77 and 0.82, cut inference time from 160 ms to 1.4 ms per spike, and mitigate volume-dependent bias.

Significance. If the central claims hold, the work demonstrates a practical route to scalable, high-throughput field phenotyping by transferring geometric knowledge from expensive 3D reconstructions to lightweight 2D models. The concrete numerical gains (specific MAE and correlation deltas) together with the >100× inference speedup constitute a tangible engineering contribution for agricultural computer vision. The rigid-invariant histogram features and dual distillation strategies are presented as reproducible design choices that could be adopted or extended by others.

major comments (2)
  1. [Abstract] Abstract: the 3D point cloud teacher’s own accuracy (MAE, correlation, or reconstruction error on the same test spikes) is never reported. Without these numbers it is impossible to determine whether the observed MAE reductions (654.31 mm³ → 639.93/644.62 mm³) and correlation gains are attributable to successful geometric knowledge transfer or to other factors such as ensemble averaging or regularization.
  2. [Abstract] Abstract / Methods description: no quantitative ablation or invariance test is supplied for the rigid-invariant histogram features under realistic field conditions (pose variation, occlusion, lighting). The central claim that these features constitute a “reliable, bias-free teacher” therefore rests on an unverified assumption; if the features themselves degrade or introduce systematic bias, the distillation benefit cannot be credited to geometric knowledge.
minor comments (1)
  1. [Abstract] The abstract states that distillation “mitigates volume-dependent bias” but does not define the metric or show the corresponding plots or statistical test; a brief clarification or supplementary figure would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We have addressed each major comment point by point below, providing clarifications and committing to revisions where the manuscript can be strengthened without misrepresenting our results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 3D point cloud teacher’s own accuracy (MAE, correlation, or reconstruction error on the same test spikes) is never reported. Without these numbers it is impossible to determine whether the observed MAE reductions (654.31 mm³ → 639.93/644.62 mm³) and correlation gains are attributable to successful geometric knowledge transfer or to other factors such as ensemble averaging or regularization.

    Authors: We acknowledge that explicitly reporting the standalone performance of the 3D point cloud teacher (MAE, correlation, and reconstruction error) on the identical test spikes would help isolate the contribution of geometric knowledge transfer from ensemble effects. These metrics were computed during our experiments but omitted from the abstract and main text to emphasize the distilled student model's efficiency gains. In the revised manuscript, we will add the teacher's test-set results in a new table or results subsection for direct comparison, enabling clearer attribution of the observed improvements. revision: yes

  2. Referee: [Abstract] Abstract / Methods description: no quantitative ablation or invariance test is supplied for the rigid-invariant histogram features under realistic field conditions (pose variation, occlusion, lighting). The central claim that these features constitute a “reliable, bias-free teacher” therefore rests on an unverified assumption; if the features themselves degrade or introduce systematic bias, the distillation benefit cannot be credited to geometric knowledge.

    Authors: We agree that a dedicated quantitative ablation or invariance test for the distance-based histogram features under field-like variations would provide stronger empirical support for their reliability. The methods section describes their design for pose robustness, but no explicit numerical evaluation (e.g., error under rotations, simulated occlusions, or lighting changes) is included. We will add such an analysis—either in the main text or as supplementary material—reporting quantitative metrics on feature stability to substantiate the assumption and rule out systematic bias. revision: yes

Circularity Check

0 steps flagged

Empirical test-set metrics with no reduction to fitted inputs or self-citation chains

full rationale

The paper's central claims rest on reported improvements in MAE and correlation for distilled image-only models versus a non-distilled baseline, measured on held-out test spikes. These are standard empirical outcomes from applying knowledge distillation (feature-based or label-based) between independently trained 3D point-cloud and 2D regulated Transformer models. No equations reduce a prediction to a fitted parameter by construction, no uniqueness theorems or ansatzes are smuggled via self-citation, and the histogram features and distillation losses are not defined circularly in terms of the final volume estimates. The approach is self-contained against external benchmarks, with only minor implicit hyperparameter choices for distillation weights that do not force the reported gains.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unverified accuracy of the 3D teacher reconstructions and on the assumption that distillation successfully injects geometric priors into the image model; no explicit free parameters are named in the abstract.

free parameters (1)
  • distillation loss weights and temperature
    Typical distillation hyperparameters that are chosen or tuned and directly affect how much geometric signal reaches the student.
axioms (1)
  • domain assumption 3D point-cloud reconstructions provide reliable ground-truth geometry for the teacher model
    Invoked when the 3D network is positioned as the source of pose-robust features.

pith-pipeline@v0.9.0 · 5863 in / 1400 out tokens · 46050 ms · 2026-05-21T05:30:33.180689+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

78 extracted references · 78 canonical work pages · 8 internal anchors

  1. [1]

    Vision based volume esti- mation method for automatic mango grading system.Biosys- tems Engineering, 198:338–349, 2020

    TheOo Mon and Nay ZarAung. Vision based volume esti- mation method for automatic mango grading system.Biosys- tems Engineering, 198:338–349, 2020. 1

  2. [2]

    Improved voxel-based vol- ume estimation and pruning severity mapping of apple trees during the pruning period.Computers and Electronics in Agriculture, 219:108834, 2024

    Xuhua Dong, Woo-Young Kim, Zheng Yu, Ju-Youl Oh, Reza Ehsani, and Kyeong-Hwan Lee. Improved voxel-based vol- ume estimation and pruning severity mapping of apple trees during the pruning period.Computers and Electronics in Agriculture, 219:108834, 2024

  3. [3]

    Automated Stock V olume Estimation Using UA V-RGB Imagery.Sensors, 24(23):7559, 2024

    Anurupa Goswami, Unmesh Khati, Ishan Goyal, Anam Sabir, and Sakshi Jain. Automated Stock V olume Estimation Using UA V-RGB Imagery.Sensors, 24(23):7559, 2024

  4. [4]

    Potato feature prediction based on machine vision and 3D model rebuilding.Computers and Electronics in Agriculture, 137:41–51, 2017

    Qinghua Su, Naoshi Kondo, Minzan Li, Hong Sun, and Di- mas Firmanda Al Riza. Potato feature prediction based on machine vision and 3D model rebuilding.Computers and Electronics in Agriculture, 137:41–51, 2017

  5. [5]

    Determination of watermelon volume using ellipsoid approximation and image processing.Postharvest Biology and Technology, 45(3):366–371, 2007

    Ali Bulent Koc. Determination of watermelon volume using ellipsoid approximation and image processing.Postharvest Biology and Technology, 45(3):366–371, 2007

  6. [6]

    G. P. Moreda, J. Ortiz-Ca ˜navate, F. J. Garc ´ıa-Ramos, and M. Ruiz-Altisent. Non-destructive technologies for fruit and vegetable size determination – A review.Journal of Food Engineering, 92(2):119–136, 2009

  7. [7]

    Image processing based modeling for Rosa roxburghii fruits mass and volume estimation.Sci- entific Reports, 14(1):15507, 2024

    Zhiping Xie, Junhao Wang, Yufei Yang, Peixuan Mao, Jial- ing Guo, and Manyu Sun. Image processing based modeling for Rosa roxburghii fruits mass and volume estimation.Sci- entific Reports, 14(1):15507, 2024. 1

  8. [8]

    Thermal imaging can reveal variation in stay- green functionality of wheat canopies under temperate con- ditions.Frontiers in Plant Science, 15, 2024

    Jonas Anderegg, Norbert Kirchgessner, Helge Aasen, Olivia Zumsteg, Beat Keller, Radek Zenkl, Achim Walter, and An- dreas Hund. Thermal imaging can reveal variation in stay- green functionality of wheat canopies under temperate con- ditions.Frontiers in Plant Science, 15, 2024. 1

  9. [9]

    Rachakonda, Bala Muralikrishnan, and Daniel S

    Prem K. Rachakonda, Bala Muralikrishnan, and Daniel S. Sawyer. Sources of Errors in Structured Light 3D Scanners. NIST, 2019. 1

  10. [10]

    Time-of-Flight Cameras: Principles, Methods and Applica- tions

    Miles Hansard, Seungkyu Lee, Ouk Choi, and Radu Horaud. Time-of-Flight Cameras: Principles, Methods and Applica- tions. Springer, London, 2013. 2

  11. [11]

    Joan Ramon Rosell Polo, Ricardo Sanz, Jordi Llorens, Jaume Arn ´o, Alexandre Escol `a, Manel Ribes-Dasi, Joan Masip, Ferran Camp, Felip Gr `acia, Francesc Solanelles, Tom`as Pallej `a, Luis Val, Santiago Planas, Emilio Gil, and Jordi Palac´ın. A tractor-mounted scanning LIDAR for the non-destructive measurement of vegetative volume and sur- face area of t...

  12. [12]

    Jimenez-Berni, David M

    Jose A. Jimenez-Berni, David M. Deery, Pablo Rozas- Larraondo, Anthony (Tony) G. Condon, Greg J. Rebetzke, Richard A. James, William D. Bovill, Robert T. Furbank, and Xavier R. R. Sirault. High Throughput Determination of Plant Height, Ground Cover, and Above-Ground Biomass in Wheat with LiDAR.Frontiers in Plant Science, 9, 2018. 2

  13. [13]

    What Can We Learn from Depth Camera Sensor Noise?Sensors, 22(14):5448, 2022

    Azmi Haider and Hagit Hel-Or. What Can We Learn from Depth Camera Sensor Noise?Sensors, 22(14):5448, 2022. 2

  14. [14]

    Mohit Gupta, Qi Yin, and Shree K. Nayar. Structured Light in Sunlight. In2013 IEEE International Conference on Com- puter Vision, pages 545–552, Sydney, Australia, 2013. 2

  15. [15]

    The use of terrestrial LiDAR technology in forest science: Appli- cation fields, benefits and challenges.Annals of Forest Sci- ence, 68(5):959–974, 2011

    Mathieu Dassot, Thi ´ery Constant, and Meriem Fournier. The use of terrestrial LiDAR technology in forest science: Appli- cation fields, benefits and challenges.Annals of Forest Sci- ence, 68(5):959–974, 2011. 2

  16. [16]

    Lukas Roth, Mar ´ıa Xos ´e Rodr ´ıguez-´Alvarez, Fred van Eeuwijk, Hans-Peter Piepho, and Andreas Hund. Phenomics data processing: A plot-level model for repeated measure- ments to extract the timing of key stages and quantities at defined time points.Field Crops Research, 274:1–17, 2021. 2

  17. [17]

    Repeated Multiview Imaging for Estimating Seedling Tiller Counts of Wheat Genotypes Using Drones.Plant Phenomics, 2020: 1–20, 2020

    Lukas Roth, Moritz Camenzind, Helge Aasen, Lukas Kro- nenberg, Christoph Barendregt, Karl-Heinz Camp, Achim Walter, Norbert Kirchgessner, and Andreas Hund. Repeated Multiview Imaging for Estimating Seedling Tiller Counts of Wheat Genotypes Using Drones.Plant Phenomics, 2020: 1–20, 2020

  18. [18]

    Deep learning for wheat ear segmentation and ear den- sity measurement: From heading to maturity.Computers and Electronics in Agriculture, 199:107161, 2022

    S ´ebastien Dandrifosse, Elias Ennadifi, Alexis Carlier, Bernard Gosselin, Benjamin Dumont, and Beno ˆıt Merca- toris. Deep learning for wheat ear segmentation and ear den- sity measurement: From heading to maturity.Computers and Electronics in Agriculture, 199:107161, 2022. 2

  19. [19]

    ImageNet classification with deep convolutional neural net- works

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural net- works. InAdvances in Neural Information Processing Sys- tems. Curran Associates, Inc., 2012. 2

  20. [20]

    Badhon, Curtis Pozniak, Benoit de Solan, Andreas Hund, Scott C

    Etienne David, Simon Madec, Pouria Sadeghi-Tehran, Helge Aasen, Bangyou Zheng, Shouyang Liu, Norbert Kirchgess- ner, Goro Ishikawa, Koichi Nagasawa, Minhajul A. Badhon, Curtis Pozniak, Benoit de Solan, Andreas Hund, Scott C. Chapman, Fr ´ed´eric Baret, Ian Stavness, and Wei Guo. Global Wheat Head Detection (GWHD) Dataset: A Large and Diverse Dataset of Hi...

  21. [21]

    YOLOv12: Attention-Centric Real-Time Object Detectors

    Yunjie Tian, Qixiang Ye, and David Doermann. YOLOv12: Attention-Centric Real-Time Object Detectors.arXiv preprint arXiv:2502.12524, 2025. 2

  22. [22]

    Comparing YOLOv11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard en- vironment.arXiv preprint arXiv:2410.19869, 2025

    Ranjan Sapkota and Manoj Karkee. Comparing YOLOv11 and YOLOv8 for instance segmentation of occluded and non-occluded immature green fruits in complex orchard en- vironment.arXiv preprint arXiv:2410.19869, 2025. 2

  23. [23]

    Wheat3DGS: In-Field 3D Recon- struction, Instance Segmentation and Phenotyping of Wheat Heads with Gaussian Splatting

    Daiwei Zhang, Joaquin Gajardo, Tomislav Medic, Isinsu Katircioglu, Mike Boss, Norbert Kirchgessner, Achim Wal- ter, and Lukas Roth. Wheat3DGS: In-Field 3D Recon- struction, Instance Segmentation and Phenotyping of Wheat Heads with Gaussian Splatting. In2025 IEEE/CVF Con- ference on Computer Vision and Pattern Recognition Work- shops (CVPRW), pages 5360–53...

  24. [24]

    Transfer learning in agriculture: A review.Artificial Intelligence Review, 58(4):97, 2025

    Md Ismail Hossen, Mohammad Awrangjeb, Shirui Pan, and Abdullah Al Mamun. Transfer learning in agriculture: A review.Artificial Intelligence Review, 58(4):97, 2025. 2

  25. [25]

    Performance Comparison of Transfer Learning and Training from Scratch Approaches for Deep Facial Expression Recognition

    Ismail Oztel, Gozde Yolcu, and Cemil Oz. Performance Comparison of Transfer Learning and Training from Scratch Approaches for Deep Facial Expression Recognition. In 9 2019 4th International Conference on Computer Science and Engineering (UBMK), pages 1–6, 2019. 2

  26. [26]

    Paolo Didier Alfano, Vito Paolo Pastore, Lorenzo Rosasco, and Francesca Odone. Top-tuning: A study on transfer learn- ing for an efficient alternative to fine tuning for image clas- sification with fast kernel methods.Image and Vision Com- puting, 142:104894, 2024. 2

  27. [27]

    Deep Residual Learning for Image Recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016. 2

  28. [28]

    Smart farming: Real-time rice yield forecasting on mobile devices using lightweight CNN-LSTM.Smart Agricultural Technol- ogy, 13:101664, 2026

    Sakshi Gandotra, Rita Chhikara, and Anuradha Dhull. Smart farming: Real-time rice yield forecasting on mobile devices using lightweight CNN-LSTM.Smart Agricultural Technol- ogy, 13:101664, 2026. 2

  29. [29]

    Abourabia, S

    I. Abourabia, S. Ounacer, M.Y . Elghoumari, S. Ardchir, and M. Azzouazi. A hybrid machine learning and deep learning framework for agricultural yield prediction using irrigation data.International Journal on Technical and Physical Prob- lems of Engineering, 17(64):436–449, 2025. 2

  30. [30]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.arXiv preprint arXiv:2010.11929, 2021. 2

  31. [31]

    End- to-End Object Detection with Transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End- to-End Object Detection with Transformers. InComputer Vision – ECCV 2020, pages 213–229, Cham, 2020. Springer International Publishing. 2

  32. [32]

    Vi- sion Transformers for Dense Prediction

    Rene Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion Transformers for Dense Prediction. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159–12168, Montreal, QC, Canada, 2021. 2

  33. [33]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mah- moud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herv ´e Je- gou, Julien Mairal, ...

  34. [34]

    Oriane Sim ´eoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timoth´ee Darcet, Th´eo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie,...

  35. [35]

    A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Pola- nia Cabrera, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. A tale of two features: Stable diffusion complements DINO for zero-shot semantic correspondence. InAdvances in Neural Information Processing Systems, pages 45533– 45547. Curran Associates, Inc., 2023. 2

  36. [36]

    Neural foundations of mental simula- tion: Future prediction of latent representations on dynamic scenes

    Aran Nayebi, Rishi Rajalingham, Mehrdad Jazayeri, and Guangyu Robert Yang. Neural foundations of mental simula- tion: Future prediction of latent representations on dynamic scenes. InAdvances in Neural Information Processing Sys- tems, pages 70548–70561. Curran Associates, Inc., 2023. 2

  37. [37]

    FoMo4Wheat: Toward reliable crop vision founda- tion models with globally curated data.arXiv preprint arXiv:2509.06907, 2025

    Bing Han, Chen Zhu, Dong Han, Rui Yu, Songliang Cao, Jianhui Wu, Scott Chapman, Zijian Wang, Bangyou Zheng, Wei Guo, Marie Weiss, Benoit de Solan, Andreas Hund, Lukas Roth, Kirchgessner Norbert, Andrea Visioni, Yufeng Ge, Wenjuan Li, Alexis Comar, Dong Jiang, De- jun Han, Fred Baret, Yanfeng Ding, Hao Lu, and Shouyang Liu. FoMo4Wheat: Toward reliable crop...

  38. [38]

    Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997

    Sepp Hochreiter and J ¨urgen Schmidhuber. Long Short-Term Memory.Neural Computation, 9(8):1735–1780, 1997. 2

  39. [39]

    An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

    Shaojie Bai, J. Zico Kolter, and Vladlen Koltun. An Empirical Evaluation of Generic Convolutional and Recur- rent Networks for Sequence Modeling.arXiv preprint arXiv:1803.01271, 2018. 2

  40. [40]

    Yuanhang Su and C.-C. Jay Kuo. On Extended Long Short- term Memory and Dependent Bidirectional Recurrent Neural Network.Neurocomputing, 356:151–161, 2019. 2

  41. [41]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neu- ral Information Processing Systems. Curran Associates, Inc.,

  42. [42]

    Yue Wang and Yuanyuan Zha. Comparison of transformer, LSTM and coupled algorithms for soil moisture prediction in shallow-groundwater-level areas with interpretability analy- sis.Agricultural Water Management, 305:109120, 2024. 2

  43. [43]

    Transformer neural networks for interpretable flood forecasting.Environmental Modelling & Software, 160:105581, 2023

    Marco Castangia, Lina Maria Medina Grajales, Alessandro Aliberti, Claudio Rossi, Alberto Macii, Enrico Macii, and Edoardo Patti. Transformer neural networks for interpretable flood forecasting.Environmental Modelling & Software, 160:105581, 2023. 2

  44. [44]

    3DPhenoMVS: A Low-Cost 3D Tomato Phenotyping Pipeline Using 3D Reconstruction Point Cloud Based on Multiview Images.Agronomy, 12(8):1865, 2022

    Yinghua Wang, Songtao Hu, He Ren, Wanneng Yang, and Ruifang Zhai. 3DPhenoMVS: A Low-Cost 3D Tomato Phenotyping Pipeline Using 3D Reconstruction Point Cloud Based on Multiview Images.Agronomy, 12(8):1865, 2022. 2

  45. [45]

    PanicleNeRF: Low-Cost, High-Precision In-Field Phenotyp- ing of Rice Panicles with Smartphone.Plant Phenomics, 6: 0279, 2024

    Xin Yang, Xuqi Lu, Pengyao Xie, Ziyue Guo, Hui Fang, Haowei Fu, Xiaochun Hu, Zhenbiao Sun, and Haiyan Cen. PanicleNeRF: Low-Cost, High-Precision In-Field Phenotyp- ing of Rice Panicles with Smartphone.Plant Phenomics, 6: 0279, 2024

  46. [46]

    NeRF-based 3D reconstruction pipeline for acquisition and analysis of tomato crop morphology.Fron- tiers in Plant Science, 15, 2024

    Hong-Beom Choi, Jae-Kun Park, Soo Hyun Park, and Taek Sung Lee. NeRF-based 3D reconstruction pipeline for acquisition and analysis of tomato crop morphology.Fron- tiers in Plant Science, 15, 2024

  47. [47]

    PlantGaussian: Exploring 3D Gaussian splat- ting for cross-time, cross-scene, and realistic 3D plant vi- 10 sualization and beyond.The Crop Journal, 13(2):607–618,

    Peng Shen, Xueyao Jing, Wenzhe Deng, Hanyue Jia, and Tingting Wu. PlantGaussian: Exploring 3D Gaussian splat- ting for cross-time, cross-scene, and realistic 3D plant vi- 10 sualization and beyond.The Crop Journal, 13(2):607–618,

  48. [48]

    NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review (Updated Post-Gaussian Splatting)

    Kyle Gao, Yina Gao, Hongjie He, Dening Lu, Linlin Xu, and Jonathan Li. NeRF: Neural Radiance Field in 3D Vision: A Comprehensive Review (Updated Post-Gaussian Splatting). arXiv preprint arXiv:2210.00379, 2026. 2

  49. [49]

    A miniaturized phenotyping platform for in- dividual plants using multi-view stereo 3D reconstruction

    Sheng Wu, Weiliang Wen, Wenbo Gou, Xianju Lu, Wenqi Zhang, Chenxi Zheng, Zhiwei Xiang, Liping Chen, and Xinyu Guo. A miniaturized phenotyping platform for in- dividual plants using multi-view stereo 3D reconstruction. Frontiers in Plant Science, 13, 2022

  50. [50]

    Object-centric 3D Gaussian splatting for strawberry plant reconstruction and phenotyping.Smart Agricultural Technology, 13:101810, 2026

    Jiajia Li, Keyi Zhu, Qianwen Zhang, Dong Chen, Qi Sun, and Zhaojian Li. Object-centric 3D Gaussian splatting for strawberry plant reconstruction and phenotyping.Smart Agricultural Technology, 13:101810, 2026. 2

  51. [51]

    Model compression

    Cristian Bucilu ˇa, Rich Caruana, and Alexandru Niculescu- Mizil. Model compression. InProceedings of the 12th ACM SIGKDD International Conference on Knowledge Discov- ery and Data Mining, pages 535–541, New York, NY , USA,

  52. [52]

    Association for Computing Machinery. 2

  53. [53]

    Im- proved Knowledge Distillation via Teacher Assistant.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 34(04):5191–5198, 2020

    Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Im- proved Knowledge Distillation via Teacher Assistant.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 34(04):5191–5198, 2020. 3

  54. [54]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015. 3

  55. [55]

    An Efficient Method of Training Small Models for Regres- sion Problems with Knowledge Distillation

    Makoto Takamoto, Yusuke Morishita, and Hitoshi Imaoka. An Efficient Method of Training Small Models for Regres- sion Problems with Knowledge Distillation. In2020 IEEE Conference on Multimedia Information Processing and Re- trieval (MIPR), pages 67–72, 2020. 3

  56. [56]

    Cross Modal Distillation for Supervision Transfer

    Saurabh Gupta, Judy Hoffman, and Jitendra Malik. Cross Modal Distillation for Supervision Transfer. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2827–2836, Las Vegas, NV , USA, 2016. 3

  57. [57]

    3D-to-2D Distillation for Indoor Scene Parsing

    Zhengzhe Liu, Xiaojuan Qi, and Chi-Wing Fu. 3D-to-2D Distillation for Indoor Scene Parsing. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4462–4472, 2021. 3

  58. [58]

    An Adversarial Ap- proach to Discriminative Modality Distillation for Remote Sensing Image Classification

    Shivam Pande, Avinandan Banerjee, Saurabh Kumar, Bi- plab Banerjee, and Subhasis Chaudhuri. An Adversarial Ap- proach to Discriminative Modality Distillation for Remote Sensing Image Classification. In2019 IEEE/CVF Interna- tional Conference on Computer Vision Workshop (ICCVW), pages 4571–4580, 2019. 3

  59. [59]

    DistillBEV: Boosting Multi-Camera 3D Ob- ject Detection with Cross-Modal Knowledge Distillation

    Zeyu Wang, Dingwen Li, Chenxu Luo, Cihang Xie, and Xi- aodong Yang. DistillBEV: Boosting Multi-Camera 3D Ob- ject Detection with Cross-Modal Knowledge Distillation. In 2023 IEEE/CVF International Conference on Computer Vi- sion (ICCV), pages 8603–8612, Paris, France, 2023. 3

  60. [60]

    RadOcc: Learning cross-modality occupancy knowledge through rendering assisted distillation

    Haiming Zhang, Xu Yan, Dongfeng Bai, Jiantao Gao, Pan Wang, Bingbing Liu, Shuguang Cui, and Zhen Li. RadOcc: Learning cross-modality occupancy knowledge through rendering assisted distillation. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artifi- cial Intelligence a...

  61. [61]

    Masked Generative Distillation

    Zhendong Yang, Zhe Li, Mingqi Shao, Dachuan Shi, Zehuan Yuan, and Chun Yuan. Masked Generative Distillation. In Computer Vision – ECCV 2022, pages 53–69, Cham, 2022. Springer Nature Switzerland. 3

  62. [62]

    A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation

    Ziwei Liu, Yongtao Wang, Xiaojie Chu, Nan Dong, Shengx- iang Qi, and Haibin Ling. A Simple and Generic Framework for Feature Distillation via Channel-wise Transformation. In 2023 IEEE/CVF International Conference on Computer Vi- sion Workshops (ICCVW), pages 1121–1130, Paris, France,

  63. [63]

    The ETH field phenotyping platform FIP: A cable-suspended multi-sensor system.Functional plant biology: FPB, 44(1): 154–168, 2016

    Norbert Kirchgessner, Frank Liebisch, Kang Yu, Johannes Pfeifer, Michael Friedli, Andreas Hund, and Achim Walter. The ETH field phenotyping platform FIP: A cable-suspended multi-sensor system.Functional plant biology: FPB, 44(1): 154–168, 2016. 3

  64. [64]

    Fine-Tuned Vision Transformers Capture Complex Wheat Spike Morphology for V olume Estimation from RGB Im- ages.arXiv preprint arXiv:2506.18060, 2025

    Olivia Zumsteg, Nico Graf, Aaron Haeusler, Norbert Kirchgessner, Nicola Storni, Lukas Roth, and Andreas Hund. Fine-Tuned Vision Transformers Capture Complex Wheat Spike Morphology for V olume Estimation from RGB Im- ages.arXiv preprint arXiv:2506.18060, 2025. 3

  65. [65]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junt- ing Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao- Yuan Wu, Ross Girshick, Piotr Doll´ar, and Christoph Feicht- enhofer. SAM 2: Segment Anything in Images and Videos. arXiv preprint arXiv:...

  66. [66]

    Descriptor-Free Multi-view Region Matching for Instance-Wise 3D Reconstruction

    Takuma Doi, Fumio Okura, Toshiki Nagahara, Yasuyuki Matsushita, and Yasushi Yagi. Descriptor-Free Multi-view Region Matching for Instance-Wise 3D Reconstruction. In Computer Vision – ACCV 2020, pages 581–599, Cham,

  67. [67]

    Springer International Publishing. 3

  68. [68]

    Near linear time algorithm to detect community structures in large-scale networks.Physical Review E, 76(3):036106,

    Usha Nandini Raghavan, R ´eka Albert, and Soundar Kumara. Near linear time algorithm to detect community structures in large-scale networks.Physical Review E, 76(3):036106,

  69. [69]

    OpenMVS: Multi-view stereo reconstruction library

    Dan Cernea. OpenMVS: Multi-view stereo reconstruction library. 2020. 4

  70. [70]

    Point Transformer V3: Simpler, Faster, Stronger

    Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point Transformer V3: Simpler, Faster, Stronger. In 2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition (CVPR), pages 4840–4851, 2024. 4, 8

  71. [71]

    Qi Charles, Hao Su, Mo Kaichun, and Leonidas J

    R. Qi Charles, Hao Su, Mo Kaichun, and Leonidas J. Guibas. PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85, Hon- olulu, HI, 2017. 4, 8

  72. [72]

    Panoptic Mapping with Fruit Completion and Pose Estimation for Horticultural Robots

    Yue Pan, Federico Magistri, Thomas L ¨abe, Elias Marks, Claus Smitt, Chris McCool, Jens Behley, and Cyrill Stach- niss. Panoptic Mapping with Fruit Completion and Pose Estimation for Horticultural Robots. In2023 IEEE/RSJ In- ternational Conference on Intelligent Robots and Systems (IROS), pages 4226–4233, Detroit, MI, USA, 2023. 8 11

  73. [73]

    DeepSDF: Learning Continuous Signed Distance Functions for Shape Represen- tation

    Jeong Joon Park, Peter Florence, Julian Straub, Richard Newcombe, and Steven Lovegrove. DeepSDF: Learning Continuous Signed Distance Functions for Shape Represen- tation. In2019 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 165–174, Long Beach, CA, USA, 2019. 8

  74. [74]

    MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection

    Donghyeon Kwon, Youngseok Yoon, Hyeongseok Son, and Suha Kwak. MemDistill: Distilling LiDAR Knowledge into Memory for Camera-Only 3D Object Detection. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, pages 6828–6838, 2025. 8

  75. [75]

    INVITE Consortium. INnovations in plant VarIety Testing in Europe to foster the introduction of new varieties better adapted to varying biotic and abiotic conditions and to more sustainable crop management practices|INVITE|Project|Fact Sheet|H2020. https://cordis.europa.eu/project/id/817970, n.d. 13

  76. [76]

    PHENET Consortium. Tools and methods for extended plant PHENotyping and EnviroTyping services of European Re- search Infrastructures|PHENET|Project|Fact Sheet| HORIZON.https://cordis.europa.eu/project/id/101094587, n.d. 13

  77. [77]

    Pyrender.https://github.com/ mmatl/pyrender, 2019

    Matthew Matl. Pyrender.https://github.com/ mmatl/pyrender, 2019. 13

  78. [78]

    OpenMVG: Open Multiple View Geometry

    Pierre Moulon, Pascal Monasse, Romuald Perrot, and Re- naud Marlet. OpenMVG: Open Multiple View Geometry. In Reproducible Research in Pattern Recognition, pages 60–74, Cham, 2017. Springer International Publishing. 13 12 3D Reconstruction and Knowledge Distillation to Improve Multi-View Image Models to Explore Spike V olume Estimation in Wheat Supplementa...