arxiv: 2604.27128 · v1 · submitted 2026-04-29 · 💻 cs.CV · cs.AI

Recognition: unknown

Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics

Haiyu Yang , Miel Hostens

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords model distillationedge computinglivestock monitoringSAM 3DINOv3pig trackingbehavior classificationre-identification

0 comments

The pith

Distilling SAM 3 and DINOv3 produces a compact pipeline that tracks individual pigs on edge devices with under 2-point accuracy loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large foundation models for open-vocabulary detection, video segmentation, and self-supervised embeddings can be compressed into a system that fits inside commodity edge hardware while retaining most of their livestock-monitoring performance. It does this by replacing the heavy SAM 3 backbone with a smaller multi-scale student network, applying a four-term distillation objective, and adding memory-bounding tricks during streaming inference. On the Edinburgh Pig dataset the resulting pipeline delivers 92.29 percent MOTA and 96.15 percent IDF1 for tracking plus 97.34 percent top-1 accuracy for nine-class behavior classification, all inside a 16 GB Jetson Orin NX. If the compression generalizes, farms could generate year-long per-animal visual records locally and later link those records to health and productivity outcomes without needing server-grade GPUs.

Core claim

By distilling the 446 M-parameter Perception Encoder of SAM 3 into a 40.66 M-parameter multi-scale student via a Feature Pyramid Network on TinyViT-21M-512, a four-term direction-then-scale loss, and backbone-substitution inference with sliding-window session pruning, and by adopting the 21 M-parameter ViT-S/16 variant of DINOv3 as the embedder, the pipeline reaches 92.29 percent MOTA and 96.15 percent IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52 GB to 6.49 GB), reaches 97.34 percent top-1 accuracy with 91.67 percent macro-F1 on nine-class pig behaviour, and is

What carries the argument

The four-term direction-then-scale distillation loss together with the Feature Pyramid Network student encoder and sliding-window session pruning that transfers SAM 3 open-vocabulary segmentation and DINOv3 embeddings to a lightweight, memory-bounded pipeline.

Load-bearing premise

The distillation techniques and pruning will transfer to new farm environments and animal species with only the small reported accuracy degradation, and the unvalidated embedding-pool re-identification will work without drift.

What would settle it

Run the compressed pipeline on video from a second pig farm or on another livestock species and measure whether MOTA stays above 90 percent and behavior top-1 accuracy stays above 95 percent.

read the original abstract

Foundation-model pipelines for individual-level livestock monitoring -- combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings -- have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is adopted as the per-individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52GB -> 6.49GB), and reaches 97.34% top-1 accuracy with 91.67% macro-F1 on nine-class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed -- but not yet empirically validated -- on-device embedding-pool re-identification mechanism whose per-individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript describes a distillation pipeline that compresses the 446M-parameter SAM 3 Perception Encoder into a 40.66M-parameter TinyViT-21M-512 FPN student via a four-term direction-then-scale loss and sliding-window session pruning, while adopting the 21M-parameter DINOv3 ViT-S/16 as the per-individual embedder. On the Edinburgh Pig dataset the resulting system reports 92.29% MOTA and 96.15% IDF1 (1.68- and 0.84-point drops from the SAM 3 teacher), 97.34% top-1 accuracy on nine-class behavior classification, a 7.77-fold parameter reduction, and a drop in peak VRAM from 19.52 GB to 6.49 GB, allowing deployment on an NVIDIA Jetson Orin NX 16 GB with 4.9 GB headroom. The abstract further proposes—but does not empirically validate—an on-device embedding-pool re-identification scheme whose ~94 MB per-animal annual footprint is intended to support longitudinal visual analytics.

Significance. If the reported performance retention and memory reductions hold under broader testing, the work would provide a concrete route to bring open-vocabulary detection and self-supervised embeddings to commodity edge hardware for precision livestock farming. The concrete efficiency numbers and the explicit statement that the re-identification component remains unvalidated are both useful for readers assessing readiness for field deployment.

major comments (4)

[Abstract] Abstract: the central claim that the pipeline 'supports' longitudinal visual analytics rests on an explicitly unvalidated on-device embedding-pool re-identification mechanism. Because this component is presented as enabling retrospective disease/lameness association, its lack of empirical validation is load-bearing for the manuscript's broader contribution.
[Methods] Methods (distillation procedure): the four-term direction-then-scale loss is introduced without numerical values for the term weights, without an ablation table, and without a description of the training/validation split or optimizer schedule. These omissions prevent reproduction and make it impossible to determine whether the reported 1.68-point MOTA drop is robust or sensitive to hyper-parameter choices.
[Experiments] Experiments/Results: all quantitative metrics (MOTA, IDF1, behavior-classification accuracy) are reported on a single dataset (Edinburgh Pig) with no cross-farm, cross-species, or cross-lighting transfer experiments, no statistical significance tests, and no failure-case analysis. This directly limits the generalizability assertions made for 'new farm environments and animal species'.
[Results] Results (re-identification): the per-individual 94 MB/year embedding-pool footprint is given as a concrete figure, yet the text states the mechanism is 'proposed—but not yet empirically validated.' The manuscript therefore presents a quantitative claim whose supporting evidence is absent.

minor comments (2)

[Abstract] Abstract: the student is described as '40.66M-parameter' while the encoder is 'TinyViT-21M-512'; a short clarification of which additional modules contribute the remaining parameters would remove ambiguity.
[Methods] Notation: the phrase 'direction-then-scale distillation loss' is used without a forward reference to its exact formulation or to any equation number; adding an equation label would improve readability.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments have identified important opportunities to improve reproducibility, clarify the scope of claims, and strengthen the discussion of limitations. We address each major comment below and indicate the revisions we will make in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the pipeline 'supports' longitudinal visual analytics rests on an explicitly unvalidated on-device embedding-pool re-identification mechanism. Because this component is presented as enabling retrospective disease/lameness association, its lack of empirical validation is load-bearing for the manuscript's broader contribution.

Authors: We agree that the language in the abstract requires clarification. The current text already qualifies the re-identification component as 'proposed—but not yet empirically validated,' and the primary contribution remains the distillation pipeline and its measured efficiency and accuracy on the Edinburgh Pig dataset. To prevent any overstatement, we will revise the abstract to replace 'supports' with 'enables the potential for' and explicitly frame the longitudinal analytics as a prospective use case enabled by the reduced memory footprint rather than a validated outcome. revision: partial
Referee: [Methods] Methods (distillation procedure): the four-term direction-then-scale loss is introduced without numerical values for the term weights, without an ablation table, and without a description of the training/validation split or optimizer schedule. These omissions prevent reproduction and make it impossible to determine whether the reported 1.68-point MOTA drop is robust or sensitive to hyper-parameter choices.

Authors: We acknowledge these omissions limit reproducibility. In the revised manuscript we will add: (i) the exact numerical weights for each term in the four-term loss, (ii) a dedicated ablation table isolating the contribution of direction and scale components, (iii) the precise training/validation split used on the Edinburgh Pig sessions, and (iv) the full optimizer schedule including learning rate, decay strategy, and number of epochs. These additions will allow readers to evaluate the sensitivity of the observed performance retention. revision: yes
Referee: [Experiments] Experiments/Results: all quantitative metrics (MOTA, IDF1, behavior-classification accuracy) are reported on a single dataset (Edinburgh Pig) with no cross-farm, cross-species, or cross-lighting transfer experiments, no statistical significance tests, and no failure-case analysis. This directly limits the generalizability assertions made for 'new farm environments and animal species'.

Authors: We accept that evaluation on a single dataset constrains strong generalizability claims. The Edinburgh Pig dataset already contains substantial real-world variation in lighting, density, and camera angles. In the revision we will (a) insert a Limitations section that explicitly discusses the single-dataset constraint and the need for future cross-farm and cross-species validation, (b) report statistical significance (paired t-tests across multiple random seeds) for the reported metrics, and (c) add a qualitative failure-case analysis highlighting common error modes. We will tone down the language regarding immediate applicability to arbitrary new environments while preserving the concrete efficiency results. revision: partial
Referee: [Results] Results (re-identification): the per-individual 94 MB/year embedding-pool footprint is given as a concrete figure, yet the text states the mechanism is 'proposed—but not yet empirically validated.' The manuscript therefore presents a quantitative claim whose supporting evidence is absent.

Authors: The 94 MB figure is a back-of-the-envelope projection derived from the DINOv3 ViT-S embedding dimension, assumed frame rate, session duration, and storage format; it is not an empirical measurement from a running system. We will revise the relevant section to present the number explicitly as a calculated estimate, include the arithmetic used to obtain it, and reiterate that the full on-device re-identification pipeline remains unvalidated. This change removes any implication that the footprint has been measured in practice. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metrics are measured outcomes, not derived by construction

full rationale

The paper describes an empirical distillation pipeline (TinyViT-21M-512 FPN student, four-term loss, sliding-window pruning) and reports directly measured performance numbers (92.29% MOTA, 96.15% IDF1, 97.34% top-1 accuracy) on a held-out Edinburgh Pig dataset against the SAM 3 teacher. These quantities are experimental results from evaluation, not quantities that reduce to fitted parameters or self-defined inputs via any equation in the manuscript. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the central claims; the longitudinal re-identification component is explicitly labeled unvalidated, but that is a validation gap rather than circularity. The derivation chain is self-contained as standard empirical ML reporting.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Only the abstract is available, preventing exhaustive audit. The work rests on standard knowledge-distillation assumptions rather than new theoretical derivations. Free parameters are the balancing coefficients of the four-term loss and the architectural hyperparameters of the student encoder, both chosen to match reported performance. No new entities are postulated beyond the proposed (unvalidated) re-identification mechanism.

free parameters (2)

four-term distillation loss weights
Balancing coefficients for direction and scale terms are selected to achieve the reported transfer performance; exact values not stated in abstract.
student encoder scale and pruning thresholds
TinyViT-21M-512 configuration and sliding-window session pruning parameters are chosen by hand to fit the Jetson memory envelope.

axioms (2)

domain assumption A student network trained with feature-matching distillation can approximate the representational power of a much larger teacher model for downstream tracking and classification tasks.
Central premise of the entire compression pipeline.
domain assumption The Edinburgh Pig dataset is sufficiently representative of real-world farm video conditions for the reported metrics to generalize.
Implicit in claiming practical edge deployability.

pith-pipeline@v0.9.0 · 5662 in / 1727 out tokens · 163348 ms · 2026-05-07T08:46:38.820477+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 14 canonical work pages · 7 internal anchors

[1]

Evaluating multiple object tracking performance: the CLEAR MOT metrics.EURASIP Journal on Image and Video Processing, 2008:246309,

Keni Bernardin and Rainer Stiefelhagen. Evaluating multiple object tracking performance: the CLEAR MOT metrics.EURASIP Journal on Image and Video Processing, 2008:246309,

2008
[2]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Baishan Guo, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Triantafyllos Afouras, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane ...

work page internal anchor Pith review arXiv
[3]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training ImageNet in 1 hour.arXiv preprint arXiv:1706.02677,

work page internal anchor Pith review arXiv
[4]

Gaussian Error Linear Units (GELUs)

Dan Hendrycks and Kevin Gimpel. Gaussian Error Linear Units (GELUs).arXiv preprint arXiv:1606.08415,

work page internal anchor Pith review arXiv
[5]

24 Diederik P

doi: 10.3390/s25154586. 24 Diederik P. Kingma and Jimmy Ba. Adam: a method for stochastic optimization. InProceedings of the 3rd International Conference on Learning Representations (ICLR),

work page doi:10.3390/s25154586
[6]

MOT16: A Benchmark for Multi-Object Tracking

Anton Milan, Laura Leal-Taixé, Ian Reid, Stefan Roth, and Konrad Schindler. MOT16: a benchmark for multi-object tracking.arXiv preprint arXiv:1603.00831,

work page Pith review arXiv
[7]

DINOv2: Learning Robust Visual Features without Supervision

arXiv:2304.07193. Georgios I. Papakonstantinou, Nikolaos Voulgarakis, Georgia Terzidou, Lampros Fotos, Elisavet Giamouri, and Vasileios G. Papatsiros. Precision livestock farming technology: applications and challenges of animal welfare and climate change.Agriculture, 14(4):620,

work page internal anchor Pith review arXiv
[8]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, and Christoph Feichtenhofer. SAM 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,

work page internal anchor Pith review arXiv
[9]

Performance measures and a data set for multi-target, multi-camera tracking

Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. Performance measures and a data set for multi-target, multi-camera tracking. InComputer Vision – ECCV 2016 Workshops, volume 9914 ofLecture Notes in Computer Science, pages 17–35. Springer,

2016
[10]

Oriane Siméoni, Huy V. Vo, Vasil Khalidov, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien ...

work page internal anchor Pith review arXiv
[11]

Dropout: a simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15:1929–1958,

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a simple way to prevent neural networks from overfitting.Journal of Machine Learning Research, 15:1929–1958,

1929
[12]

Ross Wightman

doi: 10.1016/j.compag.2025.110559. Ross Wightman. PyTorch Image Models (timm). GitHub repository, https://github.com/ huggingface/pytorch-image-models,

work page doi:10.1016/j.compag.2025.110559 2025
[13]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: state-of-the-art...

2020
[14]

TinyViT: fast pretraining distillation for small vision transformers

Kan Wu, Jinnian Zhang, Houwen Peng, Mengchen Liu, Bin Xiao, Jianlong Fu, and Lu Yuan. TinyViT: fast pretraining distillation for small vision transformers. InComputer Vision – ECCV 2022, volume 13681 ofLecture Notes in Computer Science, pages 68–85. Springer,

2022
[15]

Group normalization

26 Yuxin Wu and Kaiming He. Group normalization. InComputer Vision – ECCV 2018, volume 11217 ofLecture Notes in Computer Science, pages 3–19. Springer,

2018
[16]

EfficientSAM: leveraged masked image pretraining for efficient segment anything

Yunyang Xiong, Bala Varadarajan, Lemeng Wu, Xiaoyu Xiang, Fanyi Xiao, Chenchen Zhu, Xiaoliang Dai, Dilin Wang, Fei Sun, Forrest Iandola, Raghuraman Krishnamoorthi, and Vikas Chandra. EfficientSAM: leveraged masked image pretraining for efficient segment anything. arXiv preprint arXiv:2312.00863,

work page arXiv
[17]

A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

Haiyu Yang, Enhong Liu, Jennifer Sun, Sumit Sharma, Meike van Leerdam, Sebastien Frances- chini, Puchun Niu, and Miel Hostens. A computer vision pipeline for individual-level behavior analysis: benchmarking on the Edinburgh Pig Dataset.arXiv preprint arXiv:2509.12047,

work page internal anchor Pith review arXiv
[18]

EfficientSAM3: progressive hierarchical distillation for video concept segmentation from SAM1, SAM2, and SAM3.arXiv preprint arXiv:2511.15833,

Chengxi Zeng, Yuxuan Jiang, and Aaron Zhang. EfficientSAM3: progressive hierarchical distillation for video concept segmentation from SAM1, SAM2, and SAM3.arXiv preprint arXiv:2511.15833,

work page arXiv
[19]

Faster segment anything: Towards lightweight sam for mobile applications,

Chaoning Zhang, Dongshen Han, Yu Qiao, Jung Uk Kim, Sung-Ho Bae, Seungkyu Lee, and Choong Seon Hong. Faster segment anything: towards lightweight SAM for mobile applications. arXiv preprint arXiv:2306.14289,

work page arXiv
[20]

EdgeSAM: prompt-in-the-loop distillation for on-device deployment of SAM.arXiv preprint arXiv:2312.06660,

Chong Zhou, Xiangtai Li, Chen Change Loy, and Bo Dai. EdgeSAM: prompt-in-the-loop distillation for on-device deployment of SAM.arXiv preprint arXiv:2312.06660,

work page arXiv