Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models

Bimala Acharya; David Rosero; Juan Steibel; Ye Bi

arxiv: 2604.03426 · v1 · submitted 2026-04-03 · 💻 cs.CV

Automated Segmentation and Tracking of Group Housed Pigs Using Foundation Models

Ye Bi , Bimala Acharya , David Rosero , Juan Steibel This is my paper

Pith reviewed 2026-05-13 19:57 UTC · model grok-4.3

classification 💻 cs.CV

keywords foundation modelspig trackingsegmentationprecision livestock farmingcomputer visionGrounding-DINOSAM2video monitoring

0 comments

The pith

Pretrained foundation models with modular post-processing enable scalable long-term tracking of group-housed pigs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows that foundation models can serve as general visual backbones for detecting and segmenting nursery pigs in group housing, with only lightweight task-specific logic added afterward to correct errors from night vision or occlusion. The workflow starts with Grounding-DINO detections on still images, moves to Grounded-SAM2 for short video clips, and finishes with a custom long-term pipeline that keeps pig identities stable across hours of continuous footage. A sympathetic reader cares because the method avoids the usual need for large farm-specific labeled datasets and repeated model retraining, which currently limits automated monitoring in livestock production. If the approach holds, continuous video analysis becomes practical at commercial scale without constant new data collection.

Core claim

The paper claims that an FM-centered workflow, in which pretrained vision-language models provide the core visual representations and modular post-processing supplies the farm-specific adaptation, produces reliable automated segmentation and tracking: over 80 percent of short-term tracks are fully correct after refinement, and a 132-minute evaluation yields 0.83 mean region similarity, 0.92 contour accuracy, 0.87 J&F, 0.99 MOTA, 90.7 percent MOTP, and zero identity switches on ground-truth frames.

What carries the argument

The central mechanism is the FM-centered workflow that applies Grounding-DINO for initial detection and Grounded-SAM2 for segmentation, then chains lightweight modules for initialization, short-term tracking, mask refinement, re-identification, and post-hoc quality control to maintain identity consistency.

Load-bearing premise

The general visual features learned by the foundation models transfer well enough to pig images under night-vision and heavy occlusion that the modular post-processing can fix remaining errors without needing extensive new labels or retraining for each farm.

What would settle it

Running the same pipeline on a new set of farm videos with different lighting, camera angles, or pig densities and observing frequent identity switches or mask failures on a majority of tracks would show the claim does not hold.

Figures

Figures reproduced from arXiv: 2604.03426 by Bimala Acharya, David Rosero, Juan Steibel, Ye Bi.

**Figure 2.** Figure 2: Experimental setting on nursery pig farm. (a-b) Schematic and real views of the [PITH_FULL_IMAGE:figures/full_fig_p043_2.png] view at source ↗

**Figure 3.** Figure 3: Bidirectional SAM2 propagation diagram. After identifying the first erroneous [PITH_FULL_IMAGE:figures/full_fig_p044_3.png] view at source ↗

**Figure 4.** Figure 4: Long-term video segmentation workflow. The pipeline integrates six modules [PITH_FULL_IMAGE:figures/full_fig_p045_4.png] view at source ↗

**Figure 5.** Figure 5: Grounding-DINO performance for pig detection. (a) F1 score, recall, and precision [PITH_FULL_IMAGE:figures/full_fig_p046_5.png] view at source ↗

**Figure 6.** Figure 6: Pig detection examples by raw GroundingDINO under different conditions. In [PITH_FULL_IMAGE:figures/full_fig_p047_6.png] view at source ↗

**Figure 7.** Figure 7: Short-term video segmentation examples by raw Grounded-SAM2. (a-b) Pigs [PITH_FULL_IMAGE:figures/full_fig_p048_7.png] view at source ↗

**Figure 8.** Figure 8: Short-term video segmentation results. (a) Number of stacked and unstacked pigs [PITH_FULL_IMAGE:figures/full_fig_p049_8.png] view at source ↗

**Figure 9.** Figure 9: Quantitative results for long-term video segmentation. Modified foundation model [PITH_FULL_IMAGE:figures/full_fig_p050_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative results for long-term video segmentation. (a) The [PITH_FULL_IMAGE:figures/full_fig_p051_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative results for long-term video segmentation. The [PITH_FULL_IMAGE:figures/full_fig_p052_11.png] view at source ↗

read the original abstract

Foundation models (FM) are reshaping computer vision by reducing reliance on task-specific supervised learning and leveraging general visual representations learned at scale. In precision livestock farming, most pipelines remain dominated by supervised learning models that require extensive labeled data, repeated retraining, and farm-specific tuning. This study presents an FM-centered workflow for automated monitoring of group-housed nursery pigs, in which pretrained vision-language FM serve as general visual backbones and farm-specific adaptation is achieved through modular post-processing. Grounding-DINO was first applied to 1,418 annotated images to establish a baseline detection performance. While detection accuracy was high under daytime conditions, performance degraded under night-vision and heavy occlusion, motivating the integration of temporal tracking logic. Building on these detections, short-term video segmentation with Grounded-SAM2 was evaluated on 550 one-minute video clips; after post-processing, over 80% of 4,927 active tracks were fully correct, with most remaining errors arising from inaccurate masks or duplicated labels. To support identity consistency over an extended time, we further developed a long-term tracking pipeline integrating initialization, tracking, matching, mask refinement, re-identification, and post-hoc quality control. This system was evaluated on a continuous 132-minute video and maintained stable identities throughout. On 132 uniformly sampled ground-truth frames, the system achieved a mean region similarity (J) of 0.83, contour accuracy (F) of 0.92, J&F of 0.87, MOTA of 0.99, and MOTP of 90.7%, with no identity switches. Overall, this work demonstrates how FM prior knowledge can be combined with lightweight, task-specific logic to enable scalable, label-efficient, and long-duration monitoring in pig production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Applies foundation models to pig tracking with good results on test data but needs more checks on whether the post-processing generalizes beyond one video.

read the letter

The punchline is that this is a solid applied paper showing how to adapt recent foundation models for pig segmentation and tracking in group housing, but the impressive long-term tracking numbers come from a single video and unablated post-processing steps. They take Grounding-DINO for detection and Grounded-SAM2 for segmentation, then add a modular pipeline with initialization, tracking, matching, re-identification, and quality control. On their 132-minute test video, sampled frames give J=0.83, F=0.92, MOTA=0.99 with zero switches. That's useful for precision livestock work where labeling is expensive. The approach reduces the need for farm-specific training by relying on the models' general features plus rule-based fixes. What stands out is the focus on real-world issues like occlusion and night vision, with the post-processing handling most errors. Over 80% of tracks in the short clips become fully correct after these steps. It's pragmatic and directly addresses the labeling burden in commercial settings. The main limitation is the lack of detailed validation for the post-processing. There's no ablation showing what each component contributes, no error bars on the metrics, and no tests on hold-out farms or different lighting conditions beyond noting the detection drop at night. The high performance on one continuous video doesn't yet prove the pipeline is robust without per-farm tuning. The abstract mentions degradation but doesn't quantify it for the full system. This paper is aimed at researchers and practitioners in agricultural AI and animal monitoring. Someone building similar systems would find the workflow description and metrics helpful as a starting point. I would recommend sending it for peer review. The concrete results and the problem it tackles make it worth referee input on strengthening the evaluation.

Referee Report

4 major / 2 minor

Summary. The paper proposes a pipeline for automated detection, segmentation, and long-term tracking of group-housed nursery pigs that uses pretrained foundation models (Grounding-DINO for detection and Grounded-SAM2 for video segmentation) as backbones and relies on modular, lightweight post-processing (initialization, tracking, matching, mask refinement, re-identification, and quality control) for farm-specific adaptation. On 1,418 annotated images it reports strong daytime detection that degrades under night-vision and occlusion; on 550 one-minute clips >80 % of 4,927 tracks become fully correct after post-processing; and on a single 132-minute video it obtains J = 0.83, F = 0.92, J&F = 0.87, MOTA = 0.99, MOTP = 90.7 % with zero identity switches on 132 sampled frames.

Significance. If the central claim holds, the work shows that general visual features from foundation models can be combined with task-specific logic to achieve label-efficient, long-duration monitoring in precision livestock farming, substantially reducing the need for repeated supervised retraining and per-farm data collection.

major comments (4)

[Abstract and long-term tracking evaluation] Abstract and long-term tracking evaluation: the claim that modular post-processing reliably corrects residual errors under night-vision and heavy occlusion is not supported by any per-condition breakdown or ablation that isolates the contribution of each logic module (initialization, matching, mask refinement, re-ID, QC) to the final MOTA = 0.99 and zero ID switches.
[Abstract] Abstract: daytime-to-night degradation is acknowledged on the 1,418-image detection set, yet no quantitative results are given for the same images after the full post-processing pipeline, leaving the weakest assumption (sufficient feature transfer without per-farm retraining) untested.
[Long-term tracking pipeline] Long-term tracking pipeline: no details are supplied on how thresholds or decision rules inside the post-processing modules were chosen or whether they were tuned on the 132-minute evaluation video rather than held-out data.
[Experiments] Experiments: the single 132-minute video yielding MOTA = 0.99 cannot establish farm-agnostic generality; no cross-farm, cross-lighting, or cross-condition hold-out results are reported.

minor comments (2)

[Abstract] Abstract: the reported metrics (J = 0.83, MOTA = 0.99, etc.) are given without error bars, confidence intervals, or standard deviations across the 132 sampled frames.
[Long-term tracking evaluation] Long-term tracking evaluation: the procedure used to select the 132 uniformly sampled ground-truth frames should be described explicitly.

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and will revise the manuscript to incorporate clarifications, additional analyses, and explicit limitations where appropriate.

read point-by-point responses

Referee: [Abstract and long-term tracking evaluation] the claim that modular post-processing reliably corrects residual errors under night-vision and heavy occlusion is not supported by any per-condition breakdown or ablation that isolates the contribution of each logic module (initialization, tracking, matching, mask refinement, re-identification, and quality control) to the final MOTA = 0.99 and zero ID switches.

Authors: We agree that the current presentation does not isolate module contributions or provide per-condition breakdowns. In the revised manuscript we will add (i) daytime vs. night-vision subset evaluations on the 550-clip set and (ii) ablation results that successively disable each post-processing module while reporting MOTA, ID switches, and J&F on the 132-minute video. revision: yes
Referee: [Abstract] daytime-to-night degradation is acknowledged on the 1,418-image detection set, yet no quantitative results are given for the same images after the full post-processing pipeline, leaving the weakest assumption (sufficient feature transfer without per-farm retraining) untested.

Authors: We acknowledge the omission. We will run the complete post-processing pipeline on the 1,418-image set, report the resulting precision, recall, and mask quality metrics separately for daytime and night-vision subsets, and include these numbers in both the abstract and results section. revision: yes
Referee: [Long-term tracking pipeline] no details are supplied on how thresholds or decision rules inside the post-processing modules were chosen or whether they were tuned on the 132-minute evaluation video rather than held-out data.

Authors: We will expand the Methods section with explicit descriptions of each threshold (e.g., IoU, similarity, quality-control cut-offs), the data used for their selection, and any preliminary tuning performed on separate one-minute clips. If any parameter was adjusted using the 132-minute video, we will state this clearly and discuss its implications. revision: yes
Referee: [Experiments] the single 132-minute video yielding MOTA = 0.99 cannot establish farm-agnostic generality; no cross-farm, cross-lighting, or cross-condition hold-out results are reported.

Authors: We recognize that evaluation on one continuous video limits claims of generality. The revised manuscript will contain a dedicated Limitations paragraph that explicitly states this constraint and outlines planned multi-farm validation. No additional annotated cross-farm videos are currently available, so quantitative cross-farm results cannot be added at this time. revision: partial

standing simulated objections not resolved

We do not possess additional annotated videos from other farms or lighting conditions, so we cannot supply quantitative cross-farm or cross-condition hold-out results in the revision.

Circularity Check

0 steps flagged

No circularity; results obtained via direct evaluation on held-out ground-truth data

full rationale

The paper describes an applied workflow that combines off-the-shelf pretrained foundation models (Grounding-DINO, Grounded-SAM2) with hand-crafted modular post-processing steps for detection, short-term segmentation, and long-term tracking. All reported metrics (detection accuracy on 1,418 images, track correctness on 4,927 tracks from 550 clips, J/F/MOTA on 132 sampled frames from a 132-minute video) are computed by direct comparison against independently annotated ground-truth labels on held-out data. No equations, fitted parameters, self-definitional loops, or load-bearing self-citations appear in the derivation chain. The central claim therefore reduces to empirical measurement rather than to any quantity defined by the same inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of general foundation-model features to pig imagery and the sufficiency of lightweight post-processing to handle domain-specific failures; no new entities or heavily fitted parameters are introduced.

axioms (1)

domain assumption Pre-trained vision-language foundation models provide sufficiently general representations that transfer to pig detection and segmentation without domain-specific fine-tuning.
Invoked when the authors apply Grounding-DINO and Grounded-SAM2 directly to farm images and rely on post-processing rather than retraining.

pith-pipeline@v0.9.0 · 5629 in / 1267 out tokens · 46204 ms · 2026-05-13T19:57:47.528516+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · 2 internal anchors

[1]

Alameer, A., Kyriazakis, I., and Bacardit, J. (2020). Automated recognition of postures and drinking behaviour for the detection of compromised health in pigs.Scientific reports, 10(1):13665. Awais, M., Naseer, M., Khan, S., Anwer, R. M., Cholakkal, H., Shah, M., Yang, M.-H., and Khan, F. S. (2025). Foundation models defining a new era in vision: a survey...

work page internal anchor Pith review Pith/arXiv arXiv 2020
[2]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193. Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al. (...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[1] [1]

Alameer, A., Kyriazakis, I., and Bacardit, J. (2020). Automated recognition of postures and drinking behaviour for the detection of compromised health in pigs.Scientific reports, 10(1):13665. Awais, M., Naseer, M., Khan, S., Anwer, R. M., Cholakkal, H., Shah, M., Yang, M.-H., and Khan, F. S. (2025). Foundation models defining a new era in vision: a survey...

work page internal anchor Pith review Pith/arXiv arXiv 2020

[2] [2]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al. (2023). Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193. Ravi, N., Gabeur, V., Hu, Y.-T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., et al. (...

work page internal anchor Pith review Pith/arXiv arXiv 2023