A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

arxiv: 2509.12047 · v2 · submitted 2025-09-15 · 💻 cs.CV · cs.AI

A Computer Vision Pipeline for Individual-Level Behavior Analysis: Benchmarking on the Edinburgh Pig Dataset

Haiyu Yang , Enhong Liu , Jennifer Sun , Sumit Sharma , Meike van Leerdam , Sebastien Franceschini , Puchun Niu , Miel Hostens This is my paper

Pith reviewed 2026-05-18 16:43 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords computer visionpig behavior analysisanimal welfare monitoringobject detectionvideo trackingbehavior recognitiongroup housingprecision farming

0 comments p. Extension

The pith

A computer vision pipeline achieves 94.2% accuracy in individual pig behavior recognition from group-housing videos.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a modular pipeline that applies current computer vision models to automate behavior analysis for pigs kept in indoor group housing. It chains zero-shot object detection to find animals, motion-aware segmentation and tracking to follow them through overlaps, and vision transformers to pull out features for classifying actions. On the Edinburgh Pig Behavior Video Dataset the temporal model reaches 94.2 percent overall accuracy, a 21.2-point gain over earlier methods, while posting 93.3 percent identity preservation in tracking and 89.3 percent average precision in detection. The design replaces slow, subjective manual watching with objective, continuous records that can support welfare checks and productivity measures in commercial farms.

Core claim

Combining off-the-shelf zero-shot object detection, motion-aware segmentation and tracking, and vision transformer feature extraction into a modular pipeline enables accurate individual-level behavior recognition in occluded group-housed pigs, demonstrated by 94.2 percent accuracy on the Edinburgh Pig Behavior Video Dataset.

What carries the argument

Modular pipeline that links zero-shot object detection, motion-aware segmentation and tracking, and vision transformer feature extraction for behavior classification.

If this is right

Supplies continuous objective records for assessing pig welfare, health, and productivity without human observers.
Supports scalable monitoring across commercial indoor pig facilities.
Delivers more than twenty percentage points higher accuracy than previous automated methods on the same benchmark.
Modular structure allows reuse on other livestock species after targeted validation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time use could feed automated alerts for early signs of illness or stress in farm management software.
The same component chain may extend to behavior monitoring of other group-housed animals such as poultry or cattle.
Public release of the code creates a base that other groups can adapt for new camera setups or species.

Load-bearing premise

Off-the-shelf zero-shot detection and motion-aware segmentation models transfer to pig group-housing videos with only minimal domain shift or occlusion failures.

What would settle it

Apply the same pipeline to a new set of pig videos recorded under different lighting, camera angles, breeds, or higher crowding and measure whether accuracy falls substantially below 94.2 percent.

Figures

Figures reproduced from arXiv: 2509.12047 by Enhong Liu, Haiyu Yang, Jennifer Sun, Meike van Leerdam, Miel Hostens, Puchun Niu, Sebastien Franceschini, Sumit Sharma.

**Figure 6.** Figure 6: Bounding boxes overlay with predictions from YOLOv12(Right) showed the successful [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: Detection results on overhead pig housing footage showing YOLOv12's underdetection and [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Comparative segmentation performance on occluded cattle showing SAMURAI's precise instance [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

read the original abstract

Animal behavior analysis plays a crucial role in understanding animal welfare, health status, and productivity in agricultural settings. However, traditional manual observation methods are time-consuming, subjective, and limited in scalability. We present a modular pipeline that leverages open-sourced state-of-the-art computer vision techniques to automate animal behavior analysis in a group housing environment. Our approach combines state-of-the-art models for zero-shot object detection, motion-aware segmentation and tracking, and advanced feature extraction using vision transformers for robust behavior recognition. The pipeline addresses challenges including animal occlusions and group housing scenarios, as demonstrated in indoor pig monitoring. We validated our system on the Edinburgh Pig Behavior Video Dataset for multiple behavioral tasks. Our temporal model achieved 94.2% overall accuracy, representing a 21.2 percentage point improvement over existing methods. The pipeline demonstrated robust tracking capabilities with a 93.3% identity preservation (IDF1) score and an 89.3% average precision (AP) for object detection. The modular design suggests potential for adaptation to other contexts, though further validation across species would be required. The open-source implementation provides a scalable solution for behavior monitoring, contributing to precision pig farming and welfare assessment through automated, objective, and continuous analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper assembles a working modular CV pipeline for pig behavior tracking and classification on the Edinburgh dataset, hitting solid reported numbers, but the 21-point accuracy gain rests on an unclear baseline comparison.

read the letter

Hi colleague, this paper puts together an end-to-end pipeline that chains zero-shot detection, motion-aware tracking, and vision transformers to monitor individual pig behaviors in group housing. It reports 94.2% accuracy on the Edinburgh Pig Dataset, plus 93.3% IDF1 and 89.3% AP, and claims a 21-point lift over existing methods. The modular design and open-source release are the useful parts; anyone working on farm monitoring can take the pieces and adapt them without reinventing the wheel. The numbers look decent for an applied setting that has to deal with occlusions and multiple animals in one frame. The main soft spot is the comparison itself. The abstract does not say whether the baselines were re-implemented and tested on the exact same train/test splits and labeling scheme, or whether the numbers were simply taken from prior papers. If the conditions do not line up, the reported improvement could be smaller or absent. The scope is also narrow—one species, indoor pens only—so claims about broader transfer need more evidence. This work is for people doing applied computer vision in livestock or precision farming. A reader who needs a concrete starting point with public data and code will find it worth reading. The empirical results on a named dataset are concrete enough to justify sending it to referees rather than rejecting it outright. I would recommend peer review, with the main request being a clear description of the baseline protocol and splits.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a modular computer vision pipeline for individual-level behavior analysis in group-housed pigs on the Edinburgh Pig Dataset. It combines zero-shot object detection, motion-aware segmentation and tracking, and vision transformer feature extraction for behavior recognition. The temporal model reports 94.2% overall accuracy (21.2 pp improvement over existing methods), 93.3% IDF1, and 89.3% AP, with claims of robustness to occlusions and group interactions; the implementation is open-sourced.

Significance. If the performance deltas are shown to arise from fair, identical-condition comparisons on the same held-out data, the work could provide a practical, reproducible tool for automated welfare monitoring in precision pig farming. The modular open-source design and use of a public benchmark dataset are strengths that support potential adaptation to other species.

major comments (2)

Abstract: the headline claim of a 21.2 percentage point accuracy improvement over existing methods is load-bearing for the central contribution, yet the text provides no indication whether the cited baselines were re-implemented and re-evaluated on the authors' exact train/test splits, behavior taxonomy, and IDF1/AP preprocessing pipeline; any mismatch in video selection or labeling granularity would inflate the reported delta without demonstrating pipeline superiority.
Methods/Experiments (pipeline description): the assumption that off-the-shelf zero-shot detectors and motion-aware segmenters transfer directly to pig group-housing footage with minimal domain shift is central to the robustness narrative, but no quantitative failure analysis (e.g., occlusion rates, identity switches under crowding) is supplied to substantiate that the reported 93.3% IDF1 and 89.3% AP reflect genuine generalization rather than dataset-specific luck.

minor comments (2)

Abstract: the behavior taxonomy and number of classes are not enumerated; adding this would clarify the scope of the 94.2% accuracy figure.
Overall: a pipeline diagram showing the flow from detection through tracking to temporal classification would improve readability without altering technical content.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important aspects of clarity and substantiation that we address point-by-point below. We have prepared revisions to strengthen the manuscript accordingly.

read point-by-point responses

Referee: [—] Abstract: the headline claim of a 21.2 percentage point accuracy improvement over existing methods is load-bearing for the central contribution, yet the text provides no indication whether the cited baselines were re-implemented and re-evaluated on the authors' exact train/test splits, behavior taxonomy, and IDF1/AP preprocessing pipeline; any mismatch in video selection or labeling granularity would inflate the reported delta without demonstrating pipeline superiority.

Authors: We agree that the abstract should explicitly address the fairness of the comparison. The full experimental section describes re-implementation of the cited baselines using the identical train/test splits, behavior taxonomy, and preprocessing pipeline from the Edinburgh Pig Dataset. To eliminate any ambiguity, we will revise the abstract to include a concise statement confirming identical-condition evaluation and add a short paragraph in the Experiments section with a comparison table of setup details (splits, taxonomy, metrics). revision: yes
Referee: [—] Methods/Experiments (pipeline description): the assumption that off-the-shelf zero-shot detectors and motion-aware segmenters transfer directly to pig group-housing footage with minimal domain shift is central to the robustness narrative, but no quantitative failure analysis (e.g., occlusion rates, identity switches under crowding) is supplied to substantiate that the reported 93.3% IDF1 and 89.3% AP reflect genuine generalization rather than dataset-specific luck.

Authors: We acknowledge that aggregate IDF1 and AP scores alone do not fully quantify robustness to specific failure modes. We will add a dedicated quantitative failure analysis subsection that reports occlusion rates (derived from frame-level annotations), identity switch frequencies under varying crowding levels, and per-scenario breakdowns using the dataset's existing ground-truth. This analysis will be supported by our tracking logs and will directly link the reported metrics to generalization performance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking on external dataset with off-the-shelf components

full rationale

The paper presents a modular computer vision pipeline that applies existing zero-shot detection, motion-aware segmentation, tracking, and vision transformer feature extraction to the Edinburgh Pig Behavior Video Dataset. Reported metrics (94.2% accuracy, 93.3% IDF1, 89.3% AP) are direct empirical outcomes of running these models on the dataset and comparing against prior published methods. No equations, fitted parameters, or self-citations are used to derive the accuracy figure from internal definitions; the central claim remains an external validation result rather than a self-referential reduction. The pipeline is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of pre-trained vision models to pig videos and on the representativeness of the Edinburgh dataset for real farm conditions.

axioms (1)

domain assumption Zero-shot object detection and motion-aware segmentation models generalize to occluded group-housed pigs without domain-specific fine-tuning.
Invoked in the description of the detection and tracking stages of the pipeline.

pith-pipeline@v0.9.0 · 5777 in / 1351 out tokens · 39219 ms · 2026-05-18T16:43:26.529608+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our temporal model achieved 94.2% overall accuracy... zero-shot object detection, motion-aware segmentation and tracking, and advanced feature extraction using vision transformers
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

modular pipeline... DINOv2... LSTM classifier

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics
cs.CV 2026-04 unverdicted novelty 5.0

Distilled SAM 3 and DINOv3 models deliver near-teacher accuracy in pig tracking (92.29% MOTA, 96.15% IDF1) and behavior classification while achieving 7.77x parameter reduction and fitting on Jetson Orin NX with headroom.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

(2024) and 2) The Edinburgh Pig Behavior Video Dataset from Bergamini et al

Materials and Methods 2.1 Datasets Description We deployed our pipeline on two open-sourced datasets: 1) CBVD-5 dataset from Li et al. (2024) and 2) The Edinburgh Pig Behavior Video Dataset from Bergamini et al. (2021) . As we only validated the feature extraction ability of the model on the CBVD -5 dataset, the details of that experiment will be shared i...

work page 2024
[2]

mostly tracked

Results 3.3 Validation on The Edinburgh Pig Behavior Video 3.3.1 Dataset Preparation We decoded the 12 sequences of videos that were annotated using the stride specified on the official website of the dataset, following the procedure of 2.2.2. 600 frames are generated for each sequence and thereby accumulated 7200 labeled frames, each with 8 labeled pigs,...

work page
[3]

MLP Classification Results We first evaluated a Multi -Layer Perceptron (MLP) classifier on the extracted DinoV2 features. After filtering to nine well-represented behaviors (standing, lying, eating, drinking, sitting, sleeping, running, playing with toy, and nose -to-nose interactions), we obtained 28,698 examples with a 70/15/15 train/validation/test sp...

work page
[4]

Using a sliding window approach with majority-based filtering, we generated 14,255 temporal windows from the original dataset

LSTM Classification Results To better capture temporal dependencies in behavioral sequences, we implemented an LSTM -based classifier that processes sequences of DinoV2 embeddings. Using a sliding window approach with majority-based filtering, we generated 14,255 temporal windows from the original dataset. The LSTM classifier achieved a test accuracy of 9...

work page
[5]

(2021) using a fine -tuned YOLOv3 model

Discussion: 4.1 Comparison with Former Research: 4.1.1 Benchmarking Results Comparison: Our object detection results using OWL v2 achieved an AP of 89.28%, which is 5.93 percentage points lower than the 95.21% AP reported by Bergamini et al. (2021) using a fine -tuned YOLOv3 model. However, direct comparison is challenging as the details about which seque...

work page 2021
[6]

Computational efficiency: Processing-intensive operations can be optimized independently. For example, the decoding and tracking modules include specialized memory management to handle long video sequences, while feature extraction employs parallel processing to maximize throughpu t

work page
[7]

If tracking temporarily fails due to occlusion, the system can recover in subsequent frames without cascading errors through the entire system

Error isolation: Problems in one module do not necessarily compromise the entire pipeline. If tracking temporarily fails due to occlusion, the system can recover in subsequent frames without cascading errors through the entire system

work page
[8]

Low -light conditions might benefit from enhanced detection models, while crowded scenes may require specialized tracking approaches

Adaptation to environmental conditions: Different farm environments may require different configurations. Low -light conditions might benefit from enhanced detection models, while crowded scenes may require specialized tracking approaches. Our modular design allows these adaptations - as demonstrated when we switched from YOLOv12 to OWLv2 for pig detectio...

work page
[9]

Potential for adaptation: The modular design theoretically facilitates adaptation to other contexts. By replacing individual modules while maintaining the core pipeline architecture, the system could potentially be adapted for other applications, though this would require validation for each new use case. Our experience switching from YOLOv12 to OWLv2 dem...

work page
[10]

zero -shot

Incremental deployment: Resource-constrained environments can implement a subset of the pipeline. For example, if real -time processing is not required, users can deploy only the feature extraction and classification modules on pre-recorded video. 4.2 Limitations 4.2.1 High-quality Start Point and Delicate Dataflow The pipeline requires that animals in th...

work page 2018
[11]

zero-shot

Conclusions and Perspectives We have presented a modular pipeline for automated behavior analysis validated on pig monitoring in group housing environments. By integrating state-of-the-art deep learning techniques including OWLv2 for detection, SAMURAI for tracking, and DINOv2 for fea ture extraction, our pipeline achieved 94.2% accuracy on nine-class pig...

work page
[12]

a photo of a [class]

Supplementary 6.1 Considerations and Trials for Deciding Model In order to identify the best model, we conducted some evaluations of several open ‐source models. Two distinct video datasets were used: a proprietary recording of our own dairy cows and the publicly available Edinburgh Pig Behavior Video (Bergamini et al., 2021). 6.1.1 Object Detection and L...

work page 2021
[13]

Glossary A2: Area Attention AI: Artificial Intelligence AO: Average Overlap API: Application Programming Interface AUC: Area Under the Curve CBVD-5: Cow Behavior Video Dataset (5 categories) CLS: Class token (in transformers) CLIP: Contrastive Language-Image Pre-training CNN: Convolutional Neural Network CPU: Central Processing Unit DINOv2: Self-Distillat...

work page
[14]

Segment Anything

Reference Alvarenga, A. B., Oliveira, H. R., Chen, S. Y., Miller, S. P., Marchant -Forde, J. N., Grigoletto, L., & Brito, L. F. (2021). A systematic review of genomic regions and candidate genes underlying behavioral traits in farmed mammals and their link with human disorders. Animals, 11(3), 715. Antanaitis, R., Džermeikaitė, K., Bespalovaitė, A., Ribel...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[1] [1]

(2024) and 2) The Edinburgh Pig Behavior Video Dataset from Bergamini et al

Materials and Methods 2.1 Datasets Description We deployed our pipeline on two open-sourced datasets: 1) CBVD-5 dataset from Li et al. (2024) and 2) The Edinburgh Pig Behavior Video Dataset from Bergamini et al. (2021) . As we only validated the feature extraction ability of the model on the CBVD -5 dataset, the details of that experiment will be shared i...

work page 2024

[2] [2]

mostly tracked

Results 3.3 Validation on The Edinburgh Pig Behavior Video 3.3.1 Dataset Preparation We decoded the 12 sequences of videos that were annotated using the stride specified on the official website of the dataset, following the procedure of 2.2.2. 600 frames are generated for each sequence and thereby accumulated 7200 labeled frames, each with 8 labeled pigs,...

work page

[3] [3]

MLP Classification Results We first evaluated a Multi -Layer Perceptron (MLP) classifier on the extracted DinoV2 features. After filtering to nine well-represented behaviors (standing, lying, eating, drinking, sitting, sleeping, running, playing with toy, and nose -to-nose interactions), we obtained 28,698 examples with a 70/15/15 train/validation/test sp...

work page

[4] [4]

Using a sliding window approach with majority-based filtering, we generated 14,255 temporal windows from the original dataset

LSTM Classification Results To better capture temporal dependencies in behavioral sequences, we implemented an LSTM -based classifier that processes sequences of DinoV2 embeddings. Using a sliding window approach with majority-based filtering, we generated 14,255 temporal windows from the original dataset. The LSTM classifier achieved a test accuracy of 9...

work page

[5] [5]

(2021) using a fine -tuned YOLOv3 model

Discussion: 4.1 Comparison with Former Research: 4.1.1 Benchmarking Results Comparison: Our object detection results using OWL v2 achieved an AP of 89.28%, which is 5.93 percentage points lower than the 95.21% AP reported by Bergamini et al. (2021) using a fine -tuned YOLOv3 model. However, direct comparison is challenging as the details about which seque...

work page 2021

[6] [6]

Computational efficiency: Processing-intensive operations can be optimized independently. For example, the decoding and tracking modules include specialized memory management to handle long video sequences, while feature extraction employs parallel processing to maximize throughpu t

work page

[7] [7]

If tracking temporarily fails due to occlusion, the system can recover in subsequent frames without cascading errors through the entire system

Error isolation: Problems in one module do not necessarily compromise the entire pipeline. If tracking temporarily fails due to occlusion, the system can recover in subsequent frames without cascading errors through the entire system

work page

[8] [8]

Low -light conditions might benefit from enhanced detection models, while crowded scenes may require specialized tracking approaches

Adaptation to environmental conditions: Different farm environments may require different configurations. Low -light conditions might benefit from enhanced detection models, while crowded scenes may require specialized tracking approaches. Our modular design allows these adaptations - as demonstrated when we switched from YOLOv12 to OWLv2 for pig detectio...

work page

[9] [9]

Potential for adaptation: The modular design theoretically facilitates adaptation to other contexts. By replacing individual modules while maintaining the core pipeline architecture, the system could potentially be adapted for other applications, though this would require validation for each new use case. Our experience switching from YOLOv12 to OWLv2 dem...

work page

[10] [10]

zero -shot

Incremental deployment: Resource-constrained environments can implement a subset of the pipeline. For example, if real -time processing is not required, users can deploy only the feature extraction and classification modules on pre-recorded video. 4.2 Limitations 4.2.1 High-quality Start Point and Delicate Dataflow The pipeline requires that animals in th...

work page 2018

[11] [11]

zero-shot

Conclusions and Perspectives We have presented a modular pipeline for automated behavior analysis validated on pig monitoring in group housing environments. By integrating state-of-the-art deep learning techniques including OWLv2 for detection, SAMURAI for tracking, and DINOv2 for fea ture extraction, our pipeline achieved 94.2% accuracy on nine-class pig...

work page

[12] [12]

a photo of a [class]

Supplementary 6.1 Considerations and Trials for Deciding Model In order to identify the best model, we conducted some evaluations of several open ‐source models. Two distinct video datasets were used: a proprietary recording of our own dairy cows and the publicly available Edinburgh Pig Behavior Video (Bergamini et al., 2021). 6.1.1 Object Detection and L...

work page 2021

[13] [13]

Glossary A2: Area Attention AI: Artificial Intelligence AO: Average Overlap API: Application Programming Interface AUC: Area Under the Curve CBVD-5: Cow Behavior Video Dataset (5 categories) CLS: Class token (in transformers) CLIP: Contrastive Language-Image Pre-training CNN: Convolutional Neural Network CPU: Central Processing Unit DINOv2: Self-Distillat...

work page

[14] [14]

Segment Anything

Reference Alvarenga, A. B., Oliveira, H. R., Chen, S. Y., Miller, S. P., Marchant -Forde, J. N., Grigoletto, L., & Brito, L. F. (2021). A systematic review of genomic regions and candidate genes underlying behavioral traits in farmed mammals and their link with human disorders. Animals, 11(3), 715. Antanaitis, R., Džermeikaitė, K., Bespalovaitė, A., Ribel...

work page internal anchor Pith review Pith/arXiv arXiv 2021