Precision livestock farming (PLF) promises continuous, individual-level animal monitoring, but the computational demands of state-of-the-art foundation models have kept them in the cloud, not on the edge. A new distillation approach from researchers Haiyu Yang and Miel Hostens, detailed in a preprint on arXiv, closes the gap by compressing the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 into a 40.66M-parameter student that fits an NVIDIA Jetson Orin NX 16GB edge accelerator with room to spare.
The core problem: Foundation-model pipelines for individual-level livestock monitoring—combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings—have raised accuracy ceilings but require GPU memory budgets that commodity edge accelerators cannot meet. According to the paper, the SAM 3 teacher model demands 19.52 GB of peak VRAM. The new student pipeline reduces this to 6.49 GB, a 3.01-fold reduction, while cutting system-level parameters by 7.77-fold.
How the Distillation Works
The student encoder is built on a TinyViT-21M-512 backbone and uses a Feature Pyramid Network architecture. Training employs a four-term direction-then-scale distillation loss. For inference, backbone-substitution with sliding-window session pruning bounds streaming GPU memory growth. The DINOv3 family contributes a pre-distilled ViT-S/16 variant (21.6M parameters), adopted as the per-individual embedder; its teacher is a 6716M-parameter ViT-7B model.
Performance on Livestock Data
On the Edinburgh Pig dataset, the compressed pipeline achieved 92.29% MOTA and 96.15% IDF1, only 1.68 and 0.84 percentage points behind the SAM 3 teacher. For nine-class pig behaviour classification, top-1 accuracy reached 97.34% with a macro-F1 of 91.67%.
| Metric | Teacher (SAM 3) | Student (Distilled) | Change |
|---|---|---|---|
| MOTA | ≈93.97% | 92.29% | -1.68 pp |
| IDF1 | ≈96.99% | 96.15% | -0.84 pp |
| System parameters | 446M | 40.66M | 7.77× reduction |
| Peak VRAM | 19.52 GB | 6.49 GB | 3.01× reduction |
| Behaviour top-1 acc | — | 97.34% | — |
| Behaviour macro-F1 | — | 91.67% | — |
The pipeline fits inside the NVIDIA Jetson Orin NX 16GB envelope with 4.9 GB of headroom, enabling on-device operation without cloud connectivity.
Longitudinal Visual Analytics
The authors propose an on-device embedding-pool re-identification mechanism that stores per-individual data at approximately 94 MB per animal per year. This creates a longitudinal visual record that can be retrospectively associated with disease, lameness, reproductive, and growth outcome labels. While the mechanism has not yet been empirically validated, it points toward a future where edge-deployed cameras continuously monitor individual animals and link visual behaviour changes to health events.
Implications for Enterprise Adoption
For technology leaders in agriculture and livestock supply chains, the distillation approach demonstrates that foundation-model accuracy can be preserved while shrinking resource requirements to fit off-the-shelf edge hardware. The ability to run continuous monitoring locally reduces cloud costs, bandwidth demands, and latency—critical for remote farms. The 4.9 GB of headroom on the Jetson Orin NX 16GB means additional application logic, such as alerting or local data storage, can be co-located on the same device.
Future work could extend the pipeline to other species and integrate with existing farm management systems via standard APIs. The arXiv paper provides the technical blueprint without releasing code, but the detailed methodology allows replication by enterprise teams with access to annotated livestock video datasets.