Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Livestock Monitoring

Researchers distilled SAM 3's 446M-parameter backbone into a 40.66M-parameter student, achieving 92.29% MOTA and 96.15% IDF1 on the Edinburgh Pig dataset. The pipeline runs on an NVIDIA Jetson Orin NX 16GB with 4.9GB headroom, enabling on-device individual-level livestock monitoring and longitudinal visual analytics.

iGEN Editorial

June 16, 2026

Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Livestock Monitoring

Precision livestock farming (PLF) promises continuous, individual-level animal monitoring, but the computational demands of state-of-the-art foundation models have kept them in the cloud, not on the edge. A new distillation approach from researchers Haiyu Yang and Miel Hostens, detailed in a preprint on arXiv, closes the gap by compressing the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 into a 40.66M-parameter student that fits an NVIDIA Jetson Orin NX 16GB edge accelerator with room to spare.

The core problem: Foundation-model pipelines for individual-level livestock monitoring—combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings—have raised accuracy ceilings but require GPU memory budgets that commodity edge accelerators cannot meet. According to the paper, the SAM 3 teacher model demands 19.52 GB of peak VRAM. The new student pipeline reduces this to 6.49 GB, a 3.01-fold reduction, while cutting system-level parameters by 7.77-fold.

How the Distillation Works

The student encoder is built on a TinyViT-21M-512 backbone and uses a Feature Pyramid Network architecture. Training employs a four-term direction-then-scale distillation loss. For inference, backbone-substitution with sliding-window session pruning bounds streaming GPU memory growth. The DINOv3 family contributes a pre-distilled ViT-S/16 variant (21.6M parameters), adopted as the per-individual embedder; its teacher is a 6716M-parameter ViT-7B model.

Performance on Livestock Data

On the Edinburgh Pig dataset, the compressed pipeline achieved 92.29% MOTA and 96.15% IDF1, only 1.68 and 0.84 percentage points behind the SAM 3 teacher. For nine-class pig behaviour classification, top-1 accuracy reached 97.34% with a macro-F1 of 91.67%.

Metric	Teacher (SAM 3)	Student (Distilled)	Change
MOTA	≈93.97%	92.29%	-1.68 pp
IDF1	≈96.99%	96.15%	-0.84 pp
System parameters	446M	40.66M	7.77× reduction
Peak VRAM	19.52 GB	6.49 GB	3.01× reduction
Behaviour top-1 acc	—	97.34%	—
Behaviour macro-F1	—	91.67%	—

The pipeline fits inside the NVIDIA Jetson Orin NX 16GB envelope with 4.9 GB of headroom, enabling on-device operation without cloud connectivity.

Longitudinal Visual Analytics

The authors propose an on-device embedding-pool re-identification mechanism that stores per-individual data at approximately 94 MB per animal per year. This creates a longitudinal visual record that can be retrospectively associated with disease, lameness, reproductive, and growth outcome labels. While the mechanism has not yet been empirically validated, it points toward a future where edge-deployed cameras continuously monitor individual animals and link visual behaviour changes to health events.

Implications for Enterprise Adoption

For technology leaders in agriculture and livestock supply chains, the distillation approach demonstrates that foundation-model accuracy can be preserved while shrinking resource requirements to fit off-the-shelf edge hardware. The ability to run continuous monitoring locally reduces cloud costs, bandwidth demands, and latency—critical for remote farms. The 4.9 GB of headroom on the Jetson Orin NX 16GB means additional application logic, such as alerting or local data storage, can be co-located on the same device.

Future work could extend the pipeline to other species and integrate with existing farm management systems via standard APIs. The arXiv paper provides the technical blueprint without releasing code, but the detailed methodology allows replication by enterprise teams with access to annotated livestock video datasets.

Sources:

Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Livestock Monitoring

How the Distillation Works

Performance on Livestock Data

Longitudinal Visual Analytics

Implications for Enterprise Adoption

Recommended Stories

Ensemble Deep Learning Achieves 99.27% Accuracy in Lemon Leaf Disease Detection

Hyderabad Researchers Develop AI-Powered Plant Leaf Disease Detection System with 96% Accuracy

New Research Reveals How Visual Tokens Evolve Inside Vision-Language Models

New AI Research Shows Vision-Language Models Think Better with Visual Grounding