Enterprise security teams face a persistent challenge: obtaining Android malware source code that directly reflects attackers' original intent. Unlike binaries or decompiled code, source code provides clear insight into malicious logic, but its scarcity and the high cost of manual review make building such datasets difficult. Researchers have now introduced MASCOT-Android, a curated dataset and automated collection pipeline that leverages repository-level documentation to scalably discover Android malware source code on GitHub.
The Problem: Scarcity of Malware Source Code
Malware source code is more valuable than binaries for understanding attacker intent, yet it is rarely available and costly to curate. According to the paper published on arXiv, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. The researchers propose MASCOT-Android to address this gap.
The MASCOT-Android Solution
MASCOT-Android is both a curated dataset of Android malware source code and an automated collection framework designed for scalable discovery on GitHub. The researchers' key finding is that repository-level documentation alone provides a strong signal for malware source code collection. They extracted character-level TF-IDF features from 8,772 malware and 25,747 benign README documents.
| Metric | Value |
|---|---|
| Malware READMEs | 8,772 |
| Benign READMEs | 25,747 |
| Classifier | LinearSVC |
| Accuracy | 96.28% |
| False Positive Rate | 1.06% |
This README-only model achieves an accuracy of 96.28% and a false positive rate of 1.06% in local evaluation. Additionally, the model outputs confidence scores, allowing users to adjust the decision threshold to balance false positive rate and coverage. This flexibility is practical in real-world malware source code collection.
Technical Approach
The pipeline uses a LinearSVC classifier trained on character-level TF-IDF features from README documents. The automated collection framework is designed to be scalable, enabling continuous discovery of new malware repositories on GitHub. The researchers emphasize that the confidence score output allows fine-tuning, which is critical for operational use where a low false positive rate is often necessary to avoid overwhelming analysts.
Implications for Enterprise Security
For enterprise security teams responsible for mobile app supply chain risk, MASCOT-Android offers a way to automate the discovery of Android malware source code. The high accuracy and low false positive rate mean that security personnel can trust the model to flag likely threats without excessive noise. The confidence scores enable teams to set their own risk tolerance—for example, a higher threshold for initial screening and a lower one for deep analysis.
This research directly aids threat intelligence and malware analysis workflows. By automating the collection of source code specimens, organizations can stay ahead of emerging malware families. The GitHub-focused approach means the pipeline taps into a rich source of publicly available code, but the methodology could extend to other code repositories.
Competitive Context
While other malware datasets exist, most are based on binaries or decompiled code. MASCOT-Android's unique focus on source code and its use of README documentation as a signal provides a low-overhead, scalable method. The use of a simple LinearSVC model with character-level TF-IDF makes the approach lightweight and reproducible.
The study was conducted by researchers including Li, Bojing; Zhong, Duo; Bhandary, Prajna; S, Raguvir; Maxa, Charles; Joyce, Robert J; and Nicholas. The full paper is available on arXiv.