MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery

Researchers introduce MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework. The key finding is that repository-level documentation alone provides a strong signal, enabling a LinearSVC classifier to achieve 96.28% accuracy with a 1.06% false positive rate. The model outputs confidence scores for threshold adjustment, making it practical for real-world malware source code collection.

iGEN Editorial

June 16, 2026

MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery

Enterprise security teams face a persistent challenge: obtaining Android malware source code that directly reflects attackers' original intent. Unlike binaries or decompiled code, source code provides clear insight into malicious logic, but its scarcity and the high cost of manual review make building such datasets difficult. Researchers have now introduced MASCOT-Android, a curated dataset and automated collection pipeline that leverages repository-level documentation to scalably discover Android malware source code on GitHub.

The Problem: Scarcity of Malware Source Code

Malware source code is more valuable than binaries for understanding attacker intent, yet it is rarely available and costly to curate. According to the paper published on arXiv, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. The researchers propose MASCOT-Android to address this gap.

The MASCOT-Android Solution

MASCOT-Android is both a curated dataset of Android malware source code and an automated collection framework designed for scalable discovery on GitHub. The researchers' key finding is that repository-level documentation alone provides a strong signal for malware source code collection. They extracted character-level TF-IDF features from 8,772 malware and 25,747 benign README documents.

Metric	Value
Malware READMEs	8,772
Benign READMEs	25,747
Classifier	LinearSVC
Accuracy	96.28%
False Positive Rate	1.06%

This README-only model achieves an accuracy of 96.28% and a false positive rate of 1.06% in local evaluation. Additionally, the model outputs confidence scores, allowing users to adjust the decision threshold to balance false positive rate and coverage. This flexibility is practical in real-world malware source code collection.

Technical Approach

The pipeline uses a LinearSVC classifier trained on character-level TF-IDF features from README documents. The automated collection framework is designed to be scalable, enabling continuous discovery of new malware repositories on GitHub. The researchers emphasize that the confidence score output allows fine-tuning, which is critical for operational use where a low false positive rate is often necessary to avoid overwhelming analysts.

Implications for Enterprise Security

For enterprise security teams responsible for mobile app supply chain risk, MASCOT-Android offers a way to automate the discovery of Android malware source code. The high accuracy and low false positive rate mean that security personnel can trust the model to flag likely threats without excessive noise. The confidence scores enable teams to set their own risk tolerance—for example, a higher threshold for initial screening and a lower one for deep analysis.

This research directly aids threat intelligence and malware analysis workflows. By automating the collection of source code specimens, organizations can stay ahead of emerging malware families. The GitHub-focused approach means the pipeline taps into a rich source of publicly available code, but the methodology could extend to other code repositories.

Competitive Context

While other malware datasets exist, most are based on binaries or decompiled code. MASCOT-Android's unique focus on source code and its use of README documentation as a signal provides a low-overhead, scalable method. The use of a simple LinearSVC model with character-level TF-IDF makes the approach lightweight and reproducible.

The study was conducted by researchers including Li, Bojing; Zhong, Duo; Bhandary, Prajna; S, Raguvir; Maxa, Charles; Joyce, Robert J; and Nicholas. The full paper is available on arXiv.

Sources:

MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery

The Problem: Scarcity of Malware Source Code

The MASCOT-Android Solution

Technical Approach

Implications for Enterprise Security

Competitive Context

Recommended Stories

Study Finds Mobile Apps Marketed to US Troops Contain Chinese and Russian Code

EU Politician Investigating Pegasus Spyware Was Hacked With the Same Malware, Citizen Lab Finds

DF3DV-1K: Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

Cost of ransomware recovery too high? Here’s how to stop footing the bill