iGEN
Visit IGEN World Explore IGEN Expo
EXPLORE UPGRADE PLANS
BREAKING
New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks Multi-Sensor Fusion Technique Enhances UAV Classification Accuracy Using Image and Radar Data Multi-Agent Peer-Reviewed Reasoning Boosts LLM Accuracy in Medical Question Answering Europe needs 65 CO2 carriers and 33 ports by 2050 to meet carbon storage goals, Xodus report says LLMs Struggle with Multi-Step Logic: New Framework DREAM Boosts Theorem Proving Performance The Missing Knowledge Layer in Cognitive Architectures for AI Agents RealityBridge: New AI Framework Edits 3D Driving Simulations to Close the Sim-to-Real Gap Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2% ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies FusionRS Dataset Advances Dual-Modal Vision-Language AI for Remote Sensing New MBABench Evaluates LLM Agents on End-to-End Finance Spreadsheet Tasks Multi-Sensor Fusion Technique Enhances UAV Classification Accuracy Using Image and Radar Data Multi-Agent Peer-Reviewed Reasoning Boosts LLM Accuracy in Medical Question Answering Europe needs 65 CO2 carriers and 33 ports by 2050 to meet carbon storage goals, Xodus report says LLMs Struggle with Multi-Step Logic: New Framework DREAM Boosts Theorem Proving Performance The Missing Knowledge Layer in Cognitive Architectures for AI Agents RealityBridge: New AI Framework Edits 3D Driving Simulations to Close the Sim-to-Real Gap Reinforcement Learning with Chain-of-Thought Supervision Boosts Hateful Meme Detection Accuracy by Over 2% ATOM-Bench: New Benchmark Evaluates Atomic Skills and Compositional Generalization in Robotic Manipulation Policies FusionRS Dataset Advances Dual-Modal Vision-Language AI for Remote Sensing
Home ›› Technology ›› Cybersecurity ›› MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery

MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery

Researchers introduce MASCOT-Android, a curated dataset of Android malware source code and an automated collection framework. The key finding is that repository-level documentation alone provides a strong signal, enabling a LinearSVC classifier to achieve 96.28% accuracy with a 1.06% false positive rate. The model outputs confidence scores for threshold adjustment, making it practical for real-world malware source code collection.

iG
iGEN Editorial
June 16, 2026
MASCOT-Android: Automated Pipeline and Curated Dataset for Android Malware Source Code Discovery

Enterprise security teams face a persistent challenge: obtaining Android malware source code that directly reflects attackers' original intent. Unlike binaries or decompiled code, source code provides clear insight into malicious logic, but its scarcity and the high cost of manual review make building such datasets difficult. Researchers have now introduced MASCOT-Android, a curated dataset and automated collection pipeline that leverages repository-level documentation to scalably discover Android malware source code on GitHub.

The Problem: Scarcity of Malware Source Code

Malware source code is more valuable than binaries for understanding attacker intent, yet it is rarely available and costly to curate. According to the paper published on arXiv, the scarcity of source code and the high cost of manual review make such datasets difficult to build and maintain. The researchers propose MASCOT-Android to address this gap.

The MASCOT-Android Solution

MASCOT-Android is both a curated dataset of Android malware source code and an automated collection framework designed for scalable discovery on GitHub. The researchers' key finding is that repository-level documentation alone provides a strong signal for malware source code collection. They extracted character-level TF-IDF features from 8,772 malware and 25,747 benign README documents.

Metric Value
Malware READMEs 8,772
Benign READMEs 25,747
Classifier LinearSVC
Accuracy 96.28%
False Positive Rate 1.06%

This README-only model achieves an accuracy of 96.28% and a false positive rate of 1.06% in local evaluation. Additionally, the model outputs confidence scores, allowing users to adjust the decision threshold to balance false positive rate and coverage. This flexibility is practical in real-world malware source code collection.

Technical Approach

The pipeline uses a LinearSVC classifier trained on character-level TF-IDF features from README documents. The automated collection framework is designed to be scalable, enabling continuous discovery of new malware repositories on GitHub. The researchers emphasize that the confidence score output allows fine-tuning, which is critical for operational use where a low false positive rate is often necessary to avoid overwhelming analysts.

Implications for Enterprise Security

For enterprise security teams responsible for mobile app supply chain risk, MASCOT-Android offers a way to automate the discovery of Android malware source code. The high accuracy and low false positive rate mean that security personnel can trust the model to flag likely threats without excessive noise. The confidence scores enable teams to set their own risk tolerance—for example, a higher threshold for initial screening and a lower one for deep analysis.

This research directly aids threat intelligence and malware analysis workflows. By automating the collection of source code specimens, organizations can stay ahead of emerging malware families. The GitHub-focused approach means the pipeline taps into a rich source of publicly available code, but the methodology could extend to other code repositories.

Competitive Context

While other malware datasets exist, most are based on binaries or decompiled code. MASCOT-Android's unique focus on source code and its use of README documentation as a signal provides a low-overhead, scalable method. The use of a simple LinearSVC model with character-level TF-IDF makes the approach lightweight and reproducible.

The study was conducted by researchers including Li, Bojing; Zhong, Duo; Bhandary, Prajna; S, Raguvir; Maxa, Charles; Joyce, Robert J; and Nicholas. The full paper is available on arXiv.


Sources:

Keep Reading

Recommended Stories

New Automated Jailbreak Attack UNIATTACK Achieves High Success Rate Against Multi-Layered LLM Defenses Technology

New Automated Jailbreak Attack UNIATTACK Achieves High Success Rate Against Multi-Layered LLM Defenses

Researchers present UNIATTACK, an adversarial testing framework that extracts high-impact attack features from existing exploits and uses a specialized attacker LLM to compose flexible templates. The framework achieves an average attack success rate improvement of 64.63% to 248.82% over baselines on models with multi-layered defenses, while costing only 0.03% to 4.96% of baseline costs.

June 16, 2026
AIChilles Automatically Unearths Hidden Weaknesses in AI-Evolved Programs Technology

AIChilles Automatically Unearths Hidden Weaknesses in AI-Evolved Programs

Researchers developed AIChilles, an automated tool that uncovers hidden weaknesses in AI-evolved programs. Testing 30 AI-generated programs across five system applications, it found 49 distinct failures in correctness, runtime, memory, and output quality. The tool combines workload extraction, constraint inference, and differential oracles to identify regressions that could undermine AI-generated code reliability.

June 16, 2026
Mask-Proof: New LLM Pipeline Automates Data Curation for Mathematical Proofs with 96.8% Accuracy Technology

Mask-Proof: New LLM Pipeline Automates Data Curation for Mathematical Proofs with 96.8% Accuracy

Researchers introduce Mask-Proof, an LLM-based pipeline that turns real mathematical proofs into automatically checkable masked-step tasks. The resulting Mask-ProofBench contains 292 problems. Seventeen models tested show reasoning-enhanced models outperform standard ones by 12-27%, with the evaluator achieving 96.8% agreement with expert annotators.

June 16, 2026
Samsung MAX VPN Shuts Down June 15, 2026, Leaving 50 Million Users Seeking Alternatives Technology

Samsung MAX VPN Shuts Down June 15, 2026, Leaving 50 Million Users Seeking Alternatives

Samsung MAX VPN ceased operations on June 15, 2026, affecting over 50 million users. The app remains as a dead shell unless uninstalled. Users are advised to switch to third-party VPNs for continued protection.

June 15, 2026