JDroid: Android malware detection using hybrid opcode feature vector

ARSLAN, RECEP

doi:10.7717/peerj-cs.3051

JDroid: Android malware detection using hybrid opcode feature vector

ARSLAN R. S.

PeerJ Computer Science, cilt.11, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 11
Basım Tarihi: 2025
Doi Numarası: 10.7717/peerj-cs.3051
Dergi Adı: PeerJ Computer Science
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Directory of Open Access Journals
Anahtar Kelimeler: Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Hybrid feature vector, Malware detection, Mobile and Ubiquitous Computing, Opcode sequences, Optimization Theory and Computation, Security and Privacy, Stacked generalized ensemble classifier
Kayseri Üniversitesi Adresli: Evet

Özet

The rapid proliferation of devices using the Android operating system makes these devices the primary target for malware developers. Researchers are investigating different techniques to protect end users from these attackers. While many of these techniques are successful in detecting malware, they also have some limitations. Because many applications today use advanced obfuscation techniques, advanced disguise, and variant generation techniques to bypass detection tools, this creates difficulties for security experts. However, the rich semantic information hidden in opcodes offers a promising way to distinguish benign applications from malicious ones. In this study, we propose a tool called JDroid that treats opcodes (Dalvik Opcode and Java ByteCode) as features based on static analysis. The proposed tool aims to detect malicious applications with a unique ensemble model in a stacked generalised structure that uses different opcode sequences as a hybrid, and where each feature is first trained separately and then used by an ensemble decision. For this purpose, opcodes are extracted from APK files by code analysis and directly converted into vectors as 0 and 1 according to their usage cases. A subset of 461 features, obtained through filtering and feature selection processes, is then created using fewer features. This increases efficiency and performance, avoids overfitting, and reduces computational cost. The datasets Drebin, Genome, MalDroid2020, CICInvesAndMal2019, and Omer are tested with an application pool consisting of 14 thousand applications, and the classification performance is compared with different machine learning methods. Experimental results show that the proposed approach has an accuracy value of 98.6% and an area under the curve (AUC) value of 99.6% in malware detection without being affected by the obfuscation process.