JDroid: Android malware detection using hybrid opcode feature vector


ARSLAN R. S.

PeerJ Computer Science, cilt.11, 2025 (SCI-Expanded, Scopus) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 11
  • Basım Tarihi: 2025
  • Doi Numarası: 10.7717/peerj-cs.3051
  • Dergi Adı: PeerJ Computer Science
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, Directory of Open Access Journals
  • Anahtar Kelimeler: Algorithms and Analysis of Algorithms, Data Mining and Machine Learning, Hybrid feature vector, Malware detection, Mobile and Ubiquitous Computing, Opcode sequences, Optimization Theory and Computation, Security and Privacy, Stacked generalized ensemble classifier
  • Kayseri Üniversitesi Adresli: Evet

Özet

The rapid proliferation of devices using the Android operating system makes these devices the primary target for malware developers. Researchers are investigating different techniques to protect end users from these attackers. While many of these techniques are successful in detecting malware, they also have some limitations. Because many applications today use advanced obfuscation techniques, advanced disguise, and variant generation techniques to bypass detection tools, this creates difficulties for security experts. However, the rich semantic information hidden in opcodes offers a promising way to distinguish benign applications from malicious ones. In this study, we propose a tool called JDroid that treats opcodes (Dalvik Opcode and Java ByteCode) as features based on static analysis. The proposed tool aims to detect malicious applications with a unique ensemble model in a stacked generalised structure that uses different opcode sequences as a hybrid, and where each feature is first trained separately and then used by an ensemble decision. For this purpose, opcodes are extracted from APK files by code analysis and directly converted into vectors as 0 and 1 according to their usage cases. A subset of 461 features, obtained through filtering and feature selection processes, is then created using fewer features. This increases efficiency and performance, avoids overfitting, and reduces computational cost. The datasets Drebin, Genome, MalDroid2020, CICInvesAndMal2019, and Omer are tested with an application pool consisting of 14 thousand applications, and the classification performance is compared with different machine learning methods. Experimental results show that the proposed approach has an accuracy value of 98.6% and an area under the curve (AUC) value of 99.6% in malware detection without being affected by the obfuscation process.