Intelligent cyber-phishing detection for online

https://doi.org/10.1016/j.cose.2020.102123Get rights and content

Highlights

  • A combined blacklist-based, web content-based and heuristic approaches using multiple algorithms with features to detect fraudulent sites with higher accuracy in real-time.

  • The methodology can reduce fraudulent attacks and protect online users.

  • The integrated method and four categories of dataset are the core framework from which we extract comprehensive features.

Abstract

Phishing attacks are on the increase, resulting in financial loss and theft of sensitive information to online services and users. Anti-phishing approaches have concentrated on blacklist-based approaches that use manually verified Unified Resource Locators (URLs); or content-based methods that utilise heuristics-based machine learning (ML) classifiers. However, online deception is still on the rise. In this study, we introduce a novel methodology combining blacklist-based, web content-based and heuristic based approaches, using ML algorithms with comprehensive features to allow more accurate phishing attack detection. Extensive evaluation was carried out based on Adaptive neuro-fuzzy inference system (ANFIS), Naïve Bayes (NB), PART, J48, and JRip with features, using evaluation methods (metrics) to measure the proposed method performance. All the classifiers achieved over 99% - 99.33% accuracy. PART attained 99.33% accuracy with 0.006 seconds (secs) speed, which is the best performance. We experimentally demonstrate that the proposed methodology can detect phishing websites with a high accuracy in real-time and generalise well to new phishing attacks. The proposed approach has the best performance compared to related approaches in the field.

Introduction

Cyber phishing attacks have become a growing threat and are amongst the most challenging issues affecting governments, businesses and individuals in the world-wide web. Other major types of cyber-attacks are brute force, social engineering/cyber fraud, distributed denial-of-service attack, Pharming and malware/spyware/ransomware. Initially, we focus on cyber Phishing attacks, which have been evolving rapidly, putting more pressure on the existing detection methods (Xiang et al., 2009:1:3; Xiang et al., 2010:3; Xiang et al., 2011:8:26). Phishing is a form of social engineering in which attackers attempt to fraudulently obtain user's confidential information through communication media that direct users to fake websites claiming to be from known banks or organizations (Jackobson & Mayor, 2007; Xiang et al., 2011). According to the U.S. Federal Bureau of Investigation (USFBI) statistics reported, over $2.7 billion was lost to phishing through businesses and personal email accounts being compromised in 2018 alone (the U.S. FBI Internet Crime Complaint Center (IC3), 2018). This issue is expected to increase with time as online transactions are becoming a life trend. Also, the Anti-Phishing Working Group (APWG) (2018) found that the number of phishing sites detected in July through to September 2019 were 266,387. This was an increase of 46% from 182,465 in the second quarter of 2019 and almost doubled the 138,328 seen in the fourth quarter of 2018 (APWG, 2019). More than 55% of all phishing attacks detected in the second quarter of 2019 used Secure Socket Layer (SSL) encryption on Hypertext Transfer Protocol (HTTP) protected domains to fool victims into thinking HTTPS sites are safe (APWG, 2019).

Several past academic research on anti-phishing detection field has concentrated only on three single methods: blacklist-based approaches (Sahingoz et al., 2019; Chiew et al., 2019a; Rao and Pais, 2018); visual similarities-based (Mao et al., 2017) or content-based approaches (Adebowale et al., 2018) to detect phishing attacks.

In general, most blacklist methods are widely employed in industry due to low false positives, but blacklist alone cannot generalise well to unseen phishing instances (Xiang et al., 2011; Prakash et al., 2010; Sheng et al., 2009). For example, Xiang et al., (2011) reported that blacklist feature-based method alone cannot detect new phishing attacks more accurately because blacklist features are properties of the URL itself (NOT the content of the web page it references). A URL content contains properties of hyperlinks in source code itself, but not content of the webpage it references (Rao and Pais, 2018). There is scarcity of effective features and high false positive rates as a result (Xiang et al., 2011). These challenges require a careful consideration into how effective features can be identified

The aim of this study is to introduce a novel methodology combining blacklist-based features, web content-based and heuristic feature-based approaches, using ML algorithms to allow more accurate phishing attack detection.

Specifically, the objectives of the study are to:

  • Identify a large set of data from different sources.

  • Extract comprehensive features from the dataset: phishing, suspicious, spoofed web and legitimate sites.

  • Model the proposed approach, using ML classification algorithms including ANFIS, NB, PART, J48 and JRip with features.

  • Train and test the models utilising feature set.

  • Evaluate the methodology performance using multiple evaluation methods.

  • Compare the proposed method with the previous works.

Our proposed study contributes a new methodology integrating blacklist-based methods, web content-based and heuristic feature-based approaches, using ML classifiers with features to distinguish phishing attacks. We identified four categories of dataset: phishing sites, suspicious, spoofed webs and legitimate sites from different sources. These enable the identification and extraction of features, which are used to generate models, train and test the models. We experimentally demonstrated that the proposed method can detect phishing attacks more accurately and can generalise well to address new attacks.

The main advantages of the proposed method are: first, the outcome will significantly increase user's confidence in online transactions. Second, the approach could improve the accuracy of anti-phishing detection solutions. Spoofed webs that are difficult to trace could be detected to reduce the increasing phishing threats. Our feature model will be utilised in the implementation of a plugin toolbar that can be used on end-user's computers to keep users secure online.

The disadvantage is that attackers’ techniques evolve constantly (Xiang et al., 2009:1:3; Xiang et al., 2010:3; Xiang et al., 2011:8:26). Therefore, keeping ahead of attackers with regular feature updates is important.

Overall, our method using PART classifier with features from the data: phishing, suspicious and spoofed websites shows the best performance in time taken to build the model. Features are randomly split equally into 300 × 10 training-set and 300 × 10 testing-set. Once the training is complete, we test the model with independent unseen 300 × 10 testing-set, which obtained a significant result for this study, which suggest that our method can generalise well to address new phishing attacks.

The remainder of the paper: Section 2 reviews the existing related work. Section 3 describes the proposed methods, covering data extraction, feature selection and describes the multiple classification algorithms. Section 4 describes the experimental procedures; experiments results and discussions. Experiments based on 4 classifiers and results are also provided. Section 5 provides discussions, comparisons and evaluation. Section 6 concludes the paper, presenting contributions and further work.

Section snippets

Related work

In general, anti-phishing approaches exist to prevent phishing attacks. In this study, we explore several existing anti-phishing related works, focusing on feature-based, content-based or heuristic and blacklist-based approaches, visual similarities and web browser plug-in toolbars for phishing attack detection systems.

Research methods and materials

The proposed approach is based on multiple ML classification algorithms, using ANFIS implemented in MATLAB toolbox with features and NB, PART, J48, JRip implemented in Waikato Environment Knowledge Analysis (WEKA). Our methodology combining blacklist-based features, web content-based and heuristic-based approaches enables a set of data (phishing websites, suspicious websites, spoofed web and legitimate websites) to be extracted from diverse sources: PhishTank, Millersmiles and Relbanks.

Experimental procedures

In this section, we present five experiments with a primary aim to assess the performances of the proposed methodology with features to find the best classification model, using ANFIS, NB, PART, J48, JRip. All the experiments were conducted with the same 3000 features extracted from the dataset of 30500 websites: (10000 phishing sites + 1000 suspicious sites + 10000 legitimate sites + 500 spoofed webs). Features are randomly split equally into training-set and test-set. Three evaluation methods

Discussion

The goal of this study was to introduce a novel methodology that combined blacklist-based method, web content-based and heuristic-feature based approaches, using ML classifiers with feature-set to allow more correct phishing attack detection to protect online users in real-time. All the sets of data were utilized including phishing sites, suspicious, spoofed web and legitimate sites. These are used to identify features that are used to detect new phishing attacks accurately.

Our study

Conclusion and future work

Various methods exist to detect phishing websites, but phishing attacks persist (Rao and Pais, 2018). In this study, we presented a novel methodology combining blacklist-based features, web content-based and heuristic -based approaches with features, using multiple ML classifiers for detecting phishing attacks more accurately. The main contribution in this study includes a novel combined methodology that is blacklist- based, websites content-based and heuristic-based features approaches. We

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Phoebe Barraclough is an Associate Lecturer at Northumbria University in the Department of Computer and Information Sciences. She completed her Ph.D. at Northumbria University. She did her viva under the supervision of Professor John Woodward. She has presented 4 papers at international conferences and a Reviewer with high ranking Journal. Her research interest is in Network/Internet Security, Botnet, Pentesting, Cybersecurity, Machine Learning and Artificial Intelligence. At present she is

References (37)

  • M. Moghimi et al.

    New rule-based phishing detection method

    Expert Syst. Appl.

    (2016)
  • G. Ramesh et al.

    An efficacious method for detecting phishing webpages through target domain identification

    Decis. Support Syst.

    (2014)
  • O.K. Sahingoz et al.

    Machine learning based phishing detection from URLs

    Expert Syst. Appl.

    (2019)
  • M. Sugeno et al.

    Structure Identification fuzzy model

    Fuzzy Sets Syst.

    (1988)
  • M.A. Adebowale et al.

    Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text

    Expert Syst. Appl.

    (2018)
  • M. Babagoli et al.

    Heuristic nonlinear regression strategy for detecting phishing websites

    Soft Comput.

    (2018)
  • T.-C. Chen et al.

    Detecting visually similar webpages: application to phishing detection

    ACM Trans. Internet Technol.

    (2010)
  • Chiew K. L., Tan, C. L., Wong, K. S., Yong, K. S. C., Tiong, W. K. (2019a). A new hybrid ensemble feature selection...
  • K.L. Chiew et al.

    A survey of phishing attacks: their types, vectors and technical approaches

    Expert Syst. Appl.

    (2019)
  • F. Feng et al.

    The application of a novel neural network in the detection of phishing websites

    J. Ambient Intell. Humanized Comput.

    (2018)
  • L.M. Form et al.

    Phishing email detection technique by using hybrid features

  • E. Frank et al.

    Generating accurate rule sets without global optimization

  • A. Gupta et al.

    Content based approach for detection of phishing sites

    Int. Res. J. Eng. Technol. (IRJET)

    (2015)
  • A.K. Jain et al.

    Towards detection of phishing websites on client–side using machine learning based approach

    Telecommun. Syst.

    (2018)
  • B. Kumar et al.

    DC scanner: detecting phishing attack

  • J. MA et al.

    Learning to detect malicious URLs

    ACM Trans. Intell. Syst. Technol.

    (2011)
  • R.S. Rao et al.

    Detection of phishing websites using an efficient feature-based machine learning framework

    Neural Computi. Appl.

    (2018)
  • J. Mao et al.

    Detecting phishing websites via aggregation analysis of page layouts

  • Cited by (32)

    • A systematic literature review on phishing website detection techniques

      2023, Journal of King Saud University - Computer and Information Sciences
    • Piracema.io: A rules-based tree model for phishing prediction

      2022, Expert Systems with Applications
      Citation Excerpt :

      Adherent acting). In Barraclough et al. (2021) presents an approach that combines several strategies, such as deny list, third-party queries available on the web (Gradual depth), and ML approaches to increase phishing detection accuracy. The prediction heuristic is based on classifiers based on features (strategies iii, iv, and (v) aiming at a low computational resource cost.

    View all citing articles on Scopus

    Phoebe Barraclough is an Associate Lecturer at Northumbria University in the Department of Computer and Information Sciences. She completed her Ph.D. at Northumbria University. She did her viva under the supervision of Professor John Woodward. She has presented 4 papers at international conferences and a Reviewer with high ranking Journal. Her research interest is in Network/Internet Security, Botnet, Pentesting, Cybersecurity, Machine Learning and Artificial Intelligence. At present she is building network with 3 University in East Africa for Global Challenge research in developing countries.

    Gerhard Fehringer is a Principal Lecture at the department of Computer and Information Sciences, Northumbria University. He has extensive academic leadership and management experience including academic roles as Director of Placements and Employability (Engineering and Environment), Enterprise Director (Computing and BIS) and Programme Director (Engineering). He is currently the Director of Enterprise & Engagement for the Department of Computer and Information Sciences. He is also the Head for the Cisco Networking Academy at Northumbria University. He joined the University in 2002 after working for Newcastle University, SIEMENS and Procter & Gamble. His research focuses on Computer Networks, Cyber Security and Internet of Things (IoT) technology and includes a recent Knowledge Transfer Partnership project with Adlink Technology Ltd. on IoT communication protocols.

    John Woodward is Faculty Pro Vice-Chancellor for Engineering and Environment at Northumbria University. He is a Professor in Physical Geography, specialising in glaciology and geophysics. He graduated from Leeds University in 1993, with an Erasmus year at Liege University, Belgium, before studying for an MSc at the University of Alberta, Canada. He then returned to Leeds to complete a PhD followed by a Postdoctoral position. After a lecturing position at Brunel University, London and a research role at the British Antarctic Survey, Cambridge, he joined Northumbria University as a Lecturer in 2003. He was promoted to Reader in 2008, Professor in 2011, Faculty Associate Pro Vice Chancellor in 2013 and Pro Vice-Chancellor in 2018.

    View full text