Intelligent cyber-phishing detection for online

doi:10.1016/j.cose.2020.102123

Computers & Security

Volume 104, May 2021, 102123

https://doi.org/10.1016/j.cose.2020.102123 Get rights and content

Highlights

•
A combined blacklist-based, web content-based and heuristic approaches using multiple algorithms with features to detect fraudulent sites with higher accuracy in real-time.
•
The methodology can reduce fraudulent attacks and protect online users.
•
The integrated method and four categories of dataset are the core framework from which we extract comprehensive features.

Abstract

Phishing attacks are on the increase, resulting in financial loss and theft of sensitive information to online services and users. Anti-phishing approaches have concentrated on blacklist-based approaches that use manually verified Unified Resource Locators (URLs); or content-based methods that utilise heuristics-based machine learning (ML) classifiers. However, online deception is still on the rise. In this study, we introduce a novel methodology combining blacklist-based, web content-based and heuristic based approaches, using ML algorithms with comprehensive features to allow more accurate phishing attack detection. Extensive evaluation was carried out based on Adaptive neuro-fuzzy inference system (ANFIS), Naïve Bayes (NB), PART, J48, and JRip with features, using evaluation methods (metrics) to measure the proposed method performance. All the classifiers achieved over 99% - 99.33% accuracy. PART attained 99.33% accuracy with 0.006 seconds (secs) speed, which is the best performance. We experimentally demonstrate that the proposed methodology can detect phishing websites with a high accuracy in real-time and generalise well to new phishing attacks. The proposed approach has the best performance compared to related approaches in the field.

Introduction

Cyber phishing attacks have become a growing threat and are amongst the most challenging issues affecting governments, businesses and individuals in the world-wide web. Other major types of cyber-attacks are brute force, social engineering/cyber fraud, distributed denial-of-service attack, Pharming and malware/spyware/ransomware. Initially, we focus on cyber Phishing attacks, which have been evolving rapidly, putting more pressure on the existing detection methods (Xiang et al., 2009:1:3; Xiang et al., 2010:3; Xiang et al., 2011:8:26). Phishing is a form of social engineering in which attackers attempt to fraudulently obtain user's confidential information through communication media that direct users to fake websites claiming to be from known banks or organizations (Jackobson & Mayor, 2007; Xiang et al., 2011). According to the U.S. Federal Bureau of Investigation (USFBI) statistics reported, over $2.7 billion was lost to phishing through businesses and personal email accounts being compromised in 2018 alone (the U.S. FBI Internet Crime Complaint Center (IC3), 2018). This issue is expected to increase with time as online transactions are becoming a life trend. Also, the Anti-Phishing Working Group (APWG) (2018) found that the number of phishing sites detected in July through to September 2019 were 266,387. This was an increase of 46% from 182,465 in the second quarter of 2019 and almost doubled the 138,328 seen in the fourth quarter of 2018 (APWG, 2019). More than 55% of all phishing attacks detected in the second quarter of 2019 used Secure Socket Layer (SSL) encryption on Hypertext Transfer Protocol (HTTP) protected domains to fool victims into thinking HTTPS sites are safe (APWG, 2019).

Several past academic research on anti-phishing detection field has concentrated only on three single methods: blacklist-based approaches (Sahingoz et al., 2019; Chiew et al., 2019a; Rao and Pais, 2018); visual similarities-based (Mao et al., 2017) or content-based approaches (Adebowale et al., 2018) to detect phishing attacks.

In general, most blacklist methods are widely employed in industry due to low false positives, but blacklist alone cannot generalise well to unseen phishing instances (Xiang et al., 2011; Prakash et al., 2010; Sheng et al., 2009). For example, Xiang et al., (2011) reported that blacklist feature-based method alone cannot detect new phishing attacks more accurately because blacklist features are properties of the URL itself (NOT the content of the web page it references). A URL content contains properties of hyperlinks in source code itself, but not content of the webpage it references (Rao and Pais, 2018). There is scarcity of effective features and high false positive rates as a result (Xiang et al., 2011). These challenges require a careful consideration into how effective features can be identified

The aim of this study is to introduce a novel methodology combining blacklist-based features, web content-based and heuristic feature-based approaches, using ML algorithms to allow more accurate phishing attack detection.

Specifically, the objectives of the study are to:

•
Identify a large set of data from different sources.
•
Extract comprehensive features from the dataset: phishing, suspicious, spoofed web and legitimate sites.
•
Model the proposed approach, using ML classification algorithms including ANFIS, NB, PART, J48 and JRip with features.
•
Train and test the models utilising feature set.
•
Evaluate the methodology performance using multiple evaluation methods.
•
Compare the proposed method with the previous works.

Our proposed study contributes a new methodology integrating blacklist-based methods, web content-based and heuristic feature-based approaches, using ML classifiers with features to distinguish phishing attacks. We identified four categories of dataset: phishing sites, suspicious, spoofed webs and legitimate sites from different sources. These enable the identification and extraction of features, which are used to generate models, train and test the models. We experimentally demonstrated that the proposed method can detect phishing attacks more accurately and can generalise well to address new attacks.

The main advantages of the proposed method are: first, the outcome will significantly increase user's confidence in online transactions. Second, the approach could improve the accuracy of anti-phishing detection solutions. Spoofed webs that are difficult to trace could be detected to reduce the increasing phishing threats. Our feature model will be utilised in the implementation of a plugin toolbar that can be used on end-user's computers to keep users secure online.

The disadvantage is that attackers’ techniques evolve constantly (Xiang et al., 2009:1:3; Xiang et al., 2010:3; Xiang et al., 2011:8:26). Therefore, keeping ahead of attackers with regular feature updates is important.

Overall, our method using PART classifier with features from the data: phishing, suspicious and spoofed websites shows the best performance in time taken to build the model. Features are randomly split equally into 300 × 10 training-set and 300 × 10 testing-set. Once the training is complete, we test the model with independent unseen 300 × 10 testing-set, which obtained a significant result for this study, which suggest that our method can generalise well to address new phishing attacks.

The remainder of the paper: Section 2 reviews the existing related work. Section 3 describes the proposed methods, covering data extraction, feature selection and describes the multiple classification algorithms. Section 4 describes the experimental procedures; experiments results and discussions. Experiments based on 4 classifiers and results are also provided. Section 5 provides discussions, comparisons and evaluation. Section 6 concludes the paper, presenting contributions and further work.

Section snippets

Related work

In general, anti-phishing approaches exist to prevent phishing attacks. In this study, we explore several existing anti-phishing related works, focusing on feature-based, content-based or heuristic and blacklist-based approaches, visual similarities and web browser plug-in toolbars for phishing attack detection systems.

Research methods and materials

The proposed approach is based on multiple ML classification algorithms, using ANFIS implemented in MATLAB toolbox with features and NB, PART, J48, JRip implemented in Waikato Environment Knowledge Analysis (WEKA). Our methodology combining blacklist-based features, web content-based and heuristic-based approaches enables a set of data (phishing websites, suspicious websites, spoofed web and legitimate websites) to be extracted from diverse sources: PhishTank, Millersmiles and Relbanks.

Experimental procedures

In this section, we present five experiments with a primary aim to assess the performances of the proposed methodology with features to find the best classification model, using ANFIS, NB, PART, J48, JRip. All the experiments were conducted with the same 3000 features extracted from the dataset of 30500 websites: (10000 phishing sites + 1000 suspicious sites + 10000 legitimate sites + 500 spoofed webs). Features are randomly split equally into training-set and test-set. Three evaluation methods

Discussion

The goal of this study was to introduce a novel methodology that combined blacklist-based method, web content-based and heuristic-feature based approaches, using ML classifiers with feature-set to allow more correct phishing attack detection to protect online users in real-time. All the sets of data were utilized including phishing sites, suspicious, spoofed web and legitimate sites. These are used to identify features that are used to detect new phishing attacks accurately.

Our study

Conclusion and future work

Various methods exist to detect phishing websites, but phishing attacks persist (Rao and Pais, 2018). In this study, we presented a novel methodology combining blacklist-based features, web content-based and heuristic -based approaches with features, using multiple ML classifiers for detecting phishing attacks more accurately. The main contribution in this study includes a novel combined methodology that is blacklist- based, websites content-based and heuristic-based features approaches. We

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Phoebe Barraclough is an Associate Lecturer at Northumbria University in the Department of Computer and Information Sciences. She completed her Ph.D. at Northumbria University. She did her viva under the supervision of Professor John Woodward. She has presented 4 papers at international conferences and a Reviewer with high ranking Journal. Her research interest is in Network/Internet Security, Botnet, Pentesting, Cybersecurity, Machine Learning and Artificial Intelligence. At present she is

References (37)

M. Moghimi et al.
New rule-based phishing detection method
Expert Syst. Appl.
(2016)
G. Ramesh et al.
An efficacious method for detecting phishing webpages through target domain identification
Decis. Support Syst.
(2014)
O.K. Sahingoz et al.
Machine learning based phishing detection from URLs
Expert Syst. Appl.
(2019)
M. Sugeno et al.
Structure Identification fuzzy model
Fuzzy Sets Syst.
(1988)
M.A. Adebowale et al.
Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text
Expert Syst. Appl.
(2018)
M. Babagoli et al.
Heuristic nonlinear regression strategy for detecting phishing websites
Soft Comput.
(2018)
T.-C. Chen et al.
Detecting visually similar webpages: application to phishing detection
ACM Trans. Internet Technol.
(2010)
Chiew K. L., Tan, C. L., Wong, K. S., Yong, K. S. C., Tiong, W. K. (2019a). A new hybrid ensemble feature selection...
K.L. Chiew et al.
A survey of phishing attacks: their types, vectors and technical approaches
Expert Syst. Appl.
(2019)
F. Feng et al.
The application of a novel neural network in the detection of phishing websites
J. Ambient Intell. Humanized Comput.
(2018)

L.M. Form et al.

Phishing email detection technique by using hybrid features

E. Frank et al.

Generating accurate rule sets without global optimization

A. Gupta et al.

Content based approach for detection of phishing sites

Int. Res. J. Eng. Technol. (IRJET)

(2015)

A.K. Jain et al.

Towards detection of phishing websites on client–side using machine learning based approach

Telecommun. Syst.

(2018)

B. Kumar et al.

DC scanner: detecting phishing attack

J. MA et al.

Learning to detect malicious URLs

ACM Trans. Intell. Syst. Technol.

(2011)

R.S. Rao et al.

Detection of phishing websites using an efficient feature-based machine learning framework

Neural Computi. Appl.

(2018)

J. Mao et al.

Detecting phishing websites via aggregation analysis of page layouts

Cited by (32)

Which factors predict susceptibility to phishing? An empirical study
2024, Computers and Security
Phishing is a cybercrime in active growth that victimizes a large number of individuals and organizations.
To explore which individual and contextual factors predict phishing susceptibility, an online survey was developed, and participants were invited to participate through institutional email from the University of Porto and social networks. The total sample was constituted of 449 individuals.
Results showed that subjects that perceive to have phishing detection self-efficacy and those that have greater use of services in Internet routine activities were more susceptible to phishing. Technology competencies and other individual variables do not predict phishing susceptibility in our sample.
Furthermore, the majority of factors (individual and contextual) tested do not predict phishing susceptibility. So, more studies are needed to understand which factors influence this susceptibility, and regarding that how individuals can protect themselves.
Finally, potential applications of this research include replication in other countries/contexts, and/or the application of the survey together with other innovative tools.
An intelligent context-aware threat detection and response model for smart cyber-physical systems
2023, Internet of Things (Netherlands)
Smart cities, businesses, workplaces, and even residences have all been converged by the Internet of Things (IoT). The types and characteristics of these devices vary depending on the industry 4.0 and have rapidly increased recently, especially in smart homes. These gadgets can expose users to serious cyber dangers because of a variety of computing constraints and vulnerabilities in the security-by-design concept. The smart home network testbed setup presented in this study is used to evaluate and validate the protection of the smart cyber-physical system. The context-aware threat intelligence and response model identifies the states of the aligned smart devices to distinguish between real-world typical and attack scenarios. It then dynamically writes specific rules for protection against potential cyber threats. The context-aware model is trained on IoT Research and Innovation Lab - Smart Home System (IRIL-SHS) testbed dataset. The labeled dataset is utilized to create a random forest model, which is subsequently used to train and test the context-aware threat intelligence SHS model's effectiveness and performance. Finally, the model's logic is used to gain rules to be included in Suricata signatures and the firewall rulesets for the response system. Significant values of the measuring parameters were found in the results. The presented model can be used for the real-time security of smart home cyber-physical systems and develops a vision of security challenges for Industry 4.0.
Metaheuristics with deep learning driven phishing detection for sustainable and secure environment
2023, Sustainable Energy Technologies and Assessments
Information technologies have intervened in every aspect of human life. This growth of connectivity, however, has radically changed the phishing attack landscape. In a phishing attack, users are tricked into providing data they would not willingly share otherwise. This attack is a persistent threat to the sustainability and security of ubiquitous systems. Hence, this paper introduces a novel metaheuristics deep learning-oriented phishing detection (MDLPD-SSE) technique for a sustainable and secure environment. The presented MDLPD-SSE model majorly focuses on identifying phishing websites. For this, the MDLPD-SSE method pre-processes the input URL to transform it into a compatible format. In addition, an improved simulated annealing-based feature selection (ISA-FS) approach was used to derive feature subsets. Furthermore, the long short-term memory (LSTM) model is utilized in this study to identify phishing. Finally, the bald eagle search (BES) optimization methodology was exploited to fine-tune the hyperparameters relevant to the LSTM model. Our outcomes demonstrated the superiority of the proposed model with an improved accuracy of 95.78%.
A systematic literature review on phishing website detection techniques
2023, Journal of King Saud University - Computer and Information Sciences
Phishing is a fraud attempt in which an attacker acts as a trusted person or entity to obtain sensitive information from an internet user. In this Systematic Literature Survey (SLR), different phishing detection approaches, namely Lists Based, Visual Similarity, Heuristic, Machine Learning, and Deep Learning based techniques, are studied and compared. For this purpose, several algorithms, data sets, and techniques for phishing website detection are revealed with the proposed research questions. A systematic Literature survey was conducted on 80 scientific papers published in the last five years in research journals, conferences, leading workshops, the thesis of researchers, book chapters, and from high-rank websites. The work carried out in this study is an update in the previous systematic literature surveys with more focus on the latest trends in phishing detection techniques. This study enhances readers' understanding of different types of phishing website detection techniques, the data sets used, and the comparative performance of algorithms used. Machine Learning techniques have been applied the most, i.e., 57 as per studies, according to the SLR. In addition, the survey revealed that while gathering the data sets, researchers primarily accessed two sources: 53 studies accessed the PhishTank website (53 for the phishing data set) and 29 studies used Alexa's website for downloading legitimate data sets. Also, as per the literature survey, most studies used Machine Learning techniques; 31 used Random Forest Classifier. Finally, as per different studies, Convolution Neural Network (CNN) achieved the highest Accuracy, 99.98%, for detecting phishing websites.
Piracema.io: A rules-based tree model for phishing prediction
2022, Expert Systems with Applications
Citation Excerpt :
Adherent acting). In Barraclough et al. (2021) presents an approach that combines several strategies, such as deny list, third-party queries available on the web (Gradual depth), and ML approaches to increase phishing detection accuracy. The prediction heuristic is based on classifiers based on features (strategies iii, iv, and (v) aiming at a low computational resource cost.
Phishing has been consolidating itself as a chronic problem due to its approach to exploiting the end-user, seen as the weakest factor. Through social engineering, the attacker seeks a carelessness of the human being to intercept sensitive data. Concomitantly, the richness in details makes it more difficult to mitigate the attack by most anti-phishing mechanisms since they are sustained in classifying a malicious page that lacks visual and textual details. This study aims to present a rule-based model approach, called piracema.io, for phishing prediction. Compared with other solutions proposed in the literature, the study believes that it has a different model that increases its efficiency in prediction as phishing presents greater richness based on page reputation-driven. In the light of the results obtained in logistic regression, the study detected static and dynamic features, considering relevance, relationship, and similarity between them. As a proof of concept, the study uses a statistical approach to evaluate the prediction modeling over the gradual depth and adherent acting strategies adopted to the proposal. As a result, the study discusses the quantitative and qualitative data obtained by the proposal, presenting contributions, threats, and limitations, as well as perspectives for future work for the continuity and improvement of the model in its current state.
A comprehensive survey of phishing: mediums, intended targets, attack and defence techniques and a novel taxonomy
2024, International Journal of Information Security

View all citing articles on Scopus

Gerhard Fehringer is a Principal Lecture at the department of Computer and Information Sciences, Northumbria University. He has extensive academic leadership and management experience including academic roles as Director of Placements and Employability (Engineering and Environment), Enterprise Director (Computing and BIS) and Programme Director (Engineering). He is currently the Director of Enterprise & Engagement for the Department of Computer and Information Sciences. He is also the Head for the Cisco Networking Academy at Northumbria University. He joined the University in 2002 after working for Newcastle University, SIEMENS and Procter & Gamble. His research focuses on Computer Networks, Cyber Security and Internet of Things (IoT) technology and includes a recent Knowledge Transfer Partnership project with Adlink Technology Ltd. on IoT communication protocols.

John Woodward is Faculty Pro Vice-Chancellor for Engineering and Environment at Northumbria University. He is a Professor in Physical Geography, specialising in glaciology and geophysics. He graduated from Leeds University in 1993, with an Erasmus year at Liege University, Belgium, before studying for an MSc at the University of Alberta, Canada. He then returned to Leeds to complete a PhD followed by a Postdoctoral position. After a lecturing position at Brunel University, London and a research role at the British Antarctic Survey, Cambridge, he joined Northumbria University as a Lecturer in 2003. He was promoted to Reader in 2008, Professor in 2011, Faculty Associate Pro Vice Chancellor in 2013 and Pro Vice-Chancellor in 2018.

View full text

Intelligent cyber-phishing detection for online

Highlights

Abstract

Introduction

Section snippets

Related work

Research methods and materials

Experimental procedures

Discussion

Conclusion and future work

Declaration of Competing Interest

Expert Syst. Appl.

Decis. Support Syst.

Expert Syst. Appl.

Fuzzy Sets Syst.

Intelligent web-phishing detection and protection scheme using integrated features of Images, frames and text

Expert Syst. Appl.

Heuristic nonlinear regression strategy for detecting phishing websites

Soft Comput.

Detecting visually similar webpages: application to phishing detection

ACM Trans. Internet Technol.

A survey of phishing attacks: their types, vectors and technical approaches

Expert Syst. Appl.

The application of a novel neural network in the detection of phishing websites

J. Ambient Intell. Humanized Comput.

Phishing email detection technique by using hybrid features

Generating accurate rule sets without global optimization

Content based approach for detection of phishing sites

Int. Res. J. Eng. Technol. (IRJET)

Towards detection of phishing websites on client–side using machine learning based approach

Telecommun. Syst.

DC scanner: detecting phishing attack

Learning to detect malicious URLs

ACM Trans. Intell. Syst. Technol.

Detection of phishing websites using an efficient feature-based machine learning framework

Neural Computi. Appl.

Detecting phishing websites via aggregation analysis of page layouts