An empirical study of behavioral characteristics of spammers: Findings and implications☆
Introduction
In recent years, content-independent (CI) anti-spam techniques that aim to determine the likelihood of an incoming message to be spam based on the properties of the message sender, instead of the message content, have attracted a lot of attention [4], [15], [18], [21]. However, in order for the CI anti-spam schemes to be effective, we must have a clear understanding of the behavioral characteristics of spammers that can distinguish spammers from the senders of legitimate messages. Behavioral characteristics of spammers such as the distributions of spam and non-spam messages by spam ratios, the statistics of spam messages from different spammers, the spam arrival patterns across the IP address space, the number of mail servers in different (spam) networks, and the active duration of spammers, can significantly affect both the feasibility and effectiveness of the CI anti-spam mechanisms. Moreover, a clear understanding of the behavioral characteristics of spammers can also facilitate the design of new anti-spam mechanisms and new email delivery architectures that are inherently spam-resistant.
In this paper we perform a detailed study of the behavioral characteristics of spammers at both the mail server and the network levels by analyzing a two-month trace of more than 25 million emails received at a large US university campus network, of which more than 18 million are spam. We also correlate the arrivals of spam with Border Gateway Protocol (BGP) route updates to investigate the network reachability properties of spammers [16]. Our study confirms the informal observation that the spam arrivals from some spammers are often closely correlated in time with the BGP announcement of the corresponding network prefixes [19]. These network prefixes are short-lived in that they are withdrawn quickly after the spamming activity is over. This sophisticated spamming technique can make it hard to track and identify the responsible spammers. In this paper we formally study the prevalence of such behavior.
We use the following terms in the exposition of our findings. The spam ratio of a message sender is the fraction of the messages sent by the sender that is spam. A spam only mail server sends only spam messages, and a non-spam only mail server sends only legitimate messages. A sender mail server sending both spam and legitimate messages is referred to as a mixed mail server. The term spam mail servers refers to the set of both spam only and mixed mail servers. A spam mail server sends at least one spam message in the trace. The term non-spam mail servers refers to the set of non-spam only and mixed mail servers. A non-spam mail server sends at least one legitimate message in the trace. Sender networks are classified similarly. The major findings from our study are as follows:
- •
Most mail servers send either mostly spam or mostly non-spam messages (Section 4.1). For example, about 5% of mail servers have a spam ratio of at most 1%, and about 90% of mail servers have a spam ratio of at least 99%. Only less than 5% of mail servers have a spam ratio between 1% and 99%. The vast majority of spam messages come from mail servers with a high spam ratio, and a large portion of non-spam messages come from mail servers with a low spam ratio (Section 4.2). For example, more than 91% of spam messages come from mail servers with a spam ratio of at least 90%, and about 76% of non-spam messages come from mail servers with a spam ratio of at most 10%.
- •
The majority of spammers send only a small number of spam messages (Section 4.3). For example, 93% of spam only mail servers and 58% of spam only networks send no more than 10 messages each during the two-month trace collection period. In contrast, about 0.04% of spam only mail servers send more than 1000 messages each and are responsible for 16% of all spam messages. About 0.5% of spam only networks send more than 1000 messages each and are responsible for 2% of all spam messages. A large portion of spammers send spam only within a short period of time (Section 4.6). For example, 81% of spam only mail servers and 27% of spam only networks send spam only within one day out of the two-month email collection period.
- •
The vast majority of both spam messages and spam only mail servers are from mixed networks (Sections 3 Overview of the email trace, 4.3 Number of messages from email senders). For example, about 91.7% of spam messages and 91% of spam only mail servers are from mixed networks. Moreover, only 6.5% of mixed networks send more than 1000 messages each but are responsible for 75% of all spam messages.
- •
The majority of both spam messages and spam mail servers are from a few concentrated regions of the IP address space (Sections 4.4 IP address origins of spam messages, 4.5 Number of mail servers and their origins). For example, 68% of spam messages and 74% of spam mail servers are from top 20 “/8” IP address spaces. The top “/8” address spaces of spam messages and spam mail servers largely overlap with each other. In addition, spam networks tend to have more mail servers than non-spam only networks. For example, less than 1% of non-spam only networks have more than 10 mail servers. In contrast, about 14% of spam only networks have more than 10 mail servers. Alarmingly, about 10% of mixed networks have more than 100 mail servers, and about 1% have more than 1000 mail servers. It is likely that a large portion of mail servers in the mixed networks are infected machines (popularly called bots).
- •
Network prefixes for a small portion of spam only networks are only visible within a short period of time, and the short life span of these network prefixes coincides with the delivery of spam from the corresponding networks (Section 5). For example, during the two-month trace collection period, the network prefixes of about 6% of spam only networks are visible for no longer than one week. In contrast, only about 2% of non-spam only networks and 2% of mixed networks have a life span less than one week.
In the remainder of the paper, we will present the details of the findings and we will also discuss the implications of the findings for the design and development of (content-independent) anti-spam schemes. In Section 2 we describe the collection of the email and BGP traces, analysis methodology, and the terminology used in the paper. We present an overview of the email trace in Section 3. We study the behavioral characteristics of spammers and their network reachability properties in Sections 4 Behavioral characteristics of spammers, 5 Network reachability properties of spammers, respectively. We discuss the implications of the findings for spam control in Section 6. We describe the related work in Section 7 and conclude the paper in Section 8.
Section snippets
Data sources
The email trace was collected at a mail relay server deployed in the Florida State University (FSU) campus network between 8/25/2005 and 10/24/2005 (excluding 9/11/2005). During the course of the email trace collection, the mail server relayed messages destined for 53 sub-domains in the FSU campus network. The mail relay server ran SpamAssassin [14] to detect spam messages. The email trace contains the following information for each incoming message: the local arrival time, the IP address of
Overview of the email trace
The email trace was collected between 8/25/2005 and 10/24/2005 (excluding 9/11/2005). The trace contains more than 25 M emails, of which more than 18 M, or about 73% of the messages, are spam (see Table 1). During the course of the trace collection, we observe more than 2 M mail servers, of which more than 95% send at least one spam message, and only about 10% of which sent at least one non-spam message. The messages come from 68,732 networks, of which more than 90% send at least one spam message.
Behavioral characteristics of spammers
In this section we present a detailed study on the behavioral characteristics of spammers. In particular, we study the distribution of mail senders by spam ratios, the distribution of spam and non-spam messages by spam ratio of senders, the distributions of spam messages from different spammers, the spam arrival patterns across the IP address space, the number of mail servers in different spam networks, and the active duration of spammers, among others. We also briefly discuss the important
Network reachability properties of spammers
An important objective of this section is to verify an informal observation by Paul Vixie that the spam arrivals from some spammers are often closely correlated in time with the BGP announcement of the corresponding network prefixes [19]. These network prefixes are short-lived in that they are withdrawn after the spamming activity is finished. This technique makes it hard to identify the spammers that are responsible for spamming. In this section we formally confirm this behavior and
Implications and discussion
In this section we discuss the implications of our findings for the (content-independent) anti-spam efforts. In particular, we examine the implications of the findings on email sender authentication, sender reputation based spam filtering techniques, and new email delivery architectures.
Related work
The work [2] studied the characteristics of spam traffic aiming to identify the features that can distinguish spam from legitimate messages. They found that key email workload aspects including the email arrival process, email size distribution, and distributions of popularity and temporal locality of email recipients can distinguish spam from legitimate messages. They also discussed the inherently different natures of spammers and legitimate email users that contribute to the distinct features
Conclusion
In this paper we studied the behavioral characteristics of spammers at both the mail server and network levels using a two month email trace collected on the FSU campus network. We focused on the behavioral characteristics that have important implications for spam control, including the distributions of mail servers, spam and non-spam messages by spam ratios; the statistics of spam messages from different spammers; the spam arrival patterns across the IP address space; and the active duration
Acknowledgments
We thank Paul Vixie for the informative discussion that motivated this study. We thank Mary Stephenson and Arthur Houle for helping collect the email trace and the University of Oregon Route View Project for making the BGP trace publicly available. Zhenhai Duan was supported in part by NSF Grants CCF-0541096 and CNS-1041677. Kartik Gopalan was supported in part by NSF Grant CCF-0541096. Xin Yuan was supported in part by NSF Grants ANI-0106706, CCR-0208892, CCF-0342540, and CCF-0541096. Any
References (21)
- et al.
DMTP: Controlling spam through message delivery differentiation
Computer Networks
(2007) - L. Gomes, C. Cazita, J. Almeida, V. Almeida, W. Meira, Characterizing a spam traffic, in: Proceedings of the IMC’04,...
- et al.
Internet Routing Architectures
(2000) - S. Hao, N.A. Syed, N. Feamster, E.G. Gray, S. Krasser, Detecting spammers with SNARE: spatio-temporal network-level...
- IANA, Internet assigned numbers authority....
- et al.
Secure border gateway protocol (S-BGP)
IEEE Journal on Selected Areas in Communications
(2000) - M. Kokkodis, M. Faloutsos, Spamming botnets: Are we losing the war?, in: Proceedings of the 6th Conference on Email and...
- Merit Network Inc. Merit network routing assets database....
- U. of Oregon. Route Views project....
- A. Ramachandran, N. Feamster, Understanding the network-level behavior of spammers, in: Proceedings of the ACM SIGCOMM,...
Cited by (0)
- ☆
A preliminary version of this paper appeared in the Proceedings of IEEE ICC 2007 with the title “Behavioral characteristics of spammers and their network reachability properties”.