ABSTRACT
Detecting unsolicited content and the spammers who create it is a long-standing challenge that affects all of us on a daily basis. The recent growth of richly-structured social networks has provided new challenges and opportunities in the spam detection landscape. Motivated by the Tagged.com social network, we develop methods to identify spammers in evolving multi-relational social networks. We model a social network as a time-stamped multi-relational graph where vertices represent users, and edges represent different activities between them. To identify spammer accounts, our approach makes use of structural features, sequence modelling, and collective reasoning. We leverage relational sequence information using k-gram features and probabilistic modelling with a mixture of Markov models. Furthermore, in order to perform collective reasoning and improve the predictive power of a noisy abuse reporting system, we develop a statistical relational model using hinge-loss Markov random fields (HL-MRFs), a class of probabilistic graphical models which are highly scalable. We use Graphlab Create and Probabilistic Soft Logic (PSL) to prototype and experimentally evaluate our solutions on internet-scale data from Tagged.com. Our experiments demonstrate the effectiveness of our approach, and show that models which incorporate the multi-relational nature of the social network significantly gain predictive performance over those that do not.
Supplemental Material
- Wikipedia. History of email spam -- Wikipedia, the free encyclopedia, 2014. URL http://en.wikipedia.org/wiki/History_of_email_spam.Google Scholar
- Harold Nguyen. 2013 state of social media spam. Technical report, Nexgate. URL http://go.nexgate.com/nexgate-social-media-spam-research-report.Google Scholar
- Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking: Bringing order to the web. Technical report, 1999.Google Scholar
- J Ignacio Alvarez-Hamelin, Luca Dall'Asta, Alain Barrat, and Alessandro Vespignani. Large scale networks fingerprinting and visualization using the k-core decomposition. In Advances in neural information processing systems (NIPS), 2005.Google Scholar
- Tommy R Jensen and Bjarne Toft. Graph coloring problems. John Wiley & Sons, 2011.Google Scholar
- S Pemmaraju and S Skiena. Implementing discrete mathematics: Combinatorics and graph theory with mathematica, 2003.Google ScholarCross Ref
- Thomas Schank. Algorithmic aspects of triangle-based network analysis. Phd in computer science, University Karlsruhe, 2007.Google Scholar
- Zhengzheng Xing, Jian Pei, and Eamonn Keogh. A brief survey on sequence classification. ACM SIGKDD Explorations Newsletter, 2010. Google ScholarDigital Library
- Fuchun Peng, Dale Schuurmans, and Shaojun Wang. Augmenting naive Bayes classifiers with statistical language models. Information Retrieval, 2004. Google ScholarDigital Library
- Fei Zheng and Geoffrey I Webb. Tree augmented naive Bayes. In Encyclopedia of Machine Learning, pages 990--991. Springer, 2010.Google ScholarCross Ref
- Stephen H. Bach, Bert Huang, Ben London, and Lise Getoor. Hinge-loss Markov random fields: Convex inference for structured prediction. In Uncertainty in Artificial Intelligence (UAI), 2013.Google Scholar
- S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. Hinge-loss Markov random fields and probabilistic soft logic. arXiv:1505.04406 {cs.LG}, 2015.Google ScholarDigital Library
- Jay Pujara, Hui Miao, Lise Getoor, and William Cohen. Knowledge graph identification. In International Semantic Web Conference (ISWC), 2013. Google ScholarDigital Library
- Arti Ramesh, Dan Goldwasser, Bert Huang, Hal Daume III, and Lise Getoor. Learning latent engagement patterns of students in online courses. In Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence. AAAI Press, 2014.Google ScholarDigital Library
- Fakhraei, Huang, Raschid, and Getoor}fakhraei:tcbb14Shobeir Fakhraei, Bert Huang, Louiqa Raschid, and Lise Getoor. Network-based drug-target interaction prediction with probabilistic soft logic. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2014\natexlaba. Google ScholarDigital Library
- Shobeir Fakhraei, Louiqa Raschid, and Lise Getoor. Drug-target interaction prediction for drug repurposing with probabilistic similarity logic. In ACM SIGKDD 12th International Workshop on Data Mining in Bioinformatics (BIOKDD). ACM, 2013. Google ScholarDigital Library
- Bert Huang, Angelika Kimmig, Lise Getoor, and Jennifer Golbeck. A flexible framework for probabilistic models of social trust. In International Conference on Social Computing, Behavioral-Cultural Modeling, & Prediction (SBP), 2013. Google ScholarDigital Library
- Nesreen K. Ahmed, Jennifer Neville, and Ramana Kompella. Network sampling: From static to streaming graphs. ACM Trans. Knowl. Discov. Data, 2013. Google ScholarDigital Library
- Jure Leskovec and Christos Faloutsos. Sampling from large graphs. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2006. Google ScholarDigital Library
- Mohammad Al Hasan and Mohammed J. Zaki. Output space sampling for graph patterns. PVLDB, 2009. Google ScholarDigital Library
- Jerome H Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics, pages 1189--1232, 2001.Google Scholar
- Enrico Blanzieri and Anton Bryl. A survey of learning-based techniques of email spam filtering. Artificial Intelligence Review, 2008. Google ScholarDigital Library
- Nikita Spirin and Jiawei Han. Survey on web spam detection: principles and algorithms. ACM SIGKDD Explorations Newsletter, 13 (2): 50--64, 2012. Google ScholarDigital Library
- Nisheeth Shrivastava, Anirban Majumder, and Rajeev Rastogi. Mining (social) network graphs to detect random link attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International conference on, pages 486--495. IEEE, 2008. Google ScholarDigital Library
- Chi-Yao Tseng and Ming-Syan Chen. Incremental SVM model for spam detection on dynamic email social networks. In Computational Science and Engineering, 2009. CSE'09. International conference on, volume 4, pages 128--135. IEEE, 2009. Google ScholarDigital Library
- P Oscar and VP Roychowdbury. Leveraging social networks to fight spam. IEEE Computer, 38 (4): 61--68, 2005. Google ScholarDigital Library
- Luca Becchetti, Carlos Castillo, Debora Donato, Ricardo Baeza-Yates, and Stefano Leonardi. Link analysis for web spam detection. ACM Transactions on the Web (TWEB), 2008. Google ScholarDigital Library
- Carlos Castillo, Debora Donato, Aristides Gionis, Vanessa Murdock, and Fabrizio Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2007. Google ScholarDigital Library
- yi et al.(2004)Gyöngyi, Garcia-Molina, and Pedersen}gyongyi2004combatingZoltán Gyöngyi, Hector Garcia-Molina, and Jan Pedersen. Combating web spam with trustrank. In Proceedings of the thirtieth international conference on very large data bases, pages 576--587. VLDB Endowment, 2004. Google ScholarDigital Library
- Paul-Alexandru Chirita, Jörg Diederich, and Wolfgang Nejdl. Mailrank: using ranking for spam detection. In Proceedings of the 14th ACM international conference on Information and knowledge management, pages 373--380. ACM, 2005. Google ScholarDigital Library
- Jacob Abernethy, Olivier Chapelle, and Carlos Castillo. Graph regularization methods for web spam detection. Machine Learning, 81 (2): 207--225, 2010. Google ScholarDigital Library
- Paul Heymann, Georgia Koutrika, and Hector Garcia-Molina. Fighting spam on social web sites: A survey of approaches and future challenges. Internet Computing, IEEE, 11 (6): 36--45, 2007. Google ScholarDigital Library
- Xia Hu, Jiliang Tang, and Huan Liu. Leveraging knowledge across media for spammer detection in microblogging. In Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval. ACM, 2014. Google ScholarDigital Library
- Enhua Tan, Lei Guo, Songqing Chen, Xiaodong Zhang, and Yihong Zhao. Unik: unsupervised social network spam detection. In Proceedings of the 22nd ACM international conference on information & knowledge management. ACM, 2013. Google ScholarDigital Library
- Tao Stein, Erdong Chen, and Karan Mangla. Facebook immune system. In Proceedings of the 4th Workshop on Social Network Systems. ACM, 2011. Google ScholarDigital Library
- Hongyu Gao, Jun Hu, Christo Wilson, Zhichun Li, Yan Chen, and Ben Y Zhao. Detecting and characterizing social spam campaigns. In Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pages 35--47. ACM, 2010. Google ScholarDigital Library
- Benjamin Markines, Ciro Cattuto, and Filippo Menczer. Social spam detection. In Proceedings of the 5th International Workshop on Adversarial Information Retrieval on the Web, pages 41--48. ACM, 2009. Google ScholarDigital Library
- Fabricio Benevenuto, Gabriel Magno, Tiago Rodrigues, and Virgilio Almeida. Detecting spammers on twitter. In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), volume 6, page 12, 2010.Google Scholar
- Kyumin Lee, James Caverlee, and Steve Webb. Uncovering social spammers: social honeypotsGoogle Scholar
- machine learning. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, pages 435--442. ACM, 2010.Google ScholarDigital Library
- Yin Zhu, Xiao Wang, Erheng Zhong, Nathan N Liu, He Li, and Qiang Yang. Discovering spammers in social networks. In Twenty-Sixth AAAI Conference on Artificial Intelligence, 2012.Google Scholar
- Gueorgi Kossinets and Duncan J Watts. Empirical analysis of an evolving social network. Science, 2006.Google Scholar
- Xin Jin, Cindy Xide Lin, Jiebo Luo, and Jiawei Han. Socialspamguard: A data mining-based spam detection system for social media networks. PVLDB, 2011.Google Scholar
- Xianchao Zhang, Shaoping Zhu, and Wenxin Liang. Detecting spam and promoting campaigns in the twitter social network. In ICDM, pages 1194--1199, 2012. Google ScholarDigital Library
- Garc\'ıa, and Bringas}laorden2012collectiveCarlos Laorden, Borja Sanz, Igor Santos, Patxi Galán-García, and Pablo G Bringas. Collective classification for spam filtering. Logic Journal of IGPL, 2012.Google Scholar
- Guang-Gang Geng, Qiudan Li, and Xinchang Zhang. Link based small sample learning for web spam detection. In Proceedings of the 18th international conference on World wide web, pages 1185--1186. ACM, 2009. Google ScholarDigital Library
- Mohamadali Torkamani and Daniel Lowd. Convex adversarial collective classification. In Proceedings of the 30th International Conference on Machine Learning (ICML-13), pages 642--650, 2013.Google Scholar
- Fakhraei, Soltanian-Zadeh, and Fotouhi}fakhraei2014biasShobeir Fakhraei, Hamid Soltanian-Zadeh, and Farshad Fotouhi. Bias and stability of single variable classifiers for feature ranking and selection. Expert Systems with Applications, 41 (15): 6945 -- 6958, 2014\natexlabb. Google ScholarDigital Library
Index Terms
- Collective Spammer Detection in Evolving Multi-Relational Social Networks
Recommendations
Enhancing Graph Neural Network-based Fraud Detectors against Camouflaged Fraudsters
CIKM '20: Proceedings of the 29th ACM International Conference on Information & Knowledge ManagementGraph Neural Networks (GNNs) have been widely applied to fraud detection problems in recent years, revealing the suspiciousness of nodes by aggregating their neighborhood information via different relations. However, few prior works have noticed the ...
Pick and Choose: A GNN-based Imbalanced Learning Approach for Fraud Detection
WWW '21: Proceedings of the Web Conference 2021Graph-based fraud detection approaches have escalated lots of attention recently due to the abundant relational information of graph-structured data, which may be beneficial for the detection of fraudsters. However, the GNN-based algorithms could fare ...
DeepWalk: online learning of social representations
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data miningWe present DeepWalk, a novel approach for learning latent representations of vertices in a network. These latent representations encode social relations in a continuous vector space, which is easily exploited by statistical models. DeepWalk generalizes ...
Comments