ABSTRACT
It is an important and urgent research problem for decentralized eCommerce services, e.g., eBay, eBid, and Taobao, to detect illegal products, e.g., unclassified pornographic products. However, it is a challenging task as some sellers may utilize and change camouflaged text to deceive the current detection algorithms. In this study, we propose a novel task to dynamically locate the pornographic products from very large product collections. Unlike prior product classification efforts focusing on textual information, the proposed model, BerryPIcking TRee MoDel (BIRD), utilizes both product textual content and buyers' seeking behavior information as berrypicking trees. In particular, the BIRD encodes both semantic information with respect to all branches sequence and the overall latent buyer intent during the whole seeking process. An extensive set of experiments have been conducted to demonstrate the advantage of the proposed model against alternative solutions. To facilitate further research of this practical and important problem, the codes and buyers' seeking behavior data have been made publicly available1.
Supplemental Material
- Prudhvi Ratna Badri Satya, Kyumin Lee, Dongwon Lee, Thanh Tran, and Jason Jiasheng Zhang. 2016. Uncovering fake likers in online social networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2365--2370. Google ScholarDigital Library
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (2015), 1--15.Google Scholar
- Marcia J. Bates. 1989. The design of browsing and berrypicking techniques for the online search interface. Online review, Vol. 13, 5 (1989), 407--424.Google Scholar
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, Vol. 3, Feb (2003), 1137--1155. Google ScholarDigital Library
- Cheng Cao, James Caverlee, Kyumin Lee, Hancheng Ge, and Jinwook Chung. 2015. Organic or organized? Exploring url sharing behavior. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 513--522. Google ScholarDigital Library
- Elfreda A. Chatman. 1999. A theory of life in the round. Journal of the American Society for information Science, Vol. 50, 3 (1999), 207--217. Google ScholarDigital Library
- Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 1724--1734.Google ScholarCross Ref
- Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014). Google ScholarDigital Library
- Brenda Dervin. 1998. Sense-making theory and practice: an overview of user interests in knowledge seeking and use. Journal of knowledge management, Vol. 2, 2 (1998), 36--46.Google ScholarCross Ref
- Carsten Eickhoff, Jaime Teevan, Ryen White, and Susan Dumais. 2014. Lessons from the journey: a query log analysis of within-session learning. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 223--232. Google ScholarDigital Library
- Song Feng, Longfei Xing, Anupam Gogar, and Yejin Choi. 2012. Distributional Footprints of Deceptive Product Reviews. ICWSM, Vol. 12 (2012), 98--105.Google Scholar
- David Mandell Freeman. 2017. Can you spot the fakes? On the limitations of user feedback in online social networks. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1093--1102. Google ScholarDigital Library
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1243--1252. Google ScholarDigital Library
- Guoxiu He and Wei Lu. 2018. Entire Information Attentive GRU for Text Representation. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '18). ACM, 163--166. Google ScholarDigital Library
- Marti A. Hearst, Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications, Vol. 13, 4 (1998), 18--28. Google ScholarDigital Library
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
- Ramon Ferrer i Cancho and Ricard V Solé. 2003. Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, Vol. 100, 3 (2003), 788--791.Google ScholarCross Ref
- Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446. Google ScholarDigital Library
- Zhuoren Jiang, Liangcai Gao, Ke Yuan, Zheng Gao, Zhi Tang, and Xiaozhong Liu. 2018. Mathematics Content Understanding for Cyberlearning via Formula Evolution Map. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 37--46. Google ScholarDigital Library
- Zhuoren Jiang, Yue Yin, Liangcai Gao, Yao Lu, and Xiaozhong Liu. 2018. Cross-language Citation Recommendation via Hierarchical Representation Learning on Heterogeneous Graph. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 635--644. Google ScholarDigital Library
- Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 562--570.Google ScholarCross Ref
- Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Volume 1: Long Papers (2014), 655--665.Google ScholarCross Ref
- Mahmood Khosrowjerdi. 2016. A review of theory-driven models of trust in the online health context. IFLA journal, Vol. 42, 3 (2016), 189--206.Google ScholarCross Ref
- Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746--1751.Google ScholarCross Ref
- Diederik P. Kingma and Jimmy Ba. {n. d.}. Adam: A method for stochastic optimization. In International Conference for Learning Representations. 1--15.Google Scholar
- James Krikelas. 1983. Information-seeking behavior: Patterns and concepts. Drexel library quarterly, Vol. 19, 2 (1983), 5--20.Google Scholar
- Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436.Google Scholar
- Kyumin Lee, James Caverlee, Zhiyuan Cheng, and Daniel Z. Sui. 2013. Campaign extraction from social media. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 5, 1 (2013), 9. Google ScholarDigital Library
- Kyumin Lee, Brian David Eoff, and James Caverlee. 2011. Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter. In Fifth International AAAI Conference on Weblogs and Social Media. 185--192.Google Scholar
- Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. 2018. Simple Recurrent Units for Highly Parallelizable Recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4470--4481.Google ScholarCross Ref
- Yuqing Lu, Lei Zhang, Yudong Xiao, and Yangguang Li. 2013. Simultaneously detecting fake reviews and review spammers using factor graph model. In Proceedings of the 5th annual ACM web science conference. ACM, 225--233. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Computer Science (2013).Google Scholar
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In Interspeech, Vol. 2. 3.Google Scholar
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119. Google ScholarDigital Library
- Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Aistats, Vol. 5. Citeseer, 246--252.Google Scholar
- Myle Ott, Claire Cardie, and Jeff Hancock. 2012. Estimating the prevalence of deception in online review communities. In Proceedings of the 21st international conference on World Wide Web. ACM, 201--210. Google ScholarDigital Library
- Sherif Saad, Issa Traore, Ali Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian. 2011. Detecting P2P botnets through network behavior analysis and machine learning. In Privacy, Security and Trust (PST), 2011 Ninth Annual International Conference on. IEEE, 174--180.Google ScholarCross Ref
- Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers. 440--450.Google ScholarCross Ref
- Ning Su, Yiqun Liu, Zhao Li, Yuli Liu, Min Zhang, and Shaoping Ma. 2018. Detecting Crowdturfing "Add to Favorites" Activities in Online Shopping. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1673--1682. Google ScholarDigital Library
- Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 (2015).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarDigital Library
- Bingning Wang, Kang Liu, and Jun Zhao. 2016. Inner attention based recurrent neural networks for answer selection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1288--1297.Google ScholarCross Ref
- Chenglong Wang, Feijun Jiang, and Hongxia Yang. 2017. A hybrid framework for text modeling with convolutional RNN. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2061--2069. Google ScholarDigital Library
- Ryen W. White, Gary Marchionini, and Gheorghe Muresan. 2008. Evaluating exploratory search systems. Information Processing and Management, Vol. 44, 2 (2008), 433. Google ScholarDigital Library
- Chang Xu and Jie Zhang. 2015. Towards collusive fraud detection in online reviews. In 2015 IEEE International Conference on Data Mining (ICDM). IEEE, 1051--1056. Google ScholarDigital Library
- Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 979--988. Google ScholarDigital Library
- Junting Ye and Leman Akoglu. 2015. Discovering opinion spammer groups by network footprints. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 267--282. Google ScholarDigital Library
- Wenpeng Yin and Hinrich Schütze. 2015. Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 901--911.Google ScholarCross Ref
- Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 (2015).Google Scholar
Index Terms
- Finding Camouflaged Needle in a Haystack?: Pornographic Products Detection via Berrypicking Tree Model
Recommendations
Implicit Products in the Decentralized eCommerce Ecosystems
JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020Detecting dark businesses in a decentralized eCommerce ecosystem (e.g. eBay, eBid, and Taobao) is a critical research problem. In this paper, we investigate the characteristics of dark implicit products, the associated buyer seeking behaviors, and ...
Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild
IMC '18: Proceedings of the Internet Measurement Conference 2018Today's phishing websites are constantly evolving to deceive users and evade the detection. In this paper, we perform a measurement study on squatting phishing domains where the websites impersonate trusted entities not only at the page content level ...
Poster: CUD: crowdsourcing for URL spam detection
CCS '11: Proceedings of the 18th ACM conference on Computer and communications securityThe prevalence of spam URLs in Internet services, such as email, social networks, blogs and online forums has become a serious problem. These spam URLs host spam advertisements, phishing attempts, and malwares, which are harmful for normal users. ...
Comments