research-article

Finding Camouflaged Needle in a Haystack?: Pornographic Products Detection via Berrypicking Tree Model

Authors:
Guoxiu He

Wuhan University, Wuhan, China

Wuhan University, Wuhan, China
View Profile

,
Yangyang Kang

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Zhe Gao

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Zhuoren Jiang

Sun Yat-sen University, Guangzhou, China

Sun Yat-sen University, Guangzhou, China
View Profile

,
Changlong Sun

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China
View Profile

,
Xiaozhong Liu

Indiana University Bloomington, Bloomington, IN, USA

Indiana University Bloomington, Bloomington, IN, USA
View Profile

,
Wei Lu

Wuhan University, Wuhan, China

Wuhan University, Wuhan, China
View Profile

,
Qiong Zhang

Alibaba Group, Sunnyvale, CA, USA

Alibaba Group, Sunnyvale, CA, USA
View Profile

,
Luo Si

Alibaba Group, Seattle, WA, USA

Alibaba Group, Seattle, WA, USA
View Profile

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2019Pages 365–374https://doi.org/10.1145/3331184.3331197

Published:18 July 2019Publication History

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 365–374

ABSTRACT

It is an important and urgent research problem for decentralized eCommerce services, e.g., eBay, eBid, and Taobao, to detect illegal products, e.g., unclassified pornographic products. However, it is a challenging task as some sellers may utilize and change camouflaged text to deceive the current detection algorithms. In this study, we propose a novel task to dynamically locate the pornographic products from very large product collections. Unlike prior product classification efforts focusing on textual information, the proposed model, BerryPIcking TRee MoDel (BIRD), utilizes both product textual content and buyers' seeking behavior information as berrypicking trees. In particular, the BIRD encodes both semantic information with respect to all branches sequence and the overall latent buyer intent during the whole seeking process. An extensive set of experiments have been conducted to demonstrate the advantage of the proposed model against alternative solutions. To facilitate further research of this practical and important problem, the codes and buyers' seeking behavior data have been made publicly available1.

Supplemental Material

cite2-12h00-d2.mp4

mp4

482.7 MB

Download

References

Prudhvi Ratna Badri Satya, Kyumin Lee, Dongwon Lee, Thanh Tran, and Jason Jiasheng Zhang. 2016. Uncovering fake likers in online social networks. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. ACM, 2365--2370. Google ScholarDigital Library
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. International Conference on Learning Representations (2015), 1--15.Google Scholar
Marcia J. Bates. 1989. The design of browsing and berrypicking techniques for the online search interface. Online review, Vol. 13, 5 (1989), 407--424.Google Scholar
Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin. 2003. A neural probabilistic language model. Journal of machine learning research, Vol. 3, Feb (2003), 1137--1155. Google ScholarDigital Library
Cheng Cao, James Caverlee, Kyumin Lee, Hancheng Ge, and Jinwook Chung. 2015. Organic or organized? Exploring url sharing behavior. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management. ACM, 513--522. Google ScholarDigital Library
Elfreda A. Chatman. 1999. A theory of life in the round. Journal of the American Society for information Science, Vol. 50, 3 (1999), 207--217. Google ScholarDigital Library
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014), 1724--1734.Google ScholarCross Ref
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014). Google ScholarDigital Library
Brenda Dervin. 1998. Sense-making theory and practice: an overview of user interests in knowledge seeking and use. Journal of knowledge management, Vol. 2, 2 (1998), 36--46.Google ScholarCross Ref
Carsten Eickhoff, Jaime Teevan, Ryen White, and Susan Dumais. 2014. Lessons from the journey: a query log analysis of within-session learning. In Proceedings of the 7th ACM international conference on Web search and data mining. ACM, 223--232. Google ScholarDigital Library
Song Feng, Longfei Xing, Anupam Gogar, and Yejin Choi. 2012. Distributional Footprints of Deceptive Product Reviews. ICWSM, Vol. 12 (2012), 98--105.Google Scholar
David Mandell Freeman. 2017. Can you spot the fakes? On the limitations of user feedback in online social networks. In Proceedings of the 26th International Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1093--1102. Google ScholarDigital Library
Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N. Dauphin. 2017. Convolutional sequence to sequence learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 1243--1252. Google ScholarDigital Library
Guoxiu He and Wei Lu. 2018. Entire Information Attentive GRU for Text Representation. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval (ICTIR '18). ACM, 163--166. Google ScholarDigital Library
Marti A. Hearst, Susan T. Dumais, Edgar Osuna, John Platt, and Bernhard Scholkopf. 1998. Support vector machines. IEEE Intelligent Systems and their applications, Vol. 13, 4 (1998), 18--28. Google ScholarDigital Library
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780. Google ScholarDigital Library
Ramon Ferrer i Cancho and Ricard V Solé. 2003. Least effort and the origins of scaling in human language. Proceedings of the National Academy of Sciences, Vol. 100, 3 (2003), 788--791.Google ScholarCross Ref
Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems (TOIS), Vol. 20, 4 (2002), 422--446. Google ScholarDigital Library
Zhuoren Jiang, Liangcai Gao, Ke Yuan, Zheng Gao, Zhi Tang, and Xiaozhong Liu. 2018. Mathematics Content Understanding for Cyberlearning via Formula Evolution Map. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. ACM, 37--46. Google ScholarDigital Library
Zhuoren Jiang, Yue Yin, Liangcai Gao, Yao Lu, and Xiaozhong Liu. 2018. Cross-language Citation Recommendation via Hierarchical Representation Learning on Heterogeneous Graph. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval. ACM, 635--644. Google ScholarDigital Library
Rie Johnson and Tong Zhang. 2017. Deep pyramid convolutional neural networks for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 562--570.Google ScholarCross Ref
Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014, Volume 1: Long Papers (2014), 655--665.Google ScholarCross Ref
Mahmood Khosrowjerdi. 2016. A review of theory-driven models of trust in the online health context. IFLA journal, Vol. 42, 3 (2016), 189--206.Google ScholarCross Ref
Yoon Kim. 2014. Convolutional neural networks for sentence classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014. 1746--1751.Google ScholarCross Ref
Diederik P. Kingma and Jimmy Ba. {n. d.}. Adam: A method for stochastic optimization. In International Conference for Learning Representations. 1--15.Google Scholar
James Krikelas. 1983. Information-seeking behavior: Patterns and concepts. Drexel library quarterly, Vol. 19, 2 (1983), 5--20.Google Scholar
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. 2015. Deep learning. nature, Vol. 521, 7553 (2015), 436.Google Scholar
Kyumin Lee, James Caverlee, Zhiyuan Cheng, and Daniel Z. Sui. 2013. Campaign extraction from social media. ACM Transactions on Intelligent Systems and Technology (TIST), Vol. 5, 1 (2013), 9. Google ScholarDigital Library
Kyumin Lee, Brian David Eoff, and James Caverlee. 2011. Seven Months with the Devils: A Long-Term Study of Content Polluters on Twitter. In Fifth International AAAI Conference on Weblogs and Social Media. 185--192.Google Scholar
Tao Lei, Yu Zhang, Sida I Wang, Hui Dai, and Yoav Artzi. 2018. Simple Recurrent Units for Highly Parallelizable Recurrence. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 4470--4481.Google ScholarCross Ref
Yuqing Lu, Lei Zhang, Yudong Xiao, and Yangguang Li. 2013. Simultaneously detecting fake reviews and review spammers using factor graph model. In Proceedings of the 5th annual ACM web science conference. ACM, 225--233. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. Computer Science (2013).Google Scholar
Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In Interspeech, Vol. 2. 3.Google Scholar
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S. Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111--3119. Google ScholarDigital Library
Frederic Morin and Yoshua Bengio. 2005. Hierarchical probabilistic neural network language model. In Aistats, Vol. 5. Citeseer, 246--252.Google Scholar
Myle Ott, Claire Cardie, and Jeff Hancock. 2012. Estimating the prevalence of deception in online review communities. In Proceedings of the 21st international conference on World Wide Web. ACM, 201--210. Google ScholarDigital Library
Sherif Saad, Issa Traore, Ali Ghorbani, Bassam Sayed, David Zhao, Wei Lu, John Felix, and Payman Hakimian. 2011. Detecting P2P botnets through network behavior analysis and machine learning. In Privacy, Security and Trust (PST), 2011 Ninth Annual International Conference on. IEEE, 174--180.Google ScholarCross Ref
Dinghan Shen, Guoyin Wang, Wenlin Wang, Martin Renqiang Min, Qinliang Su, Yizhe Zhang, Chunyuan Li, Ricardo Henao, and Lawrence Carin. 2018. Baseline needs more love: On simple word-embedding-based models and associated pooling mechanisms. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Volume 1: Long Papers. 440--450.Google ScholarCross Ref
Ning Su, Yiqun Liu, Zhao Li, Yuli Liu, Min Zhang, and Shaoping Ma. 2018. Detecting Crowdturfing "Add to Favorites" Activities in Online Shopping. In Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 1673--1682. Google ScholarDigital Library
Ming Tan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2015. LSTM-based deep learning models for non-factoid answer selection. arXiv preprint arXiv:1511.04108 (2015).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998--6008. Google ScholarDigital Library
Bingning Wang, Kang Liu, and Jun Zhao. 2016. Inner attention based recurrent neural networks for answer selection. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1. 1288--1297.Google ScholarCross Ref
Chenglong Wang, Feijun Jiang, and Hongxia Yang. 2017. A hybrid framework for text modeling with convolutional RNN. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2061--2069. Google ScholarDigital Library
Ryen W. White, Gary Marchionini, and Gheorghe Muresan. 2008. Evaluating exploratory search systems. Information Processing and Management, Vol. 44, 2 (2008), 433. Google ScholarDigital Library
Chang Xu and Jie Zhang. 2015. Towards collusive fraud detection in online reviews. In 2015 IEEE International Conference on Data Mining (ICDM). IEEE, 1051--1056. Google ScholarDigital Library
Chang Xu, Jie Zhang, Kuiyu Chang, and Chong Long. 2013. Uncovering collusive spammers in Chinese review websites. In Proceedings of the 22nd ACM international conference on Conference on information & knowledge management. ACM, 979--988. Google ScholarDigital Library
Junting Ye and Leman Akoglu. 2015. Discovering opinion spammer groups by network footprints. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 267--282. Google ScholarDigital Library
Wenpeng Yin and Hinrich Schütze. 2015. Convolutional neural network for paraphrase identification. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 901--911.Google ScholarCross Ref
Chunting Zhou, Chonglin Sun, Zhiyuan Liu, and Francis Lau. 2015. A C-LSTM neural network for text classification. arXiv preprint arXiv:1511.08630 (2015).Google Scholar

Index Terms

Finding Camouflaged Needle in a Haystack?: Pornographic Products Detection via Berrypicking Tree Model
1. Computing methodologies
  1. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Information retrieval query processing
      1. Query log analysis
  2. World Wide Web
    1. Web searching and information discovery
      1. Web search engines
        Spam detection

Recommendations

Implicit Products in the Decentralized eCommerce Ecosystems
JCDL '20: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

Detecting dark businesses in a decentralized eCommerce ecosystem (e.g. eBay, eBid, and Taobao) is a critical research problem. In this paper, we investigate the characteristics of dark implicit products, the associated buyer seeking behaviors, and ...
Read More
Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild
IMC '18: Proceedings of the Internet Measurement Conference 2018

Today's phishing websites are constantly evolving to deceive users and evade the detection. In this paper, we perform a measurement study on squatting phishing domains where the websites impersonate trusted entities not only at the page content level ...
Read More
Poster: CUD: crowdsourcing for URL spam detection
CCS '11: Proceedings of the 18th ACM conference on Computer and communications security

The prevalence of spam URLs in Internet services, such as email, social networks, blogs and online forums has become a serious problem. These spam URLs host spam advertisements, phishing attempts, and malwares, which are harmful for normal users. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2019
1512 pages
ISBN:9781450361729
DOI:10.1145/3331184
General Chairs:
Benjamin Piwowarski
CNRS - Sorbonne Universite, France
,
Max Chevalier
Universite de Toulouse, CNRS, France
,
Eric Gaussier
Universite Grenoble Alpes, CNRS, France
,
Program Chairs:
Yoelle Maarek
Amazon Research, Israel
,
Jian-Yun Nie
University of Montreal, Canada
,
Falk Scholer
RMIT University, Australia
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2019
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
deep neural network
information seeking
query log analysis
spam detection
user behavior
Qualifiers
- research-article
Conference

Acceptance Rates
SIGIR'19 Paper Acceptance Rate84of426submissions,20%Overall Acceptance Rate792of3,983submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 480
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Finding Camouflaged Needle in a Haystack?: Pornographic Products Detection via Berrypicking Tree Model

SIGIR'19: Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Implicit Products in the Decentralized eCommerce Ecosystems

Needle in a Haystack: Tracking Down Elite Phishing Domains in the Wild

Poster: CUD: crowdsourcing for URL spam detection