ABSTRACT
Searching and retrieving information from the Web is a primary activity needed to monitor the development and usage of Web resources. Possible benefits include improving user experience (e.g. by optimizing query results) and enforcing data/user security (e.g. by identifying harmful websites). Motivated by the lack of ready-to-use solutions, in this paper we present a flexible and accessible toolkit for structure and content mining, able to crawl, download, extract and index resources from the Web. While being easily configurable to work in the "surface" Web, our suite is specifically tailored to explore the Tor dark Web, i.e. the ensemble of Web servers composing the world's most famous darknet. Notably, the toolkit is not just a Web scraper, but it includes two mining modules, respectively able to prepare content to be fed to an (external) semantic engine, and to reconstruct the graph structure of the explored portion of the Web. Other than discussing in detail the design, features and performance of our toolkit, we report the findings of a preliminary run over Tor, that clarify the potential of our solution.
- Daniel Arp, Fabian Yamaguchi, and Konrad Rieck. 2015. Torben: A Practical Side-Channel Attack for Deanonymizing Tor Communication. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security (ASIA CCS'15). ACM, New York, NY, USA, 597--602. Google ScholarDigital Library
- Monica J. Barrat. 2012. Silk Road: Ebay for Drugs. Addiction 107, 3 (2012), 683--683. Google ScholarCross Ref
- Massimo Bernaschi, Giancarlo Carbone, and Flavio Vella. 2016. Scalable Between-ness Centrality on multi-GPU Systems. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 29--36. Google ScholarDigital Library
- Massimo Bernaschi, Alessandro Celestini, Stefano Guarino, and Flavio Lombardi. 2017. Exploring and Analyzing the Tor Hidden Services Graph. ACM Transactions on the Web (2017), To appear. Google ScholarDigital Library
- Alex Biryukov, Ivan Pustogarov, Fabrice Thill, and Ralf-Philipp Weinmann. 2014. Content and Popularity Analysis of Tor Hidden Services. In Distributed Computing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on. 188--193. Google ScholarDigital Library
- Alex Biryukov, Ivan Pustogarov, and Ralf-Philipp Weinmann. 2013. Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization. In Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP '13). IEEE Computer Society, Washington, DC, USA, 80--94. Google ScholarDigital Library
- Olivier Blanvillain, Nikos Kasioumis, and Vangelis Banos. 2014. Blogforever crawler: techniques and algorithms to harvest modern weblogs. In Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14). ACM, 7. Google ScholarDigital Library
- David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.Google ScholarDigital Library
- Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience 34, 8 (2004), 711--726. Google ScholarDigital Library
- Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. 2014. BUbiNG: Massive crawling for the masses. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion. 227--228. Google ScholarDigital Library
- Paolo Boldi and Sebastiano Vigna. 2004. The Webgraph Framework I: Compression Techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW '04). ACM, New York, NY, USA, 595--602. Google ScholarDigital Library
- Anthony Bonato. 2005. A Survey of Models of the Web Graph. In Combinatorial and Algorithmic Aspects of Networking, Alejandro Lpez-Ortiz and AngleM. Hamel (Eds.). Lecture Notes in Computer Science, Vol. 3405. Springer Berlin Heidelberg, 159--172. Google ScholarDigital Library
- Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and Luís Sarmento. 2012. Twitterecho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st International Conference on World Wide Web. ACM, 1233--1240. Google ScholarDigital Library
- Soumen Chakrabarti, Amit Pathak, and Manish Gupta. 2011. Index Design and Query Processing for Graph Conductance Search. The VLDB Journal 20, 3 (June 2011), 445--470. Google ScholarDigital Library
- Francisco Claude and Susana Ladra. 2011. Practical Representations for Web and Social Graphs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM '11). ACM, New York, NY, USA, 1185--1190. Google ScholarDigital Library
- Francisco Claude and Gonzalo Navarro. 2010. Fast and Compact Web Graph Representations. ACM Trans. Web 4, 4, Article 16 (Sept. 2010), 31 pages. Google ScholarDigital Library
- Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The Second-Generation Onion Router. In Proceedings of the 13th Usenix Security Symposium. Google ScholarCross Ref
- Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems 70 (2014), 301 -- 323. Google ScholarDigital Library
- Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans M. Coetzee. 2002. Self-Organization and Identification of Web Communities. IEEE Computer 35 (2002), 66--71. Google ScholarDigital Library
- Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (1999), 219--229. Google ScholarDigital Library
- Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. 2004. Nutch: A flexible and scalable open-source web search engine. Oregon State University 1 (2004), 32--32.Google Scholar
- JonM. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and AndrewS. Tomkins. 1999. The Web as a Graph: Measurements, Models, and Methods. In Computing and Combinatorics, Takano Asano, Hideki Imai, D.T. Lee, Shin-ichi Nakano, and Takeshi Tokuyama (Eds.). Lecture Notes in Computer Science, Vol. 1627. Springer Berlin Heidelberg, 1--17. Google ScholarCross Ref
- Raymond Kosala and Hendrik Blockeel. 2000. Web Mining Research: A Survey. SIGKDD Explor. Newsl. 2, 1 (June 2000), 1--15. Google ScholarDigital Library
- Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D Sivakumar, Andrew Tomkins, and Eli Upfal. 2000. Stochastic models for the Web graph. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. 57--65. Google ScholarCross Ref
- Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. 2008. Shining Light in Dark Places: Understanding the Tor Network. In Privacy Enhancing Technologies, Nikita Borisov and Ian Goldberg (Eds.). Lecture Notes in Computer Science, Vol. 5134. Springer Berlin Heidelberg, 63--76. Google ScholarDigital Library
- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
- Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton. 2004. An Introduction to Heritrix An open source archival quality web crawler. In In IWAWfi04, 4th International Web Archiving Workshop. Citeseer.Google Scholar
- Christopher Olston, Marc Najork, and others. 2010. Web crawling. Foundations and Trends® in Information Retrieval 4, 3 (2010), 175--246.Google Scholar
- Gareth Owen and Nick Savage. 2016. Empirical analysis of Tor hidden services. IET Information Security 10, 3 (2016), 113--118. Google ScholarCross Ref
- Kyle Soska and Nicolas Christin. 2015. Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem. In 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12--14, 2015. 33--48.Google Scholar
- Martijn Spitters, Stefan Verbruggen, and Mark van Staalduinen. 2014. Towards a Comprehensive Insight into the Thematic Organization of the Tor Hidden Services. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE Joint. 220--223. Google ScholarDigital Library
- Flavio Vella, Giancarlo Carbone, and Massimo Bernaschi. 2016. Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems. CoRR abs/1602.00963 (2016). http://arxiv.org/abs/1602.00963Google Scholar
- Zachary Weinberg, Jeffrey Wang, Vinod Yegneswaran, Linda Briesemeister, Steven Cheung, Frank Wang, and Dan Boneh. 2012. StegoTorus: A Camouflage Proxy for the Tor Anonymity System. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS '12). ACM, New York, NY, USA, 109--120. Google ScholarDigital Library
- Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, and Hai Jin. 2016. SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces. IEEE Transactions on Services Computing 9, 4 (2016), 608--620. Google ScholarCross Ref
Index Terms
- Design, implementation and test of a flexible tor-oriented web mining toolkit
Recommendations
Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning
Neural Information ProcessingAbstractIn recent years, various web-based attacks such as Drive-by-Download attacks are becoming serious. To protect legitimate users, it is important to collect information on malicious sites that could provide a blacklist-based detection software. In ...
Dark Web: A Web of Crimes
AbstractInternet plays an important role in our day to day life. It has become an integrated part of all daily activities or lifestyle. Dark Web is like an untraceable hidden layer of the Internet which is commonly used to store and access the ...
US Domestic Extremist Groups on the Web: Link and Content Analysis
US domestic extremist groups have increased in numbers and are using the Internet intensively as a tool to share resources and members with limited regard for geographic, legal, or other obstacles. Researchers find that monitoring extremist and hate ...
Comments