skip to main content
10.1145/3102254.3102266acmotherconferencesArticle/Chapter ViewAbstractPublication PageswimsConference Proceedingsconference-collections
research-article

Design, implementation and test of a flexible tor-oriented web mining toolkit

Published:19 June 2017Publication History

ABSTRACT

Searching and retrieving information from the Web is a primary activity needed to monitor the development and usage of Web resources. Possible benefits include improving user experience (e.g. by optimizing query results) and enforcing data/user security (e.g. by identifying harmful websites). Motivated by the lack of ready-to-use solutions, in this paper we present a flexible and accessible toolkit for structure and content mining, able to crawl, download, extract and index resources from the Web. While being easily configurable to work in the "surface" Web, our suite is specifically tailored to explore the Tor dark Web, i.e. the ensemble of Web servers composing the world's most famous darknet. Notably, the toolkit is not just a Web scraper, but it includes two mining modules, respectively able to prepare content to be fed to an (external) semantic engine, and to reconstruct the graph structure of the explored portion of the Web. Other than discussing in detail the design, features and performance of our toolkit, we report the findings of a preliminary run over Tor, that clarify the potential of our solution.

References

  1. Daniel Arp, Fabian Yamaguchi, and Konrad Rieck. 2015. Torben: A Practical Side-Channel Attack for Deanonymizing Tor Communication. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security (ASIA CCS'15). ACM, New York, NY, USA, 597--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Monica J. Barrat. 2012. Silk Road: Ebay for Drugs. Addiction 107, 3 (2012), 683--683. Google ScholarGoogle ScholarCross RefCross Ref
  3. Massimo Bernaschi, Giancarlo Carbone, and Flavio Vella. 2016. Scalable Between-ness Centrality on multi-GPU Systems. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 29--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Massimo Bernaschi, Alessandro Celestini, Stefano Guarino, and Flavio Lombardi. 2017. Exploring and Analyzing the Tor Hidden Services Graph. ACM Transactions on the Web (2017), To appear. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Alex Biryukov, Ivan Pustogarov, Fabrice Thill, and Ralf-Philipp Weinmann. 2014. Content and Popularity Analysis of Tor Hidden Services. In Distributed Computing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on. 188--193. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Alex Biryukov, Ivan Pustogarov, and Ralf-Philipp Weinmann. 2013. Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization. In Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP '13). IEEE Computer Society, Washington, DC, USA, 80--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Olivier Blanvillain, Nikos Kasioumis, and Vangelis Banos. 2014. Blogforever crawler: techniques and algorithms to harvest modern weblogs. In Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14). ACM, 7. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience 34, 8 (2004), 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. 2014. BUbiNG: Massive crawling for the masses. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion. 227--228. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Paolo Boldi and Sebastiano Vigna. 2004. The Webgraph Framework I: Compression Techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW '04). ACM, New York, NY, USA, 595--602. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Anthony Bonato. 2005. A Survey of Models of the Web Graph. In Combinatorial and Algorithmic Aspects of Networking, Alejandro Lpez-Ortiz and AngleM. Hamel (Eds.). Lecture Notes in Computer Science, Vol. 3405. Springer Berlin Heidelberg, 159--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and Luís Sarmento. 2012. Twitterecho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st International Conference on World Wide Web. ACM, 1233--1240. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Soumen Chakrabarti, Amit Pathak, and Manish Gupta. 2011. Index Design and Query Processing for Graph Conductance Search. The VLDB Journal 20, 3 (June 2011), 445--470. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Francisco Claude and Susana Ladra. 2011. Practical Representations for Web and Social Graphs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM '11). ACM, New York, NY, USA, 1185--1190. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Francisco Claude and Gonzalo Navarro. 2010. Fast and Compact Web Graph Representations. ACM Trans. Web 4, 4, Article 16 (Sept. 2010), 31 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The Second-Generation Onion Router. In Proceedings of the 13th Usenix Security Symposium. Google ScholarGoogle ScholarCross RefCross Ref
  18. Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems 70 (2014), 301 -- 323. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans M. Coetzee. 2002. Self-Organization and Identification of Web Communities. IEEE Computer 35 (2002), 66--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (1999), 219--229. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. 2004. Nutch: A flexible and scalable open-source web search engine. Oregon State University 1 (2004), 32--32.Google ScholarGoogle Scholar
  22. JonM. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and AndrewS. Tomkins. 1999. The Web as a Graph: Measurements, Models, and Methods. In Computing and Combinatorics, Takano Asano, Hideki Imai, D.T. Lee, Shin-ichi Nakano, and Takeshi Tokuyama (Eds.). Lecture Notes in Computer Science, Vol. 1627. Springer Berlin Heidelberg, 1--17. Google ScholarGoogle ScholarCross RefCross Ref
  23. Raymond Kosala and Hendrik Blockeel. 2000. Web Mining Research: A Survey. SIGKDD Explor. Newsl. 2, 1 (June 2000), 1--15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D Sivakumar, Andrew Tomkins, and Eli Upfal. 2000. Stochastic models for the Web graph. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. 57--65. Google ScholarGoogle ScholarCross RefCross Ref
  25. Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. 2008. Shining Light in Dark Places: Understanding the Tor Network. In Privacy Enhancing Technologies, Nikita Borisov and Ian Goldberg (Eds.). Lecture Notes in Computer Science, Vol. 5134. Springer Berlin Heidelberg, 63--76. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google ScholarGoogle Scholar
  27. Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton. 2004. An Introduction to Heritrix An open source archival quality web crawler. In In IWAWfi04, 4th International Web Archiving Workshop. Citeseer.Google ScholarGoogle Scholar
  28. Christopher Olston, Marc Najork, and others. 2010. Web crawling. Foundations and Trends® in Information Retrieval 4, 3 (2010), 175--246.Google ScholarGoogle Scholar
  29. Gareth Owen and Nick Savage. 2016. Empirical analysis of Tor hidden services. IET Information Security 10, 3 (2016), 113--118. Google ScholarGoogle ScholarCross RefCross Ref
  30. Kyle Soska and Nicolas Christin. 2015. Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem. In 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12--14, 2015. 33--48.Google ScholarGoogle Scholar
  31. Martijn Spitters, Stefan Verbruggen, and Mark van Staalduinen. 2014. Towards a Comprehensive Insight into the Thematic Organization of the Tor Hidden Services. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE Joint. 220--223. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Flavio Vella, Giancarlo Carbone, and Massimo Bernaschi. 2016. Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems. CoRR abs/1602.00963 (2016). http://arxiv.org/abs/1602.00963Google ScholarGoogle Scholar
  33. Zachary Weinberg, Jeffrey Wang, Vinod Yegneswaran, Linda Briesemeister, Steven Cheung, Frank Wang, and Dan Boneh. 2012. StegoTorus: A Camouflage Proxy for the Tor Anonymity System. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS '12). ACM, New York, NY, USA, 109--120. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, and Hai Jin. 2016. SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces. IEEE Transactions on Services Computing 9, 4 (2016), 608--620. Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Design, implementation and test of a flexible tor-oriented web mining toolkit

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics
          June 2017
          268 pages
          ISBN:9781450352253
          DOI:10.1145/3102254

          Copyright © 2017 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 19 June 2017

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate140of278submissions,50%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader