research-article

Design, implementation and test of a flexible tor-oriented web mining toolkit

Authors:
Alessandro Celestini

Institute for Applied Computing (IAC-CNR), Rome, Italy

Institute for Applied Computing (IAC-CNR), Rome, Italy
View Profile

,
Stefano Guarino

Institute for Applied Computing (IAC-CNR), Rome, Italy

Institute for Applied Computing (IAC-CNR), Rome, Italy
View Profile

WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and SemanticsJune 2017Article No.: 19Pages 1–10https://doi.org/10.1145/3102254.3102266

Published:19 June 2017Publication History

WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

Pages 1–10

ABSTRACT

Searching and retrieving information from the Web is a primary activity needed to monitor the development and usage of Web resources. Possible benefits include improving user experience (e.g. by optimizing query results) and enforcing data/user security (e.g. by identifying harmful websites). Motivated by the lack of ready-to-use solutions, in this paper we present a flexible and accessible toolkit for structure and content mining, able to crawl, download, extract and index resources from the Web. While being easily configurable to work in the "surface" Web, our suite is specifically tailored to explore the Tor dark Web, i.e. the ensemble of Web servers composing the world's most famous darknet. Notably, the toolkit is not just a Web scraper, but it includes two mining modules, respectively able to prepare content to be fed to an (external) semantic engine, and to reconstruct the graph structure of the explored portion of the Web. Other than discussing in detail the design, features and performance of our toolkit, we report the findings of a preliminary run over Tor, that clarify the potential of our solution.

References

Daniel Arp, Fabian Yamaguchi, and Konrad Rieck. 2015. Torben: A Practical Side-Channel Attack for Deanonymizing Tor Communication. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security (ASIA CCS'15). ACM, New York, NY, USA, 597--602. Google ScholarDigital Library
Monica J. Barrat. 2012. Silk Road: Ebay for Drugs. Addiction 107, 3 (2012), 683--683. Google ScholarCross Ref
Massimo Bernaschi, Giancarlo Carbone, and Flavio Vella. 2016. Scalable Between-ness Centrality on multi-GPU Systems. In Proceedings of the ACM International Conference on Computing Frontiers (CF '16). ACM, New York, NY, USA, 29--36. Google ScholarDigital Library
Massimo Bernaschi, Alessandro Celestini, Stefano Guarino, and Flavio Lombardi. 2017. Exploring and Analyzing the Tor Hidden Services Graph. ACM Transactions on the Web (2017), To appear. Google ScholarDigital Library
Alex Biryukov, Ivan Pustogarov, Fabrice Thill, and Ralf-Philipp Weinmann. 2014. Content and Popularity Analysis of Tor Hidden Services. In Distributed Computing Systems Workshops (ICDCSW), 2014 IEEE 34th International Conference on. 188--193. Google ScholarDigital Library
Alex Biryukov, Ivan Pustogarov, and Ralf-Philipp Weinmann. 2013. Trawling for Tor Hidden Services: Detection, Measurement, Deanonymization. In Proceedings of the 2013 IEEE Symposium on Security and Privacy (SP '13). IEEE Computer Society, Washington, DC, USA, 80--94. Google ScholarDigital Library
Olivier Blanvillain, Nikos Kasioumis, and Vangelis Banos. 2014. Blogforever crawler: techniques and algorithms to harvest modern weblogs. In Proceedings of the 4th International Conference on Web Intelligence, Mining and Semantics (WIMS14). ACM, 7. Google ScholarDigital Library
David M Blei, Andrew Y Ng, and Michael I Jordan. 2003. Latent dirichlet allocation. Journal of machine Learning research 3, Jan (2003), 993--1022.Google ScholarDigital Library
Paolo Boldi, Bruno Codenotti, Massimo Santini, and Sebastiano Vigna. 2004. Ubicrawler: A scalable fully distributed web crawler. Software: Practice and Experience 34, 8 (2004), 711--726. Google ScholarDigital Library
Paolo Boldi, Andrea Marino, Massimo Santini, and Sebastiano Vigna. 2014. BUbiNG: Massive crawling for the masses. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion. 227--228. Google ScholarDigital Library
Paolo Boldi and Sebastiano Vigna. 2004. The Webgraph Framework I: Compression Techniques. In Proceedings of the 13th International Conference on World Wide Web (WWW '04). ACM, New York, NY, USA, 595--602. Google ScholarDigital Library
Anthony Bonato. 2005. A Survey of Models of the Web Graph. In Combinatorial and Algorithmic Aspects of Networking, Alejandro Lpez-Ortiz and AngleM. Hamel (Eds.). Lecture Notes in Computer Science, Vol. 3405. Springer Berlin Heidelberg, 159--172. Google ScholarDigital Library
Matko Bošnjak, Eduardo Oliveira, José Martins, Eduarda Mendes Rodrigues, and Luís Sarmento. 2012. Twitterecho: a distributed focused crawler to support open research with twitter data. In Proceedings of the 21st International Conference on World Wide Web. ACM, 1233--1240. Google ScholarDigital Library
Soumen Chakrabarti, Amit Pathak, and Manish Gupta. 2011. Index Design and Query Processing for Graph Conductance Search. The VLDB Journal 20, 3 (June 2011), 445--470. Google ScholarDigital Library
Francisco Claude and Susana Ladra. 2011. Practical Representations for Web and Social Graphs. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management (CIKM '11). ACM, New York, NY, USA, 1185--1190. Google ScholarDigital Library
Francisco Claude and Gonzalo Navarro. 2010. Fast and Compact Web Graph Representations. ACM Trans. Web 4, 4, Article 16 (Sept. 2010), 31 pages. Google ScholarDigital Library
Roger Dingledine, Nick Mathewson, and Paul Syverson. 2004. Tor: The Second-Generation Onion Router. In Proceedings of the 13th Usenix Security Symposium. Google ScholarCross Ref
Emilio Ferrara, Pasquale De Meo, Giacomo Fiumara, and Robert Baumgartner. 2014. Web data extraction, applications and techniques: A survey. Knowledge-Based Systems 70 (2014), 301 -- 323. Google ScholarDigital Library
Gary William Flake, Steve Lawrence, C. Lee Giles, and Frans M. Coetzee. 2002. Self-Organization and Identification of Web Communities. IEEE Computer 35 (2002), 66--71. Google ScholarDigital Library
Allan Heydon and Marc Najork. 1999. Mercator: A scalable, extensible web crawler. World Wide Web 2, 4 (1999), 219--229. Google ScholarDigital Library
Rohit Khare, Doug Cutting, Kragen Sitaker, and Adam Rifkin. 2004. Nutch: A flexible and scalable open-source web search engine. Oregon State University 1 (2004), 32--32.Google Scholar
JonM. Kleinberg, Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, and AndrewS. Tomkins. 1999. The Web as a Graph: Measurements, Models, and Methods. In Computing and Combinatorics, Takano Asano, Hideki Imai, D.T. Lee, Shin-ichi Nakano, and Takeshi Tokuyama (Eds.). Lecture Notes in Computer Science, Vol. 1627. Springer Berlin Heidelberg, 1--17. Google ScholarCross Ref
Raymond Kosala and Hendrik Blockeel. 2000. Web Mining Research: A Survey. SIGKDD Explor. Newsl. 2, 1 (June 2000), 1--15. Google ScholarDigital Library
Ravi Kumar, Prabhakar Raghavan, Sridhar Rajagopalan, D Sivakumar, Andrew Tomkins, and Eli Upfal. 2000. Stochastic models for the Web graph. In Foundations of Computer Science, 2000. Proceedings. 41st Annual Symposium on. 57--65. Google ScholarCross Ref
Damon McCoy, Kevin Bauer, Dirk Grunwald, Tadayoshi Kohno, and Douglas Sicker. 2008. Shining Light in Dark Places: Understanding the Tor Network. In Privacy Enhancing Technologies, Nikita Borisov and Ian Goldberg (Eds.). Lecture Notes in Computer Science, Vol. 5134. Springer Berlin Heidelberg, 63--76. Google ScholarDigital Library
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).Google Scholar
Gordon Mohr, Michael Stack, Igor Ranitovic, Dan Avery, and Michele Kimpton. 2004. An Introduction to Heritrix An open source archival quality web crawler. In In IWAWfi04, 4th International Web Archiving Workshop. Citeseer.Google Scholar
Christopher Olston, Marc Najork, and others. 2010. Web crawling. Foundations and Trends® in Information Retrieval 4, 3 (2010), 175--246.Google Scholar
Gareth Owen and Nick Savage. 2016. Empirical analysis of Tor hidden services. IET Information Security 10, 3 (2016), 113--118. Google ScholarCross Ref
Kyle Soska and Nicolas Christin. 2015. Measuring the Longitudinal Evolution of the Online Anonymous Marketplace Ecosystem. In 24th USENIX Security Symposium, USENIX Security 15, Washington, D.C., USA, August 12--14, 2015. 33--48.Google Scholar
Martijn Spitters, Stefan Verbruggen, and Mark van Staalduinen. 2014. Towards a Comprehensive Insight into the Thematic Organization of the Tor Hidden Services. In Intelligence and Security Informatics Conference (JISIC), 2014 IEEE Joint. 220--223. Google ScholarDigital Library
Flavio Vella, Giancarlo Carbone, and Massimo Bernaschi. 2016. Algorithms and Heuristics for Scalable Betweenness Centrality Computation on Multi-GPU Systems. CoRR abs/1602.00963 (2016). http://arxiv.org/abs/1602.00963Google Scholar
Zachary Weinberg, Jeffrey Wang, Vinod Yegneswaran, Linda Briesemeister, Steven Cheung, Frank Wang, and Dan Boneh. 2012. StegoTorus: A Camouflage Proxy for the Tor Anonymity System. In Proceedings of the 2012 ACM Conference on Computer and Communications Security (CCS '12). ACM, New York, NY, USA, 109--120. Google ScholarDigital Library
Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, and Hai Jin. 2016. SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-Web Interfaces. IEEE Transactions on Services Computing 9, 4 (2016), 608--620. Google ScholarCross Ref

Index Terms

Design, implementation and test of a flexible tor-oriented web mining toolkit
1. Information systems
  1. World Wide Web
    1. Web mining
      1. Data extraction and integration
        Deep web
    2. Web searching and information discovery
      1. Web search engines
        Web crawling
2. Networks
  1. Network properties
    1. Network structure
      1. Topology analysis and generation

Recommendations

Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning
Neural Information Processing
Abstract
In recent years, various web-based attacks such as Drive-by-Download attacks are becoming serious. To protect legitimate users, it is important to collect information on malicious sites that could provide a blacklist-based detection software. In ...
Read More
Dark Web: A Web of Crimes
Abstract
Internet plays an important role in our day to day life. It has become an integrated part of all daily activities or lifestyle. Dark Web is like an untraceable hidden layer of the Internet which is commonly used to store and access the ...
Read More
US Domestic Extremist Groups on the Web: Link and Content Analysis

US domestic extremist groups have increased in numbers and are using the Internet intensively as a tool to share resources and members with limited regard for geographic, legal, or other obstacles. Researchers find that monitoring extremist and hate ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics
June 2017
268 pages
ISBN:9781450352253
DOI:10.1145/3102254
Conference Chair:
Rajendra Akerkar
Western Norway Research Institute, Norway
,
General Chair:
Alfredo Cuzzocrea
University of Trieste and ICAR-CNR, Italy
,
Program Chairs:
Jannong Cao
Hong Kong Polytechnic University, Hong Kong
,
Mohand-Said Hacid
University of Claude Bernard Lyon 1, France
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 June 2017
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
dark web
tor web graph
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate140of278submissions,50%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 224
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Design, implementation and test of a flexible tor-oriented web mining toolkit

WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning

Dark Web: A Web of Crimes

US Domestic Extremist Groups on the Web: Link and Content Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Design, implementation and test of a flexible tor-oriented web mining toolkit

WIMS '17: Proceedings of the 7th International Conference on Web Intelligence, Mining and Semantics

ABSTRACT

References

Cited By

Index Terms

Recommendations

Exploring and Identifying Malicious Sites in Dark Web Using Machine Learning

Dark Web: A Web of Crimes

US Domestic Extremist Groups on the Web: Link and Content Analysis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media