ABSTRACT
Many libraries and databases are closed to general-purpose Web crawlers, and they expose their content only through their own search engines. At the same time many researchers attempt to locate technical papers through general-purpose Web search engines. DP9 is an open source gateway service that allows general search engines, (e.g. Google, Inktomi) to index OAI-compliant archives. DP9 does this by providing consistent URLs for repository records, and converting them to OAI queries against the appropriate repository when the URL is requested. This allows search engines that do not support the OAI protocol to index the "deep Web" contained within OAI compliant repositories.
- M. K. Bergman. The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing, 7(1), 2001]]Google ScholarCross Ref
- M. Mahoui and S. J. Cunningham. Search Behavior in a Research-Oriented Digital Library. Proceedings of ECDL2001, Darmstadt, Germany, September 4--9, 2001, LNCS 2163, pp. 13--24]] Google ScholarDigital Library
- C. Lagoze and H. Van de Sompel. The Open Archives Initiative: Building a low-barrier interoperability framework. Proceedings of the ACM/IEEE Joint Conference on Digital Libraries, Roanoke VA, June 24-28, 2001, pp. 54--62]] Google ScholarDigital Library
- X. Liu, K. Maly, M. Zubair, and M. L. Nelson. Arc - An OAI Service Provider for Digital Library Federation, D-Lib Magazine 7(4), April 2001]]Google Scholar
- M. Koster. The Web Robots Page. Available at http://info.webcrawler.com/mak/projects/robots/robots.html]]Google Scholar
- OAI Perl. Available at http://oai-perl.sourceforge.net/]]Google Scholar
Index Terms
- DP9: an OAI gateway service for web crawlers
Recommendations
Current challenges in web crawling
ICWE'13: Proceedings of the 13th international conference on Web EngineeringWeb crawling, a process of collecting web pages in an automated manner, is the primary and ubiquitous operation used by a large number of web systems and agents starting from a simple program for website backup to a major web search engine. Due to an ...
Search Engine Coverage of the OAI-PMH Corpus
Having indexed much of the "surface" Web, search engines are now using various approaches to index the "deep"Web. At the same time, institutional repositories and digital libraries are adopting the Open Archives Initiative Protocol for Metadata ...
A framework for incremental deep web crawler based on URL classification
WISM'11: Proceedings of the 2011 international conference on Web information systems and mining - Volume Part IIWith the Web grows rapidly, more and more data become available in the Deep Web.But users have to key in a set of keywords in order to access the pages from some web sites. Traditional search engines only index and retrieve Surface Web pages through ...
Comments