ScienceDirect® Home Skip Main Navigation Links
You have guest access to ScienceDirect. Find out more.
 
Home
Browse
My Settings
Alerts
Help
 Quick Search
 Search tips (Opens new window)
    Clear all fields    
Data & Knowledge Engineering
Volume 59, Issue 2, November 2006, Pages 270-291
Including: Sixth ACM International Workshop on Web Information and Data Management
 
Font Size: Decrease Font Size  Increase Font Size
 Abstract - selected
Article
Purchase PDF (630 K)

 
 
 
Related Articles in ScienceDirect
View More Related Articles
 
View Record in Scopus
 
doi:10.1016/j.datak.2006.01.012    How to Cite or Link Using DOI (Opens New Window)
Copyright © 2006 Elsevier B.V. All rights reserved.

Using HMM to learn user browsing patterns for focused Web crawling

Hongyu Liua, Corresponding Author Contact Information, E-mail The Corresponding Author, Jeannette Janssenb, E-mail The Corresponding Author and Evangelos Miliosa, E-mail The Corresponding Author

aFaculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada B3H 1W5 bDepartment of Mathematics and Statistics, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada B3H 1W5

Received 18 January 2006; 
revised 18 January 2006; 
revised 18 January 2006. 
Available online 10 March 2006.

Purchase the full-text article



References and further reading may be available for this article. To view references and further reading you must purchase this article.

Abstract

A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use information gleaned from previously crawled page sequences.

In this paper, we present a new approach for prediction of the links leading to relevant pages based on a Hidden Markov Model (HMM). The system consists of three stages: user data collection, user modelling via sequential pattern learning, and focused crawling. In particular, we first collect the Web pages visited during a user browsing session. These pages are clustered, and the link structure among pages from different clusters is then used to learn page sequences that are likely to lead to target pages. The learning is performed using HMM. During crawling, the priority of links to follow is based on a learned estimate of how likely the page is to lead to a target page. We compare the performance with Context-Graph crawling and Best-First crawling. Our experiments demonstrate that this approach performs better than Context-Graph crawling and Best-First crawling.

Keywords: Focused crawling; Web searching; Relevance modelling; User modelling; Pattern learning; Hidden Markov models; World Wide Web; Web Graph

Article Outline

1. Introduction
1.1. Focused crawling
1.2. Literature review
1.3. Contribution and outline of the paper
2. System overview
2.1. User Data Collection
2.2. Concept graph
2.2.1. LSI—identification of semantic content
2.2.2. Clustering
3. User modelling
3.1. Structure of the HMM for focused crawling
3.2. Parameter estimation
3.3. Efficient inference
4. Focused crawling
4.1. Calculation of the priority
4.2. Priority queue data structure
4.3. The algorithm
5. Evaluation
5.1. Performance metrics
5.2. Experiments
5.2.1. Algorithms for comparison
5.2.2. Training data
5.3. Results
6. Conclusions and discussion
Acknowledgements
References
Vitae



















Data & Knowledge Engineering
Volume 59, Issue 2, November 2006, Pages 270-291
Including: Sixth ACM International Workshop on Web Information and Data Management
 
Home
Browse
My Settings
Alerts
Help
Elsevier.com (Opens new window)
About ScienceDirect  |  Contact Us  |  Information for Advertisers  |  Terms & Conditions  |  Privacy Policy
Copyright © 2008 Elsevier B.V. All rights reserved. ScienceDirect® is a registered trademark of Elsevier B.V.