doi:10.1016/j.datak.2006.01.012
Copyright © 2006 Elsevier B.V. All rights reserved.
Using HMM to learn user browsing patterns for focused Web crawling
aFaculty of Computer Science, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada B3H 1W5
bDepartment of Mathematics and Statistics, Dalhousie University, 6050 University Avenue, Halifax, NS, Canada B3H 1W5
Received 18 January 2006;
revised 18 January 2006;
revised 18 January 2006.
Available online 10 March 2006.
References and further reading may be available for this article. To view references and further reading you must
purchase this article.
Abstract
A focused crawler is designed to traverse the Web to gather documents on a specific topic. It can be used to build domain-specific Web search portals and online personalized search tools. To estimate the relevance of a newly seen URL, it must use information gleaned from previously crawled page sequences.
In this paper, we present a new approach for prediction of the links leading to relevant pages based on a Hidden Markov Model (HMM). The system consists of three stages: user data collection, user modelling via sequential pattern learning, and focused crawling. In particular, we first collect the Web pages visited during a user browsing session. These pages are clustered, and the link structure among pages from different clusters is then used to learn page sequences that are likely to lead to target pages. The learning is performed using HMM. During crawling, the priority of links to follow is based on a learned estimate of how likely the page is to lead to a target page. We compare the performance with Context-Graph crawling and Best-First crawling. Our experiments demonstrate that this approach performs better than Context-Graph crawling and Best-First crawling.
Keywords: Focused crawling; Web searching; Relevance modelling; User modelling; Pattern learning; Hidden Markov models; World Wide Web; Web Graph
Fig. 1. System architecture: User Data Collection, User Modelling via Pattern Learning, and Focused Crawling.
Fig. 2. User data collection. Double-bordered white nodes represent target pages: (a) Useful and Submit buttons, (b) user visited pages form Web graph.
Fig. 3. User modelling via sequential pattern learning. Ci is the label of cluster i, Tj is the estimated hidden state.
Fig. 4. The structure of a Hidden Markov Model with four hidden states.
Fig. 5. Parameter estimation of HMM.
Fig. 6. Flow chart of focused crawling.
Fig. 7. Pseudocode of crawling algorithm.
Fig. 8. Build context graph. Each circle represents one layer. Targets form layer 0, and layer i contains all the parents of the nodes in layer i − 1.
Fig. 9. Topic Linux: (a) the number of relevant pages within the set of downloaded pages with threshold 0.7, (b) average maximal similarities to the set of target pages of all downloaded pages.
Fig. 10. Topic Call for Papers: (a) the number of relevant pages within the set of downloaded pages with threshold 0.7, (b) average maximal similarities to the set of target pages of all downloaded pages.
Fig. 11. Topic Biking: (a) the number of relevant pages within the set of downloaded pages with threshold 0.6, (b) average maximal similarities to the set of target pages of all downloaded pages.
Fig. 12. Topic Hockey: (a) the number of relevant pages within the set of downloaded pages with threshold 0.7, (b) average maximal similarities to the set of target pages of all downloaded pages.
Fig. 13. The effect of different relevance threshold values on topic Linux: (a) the number of relevant pages within the set of downloaded pages with threshold 0.8, (b) the number of relevant pages within the set of downloaded pages with threshold 0.6.
Fig. 14. The effect of different relevance threshold values on topic Call for Papers: (a) the number of relevant pages within the set of downloaded pages with threshold 0.8, (b) the number of relevant pages within the set of downloaded pages with threshold 0.6.
Fig. 15. The effect of different relevance threshold values on topic Hockey: (a) the number of relevant pages within the set of downloaded pages with threshold 0.8, (b) the number of relevant pages within the set of downloaded pages with threshold 0.6.
Fig. 16. The effect of different relevance threshold values on topic Biking: (a) the number of relevant pages within the set of downloaded pages with threshold 0.7, (b) the number of relevant pages within the set of downloaded pages with threshold 0.5.
Fig. 17. Comparison of HMM crawler using different training data on topic Linux: (a) the number of relevant pages within the set of downloaded pages with threshold 0.7, (b) average maximal similarities to the set of target pages of all downloaded pages. HMM: HMM-learning with the Web graph using user visited pages (training data no. 1); HMM_alldata: HMM-learning with the Web graph using all nodes from Context Graph (training data no. 3).
Fig. 18. Comparison of HMM crawler using different training data on topic hockey: (a) the number of relevant pages within the set of downloaded pages with threshold 0.7, (b) average maximal similarities to the set of target pages of all downloaded pages. HMM: HMM-learning with the Web graph using user visited pages (training data no. 1); HMM_alldata: HMM-learning with the Web graph using all nodes from Context Graph (training data no. 3).
Table 1.
Recall evaluation
