Abstract
With the rapid growth of Web databases, it is necessary to extract and integrate large-scale data available in Deep Web automatically. But current Web search engines conduct page-level ranking, which are becoming inadequate for entity-oriented vertical search. In this paper, we present an entity-level ranking mechanism called LG-ERM for Deep Web queries based on local scoring and global aggregation. Unlike traditional approaches, LG-ERM considers more rank influencing factors including the uncertainty of entity extraction, the style information of the entities and the importance of the Web sources, as well as the entity relationship. By combining local scoring and global aggregation in ranking, the query result can be more accurate and effective to meet users' needs. The experiments demonstrate the feasibility and effectiveness of the key techniques of LG-ERM.
Similar content being viewed by others
References
Chang K C, He B, Li C, Patel M, Zhang Z. Structured databases on the web: Observations and implications. SIGMOD Record, 2004, 33(3): 61–70.
Dong X, Halevy A Y, Yu C. Data integration with uncertainty. In Proc. the 33rd VLDB, Vienna, Austria, September 23–27, 2007, pp.687–698.
Jin R, Valizadegan H, Li H. Ranking refinement and its application to information retrieval. In Proc. the 17th WWW, Beijing, China, April 21–25, 2008, pp.397–406.
Qin T, Liu T, Zhang X, Wang D, Xiong W, Li H. Learning to rank relational objects and its application to Web search. In Proc. the 17th WWW, Beijing, China, April 21–25, 2008, pp.407–416.
Chaudhuri S, Ramakrishnan R, Weikum G. Integrating DB and IR Technologies: What is the Sound of one hand clapping. In Proc. the 2nd CIDR, CA, USA, January 4–7, 2005, pp.1–12.
Chakrabarti K, Ganti V, Han J W, Xin D. Ranking objects by exploiting relationships: Computing top-k over aggregation. In Proc. the 25th SIGMOD, Illinois, USA, June 27–29, 2006, pp.371–382.
Cheng T, Yan X, Chang K C C. EntityRank: Searching entities directly and holistically. In Proc. the 33rd VLDB, Vienna, Austria, September 23–27, 2007, pp.387–398.
Cheng T, Chang K C C. Entity search engine: Towards agile best-effort information integration over theWeb. In Proc. the 3rd CIDR, USA, January 7–10, 2007, pp.108–113.
Nie Z, Ma Y, Shi S, Wen J, Ma W. Web object retrieval. In Proc. the 16th WWW, Alberta, Canada, May 8–12, 2007, pp.81–90.
Nie Z, Wen J, Ma W. Object-level vertical search. In Proc. the 3rd CIDR, CA, USA, January 7–10, 2007, pp.235–246.
Etzioni O, Cafarella M, Downey D. Web-scale information extraction in KnowItAll. In Proc. the 13th WWW, NY, USA, May 17–20, 2004, pp.100–110.
Cai D, Yu S,Wen J, Ma W. Block-basedWeb search. In Proc. the 27th SIGIR, Sheffield, UK, July 25–29, 2004, pp.456–463.
Zhu J, Nie Z, Wen J, Zhang B, Ma W. Simultaneous record detection and attribute labeling in Web data extraction. In Proc. the 12th KDD, PA, USA, August 20–23, 2006, pp.494–503.
Kou Y, Li D, Shen D, Yu G, Nie T. D-EEM: A DOM-tree based entity extraction mechanism for deep Web. In Proc. the 5th CNCC, Xian, China, September 25–27, 2008, p.21.
Nambiar U, Kambhampati S. Mining approximate functional dependencies and concept similarities to answer imprecise queries. In Proc. the 7th WebDB, Paris, France, June 17–18, 2004, pp.73–78.
Nigam K, McCallum A K, Thrun S. Text classification from labeled and unlabeled documents using EM. Machine Learning, 2000, 39(2): 103–134.
Lertnattee V, Theeramunkong T. Effect of term distributions on centroid-based text categorization. Information Sciences, 2004, 158(1): 89–115.
Song R, Liu H, Wen J. Learning block importance models for Web pages. In Proc. the 13th WWW, NY, USA, May 17–20, 2004, pp.203–211.
Bianchini M, Gori M, Scarselli F. Inside PageRank. ACM Transactions on Internet Technology, 2005, 5(1): 92–128.
Parreira J X, Weikum G. JXP: Global authority scores in a P2P network. In Proc. the 8th WebDB, Maryland, USA, June 16–17, 2005, pp.31–36.
Vazirgiannis M, Drosos D, Senellart P, Vlachou A. Web page rank prediction with Markov models. In Proc. the 17th WWW, Beijing, China, April 21–25, 2008, pp.1075–1076.
Kou Y, Shen D, Li D, Nie T. A deep Web entity identification mechanism based on semantics and statistical analysis. Journal of Software, 2008, 19(2): 194–208.
Yagoub K, Florescu D, Issarny V. Caching strategies for data-intensive Web sites. In Proc. the 26th VLDB, Cairo, Egypt, September 10–14, 2000, pp.188–199.
Shi L, Han Y, Ding X, Wei L. An SPN-based integrated model forWeb prefetching and caching. J. Comput. Sci. & Technol, 2006, 21(4): 482–489.
Author information
Authors and Affiliations
Corresponding author
Additional information
Supported by the National Natural Science Foundation of China under Grant No. 60673139 and the National High Technology Development and Research 863 Program of China under Grant No. 2008AA01Z146.
Electronic Supplementary Material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Kou, Y., Shen, DR., Yu, G. et al. Combining Local Scoring and Global Aggregation to Rank Entities for Deep Web Queries. J. Comput. Sci. Technol. 24, 626–637 (2009). https://doi.org/10.1007/s11390-009-9263-y
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11390-009-9263-y