Web Content Mining Focused on Named Objects

Snášel, Václav; Kudelka, Milos

doi:10.1007/978-81-8489-203-1_3

Web Content Mining Focused on Named Objects

Václav Snášel² &
Milos Kudelka²

Conference paper

1140 Accesses
5 Citations

Abstract

In our chapter we are working within the field of Web content mining. In relation to the user’s description of a Web page, we define a new term: Named object. Named objects are used for a new classification of selected methods dealing with mining, information from Web pages. This classification has been made on the basis of a survey of published methods. Our approach is based on the perception of a Web page through an intention. This intention is important both for the users and authors of a Web page. Named object is near to Web design patterns, which became a basis for our own mining method, Pattrio. The Pattrio method is introduced in this work together with a few experiments.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

References

Alexander, Ch.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press, New York (1977)
Google Scholar
Boese, E. S., Howe, A. E.: Effects of web document evolution on genre classification. 14th ACM Information and Knowledge Management (Bremen, Germany, October 31–November 05, 2005). CIKM’ 05. ACM, New York NY, pp. 632–639 (2005)
Chapter Google Scholar
Borchers, J.O.: Interaction design patterns: twelve theses, Position paper, Workshop on Pattern Languages for Interaction Design. CHI 2000 Conference on Human Factors in Computing Systems, pp. 1–6 (2000)
Google Scholar
Chaker, J., Ounelli, H.: Genre Categorization of Web Pages. ICDM Workshops (2007)
Google Scholar
Chang Ch.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems, IEEE Transactions on Knowledge and Data Engineering, 18, 1411–1428 (2006)
Google Scholar
Conrad, J.G., Schilder, F.: Opinion mining in legal blogs. Artificial intelligence and Law (Stanford, June 04–08, 2007). ICAIL’ 07. ACM, New York, NY, pp. 231–236. (2007)
Google Scholar
Dong, L., Watters, C.R., Duffy J., Shepherd, M.A.: An Examination of Genre Attributes for Web Page Classification. HICSS (2008)
Google Scholar
Van Duyne, D.K., Landay, J.A., Hong, J.I. The Design of Sites: Patterns, Principles, and Processes, for Crafting a Customer-Centered Web Experience. Pearson Education (2002)
Google Scholar
Embley, D.E., Tao, C., Liddle, S.W.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 5, 3–28
Google Scholar
Flieder, K., Modritscher, F. Foundations of a pattern language based on Gestalt principles. In CHI’ 06 Extended Abstracts on Human Factors in Computing Systems, pp. 773–778 (2006)
Google Scholar
Gagneux, A., Eglin, V., Emptoz, H.: Quality Approach of Web Documents by an Evaluation of Structure Relevance, Proceedings of WDA (2001)
Google Scholar
Gatterbauer, W., Bohunsky, P., Herzog, M., Krupl, B., Pollak, B.: Towards domain-independent information extraction from web tables. World Wide Web’ 07, (2007)
Google Scholar
Goldberg, J. H., Stimson, M. J., Lewenstein, M., Scott, N., Wichansky, A. M.: Eye tracking in web search tasks: design implications. Symposium on Eye Tracking Research & Applications, ETRA’ 02, ACM, pp. 51–58 (2002)
Google Scholar
Graham, L.: A pattern language for web usability. Addison-Wesley (2003)
Google Scholar
Han, J. Chang, K.: Data Mining for Web Intelligence. Computer 35: 11, 64–70 (2002)
Google Scholar
Han J., Kamber, M.: Data mining: concepts and techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA. (2000)
Google Scholar
Kanaris, I., Stamatatos, E.: Webpage Genre Identification Using Variable-Length Character n-Grams Tools with Artificial Intelligence, 2007. ICTAI 2007, pp. 3–10 (2007)
Google Scholar
Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. Annual Hawaii International Conference on System Sciences (2005)
Google Scholar
Kocibova, J., Klos, K., Lehecka, O., Kudelka, M., Snasel, V.: Web Page Analysis: Experiments Based on Discussion and Purchase Web Patterns. IEEE/ACM WIC Web Intelligence Workshops (2007).
Google Scholar
Kohonen, T.: Self-Organizing Maps, Springer (2006)
Google Scholar
Kosala, K. Blockeel, H.: Web Mining Research: A Survey, SIGKDD Explorations 2. 1–15 (2000)
Article Google Scholar
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. IEEE/ACM/WIC Web Intelligence (2006)
Google Scholar
Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web Pages Reordering and Clustering Based on Web Patterns. SOFSEM 2008, Novy Smokovec, Slovakia, in Springer LNCS (2008)
Google Scholar
Kudelka, M., Snasel V., Lehecka, O., El-Qawasmeh, E.: Web Content Mining Using Web Design Patterns, IEEE International Conference on Information Reuse and Integration (2008)
Google Scholar
Lee, D., Jeong, O., and Lee, S.: Opinion mining of customer feedback data on the web. Conference on Ubiquitous information Management and Communication ICUIMC’ 08. pp. 230–235 (2008).
Google Scholar
Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. ACM SIGMOD Management of Data, SIGMOD’ 04. pp. 119–130 (2004)
Google Scholar
Limanto, H. Y., Giang, N. N., Trung, V. T., Zhang, J., He, Q., Huy, N. Q.: An information extraction engine for web discussion forums. World Wide Web www’ 05. pp. 978–979 (2005)
Google Scholar
Nie, Z., Wen, J-R., Ma W-Y.: Object-level Vertical Search. CIDR 2007, pp. 235–246. (2007)
Google Scholar
Nielsen, J., Loranger, H.: Prioritizing Web Usability. New Riders Press, Berkeley. (2006)
Google Scholar
Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60, 567–595 (2007)
Article Google Scholar
Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In WWW’ 04: Proceedings of the 13th international conference on World Wide Web (2004)
Google Scholar
Santini, M.: Characterizing Genres of Web Pages: Genre Hybridism and Individualization. HICSS 2007, p. 71 (2007)
Google Scholar
Salton G., Wong, A. Yang, C. S.: A vector space model for automatic indexing, Communications of the ACM 18, 613–620 (1975)
MATH Google Scholar
Schmidt, S., Mandl, S., Ludwig, B., Stoyan, H.: Product-advisory on the web: An information extraction approach, Artificial Intelligence and Applications, pp. 678–683 (2007)
Google Scholar
Schuth, A., Marx, M., de Rijke, M.: Extracting the discussion structure in comments on news-articles. ACM international Workshop on Web information and Data Management pp. 97–104 (2007)
Google Scholar
Snasel, V., Rezankova, H., Husek, D., Kudelka, M., Lehecka, O.: Semantic Analysis of Web Pages Using Cluster Analysis and Nonnegative Matrix Factorization. IEEE/WIC AWIC 2007, Springer ASC (2007)
Google Scholar
Tidwell, J.: Designing Interfaces: Patterns for Effective Interaction Design, O’Reilly Media, Inc. (2006)
Google Scholar
Van Welie, M.: Pattern in Interaction Design, http://www.welie.com, (last access 2008-08-31).
Google Scholar
Wong, T-L. W. Lam, W.: Hot Item Mining and Summarization from Multiple Auction Web Sites. ICDM 2005, pp. 797–800 (2005)
Google Scholar
Yahoo!, http://www.yahoo.com, (last access 2008-08-31).
Google Scholar
Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences, International Journal on Document Analysis and Recognition, 7, 1–16 (2004)
Google Scholar
Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction based on visual consistency. In Proceedings of AAAI-2007, pp. 1507–1511 (2005).
Google Scholar
Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting Author Meta-Data from Web Using Visual Features, Data Mining Workshops, ICDM Workshops, 2007, pp. 33–40 (2007)
Google Scholar
Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W. Webpage understanding: an integrated approach, Conference on Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computer Science, VSB Technical University of Ostrava, 708 33, Ostrava-Poruba, Czech Republic
Václav Snášel & Milos Kudelka

Authors

Václav Snášel
View author publications
You can also search for this author in PubMed Google Scholar
Milos Kudelka
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Indian Institute of Information Technology, Allahabad, India
U. S. Tiwary (Professor), Tanveer J. Siddiqui (Assistant Professor), M. Radhakrishna (Professor) & M. D. Tiwari (Director) (Professor), (Assistant Professor), (Professor) & (Director)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Snášel, V., Kudelka, M. (2009). Web Content Mining Focused on Named Objects. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds) Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New Delhi. https://doi.org/10.1007/978-81-8489-203-1_3

Download citation

DOI: https://doi.org/10.1007/978-81-8489-203-1_3
Publisher Name: Springer, New Delhi
Print ISBN: 978-81-8489-404-2
Online ISBN: 978-81-8489-203-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics