Skip to main content

Web Content Mining Focused on Named Objects

  • Conference paper

Abstract

In our chapter we are working within the field of Web content mining. In relation to the user’s description of a Web page, we define a new term: Named object. Named objects are used for a new classification of selected methods dealing with mining, information from Web pages. This classification has been made on the basis of a survey of published methods. Our approach is based on the perception of a Web page through an intention. This intention is important both for the users and authors of a Web page. Named object is near to Web design patterns, which became a basis for our own mining method, Pattrio. The Pattrio method is introduced in this work together with a few experiments.

This is a preview of subscription content, log in via an institution.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alexander, Ch.: A Pattern Language: Towns, Buildings, Construction. Oxford University Press, New York (1977)

    Google Scholar 

  2. Boese, E. S., Howe, A. E.: Effects of web document evolution on genre classification. 14th ACM Information and Knowledge Management (Bremen, Germany, October 31–November 05, 2005). CIKM’ 05. ACM, New York NY, pp. 632–639 (2005)

    Chapter  Google Scholar 

  3. Borchers, J.O.: Interaction design patterns: twelve theses, Position paper, Workshop on Pattern Languages for Interaction Design. CHI 2000 Conference on Human Factors in Computing Systems, pp. 1–6 (2000)

    Google Scholar 

  4. Chaker, J., Ounelli, H.: Genre Categorization of Web Pages. ICDM Workshops (2007)

    Google Scholar 

  5. Chang Ch.H., Kayed, M., Girgis, M.R., Shaalan, K.F.: A Survey of Web Information Extraction Systems, IEEE Transactions on Knowledge and Data Engineering, 18, 1411–1428 (2006)

    Google Scholar 

  6. Conrad, J.G., Schilder, F.: Opinion mining in legal blogs. Artificial intelligence and Law (Stanford, June 04–08, 2007). ICAIL’ 07. ACM, New York, NY, pp. 231–236. (2007)

    Google Scholar 

  7. Dong, L., Watters, C.R., Duffy J., Shepherd, M.A.: An Examination of Genre Attributes for Web Page Classification. HICSS (2008)

    Google Scholar 

  8. Van Duyne, D.K., Landay, J.A., Hong, J.I. The Design of Sites: Patterns, Principles, and Processes, for Crafting a Customer-Centered Web Experience. Pearson Education (2002)

    Google Scholar 

  9. Embley, D.E., Tao, C., Liddle, S.W.: Automating the extraction of data from HTML tables with unknown structure. Data Knowl. Eng. 5, 3–28

    Google Scholar 

  10. Flieder, K., Modritscher, F. Foundations of a pattern language based on Gestalt principles. In CHI’ 06 Extended Abstracts on Human Factors in Computing Systems, pp. 773–778 (2006)

    Google Scholar 

  11. Gagneux, A., Eglin, V., Emptoz, H.: Quality Approach of Web Documents by an Evaluation of Structure Relevance, Proceedings of WDA (2001)

    Google Scholar 

  12. Gatterbauer, W., Bohunsky, P., Herzog, M., Krupl, B., Pollak, B.: Towards domain-independent information extraction from web tables. World Wide Web’ 07, (2007)

    Google Scholar 

  13. Goldberg, J. H., Stimson, M. J., Lewenstein, M., Scott, N., Wichansky, A. M.: Eye tracking in web search tasks: design implications. Symposium on Eye Tracking Research & Applications, ETRA’ 02, ACM, pp. 51–58 (2002)

    Google Scholar 

  14. Graham, L.: A pattern language for web usability. Addison-Wesley (2003)

    Google Scholar 

  15. Han, J. Chang, K.: Data Mining for Web Intelligence. Computer 35: 11, 64–70 (2002)

    Google Scholar 

  16. Han J., Kamber, M.: Data mining: concepts and techniques, Morgan Kaufmann Publishers Inc., San Francisco, CA. (2000)

    Google Scholar 

  17. Kanaris, I., Stamatatos, E.: Webpage Genre Identification Using Variable-Length Character n-Grams Tools with Artificial Intelligence, 2007. ICTAI 2007, pp. 3–10 (2007)

    Google Scholar 

  18. Kennedy, A., Shepherd, M.: Automatic identification of home pages on the web. Annual Hawaii International Conference on System Sciences (2005)

    Google Scholar 

  19. Kocibova, J., Klos, K., Lehecka, O., Kudelka, M., Snasel, V.: Web Page Analysis: Experiments Based on Discussion and Purchase Web Patterns. IEEE/ACM WIC Web Intelligence Workshops (2007).

    Google Scholar 

  20. Kohonen, T.: Self-Organizing Maps, Springer (2006)

    Google Scholar 

  21. Kosala, K. Blockeel, H.: Web Mining Research: A Survey, SIGKDD Explorations 2. 1–15 (2000)

    Article  Google Scholar 

  22. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E.: Semantic Analysis of Web Pages Using Web Patterns. IEEE/ACM/WIC Web Intelligence (2006)

    Google Scholar 

  23. Kudelka, M., Snasel, V., Lehecka, O., El-Qawasmeh, E., Pokorny, J.: Web Pages Reordering and Clustering Based on Web Patterns. SOFSEM 2008, Novy Smokovec, Slovakia, in Springer LNCS (2008)

    Google Scholar 

  24. Kudelka, M., Snasel V., Lehecka, O., El-Qawasmeh, E.: Web Content Mining Using Web Design Patterns, IEEE International Conference on Information Reuse and Integration (2008)

    Google Scholar 

  25. Lee, D., Jeong, O., and Lee, S.: Opinion mining of customer feedback data on the web. Conference on Ubiquitous information Management and Communication ICUIMC’ 08. pp. 230–235 (2008).

    Google Scholar 

  26. Lerman, K., Getoor, L., Minton, S., Knoblock, C.: Using the structure of Web sites for automatic segmentation of tables. ACM SIGMOD Management of Data, SIGMOD’ 04. pp. 119–130 (2004)

    Google Scholar 

  27. Limanto, H. Y., Giang, N. N., Trung, V. T., Zhang, J., He, Q., Huy, N. Q.: An information extraction engine for web discussion forums. World Wide Web www’ 05. pp. 978–979 (2005)

    Google Scholar 

  28. Nie, Z., Wen, J-R., Ma W-Y.: Object-level Vertical Search. CIDR 2007, pp. 235–246. (2007)

    Google Scholar 

  29. Nielsen, J., Loranger, H.: Prioritizing Web Usability. New Riders Press, Berkeley. (2006)

    Google Scholar 

  30. Pivk, A., Cimiano, P., Sure, Y., Gams, M., Rajkovic, V., Studer, R.: Transforming arbitrary tables into logical form with TARTAR. Data Knowl. Eng. 60, 567–595 (2007)

    Article  Google Scholar 

  31. Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F.: Automatic web news extraction using tree edit distance. In WWW’ 04: Proceedings of the 13th international conference on World Wide Web (2004)

    Google Scholar 

  32. Santini, M.: Characterizing Genres of Web Pages: Genre Hybridism and Individualization. HICSS 2007, p. 71 (2007)

    Google Scholar 

  33. Salton G., Wong, A. Yang, C. S.: A vector space model for automatic indexing, Communications of the ACM 18, 613–620 (1975)

    MATH  Google Scholar 

  34. Schmidt, S., Mandl, S., Ludwig, B., Stoyan, H.: Product-advisory on the web: An information extraction approach, Artificial Intelligence and Applications, pp. 678–683 (2007)

    Google Scholar 

  35. Schuth, A., Marx, M., de Rijke, M.: Extracting the discussion structure in comments on news-articles. ACM international Workshop on Web information and Data Management pp. 97–104 (2007)

    Google Scholar 

  36. Snasel, V., Rezankova, H., Husek, D., Kudelka, M., Lehecka, O.: Semantic Analysis of Web Pages Using Cluster Analysis and Nonnegative Matrix Factorization. IEEE/WIC AWIC 2007, Springer ASC (2007)

    Google Scholar 

  37. Tidwell, J.: Designing Interfaces: Patterns for Effective Interaction Design, O’Reilly Media, Inc. (2006)

    Google Scholar 

  38. Van Welie, M.: Pattern in Interaction Design, http://www.welie.com, (last access 2008-08-31).

    Google Scholar 

  39. Wong, T-L. W. Lam, W.: Hot Item Mining and Summarization from Multiple Auction Web Sites. ICDM 2005, pp. 797–800 (2005)

    Google Scholar 

  40. Yahoo!, http://www.yahoo.com, (last access 2008-08-31).

    Google Scholar 

  41. Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition: Models, observations, transformations, and inferences, International Journal on Document Analysis and Recognition, 7, 1–16 (2004)

    Google Scholar 

  42. Zheng, S., Song, R., Wen, J.-R.: Template-independent news extraction based on visual consistency. In Proceedings of AAAI-2007, pp. 1507–1511 (2005).

    Google Scholar 

  43. Zheng, S., Zhou, D., Li, J., Giles, C.L.: Extracting Author Meta-Data from Web Using Visual Features, Data Mining Workshops, ICDM Workshops, 2007, pp. 33–40 (2007)

    Google Scholar 

  44. Zhu, J., Zhang, B., Nie, Z., Wen, J.R., Hon, H.W. Webpage understanding: an integrated approach, Conference on Knowledge Discovery in Data, San Jose, California, USA, pp. 903–912 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Indian Institute of Information Technology, India

About this paper

Cite this paper

Snášel, V., Kudelka, M. (2009). Web Content Mining Focused on Named Objects. In: Tiwary, U.S., Siddiqui, T.J., Radhakrishna, M., Tiwari, M.D. (eds) Proceedings of the First International Conference on Intelligent Human Computer Interaction. Springer, New Delhi. https://doi.org/10.1007/978-81-8489-203-1_3

Download citation

  • DOI: https://doi.org/10.1007/978-81-8489-203-1_3

  • Publisher Name: Springer, New Delhi

  • Print ISBN: 978-81-8489-404-2

  • Online ISBN: 978-81-8489-203-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics