Skip to main content

A New Webpage Classification Model Based on Visual Information Using Gestalt Laws of Grouping

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9419))

Abstract

Traditional text-based webpage classification fails to handle rich-information-embedded modern webpages. Current approaches regard webpages as either trees or images. However, the former only focuses on webpage structure, and the latter ignores internal connections among different webpage features. Therefore, they are not suitable for modern webpage classification. Hence, semantic-block trees are introduced as a new representation for webpages. They are constructed by extracting visual information from webpages, integrating the visual information into render-blocks, and merging render-blocks using the Gestalt laws of grouping. The block tree edit distance is then described to evaluate both structural and visual similarity of pages. Using this distance as a metric, a classification framework is proposed to classify webpages based upon their similarity.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Qi, X., Davison, B.D.: Web page classification: features and algorithms. J. ACM 41(2), 12:1–12:31 (2009)

    Google Scholar 

  2. Wei, Y., Wang, B., Liu, Y., Lv, F.: Research on webpage similarity computing technology based on visual blocks. SMP 2014, 187–197 (2014)

    Google Scholar 

  3. Wertheimer, M.: Laws of organization in perceptual forms (1938)

    Google Scholar 

  4. Rohlfing, T.: Image similarity and tissue overlaps as surrogates for image registration accuracy: widely used but unreliable. IEEE Trans. Med. Imaging 31(2), 153–163 (2012)

    Article  Google Scholar 

  5. Liu, Z., Laganière, R.: Phase congruence measurement for image similarity assessment. Pattern Recogn. Lett. 28(1), 166–172 (2007)

    Article  Google Scholar 

  6. Kwitt, R., Uhl, A.: Image similarity measurement by kullback-leibler divergences between complex wavelet subband statistics for texture retrieval. ICIP 2008, pp. 933–936 (2008)

    Google Scholar 

  7. Sampat, M.P., Wang, Z., Gupta, S., Bovik, A.C., Markey, M.K.: Complex wavelet structural similarity: a new image similarity index. IEEE Trans. Image Process. 18(11), 2385–2401 (2009)

    Article  MathSciNet  Google Scholar 

  8. Shahbazi, A., Miller, J.: Extended subtree: a new similarity function for tree structured data. IEEE Trans. Knowl. Data Eng. 26(4), 864–877 (2014)

    Article  Google Scholar 

  9. M"uller-Molina, A.J., Hirata, K., Shinohara, T.: A tree distance function based on multisets. In: Chawla, S., Washio, T., Minato, S., Tsumoto, S., Onoda, T., Yamada, S., Inokuchi, A. (eds.) PAKDD 2008. LNCS, vol. 5433, pp. 87–98. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  10. Connor, R., Simeoni, F., Iakovos, M., Moss, R.: A bounded distance metric for comparing tree structure. Inf. Syst. 36(4), 748–764 (2011)

    Article  Google Scholar 

  11. Cording, P. H.: Algorithms for Web Scraping (2011). [PDF] http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/6183/pdf/imm6183.pdf

  12. Zhai, Y., Liu, B.: Structured data extraction from the web based on partial tree alignment. IEEE Trans. Knowl. Data Eng. 18(12), 1614–1628 (2006)

    Article  Google Scholar 

  13. Chaudhuri, B.B., Rosenfeld, A.: A modified Hausdorff distance between fuzzy sets. Inf. Sci. 118(1), 159–171 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  14. Johnson, G.M., Fairchild, M.D.: A top down description of SCIELAB and CIEDE2000. Color Res. Appl. 28(6), 425–435 (2003)

    Article  Google Scholar 

  15. Li, M., Chen, X., Li, X., Ma, B., Vitányi, P.M.: The similarity metric. IEEE Trans. Inf. Theory 50(12), 3250–3264 (2004)

    Article  MATH  Google Scholar 

Download references

Acknowledgment

The authors give thanks to China Scholarship Council (CSC) for their financial support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhen Xu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Xu, Z., Miller, J. (2015). A New Webpage Classification Model Based on Visual Information Using Gestalt Laws of Grouping. In: Wang, J., et al. Web Information Systems Engineering – WISE 2015. WISE 2015. Lecture Notes in Computer Science(), vol 9419. Springer, Cham. https://doi.org/10.1007/978-3-319-26187-4_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-26187-4_18

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-26186-7

  • Online ISBN: 978-3-319-26187-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics