Skip to main content

Towards Understanding the Functions of Web Element

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Abstract

A web page is a collection of basic elements, and the role of each element in a page is different. For example, an image element can be part of the main content, advertisement, or banner of the site. This paper describes ongoing work using a machine learning approach to classify each element in a web page into six functional categories: Content (C), Related Link (R), Navigation (N), Advertisement (A), Form (F) and Other (O). This allows the extraction of only certain categories of content in a webpage to be delivered to a mobile device to fit user’s specific needs, or to facilitate web information processes like web mining or mobile search. We manually labeled 18,864 elements from 150 websites. For each element we extracted both local features (such as the text length, URL, tag name etc) and global features (such as the text match with the other elements) to construct a feature vector. We trained the training set 10,650 elements with a decision tree learning algorithm J48, and it achieved 82% accuracy for stratified cross-validation, and an average F value 0.78 for the six different categories. Testing on 3,043 elements from pages that are not included in the training set gives 58% accuracy rate. Although this is not satisfactory overall, the F value for content category reaches 0.795, indicating that the method could be useful for less demanding applications. We are working on improving the results in order to make automatic functional classification of web elements feasible and to provide new opportunities to push the state of art in the mobile internet and mobile search.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. AvantGo, http://www.avantgo.com

  2. Bickmore, T., Schilit, B.: Digester. Device Independent Access to the World Wide Web. In: Proceedings of the 6th International World Wide Web Conference (1997)

    Google Scholar 

  3. Buyukkokten, O., Garcia-Molina, H., Paepcke, A.: Seeing the Whole in Parts: Text Summarization for Web Browsing on Handheld Devices. In: Proceedings of the 10th World Wide Web Conference (2001)

    Google Scholar 

  4. Buyukkokten, O., Garcia-Molina, H., Paepcke, A., Winograd, T.: Power Browser: Efficient Web Browsing for PDAs. In: Proceedings of the ACM Conference on Computers and Human Interaction (2000)

    Google Scholar 

  5. Bharadvaj, H., Joshi, A., Auephanwiriyakul, S.: An Active Transcoding Proxy to Support Mobile Web Access. In: Proceedings of 17th IEEE Symposium on Reliable Distributed Systems (1998)

    Google Scholar 

  6. Chen, J., Zhou, B., Shi, J., Zhang, H., Fengwu, Q.: Function-based Object Model towards Website Adaptation. In: Proceedings of 10th Thirteenth International World Wide Web Conference (2001)

    Google Scholar 

  7. Yi, L., Liu, B., Li, X.: Eliminating Noisy Information in Web Pages for Data Mining. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2003)

    Google Scholar 

  8. Milic-Frayling, N., Sommerer, R.: SmartView: Flexible Viewing of Web Page Contents. In: Proceedings of the 11th World Wide Web Conference (2002)

    Google Scholar 

  9. Song, R., Liu, H., Wen, J., Ma, W.-Y.: Learning Block Importance Models for Web Pages. In: Proceedings of 13th International World Wide Web Conference (2004)

    Google Scholar 

  10. Yu, S., Cai, D., Wen, J.-R., Ma, W.-Y.: Improving Pseudo-relevance Feedback in Web Information Retrieval Using Web Page Segmentation. In: Proceedings of the 11th World Wide Web Conference (2003)

    Google Scholar 

  11. Gupta, S., Kaiser, G., Neistadt, D., Grimm, P.: DOM-based Content Extraction of HTML Documents. In: Proceedings of the 12th International Conference on World Wide Web (2003)

    Google Scholar 

  12. Trevor, J., Hilbert, D.M., Schilit, B.N., Koh, T.K.: From Desktop to Phone Top, a UI for Web Interaction on Very Small Devices. In: Proceedings of the 14th annual ACM symposium on user interface software and technology (2001)

    Google Scholar 

  13. Web Clipping, http://www.palmos.com/dev/tech/webclipping/

  14. Gu, X.-D., Chen, J., Ma, W.Y., Chen, G.-L.: Visual Based Content Understanding towards Web Adaptation. In: Second International Conference on Adaptive Hypermedia and Adaptive Web-based Systems (2002)

    Google Scholar 

  15. Yin, X., Lee, W.S.: Using Link Analysis to Improve Layout on Mobile Devices. In: Proceedings of 13th International World Wide Web Conference (2004)

    Google Scholar 

  16. Chen, Y., Ma, W.-Y., Zhang, H.-J.: Detecting Web Page Structure for Adaptive Viewing on Small Form Factor Devices. In: Proceedings of the 11th World Wide Web Conference (2003)

    Google Scholar 

  17. Yang, Y., Zhang, H.: HTML Page Analysis Based on Visual Cues. In: 7th International Conference on Document Analysis and Recognition (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Yin, X., Lee, W.S. (2005). Towards Understanding the Functions of Web Element. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_27

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31871-2_27

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25065-4

  • Online ISBN: 978-3-540-31871-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics