skip to main content
10.1145/1284420.1284465acmconferencesArticle/Chapter ViewAbstractPublication PagesdocengConference Proceedingsconference-collections
Article

Authors vs. readers: a comparative study of document metadata and content in the www

Published:28 August 2007Publication History

ABSTRACT

Collaborative tagging describes the process by which many users add metadata in the form of unstructured keywords to shared content. The recent practical success of web services with such a tagging component like Flickr or del.icio.us has provided a plethora of user-supplied metadata about web content for everyone to leverage.

In this paper, we conduct a quantitative and qualitative analysis of metadata and information provided by the authors and publishers of web documents compared with metadata supplied by end users for the same content. Our study is based on a random sample of 100,000 web documents from the Open Directory, for which we examined the original documents from the World Wide Web in addition to data retrieved from the social bookmarking service del.icio.us, the content rating system ICRA, and the search engine Google. To the best of our knowledge, this is the first study to compare user tags with the metadata and actual content of documents in the WWW on a larger scale and to integrate document popularity information in the observations. The data set of our experiments is freely available for research.

References

  1. M. Ames and M. Naaman. Why we tag: Motivations for annotation in mobile and online media. In Proceedings of CHI '07, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of WWW '98, pages 107--117, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In Proceedings of WWW '06, pages 625--632, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of WWW '98, pages 65--74, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2):198--208, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of VLDB '04, pages 271--279, Toronto, Canada, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Hickson. Google: Web authoring statistics, http://code.google.com/webstats/. Technical report, Google, Inc., December 2005.Google ScholarGoogle Scholar
  8. M. J. Jones and J. M. Rehg. Statistical color models with application to skin detection. International Journal of Computer Vision, 46(1):81--96, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M.-Y. Kan. Web page categorization without the web page. In WWW, pages 262--263. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr, academic article, to read. In Proceedings of HT '06, pages 31--40, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Mathes. Folksonomies - cooperative classification and communication through shared metadata. Technical report, UIC, 2004.Google ScholarGoogle Scholar
  12. M. G. Noll and C. Meinel. Web page classification: An exploratory study of internet content rating systems. In Proceedings of HACK '05, Luxembourg, 2005.Google ScholarGoogle Scholar
  13. M. G. Noll and C. Meinel. Design and anatomy of a social web filtering service. In Proceedings of CIC '06, pages 35--44, Hong Kong, 2006.Google ScholarGoogle Scholar
  14. A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of WWW '06, pages 83--92, Edinburgh, Scotland, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. H. A. Rowley, Y. Jing, and S. Baluja. Large scale image-based adult-content filtering. In 1st int'l Conference on Computer Vision Theory, 2006.Google ScholarGoogle Scholar
  16. G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Sen, S. K. Lam, A. M. Rashid, D. Cosley, D. Frankowski, J. Osterhouse, F. M. Harper, and J. Riedl. tagging, communities, vocabulary, evolution. In Proceedings of CSCW '06, pages 181--190, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. Tonkin and M. Guy. Folksonomies: Tyding up tags? D-Lib Magazine, 12(1), January 2006.Google ScholarGoogle Scholar
  19. J. Varghese, R. Krishnan, Y. U. Ryu, R. Chandrasekaran, and S. Hong. Filtering objectionable internet content. In Proceedings of ICIS '99, pages 274--278, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Wang, W. Wang, and W. Gao. Research on the discrimination of pornographic and bikini images. In Proceedings of IEEE ISM '05, pages 558--564, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. H. Yu, J. Han, and K. C.-C. Chang. Pebl: positive example based learning for web page classification using svm. In Proceedings of SIGKDD '02, Canada, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Authors vs. readers: a comparative study of document metadata and content in the www

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering
            August 2007
            236 pages
            ISBN:9781595937766
            DOI:10.1145/1284420

            Copyright © 2007 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 28 August 2007

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            Overall Acceptance Rate178of537submissions,33%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader