Article

Authors vs. readers: a comparative study of document metadata and content in the www

Authors:
Michael G. Noll

University of Potsdam

University of Potsdam
View Profile

,
Christoph Meinel

University of Potsdam

University of Potsdam
View Profile

DocEng '07: Proceedings of the 2007 ACM symposium on Document engineeringAugust 2007Pages 177–186https://doi.org/10.1145/1284420.1284465

Published:28 August 2007Publication History

DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering

Pages 177–186

ABSTRACT

Collaborative tagging describes the process by which many users add metadata in the form of unstructured keywords to shared content. The recent practical success of web services with such a tagging component like Flickr or del.icio.us has provided a plethora of user-supplied metadata about web content for everyone to leverage.

In this paper, we conduct a quantitative and qualitative analysis of metadata and information provided by the authors and publishers of web documents compared with metadata supplied by end users for the same content. Our study is based on a random sample of 100,000 web documents from the Open Directory, for which we examined the original documents from the World Wide Web in addition to data retrieved from the social bookmarking service del.icio.us, the content rating system ICRA, and the search engine Google. To the best of our knowledge, this is the first study to compare user tags with the metadata and actual content of documents in the WWW on a larger scale and to integrate document popularity information in the observations. The data set of our experiments is freely available for research.

References

M. Ames and M. Naaman. Why we tag: Motivations for annotation in mobile and online media. In Proceedings of CHI '07, 2007. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of WWW '98, pages 107--117, 1998. Google ScholarDigital Library
C. H. Brooks and N. Montanez. Improved annotation of the blogosphere via autotagging and hierarchical clustering. In Proceedings of WWW '06, pages 625--632, 2006. Google ScholarDigital Library
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic resource compilation by analyzing hyperlink structure and associated text. In Proceedings of WWW '98, pages 65--74, 1998. Google ScholarDigital Library
S. A. Golder and B. A. Huberman. Usage patterns of collaborative tagging systems. Journal of Information Science, 32(2):198--208, 2006. Google ScholarDigital Library
Z. Gyöngyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with trustrank. In Proceedings of VLDB '04, pages 271--279, Toronto, Canada, 2004. Google ScholarDigital Library
I. Hickson. Google: Web authoring statistics, http://code.google.com/webstats/. Technical report, Google, Inc., December 2005.Google Scholar
M. J. Jones and J. M. Rehg. Statistical color models with application to skin detection. International Journal of Computer Vision, 46(1):81--96, 2002. Google ScholarDigital Library
M.-Y. Kan. Web page categorization without the web page. In WWW, pages 262--263. ACM, 2004. Google ScholarDigital Library
C. Marlow, M. Naaman, D. Boyd, and M. Davis. Ht06, tagging paper, taxonomy, flickr, academic article, to read. In Proceedings of HT '06, pages 31--40, 2006. Google ScholarDigital Library
A. Mathes. Folksonomies - cooperative classification and communication through shared metadata. Technical report, UIC, 2004.Google Scholar
M. G. Noll and C. Meinel. Web page classification: An exploratory study of internet content rating systems. In Proceedings of HACK '05, Luxembourg, 2005.Google Scholar
M. G. Noll and C. Meinel. Design and anatomy of a social web filtering service. In Proceedings of CIC '06, pages 35--44, Hong Kong, 2006.Google Scholar
A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of WWW '06, pages 83--92, Edinburgh, Scotland, 2006. Google ScholarDigital Library
H. A. Rowley, Y. Jing, and S. Baluja. Large scale image-based adult-content filtering. In 1st int'l Conference on Computer Vision Theory, 2006.Google Scholar
G. Salton and C. Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513--523, 1988. Google ScholarDigital Library
S. Sen, S. K. Lam, A. M. Rashid, D. Cosley, D. Frankowski, J. Osterhouse, F. M. Harper, and J. Riedl. tagging, communities, vocabulary, evolution. In Proceedings of CSCW '06, pages 181--190, 2006. Google ScholarDigital Library
E. Tonkin and M. Guy. Folksonomies: Tyding up tags? D-Lib Magazine, 12(1), January 2006.Google Scholar
J. Varghese, R. Krishnan, Y. U. Ryu, R. Chandrasekaran, and S. Hong. Filtering objectionable internet content. In Proceedings of ICIS '99, pages 274--278, 1999. Google ScholarDigital Library
Y. Wang, W. Wang, and W. Gao. Research on the discrimination of pornographic and bikini images. In Proceedings of IEEE ISM '05, pages 558--564, Washington, DC, USA, 2005. IEEE Computer Society. Google ScholarDigital Library
H. Yu, J. Han, and K. C.-C. Chang. Pebl: positive example based learning for web page classification using svm. In Proceedings of SIGKDD '02, Canada, 2002. Google ScholarDigital Library

Index Terms

Authors vs. readers: a comparative study of document metadata and content in the www
1. Applied computing
  1. Computers in other domains
    1. Publishing
  2. Document management and text processing
    1. Document management
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Document filtering
      2. Information extraction

Recommendations

The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries
WI-IAT '08: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

In this paper, we study and compare three different but related types of metadata about web documents: social annotations provided by readers of web documents, hyperlink anchor text provided by authors of web documents, and search queries of users ...
Read More
TagScore: Approximate Similarity Using Tag Synopses
WI-IAT '08: Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01

Collaborative tagging is the aggregate effort by a community of online users to annotate web content with metadata labels called tags. It is a simple activity that enriches our knowledge about digital content, and has gained popularity with services ...
Read More
Finding similar pages in a social tagging repository
WWW '08: Proceedings of the 17th international conference on World Wide Web

Social tagging describes a community of users labeling web content with tags. It is a simple activity that enriches our knowledge about resources on the web. For a computer to help users search the tagged repository, it must know when tags are good or ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering
August 2007
236 pages
ISBN:9781595937766
DOI:10.1145/1284420
General Chair:
Peter King
University of Manitoba, Winnipeg, Canada
,
Program Chair:
Steven Simske
Hewlett-Packard, USA
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 28 August 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
authoring
del.icio.us
dmoz
dmoz100k06
document engineering
google
icra
metadata
pagerank
social bookmarking
tagging
www
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate178of537submissions,33%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 27
  Total Citations
  View Citations
- 703
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Authors vs. readers: a comparative study of document metadata and content in the www

DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries

TagScore: Approximate Similarity Using Tag Synopses

Finding similar pages in a social tagging repository

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Authors vs. readers: a comparative study of document metadata and content in the www

DocEng '07: Proceedings of the 2007 ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

The Metadata Triumvirate: Social Annotations, Anchor Texts and Search Queries

TagScore: Approximate Similarity Using Tag Synopses

Finding similar pages in a social tagging repository

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media