research-article

Aggregate documents: making sense of a patchwork of topical documents

Author:
Michael Shilman

Wize.com, San Mateo, CA, USA

Wize.com, San Mateo, CA, USA
View Profile

DocEng '08: Proceedings of the eighth ACM symposium on Document engineeringSeptember 2008Pages 3–7https://doi.org/10.1145/1410140.1410143

Published:16 September 2008Publication History

DocEng '08: Proceedings of the eighth ACM symposium on Document engineering

Pages 3–7

ABSTRACT

With the dramatic increase in quantity and diversity of online content, particularly in the form of user generated content, we now have access to unprecedented amounts of information. Whether you are researching the purchase of a new cell phone, planning a vacation, or trying to assess a political candidate, there are now countless resources at your fingertips. However, finding and making sense of all this information is laborious and it is difficult to assess high-level trends in what is said. Web sites like Wikipedia and Digg democratize the process of organizing the information from countless document into a single source where it is somewhat easier to understand what is important and interesting. In this talk, I describe a complementary set of automated alternatives to these approaches, demonstrate these approaches with a working example, the commercial web site Wize.com, and derive some basic principles for aggregating a diverse set of documents into a coherent and useful summary.

References

Bunke, H. and Wang, P.S. 1997. Handbook of character recognition and document image analysis. World Scientific.Google Scholar
New York Times Blog. http://www.nytimes.com/ref/topnews/blog-index.htmlGoogle Scholar
Antonacopoulos, A and Hu, J. 1995. Web Document Analysis: Challenges and Opportunities. World Scientific. Google ScholarDigital Library
Sifry, D. 2007. "The State of the Live Web, April 2007". http://www.sifry.com/alerts/archives/000493.htmlGoogle Scholar
Wikipedia statistics. July 1, 2008. http://en.wikipedia.org/wiki/Special:StatisticsGoogle Scholar
Digg.com. July 1, 2008. http://digg.comGoogle Scholar
Graham-Cumming, J. "How Many Users does Digg Have?" http://www.jgc.org/blog/2008/01/how-many-users-does-digg-have.htmlGoogle Scholar
Wize.com. July 2008. http://wize.comGoogle Scholar
Heydon, A. and Najork, M. 1999. "Mercator: A Scalable, Extensible Web Crawler." World Wide Web 2, 4 (Apr. 1999), 219--229. Google ScholarDigital Library

Index Terms

Aggregate documents: making sense of a patchwork of topical documents

Recommendations

Extracting and modeling the semantic information content of web documents to support semantic document retrieval
APCCM '09: Proceedings of the Sixth Asia-Pacific Conference on Conceptual Modeling - Volume 96

Existing HTML mark-up is used only to indicate the structure and lay-out of documents, but not the document semantics. As a result web documents are difficult to be semantically processed, retrieved and explored by computer applications. Existing ...
Read More
An indexing model of HTML documents
SAC '03: Proceedings of the 2003 ACM symposium on Applied computing

The diffusion of the World Wide Web and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) is broadly employed for ...
Read More
Categorisation of web documents using extraction ontologies

Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DocEng '08: Proceedings of the eighth ACM symposium on Document engineering
September 2008
312 pages
ISBN:9781605580814
DOI:10.1145/1410140
General Chair:
Maria de Graça Pimentel
Universidade de Sño Paulo, Brazil
,
Program Chairs:
Dick C.A. Bulterman
CWI and VU, Netherlands
,
Luis Fernando Gomes Soares
PUC-Rio, Brazil
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 September 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
aggregation
analysis
information retrieval
summarization
Qualifiers
- research-article
Conference

Acceptance Rates
DocEng '08 Paper Acceptance Rate21of62submissions,34%Overall Acceptance Rate178of537submissions,33%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 224
  Total Downloads
- Downloads (Last 12 months)2
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Aggregate documents: making sense of a patchwork of topical documents

DocEng '08: Proceedings of the eighth ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extracting and modeling the semantic information content of web documents to support semantic document retrieval

An indexing model of HTML documents

Categorisation of web documents using extraction ontologies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Aggregate documents: making sense of a patchwork of topical documents

DocEng '08: Proceedings of the eighth ACM symposium on Document engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Extracting and modeling the semantic information content of web documents to support semantic document retrieval

An indexing model of HTML documents

Categorisation of web documents using extraction ontologies

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media