ABSTRACT
With the dramatic increase in quantity and diversity of online content, particularly in the form of user generated content, we now have access to unprecedented amounts of information. Whether you are researching the purchase of a new cell phone, planning a vacation, or trying to assess a political candidate, there are now countless resources at your fingertips. However, finding and making sense of all this information is laborious and it is difficult to assess high-level trends in what is said. Web sites like Wikipedia and Digg democratize the process of organizing the information from countless document into a single source where it is somewhat easier to understand what is important and interesting. In this talk, I describe a complementary set of automated alternatives to these approaches, demonstrate these approaches with a working example, the commercial web site Wize.com, and derive some basic principles for aggregating a diverse set of documents into a coherent and useful summary.
- Bunke, H. and Wang, P.S. 1997. Handbook of character recognition and document image analysis. World Scientific.Google Scholar
- New York Times Blog. http://www.nytimes.com/ref/topnews/blog-index.htmlGoogle Scholar
- Antonacopoulos, A and Hu, J. 1995. Web Document Analysis: Challenges and Opportunities. World Scientific. Google ScholarDigital Library
- Sifry, D. 2007. "The State of the Live Web, April 2007". http://www.sifry.com/alerts/archives/000493.htmlGoogle Scholar
- Wikipedia statistics. July 1, 2008. http://en.wikipedia.org/wiki/Special:StatisticsGoogle Scholar
- Digg.com. July 1, 2008. http://digg.comGoogle Scholar
- Graham-Cumming, J. "How Many Users does Digg Have?" http://www.jgc.org/blog/2008/01/how-many-users-does-digg-have.htmlGoogle Scholar
- Wize.com. July 2008. http://wize.comGoogle Scholar
- Heydon, A. and Najork, M. 1999. "Mercator: A Scalable, Extensible Web Crawler." World Wide Web 2, 4 (Apr. 1999), 219--229. Google ScholarDigital Library
Index Terms
- Aggregate documents: making sense of a patchwork of topical documents
Recommendations
Extracting and modeling the semantic information content of web documents to support semantic document retrieval
APCCM '09: Proceedings of the Sixth Asia-Pacific Conference on Conceptual Modeling - Volume 96Existing HTML mark-up is used only to indicate the structure and lay-out of documents, but not the document semantics. As a result web documents are difficult to be semantically processed, retrieved and explored by computer applications. Existing ...
An indexing model of HTML documents
SAC '03: Proceedings of the 2003 ACM symposium on Applied computingThe diffusion of the World Wide Web and the consequent increase in the production and exchange of textual information demand the development of effective information retrieval systems. The HyperText Markup Language (HTML) is broadly employed for ...
Categorisation of web documents using extraction ontologies
Automatically recognising which HTML documents on the Web contain items of interest for a user is non-trivial. As a step toward solving this problem, we propose an approach based on information-extraction ontologies. Given HTML documents, tables, and forms, ...
Comments