A Brief Tutorial on How to Extract Information from User-Generated Content (UGC)

Egger, Marc; Lang, André

doi:10.1007/s13218-012-0224-1

A Brief Tutorial on How to Extract Information from User-Generated Content (UGC)

Research Project
Published: 22 December 2012

Volume 27, pages 53–60, (2013)
Cite this article

KI - Künstliche Intelligenz Aims and scope Submit manuscript

Marc Egger¹ &
André Lang²

911 Accesses
4 Citations
Explore all metrics

Abstract

In this brief tutorial, we provide an overview of investigating text-based user-generated content for information that is relevant in the corporate context. We structure the overall process along three stages: collection, analysis, and visualization. Corresponding to the stages we outline challenges and basic techniques to extract information of different levels of granularity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The interactive Leipzig Corpus Miner: An extensible and adaptable text analysis tool for content analysis

Article Open access 29 August 2023

Andreas Niekler, Christian Kahmann, … Gerhard Heyer

DataGorri: a tool for automated data collection of tabular web content

Article 01 October 2018

Julian Hackinger

Text and Web Content Mining: A Systematic Review

Notes

http://yacy.net/.
It is to note that each document might consist of multiple topics.
This means, the opinion “nice display” would consist of aspect “display” and sentiment “nice”.
nlp.stanford.edu/software/lex-parser.shtml.

References

Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: CIKM ’02. ACM, New York, pp 211–218
Chapter Google Scholar
Cai D, Yu S, Wen J, Ma W (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. doi:10.1145/1008992.1009070
Google Scholar
Card SK, Mackinlay JD, Shneiderman B (1999) Readings in information visualization: using vision to think. Morgan Kaufmann, San Mateo
Google Scholar
Chen C, Wu K, Srinivasan V, Zhang X (2011) Battling the internet water army: detection of hidden paid posters. arXiv:1111.4297 [cs.SI]
Collins M, Singer Y (1999) Unsupervised models for named entity classification, pp 100–110
Covington MA (2001) A fundamental algorithm for dependency parsing, pp 95–102
Decker R, Trusov M (2010) Estimating aggregate consumer preferences from online product reviews. Int J Res Mark 27(4):293–307
Article Google Scholar
Deng C, Shipeng Y, Ji-Rong W, Wei-Ying M (2003) VIPS: a vision-based page segmentation algorithm
Evert S (2007) StupidOS: a high-precision approach to boilerplate removal. In: Building and exploring web corpora: proceedings of the 3rd web as corpus workshop, incorporating cleaneval, vol 123, p 123
Google Scholar
Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge
Book Google Scholar
González-Ibáñez R, Muresan S, Wacholder N (2011) Identifying sarcasm in Twitter: a closer look, pp 581–586
Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents
Hatzivassiloglou V, McKeown KR, Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of adjectives. In: The 35th annual meeting, pp 174–181
Google Scholar
Jindal N, Liu B (2007) Analyzing and detecting review spam, pp 547–552
Kamps J, Marx MJ, Mokken RJ, De Rijke M (2004) Using wordnet to measure semantic orientations of adjectives
Karlgren J, Cutting D (1994) Recognizing text genres with simple metrics using discriminant analysis. Association for Computational Linguistics, pp 1071–1075
Kaser O, Lemire D (2007) Tag-cloud drawing: algorithms for cloud visualization. Preprint, arXiv:cs/0703109 [cs.DS]
Keim D, Andrienko G, Fekete JD, Görg C, Kohlhammer J, Melançon G (2008) Visual analytics: definition, process, and challenges. Inf Vis 4950:154–175
Google Scholar
Kessler B, Numberg G, Schütze H (1997) Automatic detection of text genre. Association for Computational Linguistics, pp 32–38
Lim EP, Nguyen VA, Jindal N, Liu B, Lauw HW (2010) Detecting product review spammers using rating behaviors, pp 939–948
Liu B, Hu M, Cheng J (2005) Opinion observer: analysing and comparing opinions on the web. Chiba, Japan, pp 342–351
Liu B (2010) Sentiment analysis and subjectivity. Handbook of natural language processing. CRC Press, Boca Raton, pp 627–666
Google Scholar
Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of EMNLP, pp 79–86
Google Scholar
Porter MF (1980) An algorithm for suffix stripping. In: Proc ACM SIGIR conference on conference on research and development in information retrieval
Google Scholar
Qiu G, Liu B, Bu J, Chen C (2006) Opinion word expansion and target extraction through double propagation. Association for Computational Linguistics
Qiu G, Liu B, Bu J, Chen C (2009) Expanding domain sentiment lexicon through double propagation. In: Proceedings of the 21st international jont conference on artifical intelligence, pp 1199–1204
Google Scholar
Qiu G, Liu B, Bu J, Chen C (2011) Opinion word expansion and target extraction through double propagation. Comput Linguist 37(1):9–27
Article Google Scholar
Robertson SE (1981) The methodology of information retrieval experiment. In: Information retrieval experiment, pp 9–31
Google Scholar
Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York
Google Scholar
Schiller A (1996) Multilingual finite-state noun phrase extraction
Song R, Liu H, Wen J (2004) Learning block importance models for web pages
Steinberger J, Ebrahim M, Ehrmann M, Hurriyetoglu A, Kabadjov M, Lenkova P, Steinberger R, Tanev H, Zquez SV, Zavarella V (2012) Creating sentiment dictionaries via triangulation. Decis Support Syst 53(4):689–694
Article Google Scholar
Stoyanov V, Cardie C (2006) Partially supervised coreference resolution for opinion summarization through structured rule learning
Theobald M, Siddharth J, Paepcke A (2008) SpotSigs: robust and efficient near duplicate detection in large web collections. ACM, New York, pp 563–570
Tufte ER (1990) Envisioning information. Graphics Press, Cheshire
Google Scholar
Vickery G, Wunsch-Vincent S (2007) Participative web and user-created content: web 2.0 wikis and social networking. OECD, Paris
Google Scholar
Wang Y, Fang B, Cheng X, Guo L, Xu H (2008) Incremental web page template detection by text segments, pp 174–180
Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis. In: CIKM ’05. ACM, New York, pp 625–631
Chapter Google Scholar
Yu S, Cai D, Wen J, Ma W (2003) Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of the 12th international conference on world wide web
Google Scholar

Download references

Author information

Authors and Affiliations

Seminar für Wirtschaftsinformatik und Informationsmanagement, Universität zu Köln, Pohligstraße 1, 50969, Köln, Germany
Marc Egger
Insius UG (haftungsbeschränkt), TechnologiePark Köln, Eupener Straße 165, 50933, Köln, Germany
André Lang

Authors

Marc Egger
View author publications
You can also search for this author in PubMed Google Scholar
André Lang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marc Egger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Egger, M., Lang, A. A Brief Tutorial on How to Extract Information from User-Generated Content (UGC). Künstl Intell 27, 53–60 (2013). https://doi.org/10.1007/s13218-012-0224-1

Download citation

Published: 22 December 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s13218-012-0224-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Brief Tutorial on How to Extract Information from User-Generated Content (UGC)

Abstract

Access this article

Similar content being viewed by others

The interactive Leipzig Corpus Miner: An extensible and adaptable text analysis tool for content analysis

DataGorri: a tool for automated data collection of tabular web content

Text and Web Content Mining: A Systematic Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A Brief Tutorial on How to Extract Information from User-Generated Content (UGC)

Abstract

Access this article

Similar content being viewed by others

The interactive Leipzig Corpus Miner: An extensible and adaptable text analysis tool for content analysis

DataGorri: a tool for automated data collection of tabular web content

Text and Web Content Mining: A Systematic Review

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation