Skip to main content
Log in

A Brief Tutorial on How to Extract Information from User-Generated Content (UGC)

  • Research Project
  • Published:
KI - Künstliche Intelligenz Aims and scope Submit manuscript

Abstract

In this brief tutorial, we provide an overview of investigating text-based user-generated content for information that is relevant in the corporate context. We structure the overall process along three stages: collection, analysis, and visualization. Corresponding to the stages we outline challenges and basic techniques to extract information of different levels of granularity.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Similar content being viewed by others

Notes

  1. http://yacy.net/.

  2. It is to note that each document might consist of multiple topics.

  3. This means, the opinion “nice display” would consist of aspect “display” and sentiment “nice”.

  4. nlp.stanford.edu/software/lex-parser.shtml.

References

  1. Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  2. Brants T, Chen F, Tsochantaridis I (2002) Topic-based document segmentation with probabilistic latent semantic analysis. In: CIKM ’02. ACM, New York, pp 211–218

    Chapter  Google Scholar 

  3. Cai D, Yu S, Wen J, Ma W (2004) Block-based web search. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. doi:10.1145/1008992.1009070

    Google Scholar 

  4. Card SK, Mackinlay JD, Shneiderman B (1999) Readings in information visualization: using vision to think. Morgan Kaufmann, San Mateo

    Google Scholar 

  5. Chen C, Wu K, Srinivasan V, Zhang X (2011) Battling the internet water army: detection of hidden paid posters. arXiv:1111.4297 [cs.SI]

  6. Collins M, Singer Y (1999) Unsupervised models for named entity classification, pp 100–110

  7. Covington MA (2001) A fundamental algorithm for dependency parsing, pp 95–102

  8. Decker R, Trusov M (2010) Estimating aggregate consumer preferences from online product reviews. Int J Res Mark 27(4):293–307

    Article  Google Scholar 

  9. Deng C, Shipeng Y, Ji-Rong W, Wei-Ying M (2003) VIPS: a vision-based page segmentation algorithm

  10. Evert S (2007) StupidOS: a high-precision approach to boilerplate removal. In: Building and exploring web corpora: proceedings of the 3rd web as corpus workshop, incorporating cleaneval, vol 123, p 123

    Google Scholar 

  11. Feldman R, Sanger J (2006) The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, Cambridge

    Book  Google Scholar 

  12. González-Ibáñez R, Muresan S, Wacholder N (2011) Identifying sarcasm in Twitter: a closer look, pp 581–586

  13. Gupta S, Kaiser G, Neistadt D, Grimm P (2003) DOM-based content extraction of HTML documents

  14. Hatzivassiloglou V, McKeown KR, Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of adjectives. In: The 35th annual meeting, pp 174–181

    Google Scholar 

  15. Jindal N, Liu B (2007) Analyzing and detecting review spam, pp 547–552

  16. Kamps J, Marx MJ, Mokken RJ, De Rijke M (2004) Using wordnet to measure semantic orientations of adjectives

  17. Karlgren J, Cutting D (1994) Recognizing text genres with simple metrics using discriminant analysis. Association for Computational Linguistics, pp 1071–1075

  18. Kaser O, Lemire D (2007) Tag-cloud drawing: algorithms for cloud visualization. Preprint, arXiv:cs/0703109 [cs.DS]

  19. Keim D, Andrienko G, Fekete JD, Görg C, Kohlhammer J, Melançon G (2008) Visual analytics: definition, process, and challenges. Inf Vis 4950:154–175

    Google Scholar 

  20. Kessler B, Numberg G, Schütze H (1997) Automatic detection of text genre. Association for Computational Linguistics, pp 32–38

  21. Lim EP, Nguyen VA, Jindal N, Liu B, Lauw HW (2010) Detecting product review spammers using rating behaviors, pp 939–948

  22. Liu B, Hu M, Cheng J (2005) Opinion observer: analysing and comparing opinions on the web. Chiba, Japan, pp 342–351

  23. Liu B (2010) Sentiment analysis and subjectivity. Handbook of natural language processing. CRC Press, Boca Raton, pp 627–666

    Google Scholar 

  24. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using machine learning techniques. In: Proceedings of EMNLP, pp 79–86

    Google Scholar 

  25. Porter MF (1980) An algorithm for suffix stripping. In: Proc ACM SIGIR conference on conference on research and development in information retrieval

    Google Scholar 

  26. Qiu G, Liu B, Bu J, Chen C (2006) Opinion word expansion and target extraction through double propagation. Association for Computational Linguistics

  27. Qiu G, Liu B, Bu J, Chen C (2009) Expanding domain sentiment lexicon through double propagation. In: Proceedings of the 21st international jont conference on artifical intelligence, pp 1199–1204

    Google Scholar 

  28. Qiu G, Liu B, Bu J, Chen C (2011) Opinion word expansion and target extraction through double propagation. Comput Linguist 37(1):9–27

    Article  Google Scholar 

  29. Robertson SE (1981) The methodology of information retrieval experiment. In: Information retrieval experiment, pp 9–31

    Google Scholar 

  30. Salton G, McGill MJ (1986) Introduction to modern information retrieval. McGraw-Hill, New York

    Google Scholar 

  31. Schiller A (1996) Multilingual finite-state noun phrase extraction

  32. Song R, Liu H, Wen J (2004) Learning block importance models for web pages

  33. Steinberger J, Ebrahim M, Ehrmann M, Hurriyetoglu A, Kabadjov M, Lenkova P, Steinberger R, Tanev H, Zquez SV, Zavarella V (2012) Creating sentiment dictionaries via triangulation. Decis Support Syst 53(4):689–694

    Article  Google Scholar 

  34. Stoyanov V, Cardie C (2006) Partially supervised coreference resolution for opinion summarization through structured rule learning

  35. Theobald M, Siddharth J, Paepcke A (2008) SpotSigs: robust and efficient near duplicate detection in large web collections. ACM, New York, pp 563–570

  36. Tufte ER (1990) Envisioning information. Graphics Press, Cheshire

    Google Scholar 

  37. Vickery G, Wunsch-Vincent S (2007) Participative web and user-created content: web 2.0 wikis and social networking. OECD, Paris

    Google Scholar 

  38. Wang Y, Fang B, Cheng X, Guo L, Xu H (2008) Incremental web page template detection by text segments, pp 174–180

  39. Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis. In: CIKM ’05. ACM, New York, pp 625–631

    Chapter  Google Scholar 

  40. Yu S, Cai D, Wen J, Ma W (2003) Improving pseudo-relevance feedback in web information retrieval using web page segmentation. In: Proceedings of the 12th international conference on world wide web

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marc Egger.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Egger, M., Lang, A. A Brief Tutorial on How to Extract Information from User-Generated Content (UGC). Künstl Intell 27, 53–60 (2013). https://doi.org/10.1007/s13218-012-0224-1

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13218-012-0224-1

Keywords

Navigation