Skip to main content

Large Scale Text Mining Approaches for Information Retrieval and Extraction

  • Chapter
  • First Online:
Innovations in Intelligent Machines-4

Part of the book series: Studies in Computational Intelligence ((SCI,volume 514))

Abstract

The issues for Natural Language Processing and Information Retrieval have been studied for long time but the recent availability of very large resources (Web pages, digital documents…) and the development of statistical machine learning methods exploiting annotated texts (manual encoding by crowdsourcing is a new major way) have transformed these fields. This allows not limiting these approaches to highly specialized domains and reducing the cost of their implementation. For this chapter, our aim is to present some popular text-mining statistical approaches for information retrieval and information extraction and to discuss the practical limits of actual systems that introduce challenges for future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Alternatives to the use of probability and to Bayesian networks or other probabilistic graphic models for dealing with uncertainty have been proposed. Among them fuzzy logic and Dempster-Shafer theory.

  2. 2.

    Precision is the fraction of retrieved items that are relevant or well classified while recall is the fraction of relevant items that are retrieved and provided as result. F-score is the harmonic mean of precision and recall.

  3. 3.

    Stemming consists in reducing words according to their morphological variants and roots. See for example Snowball that makes light stemming available for many languages (http://snowball.tartarus.org). Lemmatization can be seen as an advanced stemming.

  4. 4.

    Google Books Ngram (http://books.google.com/ngrams) and n-grams from the Corpus of Contemporary American English COCA (http://www.ngrams.info/) are two popular and freely downloadable word n-grams sets for English.

  5. 5.

    http://trec.nist.gov/tracks.html

  6. 6.

    http://ilps.science.uva.nl/trec-entity/

  7. 7.

    During CoNLL 2003 (Conference on Computational Natural Language Learning) a challenge that concerned language-independent named entity recognition was organized. Many other tasks related to Natural Language Processing have been organized in the context of CoNLL conferences: grammatical error correction, multilingual parsing, analysis of dependencies… (http://www.clips.ua.ac.be/conll/).

  8. 8.

    Freebase (https://developers.google.com/freebase/) contains in June 2013 more than 37 million entities, 1,998 types and 30,000 properties.

  9. 9.

    http://nlp.stanford.edu/software/CRF-NER.shtml

  10. 10.

    http://trec.nist.gov/tracks.html

  11. 11.

    LDC catalog number LDC2002T31 (http://www.ldc.upenn.edu).

  12. 12.

    http://www.nist.gov/tac/.

  13. 13.

    http://ir.dcs.gla.ac.uk/test_collections/blog06info.html (about 40 GB of data for feeds only).

  14. 14.

    http://www.inex.otago.ac.nz/tracks/qa/qa.asp.

  15. 15.

    https://inex.mmci.uni-saarland.de/tracks/qa/.

  16. 16.

    http://wordnet.princeton.edu.

  17. 17.

    http://lucene.apache.org.

  18. 18.

    http://sentiwordnet.isti.cnr.it.

  19. 19.

    http://www.cs.york.ac.uk/semeval-2013/.

  20. 20.

    DBPedia is a large knowledge base (more than 3.77 million things are classified in an ontology) localized in 111 languages built by extracting structured information from Wikipedia (http://dbpedia.org)—June 2013.

  21. 21.

    http://trec.nist.gov/data/kba.html.

  22. 22.

    http://www.nist.gov/tac/publications/index.html.

  23. 23.

    http://lab.hypotheses.org

  24. 24.

    http://openedition.org

  25. 25.

    Text Encoding Initiative (http://www.tei-c.org/Guidelines/).

  26. 26.

    This project was supported by the 6th Framework Research Programme of the European Union (EU), Project LUNA, IST contract no 33549 (www.ist-luna.eu).

  27. 27.

    https://framenet.icsi.berkeley.edu/fndrupal/.

References

  1. Aljaber, B., Stokes, N., Bailey, J., Pei, J.: Document clustering of scientific texts using citation contexts. Inf. Retrieval 13, 101–131 (2009). (Kluwer Academic Pub.)

    Article  Google Scholar 

  2. Almuhareb, A., Poesio, M.: Attribute-based and value-based clustering: an evaluation. In: Proceedings of EMNLP, pp. 158–165 (2004)

    Google Scholar 

  3. Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta (May, 2010)

    Google Scholar 

  4. Balog, K., Serdyukov, P., Vries, A.P.D.: Overview of the TREC 2010 entity track. DTIC document, (2010)

    Google Scholar 

  5. Béchet, F., Charton, E.: Unsupervised knowledge acquisition for extracting named entities from speech. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2010), pp. 5338–5341 (2010)

    Google Scholar 

  6. Béchet, F., Raymond, C., Duvert, F., de Mori, R.: Frame based interpretation of conversational speech. Spoken Language Technology Workshop (SLT), 2010 IEEE, pp. 401–406 (2010)

    Google Scholar 

  7. Belkin, N.J.: Some (what) grand challenges for information retrieval. SIGIR Forum 42, 47–54 (2008)

    Article  Google Scholar 

  8. Bellot, P., Chappell, T., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Landoni, M., Marx, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., Ram´ırez, G., Sanderson, M., Sanjuan, E., Scholer, F., Schuh, A., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Wang, Q.: Report on INEX 2012. SIGIR Forum 46, 50–59 (2012)

    Google Scholar 

  9. Bellot, P., Crestan, E., El-bèze, M., Gillard, L., de Loupy, C.: Coupling named entity recognition, vector-space model and knowledge bases for TREC-11 question-answering track. In: Proceedings of the Twelfth Text Retrieval Conference (TREC 2003), NIST Special publication, pp. 500–251 (2003)

    Google Scholar 

  10. Berland, M., Charniak, E.: Finding parts in very large corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, pp. 57–64 (1999)

    Google Scholar 

  11. Bonneau-maynard, H., Rosset, S., Ayache, C., Kuhn, A., Mostefa, D.: Semantic annotation of the French media dialog corpus. In: Proceedings of Ninth European Conference on Speech Communication and Technology, Lisboa, Portugal (2005)

    Google Scholar 

  12. Bonnefoy, L., Bellot, P., Benoit, M.: The Web as a source of evidence for filtering candidate answers to natural language questions. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp. 63–66 (2011)

    Google Scholar 

  13. Bonnefoy, L., Bouvier, V., Bellot, P.: LSIS/LIA at TREC 2012 knowledge base acceleration. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), pp. 500–298. NIST Special Publication SP, Gaithersburg, USA (2013)

    Google Scholar 

  14. Bordogna, G., Pasi, G.: A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation. JASIS 44, 70–82 (1993)

    Article  Google Scholar 

  15. Brocki, Ł., Koržinek, D., Marasek, K.: Telephony based voice portal for a University. Appl. Syst. Homel. Secur. (2008)

    Google Scholar 

  16. Bunescu, R., Mooney, R.: Subsequence kernels for relation extraction. Adv. Neural Inf. Process. Syst. 18, 171 (2006)

    Google Scholar 

  17. Burger, J.D.: Mitre’s quanda at trec-12. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), pp. 500–298. NIST Special Publication SP, Gaithersburg, USA (2003)

    Google Scholar 

  18. Camelin, N., Bechet, F., Damnati, G., de Mori, R.: Detection and interpretation of opinion expressions in spoken surveys. IEEE Trans. Audio Speech Lang. Process. 18, 369–381 (2010)

    Article  Google Scholar 

  19. Carreras, X., Marquez, L., Padró, L.: Named entity extraction using AdaBoost. In: Proceedings of the 6th Conference on Natural Language Learning-Volume 20, pp. 1–4. Association for Computational Linguistics (2002)

    Google Scholar 

  20. Cassidy, T., Zheng, C., Artiles, J., Ji, H., Deng, H., Ratinov, L.-A., Zheng, J., Han, J., Roth, D.: CUNY-UIUC-SRI TAC-KBP2011 entity linking system description. In: Proceedings of Text Analysis Conference (TAC2011), (2010)

    Google Scholar 

  21. Chang, H.C.: A new perspective on twitter hashtag use: diffusion of innovation theory. Proc. Am. Soc. Inform. Sci. Technol. 47, 1–4 (2010)

    Google Scholar 

  22. Chomsky, N.: Current issues in linguistic theory. In: Fodor, J., Katz, B. (eds.) The Structure of Language. Prentice Hall, New York (1964)

    Google Scholar 

  23. Chomsky, N.: Lectures in Government and Binding. Foris Publications, Dordrecht (1981)

    Google Scholar 

  24. Ciravegna, D.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle (2001)

    Google Scholar 

  25. Collins, M., Singer, Y. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 189–196 (1999)

    Google Scholar 

  26. Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996)

    Article  Google Scholar 

  27. Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 423. Association for Computational Linguistics (2004)

    Google Scholar 

  28. Cutler, A., Fodor, J.A.: Semantic focus and sentence comprehension. Cognition 7, 49–59 (1979)

    Article  Google Scholar 

  29. Dang, H.T., Owczarzak, K.: Overview of the TAC 2008 opinion question answering and summarization tasks. In: Proceedings of the First Text Analysis Conference, (2008)

    Google Scholar 

  30. Davidov, D., Rappoport, A.: Extraction and approximation of numerical attributes from the Web. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1308–1317. Association for Computational Linguistics (2010)

    Google Scholar 

  31. Deerwester, S.C., Dumais, S., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)

    Article  Google Scholar 

  32. Deveaud, R., Avignon, F., Sanjuan, E., Bellot, P.: LIA at TREC 2011 Web track: experiments on the combination of online resources. In: Proceedings of the Twentieth Text REtrieval Conference (TREC 2011), pp. 500–596. NIST Special Publication SP, Gaithersburg, USA (2011)

    Google Scholar 

  33. Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ACE) program-tasks, data, and evaluation. In: Proceedings of LREC, pp. 837–840. Citeseer (2004)

    Google Scholar 

  34. Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2733–2739 (2007)

    Google Scholar 

  35. Duvert, F., de Mori, R.: Conditional models for detecting lambda-functions in a spoken language understanding system. In: Eleventh Annual Conference of the International Speech Communication Association, (2010)

    Google Scholar 

  36. Duvert, F., Meurs, M.-J., Servan, C., Béchet, F., Lefevre, F., de Mori, R.: Semantic composition process in a speech understanding system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 5029–5032 (2008)

    Google Scholar 

  37. Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51, 68–74 (2008)

    Article  Google Scholar 

  38. Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)

    Article  Google Scholar 

  39. Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: the second generation. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume, vol. 1, pp. 3–10. AAAI Press (2011)

    Google Scholar 

  40. Fader, A., Soderland, S, Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)

    Google Scholar 

  41. Ferret, O., Grau, B., Hurault-plantet, M., Illouz, G., Monceaux, L., Robba, I., Vilnat, A.: Finding an answer based on the recognition of the question focus. In: Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2002 Gaithersburg, Maryland, USA (2002)

    Google Scholar 

  42. Fuhr, N., Buckley, C.: A probabilistic learning approach for document indexing. ACM Trans. Inf. Syst. (TOIS) 9, 223–248 (1991)

    Article  Google Scholar 

  43. Garfield, E.: Citation analysis as a tool in journal evaluation. Science 178, 471–479 (1972)

    Article  Google Scholar 

  44. Ge, N., Hale, J., Charniak, E.: A statistical approach to anaphora resolution. In: Proceedings of the Sixth Workshop on Very Large Corpora, pp. 161–170 (1998)

    Google Scholar 

  45. Giles, C.L., Bollacker, K., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 89–98. ACM, Pittsburgh, Pennsylvania, USA (1998)

    Google Scholar 

  46. Griol, D., Riccardi, G., Sanchis, E.: A statistical dialog manager for the LUNA project. In: Proceedings of interspeech/ICSLP, pp. 272–275 (2009)

    Google Scholar 

  47. Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of COLING, pp. 466–471 (1996)

    Google Scholar 

  48. Grodzinsky, Y.: La syntaxe générative dans le cerveau. In: Bricmont, J., Franck, J. (eds.) Chomsky (Les Cahiers de l’Herne). Editions de l’Herne, Paris (2007)

    Google Scholar 

  49. Guarino, N.: Concepts, attributes and arbitrary relations: some linguistic and ontological criteria for structuring knowledge bases. Data Knowl. Eng. 8, 249–261 (1992)

    Article  Google Scholar 

  50. Hamdan, H., Béchet, F., Bellot, P.: Experiments with DBpedia, WordNet and SentiWordNet as re-sources for sentiment analysis in micro-blogging. In: International Workshop on Semantic Evaluation SemEval-2013 (NAACL Workshop), Atlanta, Georgia, USA (2013)

    Google Scholar 

  51. Harth, E.: The Creative Loop: How the Brain Makes a Mind. Addison-Wesley, New-York (1993)

    Google Scholar 

  52. Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7. Association for Computational Linguistics (2002)

    Google Scholar 

  53. Ji, H., Grishman, R.: Knowledge base population: Successful approaches and challenges. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1148–1158 (2011)

    Google Scholar 

  54. Kantrowitz, M., Mohit, B., Mittal, V.: Stemming and its effects on TFIDF ranking (poster session). In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 357–359. ACM Press (2000)

    Google Scholar 

  55. Kim, J.-H., Woodland, P.: A rule-based named entity recognition system for speech input. In: Proceedings of the 6th International Conference on Spoken Language Processing, (2000)

    Google Scholar 

  56. Kim, Y.-M., Bellot, P., Tavernier, J., Faath, E., Dacos, M.: Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 209–212. ACM Press, Paris, France (2012)

    Google Scholar 

  57. Krogh, A. Hidden Markov models for labeled sequences. In: Proceedings of the IEEE 12th IAPR International. Conference on Pattern Recognition, Vol. 2-Conference B: Computer Vision and Image Processing, pp. 140–144 (1994)

    Google Scholar 

  58. Lafferty, J., Mccallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282–289 (2001)

    Google Scholar 

  59. Langley, P., Simon, H.A.: Applications of machine learning and rule induction. Commun. ACM 38, 54–64 (1995)

    Article  Google Scholar 

  60. Lehnert, W.: The Process of Question Answering: A Computer Simulation of Cognition. Lawrence Erlbaum Associates, Hillsdale (1978)

    MATH  Google Scholar 

  61. Li, F., Zheng, Z., Yang, T., Bu, F., Ge, R., Zhu, X., Zhang, X., Huang, M.: Thu quanta at TAC 2008 qa and rte track. In: Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, BC, Canada (2008)

    Google Scholar 

  62. Lin, J.: An exploration of the principles underlying redundancy-based factoid question answering. ACM Trans. Inf. Syst. 25, 4–53 (2007)

    Article  Google Scholar 

  63. Màrquez, L., Carreras, X., Litkowski, K.C., Stevenson, S.: Semantic role labeling: an introduction to the special issue. Comput. Linguis. 34, 145–159 (2008)

    Article  Google Scholar 

  64. Maybury, M.T.: New Directions in Question Answering. The MIT Press, Menlo Park (2004)

    Google Scholar 

  65. McCallum, A.: Information extraction: distilling structured data from unstructured text. Queue 3, 48–57 (2005)

    Article  Google Scholar 

  66. Mehler, J., Dupoux, E.: Naître Humain. Odile Jacob, Paris (1992)

    Google Scholar 

  67. Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM Press (2005)

    Google Scholar 

  68. Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011, Association for Computational Linguistics (2009)

    Google Scholar 

  69. Mitkov, R.: Anaphora Resolution. Pearson Education ESL, Boston (2002)

    Google Scholar 

  70. Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V.: The structure and performance of an open-domain question answering system. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 563–570. Association for Computational Linguistics (2000)

    Google Scholar 

  71. Muslea, I.: Extraction patterns for information extraction tasks: a survey. The AAAI-99 workshop on machine learning for information extraction, 1999

    Google Scholar 

  72. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 3–26 (2007)

    Article  Google Scholar 

  73. Palmer, M., Gildea, D., Xue, N.: Semantic Role Labeling. Morgan & Claypool, Waterloo (2010)

    Google Scholar 

  74. PASCA, M.: Weakly-supervised discovery of named entities using web search queries. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, ACM press, Lisbon, Portugal (2007)

    Google Scholar 

  75. Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42, 963–979 (2006)

    Article  Google Scholar 

  76. Poesio, M., Almuhareb, A.: Extracting concept descriptions from the Web: the importance of attributes and values. In: Proceedings of the Conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 29–44. Citeseer (2008)

    Google Scholar 

  77. Ponte, J.M., Croft, W.B. A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM Press, Melbourne, Australia (1998)

    Google Scholar 

  78. Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 689–696. Association for Computational Linguistics (2008)

    Google Scholar 

  79. Quarteroni, S., Riccardi, G., Dinarelli, M.: What’s in an ontology for spoken language understanding. In: Proceedings of Interspeech, pp. 1023–1026 (2009)

    Google Scholar 

  80. Quintard, L., Galibert, O., Adda, G., Grau, B., Laurent, D., Moriceau, V., Rosset, S., Tannier, X., Vilnat, A.: Question answering on web data: the qa evaluation in quæro. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta (2010)

    Google Scholar 

  81. Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Mag. 3, 4–16 (1986)

    Article  Google Scholar 

  82. Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)

    Article  Google Scholar 

  83. Raju, S., Pingali, P., Varma, V.: An Unsupervised Approach to Product Attribute Extraction. Springer, Berlin Heidelberg (2009). (Advances in Information Retrieval)

    Google Scholar 

  84. Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256. Association for Computational Linguistics (2009)

    Google Scholar 

  85. Ramakrishnan, G., Chakrabarti, S., Paranjpe, D., Bhattacharya, P.: Is question answering an acquired skill? In: Proceedings of the 13th International Conference on World Wide Web, ACM Press, New York, NY, USA (2004)

    Google Scholar 

  86. Ritchie, A., Robertson, S., Teufel, S.: Comparing citation contexts for information retrieval. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 213–222. ACM Press (2008)

    Google Scholar 

  87. Ritter, A., Clark, S., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1524–1534. Association for Computational Linguistics (2011)

    Google Scholar 

  88. Rizzi, L.: L’acquisition de la langue et la faculté de langage. In: Bricmont, J., Franck, J. (eds.) Chomsky (Les Cahiers de l’Herne). Editions de l’Herne, Paris (2007)

    Google Scholar 

  89. Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proceedings of the Thirteenth ACM International Conference on INFORMATION and Knowledge Management %@ 1-58113-874-1, pp. 42-49. ACM Press, Washington, DC, USA (2004)

    Google Scholar 

  90. Robertson, S.E.: The probability ranking principle in IR. J. Doc. 33, 294–304 (1977)

    Article  Google Scholar 

  91. Salton, G., Fox, E., Wu, H.: Extended Boolean information retrieval. Commun. ACM 31, 1002–1036 (1983)

    MathSciNet  Google Scholar 

  92. Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)

    Article  MATH  Google Scholar 

  93. Sánchez, D.: A methodology to learn ontological attributes from the Web. Data Knowl. Eng. 69, 573–597 (2010)

    Article  Google Scholar 

  94. Sanjuan, E., Bellot, P., Moriceau, V., Tannier, X.: Overview of the INEX 2010 question answering track (QA@INEX). In: Proceedings of the 9th International Conference on Initiative for the Evaluation of XML Retrieval: Comparative Evaluation of Focused Retrieval, Springer, Vught, The Netherland (2011)

    Google Scholar 

  95. Sanjuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX 2012 tweet contextualization track. Initiative for XML Retrieval INEX 2012, p. 148. Roma, Italia (2012)

    Google Scholar 

  96. Sarawagi, S.: Information extraction. Foundations and trends in databases 1, 261–377 (2008)

    Article  Google Scholar 

  97. Savoy, J., Le Calvé, A., Vrajitoru, D.: Report on the TREC-5 experiment: data fusion and collection fusion. In: Proceedings of the Fifth Text REtrieval Conference (TREC-5), pp. 500–538, 489–502. NIST Special Publication (1997)

    Google Scholar 

  98. Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. AAAI-99 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)

    Google Scholar 

  99. Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM Press (1996)

    Google Scholar 

  100. Solomon, M., Yu, C., Gravano, L.: Popularity-guided top-k extraction of entity attributes. In: Proceedings of the 13th International Workshop on the Web and Databases (WebDB), p. 9. ACM Press, Indianapolis, IN, USA (2010)

    Google Scholar 

  101. Sparck-Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)

    Article  Google Scholar 

  102. Sparck-jones, K.: A look back and a look forward. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–29. ACM Press, Grenoble, France

    Google Scholar 

  103. Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166. ACM Press (2003)

    Google Scholar 

  104. Varma, V., Pingali, P., Katragadda, S., Krishna, R., Ganesh, S., Sarvabhotla, K.H.G., Gopisetty, H., Reddy, K., Bharadwaj, R.: IIIT hyderabad at TAC 2009. In: Proceedings of Test Analysis Conference 2008 (TAC 2008), NIST, Gaithersburg, USA (2008)

    Google Scholar 

  105. Voorhees, E.M.: Overview of the TREC 2001 question answering track. In: Proceedings of the Tenth Text Retrieval Conference (TREC 2001), pp. 500–551, 42–50. NIST Special Publication (2001)

    Google Scholar 

  106. Voorhees, E.M.: Question answering in TREC. In: Voorhees, E.M., Harman, D.K. (eds.) TREC—Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)

    Google Scholar 

  107. Voorhees, E.M., Harman, D.K.: Overview of the eighth text retrieval conference (TREC-8). In: Proceedings of the Eighth Text REtrieval Conference (TREC 8), pp. 500–546, 1–24. NIST Special Publication (1999)

    Google Scholar 

  108. Voorhees, E.M., Harman, D.K.: TREC—Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)

    Google Scholar 

  109. Weerkamp, W., Carter, S., Tsagkias, M.: How people use twitter in different languages. ACM Web Science 2011, 2011, p. 2. Koblenz, Germany (2011)

    Google Scholar 

  110. Whitelaw, C., Kehlenbeck, A., Petrovic, N., Ungar, L.: Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 123–132. ACM Press, Napa Valley, California, USA (2008)

    Google Scholar 

  111. Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 4–11. ACM Press, Zurich, Suisse (1996)

    Google Scholar 

  112. Yao, C., Yu, Y., Shou, S., Li, X.: Towards a global schema for web entities. In: Proceedings of the 17th international Conference on World Wide Web, pp. 999–1008. ACM Press (2008)

    Google Scholar 

  113. Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)

    Article  Google Scholar 

  114. Zhao, Y., Qin, B., Hu, S., Liu, T.: Generalizing syntactic structures for product attribute candidate extraction. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 377–380. Association for Computational Linguistics (2010)

    Google Scholar 

  115. Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 494–503. ACM Press (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Patrice Bellot .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Bellot, P., Bonnefoy, L., Bouvier, V., Duvert, F., Kim, YM. (2014). Large Scale Text Mining Approaches for Information Retrieval and Extraction. In: Faucher, C., Jain, L. (eds) Innovations in Intelligent Machines-4. Studies in Computational Intelligence, vol 514. Springer, Cham. https://doi.org/10.1007/978-3-319-01866-9_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-01866-9_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-01865-2

  • Online ISBN: 978-3-319-01866-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics