Large Scale Text Mining Approaches for Information Retrieval and Extraction

Bellot, Patrice; Bonnefoy, Ludovic; Bouvier, Vincent; Duvert, Frédéric; Kim, Young-Min

doi:10.1007/978-3-319-01866-9_1

Patrice Bellot⁴,
Ludovic Bonnefoy^5,6,
Vincent Bouvier^4,5,
Frédéric Duvert⁵ &
…
Young-Min Kim⁴

Part of the book series: Studies in Computational Intelligence ((SCI,volume 514))

930 Accesses
2 Citations

Abstract

The issues for Natural Language Processing and Information Retrieval have been studied for long time but the recent availability of very large resources (Web pages, digital documents…) and the development of statistical machine learning methods exploiting annotated texts (manual encoding by crowdsourcing is a new major way) have transformed these fields. This allows not limiting these approaches to highly specialized domains and reducing the cost of their implementation. For this chapter, our aim is to present some popular text-mining statistical approaches for information retrieval and information extraction and to discuss the practical limits of actual systems that introduce challenges for future.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Alternatives to the use of probability and to Bayesian networks or other probabilistic graphic models for dealing with uncertainty have been proposed. Among them fuzzy logic and Dempster-Shafer theory.
2.
Precision is the fraction of retrieved items that are relevant or well classified while recall is the fraction of relevant items that are retrieved and provided as result. F-score is the harmonic mean of precision and recall.
3.
Stemming consists in reducing words according to their morphological variants and roots. See for example Snowball that makes light stemming available for many languages (http://snowball.tartarus.org). Lemmatization can be seen as an advanced stemming.
4.
Google Books Ngram (http://books.google.com/ngrams) and n-grams from the Corpus of Contemporary American English COCA (http://www.ngrams.info/) are two popular and freely downloadable word n-grams sets for English.
5.
http://trec.nist.gov/tracks.html
6.
http://ilps.science.uva.nl/trec-entity/
7.
During CoNLL 2003 (Conference on Computational Natural Language Learning) a challenge that concerned language-independent named entity recognition was organized. Many other tasks related to Natural Language Processing have been organized in the context of CoNLL conferences: grammatical error correction, multilingual parsing, analysis of dependencies… (http://www.clips.ua.ac.be/conll/).
8.
Freebase (https://developers.google.com/freebase/) contains in June 2013 more than 37 million entities, 1,998 types and 30,000 properties.
9.
http://nlp.stanford.edu/software/CRF-NER.shtml
10.
http://trec.nist.gov/tracks.html
11.
LDC catalog number LDC2002T31 (http://www.ldc.upenn.edu).
12.
http://www.nist.gov/tac/.
13.
http://ir.dcs.gla.ac.uk/test_collections/blog06info.html (about 40 GB of data for feeds only).
14.
http://www.inex.otago.ac.nz/tracks/qa/qa.asp.
15.
https://inex.mmci.uni-saarland.de/tracks/qa/.
16.
http://wordnet.princeton.edu.
17.
http://lucene.apache.org.
18.
http://sentiwordnet.isti.cnr.it.
19.
http://www.cs.york.ac.uk/semeval-2013/.
20.
DBPedia is a large knowledge base (more than 3.77 million things are classified in an ontology) localized in 111 languages built by extracting structured information from Wikipedia (http://dbpedia.org)—June 2013.
21.
http://trec.nist.gov/data/kba.html.
22.
http://www.nist.gov/tac/publications/index.html.
23.
http://lab.hypotheses.org
24.
http://openedition.org
25.
Text Encoding Initiative (http://www.tei-c.org/Guidelines/).
26.
This project was supported by the 6th Framework Research Programme of the European Union (EU), Project LUNA, IST contract no 33549 (www.ist-luna.eu).
27.
https://framenet.icsi.berkeley.edu/fndrupal/.

References

Aljaber, B., Stokes, N., Bailey, J., Pei, J.: Document clustering of scientific texts using citation contexts. Inf. Retrieval 13, 101–131 (2009). (Kluwer Academic Pub.)
Article Google Scholar
Almuhareb, A., Poesio, M.: Attribute-based and value-based clustering: an evaluation. In: Proceedings of EMNLP, pp. 158–165 (2004)
Google Scholar
Baccianella, S., Esuli, A., Sebastiani, F.: Sentiwordnet 3.0: an enhanced lexical resource for sentiment analysis and opinion mining. In: Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta (May, 2010)
Google Scholar
Balog, K., Serdyukov, P., Vries, A.P.D.: Overview of the TREC 2010 entity track. DTIC document, (2010)
Google Scholar
Béchet, F., Charton, E.: Unsupervised knowledge acquisition for extracting named entities from speech. IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2010), pp. 5338–5341 (2010)
Google Scholar
Béchet, F., Raymond, C., Duvert, F., de Mori, R.: Frame based interpretation of conversational speech. Spoken Language Technology Workshop (SLT), 2010 IEEE, pp. 401–406 (2010)
Google Scholar
Belkin, N.J.: Some (what) grand challenges for information retrieval. SIGIR Forum 42, 47–54 (2008)
Article Google Scholar
Bellot, P., Chappell, T., Doucet, A., Geva, S., Gurajada, S., Kamps, J., Kazai, G., Koolen, M., Landoni, M., Marx, M., Mishra, A., Moriceau, V., Mothe, J., Preminger, M., Ram´ırez, G., Sanderson, M., Sanjuan, E., Scholer, F., Schuh, A., Tannier, X., Theobald, M., Trappett, M., Trotman, A., Wang, Q.: Report on INEX 2012. SIGIR Forum 46, 50–59 (2012)
Google Scholar
Bellot, P., Crestan, E., El-bèze, M., Gillard, L., de Loupy, C.: Coupling named entity recognition, vector-space model and knowledge bases for TREC-11 question-answering track. In: Proceedings of the Twelfth Text Retrieval Conference (TREC 2003), NIST Special publication, pp. 500–251 (2003)
Google Scholar
Berland, M., Charniak, E.: Finding parts in very large corpora. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. Association for Computational Linguistics, pp. 57–64 (1999)
Google Scholar
Bonneau-maynard, H., Rosset, S., Ayache, C., Kuhn, A., Mostefa, D.: Semantic annotation of the French media dialog corpus. In: Proceedings of Ninth European Conference on Speech Communication and Technology, Lisboa, Portugal (2005)
Google Scholar
Bonnefoy, L., Bellot, P., Benoit, M.: The Web as a source of evidence for filtering candidate answers to natural language questions. In: IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp. 63–66 (2011)
Google Scholar
Bonnefoy, L., Bouvier, V., Bellot, P.: LSIS/LIA at TREC 2012 knowledge base acceleration. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), pp. 500–298. NIST Special Publication SP, Gaithersburg, USA (2013)
Google Scholar
Bordogna, G., Pasi, G.: A fuzzy linguistic approach generalizing Boolean information retrieval: a model and its evaluation. JASIS 44, 70–82 (1993)
Article Google Scholar
Brocki, Ł., Koržinek, D., Marasek, K.: Telephony based voice portal for a University. Appl. Syst. Homel. Secur. (2008)
Google Scholar
Bunescu, R., Mooney, R.: Subsequence kernels for relation extraction. Adv. Neural Inf. Process. Syst. 18, 171 (2006)
Google Scholar
Burger, J.D.: Mitre’s quanda at trec-12. In: Proceedings of the Twenty-First Text REtrieval Conference (TREC 2012), pp. 500–298. NIST Special Publication SP, Gaithersburg, USA (2003)
Google Scholar
Camelin, N., Bechet, F., Damnati, G., de Mori, R.: Detection and interpretation of opinion expressions in spoken surveys. IEEE Trans. Audio Speech Lang. Process. 18, 369–381 (2010)
Article Google Scholar
Carreras, X., Marquez, L., Padró, L.: Named entity extraction using AdaBoost. In: Proceedings of the 6th Conference on Natural Language Learning-Volume 20, pp. 1–4. Association for Computational Linguistics (2002)
Google Scholar
Cassidy, T., Zheng, C., Artiles, J., Ji, H., Deng, H., Ratinov, L.-A., Zheng, J., Han, J., Roth, D.: CUNY-UIUC-SRI TAC-KBP2011 entity linking system description. In: Proceedings of Text Analysis Conference (TAC2011), (2010)
Google Scholar
Chang, H.C.: A new perspective on twitter hashtag use: diffusion of innovation theory. Proc. Am. Soc. Inform. Sci. Technol. 47, 1–4 (2010)
Google Scholar
Chomsky, N.: Current issues in linguistic theory. In: Fodor, J., Katz, B. (eds.) The Structure of Language. Prentice Hall, New York (1964)
Google Scholar
Chomsky, N.: Lectures in Government and Binding. Foris Publications, Dordrecht (1981)
Google Scholar
Ciravegna, D.: Adaptive information extraction from text by rule induction and generalisation. In: Proceedings 17th International Joint Conference on Artificial Intelligence (IJCAI 2001), Seattle (2001)
Google Scholar
Collins, M., Singer, Y. Unsupervised models for named entity classification. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 189–196 (1999)
Google Scholar
Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39, 80–91 (1996)
Article Google Scholar
Culotta, A., Sorensen, J.: Dependency tree kernels for relation extraction. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, p. 423. Association for Computational Linguistics (2004)
Google Scholar
Cutler, A., Fodor, J.A.: Semantic focus and sentence comprehension. Cognition 7, 49–59 (1979)
Article Google Scholar
Dang, H.T., Owczarzak, K.: Overview of the TAC 2008 opinion question answering and summarization tasks. In: Proceedings of the First Text Analysis Conference, (2008)
Google Scholar
Davidov, D., Rappoport, A.: Extraction and approximation of numerical attributes from the Web. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 1308–1317. Association for Computational Linguistics (2010)
Google Scholar
Deerwester, S.C., Dumais, S., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41, 391–407 (1990)
Article Google Scholar
Deveaud, R., Avignon, F., Sanjuan, E., Bellot, P.: LIA at TREC 2011 Web track: experiments on the combination of online resources. In: Proceedings of the Twentieth Text REtrieval Conference (TREC 2011), pp. 500–596. NIST Special Publication SP, Gaithersburg, USA (2011)
Google Scholar
Doddington, G., Mitchell, A., Przybocki, M., Ramshaw, L., Strassel, S., Weischedel, R.: The automatic content extraction (ACE) program-tasks, data, and evaluation. In: Proceedings of LREC, pp. 837–840. Citeseer (2004)
Google Scholar
Downey, D., Broadhead, M., Etzioni, O.: Locating complex named entities in web text. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence, pp. 2733–2739 (2007)
Google Scholar
Duvert, F., de Mori, R.: Conditional models for detecting lambda-functions in a spoken language understanding system. In: Eleventh Annual Conference of the International Speech Communication Association, (2010)
Google Scholar
Duvert, F., Meurs, M.-J., Servan, C., Béchet, F., Lefevre, F., de Mori, R.: Semantic composition process in a speech understanding system. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2008, pp. 5029–5032 (2008)
Google Scholar
Etzioni, O., Banko, M., Soderland, S., Weld, D.S.: Open information extraction from the web. Commun. ACM 51, 68–74 (2008)
Article Google Scholar
Etzioni, O., Cafarella, M., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)
Article Google Scholar
Etzioni, O., Fader, A., Christensen, J., Soderland, S., Mausam, M.: Open information extraction: the second generation. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence-Volume, vol. 1, pp. 3–10. AAAI Press (2011)
Google Scholar
Fader, A., Soderland, S, Etzioni, O.: Identifying relations for open information extraction. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1535–1545. Association for Computational Linguistics (2011)
Google Scholar
Ferret, O., Grau, B., Hurault-plantet, M., Illouz, G., Monceaux, L., Robba, I., Vilnat, A.: Finding an answer based on the recognition of the question focus. In: Proceedings of the Tenth Text REtrieval Conference (TREC 2001), 2002 Gaithersburg, Maryland, USA (2002)
Google Scholar
Fuhr, N., Buckley, C.: A probabilistic learning approach for document indexing. ACM Trans. Inf. Syst. (TOIS) 9, 223–248 (1991)
Article Google Scholar
Garfield, E.: Citation analysis as a tool in journal evaluation. Science 178, 471–479 (1972)
Article Google Scholar
Ge, N., Hale, J., Charniak, E.: A statistical approach to anaphora resolution. In: Proceedings of the Sixth Workshop on Very Large Corpora, pp. 161–170 (1998)
Google Scholar
Giles, C.L., Bollacker, K., Lawrence, S.: CiteSeer: an automatic citation indexing system. In: Proceedings of the Third ACM Conference on Digital Libraries, pp. 89–98. ACM, Pittsburgh, Pennsylvania, USA (1998)
Google Scholar
Griol, D., Riccardi, G., Sanchis, E.: A statistical dialog manager for the LUNA project. In: Proceedings of interspeech/ICSLP, pp. 272–275 (2009)
Google Scholar
Grishman, R., Sundheim, B.: Message understanding conference-6: a brief history. In: Proceedings of COLING, pp. 466–471 (1996)
Google Scholar
Grodzinsky, Y.: La syntaxe générative dans le cerveau. In: Bricmont, J., Franck, J. (eds.) Chomsky (Les Cahiers de l’Herne). Editions de l’Herne, Paris (2007)
Google Scholar
Guarino, N.: Concepts, attributes and arbitrary relations: some linguistic and ontological criteria for structuring knowledge bases. Data Knowl. Eng. 8, 249–261 (1992)
Article Google Scholar
Hamdan, H., Béchet, F., Bellot, P.: Experiments with DBpedia, WordNet and SentiWordNet as re-sources for sentiment analysis in micro-blogging. In: International Workshop on Semantic Evaluation SemEval-2013 (NAACL Workshop), Atlanta, Georgia, USA (2013)
Google Scholar
Harth, E.: The Creative Loop: How the Brain Makes a Mind. Addison-Wesley, New-York (1993)
Google Scholar
Isozaki, H., Kazawa, H.: Efficient support vector classifiers for named entity recognition. In: Proceedings of the 19th International Conference on Computational Linguistics, pp. 1–7. Association for Computational Linguistics (2002)
Google Scholar
Ji, H., Grishman, R.: Knowledge base population: Successful approaches and challenges. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 1148–1158 (2011)
Google Scholar
Kantrowitz, M., Mohit, B., Mittal, V.: Stemming and its effects on TFIDF ranking (poster session). In: Proceedings of the 23rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 357–359. ACM Press (2000)
Google Scholar
Kim, J.-H., Woodland, P.: A rule-based named entity recognition system for speech input. In: Proceedings of the 6th International Conference on Spoken Language Processing, (2000)
Google Scholar
Kim, Y.-M., Bellot, P., Tavernier, J., Faath, E., Dacos, M.: Evaluation of BILBO reference parsing in digital humanities via a comparison of different tools. In: Proceedings of the 2012 ACM Symposium on Document Engineering, pp. 209–212. ACM Press, Paris, France (2012)
Google Scholar
Krogh, A. Hidden Markov models for labeled sequences. In: Proceedings of the IEEE 12th IAPR International. Conference on Pattern Recognition, Vol. 2-Conference B: Computer Vision and Image Processing, pp. 140–144 (1994)
Google Scholar
Lafferty, J., Mccallum, A., Pereira, F.C.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proceedings of the 18th International Conference on Machine Learning 2001 (ICML 2001), pp. 282–289 (2001)
Google Scholar
Langley, P., Simon, H.A.: Applications of machine learning and rule induction. Commun. ACM 38, 54–64 (1995)
Article Google Scholar
Lehnert, W.: The Process of Question Answering: A Computer Simulation of Cognition. Lawrence Erlbaum Associates, Hillsdale (1978)
MATH Google Scholar
Li, F., Zheng, Z., Yang, T., Bu, F., Ge, R., Zhu, X., Zhang, X., Huang, M.: Thu quanta at TAC 2008 qa and rte track. In: Proceedings of Human Language Technologies Conference/Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, BC, Canada (2008)
Google Scholar
Lin, J.: An exploration of the principles underlying redundancy-based factoid question answering. ACM Trans. Inf. Syst. 25, 4–53 (2007)
Article Google Scholar
Màrquez, L., Carreras, X., Litkowski, K.C., Stevenson, S.: Semantic role labeling: an introduction to the special issue. Comput. Linguis. 34, 145–159 (2008)
Article Google Scholar
Maybury, M.T.: New Directions in Question Answering. The MIT Press, Menlo Park (2004)
Google Scholar
McCallum, A.: Information extraction: distilling structured data from unstructured text. Queue 3, 48–57 (2005)
Article Google Scholar
Mehler, J., Dupoux, E.: Naître Humain. Odile Jacob, Paris (1992)
Google Scholar
Metzler, D., Croft, W.B.: A Markov random field model for term dependencies. In: Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 472–479. ACM Press (2005)
Google Scholar
Mintz, M., Bills, S., Snow, R., Jurafsky, D.: Distant supervision for relation extraction without labeled data. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, pp. 1003–1011, Association for Computational Linguistics (2009)
Google Scholar
Mitkov, R.: Anaphora Resolution. Pearson Education ESL, Boston (2002)
Google Scholar
Moldovan, D., Harabagiu, S., Pasca, M., Mihalcea, R., Girju, R., Goodrum, R., Rus, V.: The structure and performance of an open-domain question answering system. In: Proceedings of the 38th Annual Meeting on Association for Computational Linguistics, pp. 563–570. Association for Computational Linguistics (2000)
Google Scholar
Muslea, I.: Extraction patterns for information extraction tasks: a survey. The AAAI-99 workshop on machine learning for information extraction, 1999
Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30, 3–26 (2007)
Article Google Scholar
Palmer, M., Gildea, D., Xue, N.: Semantic Role Labeling. Morgan & Claypool, Waterloo (2010)
Google Scholar
PASCA, M.: Weakly-supervised discovery of named entities using web search queries. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management, ACM press, Lisbon, Portugal (2007)
Google Scholar
Peng, F., McCallum, A.: Information extraction from research papers using conditional random fields. Inf. Process. Manage. 42, 963–979 (2006)
Article Google Scholar
Poesio, M., Almuhareb, A.: Extracting concept descriptions from the Web: the importance of attributes and values. In: Proceedings of the Conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 29–44. Citeseer (2008)
Google Scholar
Ponte, J.M., Croft, W.B. A language modeling approach to information retrieval. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 275–281. ACM Press, Melbourne, Australia (1998)
Google Scholar
Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 689–696. Association for Computational Linguistics (2008)
Google Scholar
Quarteroni, S., Riccardi, G., Dinarelli, M.: What’s in an ontology for spoken language understanding. In: Proceedings of Interspeech, pp. 1023–1026 (2009)
Google Scholar
Quintard, L., Galibert, O., Adda, G., Grau, B., Laurent, D., Moriceau, V., Rosset, S., Tannier, X., Vilnat, A.: Question answering on web data: the qa evaluation in quæro. In: Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC’10), Valletta, Malta (2010)
Google Scholar
Rabiner, L., Juang, B.: An introduction to hidden Markov models. IEEE ASSP Mag. 3, 4–16 (1986)
Article Google Scholar
Rabiner, L.R.: A tutorial on hidden Markov models and selected applications in speech recognition. Proc. IEEE 77, 257–286 (1989)
Article Google Scholar
Raju, S., Pingali, P., Varma, V.: An Unsupervised Approach to Product Attribute Extraction. Springer, Berlin Heidelberg (2009). (Advances in Information Retrieval)
Google Scholar
Ramage, D., Hall, D., Nallapati, R., Manning, C.D.: Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, vol. 1, pp. 248–256. Association for Computational Linguistics (2009)
Google Scholar
Ramakrishnan, G., Chakrabarti, S., Paranjpe, D., Bhattacharya, P.: Is question answering an acquired skill? In: Proceedings of the 13th International Conference on World Wide Web, ACM Press, New York, NY, USA (2004)
Google Scholar
Ritchie, A., Robertson, S., Teufel, S.: Comparing citation contexts for information retrieval. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp. 213–222. ACM Press (2008)
Google Scholar
Ritter, A., Clark, S., Etzioni, O.: Named entity recognition in tweets: an experimental study. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1524–1534. Association for Computational Linguistics (2011)
Google Scholar
Rizzi, L.: L’acquisition de la langue et la faculté de langage. In: Bricmont, J., Franck, J. (eds.) Chomsky (Les Cahiers de l’Herne). Editions de l’Herne, Paris (2007)
Google Scholar
Robertson, S., Zaragoza, H., Taylor, M.: Simple BM25 extension to multiple weighted fields. In: Proceedings of the Thirteenth ACM International Conference on INFORMATION and Knowledge Management %@ 1-58113-874-1, pp. 42-49. ACM Press, Washington, DC, USA (2004)
Google Scholar
Robertson, S.E.: The probability ranking principle in IR. J. Doc. 33, 294–304 (1977)
Article Google Scholar
Salton, G., Fox, E., Wu, H.: Extended Boolean information retrieval. Commun. ACM 31, 1002–1036 (1983)
MathSciNet Google Scholar
Salton, G., Wong, A., Yang, C.-S.: A vector space model for automatic indexing. Commun. ACM 18, 613–620 (1975)
Article MATH Google Scholar
Sánchez, D.: A methodology to learn ontological attributes from the Web. Data Knowl. Eng. 69, 573–597 (2010)
Article Google Scholar
Sanjuan, E., Bellot, P., Moriceau, V., Tannier, X.: Overview of the INEX 2010 question answering track (QA@INEX). In: Proceedings of the 9th International Conference on Initiative for the Evaluation of XML Retrieval: Comparative Evaluation of Focused Retrieval, Springer, Vught, The Netherland (2011)
Google Scholar
Sanjuan, E., Moriceau, V., Tannier, X., Bellot, P., Mothe, J.: Overview of the INEX 2012 tweet contextualization track. Initiative for XML Retrieval INEX 2012, p. 148. Roma, Italia (2012)
Google Scholar
Sarawagi, S.: Information extraction. Foundations and trends in databases 1, 261–377 (2008)
Article Google Scholar
Savoy, J., Le Calvé, A., Vrajitoru, D.: Report on the TREC-5 experiment: data fusion and collection fusion. In: Proceedings of the Fifth Text REtrieval Conference (TREC-5), pp. 500–538, 489–502. NIST Special Publication (1997)
Google Scholar
Seymore, K., Mccallum, A., Rosenfeld, R.: Learning hidden Markov model structure for information extraction. AAAI-99 Workshop on Machine Learning for Information Extraction, pp. 37–42 (1999)
Google Scholar
Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 21–29. ACM Press (1996)
Google Scholar
Solomon, M., Yu, C., Gravano, L.: Popularity-guided top-k extraction of entity attributes. In: Proceedings of the 13th International Workshop on the Web and Databases (WebDB), p. 9. ACM Press, Indianapolis, IN, USA (2010)
Google Scholar
Sparck-Jones, K.: A statistical interpretation of term specificity and its application in retrieval. J. Doc. 28, 11–21 (1972)
Article Google Scholar
Sparck-jones, K.: A look back and a look forward. In: Proceedings of the 11th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 13–29. ACM Press, Grenoble, France
Google Scholar
Stokoe, C., Oakes, M.P., Tait, J.: Word sense disambiguation in information retrieval revisited. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 159–166. ACM Press (2003)
Google Scholar
Varma, V., Pingali, P., Katragadda, S., Krishna, R., Ganesh, S., Sarvabhotla, K.H.G., Gopisetty, H., Reddy, K., Bharadwaj, R.: IIIT hyderabad at TAC 2009. In: Proceedings of Test Analysis Conference 2008 (TAC 2008), NIST, Gaithersburg, USA (2008)
Google Scholar
Voorhees, E.M.: Overview of the TREC 2001 question answering track. In: Proceedings of the Tenth Text Retrieval Conference (TREC 2001), pp. 500–551, 42–50. NIST Special Publication (2001)
Google Scholar
Voorhees, E.M.: Question answering in TREC. In: Voorhees, E.M., Harman, D.K. (eds.) TREC—Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)
Google Scholar
Voorhees, E.M., Harman, D.K.: Overview of the eighth text retrieval conference (TREC-8). In: Proceedings of the Eighth Text REtrieval Conference (TREC 8), pp. 500–546, 1–24. NIST Special Publication (1999)
Google Scholar
Voorhees, E.M., Harman, D.K.: TREC—Experiment and Evaluation in Information Retrieval. The MIT Press, Cambridge (2005)
Google Scholar
Weerkamp, W., Carter, S., Tsagkias, M.: How people use twitter in different languages. ACM Web Science 2011, 2011, p. 2. Koblenz, Germany (2011)
Google Scholar
Whitelaw, C., Kehlenbeck, A., Petrovic, N., Ungar, L.: Web-scale named entity recognition. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management (CIKM 2008), pp. 123–132. ACM Press, Napa Valley, California, USA (2008)
Google Scholar
Xu, J., Croft, W.B.: Query expansion using local and global document analysis. In: ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 4–11. ACM Press, Zurich, Suisse (1996)
Google Scholar
Yao, C., Yu, Y., Shou, S., Li, X.: Towards a global schema for web entities. In: Proceedings of the 17th international Conference on World Wide Web, pp. 999–1008. ACM Press (2008)
Google Scholar
Yujian, L., Bo, L.: A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell. 29, 1091–1095 (2007)
Article Google Scholar
Zhao, Y., Qin, B., Hu, S., Liu, T.: Generalizing syntactic structures for product attribute candidate extraction. In: Proceedings of Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 377–380. Association for Computational Linguistics (2010)
Google Scholar
Zhu, J., Nie, Z., Wen, J.-R., Zhang, B., Ma, W.-Y.: Simultaneous record detection and attribute labeling in web data extraction. In: Proceedings of the 12th ACM SIGKDD international Conference on Knowledge Discovery and Data Mining, pp. 494–503. ACM Press (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

CNRS, Aix-Marseille Université, LSIS UMR 7296, Av. Esc. Normandie-Niemen, 13397, Marseille cedex 20, France
Patrice Bellot, Vincent Bouvier & Young-Min Kim
iSmart, 565 rue M. Berthelot, 13851, Aix-en-Provence cedex 3, France
Ludovic Bonnefoy, Vincent Bouvier & Frédéric Duvert
LIA, Université d’Avignon et des Pays de Vaucluse, Agroparc, 84911, Avignon cedex 9, France
Ludovic Bonnefoy

Authors

Patrice Bellot
View author publications
You can also search for this author in PubMed Google Scholar
Ludovic Bonnefoy
View author publications
You can also search for this author in PubMed Google Scholar
Vincent Bouvier
View author publications
You can also search for this author in PubMed Google Scholar
Frédéric Duvert
View author publications
You can also search for this author in PubMed Google Scholar
Young-Min Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrice Bellot .

Editor information

Editors and Affiliations

Ecole Polytechnique Universitaire de Marseille, Aix-Marseille University (AMU), Marseille, France
Colette Faucher
Faculty of Education, Science, Technology & Mathematics, University of Canberra, Canberra, Australia
Lakhmi C. Jain

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Bellot, P., Bonnefoy, L., Bouvier, V., Duvert, F., Kim, YM. (2014). Large Scale Text Mining Approaches for Information Retrieval and Extraction. In: Faucher, C., Jain, L. (eds) Innovations in Intelligent Machines-4. Studies in Computational Intelligence, vol 514. Springer, Cham. https://doi.org/10.1007/978-3-319-01866-9_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-01866-9_1
Published: 15 November 2013
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-01865-2
Online ISBN: 978-3-319-01866-9
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics