Skip to main content
Log in

Rediscovering 15 + 2 years of discoveries in language resources and evaluation

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

This paper analyzes the content of the proceedings of the Language Resources and Evaluation Conference (LREC) over the past 17 years (1998–2014), with the goal of gaining a picture of the LREC community and the topics that are most relevant to the field. We follow the methodology used in similar studies, including the survey of the IEEE ICASSP conference proceedings from 1976 to 1990, the survey of the Association of Computational Linguistics conference proceedings over 50 years, and the survey of the proceedings of the conferences contained in the ISCA Archive over 25 years (1987–2012). We expand on results originally presented at LREC 2014, but include the proceedings of LREC 2014 itself in the study together with an analysis of various citation graphs. We show the evolution over time of the number of papers and authors, including their distribution by gender and affiliation, as well as collaborations and citation patterns among authors and papers, funding sources for reported research, and plagiarism and reuse in LREC papers; results for LREC are compared with similar results for major conferences in related fields. We also consider the evolution of research topics over time and identify the authors who introduced key terms. Finally, we propose and apply a measure of a researcher’s notability and provide the results for LREC authors. The study uses NLP methods that have been published in the corpus considered in the study. In addition to providing a revealing characterization of the LRE community, the study also demonstrates the need for establishing a system for unique identification of authors, papers and other sources to facilitate this type of analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30

Similar content being viewed by others

Notes

  1. http://snap.stanford.edu/data/.

  2. http://aclweb.org/anthology/.

  3. Results of these analyses together with corresponding data and tools are available on-line at the University of Michigan http://clair.eecs.umich.edu/aan/index.php.

  4. http://www.isca-speech.org/iscaweb/index.php/archive/online-archive.

  5. Available online at www.atala.org/-Conference-TALN-RECITAL.

  6. http://saffron.deri.ie.

  7. http://www.lrec-conf.org/.

  8. https://code.google.com/p/tesseract-ocr/.

  9. http://swish-e.org/.

  10. We checked the fact that there is no bias introduced by the frequency of the conferences (annual vs. biennial).

  11. “Epicene” means that the given name is gender ambiguous.

  12. http://en.wikipedia.org/wiki/Collaboration_graph.

  13. Note that if author A published a paper with author B, and author B published a paper with author C, authors A, B and C belong to the same connected component.

  14. See Table 1.

  15. http://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_computationallinguistics.

  16. www.tagmatica.com.

  17. www.natcorp.ox.ac.uk.

  18. www.americannationalcorpus.org.

  19. www.grsampson.net/Resources.html.

  20. www.statmt.org/europarl.

  21. www.tagcrowd.com. Our thanks to Daniel Steinbock for providing access to this web service.

  22. “Van Eynde, F.; Zavrel, J. and Daelemans W. (2000), Part of Speech Tagging and Lemmatisation for the Spoken Dutch Corpus” and “Džeroski, S.; Erjavec, T. and Zavrel J. (2000), Morphosyntactic Tagging of Slovene: Evaluating Taggers and Tagsets”.

  23. We actually consider also papers dated the year after, given the possible publication delay.

References

  • ACL. (2012). Proceedings of the ACL-2012 special workshop on rediscovering 50 years of discoveries, ACL 2012, Jeju, July 10, 2012. ISBN 978-1-937284-29-9.

  • Bavelas, A. (1948). A mathematical model for small group structures. Human Organization, 7, 16–30.

    Article  Google Scholar 

  • Bavelas, A. (1950). Communication patterns in task oriented groups. Journal of the Acoustical Society of America, 22, 271–282.

    Article  Google Scholar 

  • Boudin, F. (2013). TALN archives: une archive numérique francophone des articles de recherche en traitement automatique de la langue. TALN-RÉCITAL 2013, Les Sables d’Olonne, Juin 17–21, 2013.

  • Bravo, E., Calzolari, A., De Castro, P., Mabile, L., Napolitani, F., Rossi, A. M., & Cambon-Thomsen, A. (2015). Developing a guideline to standardize the citation of bioresources in journal articles (CoBRA). BMC Medicine, 13, 33.

    Article  Google Scholar 

  • Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., et al. (2012). The LRE map. Harmonising community descriptions of resources. In Proceedings of the language resources and evaluation conference (LREC 2012), Istanbul, Turkey, May 23–25, 2012.

  • Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). ParsCit: An open-source CRF reference string parsing package. In Proceedings of the language resources and evaluation conference (LREC 2008), Marrakesh, Morocco, May 2008.

  • Csárdi, G., & Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems, 1695, 1–9.

    Google Scholar 

  • Drouin, P. (2004). Detection of domain specific terminology using corpora comparison. In Proceedings of the language resources and evaluation conference (LREC 2004), Lisbon, Portugal, May 2004.

  • Dunne, C., Shneiderman, B., Gove, R., Klavans, J., & Dorr, B. (2012). Rapid understanding of scientific paper collections: Integrating statistics, text analytics, and visualization. Journal of the American Society for Information Science and Technology, 63(12), 2351–2369.

    Article  Google Scholar 

  • Francopoulo, G. (2007). TagParser: Well on the way to ISO-TC37 conformance. In ICGL (International conference on global interoperability for language resources), Hong Kong.

  • Francopoulo, G., Marcoul, F., Causse, D., & Piparo, G. (2013). Global atlas: Proper nouns, from Wikipedia to LMF. In G. Francopoulo (Ed.), LMF—Lexical Markup Framework. London: ISTE/Wiley.

    Chapter  Google Scholar 

  • Francopoulo, G., Mariani, J., & Paroubek, P. (2015a). NLP4NLP: The cobbler’s children won’t go unshod. In 4th international workshop on mining scientific publications (WOSP2015), joint conference on digital libraries 2015 (JCDL 2015), Knoxville (USA), June 24, 2015.

  • Francopoulo, G., Mariani, J., & Paroubek, P. (2015b). NLP4NLP: Applying NLP to written and spoken scientific NLP corpora. In Workshop on mining scientific papers: Computational linguistics and bibliometrics, 15th international society of scientometrics and informetrics conference (ISSI 2015), Istanbul (Turkey), June 29, 2015.

  • Francopoulo, G., Mariani, J., & Paroubek, P. (2016). A study of reuse and plagiarism in LREC papers. In Proceedings of LREC 2016, Portorož, Slovenia, May 23–28, 2016.

  • Freeman, L. C. (1978). Centrality in social networks, conceptual clarifications. Social Networks, 1(1978/79), 215–239.

    Article  Google Scholar 

  • Fu, Y., Xu, F., & Uszkoreit, H. (2010). Determining the origin and structure of person names. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10) (pp. 3417–3422), Valletta, Malta. European Language Resources Association (ELRA), May 2010. ISBN 2-9517408-6-7.

  • Hall, D. L. W., Jurafsky, D., & Manning, C. (2008). Studying the history of ideas using topic models. In Proceedings of the conference on empirical methods in natural language processing (EMNLP’08) (pp. 363–371).

  • Joerg, B., Höllrigl, T., & Sicilia, M.-A. (2012). Entities and identities in research information systems. In 11th international conference on current research information systems (CRIS2012): “e-Infrastructures for research and innovation: Linking information systems to improve scientific knowledge production”, Prague, Czech Republic, June 6–9, 2012.

  • Li, H., Councill, I., Lee, W. C., & Giles, C. L. (2006). CiteSeerx: An architecture and web service design for an academic document search engine. In Proceedings of the 15th international conference on the World Wide Web.

  • Litchfield, B. (2005). Making PDFs portable: Integrating PDF and Java technology. Java Developers Journal, March 24, 2005. http://java.sys-con.com/node/48543 (PDFBox is available at http://pdfbox.apache.org/).

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. Cambridge: Cambridge University Press. ISBN 0521865719.

    Book  Google Scholar 

  • Mariani, J. (1990). La Conférence IEEE-ICASSP de 1976 à 1990: 15 ans de recherches en Traitement Automatique de la Parole. Notes et Documents LIMSI 90-8, Septembre 1990.

  • Mariani, J., Cieri, C., Francopoulo, G., Paroubek, P., & Delaborde, M. (2014b). Facing the identification problem in language-related scientific data analysis. In Proceedings of LREC 2014, Reykjavik, Iceland, May 26–31, 2014.

  • Mariani, J., Paroubek, P., Francopoulo, G., & Delaborde, M. (2013). Rediscovering 25 years of discoveries in spoken language processing: A preliminary ISCA archive analysis. In Proceedings of Interspeech 2013, Lyon, France, August 26–29, 2013.

  • Mariani, J., Paroubek, P., Francopoulo, G., & Hamon, O. (2014a). Rediscovering 15 years of discoveries in language resources and evaluation: The LREC anthology analysis. In Proceedings of LREC 2014, Reykjavik, Iceland, May 26–31, 2014.

  • Osborne, F., Motta, E., & Mulholland, P. (2013). Exploring scholarly data with Rexplore. In International semantic web conference, Sydney, Australia.

  • Paul, M., & Girju, R. (2009). Topic modeling of research fields: An interdisciplinary perspective. In Recent advances in natural language processing (RANLP 2009), Borovets, Bulgaria.

  • Radev, D. R., Muthukrishnan, P., Qazvinian, V., & Abu-Jbara, A. (2013). The ACL Anthology Network corpus. Language Resources and Evaluation, 47, 919–944.

    Article  Google Scholar 

  • Rochat, Y. (2009). Closeness centrality extended to unconnected graphs: The harmonic centrality index. In Applications of Social Network Analysis (ASNA), 2009, Zurich, Switzerland.

  • Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z. (2008). ArnetMiner: Extraction and mining of academic social networks. In Proceeding of the 14th international conference on knowledge discovery and data mining.

  • The British National Corpus. (2007). Version 3 (BNC XML edition). Distributed by Oxford University Computing Services on behalf of the BNC Consortium. http://www.natcorp.ox.ac.uk/.

  • The R Journal. (2012). 4(2), 5–12. ISSN 2073-4859, http://journal.r-project.org/.

Download references

Acknowledgments

The authors wish to thank the ACL colleagues, Ken Church, Sanjeev Khudanpur, Amjbad Abu Jbara, Dragomir Radev and Simone Teufel, who helped them in the starting phase, Isabel Trancoso, who gave her ISCA Archive analysis on the use of assessment and corpora, Wolfgang Hess, who produced and provided a 14 GBytes ISCA Archive, Emmanuelle Foxonet who provided a list of authors given names with genre, Florian Boudin, who made available the TALN Anthology, Helen van der Stelt and Jolanda Voogd (Springer) who provided the LRE data and Douglas O’Shaughnessy, Denise Hurley, Rebecca Wollman and Casey Schwartz (IEEE) who provided the IEEE ICASSP and TASLP data, Nancy Ide and Christopher Cieri who largely improved the readability of the paper. They also thank Khalid Choukri, Alexandre Sicard and Nicoletta Calzolari, who provided information about the past LREC conferences, Victoria Arranz, Ioanna Giannopoulou, Johann Gorlier, Jérémy Leixa, Valérie Mapelli and Hélène Mazo, who helped in recovering the metadata for LREC 1998, and all the organizers, reviewers and authors over the 17 years conferences without whom this analysis could not have been conducted!

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph Mariani.

Additional information

Olivier Hamon was at ELDA when he contributed to this paper.

This survey has been made on textual data, which cover a 17-year period, including scanned content for LREC proceedings. The analysis uses tools that automatically process the content of the scientific papers and may produce errors. Therefore, the results should be regarded as reflecting a large margin of error. The authors wish to apologize for any errors the reader may detect, and they will gladly rectify any such errors take in future releases of the survey results.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mariani, J., Paroubek, P., Francopoulo, G. et al. Rediscovering 15 + 2 years of discoveries in language resources and evaluation. Lang Resources & Evaluation 50, 165–220 (2016). https://doi.org/10.1007/s10579-016-9352-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9352-9

Keywords

Navigation