Skip to main content
Log in

Multilingual collocation extraction with a syntactic parser

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

An impressive amount of work was devoted over the past few decades to collocation extraction. The state of the art shows that there is a sustained interest in the morphosyntactic preprocessing of texts in order to better identify candidate expressions; however, the treatment performed is, in most cases, limited (lemmatization, POS-tagging, or shallow parsing). This article presents a collocation extraction system based on the full parsing of source corpora, which supports four languages: English, French, Spanish, and Italian. The performance of the system is compared against that of the standard mobile-window method. The evaluation experiment investigates several levels of the significance lists, uses a fine-grained annotation schema, and covers all the languages supported. Consistent results were obtained for these languages: parsing, even if imperfect, leads to a significant improvement in the quality of results, in terms of collocational precision (between 16.4 and 29.7%, depending on the language; 20.1% overall), MWE precision (between 19.9 and 35.8%; 26.1% overall), and grammatical precision (between 47.3 and 67.4%; 55.6% overall). This positive result bears a high importance, especially in the perspective of the subsequent integration of extraction results in other NLP applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. All the sample sentences provided in this paper actually occurred in our corpora.

  2. The following abbreviations are used in this paper: N—noun, V—verb, A—adjective, Adv—adverb, C—conjunction, P—preposition, Inter—interjection.

  3. Jacquemin et al. (1997, p. 27) argue that a 5-words window is insufficient for French due to the “longer syntactic structures”. In fact, Goldman et al. (2001, p. 62) identified some instances of verb–object collocations that had the component items separated by as much as 30 intervening words.

  4. Evert and Krenn (2005) indicate that this choice is also dependent on the specific extraction setting (e.g., domain and size of corpora, frequency threshold applied, type of preprocessing performed).

  5. The lexical categories are N, A, V, P, Adv, C, Inter, to which we add the two functional categories T (tense) and F (functional).

  6. In this case, however, the instances missed for candidate pairs alter the frequency profile of these pairs (the values in the contingency table), on which their ranking in the significance list and, ultimately, the quality of results depend.

  7. These percentages are not as small as they might seem, since the data processed is fairly large and no frequency threshold was applied on the candidate pairs.

  8. The kappa values indicate different degrees of agreement, as follows: 0 to 0.2—slight; 0.2 to 0.4—fair; 0.4 to 0.6—moderate; 0.6 to 0.8—substantial; 0.8 to 0.99—almost perfect, and 1—perfect. The scores we obtained are higher than expected, given the difficulty of the task.

  9. The study considered the top 100 subject–verb pairs extracted with the Sketch Engine from the BNC for the noun preference, without a frequency cutoff. We found that as many as 23.8% of the 63 corresponding pair types were derived from ungrammatical instances, e.g., preference-result: “to give effect to the preference would result in ...”, or preference-lead: “the existence of these preferences would clearly lead ...”.

References

  • Barnbrook, G. (1996). Language and computers: A practical introduction to the computer analysis of language. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Basili, R., Pazienza, M. T., & Velardi, P. (1994) A “not-so-shallow” parser for collocational analysis. In Proceedings of the 15th Conference on Computational Linguistics (pp. 447–453). Association for Computational Linguistics: Kyoto, Japan.

  • Bourigault, D. (1992). Surface grammatical analysis for the extraction of terminological noun phrases. In Proceedings of the 15th International Conference on Computational Linguistics (pp. 977–981). Nantes, France.

  • Breidt, E. (1993). Extraction of V–N-Collocations from text corpora: A feasibility study for German. In Proceedings of the Workshop on Very Large Corpora: Academic and Industrial Perspectives. Columbus, USA.

  • Calzolari, N., & Bindi, R. (1990). Acquisition of lexical information from a large textual Italian corpus. In Proceedings of the 13th International Conference on Computational Linguistics (pp. 54–59). Helsinki, Finland.

  • Choueka, Y. (1988). Looking for needles in a haystack, or locating interesting collocational expressions in large textual databases. In Proceedings of the International Conference on User-oriented Content-based Text and Image Handling (pp. 609–623). Cambridge, USA.

  • Church, K., Gale, W., Hanks, P., & Hindle, D. (1989). Parsing, word associations and typical predicate-argument relations. In Proceedings of the International Workshop on Parsing Technologies (pp. 103–112). Carnegie Mellon University: Pittsburgh.

  • Church, K. W., & Hanks, P. (1989). Word association norms, mutual information, and lexicography. In Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics (pp. 76–83). Vancouver, B.C.: Association for Computational Linguistics.

  • Church, K., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16(1), 22–29.

    Google Scholar 

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.

    Article  Google Scholar 

  • Daille, B. (1994). Approche mixte pour l’extraction automatique de terminologie: statistiques lexicales et filtres linguistiques. Ph.D. thesis, Université Paris 7.

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Evert, S. (2004). The statistics of word cooccurrences: Word pairs and collocations. Ph.D. thesis, University of Stuttgart.

  • Evert, S., & Kermes, H. (2003). Experiments on candidate data for collocation extraction. In Companion Volume to the Proceedings of the 10th Conference of The European Chapter of the Association for Computational Linguistics (pp. 83–86). Budapest, Hungary.

  • Evert, S., & Krenn, B. (2001). Methods for the qualitative evaluation of lexical association measures. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (pp. 188–195). Toulouse, France.

  • Evert, S., & Krenn, B. (2005). Using small random samples for the manual evaluation of statistical association measures. Computer Speech & Language, 19(4), 450–466.

    Google Scholar 

  • Fontenelle, T. (1992). Collocation acquisition from a corpus or from a dictionary: A comparison. Proceedings I–II. Papers submitted to the 5th EURALEX International Congress on Lexicography in Tampere, pp. 221–228.

  • Goldman, J.-P., Nerima, L., & Wehrli, E. (2001). Collocation extraction using a syntactic parser. In Proceedings of the ACL Workshop on Collocations (pp. 61–66). Toulouse, France.

  • Gross, M. (1984). Lexicon-grammar and the syntactic analysis of French. In Proceedings of the 22nd conference on Association for Computational Linguistics (pp. 275–282). Morristown, NJ, USA.

  • Huang, C.-R., Kilgarriff, A., Wu, Y., Chiu, C.-M., Smith, S., Rychly, P., Bai, M.-H., & Chen, K.-J. (2005). Chinese Sketch Engine and the extraction of grammatical collocations. In Proceedings of the Fourth SIGHAN Workshop on Chinese Language Processing (pp. 48–55). Jeju Island, Republic of Korea.

  • Ikehara, S., Shirai, S., & Kawaoka, T. (1995). Automatic extraction of uninterrupted collocations by n-gram statistics. In Proceedings of First Annual Meeting of the Association for Natural Language Processing, pp. 313–316.

  • Jacquemin, C., Klavans, J. L., & Tzoukermann, E. (1997). Expansion of multi-word terms for indexing and retrieval using morphology and syntax. In Proceedings of the 35th Annual Meeting on Association for Computational Linguistics (pp. 24–31). Association for Computational Linguistics: Morristown, NJ, USA.

  • Justeson, J. S., & Katz, S. M. (1995). Technical terminology: Some linguistic properties and an algorithm for identification in text. Natural Language Engineering, 1(1), 9–27.

    Article  Google Scholar 

  • Kilgarriff, A. (1996). Which words are particularly characteristic of a text? A survey of statistical approaches. In Proceedings of AISB Workshop on Language Engineering for Document Analysis and Recognition (pp. 33–40). Sussex, UK.

  • Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. In Proceedings of the Eleventh EURALEX International Congress (pp. 105–116). Lorient, France.

  • Kim, S., Yang, Z., Song, M., & Ahn, J.-H. (1999). Retrieving collocations from Korean text. In Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (pp. 71–81). Maryland, USA.

  • Kjellmer, G. (1994). A dictionary of English collocations. Oxford: Claredon Press.

    Google Scholar 

  • Koehn, P. (2005). Europarl: A parallel corpus for statistical machine translation. In Proceedings of The Tenth Machine Translation Summit (MT Summit X) (pp. 79–86). Phuket, Thailand.

  • Krenn, B. (2000). The usual suspects: Data-oriented models for identification and representation of lexical collocations, Vol. 7. Saarbrücken, Germany: German Research Center for Artificial Intelligence and Saarland University Dissertations in Computational Linguistics and Language Technology.

  • Krenn, B., & Evert, S. (2001). Can we do better than frequency? A case study on extracting PP-verb collocations. In Proceedings of the ACL Workshop on Collocations (pp. 39–46). Toulouse, France.

  • Lafon, P. (1984). Dépouillements et statistiques en lexicométrie. Genève Paris: Slatkine Champion.

  • Lin, D. (1998). Extracting collocations from text corpora. In First Workshop on Computational Terminology (pp. 57–63). Montreal.

  • Lin, D. (1999). Automatic identification of non-compositional phrases. In Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics (pp. 317–324). Association for Computational Linguistics: Morristown, NJ, USA.

  • Lu, Q., Li, Y., & Xu, R. (2004). Improving Xtract for Chinese collocation extraction. In: Proceedings of IEEE International Conference on Natural Language Processing and Knowledge Engineering, pp. 333–338.

  • Manning, C. D., & Schütze, H. (1999). Foundations of statistical natural language processing. Cambridge, MA: MIT Press.

    Google Scholar 

  • McKeown, K. R., & Radev, D. R. (2000). Collocations. In R. Dale, H. Moisl, & H. Somers (Eds.), A Handbook of natural language processing (pp. 507–523). New York, USA: Marcel Dekker.

    Google Scholar 

  • Mel’čuk, I. (1998). Collocations and lexical functions. In A. P. Cowie (Eds.), Phraseology. Theory, analysis, and applications (pp. 23–53). Oxford: Claredon Press.

    Google Scholar 

  • Mel’čuk, I. (2003). Collocations: Définition, rôle et utilité. In: F. Grossmann & A. Tutin (Eds.), Les collocations: Analyse et traitement (pp. 23–32). Amsterdam: Editions “De Werelt”.

  • Pearce, D. (2001). Synonymy in collocation extraction. In WordNet and Other Lexical Resources: Applications, Extensions and Customizations (NAACL 2001 Workshop) (pp. 41–46). Pittsburgh, USA.

  • Pearce, D. (2002). A comparative evaluation of collocation extraction techniques. In Third International Conference on Language Resources and Evaluation. Spain: Las Palmas.

  • Sag, I. A., Baldwin, T., Bond, F., Copestake, A., & Flickinger, D. (2002). Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLING 2002) (pp. 1–15). Mexico City.

  • Seretan, V., Nerima, L., & Wehrli, E. (2004). A tool for multi-word collocation extraction and visualization in multilingual corpora. In Proceedings of the Eleventh EURALEX International Congress, EURALEX 2004 (pp. 755–766). Lorient, France.

  • Seretan, V., & Wehrli, E. (2006). Accurate collocation extraction using a multilingual parser. In Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics (pp. 953–960). Sydney, Australia.

  • Shimohata, S., Sugio, T., & Nagata, J. (1997). Retrieving collocations by co-occurrences and word order constraints. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp. 476–481). Madrid, Spain.

  • Silberztein, M. (1993). Dictionnaires électroniques et analyse automatique de textes. Le système INTEX. Paris: Masson.

    Google Scholar 

  • Smadja, F. (1993). Retrieving collocations from text: Xtract. Computational Linguistics, 19(1), 143–177.

    Google Scholar 

  • Tutin, A. (2004). Pour une modélisation dynamique des collocations dans les textes. In Proceedings of the Eleventh EURALEX International Congress (pp. 207–219). Lorient, France.

  • Villada Moirón, M. B. (2005). Data-driven identification of fixed expressions and their modifiability. Ph.D. thesis, University of Groningen.

  • Wehrli, E. (2007). Fips, A “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep Linguistic Processing. Prague, Czech Republic (pp. 120–127). Association for Computational Linguistics.

  • Wermter, J., & Hahn, U. (2004). Collocation extraction based on modifiability statistics. In: Proceedings of the 20th International Conference on Computational Linguistics (COLING 2004) (pp. 980–986). Geneva, Switzerland.

  • Zajac, R., Lange, E., & Yang, J. (2003). Customizing complex lexical entries for high-quality MT. In Proceedings of the Ninth Machine Translation Summit (pp. 433–438). New Orleans, USA.

  • Zinsmeister, H., & Heid, U. (2003). Significant triples: Adjective+Noun+Verb combinations. In Proceedings of the 7th Conference on Computational Lexicography and Text Research (Complex 2003), Budapest.

Download references

Acknowledgements

This work was supported in part by Swiss National Science Foundation grant no. 101412-103999. We wish to thank Jorge Antonio Leoni de León, Yves Scherrer and Vincenzo Pallotta for participating in the annotation task, as well as Stephanie Durrleman-Tame for proofreading the article. We are very grateful to the anonymous reviewers, whose comments and suggestions helped us to improve this paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Violeta Seretan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Seretan, V., Wehrli, E. Multilingual collocation extraction with a syntactic parser. Lang Resources & Evaluation 43, 71–85 (2009). https://doi.org/10.1007/s10579-008-9075-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-008-9075-7

Keywords

Navigation