Abstract
The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
- 3.
In some cases we don’t know, what method was used.
- 4.
These are observations done during the shared task workshop at the AINL 2016 conference. Unfortunately, not all participants submitted a paper though some presentations are available on the conference webpage: http://ainlconf.ru/2016/materials.
- 5.
References
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., Rigau, G., Wiebe, J.: SemEval-2014 task 10: multilingual semantic textual similarity. In: Proceedings of SemEval 2014 (2014)
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., Rigau, G.; Uria, L., Wiebe, J.: SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of SemEval 2015 (2015)
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R.; Rigau, G., Wiebe, J.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of SemEval 2016 (2016)
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: SemEval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of SemEval 2012 (2012)
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo. W.: *SEM 2013 shared task: semantic textual similarity. In: Proceedings of *SEM 2013 (2013)
Androutsopoulos, I., Prodromos Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2010)
Bakhteev, O., Kuznetsova, R., Romanov, A., Khritankov, A.: A monolingual approach to detection of text reuse in Russian-English collection. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 3–10. IEEE (2015)
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the ACL, pp. 597–604 (2005)
Barrón-Cedeño, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In: Proceedings of the ACL 2008 3rd Workshop on Innovative Use of NLP for Building Educational Applications, pp. 44–52 (2008)
Bhagat, R., Hovy, E.: What is a paraphrase? Comput. Linguist. 39(3), 463–472 (2013)
Boyarsky, K., Kanevsky, E.: Effect of semantic parsing depth on the identification of paraphrases in Russian texts. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp.226–241. Springer, Cham (2017)
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA, pp. 190–200 (2011)
Cohn, T., Callison-Burch, C., Lapata, M.: Constructing corpora for the development and evaluation of paraphrase systems. Comput. Linguist. 34(4), 597–614 (2008)
Demir, S., El-Kahlout, ˙I.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: proceedings of LREC 2012, pp. 4081–4091 (2012)
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Dzikovska, M.O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., Dang, H.T.: SemEval - 2013 task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA (2013)
Eshkol-Taravella, I., Grabar, N.: Paraphrastic reformulations in spoken corpora. In: Przepiórkowski, A., Ogrodniczuk, M. (eds.) NLP 2014. LNCS (LNAI), vol. 8686, pp. 425–437. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10888-9_42
Eyecioglu, A., Keller, B.: Knowledge-lean paraphrase identification using character-based features. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 257–276. Springer, Cham (2017)
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1576–1586 (2015)
Hintz, G.: Data-driven paraphrasing and stylistic harmonization. In: Proceedings of NAACL-HLT, pp. 37–44 (2016)
Khritankov, A., Botov, P., Surovenko, N., Tsarkov, S., Viuchnov, D., Chekhovich, Y.: Discovering text reuse in large collections of documents: a study of theses in history sciences. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 26–32. IEEE (2015)
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)
Kravchenko, D.: Paraphrase detection using machine translation and textual similarity algorithms. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 277–292. Springer, Cham (2017)
Liang, C., Paritosh, P., Rajendran, V., Forbus, K.D.: Paraphrase identification with structural alignment. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 2859–2865 (2016)
Loukachevitch, N., Shevelev, A., Mozharova, V., Dobrov, B., Pavlov, A.: RuThes thesaurus in detecting Russian paraphrases. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 242–256. Springer, Cham (2017)
Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 182–190. Association for Computational Linguistics (2012)
Malykh, V.: Robust word vectors for Russian language. In: Proceedings of Artificial Intelligence and Natural Language AINL FRUCT 2016 Conference, Saint-Petersburg, Russia, 10–12 November 2016, pp. 95–98 (2016)
Maraev, V., Saedi, C., Rodrigues, J., Branco, A., Silva, J.: Character-level convolutional neural network for paraphrase detection and other experiments. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 293–304. Springer, Cham (2017)
Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In: LREC 2010, Valetta, Malta (2010)
McCarthy, P.M., McNamara, D.S.: The user-language paraphrase corpus. In: Cross-Disciplinary Advances in Applied Natural Language Processing (2008)
Nevěřilová, Z.: Paraphrase and textual entailment generation in Czech. Computación y Sistemas 18(3), 555–568 (2014)
Pavlick, E., Nenkova, A.: Inducing lexical style properties for paraphrase and genre differentiation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 218–224 (2015)
Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and Twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 338–346. Association for Computational Linguistics (2012)
Pham, N., Bernardi, R., Zhang, Y.Z., Baroni, M.: Sentence paraphrase detection: When determiners and word order make the difference. In: Proceedings of the Towards a Formal Distributional Semantics Workshop, IWCS 2013, pp. 21–29 (2013)
Pronoza, E., Yagunova, E., Kochetkova, N.: Sentence paraphrase graphs: classification based on predictive models or annotators’ decisions? In: Sidorov, G., Herrera-Alcántara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 41–52. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62434-1_4
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82. IEEE (2015a)
Pronoza, E., Yagunova, E.: Low-level features for paraphrase identification. In: Sidorov, G., Galicia-Haro, S.N. (eds.) MICAI 2015. LNCS (LNAI), vol. 9413, pp. 59–71. Springer, Cham (2015b). https://doi.org/10.1007/978-3-319-27060-9_5
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, D.I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41718-9_8
Regneri, M., Wangy, R., Pinkal, M.: Aligning predicate-argument structures for paraphrase fragment extraction. In: LREC 2014, pp. 4300–4307 (2014)
Rocktäschel, T., Grefenstette, E., Hermann, K.M., Kočiský, T., Blunsom, P.: Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664 (2015)
Rus, V., Banjade, R., Lintean, M.: On paraphrase identification corpora. In: LREC 2014, pp. 2422–2429 (2016)
Rus, V., Lintean, M., Moldovan, C., Baggett, W., Niraula, N., Morgan, B.: The SEMILAR corpus: a resource to foster the qualitative understanding of semantic similarity of texts. In: Semantic Relations II: Enhancing Resources and Applications, The 8th Language Resources and Evaluation Conference (LREC 2012), May 23–25, Instanbul, Turkey (2012)
Shimohata, M., Sumita, E., Matsumoto, Y.: Building a paraphrase corpus for speech translation. In: Proceedings of 4th international conference on language resources and evaluation (LREC) (2004)
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarityand soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Smirnov, I., Kuznetsova, R., Kopotev, M., Khazov, A., Lyashevskaya, O., Ivanova, L., Kutuzov, A.: Evaluation tracks on plagiarism detection algorithms for the russian language. In: Dialog 2017 (2017)
Socher, R., Huang, E. H., Pennin, J., Manning, C. D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Triantafillou, E., Kiros, J.R., Urtasun, R., Zeme, R.: Towards generalizable sentence embeddings. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 239–248, Berlin, Germany (2016)
Vila, M., Martí, M.A., Rodríguez, H.: Is this a paraphrase? What kind? Paraphrase boundaries and typology. Open J. Modern Linguist. 4(01), 205 (2014)
Vila, M., Rodriguez, H., Marti, M.A.: WRPA: a system for relational paraphrase acquisition from Wikipedia. In: Procesamiento del Lenguaje Natural, Revista no. 45, septiembre 2010, pp. 11–19 (2010)
Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Trans. Assoc. Comput. Linguist. 3, 345–358 (2015)
Wubben, S., van den Bosch, A., Krahmer, E., Marsi, E.: Clustering and matching headlines for automatic paraphrase acquisition. In: Proceedings of the 12th European Workshop on Natural, Language Generation, pp. 122–125, Athens, Greece (2009)
Xu, W., Callison-Burch, C., Dolan, W.B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of SemEval (2015)
Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from Twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, August 2013, Sofia, Bulgaria, pp. 121–128 (2013)
Zubarev, D.V., Sochenkov, I.V.: Paraphrased plagiarism detection using sentence similarity. In: Dialog 2017 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Pivovarova, L., Pronoza, E., Yagunova, E., Pronoza, A. (2018). ParaPhraser: Russian Paraphrase Corpus and Shared Task. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, vol 789. Springer, Cham. https://doi.org/10.1007/978-3-319-71746-3_18
Download citation
DOI: https://doi.org/10.1007/978-3-319-71746-3_18
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71745-6
Online ISBN: 978-3-319-71746-3
eBook Packages: Computer ScienceComputer Science (R0)