ParaPhraser: Russian Paraphrase Corpus and Shared Task

Pivovarova, Lidia; Pronoza, Ekaterina; Yagunova, Elena; Pronoza, Anton

doi:10.1007/978-3-319-71746-3_18

ParaPhraser: Russian Paraphrase Corpus and Shared Task

Lidia Pivovarova¹²,
Ekaterina Pronoza¹³,
Elena Yagunova¹³ &
…
Anton Pronoza¹⁴

Conference paper
First Online: 28 November 2017

1454 Accesses
8 Citations

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 789))

Abstract

The paper describes the results of the First Russian Paraphrase Detection Shared Task held in St.-Petersburg, Russia, in October 2016. Research in the area of paraphrase extraction, detection and generation has been successfully developing for a long time while there has been only a recent surge of interest towards the problem in the Russian community of computational linguistics. We try to overcome this gap by introducing the project ParaPhraser.ru dedicated to the collection of Russian paraphrase corpus and organizing a Paraphrase Detection Shared Task, which uses the corpus as the training data. The participants of the task applied a wide variety of techniques to the problem of paraphrase detection, from rule-based approaches to deep learning, and results of the task reflect the following tendencies: the best scores are obtained by the strategy of using traditional classifiers combined with fine-grained linguistic features, however, complex neural networks, shallow methods and purely technical methods also demonstrate competitive results.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
http://paraphraser.ru/scorer.
2.
http://www.paraphraser.ru/download/.
3.
In some cases we don’t know, what method was used.
4.
These are observations done during the shared task workshop at the AINL 2016 conference. Unfortunately, not all participants submitted a paper though some presentations are available on the conference webpage: http://ainlconf.ru/2016/materials.
5.
https://aclweb.org/aclwiki/index.php?title=Paraphrase_Identification_(State_of_the_art)).

References

Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., Rigau, G., Wiebe, J.: SemEval-2014 task 10: multilingual semantic textual similarity. In: Proceedings of SemEval 2014 (2014)
Google Scholar
Agirre, E., Banea, C., Cardie, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W., Lopez-Gazpio, I., Maritxalar, M., Mihalcea, R., Rigau, G.; Uria, L., Wiebe, J.: SemEval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of SemEval 2015 (2015)
Google Scholar
Agirre, E., Banea, C., Cer, D., Diab, M., Gonzalez-Agirre, A., Mihalcea, R.; Rigau, G., Wiebe, J.: Semeval-2016 task 1: semantic textual similarity, monolingual and cross-lingual evaluation. In: Proceedings of SemEval 2016 (2016)
Google Scholar
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A.: SemEval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of SemEval 2012 (2012)
Google Scholar
Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo. W.: *SEM 2013 shared task: semantic textual similarity. In: Proceedings of *SEM 2013 (2013)
Google Scholar
Androutsopoulos, I., Prodromos Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2010)
MATH Google Scholar
Bakhteev, O., Kuznetsova, R., Romanov, A., Khritankov, A.: A monolingual approach to detection of text reuse in Russian-English collection. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 3–10. IEEE (2015)
Google Scholar
Bannard, C., Callison-Burch, C.: Paraphrasing with bilingual parallel corpora. In: Proceedings of the 43rd Annual Meeting of the ACL, pp. 597–604 (2005)
Google Scholar
Barrón-Cedeño, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Article Google Scholar
Bernhard, D., Gurevych, I.: Answering learners’ questions by retrieving question paraphrases from social Q&A sites. In: Proceedings of the ACL 2008 3rd Workshop on Innovative Use of NLP for Building Educational Applications, pp. 44–52 (2008)
Google Scholar
Bhagat, R., Hovy, E.: What is a paraphrase? Comput. Linguist. 39(3), 463–472 (2013)
Article Google Scholar
Boyarsky, K., Kanevsky, E.: Effect of semantic parsing depth on the identification of paraphrases in Russian texts. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp.226–241. Springer, Cham (2017)
Chapter Google Scholar
Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, USA, pp. 190–200 (2011)
Google Scholar
Cohn, T., Callison-Burch, C., Lapata, M.: Constructing corpora for the development and evaluation of paraphrase systems. Comput. Linguist. 34(4), 597–614 (2008)
Article Google Scholar
Demir, S., El-Kahlout, ˙I.D., Unal, E., Kaya, H.: Turkish paraphrase corpus. In: proceedings of LREC 2012, pp. 4081–4091 (2012)
Google Scholar
Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)
Google Scholar
Dzikovska, M.O., Nielsen, R., Brew, C., Leacock, C., Giampiccolo, D., Bentivogli, L., Clark, P., Dagan, I., Dang, H.T.: SemEval - 2013 task 7: the joint student response analysis and 8th recognizing textual entailment challenge. In: Proceedings of the 7th International Workshop on Semantic Evaluation (SemEval 2013), Atlanta, Georgia, USA (2013)
Google Scholar
Eshkol-Taravella, I., Grabar, N.: Paraphrastic reformulations in spoken corpora. In: Przepiórkowski, A., Ogrodniczuk, M. (eds.) NLP 2014. LNCS (LNAI), vol. 8686, pp. 425–437. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10888-9_42
Chapter Google Scholar
Eyecioglu, A., Keller, B.: Knowledge-lean paraphrase identification using character-based features. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 257–276. Springer, Cham (2017)
Chapter Google Scholar
Fernando, S., Stevenson, M.: A semantic similarity approach to paraphrase detection. In: Proceedings of the 11th Annual Research Colloquium of the UK Special Interest Group for Computational Linguistics, pp. 45–52 (2008)
Google Scholar
Ganitkevitch, J., Callison-Burch, C.: The multilingual paraphrase database. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). European Language Resources Association (ELRA), Reykjavik, Iceland (2014)
Google Scholar
He, H., Gimpel, K., Lin, J.: Multi-perspective sentence similarity modeling with convolutional neural networks. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 1576–1586 (2015)
Google Scholar
Hintz, G.: Data-driven paraphrasing and stylistic harmonization. In: Proceedings of NAACL-HLT, pp. 37–44 (2016)
Google Scholar
Khritankov, A., Botov, P., Surovenko, N., Tsarkov, S., Viuchnov, D., Chekhovich, Y.: Discovering text reuse in large collections of documents: a study of theses in history sciences. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 26–32. IEEE (2015)
Google Scholar
Knight, K., Marcu, D.: Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artif. Intell. 139(1), 91–107 (2002)
Article Google Scholar
Kravchenko, D.: Paraphrase detection using machine translation and textual similarity algorithms. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 277–292. Springer, Cham (2017)
Chapter Google Scholar
Liang, C., Paritosh, P., Rajendran, V., Forbus, K.D.: Paraphrase identification with structural alignment. In: Proceedings of the 16th International Joint Conference on Artificial Intelligence, pp. 2859–2865 (2016)
Google Scholar
Loukachevitch, N., Shevelev, A., Mozharova, V., Dobrov, B., Pavlov, A.: RuThes thesaurus in detecting Russian paraphrases. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 242–256. Springer, Cham (2017)
Chapter Google Scholar
Madnani, N., Tetreault, J., Chodorow, M.: Re-examining machine translation metrics for paraphrase identification. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 182–190. Association for Computational Linguistics (2012)
Google Scholar
Malykh, V.: Robust word vectors for Russian language. In: Proceedings of Artificial Intelligence and Natural Language AINL FRUCT 2016 Conference, Saint-Petersburg, Russia, 10–12 November 2016, pp. 95–98 (2016)
Google Scholar
Maraev, V., Saedi, C., Rodrigues, J., Branco, A., Silva, J.: Character-level convolutional neural network for paraphrase detection and other experiments. In: Filchenkov, A. et al. (eds.) AINL 2017. CCIS, vol. 789, pp. 293–304. Springer, Cham (2017)
Chapter Google Scholar
Max, A., Wisniewski, G.: Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In: LREC 2010, Valetta, Malta (2010)
Google Scholar
McCarthy, P.M., McNamara, D.S.: The user-language paraphrase corpus. In: Cross-Disciplinary Advances in Applied Natural Language Processing (2008)
Google Scholar
Nevěřilová, Z.: Paraphrase and textual entailment generation in Czech. Computación y Sistemas 18(3), 555–568 (2014)
Article Google Scholar
Pavlick, E., Nenkova, A.: Inducing lexical style properties for paraphrase and genre differentiation. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 218–224 (2015)
Google Scholar
Petrović, S., Osborne, M., Lavrenko, V.: Using paraphrases for improving first story detection in news and Twitter. In: Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 338–346. Association for Computational Linguistics (2012)
Google Scholar
Pham, N., Bernardi, R., Zhang, Y.Z., Baroni, M.: Sentence paraphrase detection: When determiners and word order make the difference. In: Proceedings of the Towards a Formal Distributional Semantics Workshop, IWCS 2013, pp. 21–29 (2013)
Google Scholar
Pronoza, E., Yagunova, E., Kochetkova, N.: Sentence paraphrase graphs: classification based on predictive models or annotators’ decisions? In: Sidorov, G., Herrera-Alcántara, O. (eds.) MICAI 2016. LNCS (LNAI), vol. 10061, pp. 41–52. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-62434-1_4
Chapter Google Scholar
Pronoza, E., Yagunova, E.: Comparison of sentence similarity measures for Russian paraphrase identification. In: Artificial Intelligence and Natural Language and Information Extraction, Social Media and Web Search FRUCT Conference (AINL-ISMW FRUCT), pp. 74–82. IEEE (2015a)
Google Scholar
Pronoza, E., Yagunova, E.: Low-level features for paraphrase identification. In: Sidorov, G., Galicia-Haro, S.N. (eds.) MICAI 2015. LNCS (LNAI), vol. 9413, pp. 59–71. Springer, Cham (2015b). https://doi.org/10.1007/978-3-319-27060-9_5
Google Scholar
Pronoza, E., Yagunova, E., Pronoza, A.: Construction of a Russian paraphrase corpus: unsupervised paraphrase extraction. In: Braslavski, P., Markov, I., Pardalos, P., Volkovich, Y., Ignatov, D.I., Koltsov, S., Koltsova, O. (eds.) RuSSIR 2015. CCIS, vol. 573, pp. 146–157. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-41718-9_8
Chapter Google Scholar
Regneri, M., Wangy, R., Pinkal, M.: Aligning predicate-argument structures for paraphrase fragment extraction. In: LREC 2014, pp. 4300–4307 (2014)
Google Scholar
Rocktäschel, T., Grefenstette, E., Hermann, K.M., Kočiský, T., Blunsom, P.: Reasoning about entailment with neural attention. arXiv preprint arXiv:1509.06664 (2015)
Rus, V., Banjade, R., Lintean, M.: On paraphrase identification corpora. In: LREC 2014, pp. 2422–2429 (2016)
Google Scholar
Rus, V., Lintean, M., Moldovan, C., Baggett, W., Niraula, N., Morgan, B.: The SEMILAR corpus: a resource to foster the qualitative understanding of semantic similarity of texts. In: Semantic Relations II: Enhancing Resources and Applications, The 8th Language Resources and Evaluation Conference (LREC 2012), May 23–25, Instanbul, Turkey (2012)
Google Scholar
Shimohata, M., Sumita, E., Matsumoto, Y.: Building a paraphrase corpus for speech translation. In: Proceedings of 4th international conference on language resources and evaluation (LREC) (2004)
Google Scholar
Sidorov, G., Gelbukh, A., Gómez-Adorno, H., Pinto, D.: Soft similarityand soft cosine measure: similarity of features in vector space model. Computación y Sistemas 18(3), 491–504 (2014)
Article Google Scholar
Smirnov, I., Kuznetsova, R., Kopotev, M., Khazov, A., Lyashevskaya, O., Ivanova, L., Kutuzov, A.: Evaluation tracks on plagiarism detection algorithms for the russian language. In: Dialog 2017 (2017)
Google Scholar
Socher, R., Huang, E. H., Pennin, J., Manning, C. D., Ng, A.Y.: Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in Neural Information Processing Systems, pp. 801–809 (2011)
Google Scholar
Triantafillou, E., Kiros, J.R., Urtasun, R., Zeme, R.: Towards generalizable sentence embeddings. In: Proceedings of the 1st Workshop on Representation Learning for NLP, pp. 239–248, Berlin, Germany (2016)
Google Scholar
Vila, M., Martí, M.A., Rodríguez, H.: Is this a paraphrase? What kind? Paraphrase boundaries and typology. Open J. Modern Linguist. 4(01), 205 (2014)
Article Google Scholar
Vila, M., Rodriguez, H., Marti, M.A.: WRPA: a system for relational paraphrase acquisition from Wikipedia. In: Procesamiento del Lenguaje Natural, Revista no. 45, septiembre 2010, pp. 11–19 (2010)
Google Scholar
Wieting, J., Bansal, M., Gimpel, K., Livescu, K.: Trans. Assoc. Comput. Linguist. 3, 345–358 (2015)
Google Scholar
Wubben, S., van den Bosch, A., Krahmer, E., Marsi, E.: Clustering and matching headlines for automatic paraphrase acquisition. In: Proceedings of the 12th European Workshop on Natural, Language Generation, pp. 122–125, Athens, Greece (2009)
Google Scholar
Xu, W., Callison-Burch, C., Dolan, W.B.: SemEval-2015 task 1: paraphrase and semantic similarity in Twitter (PIT). In: Proceedings of SemEval (2015)
Google Scholar
Xu, W., Ritter, A., Grishman, R.: Gathering and generating paraphrases from Twitter with application to normalization. In: Proceedings of the Sixth Workshop on Building and Using Comparable Corpora, August 2013, Sofia, Bulgaria, pp. 121–128 (2013)
Google Scholar
Zubarev, D.V., Sochenkov, I.V.: Paraphrased plagiarism detection using sentence similarity. In: Dialog 2017 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Helsinki, Helsinki, Finland
Lidia Pivovarova
St.-Petersburg State University, St.-Petersburg, Russian Federation
Ekaterina Pronoza & Elena Yagunova
Institute for Informatics and Automation of the Russian Academy of Sciences, St.-Petersburg, Russian Federation
Anton Pronoza

Authors

Lidia Pivovarova
View author publications
You can also search for this author in PubMed Google Scholar
Ekaterina Pronoza
View author publications
You can also search for this author in PubMed Google Scholar
Elena Yagunova
View author publications
You can also search for this author in PubMed Google Scholar
Anton Pronoza
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lidia Pivovarova .

Editor information

Editors and Affiliations

ITMO University, St. Petersburg, Russia
Andrey Filchenkov
University of Helsinki, Helsinki, Finland
Lidia Pivovarova
Mendel University , Brno, Czech Republic
Jan Žižka

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pivovarova, L., Pronoza, E., Yagunova, E., Pronoza, A. (2018). ParaPhraser: Russian Paraphrase Corpus and Shared Task. In: Filchenkov, A., Pivovarova, L., Žižka, J. (eds) Artificial Intelligence and Natural Language. AINL 2017. Communications in Computer and Information Science, vol 789. Springer, Cham. https://doi.org/10.1007/978-3-319-71746-3_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-71746-3_18
Published: 28 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71745-6
Online ISBN: 978-3-319-71746-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics