Hostname: page-component-7c8c6479df-fqc5m Total loading time: 0 Render date: 2024-03-29T08:31:45.271Z Has data issue: false hasContentIssue false

Ad astra or astray: Exploring linguistic knowledge of multilingual BERT through NLI task

Published online by Cambridge University Press:  09 June 2022

Maria Tikhonova*
Affiliation:
HSE University, Moscow, Russia
Vladislav Mikhailov
Affiliation:
HSE University, Moscow, Russia
Dina Pisarevskaya
Affiliation:
Independent Resercher, London, UK
Valentin Malykh
Affiliation:
Huawei Noah’s Ark Lab, Moscow, Russia
Tatiana Shavrina*
Affiliation:
HSE University, Moscow, Russia AI Research Institute (AIRI), Moscow, Russia
*
*Corresponding author: E-mail: m_tikhonova94@mail.ru; rybolos@gmail.com
*Corresponding author: E-mail: m_tikhonova94@mail.ru; rybolos@gmail.com

Abstract

Recent research has reported that standard fine-tuning approaches can be unstable due to being prone to various sources of randomness, including but not limited to weight initialization, training data order, and hardware. Such brittleness can lead to different evaluation results, prediction confidences, and generalization inconsistency of the same models independently fine-tuned under the same experimental setup. Our paper explores this problem in natural language inference, a common task in benchmarking practices, and extends the ongoing research to the multilingual setting. We propose six novel textual entailment and broad-coverage diagnostic datasets for French, German, and Swedish. Our key findings are that the mBERT model demonstrates fine-tuning instability for categories that involve lexical semantics, logic, and predicate-argument structure and struggles to learn monotonicity, negation, numeracy, and symmetry. We also observe that using extra training data only in English can enhance the generalization performance and fine-tuning stability, which we attribute to the cross-lingual transfer capabilities. However, the ratio of particular features in the additional training data might rather hurt the performance for model instances. We are publicly releasing the datasets, hoping to foster the diagnostic investigation of language models (LMs) in a cross-lingual scenario, particularly in terms of benchmarking, which might promote a more holistic understanding of multilingualism in LMs and cross-lingual knowledge transfer.

Type
Article
Copyright
© The Author(s), 2022. Published by Cambridge University Press

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

Al-Shabab, O. (1996). Interpretation and the language of translation: creativity and conventions in translation.Google Scholar
Belinkov, Y., Poliak, A., Shieber, S., Van Durme, B. and Rush, A. (2019). Don’t take the premise for granted: Mitigating artifacts in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 877891.CrossRefGoogle Scholar
Bengio, Y. (2012). Practical recommendations for gradient-based training of deep architectures.CrossRefGoogle Scholar
Bentivogli, L., Clark, P., Dagan, I. and Giampiccolo, D. (2009). The fifth pascal recognizing textual entailment challenge. In TAC.Google Scholar
Bhojanapalli, S., Wilber, K., Veit, A., Rawat, A.S., Kim, S., Menon, A. and Kumar, S. (2021). On the reproducibility of neural network predictions. arXiv preprint arXiv:2102.03349.Google Scholar
Bowman, S.R., Angeli, G., Potts, C. and Manning, C.D. (2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics, pp. 632642.CrossRefGoogle Scholar
Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., GuzmÁn, F., Grave, E., Ott, M., Zettlemoyer, L. and Stoyanov, V. (2020). Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 84408451.CrossRefGoogle Scholar
Conneau, A., Kruszewski, G., Lample, G., Barrault, L. and Baroni, M. (2018a). What you can cram into a single $&!#* vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 21262136.CrossRefGoogle Scholar
Conneau, A., Rinott, R., Lample, G., Williams, A., Bowman, S., Schwenk, H. and Stoyanov, V. (2018b). XNLI: Evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Brussels, Belgium: Association for Computational Linguistics, pp. 24752485.CrossRefGoogle Scholar
Cui, L., Cheng, S., Wu, Y. and Zhang, Y. (2020). Does bert solve commonsense task via commonsense knowledge?Google Scholar
Dagan, I., Glickman, O. and Magnini, B. (2005). The pascal recognising textual entailment challenge. In Machine Learning Challenges Workshop. Springer, pp. 177190.Google Scholar
Dehghani, M., Tay, Y., Gritsenko, A.A., Zhao, Z., Houlsby, N., Diaz, F., Metzler, D. and Vinyals, O. (2021). The benchmark lottery.Google Scholar
Devlin, J., Chang, M.-W., Lee, K. and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 41714186.Google Scholar
Dodge, J., Ilharco, G., Schwartz, R., Farhadi, A., Hajishirzi, H. and Smith, N. (2020). Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.Google Scholar
Ettinger, A. (2020). What BERT is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics, 8, 3448.CrossRefGoogle Scholar
Giampiccolo, D., Magnini, B., Dagan, I. and Dolan, B. (2007). The third PASCAL recognizing textual entailment challenge. In Proceedings of the ACL-PASCAL Workshop on Textual Entailment and Paraphrasing. Prague: Association for Computational Linguistics, pp. 19.CrossRefGoogle Scholar
Glockner, M., Shwartz, V. and Goldberg, Y. (2018). Breaking NLI systems with sentences that require simple lexical inferences. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Melbourne, Australia: Association for Computational Linguistics, pp. 650655.CrossRefGoogle Scholar
Goldberg, Y. (2019). Assessing BERT’s syntactic abilities.Google Scholar
Gorodkin, J. (2004). Comparing two k-category assignments by a k-category correlation coefficient. Computational Biology and Chemistry 28(5–6), 367374.CrossRefGoogle ScholarPubMed
Haim, R.B., Dagan, I., Dolan, B., Ferro, L., Giampiccolo, D., Magnini, B. and Szpektor, I. (2006). The second pascal recognising textual entailment challenge. In Proceedings of the Second PASCAL Challenges Workshop on Recognising Textual Entailment.Google Scholar
He, P., Liu, X., Gao, J. and Chen, W. (2021). Deberta: Decoding-enhanced bert with disentangled attention.Google Scholar
Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D. and Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32.CrossRefGoogle Scholar
Hossain, M.M., Kovatchev, V., Dutta, P., Kao, T., Wei, E. and Blanco, E. (2020). An analysis of natural language inference benchmarks through the lens of negation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 91069118.CrossRefGoogle Scholar
Hosseini, A., Reddy, S., Bahdanau, D., Hjelm, R.D., Sordoni, A. and Courville, A. (2021). Understanding by understanding not: Modeling negation in language models. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 13011312.CrossRefGoogle Scholar
Hu, H., Richardson, K., Xu, L., Li, L., Kübler, S. and Moss, L.S. (2020a). Ocnli: Original chinese natural language inference. In Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 35123526.CrossRefGoogle Scholar
Hu, H., Zhou, H., Tian, Z., Zhang, Y., Patterson, Y., Li, Y., Nie, Y. and Richardson, K. (2021). Investigating transfer learning in multilingual pre-trained language models through Chinese natural language inference. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Online: Association for Computational Linguistics, pp. 37703785.CrossRefGoogle Scholar
Hu, J., Ruder, S., Siddhant, A., Neubig, G., Firat, O. and Johnson, M. (2020b). Xtreme: A massively multilingual multi-task benchmark for evaluating cross-lingual generalization.Google Scholar
Hua, H., Li, X., Dou, D., Xu, C. and Luo, J. (2021). Noise stability regularization for improving BERT fine-tuning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 32293241.CrossRefGoogle Scholar
Huang, W., Liu, H. and Bowman, S.R. (2020). Counterfactually-augmented SNLI training data does not yield better generalization than unaugmented data. In Proceedings of the First Workshop on Insights from Negative Results in NLP. Online: Association for Computational Linguistics, pp. 8287.CrossRefGoogle Scholar
Jawahar, G., Sagot, B. and Seddah, D. (2019). What does BERT learn about the structure of language? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 36513657.Google Scholar
Kassner, N. and Schütze, H. (2020). Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 78117818.CrossRefGoogle Scholar
Khashabi, D., Chaturvedi, S., Roth, M., Upadhyay, S. and Roth, D. (2018). Looking beyond the surface: A challenge set for reading comprehension over multiple sentences. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 252262.CrossRefGoogle Scholar
Kingma, D.P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.Google Scholar
Korobov, M. (2015). Morphological analyzer and generator for Russian and Ukrainian languages. In International Conference on Analysis of Images, Social Networks and Texts AIST 2015: Analysis of Images, Social Networks and Texts, vol. 542, pp. 320332.CrossRefGoogle Scholar
Kovaleva, O., Romanov, A., Rogers, A. and Rumshisky, A. (2019). Revealing the dark secrets of BERT. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 43654374.CrossRefGoogle Scholar
Le, H., Vial, L., Frej, J., Segonne, V., Coavoux, M., Lecouteux, B., Allauzen, A., Crabbé, B., Besacier, L. and Schwab, D. (2020). FlauBERT: Unsupervised language model pre-training for French. In Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, pp. 24792490.Google Scholar
Lee, C., Cho, K. and Kang, W. (2019). Mixout: Effective regularization to finetune large-scale pretrained language models. arXiv preprint arXiv:1909.11299.Google Scholar
Liang, Y., Duan, N., Gong, Y., Wu, N., Guo, F., Qi, W., Gong, M., Shou, L., Jiang, D., Cao, G., Fan, X., Zhang, R., Agrawal, R., Cui, E., Wei, S., Bharti, T., Qiao, Y., Chen, J.-H., Wu, W., Liu, S., Yang, F., Campos, D., Majumder, R. and Zhou, M. (2020). XGLUE: A new benchmark datasetfor cross-lingual pre-training, understanding and generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 60086018.CrossRefGoogle Scholar
Liška, A., Kruszewski, G. and Baroni, M. (2018). Memorize or generalize? searching for a compositional rnn in a haystack. arXiv preprint arXiv:1802.06467.Google Scholar
Liu, Y., Gu, J., Goyal, N., Li, X., Edunov, S., Ghazvininejad, M., Lewis, M. and Zettlemoyer, L. (2020). Multilingual denoising pre-training for neural machine translation. Transactions of the Association for Computational Linguistics 8, 726742.CrossRefGoogle Scholar
Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.Google Scholar
Madhyastha, P. and Jain, R. (2019). On model stability as a function of random seed. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL). Hong Kong, China: Association for Computational Linguistics, pp. 929939.CrossRefGoogle Scholar
Marelli, M., Menini, S., Baroni, M., Bentivogli, L., Bernardi, R. and Zamparelli, R. (2014). A SICK cure for the evaluation of compositional distributional semantic models. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). Reykjavik, Iceland: European Language Resources Association (ELRA), pp. 216223.Google Scholar
McCoy, R.T., Frank, R, and Linzen, T. (2018). Revisiting the poverty of the stimulus: hierarchical generalization without a hierarchical bias in recurrent neural networks.Google Scholar
McCoy, R.T., Min, J. and Linzen, T. (2020). BERTs of a feather do not generalize together: Large variability in generalization across models with similar test set performance. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 217227.CrossRefGoogle Scholar
McCoy, T., Pavlick, E. and Linzen, T. (2019). Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, pp. 34283448.CrossRefGoogle Scholar
Merchant, A., Rahimtoroghi, E., Pavlick, E. and Tenney, I. (2020). What happens to BERT embeddings during fine-tuning? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 3344.Google Scholar
Miaschi, A., Brunato, D., Dell’Orletta, F. and Venturi, G. (2020). Linguistic profiling of a neural language model. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 745756.CrossRefGoogle Scholar
Min, J., McCoy, R.T., Das, D., Pitler, E. and Linzen, T. (2020). Syntactic data augmentation increases robustness to inference heuristics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 23392352.CrossRefGoogle Scholar
Mosbach, M., Andriushchenko, M. and Klakow, D. (2020a). On the stability of fine-tuning bert: Misconceptions, explanations, and strong baselines. In International Conference on Learning Representations.Google Scholar
Mosbach, M., Khokhlova, A., Hedderich, M.A. and Klakow, D. (2020b). On the interplay between fine-tuning and sentence-level probing for linguistic knowledge in pre-trained transformers. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 6882.CrossRefGoogle Scholar
Naik, A., Ravichander, A., Sadeh, N., Rose, C. and Neubig, G. (2018). Stress test evaluation for natural language inference. In Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, pp. 23402353.Google Scholar
Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J. and Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 48854901.CrossRefGoogle Scholar
Nisioi, S., Rabinovich, E., Dinu, L.P. and Wintner, S. (2016). A corpus of native, non-native and translated texts. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Portorož, Slovenia: European Language Resources Association (ELRA), pp. 41974201.Google Scholar
Phang, J., Févry, T. and Bowman, S.R. (2018). Sentence encoders on stilts: Supplementary training on intermediate labeled-data tasks. arXiv preprint arXiv:1811.01088.Google Scholar
Pruksachatkun, Y., Phang, J., Liu, H., Htut, P.M., Zhang, X., Pang, R.Y., Vania, C., Kann, K. and Bowman, S.R. (2020a). Intermediate-task transfer learning with pretrained language models: When and why does it work? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 52315247.CrossRefGoogle Scholar
Pruksachatkun, Y., Yeres, P., Liu, H., Phang, J., Htut, P. M., Wang, A., Tenney, I. and Bowman, S.R. (2020b). jiant: A software toolkit for research on general-purpose text understanding models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations. Online: Association for Computational Linguistics, pp. 109117.CrossRefGoogle Scholar
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W. and Liu, P.J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer.Google Scholar
Richardson, K., Hu, H., Moss, L. and Sabharwal, A. (2020). Probing natural language inference models through semantic fragments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pp. 87138721.CrossRefGoogle Scholar
Rogers, A. (2019). How the transformers broke nlp leaderboards.Google Scholar
Rogers, A. (2021). Changing the world by changing the data. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 21822194.CrossRefGoogle Scholar
Rogers, A., Kovaleva, O. and Rumshisky, A. (2020). A primer in BERTology: What we know about how BERT works. Transactions of the Association for Computational Linguistics 8, 842866.CrossRefGoogle Scholar
Rybak, P., Mroczkowski, R., Tracz, J. and Gawlik, I. (2020). KLEJ: Comprehensive benchmark for Polish language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 11911201.CrossRefGoogle Scholar
Sanchez, I., Mitchell, J. and Riedel, S. (2018). Behavior analysis of NLI models: Uncovering the influence of three factors on robustness. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 19751985.CrossRefGoogle Scholar
Shavrina, T., Fenogenova, A., Anton, E., Shevelev, D., Artemova, E., Malykh, V., Mikhailov, V., Tikhonova, M., Chertok, A. and Evlampiev, A. (2020). RussianSuperGLUE: A Russian language understanding evaluation benchmark. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Online: Association for Computational Linguistics, pp. 47174726.CrossRefGoogle Scholar
Shavrina, T. and Shapovalova, O. (2017). To the methodology of corpus construction for machine learning:“taiga”. syntax tree corpus and parser. Corpus Linguistics 2017, p. 78.Google Scholar
Singh, J., Wallat, J. and Anand, A. (2020). BERTnesia: Investigating the capture and forgetting of knowledge in BERT. In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP. Online: Association for Computational Linguistics, pp. 174183.CrossRefGoogle Scholar
Storks, S., Gao, Q. and Chai, J.Y. (2019). Recent advances in natural language inference: A survey of benchmarks, resources, and approaches. arXiv preprint arXiv:1904.01172.Google Scholar
Tanchip, C., Yu, L., Xu, A. and Xu, Y. (2020). Inferring symmetry in natural language. In Findings of the Association for Computational Linguistics: EMNLP 2020. Online: Association for Computational Linguistics, pp. 28772886.CrossRefGoogle Scholar
Thawani, A., Pujara, J., Ilievski, F. and Szekely, P. (2021). Representing numbers in NLP: a survey and a vision. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 644656.CrossRefGoogle Scholar
Tsuchiya, M. (2018). Performance impact caused by hidden bias of training data for recognizing textual entailment. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) Miyazaki, Japan: European Language Resources Association (ELRA).Google Scholar
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł. and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems (pp. 59986008).Google Scholar
Venhuizen, N.J., Hendriks, P., Crocker, M.W. and Brouwer, H. (2021). Distributional formal semantics. Information and Computation, p. 104763.Google Scholar
Wallace, E., Wang, Y., Li, S., Singh, S. and Gardner, M. (2019). Do NLP models know numbers? probing numeracy in embeddings. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Hong Kong, China: Association for Computational Linguistics, pp. 53075315.CrossRefGoogle Scholar
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2019). Superglue: A stickier benchmark for general-purpose language understanding systems. In Advances in Neural Information Processing Systems, pp. 32663280.Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O. and Bowman, S. (2018). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Brussels, Belgium: Association for Computational Linguistics, pp. 353355.CrossRefGoogle Scholar
Warstadt, A. and Bowman, S.R. (2019). Linguistic analysis of pretrained sentence encoders with acceptability judgments. arXiv preprint arXiv:1901.03438.Google Scholar
Weber, N., Shekhar, L. and Balasubramanian, N. (2018). The fine line between linguistic generalization and failure in Seq2Seq-attention models. In Proceedings of the Workshop on Generalization in the Age of Deep Learning. New Orleans, Louisiana: Association for Computational Linguistics, pp. 2427.CrossRefGoogle Scholar
Williams, A., Nangia, N. and Bowman, S. (2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans, Louisiana: Association for Computational Linguistics, pp. 11121122.CrossRefGoogle Scholar
Wu, J.M., Belinkov, Y., Sajjad, H., Durrani, N., Dalvi, F. and Glass, J. (2020). Similarity analysis of contextual word representation models. arXiv preprint arXiv:2005.01172.Google Scholar
Xu, L., Hu, H., Zhang, X., Li, L., Cao, C., Li, Y., Xu, Y., Sun, K., Yu, D., Yu, C., Tian, Y., Dong, Q., Liu, W., Shi, B., Cui, Y., Li, J., Zeng, J., Wang, R., Xie, W., Li, Y., Patterson, Y., Tian, Z., Zhang, Y., Zhou, H., Liu, S., Zhao, Z., Zhao, Q., Yue, C., Zhang, X., Yang, Z., Richardson, K. and Lan, Z. (2020). CLUE: A Chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics. Barcelona, Spain (Online): International Committee on Computational Linguistics, pp. 47624772.CrossRefGoogle Scholar
Xue, L., Constant, N., Roberts, A., Kale, M., Al-Rfou, R., Siddhant, A., Barua, A. and Raffel, C. (2021). mT5: A massively multilingual pre-trained text-to-text transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, pp. 483498.CrossRefGoogle Scholar
Yanaka, H., Mineshima, K., Bekki, D. and Inui, K. (2020). Do neural models learn systematicity of monotonicity inference in natural language? In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 61056117.Google Scholar
Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L. and Bos, J. (2019a). Can neural networks understand monotonicity reasoning? In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Florence, Italy: Association for Computational Linguistics, pp. 3140.CrossRefGoogle Scholar
Yanaka, H., Mineshima, K., Bekki, D., Inui, K., Sekine, S., Abzianidze, L. and Bos, J. (2019b). HELP: A dataset for identifying shortcomings of neural models in monotonicity reasoning. In Proceedings of the Eighth Joint Conference on Lexical and Computational Semantics (*SEM 2019). Minneapolis, Minnesota: Association for Computational Linguistics, pp. 250255.CrossRefGoogle Scholar
Zhang, S., Liu, X., Liu, J., Gao, J., Duh, K. and Durme, B.V. (2018). Record: Bridging the gap between human and machine commonsense reading comprehension.Google Scholar
Zhang, Y., Warstadt, A., Li, X., and Bowman, S.R. (2021). When do you need billions of words of pretraining data? In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Online: Association for Computational Linguistics, pp. 11121125.Google Scholar
Zhao, Y. and Bethard, S. (2020). How does BERT’s attention change when you fine-tune? an analysis methodology and a case study in negation scope. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, pp. 47294747.CrossRefGoogle Scholar
Zhu, C., Cheng, Y., Gan, Z., Sun, S., Goldstein, T. and Liu, J. (2019). Freelb: Enhanced adversarial training for natural language understanding. In International Conference on Learning Representations.Google Scholar
Zhuang, D., Zhang, X., Song, S.L. and Hooker, S. (2021). Randomness in neural network training: Characterizing the impact of tooling. arXiv preprint arXiv:2106.11872.Google Scholar