skip to main content
10.1145/3462757.3466088acmconferencesArticle/Chapter ViewAbstractPublication PagesicailConference Proceedingsconference-collections
research-article
Open Access

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

Published:27 July 2021Publication History

ABSTRACT

While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage in resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, has yielded few documented instances of substantial gains to domain pretraining in spite of the fact that legal language is widely seen to be unique. We hypothesize that these existing results stem from the fact that existing legal NLP tasks are too easy and fail to meet conditions for when domain pretraining can help. To address this, we first present CaseHOLD (Case <u>H</u>oldings <u>O</u>n <u>L</u>egal <u>D</u>ecisions), a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we assess performance gains on CaseHOLD and existing legal NLP datasets. While a Transformer architecture (BERT) pretrained on a general corpus (Google Books and Wikipedia) improves performance, domain pretraining (on a corpus of ≈3.5M decisions across all courts in the U.S. that is larger than BERT's) with a custom legal vocabulary exhibits the most substantial performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12% improvement on BERT) and consistent performance gains across two other legal tasks. Third, we show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Our findings inform when researchers should engage in resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.

References

  1. Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios Lampos. 2016. Predicting judicial decisions of the European Court of Human Rights: A natural language processing perspective. PeerJ Computer Science 2 (2016), e93.Google ScholarGoogle ScholarCross RefCross Ref
  2. Pablo D Arredondo. 2017. Harvesting and Utilizing Explanatory Parentheticals. SCL Rev. 69 (2017), 659.Google ScholarGoogle Scholar
  3. Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3615--3620.Google ScholarGoogle Scholar
  4. Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA, 610--623.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4317--4323. https://www.aclweb.org/anthology/P19-1424Google ScholarGoogle ScholarCross RefCross Ref
  6. Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2017. Extracting Contract Elements. In Proceedings of the 16th Edition of the International Conference on Articial Intelligence and Law (London, United Kingdom) (ICAIL '17). Association for Computing Machinery, New York, NY, USA, 19--28.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 2898--2904. https://www.aclweb.org/anthology/2020.findings-emnlp.261Google ScholarGoogle ScholarCross RefCross Ref
  8. Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2019. Neural Contract Element Extraction Revisited. Workshop on Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=B1x6fa95UHGoogle ScholarGoogle Scholar
  9. Columbia Law Review Ass'n, Harvard Law Review Ass'n, and Yale Law Journal. 2015. The Bluebook: A Uniform System of Citation (21st ed.). The Columbia Law Review, The Harvard Law Review, The University of Pennsylvania Law Review, and The Yale Law Journal.Google ScholarGoogle Scholar
  10. Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv:1906.08101 [cs.CL]Google ScholarGoogle Scholar
  11. Laura C. Dabney. 2008. Citators: Past, Present, and Future. Legal Reference Services Quarterly 27, 2--3 (2008), 165--190.Google ScholarGoogle ScholarCross RefCross Ref
  12. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://www.aclweb.org/anthology/N19-1423Google ScholarGoogle Scholar
  13. Pintip Hompluem Dunn. 2003. How judges overrule: Speech act theory and the doctrine of stare decisis. Yale LJ 113 (2003), 493.Google ScholarGoogle ScholarCross RefCross Ref
  14. Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding. arXiv:1911.00473 http://arxiv.org/abs/1911.00473Google ScholarGoogle Scholar
  15. David Freeman Engstrom and Daniel E Ho. 2020. Algorithmic accountability in the administrative state. Yale J. on Reg. 37 (2020), 800.Google ScholarGoogle Scholar
  16. David Freeman Engstrom, Daniel E. Ho, Catherine Sharkey, and Mariano-Florentino Cuéllar. 2020. Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies. Administrative Conference of the United States, Washington DC, United States.Google ScholarGoogle ScholarCross RefCross Ref
  17. European Union 1993. Council Directive 93/13/EEC of 5 April 1993 on unfair terms in consumer contracts. European Union.Google ScholarGoogle Scholar
  18. Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021. Aligning AI With Shared Human Values. arXiv:2008.02275 [cs.CY]Google ScholarGoogle Scholar
  19. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]Google ScholarGoogle Scholar
  20. Michael J. Bommarito II, Daniel Martin Katz, and Eric M. Detterman. 2018. LexNLP: Natural language processing and information extraction for legal and regulatory texts. arXiv:1806.03688 http://arxiv.org/abs/1806.03688Google ScholarGoogle Scholar
  21. Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8 (2020), 64--77. https://www.aclweb.org/anthology/2020.tacl-1.5Google ScholarGoogle ScholarCross RefCross Ref
  22. David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. 2018. Measuring the Evolution of a Scientific Field through Citation Frames. Transactions of the Association for Computational Linguistics 6 (2018), 391--406. https://www.aclweb.org/anthology/Q18-1028Google ScholarGoogle ScholarCross RefCross Ref
  23. Minki Kang, Moonsu Han, and Sung Ju Hwang. 2020. Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6102--6120. https://www.aclweb.org/anthology/2020.emnlp-main.493Google ScholarGoogle ScholarCross RefCross Ref
  24. Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226 [cs.CL]Google ScholarGoogle Scholar
  25. Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 1234--1240.Google ScholarGoogle ScholarCross RefCross Ref
  26. Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law 27, 2 (2019), 117--139.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]Google ScholarGoogle Scholar
  28. David Mellinkoff. 2004. The language of the law. Wipf and Stock Publishers, Eugene, Oregon.Google ScholarGoogle Scholar
  29. Elizabeth Mertz. 2007. The Language of Law School: Learning to "Think Like a Lawyer". Oxford University Press, USA.Google ScholarGoogle Scholar
  30. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301.3781Google ScholarGoogle Scholar
  31. Octavia-Maria, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, and Josef van Genabith. 2017. Exploring the Use of Text Classification in the Legal Domain. Proceedings of 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL).Google ScholarGoogle Scholar
  32. Adam R. Pah, David L. Schwartz, Sarath Sanga, Zachary D. Clopton, Peter DiCola, Rachel Davis Mersey, Charlotte S. Alexander, Kristian J. Hammond, and Luís A. Nunes Amaral. 2020. How to build a more open justice system. Science 369, 6500 (2020), 134--136.Google ScholarGoogle ScholarCross RefCross Ref
  33. Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2357--2368. https://www.aclweb.org/anthology/D18-1258Google ScholarGoogle ScholarCross RefCross Ref
  34. Marc Queudot, Éric Charton, and Marie-Jean Meurs. 2020. Improving Access to Justice with Legal Chatbots. Stats 3, 3 (2020), 356--375.Google ScholarGoogle ScholarCross RefCross Ref
  35. Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.Google ScholarGoogle Scholar
  36. Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392. https://www.aclweb.org/anthology/D16-1264Google ScholarGoogle ScholarCross RefCross Ref
  37. Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (2019), 249--266.Google ScholarGoogle ScholarCross RefCross Ref
  38. Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics 8 (2021), 842--866.Google ScholarGoogle ScholarCross RefCross Ref
  39. Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2020. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. arXiv:2012.15613 [cs.CL]Google ScholarGoogle Scholar
  40. Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017. Sentence boundary detection in adjudicatory decisions in the United States. Traitement automatique des langues 58 (2017), 21.Google ScholarGoogle Scholar
  41. Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The Cost of Training NLP Models: A Concise Overview. arXiv:2004.08900 [cs.CL]Google ScholarGoogle Scholar
  42. Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau, and Adina Williams. 2020. Unnatural Language Inference. arXiv:2101.00010 [cs.CL]Google ScholarGoogle Scholar
  43. Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]Google ScholarGoogle Scholar
  44. P.M. Tiersma. 1999. Legal Language. University of Chicago Press, Chicago, Illinois. https://books.google.com/books?id=Sq8XXTo3A48CGoogle ScholarGoogle Scholar
  45. George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel-Cyrille Ngonga Ngomo, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 1 (April 2015), 138.Google ScholarGoogle ScholarCross RefCross Ref
  46. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353--355. https://www.aclweb.org/anthology/W18-5446Google ScholarGoogle ScholarCross RefCross Ref
  47. Jonah Wu. 2019. AI Goes to Court: The Growing Landscape of AI for Access to Justice. https://medium.com/legal-design-and-innovation/ai-goes-to-court-the-growing-landscape-of-ai-for-access-to-justice-3f58aca4306fGoogle ScholarGoogle Scholar
  48. Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs.CL]Google ScholarGoogle Scholar
  49. Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5218--5230. https://www.aclweb.org/anthology/2020.acl-main.466Google ScholarGoogle ScholarCross RefCross Ref
  50. Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset., 9701-9708 pages.Google ScholarGoogle Scholar

Index Terms

  1. When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law
          June 2021
          319 pages
          ISBN:9781450385268
          DOI:10.1145/3462757

          Copyright © 2021 Owner/Author

          This work is licensed under a Creative Commons Attribution International 4.0 License.

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 July 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate69of169submissions,41%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader