research-article

Open Access

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

Authors:
Lucia Zheng

Stanford University

Stanford University
View Profile

,
Neel Guha

Stanford University

Stanford University
View Profile

,
Brandon R. Anderson

Stanford University

Stanford University
View Profile

,
Peter Henderson

Stanford University

Stanford University
View Profile

,
Daniel E. Ho

Stanford University

Stanford University
View Profile

ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and LawJune 2021Pages 159–168https://doi.org/10.1145/3462757.3466088

Published:27 July 2021Publication History

ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law

Pages 159–168

ABSTRACT

While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage in resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, has yielded few documented instances of substantial gains to domain pretraining in spite of the fact that legal language is widely seen to be unique. We hypothesize that these existing results stem from the fact that existing legal NLP tasks are too easy and fail to meet conditions for when domain pretraining can help. To address this, we first present CaseHOLD (Case Holdings On Legal Decisions), a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we assess performance gains on CaseHOLD and existing legal NLP datasets. While a Transformer architecture (BERT) pretrained on a general corpus (Google Books and Wikipedia) improves performance, domain pretraining (on a corpus of ≈3.5M decisions across all courts in the U.S. that is larger than BERT's) with a custom legal vocabulary exhibits the most substantial performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12% improvement on BERT) and consistent performance gains across two other legal tasks. Third, we show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Our findings inform when researchers should engage in resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.

References

Nikolaos Aletras, Dimitrios Tsarapatsanis, Daniel Preoţiuc-Pietro, and Vasileios Lampos. 2016. Predicting judicial decisions of the European Court of Human Rights: A natural language processing perspective. PeerJ Computer Science 2 (2016), e93.Google ScholarCross Ref
Pablo D Arredondo. 2017. Harvesting and Utilizing Explanatory Parentheticals. SCL Rev. 69 (2017), 659.Google Scholar
Iz Beltagy, Kyle Lo, and Arman Cohan. 2019. SciBERT: A Pretrained Language Model for Scientific Text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3615--3620.Google Scholar
Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell. 2021. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. Association for Computing Machinery, New York, NY, USA, 610--623.Google ScholarDigital Library
Ilias Chalkidis, Ion Androutsopoulos, and Nikolaos Aletras. 2019. Neural Legal Judgment Prediction in English. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4317--4323. https://www.aclweb.org/anthology/P19-1424Google ScholarCross Ref
Ilias Chalkidis, Ion Androutsopoulos, and Achilleas Michos. 2017. Extracting Contract Elements. In Proceedings of the 16th Edition of the International Conference on Articial Intelligence and Law (London, United Kingdom) (ICAIL '17). Association for Computing Machinery, New York, NY, USA, 19--28.Google ScholarDigital Library
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, Nikolaos Aletras, and Ion Androutsopoulos. 2020. LEGAL-BERT: The Muppets straight out of Law School. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, 2898--2904. https://www.aclweb.org/anthology/2020.findings-emnlp.261Google ScholarCross Ref
Ilias Chalkidis, Manos Fergadiotis, Prodromos Malakasiotis, and Ion Androutsopoulos. 2019. Neural Contract Element Extraction Revisited. Workshop on Document Intelligence at NeurIPS 2019. https://openreview.net/forum?id=B1x6fa95UHGoogle Scholar
Columbia Law Review Ass'n, Harvard Law Review Ass'n, and Yale Law Journal. 2015. The Bluebook: A Uniform System of Citation (21st ed.). The Columbia Law Review, The Harvard Law Review, The University of Pennsylvania Law Review, and The Yale Law Journal.Google Scholar
Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, Ziqing Yang, Shijin Wang, and Guoping Hu. 2019. Pre-Training with Whole Word Masking for Chinese BERT. arXiv:1906.08101 [cs.CL]Google Scholar
Laura C. Dabney. 2008. Citators: Past, Present, and Future. Legal Reference Services Quarterly 27, 2--3 (2008), 165--190.Google ScholarCross Ref
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 4171--4186. https://www.aclweb.org/anthology/N19-1423Google Scholar
Pintip Hompluem Dunn. 2003. How judges overrule: Speech act theory and the doctrine of stare decisis. Yale LJ 113 (2003), 493.Google ScholarCross Ref
Emad Elwany, Dave Moore, and Gaurav Oberoi. 2019. BERT Goes to Law School: Quantifying the Competitive Advantage of Access to Large Legal Corpora in Contract Understanding. arXiv:1911.00473 http://arxiv.org/abs/1911.00473Google Scholar
David Freeman Engstrom and Daniel E Ho. 2020. Algorithmic accountability in the administrative state. Yale J. on Reg. 37 (2020), 800.Google Scholar
David Freeman Engstrom, Daniel E. Ho, Catherine Sharkey, and Mariano-Florentino Cuéllar. 2020. Government by Algorithm: Artificial Intelligence in Federal Administrative Agencies. Administrative Conference of the United States, Washington DC, United States.Google ScholarCross Ref
European Union 1993. Council Directive 93/13/EEC of 5 April 1993 on unfair terms in consumer contracts. European Union.Google Scholar
Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021. Aligning AI With Shared Human Values. arXiv:2008.02275 [cs.CY]Google Scholar
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021. Measuring Massive Multitask Language Understanding. arXiv:2009.03300 [cs.CY]Google Scholar
Michael J. Bommarito II, Daniel Martin Katz, and Eric M. Detterman. 2018. LexNLP: Natural language processing and information extraction for legal and regulatory texts. arXiv:1806.03688 http://arxiv.org/abs/1806.03688Google Scholar
Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, and Omer Levy. 2020. SpanBERT: Improving Pre-training by Representing and Predicting Spans. Transactions of the Association for Computational Linguistics 8 (2020), 64--77. https://www.aclweb.org/anthology/2020.tacl-1.5Google ScholarCross Ref
David Jurgens, Srijan Kumar, Raine Hoover, Dan McFarland, and Dan Jurafsky. 2018. Measuring the Evolution of a Scientific Field through Citation Frames. Transactions of the Association for Computational Linguistics 6 (2018), 391--406. https://www.aclweb.org/anthology/Q18-1028Google ScholarCross Ref
Minki Kang, Moonsu Han, and Sung Ju Hwang. 2020. Neural Mask Generator: Learning to Generate Adaptive Word Maskings for Language Model Adaptation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Online, 6102--6120. https://www.aclweb.org/anthology/2020.emnlp-main.493Google ScholarCross Ref
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. arXiv:1808.06226 [cs.CL]Google Scholar
Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. 2019. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, 4 (2019), 1234--1240.Google ScholarCross Ref
Marco Lippi, Przemysław Pałka, Giuseppe Contissa, Francesca Lagioia, Hans-Wolfgang Micklitz, Giovanni Sartor, and Paolo Torroni. 2019. CLAUDETTE: an automated detector of potentially unfair clauses in online terms of service. Artificial Intelligence and Law 27, 2 (2019), 117--139.Google ScholarDigital Library
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]Google Scholar
David Mellinkoff. 2004. The language of the law. Wipf and Stock Publishers, Eugene, Oregon.Google Scholar
Elizabeth Mertz. 2007. The Language of Law School: Learning to "Think Like a Lawyer". Oxford University Press, USA.Google Scholar
Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. http://arxiv.org/abs/1301.3781Google Scholar
Octavia-Maria, Marcos Zampieri, Shervin Malmasi, Mihaela Vela, Liviu P. Dinu, and Josef van Genabith. 2017. Exploring the Use of Text Classification in the Legal Domain. Proceedings of 2nd Workshop on Automated Semantic Analysis of Information in Legal Texts (ASAIL).Google Scholar
Adam R. Pah, David L. Schwartz, Sarath Sanga, Zachary D. Clopton, Peter DiCola, Rachel Davis Mersey, Charlotte S. Alexander, Kristian J. Hammond, and Luís A. Nunes Amaral. 2020. How to build a more open justice system. Science 369, 6500 (2020), 134--136.Google ScholarCross Ref
Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. 2018. emrQA: A Large Corpus for Question Answering on Electronic Medical Records. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 2357--2368. https://www.aclweb.org/anthology/D18-1258Google ScholarCross Ref
Marc Queudot, Éric Charton, and Marie-Jean Meurs. 2020. Improving Access to Justice with Legal Chatbots. Stats 3, 3 (2020), 356--375.Google ScholarCross Ref
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training.Google Scholar
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, 2383--2392. https://www.aclweb.org/anthology/D16-1264Google ScholarCross Ref
Siva Reddy, Danqi Chen, and Christopher D Manning. 2019. Coqa: A conversational question answering challenge. Transactions of the Association for Computational Linguistics 7 (2019), 249--266.Google ScholarCross Ref
Anna Rogers, Olga Kovaleva, and Anna Rumshisky. 2021. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics 8 (2021), 842--866.Google ScholarCross Ref
Phillip Rust, Jonas Pfeiffer, Ivan Vulić, Sebastian Ruder, and Iryna Gurevych. 2020. How Good is Your Tokenizer? On the Monolingual Performance of Multilingual Language Models. arXiv:2012.15613 [cs.CL]Google Scholar
Jaromir Savelka, Vern R Walker, Matthias Grabmair, and Kevin D Ashley. 2017. Sentence boundary detection in adjudicatory decisions in the United States. Traitement automatique des langues 58 (2017), 21.Google Scholar
Or Sharir, Barak Peleg, and Yoav Shoham. 2020. The Cost of Training NLP Models: A Concise Overview. arXiv:2004.08900 [cs.CL]Google Scholar
Koustuv Sinha, Prasanna Parthasarathi, Joelle Pineau, and Adina Williams. 2020. Unnatural Language Inference. arXiv:2101.00010 [cs.CL]Google Scholar
Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Tian, and Hua Wu. 2019. ERNIE: Enhanced Representation through Knowledge Integration. arXiv:1904.09223 [cs.CL]Google Scholar
P.M. Tiersma. 1999. Legal Language. University of Chicago Press, Chicago, Illinois. https://books.google.com/books?id=Sq8XXTo3A48CGoogle Scholar
George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R. Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almirantis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Artiéres, Axel-Cyrille Ngonga Ngomo, Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. 2015. An overview of the BIOASQ large-scale biomedical semantic indexing and question answering competition. BMC Bioinformatics 16, 1 (April 2015), 138.Google ScholarCross Ref
Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. 2018. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. Association for Computational Linguistics, Brussels, Belgium, 353--355. https://www.aclweb.org/anthology/W18-5446Google ScholarCross Ref
Jonah Wu. 2019. AI Goes to Court: The Growing Landscape of AI for Access to Justice. https://medium.com/legal-design-and-innovation/ai-goes-to-court-the-growing-landscape-of-ai-for-access-to-justice-3f58aca4306fGoogle Scholar
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv:1609.08144 [cs.CL]Google Scholar
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. How Does NLP Benefit Legal System: A Summary of Legal Artificial Intelligence. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 5218--5230. https://www.aclweb.org/anthology/2020.acl-main.466Google ScholarCross Ref
Haoxi Zhong, Chaojun Xiao, Cunchao Tu, Tianyang Zhang, Zhiyuan Liu, and Maosong Sun. 2020. JEC-QA: A Legal-Domain Question Answering Dataset., 9701-9708 pages.Google Scholar

Index Terms

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings
1. Applied computing
  1. Law, social and behavioral sciences
    1. Law
2. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks

Recommendations

Natural language processing in law: Prediction of outcomes in the higher courts of Turkey
Abstract
Natural language processing (NLP) based approaches have recently received attention for legal systems of several countries. It is of interest to study the wide variety of legal systems that have so far not received any attention. In ...
Read More
Extracting value from Brazilian Court decisions
Abstract
We propose a methodology to extract value from Brazilian Court decisions to support judges and lawyers in their decision-making. We instantiate our methodology in one information system we have developed. Such system (i) extracts ...
Highlights
- The Brazilian legal system has seen an increase of the case law in the last years.
Read More
Lessons in Copyright Activism: K-12 Education and the DMCA 1201 Exemption Rulemaking Process

Digital learning is being transformed by changes in copyright law. This article discusses the author's personal journey as a copyright education activist through two rounds of rulemaking proceedings before the Copyright Office concerning the anti-...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law
June 2021
319 pages
ISBN:9781450385268
DOI:10.1145/3462757
Conference Chair:
Juliano Maranhão
University of São Paulo, Brazil
,
Program Chair:
Adam Zachary Wyner
Swansea University, United Kingdom
Copyright © 2021 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 July 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
benchmark dataset
law
natural language processing
pretraining
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate69of169submissions,41%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 14
  Total Citations
  View Citations
- 1,267
  Total Downloads
- Downloads (Last 12 months)628
- Downloads (Last 6 weeks)82
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law

ABSTRACT

References

Cited By

Index Terms

Recommendations

Natural language processing in law: Prediction of outcomes in the higher courts of Turkey

Extracting value from Brazilian Court decisions

Lessons in Copyright Activism: K-12 Education and the DMCA 1201 Exemption Rulemaking Process

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

When does pretraining help?: assessing self-supervised learning for law and the CaseHOLD dataset of 53,000+ legal holdings

ICAIL '21: Proceedings of the Eighteenth International Conference on Artificial Intelligence and Law

ABSTRACT

References

Cited By

Index Terms

Recommendations

Natural language processing in law: Prediction of outcomes in the higher courts of Turkey

Extracting value from Brazilian Court decisions

Lessons in Copyright Activism: K-12 Education and the DMCA 1201 Exemption Rulemaking Process

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media