Skip to main content

LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences

  • Conference paper
  • First Online:
Advances in Knowledge Discovery and Data Mining (PAKDD 2023)

Abstract

Transformer models achieve state-of-the-art performance on a wide range of NLP tasks. They however suffer from a prohibitive limitation due to the self-attention mechanism, inducing \(O(n^2)\) complexity with regard to sequence length. To answer this limitation we introduce the LSG architecture which relies on Local, Sparse and Global attention. We show that LSG attention is fast, efficient and competitive in classification and summarization tasks on long documents. Interestingly, it can also be used to adapt existing pretrained models to efficiently extrapolate to longer sequences with no additional training. Along with the introduction of the LSG attention mechanism, we propose a PyPI package to train new models and adapt existing ones based on this mechanism.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Checkpoints and datasets are available at https://huggingface.co/ccdv.

  2. 2.

    https://github.com/ccdv-ai/convert_checkpoint_to_lsg.

  3. 3.

    See https://github.com/ccdv-ai/convert_checkpoint_to_lsg for a benchmark.

  4. 4.

    All summarization experiments are run using a 8e-5 learning rate, a 10% warmup, a length penalty of 2.0 and a beam size of 5 for beam search.

  5. 5.

    This work has benefited from LAWBOT (ANR-20-CE38-0013) grant and HPC resources from GENCI-IDRIS (Grant 2023-AD011011309R3).

References

  1. Ainslie, J., et al.: ETC: encoding long and structured inputs in transformers. arXiv preprint arXiv:2004.08483 (2020)

  2. Andoni, A., Indyk, P., Laarhoven, T., Razenshteyn, I.P., Schmidt, L.: Practical and optimal LSH for angular distance. CoRR arXiv:abs/1509.02897 (2015)

  3. Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv:2004.05150 (2020)

  4. Britz, D., Guan, M.Y., Luong, M.T.: Efficient attention using a fixed-size memory representation. arXiv preprint arXiv:1707.00110 (2017)

  5. Chalkidis, I., Fergadiotis, M., Malakasiotis, P., Aletras, N., Androutsopoulos, I.: LEGAL-BERT: the muppets straight out of law school. In: Findings of the Association for Computational Linguistics: EMNLP 2020, pp. 2898–2904 (2020)

    Google Scholar 

  6. Chalkidis, I., et al.: LexGLUE: a benchmark dataset for legal language understanding in english. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics. Dubln, Ireland (2022)

    Google Scholar 

  7. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)

  8. Chiu, C.C., Raffel, C.: Monotonic chunkwise attention. arXiv preprint arXiv:1712.05382 (2017)

  9. Choromanski, K., et al.: Rethinking attention with performers. arXiv:2009.14794 (2021)

  10. Clark, K., Khandelwal, U., Levy, O., Manning, C.D.: What does BERT look at? an analysis of bert’s attention. arXiv preprint arXiv:1906.04341 (2019)

  11. Cohan, A., et al.: A discourse-aware attention model for abstractive summarization of long documents. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers) (2018)

    Google Scholar 

  12. Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)

  13. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)

  14. Fabbri, A.R., Li, I., She, T., Li, S., Radev, D.R.: Multi-news: a large-scale multi-document summarization dataset and abstractive hierarchical model (2019)

    Google Scholar 

  15. Guo, M., et al.: Longt5: efficient text-to-text transformer for long sequences. CoRR arXiv:abs/2112.07916 (2021)

  16. Guo, Q., Qiu, X., Liu, P., Shao, Y., Xue, X., Zhang, Z.: Star-transformer. arXiv preprint arXiv:1902.09113 (2019)

  17. He, J., Wang, L., Liu, L., Feng, J., Wu, H.: Long document classification from local word glimpses via recurrent attention learning. IEEE Access 7, 40707–40718 (2019)

    Article  Google Scholar 

  18. Huang, L., Cao, S., Parulian, N., Ji, H., Wang, L.: Efficient attentions for long document summarization. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, Online (Jun 2021)

    Google Scholar 

  19. Katharopoulos, A., Vyas, A., Pappas, N., Fleuret, F.: Transformers are RNNs: fast autoregressive transformers with linear attention. CoRR arXiv:abs/2006.16236 (2020)

  20. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014)

    Google Scholar 

  21. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. CoRR arXiv:abs/2001.04451 (2020)

  22. Lewis, M., et al.: BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7871–7880. Association for Computational Linguistics, Online (Jul 2020)

    Google Scholar 

  23. Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. CoRR arXiv:abs/1907.11692 (2019)

  24. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150. Association for Computational Linguistics (Jun 2011)

    Google Scholar 

  25. Rae, J.W., Potapenko, A., Jayakumar, S.M., Lillicrap, T.P.: Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507 (2019)

  26. Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res. 21(140), 1–67 (2020)

    MathSciNet  Google Scholar 

  27. Raganato, A., Scherrer, Y., Tiedemann, J.: Fixed encoder self-attention patterns in transformer-based machine translation. arXiv preprint arXiv:2002.10260 (2020)

  28. Rohde, T., Wu, X., Liu, Y.: Hierarchical learning for generation with long source sequences. CoRR arXiv:abs/2104.07545 (2021)

  29. Sharma, E., Li, C., Wang, L.: BIGPATENT: a large-scale dataset for abstractive and coherent summarization. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2204–2213. Association for Computational Linguistics, Florence, Italy (Jul 2019)

    Google Scholar 

  30. Shaw, P., Uszkoreit, J., Vaswani, A.: Self-attention with relative position representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 464–468. New Orleans, Louisiana (Jun 2018)

    Google Scholar 

  31. Shen, T., Zhou, T., Long, G., Jiang, J., Zhang, C.: Bi-directional block self-attention for fast and memory-efficient sequence modeling. arXiv preprint arXiv:1804.00857 (2018)

  32. Shen, Z., Zhang, M., Yi, S., Yan, J., Zhao, H.: Factorized attention: self-attention with linear complexities. CoRR arXiv:abs/1812.01243 (2018)

  33. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  34. Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. CoRR arXiv:abs/2006.04768 (2020)

  35. Wu, F., Fan, A., Baevski, A., Dauphin, Y.N., Auli, M.: Pay less attention with lightweight and dynamic convolutions. arXiv preprint arXiv:1901.10430 (2019)

  36. Xiao, W., Beltagy, I., Carenini, G., Cohan, A.: PRIMERA: pyramid-based masked sentence pre-training for multi-document summarization. In: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5245–5263. Dublin, Ireland (May 2022)

    Google Scholar 

  37. Zaheer, M., et al.: Big bird: transformers for longer sequences. In: Advances in Neural Information Processing Systems, vol. 33 (2020)

    Google Scholar 

  38. Zhang, J., Zhao, Y., Saleh, M., Liu, P.J.: Pegasus: pre-training with extracted gap-sentences for abstractive summarization (2019)

    Google Scholar 

  39. Zhang, X., Wei, F., Zhou, M.: Hibert: document level pre-training of hierarchical bidirectional transformers for document summarization. arXiv preprint arXiv:1905.06566 (2019)

  40. Zhu, C., Liu, Y., Mei, J., Zeng, M.: Mediasum: a large-scale media interview dataset for dialogue summarization. arXiv preprint arXiv:2103.06410 (2021)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Charles Condevaux .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Condevaux, C., Harispe, S. (2023). LSG Attention: Extrapolation of Pretrained Transformers to Long Sequences. In: Kashima, H., Ide, T., Peng, WC. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2023. Lecture Notes in Computer Science(), vol 13935. Springer, Cham. https://doi.org/10.1007/978-3-031-33374-3_35

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-33374-3_35

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-33373-6

  • Online ISBN: 978-3-031-33374-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics