Deep learning for patent landscaping using transformer and graph embedding

https://doi.org/10.1016/j.techfore.2021.121413Get rights and content

Highlights

  • This paper suggests a deep patent landscaping model that utilizes transformer and Diff2Vec structures to handle the classification problem in patent landscaping.

  • This paper proposes a new benchmarking dataset for automated patent landscaping and conducts a practical investigation on the subject.

  • The suggested approach outperforms previous methods in terms of overall classification performance in patent landscaping.

  • This paper conducts an experimental investigation into the effect of technical codes and text data on patent categorization models.

Abstract

Patent landscaping is used to search for related patents during research and development projects. Patent landscaping is a crucial task required during the early stages of an R & D project to avoid the risk of patent infringement and to follow current trends in technology. The first task of patent landscaping is to extract the target patent for analysis from a patent database. Because patent classification for patent landscaping requires advanced human resources and can be tedious, the demand for automated patent classification has gradually increased. However, a shortage of well-defined benchmark datasets and comparable models makes it difficult to find related research studies. This paper proposes an automated patent classification model for patent landscaping based on transformer and graph embedding, both of which are drawn from deep learning. The proposed model uses a transformer architecture to derive text embedding from patent abstracts and uses a graph neural network to derive graph embedding from classification code co-occurrence information and concatenates them. Furthermore, we introduce four benchmark datasets to compare related research studies on patent landscaping. The obtained results showed prominent performance that was actually applicable to our dataset and comparable to the model using BERT, which has recently shown the best performance.

Introduction

A patent is significantly deliverable in research and development (R & D) projects. In addition to protecting an assignee’s legal rights, patents represent current trends in technology. To study technological trends and identify potential patent infringements, most R & D projects include patent landscaping, which involves collecting and analyzing patent documents related to a specific project  (Abood, Feltenberger, 2018, Bubela, Gold, Graff, Cahoy, Nicol, Castle, 2013, Bubela, Gold, Graff, Cahoy, Nicol, Castle, 2013, Wittenburg, Pekhteryev, 2015). The first step in patent landscaping is patent classification, which involves extracting patent database documents relating to the target technology for analysis. In general, patent classification for patent landscaping is a human-centric, tedious, and expensive process (Abood, Feltenberger, 2018, Trippe, 2015). Researchers and patent attorneys query related patents in large patent databases (by creating keyword candidates), eliminating unrelated patent documents, and extracting only target patent documents related to their project (Wittenburg, Pekhteryev, 2015, Yang, Akers, Yang, Klose, Pavlek). However, because the participants in the process must be familiar with the scientific and technical domains, such procedures are costly. Furthermore, the patent classification task must be repeated regularly (weekly or monthly) during a project to search for newly published patents.

In this paper, we propose a supervised deep learning model for patent classification in patent landscaping. The proposed model aims to eliminate repetitive and inefficient tasks by employing classification models based on deep learning. The proposed model incorporates a modified transformer structure (Vaswani et al., 2017a) and a graph embedding method using a diffusion graph (Rozemberczki and Sarkar, 2018). Because a patent document can contain several textual features and bibliometric data, the modified transformer structure is applied to process textual data, and the diffusion graph Diff2Vec is applied to process graph-based bibliometric data fields. Patents contain extremely long texts, making it difficult to process all text in the proposed transformer model. We developed the model to use only the abstract, which in the proposed transformer model have processable text lengths. In addition, we apply graph embedding models by using the co-occurrence information of the technology classification code, which is the best patent bibliometric feature for patent landscaping.

Because our aim is also to contribute resources to machine-learning-based patent classification research, we additionally propose benchmark datasets for classification in patent landscaping. Owing to issues such as high cost and data security, benchmark datasets for patent landscaping are usually closed or unavailable. The proposed benchmark datasets are based on the patent trend report by the Korea Intellectual Property Strategy Agency (KISTA1), which was written by human experts, such as patent attorneys. We built benchmark datasets from Google BigQuery by using keyword queries and target patents from the KISTA patent trends report, as filtered by experts. The experimental results indicate that the proposed model (using the proposed benchmark datasets) outperforms other traditional classification models, and the average classification accuracy for each dataset can be improved by approximately 15%.

In building the patent classification model, we started with two major research objectives. First, we wanted to explore the appropriate features of a patent to achieve high performance in patent classification. There are various types of features in patent documents. Examples include text data, such as the patent abstract and title, and the author of a patent document, which is bibliographic information. We assume that, among these data, the abstract contains the most basic information regarding the patent and that the predefined patent technology classification codes are the most appropriate features. This approach has been used in many studies on existing patent classification based on machine learning, and such features are generally those most frequently used by patent attorneys in patent landscaping tasks (Park, Yoon, 2017, Wu, Chang, Tsao, Fan, 2016, Zhang, Shang, Huang, Porter, Zhang, Lu, Zhu, 2016).

Second, our research investigates how features should be represented and aligned with the deep learning model. As a characteristic of deep learning, when entering a large amount of data, the model itself identifies the data pattern and produces the most appropriate results. For text, there are various data representation techniques for deep learning, such as Word2Vec, Doc2Vec, and transformers. For technical classification codes, international patent classification (IPC), cooperative patent classification (CPC), and United States Patent Classification (USPC) codes have been developed, and it is necessary to review which of these codes is most appropriate for patent classification. We focus on using an experimental approach to find the optimal text representation technique and technology classification code for patent landscaping.

The remainder of this paper is organized as follows. In Section 2, we describe relevant studies and patent landscaping. Section 3 presents the process of establishing a dataset for patent landscaping, which is one of the key contributions of this study. In Section 4, we propose a deep patent landscaping model, and in Section 5, we describe the results of the experiments based on this model. Finally, Section 6 provides some concluding remarks regarding this research and possible future areas of study.

Section snippets

Deep learning for patent document classification

In this study, we leverage deep-learning-based models for patent document classification. In general, deep-learning-based document classification models have different methods depending on the type of document. The methods used in this study largely utilize technology codes that are representative features of patents, text, and metadata. Our approach is inspired by studies related to patent classification based on existing machine learning.

The most traditional method of machine-learning-based

KISTA datasets for patent landscaping

First, we built benchmarking datasets using Korea Intellectual Property Service Center(KISTA4) patent report maps. KISTA is an affiliated organization of the Korean Intellectual Property Office and is in charge of patent analysis. KISTA’s research report is prepared with the participation of many technical experts and is shared and used nationwide. A detailed flowchart is shown in Figure 2.

Model overview

Our proposed deep patent landscaping model is composed of two parts, as shown in Figure 3, i.e., a transformer encoder (Vaswani et al., 2017a) and a graph embedding process using a diffusion graph called Diff2Vec (Rozemberczki and Sarkar, 2018). The model contains a concatenation layer of embedding vectors and stacked neural network layers to classify the target patents. In this respect, a patent is a scientific document that contains textual data and metadata (i.e., fields with bibliometric

Dataset

We measured the performance of the proposed model for classifying target patents within the four KISTA datasets described in Table 6. More than half of the datasets had over one million documents. These large datasets may contain not only search formula keywords but also noisy patents (which are outside the domain). Moreover, extracting embeddings from these datasets and using them for model training requires significant computing resources. Thus, we used high-frequency CPC codes for heuristic

Overall results

For each patent, our model considers two sets of features: metadata for the Diff2Vec architecture and text data for the transformer architecture. We experimented with our proposed model to determine the effect of each feature on the classification performance. For the metadata, we identified how CPC, IPC, and USPC codes affect the performance. IPC is an internationally unified patent classification system with 5 hierarchies and approximately 70,000 codes. USPC is a US patent classification

Discussion

The following lessons were learned from the experiment results of the patent classification model.

  • Patent documents comprise large amounts of scholarly data that contain metadata and text data. It was found that classifying patent documents using both sets of features is important for providing better classification performance than when using an individual feature alone.

  • Technology codes play a vital role in patent document classification for patent landscaping. This may be because technology

Conclusion

In this paper, we proposed a deep patent landscaping model that addresses the classification problem in patent landscaping using a transformer and Diff2Vec structures. Our study contributes to research on patent landscaping in three respects. First, we introduce a new benchmarking dataset for patent landscaping with DL model and provide a practical study of deep patent. Second, our model shows a high overall classification performance in patent landscaping compared to existing models. Finally,

Acknowledgement

This work was supported by a National Research Foundation of Korea (NRF) grant and funded by the Korean government (No. NRF-2015R1C1A1A01056185 and No. NRF-2018R1D1A1B07045825). We are very appreciative to Ph.D. Min and Ph.D. Kim, who reside in the Korean province of southern Gyeonggi-do. They provided us with a great deal of inspiration and bravery in order to compose this paper.

Seokkyu Choi was born in Seoul, South Korea. He received his B.S. degree and master’s degree in industrial management engineering at Gachon University. Also, he was a researcher at TEAMLAB. He is interested in classification and regression problems with semi-structured data and natural language processing. He is currently interested in multimodal learning and semi-supervised learning with real-world data. He is an AI research scientist at GenesisLab, which develops interactive AI services.

References (59)

  • A. Abood et al.

    Automated patent landscaping

    Artif. Intell. Law

    (2018)
  • Bahdanau, D., Cho, K., Bengio, Y., 2015. Neural machine translation by jointly learning to align and translate. 3rd...
  • I. Beltagy et al.

    Longformer: The long-document transformer

    arXiv:2004.05150

    (2020)
  • C.L. Benson et al.

    Technology structural implications from the extension of a patent search method

    Scientometrics

    (2015)
  • T. Bubela et al.

    Patent landscaping for life sciences innovation: toward consistent and transparent practices

    Nature Biotechnol.

    (2013)
  • Y.-L. Chen et al.

    A three-phase method for patent classification

    Inf. Process. Manag.

    (2012)
  • Y.-L. Chen et al.

    An ipc-based vector space model for patent retrieval

    Inf. Process. Manag.

    (2011)
  • J. Devlin et al.

    BERT: Pre-training of deep bidirectional transformers for language understanding

    Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

    (2019)
  • J.C. Gomez

    Analysis of the effect of data properties in automated patent classification

    Scientometrics

    (2019)
  • M.F. Grawe et al.

    Automated patent classification using word embedding

    2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA)

    (2017)
  • A. Grover et al.

    Node2vec: Scalable feature learning for networks

    Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

    (2016)
  • G.E. Hinton et al.

    Reducing the dimensionality of data with neural networks

    Science

    (2006)
  • S. Hochreiter et al.

    Long short-term memory

    Neural Comput.

    (1997)
  • C. Jeong et al.

    A context-aware citation recommendation model with bert and graph convolutional networks

    Scientometrics

    (2020)
  • S. Jiang et al.

    Deriving design feature vectors for patent images using convolutional neural networks

    J. Mech. Des.

    (2021)
  • J. Kim et al.

    Patent document clustering with deep embeddings

    Scientometrics

    (2020)
  • J.-M. Kim et al.

    Patent data analysis using functional count data model

    Soft Comput.

    (2019)
  • D.P. Kingma et al.

    Adam: A method for stochastic optimization

    3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings

    (2015)
  • R. Krestel et al.

    A survey on deep learning for patent analysis

    World Patent Inf.

    (2021)
  • Cited by (23)

    • The changing work of IPR attorneys: 30 years of institutional transitions

      2023, Technological Forecasting and Social Change
    • Technology identification from patent texts: A novel named entity recognition method

      2023, Technological Forecasting and Social Change
      Citation Excerpt :

      For example, Hain et al. (2022) develop a new method to create vector representations of patent text for measuring the patent-to-patent similarity, and compare their performance with PatentBERT model. Similarly, Choi et al. (2022) propose an automated patent classification model for patent landscaping and compare the classification performance of their model with PatentBERT. In this paper, we apply Natural Language Processing techniques for technology identification and mapping.

    • Patent representation learning with a novel design of patent ontology: Case study on PEM patents

      2022, Technological Forecasting and Social Change
      Citation Excerpt :

      Lin et al. constructed a patent quality evaluation model (Lin et al., 2018) based on the deep learning method, in which the attribute network embedding module is designed to learn the patent embedding from the citation network and attributes. Choi et al. proposed a graph embedding method using diffusion graph diff2vec (Choi et al., 2022) to analyze the patent metadata information. Fu et al. (2019) constructed a heterogeneous network based on the patent data in the form of graph structure, then they conducted a clustering to patents, characterized the technical similarity between the patent assignee by the technical similarity between patents, and finally identified the potential technology partners.

    View all citing articles on Scopus

    Seokkyu Choi was born in Seoul, South Korea. He received his B.S. degree and master’s degree in industrial management engineering at Gachon University. Also, he was a researcher at TEAMLAB. He is interested in classification and regression problems with semi-structured data and natural language processing. He is currently interested in multimodal learning and semi-supervised learning with real-world data. He is an AI research scientist at GenesisLab, which develops interactive AI services.

    Hyeonju Lee received B.S degree and master’s degree in industrial engineering from Gachon University. She studied natural language processing and knowledge base construction. She is working at Daewoo Shipbuilding & Marine Engineering as an AI research engineer. She is interested in time-series analysis and shipbuilding AI.

    Eunjeong Park received M.S and Ph.D degrees in Data Mining from Seoul National University in 2011 and 2016, respectively, where she has pursued various studies on text mining in the fields of manufacturing, politics, multimedia, and marketing. After her studies, she joined Naver, a South Korea based search company in 2016, and is currently working on machine translation. Her research interests include multilingual text mining and evaluation.

    Sungchul Choi received his B.S. degree in management from Handong University and his Ph.D. in industrial management engineering at POSTECH, Korea, in 2006 and 2012, respectively. Currently, he is an Associate Professor in Industrial Data Science and Engineering, Department of Industrial and Data Engineering at Pukyong National University, Busan, and a director of TEAMLAB. After his Ph.D., he worked in the CTO office at Samsung Advanced Technology as a technology strategy manager. He has researched patent analysis, technology roadmapping, and planning using text mining approaches. Recently, he has been working on applying machine learning and deep learning approaches to scholarly big data and has been conducting various studies related to NLP and artificial intelligence.

    View full text