Multitask Recalibrated Aggregation Network for Medical Code Prediction

Sun, Wei; Ji, Shaoxiong; Cambria, Erik; Marttinen, Pekka

doi:10.1007/978-3-030-86514-6_23

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12978))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

Abstract

Medical coding translates professionally written medical reports into standardized codes, which is an essential part of medical information systems and health insurance reimbursement. Manual coding by trained human coders is time-consuming and error-prone. Thus, automated coding algorithms have been developed, building especially on the recent advances in machine learning and deep neural networks. To solve the challenges of encoding lengthy and noisy clinical documents and capturing code associations, we propose a multitask recalibrated aggregation network. In particular, multitask learning shares information across different coding schemes and captures the dependencies between different medical codes. Feature recalibration and aggregation in shared modules enhance representation learning for lengthy notes. Experiments with a real-world MIMIC-III dataset show significantly improved predictive performance.

You have full access to this open access chapter, Download conference paper PDF

NIDN: Medical Code Assignment via Note-Code Interaction Denoising Network

JLAN: medical code prediction via joint learning attention networks and denoising mechanism

Article Open access 13 December 2021

Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes

Keywords

1 Introduction

Clinical notes generated by clinicians contain rich information about patients’ diagnoses and treatment procedures. Healthcare institutions digitized these clinical texts into Electronic Health Records (EHRs), together with other structural medical and treatment histories of patients, for clinical data management, health condition tracking and automation. To facilitate information management, clinical notes are usually annotated with standardized statistical codes. Different diagnosis classification systems utilize various medical coding systems. One of the most widely used coding systems is the International Classification of Diseases (ICD) maintained by the World Health Organization^{Footnote 1}. The ICD system is used to transform diseases, symptoms, signs, and treatment procedures into standard medical codes and has been widely used for clinical data analysis, automated medical decision support [8], and medical insurance reimbursement [24]. The latest ICD version is ICD-11 that will become effective in 2022, while older versions such as ICD-9 and ICD-9-CM, ICD-10 are also concurrently used. Other popular medical condition classification tools include the Clinical Classifications Software (CCS) and Hierarchical Condition Category (HCC) coding.

This paper primarily studies ICD and CCS coding systems because of their individual characteristics of popularization and simplicity. CCS codes maintained by the Healthcare Cost and Utilization Project (HCUP^{Footnote 2}) provide medical workers, insurance companies, and researchers with an easy-to-understand coding scheme of diagnoses and processes. On the other hand, the ICD coding system provides a comprehensive classification tool for diseases and related health problems. Nonetheless, the CCS and ICD codes have a one-to-many relationship that enables the CCS software to convert ICD codes into CCS codes with a smaller label space at different levels. For instance, in Fig. 1, the ICD-CCS mapping scheme converts “921.3” (“Contusion of eyeball”) and “918.1” (“Superficial injury of cornea”) to the same CCS code “239”, which represents the “Superficial injury; contusion”. The CCS code “239” establishes a connection between two different ICD codes.

Medical codes concisely summarize useful information from vast amounts of inpatient discharge summaries, and have high medical and commercial value. They are consequently of interest for both medical institutions and health insurance companies. For example, major insurance companies use standard medical codes in their insurance claim business [4]. Professional coders do the medical coding task by annotating clinical texts with corresponding medical codes. Since manual coding is error-prone and labor-consuming [23], automated coding is needed. Taking the ICD coding as an example, many publications have proposed automated coding approaches, including feature engineering-based machine learning methods [15, 25] and deep learning methods [5, 17, 22].

However, the automated medical coding task is still challenging as reflected in the following two aspects. Clinical notes contain noisy information, such as spelling errors, irrelevant information, and incorrect wording, which may have an adverse impact on representation learning, increasing the difficulty of medical coding. Also, it is a challenge to benefit from the relationship between different medical codes, especially when the label is high-dimensional. Existing automatic ICD coding models, such as CAML [22] and MultiResCNN [17], have limited performance because they do not consider the relationship between ICD codes. In the medical ontology, there exists certain connections between different concepts. For example, in the ICD coding system, “921.3” and “918.1”, representing “Contusion of eyeball” and “Superficial injury of cornea”, respectively, belong to “Superficial injury; contusion”. Medical coding models may suffer from underperformance if they can not effectively capture the relationships between medical codes. For example in Fig. 1, the highlight area in a clinical document is converted into corresponding medical codes, including ICD codes and CCS codes.

In this paper, we propose a novel framework called MT-RAM, which combines MultiTask (MT) learning with a Recalibrated Aggregation Module (RAM) for medical code prediction. In particular, the RAM improves the quality of representation learning of clinical documents, by injecting rich contextual information and performing nested convolutions, thereby solving the challenge of encoding noisy and lengthy clinical notes. In multitask training, we consider the joint training on two tasks, ICD and CCS code prediction. MultiTask Learning (MTL) is inspired by human learning, where people often apply the knowledge from previous tasks to help with a new task [33]. It makes full use of the information contained in each task, shares information between related tasks through common parameters, and enhances training efficiency [6, 30]. In addition, MTL reduces over-fitting to specific tasks by regularizing the learned representation to be generalizable across tasks [18]. In the context of the two medical coding systems, CCS coding can promote the training on the ICD codes; further, the CCS codes can inform about the relationship between the ICD codes, thereby improving model performance.

Our contributions fall into the following four aspects.

To the best of our knowledge, this paper is the first to adopt multitask learning for medical code prediction and demonstrate the benefits of leveraging multiple coding schemes.
We design a recalibrated aggregation module (RAM) to generate clinical document features with better quality and less noise.
We propose a novel framework called MT-RAM, which combines multitask learning, bidirectional GRU, RAM and label-aware attention mechanism.
Experimental results show competitive performance of our framework across different evaluation criteria on the standard real-world MIMIC-III database when compared with several strong baselines.

Our paper is organized as follows: Sect. 2 introduces related work; Sect. 3 describes the proposed model; Sect. 4 performs a series of comparison experiments, an ablation study and a detailed analysis of the properties of the RAM; finally, Sect. 5 provides concluding remarks.

2 Related Work

Automated Medical Coding. Automated medical coding is an essential and challenging task in medical information systems [25]. Healthcare institutes use different medical coding systems such as ICD, one of the most widely used coding schemes. The majority of early automated medical coding works use machine learning algorithms. Larkey and Croft [16] proposed a ICD code classifier with multiple models, including K-nearest neighbor, relevance feedback, and Bayesian independence classifiers. Perotte et al. [25] presented two ICD coding approaches: a flat and a hierarchy-based SVM classifier. The experiments showed that hierarchical SVM model outperforms flat SVM because it captures the hierarchical structure of ICD codes.

Neural networks have gained popularity for medical coding with the recent advances of deep learning techniques. Recurrent neural networks capture the sequential nature of medical text and have been applied by several studies such as the attention LSTM [26], the Hierarchical Attention Gated Recurrent Unit (HA-GRU) [2], and the multilayer attention-based bidirectional RNN [31]. Convolutional networks also play an important role in this field. Mullenbach et al. [22] proposed Convolutional Attention network for Multi-Label classification (CAML). Li and Yu [17] utilized a Multi-Filter Residual Convolutional Neural Network (MultiResCNN), and Ji et al. [11] developed a dilated convolutional network. Fine-tuning retrained language models as an emerging trend for NLP applications has been reported to have limits in medical coding by several initial studies [11, 17] and a comprehensive analysis on the pretraining domain and fine-tuning architectures [12].

Multitask Learning. Multitask learning is a machine learning paradigm that jointly trains multiple related tasks to improve the performance of each task and the generalization of the model. Multitask learning is widely used in various medical applications such as drug action extraction [34], biological image analysis [32] and clinical information extraction [3, 28]. In recent years, researchers have studied leveraging multitask learning strategies to better process medical notes. Malakouti et al. [20] jointly trained different diagnostic models to improve performance of each diagnostic task. This work implemented the parameter sharing between tasks by utilizing the bottom-up and top-down steps. This multitask learning framework improved the performance and the generalization ability of independently learned models. Si and Roberts [27] presented a CNN-based multitask learning network for inpatient mortality prediction task, which comprises some related tasks such as 0-day, 30-day, 1-year patient death prediction.

3 Method

This section describes the proposed Multi-Task Recalibrated Aggregation Network, referred as MT-RAM, as it combines the Multi-Task learning scheme and a Recalibrated Aggregation Module. The overall architecture of our MT-RAM network has five parts as shown in Fig. 2. We use word embeddings pretrained by the word2vec [21] as the input. Secondly, we use the bidirectional gated recurrent unit (BiGRU) [7] layer to extract document representation features capturing sequential dependencies in clinical notes. Next, a RAM module is used to improve the quality of the feature matrix and the efficiency of training for the multitask objective. Fourthly, the attention classification layers with two branches of ICD and CCS codes are composed of label-wise attention mechanism and linear classification layers. The last part combines the respective losses of the two classification heads and performs multitask training.

3.1 Input Layer

Denote a clinical document with n tokens as w = {$w_1, w_2, \dots , w_n$}. We utilize word2vec [21] to pretrain each clinical document to obtain word embedding matrices. A word embedding matrix, referred to $\mathbf {X} = [\mathbf {x}_1, \mathbf {x}_2, \dots , \mathbf {x}_n]^{{\text {T}}}$, is the combination of each word vector $\mathbf {x}_n\, \in \, \mathbb {R}^{d_e}$, where $d_e$ is the embedding dimension. Next, we feed word embedding matrix $\mathbf {X}\, \in \, \mathbb {R}^{n \times d_e}$ into the BiGRU layer to extract document representation features.

3.2 Bidirectional GRU Layer

We use a bidirectional GRU layer to extract the contextual information from the word embeddings $\mathbf {X}$ of the input documents. We calculate the latent states of GRUs on i-th token$x_i$:

$$\begin{aligned} \overrightarrow{\mathbf {h}_i}&= \overrightarrow{{\text {GRU}}}(\mathbf {x}_i, \overrightarrow{\mathbf {h}_{i-1}})\end{aligned}$$

(1)

$$\begin{aligned} \overleftarrow{\mathbf {h}_i}&= \overleftarrow{{\text {GRU}}}(\mathbf {x}_i, \overleftarrow{\mathbf {h}_{i+1}}) \end{aligned}$$

(2)

where $\overrightarrow{{\text {GRU}}}$ and $\overleftarrow{{\text {GRU}}}$ represent forward and backward GRUs, respectively. Final operation is to concatenate the $\overrightarrow{\mathbf {h}_i}$ and then $\overleftarrow{\mathbf {h}_i}$ into hidden vector $\mathbf {h}_i$:

$$\begin{aligned} \mathbf {h}_i = {\text {Concat}}(\overrightarrow{\mathbf {h}_i}, \overleftarrow{\mathbf {h}_i}) \end{aligned}$$

(3)

Dimension of forward or backward GRU is set to $d_r$. Bidirectional hidden vectors $\mathbf {h}_i\in \mathbb {R}^{2d_r}$ are horizontally concatenated into a resulting hidden representation matrix $\mathbf {H} = [\mathbf {h}_1, \mathbf {h}_2, \dots , \mathbf {h}_n]^T$, where the dimension of $\mathbf {H} \in \mathbb {R}^{n \times 2d_r}$.

3.3 Recalibrated Aggregation Module

We propose a Recalibrated Aggregation Module (RAM) that abstracts features learned by the BiGRU, recalibrates the abstraction, aggregates the abstraction and the recalibrated features, and eventually combines the new representation with the original one. This way, the RAM module can reduce the effect of noise in the clinical notes and lead to improved representations for medical code classification. In detail, the RAM leverages a nested convolution structure to extract and aggregate contextual information, which is used to recalibrate the noisy input features. In addition to this, through the convolutions, the RAM attains global receptive fields during feature extraction, which is complementary to the GRU-based recurrent structure described in Sect. 3.2. With these two characteristics, our RAM can improve the encoding of noisy and lengthy clinical notes. The RAM consists of feature aggregation and recalibration. The calculation flow of RAM is shown in Fig. 3 and described as below.

Firstly, the hidden representation $\mathbf {H}$ from the BiGRU layer passes through two down nodes to obtain matrices $\mathbf {A}$ and $\mathbf {A^\prime }$. This downsampling process can be denoted as:

$$\begin{aligned} \mathbf {A} = \bigwedge _{n=1}^{d_r} \Bigg \{ \mathbf {K}_2^{d_1} \left[ {\text {tanh}}\left( \bigwedge _{m=1}^{d_r} \left( \mathbf {K}_1^{d_1} \mathbf {H} \right) _m \right) \right] _n \Bigg \} \in \mathbb {R}^{n \times d_r}, \end{aligned}$$

(4)

where $\bigwedge _{m=1}^{d_r}$ represents dislocation addition, i.e., the second matrix is shifted by one unit to the right based on the position of the first matrix. The overlapped area is summed up. We repeat this operation until the last matrix and cut off unit vectors on both sides of the concatenated matrix. In Eq. 4, $\mathbf {K}_1^{d_1} \in \mathbb {R}^{2d_r \times k \times d_r}$ and $\mathbf {K}_2^{d_1} \in \mathbb {R}^{d_r \times k \times d_r}$ represent two convolutional kernel groups in the first down node and k is the kernel size. The second downsampled matrix $\mathbf {A^\prime } \in \mathbb {R}^{n \times \frac{d_r}{2}}$ can also be obtained in a similar way with different convolutional kernel groups $\mathbf {K}_1^{d_2} \in \mathbb {R}^{d_r \times k \times \frac{d_r}{2}}$ and $\mathbf {K}_2^{d_2} \in \mathbb {R}^{\frac{d_r}{2} \times k \times \frac{d_r}{2}}$. Next, we use a lateral node with another two convolutional kernel groups $\mathbf {K}_1^{l} \in \mathbb {R}^{\frac{d_r}{2} \times k \times \frac{d_r}{2}}$ and $\mathbf {K}_2^{l} \in \mathbb {R}^{\frac{d_r}{2} \times k \times \frac{d_r}{2}}$, which have consistent in and out channel dimensions, to transform $\mathbf {A^\prime }$ into lateral feature matrix $\mathbf {L} \in \mathbb {R}^{n \times \frac{d_r}{2}}$. We recover $\mathbf {L}$ with a up node and pair-wisely add the recovered signal with the first downsampled feature matrix $\mathbf {A}$ to obtain the primarily aggregated matrix $\mathbf {B} \in \mathbb {R}^{n \times d_r}$ as denoted in Eq. 5, where $\mathbf {K}_1^{u_1} \in \mathbb {R}^{\frac{d_r}{2} \times k \times d_r}$ and $\mathbf {K}_2^{u_1} \in \mathbb {R}^{d_r \times k \times d_r}$ represent deconvolutional kernel groups in the first up node.

$$\begin{aligned} \mathbf {B} = \mathbf {A} + \bigwedge _{n=1}^{d_r} \Bigg \{ \mathbf {K}_2^{u_1} \left[ {\text {tanh}}\left( \bigwedge _{m=1}^{d_r} \left( \mathbf {K}_1^{u_1} \mathbf {L} \right) _m \right) \right] _n \Bigg \} \end{aligned}$$

(5)

Secondly, we perform upsampling operations on the aggregated feature matrix $\mathbf {B}$ to obtain weight matrix $\mathbf {O} \in \mathbb {R}^{n \times 2d_r}$ as illustrated in Fig. 4. Specifically, we leverage a deconvolution kernel group $\mathbf {K}_1^{u_2} \in \mathbb {R}^{d_r \times k \times 2d_r}$ to obtain the intermediate representation $\mathbf {T} \in \mathbb {R}^{n \times 2d_r}$. The different colors of $\mathbf {K}_1^{u_2}$ in Fig. 7 correspond to how the different parts of the matrix $\mathbf {T^\prime } \in \mathbb {R}^{n \times k \times 2d_r}$ are calculated. This process is denoted as:

$$\begin{aligned} \mathbf {T} = \bigwedge _{m=1}^{2d_r} \mathbf {T}_m^\prime = \bigwedge _{m=1}^{2d_r} (\mathbf {B} \mathbf {K}_1^{u_2})_m. \end{aligned}$$

(6)

We adopt a deconvolution operation on the intermediate representation $\mathbf {T}$ to get the weight matrix $\mathbf {O} \in \mathbb {R}^{n \times 2d_r}$, denoted as:

$$\begin{aligned} \mathbf {O} = \bigwedge _{n=1}^{2d_r} \mathbf {O}_n^\prime = \bigwedge _{n=1}^{2d_r} \left( {\text {tanh}} \left( \mathbf {T}\right) \mathbf {K}_2^{u_2}\right) _n, \end{aligned}$$

(7)

where $\mathbf {K}_2^{u_2}$ represents the deconvolution kernel group.

Finally, we employ the feature recalibration in a way similar to the attention mechanism, where the “attention” score is learned by an iterative procedure with convolutional feature abstraction (Eq. 4) and de-convolutional feature excitation (Eq. 7). Specifically, we multiply the input feature matrix $\mathbf {H}$ by the weight matrix $\mathbf {O}$ to obtain the recalibrated feature matrix $\mathbf {H^\prime } \in \mathbb {R}^{n \times 2d_r}$, denoted as:

$$\begin{aligned} \mathbf {H^\prime } = {\text {tanh}} \left( \mathbf {O} \odot \mathbf {H}\right) , \end{aligned}$$

(8)

where “$\odot $” represents element-wise multiplication. The recalibration operation enhances the original features with contextual information injection through the weight matrix $\mathbf {O}$, which comprises rich semantic information that is consequently less sensitive to errors. It enables the RAM module to have improved generalization ability and, in the end, improved performance in medical coding.

3.4 Attention Classification Layers

Features extracted by lower layers in shared modules are label-agnostic. The Recalibrated Aggregation Module inherits the capacity of learning label-specific features from the Squeeze-and-Excitation block [10] to some extent. In order to make different positions of clinical notes correspond to different medical codes, we develop the label attention for classification layers to reorganize the characteristic information related to medical codes and enhance label specifications. Working together with the RAM module and label attention mechanism, our model can achieve label-aware representation learning, which is helpful for multitask heads as described in the next section (Sect. 3.5).

The attention classification layers are described as follows. We take a subscript d to denote a type of medical code. It can be generalized into different coding systems. Specifically, d represents the ICD code in our paper. For simplicity, the bias term is omitted. The attention scores of medical code $\mathbf {A}_d \in \mathbb {R}^{n \times m}$ can be calculated as:

$$\begin{aligned} \mathbf {A}_d = {\text {Softmax}}(\mathbf {H^\prime }\mathbf {U}_d) \end{aligned}$$

(9)

where $\mathbf {H^\prime }$ is the document features extracted by the RAM block, $\mathbf {U}_d \in \mathbb {R}^{d_r \times m_d}$ represents the parameter matrix of query in the attention mechanism, and $m_d$ denotes the number of target medical code. The attentive document features $\mathbf {V}_d \in \mathbb {R}^{d_r \times m_d}$ can be obtained by:

$$\begin{aligned} \mathbf {V}_d = \mathbf {A}_d^{{\text {T}}} \mathbf {H^\prime } \end{aligned}$$

(10)

The label-wise attention mechanism captures the selective information contained in the document encoding $\mathbf {H^\prime }$ and the query matrix $\mathbf {U}_d$ determines what information in the encoding matrix to prioritize.

Then, we use a fully-connected max pooling layer as a classifier, which affines the weight matrix to obtain the score vector $\mathbf {Y}_d \in \mathbb {R}^{m_d \times 1}$ denoted as:

$$\begin{aligned} \mathbf {\mathbf {Y}}_d = {\text {Pooling}}(\mathbf {W}_d\mathbf {V}_d^{{\text {T}}}) \end{aligned}$$

(11)

where $\mathbf {W}_d \in \mathbb {R}^{m_d \times m_d}$ represents the linear weight of the score vector. We use the Sigmoid activation function to produce the probability logits $\mathbf {\bar{y}}_d$ for final prediction.

3.5 Multitask Training

We introduce two self-contained tasks for multitask learning, i.e., ICD and CCS code prediction. The two medical coding branch tasks enter different coding processes and back-propagate the ICD code loss and CCS code loss, respectively. The structure of the two coding processing branches is similar. By passing the encoded features of clinical notes through the label attention module, we can get the weighted document features of the ICD code $\mathbf {V}_d \in \mathbb {R}^{d_r \times m_d}$ and the CCS code $\mathbf {V}_s \in \mathbb {R}^{d_r \times m_s}$, where $m_d$ and $m_s$ is the number of ICD and CCS codes respectively. With the linear classifier layer, the prediction probability of ICD and CCS codes are generated as $\mathbf {\bar{y}}_d$ and $\mathbf {\bar{y}}_s $.

The medical code assignment is a typical multi-label classification task. We use the binary cross entropy loss as the loss function of each sub-task in the multitask setting. The ICD coding loss and CCS coding loss are denoted as:

$$\begin{aligned} \mathcal {L}_d = \sum _{i=1}^{m_d} \Big [-y_{d_i}\log (\bar{y}_{d_i}) - (1 - y_{d_i})\log (1 - \bar{y}_{d_i}) \Big ]\end{aligned}$$

(12)

$$\begin{aligned} \mathcal {L}_s = \sum _{i=1}^{m_s} \Big [-y_{s_i}\log (\bar{y}_{s_i}) - (1 - y_{s_i})\log (1 - \bar{y}_{s_i}) \Big ] \end{aligned}$$

(13)

where $y_{d_i}, y_{s_i} \in \{0,1\}$ are the target medical code labels. $\bar{y}_{d_i}$ and $\bar{y}_{s_i}$ represent prediction probability of ICD and CCS codes, and the number of ICD and CCS codes are denoted as $m_d$ and $m_s$ respectively. We adopt joint training for the two medical coding losses to facilitate multitask learning. The joint training loss is defined as

$$\begin{aligned} \mathcal {L}_M = \lambda _d \mathcal {L}_d + \lambda _s \mathcal {L}_s, \end{aligned}$$

(14)

where $\lambda _d$ and $\lambda _s$ are scaling factors of ICD and CCS codes.

4 Experiments

We perform a series of experiments to validate the effectiveness of our proposed model on public real-world datasets. Source code is available at https://github.com/VRCMF/MT-RAM.

4.1 Datasets

MIMIC-III (ICD). The third version of Medical Information Mart for Intensive Care (MIMIC-III)^{Footnote 3} is a large, open-access dataset consists of clinical data associated with above 40,000 inpatients in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012 [13]. Following Mullenbach et al. [22] and Li and Yu [17], we segment all discharge summaries documents based on the patient IDs, and generate 50 most frequent ICD codes for experiments. We refer MIMIC-III dataset with top 50 ICD codes as the MIMIC-III ICD dataset. There are 8,067 discharge summaries for training, and 1,574 and 1,730 documents for validation and testing, respectively.

MIMIC-III (CCS). We utilize the ICD-CCS mapping scheme, provided by the HCUP, to convert the ICD codes and obtain the dataset with CCS codes. The converted CCS dataset denotes as MIMIC-III CCS, which contains 38 frequent CCS labels. Because the MIMIC-III ICD dataset shares the discharge summary documents with the CCS dataset, the documents used for CCS code training, validation and testing are consistent with the ICD code documents. We change several conflicting mapping items so that ICD and CCS codes can achieve one-versus-one matching. The converted CCS codes are then used as the labels of discharge summaries.

4.2 Settings

Data Preprocessing. Following the processing flow of CAML [22], the non-alphabetic tokens, such as punctuation and numbers, are removed from clinical text. All tokens are transformed into lowercase format, and we replace low-frequency tokens (appearing in fewer than three documents) into the ‘UNK’ token. We train the word2vec [21] on all discharge summaries to obtain the word embeddings. The maximum length of each document is limited to 2,500, i.e., documents longer than this length are truncated. The kernel size of convolution layer in the RAM module is 3.

Evaluation Metrics. To evaluate the performance of models in CCS and ICD code (collectively called medical code) datasets, we follow the evaluation protocols of previous works [17, 22]. We utilize micro-averaged and macro-averaged F1, micro-averaged and macro-averaged AUC (area under the receiver operating characteristic curve), precision at k as the evaluation methods. Precision at k (‘P@k’ in shorthand) is the proportion of k highest scored labels in the ground truth labels. When calculating of micro-averaged scores, each clinical text and medical codes are treated as separate predictions. During the computing of macro-averaged metrics, we calculate the scores for each medical code and take the average of them. We run our model ten times and report the mean and standard deviation of all the metrics.

Hyper-parameter Tuning. We refer to the previous works [17, 22] and apply some common hyper-parameter settings. Specifically, we set the word embedding dimension to 100, the maximum document length to 2500, dropout rate to 0.2, the batch size to 16, and the dimension of hidden units to 300. In the choice of learning rate, 0.008 is the optimal learning rate, which achieves good model performance and consumes moderate time to converge. We set the scaling factors $\lambda _d$ and $\lambda _s$ to 0.7 and 0.3 respectively. We use different optimizers to train our model, including Adam [14], AdamW [19] and SGD+momemtum [29]. Although the AdamW optimizer can shorten the training time, its predictive performance is not as good as the Adam. The performance of the SGD+momentum and the Adam are close, while Adam converges faster.

4.3 Baselines

CAML [22] comprises a single convolutional backbone and a label-wise attention mechanism, achieving high performance for ICD code prediction.

DR-CAML [22], i.e., the Description Regularized CAML, is an extension of CAML that incorporates the ICD description to regularize the CAML model.

HyperCore [5] uses the hyperbolic representation space to leverage the code hierarchy and utilize the graph convolutional network to capture the ICD code co-occurrence correlation.

MultiResCNN [17] adopts a multi-filter convolutional layer to capture various text patterns and a residual connection to enlarge the receptive field.

4.4 Results

MIMIC-III (ICD Codes). Table 1 shows that the results of our MT-RAM model performs better than all baseline models on all evaluation metrics. When compared with the state-of-the-art MultiResCNN [17], our model has improved the scores of macro-AUC, micro-AUC, macro-F1, micro-F1 and P@5 by 2.2%, 1.5% 4.5%, 3.6% and 2.3% respectively. Our model outperforms the CAML [22], which is the classical automated ICD coding model, by 4.6%, 3.4%, 11.9%, 9.2% and 5.5%. The improvement of our model in macro-F1 and micro-F1 is more significant than other metrics by comparing with HyperCore [5], specifically by 4.2% and 4.3% respectively. While other scores see moderate improvement by 1% $\sim $ 3%. Recent pretrained language models such as BERT [9] and its domain-specific variants like ClinicalBERT [1] are omitted from the comparison because these models are limited to process text with 512 tokens and have been reported with poor performance by two recent studies [12, 17].

Table 1. MIMIC-III results (ICD code). Results are shown in %. We set different random seeds for initialization to run our model for 10 times. Results of MT-RAM are demonstrated in means ± standard deviation

Full size table

MIMIC-III (CCS Code). We evaluate the CAML, DR-CAML and the MultiResCNN on the MIMIC-III CCS dataset and record the results in Table 2. Since the Hypercore [5] does not provide the source code, we omit it from the comparison. Following the practice described in the section of hyper-parameter tuning, we set all the parameters of the CAML and the MultiResCNN to be consistent with the hyper-parameters of the original works except for the learning rate.

As shown in Table 2, we can see that our model obtains better results in the macro AUC, micro AUC, macro F1, micro F1, P@5, compared with the strong MultiResCNN baseline. The improvement of our model is 1.6% in both macro AUC and micro AUC, 4.2% in macro F1, 3.4% in micro F1, and 2.7% in P@5. DR-CAML uses the ICD code description to achieve performance improvement. DR-CAML uses the description of ICD codes to improve the performance of CAML. But on MIMIC-III (CCS) dataset, this description will cause interference to CAML, so the result of DR-CAML is worse. Our model improves the F1 macro metric by 5.5%, comparing with the CAML model.

Table 2. MIMIC-III results (CCS code). We run each model for 10 times and each time set different random seeds for initialization. Results of all models are demonstrated in means ± standard deviation

Full size table

4.5 Ablation Study

We examine the general usefulness of the two main components - multitask training (MTL) and RAM module, by conducting an ablation study, where we consider the performance of three representative ICD coding models: CAML, MultiResCNN, and the GRU-based model (our method), with and without the specific components.

Multitask Learning. We firstly investigate the effectiveness of the multitask learning (MTL) scheme. From Table 3, we can observe that CAML and BiGRU have been improved by a relatively large margin across all evaluation metrics with multitask training. The CAML with MTL achieves 7.6% and 5.2% improvement in macro and micro F1, respectively, and obtains increases by about 2% to 3% in other scores. Similarly, the BiGRU with MTL has achieved a good improvement in macro and micro F1, increased by 4.2% and 3.3% respectively. For the MultiResCNN model, the multitask learning also contributes to relatively good results, which is 2.3% improvement in the macro F1 score. The reason why multitask learning can improve the performance of the model is the information exchange between the two tasks. Intuitively, there exists a correlation relationship between ICD and CCS coding systems This leads to complementary benefits for both ICD and CCS code prediction tasks. CAML and MultiResCNN have achieved significant gains by incorporating the multitask learning aggregation framework as a whole, i.e., the multitask learning scheme and the RAM together. Therefore, the gain of the multitask learning aggregation framework is not limited to some special network structures, and it has strong generalization ability.

Table 3. Ablation study

Full size table

Recalibrated Aggregation Module. The second part of ablation study if examines whether the proposed Recalibrated Aggregation Module (RAM) can learn useful features and consequently lead to better performance. In Table 3, the performance of the three models has been greatly improved after including the RAM module to the multitask BiGRU architecture. The micro F1 scores of CAML, MultiResCNN and MT-RAM have been improved by 2.1%, 1.2% and 0.8%, respectively. The RAM module helps the GRU-based model achieve greater improvement than convolution-based models.

4.6 A Detailed Analysis of the Properties of the RAM

We conduct an exploratory study to investigate the effectiveness of element-wise multiplication in the final feature weighting stage. We denote models applying the multitask learning and multiplicative to CAML and MultiResCNN as MT-CAML + RAM (Mult) and MT-MultiResCNN + RAM (Mult), respectively. The RAM (Add) means to replace the multiplication operation in RAM with an addition operation. From Table 4, we can observe that the model with RAM (Mult) outperforms models with RAM (Add) in most evaluation metrics. Although the results of MT-RAM (Add) in F1 macro, F1 micro and P@5 are slightly better than the results of MT-RAM (Mult), the gap is marginal. Considering the generalization ability and performance improvement of the two modules, the RAM with multiplication operation outperforms the RAM with addition operation.

Table 4. Analysis of RAM: multiplicative versus additive

Full size table

Regarding to the position of RAM in the multitasking learning framework, we found that it is best to embed the RAM in the shared layers. Compared with putting RAM in the two branches of the framework, RAM module embedded in the shared layers helps two sub-tasks share more information. If RAM is embedded in two sub-branch networks, the depth of the sub-network will increase and the shared part will decrease. The deepening of the sub-network will interfere with the network convergence and make training more difficult. At the same time, reducing the shared part will reduce the amount of information exchange between sub-tasks, which will affect the improvement of the model by the multitask learning scheme.

5 Conclusion

In this paper, we proposed a novel multitask framework for the automated medical coding task, which improved feature learning for clinical documents and accounted for the dependencies between different medical coding systems. We designed a Recalibrated Aggregation Module (RAM) to enrich document features and reduce noisy information. Furthermore, we leveraged multitask learning to share information across different medical codes. We demonstrated that the combination of multitask learning and RAM improved automatic medical coding considerably. In addition, these components are generalizable and can be successfully integrated to other overall architectures. The experimental results on the real-world clinical MIMIC-III database showed that our framework outperformed previous strong baselines. Finally, we believe our framework can be beneficial not only in medical coding tasks, but also in other text label prediction tasks.

Notes

References

Alsentzer, E., et al.: Publicly available clinical BERT embeddings. In: Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp. 72–78 (2019)
Google Scholar
Baumel, T., Nassour-Kassis, J., Cohen, R., Elhadad, M., Elhadad, N.: Multi-label classification of patient notes a case study on ICD code assignment. arXiv preprint arXiv:1709.09587 (2017)
Bi, J., Xiong, T., Yu, S., Dundar, M., Rao, R.B.: An improved multi-task learning approach with applications in medical diagnosis. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS (LNAI), vol. 5211, pp. 117–132. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87479-9_26
Chapter Google Scholar
Bottle, A., Aylin, P.: Intelligent information: a national system for monitoring clinical performance. Health Serv. Res. 43(1p1), 10–31 (2008)
Google Scholar
Cao, P., Chen, Y., Liu, K., Zhao, J., Liu, S., Chong, W.: HyperCore: hyperbolic and co-graph representation for automatic ICD coding. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 3105–3114 (2020)
Google Scholar
Chandra, R., Gupta, A., Ong, Y.-S., Goh, C.-K.: Evolutionary multi-task learning for modular training of feedforward neural networks. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9948, pp. 37–46. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46672-9_5
Chapter Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
Choi, E., Bahadori, M.T., Schuetz, A., Stewart, W.F., Sun, J.: Doctor AI: predicting clinical events via recurrent neural networks. In: Machine Learning for Healthcare Conference, pp. 301–318. PMLR (2016)
Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.L.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision And Pattern Recognition, pp. 7132–7141 (2018)
Google Scholar
Ji, S., Cambria, E., Marttinen, P.: Dilated convolutional attention network for medical code assignment from clinical text. In: Proceedings of the 3rd Clinical Natural Language Processing Workshop at EMNLP, pp. 73–78 (2020)
Google Scholar
Ji, S., Hölttä, M., Marttinen, P.: Does the magic of BERT apply to medical code assignment? A quantitative study. arXiv preprint arXiv:2103.06511 (2021)
Johnson, A.E.W., et al.: MIMIC-III, a freely accessible critical care database. Sci. Data 3(1), 1–9 (2016)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Koopman, B., Zuccon, G., Nguyen, A., Bergheim, A., Grayson, N.: Automatic ICD-10 classification of cancers from free-text death certificates. Int. J. Med. Inform. 84(11), 956–965 (2015)
Article Google Scholar
Larkey, L.S., Croft, W.B.: Combining classifiers in text categorization. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 289–297 (1996)
Google Scholar
Li, F., Hong, Y.: ICD coding from clinical text using multi-filter residual convolutional neural network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 8180–8187 (2020)
Google Scholar
Liu, X., He, P., Chen, W., Gao, J.: Multi-task deep neural networks for natural language understanding. arXiv preprint arXiv:1901.11504 (2019)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Malakouti, S., Hauskrecht, M.: Hierarchical adaptive multi-task learning framework for patient diagnoses and diagnostic category classification. In: 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 701–706. IEEE (2019)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. arXiv preprint arXiv:1310.4546 (2013)
Mullenbach, J., Wiegreffe, S., Duke, J., Sun, J., Eisenstein, J.: Explainable prediction of medical codes from clinical text. arXiv preprint arXiv:1802.05695 (2018)
O’malley, K.J., Cook, K.F., Price, M.D., Wildes, K.R., Hurdle, J.F., Ashton, C.M.: Measuring diagnoses: ICD code accuracy. Health Serv. Res. 40(5p2), 1620–1639 (2005)
Google Scholar
Park, J.-K., et al.: The accuracy of ICD codes for cerebrovascular diseases in medical insurance claims. J. Prev. Med. Public Health 33(1), 76–82 (2000)
Google Scholar
Perotte, A., Pivovarov, R., Natarajan, K., Weiskopf, N., Wood, F., Elhadad, N.: Diagnosis code assignment: models and evaluation metrics. J. Am. Med. Inform. Assoc. 21(2), 231–237 (2014)
Article Google Scholar
Shi, H., Xie, P., Hu, Z., Zhang, M., Xing, E.P.: Towards automated ICD coding using deep learning. arXiv preprint arXiv:1711.04075 (2017)
Si, Y., Roberts, K.: Deep patient representation of clinical notes via multi-task learning for mortality prediction. In: AMIA Summits on Translational Science Proceedings 2019, p. 779 (2019)
Google Scholar
Suk, H.-I., Lee, S.-W., Shen, D.: Deep sparse multi-task learning for feature selection in Alzheimer’s disease diagnosis. Brain Struct. Funct. 221(5), 2569–2587 (2016). https://doi.org/10.1007/s00429-015-1059-y
Article Google Scholar
Sutskever, I., Martens, J., Dahl, G., Hinton, G.: On the importance of initialization and momentum in deep learning. In: International Conference on Machine Learning, pp. 1139–1147. PMLR (2013)
Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? arXiv preprint arXiv:1411.1792 (2014)
Yu, Y., Li, M., Liu, L., Fei, Z., Wu, F.X., Wang, J.: Automatic ICD code assignment of Chinese clinical notes based on multilayer attention BiRNN. J. Biomed. Inform. 91, 103114 (2019)
Article Google Scholar
Zhang, W., Li, R., Zeng, T., Sun, Q., Kumar, S., Ye, J., Ji, S.: Deep model based transfer and multi-task learning for biological image analysis. IEEE Trans. Big Data 6(2), 322–333 (2016)
Article Google Scholar
Zhang, Y., Yang, Q.: A survey on multi-task learning. arXiv preprint arXiv:1707.08114 (2017)
Zhou, D., Miao, L., He, Y.: Position-aware deep multi-task learning for drug-drug interaction extraction. Artif. Intell. Med. 87, 1–8 (2018)
Article Google Scholar

Download references

Acknowledgments

This work was supported by the Academy of Finland (grant 336033) and EU H2020 (grant 101016775). We acknowledge the computational resources provided by the Aalto Science-IT project. The authors wish to acknowledge CSC - IT Center for Science, Finland, for computational resources.

Author information

Authors and Affiliations

Aalto University, 02150, Espoo, Finland
Wei Sun, Shaoxiong Ji & Pekka Marttinen
Nanyang Technological University, Singapore, 639798, Singapore
Erik Cambria

Authors

Wei Sun
View author publications
You can also search for this author in PubMed Google Scholar
Shaoxiong Ji
View author publications
You can also search for this author in PubMed Google Scholar
Erik Cambria
View author publications
You can also search for this author in PubMed Google Scholar
Pekka Marttinen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shaoxiong Ji .

Editor information

Editors and Affiliations

Facebook AI, Seattle, WA, USA
Yuxiao Dong
Torre Telefonica, Barcelona, Spain
Nicolas Kourtellis
Bielefeld University, CITEC, Bielefeld, Germany
Barbara Hammer
Basque Center for Applied Mathematics, Bilbao, Spain
Jose A. Lozano

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sun, W., Ji, S., Cambria, E., Marttinen, P. (2021). Multitask Recalibrated Aggregation Network for Medical Code Prediction. In: Dong, Y., Kourtellis, N., Hammer, B., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Applied Data Science Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12978. Springer, Cham. https://doi.org/10.1007/978-3-030-86514-6_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-86514-6_23
Published: 10 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86513-9
Online ISBN: 978-3-030-86514-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Multitask Recalibrated Aggregation Network for Medical Code Prediction

Abstract

Similar content being viewed by others

NIDN: Medical Code Assignment via Note-Code Interaction Denoising Network

JLAN: medical code prediction via joint learning attention networks and denoising mechanism

Predicting Multiple ICD-10 Codes from Brazilian-Portuguese Clinical Notes

Keywords

1 Introduction

2 Related Work