ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Revealing HIV viral load patterns using unsupervised machine learning and cluster summarization

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 27 Jul 2018
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

HIV RNA viral load (VL) is an important outcome variable in studies of HIV infected persons. There exists only a handful of methods which classify patients by VL patterns.  Most methods place limits on the use of viral load measurements, are often specific to a particular study design, and do not account for complex, temporal variation. To address this issue, we propose a set of four unambiguous computable characteristics (features) of time-varying HIV viral load patterns, along with a novel centroid-based classification algorithm, which we use to classify a population of 1,576 HIV positive clinic patients into one of five different viral load patterns (clusters) often found in the literature: durably suppressed viral load (DSVL), sustained low viral load (SLVL), sustained high viral load (SHVL), high viral load suppression (HVLS), and rebounding viral load (RVL). The centroid algorithm summarizes these clusters in terms of their centroids and radii. We show that this allows new VL patterns to be assigned pattern membership based on the distance from the centroid relative to its radius, which we term radial normalization classification. This method has the benefit of providing an objective and quantitative method to assign VL pattern membership with a concise and interpretable model that aids clinical decision making. This method also facilitates meta-analyses by providing computably distinct HIV categories. Finally we propose that this novel centroid algorithm could also be useful in the areas of cluster comparison for outcomes research and data reduction in machine learning.

Keywords

Machine learning, HIV, viral load, feature extraction, HIV categories, centroid, cluster summarization, clinical interpretability

Introduction

The primary clinical goal of HIV treatment and patient engagement is suppression of the HIV viral load (VL), as measured by low or undetectable circulating HIV RNA levels. However, VL most often fluctuates over repeated measurements, with a range that spans 8 orders of magnitude from 0 (undetectable) - 107 copies/mL. VL is regularly monitored for signs of progression of HIV infection. Standard HIV treatment protocols are based on VL measurements1, especially when monitoring responses to antiretroviral therapy (ART). Monitoring of VL helps to determine whether ART therapy was able to successfully suppress patient VL2. Individuals with sustained high viral loads (SHVL) are at greater risk of secondary transmission, clinical progression to AIDS, and death36. In contrast, significant reduction in VL or high viral load suppression (HVLS) both lead to immune recovery, as measured by CD4 T cell levels7, and can reduce or eliminate the risks of SHVL. Furthermore, patients sustaining low-level viral load (SLVL), or with a rising VL after previous suppression, have a high incidence of treatment failure8. Thus, developing an objective measure of VL status, and categorization of patients by time varying patterns of VL, is critical for standardizing both therapy and comparing research protocol efficacy.

Reports in the current literature differ in the definition “high viral load"912, and their findings of how long it takes a patient on highly active anti-retroviral therapy (HAART) to suppress their VL2,9,10,13. We summarize some of the published approaches here (for greater detail see Supplementary File 1). With respect to VL levels, Terzian et al. defined SHVL as two consecutive viral load measurements (VLM) 100,000 copies/mL9. Durably suppressed viral load (DSVL) was defined as all VLM <400 copies/mL. In contrast, Greub et al. focused on detecting low level viral rebound (LLVR) by first considering patients with an initial consecutive VLM pair <50 copies/mL, and classified LLVR as having subsequent maximum VLM between 51–5008. Alternatively, Rose et al. investigated the use of five different frameworks to categorize suppressed versus not-suppressed VL10. Their approach excluded patients with VLM<200 at baseline, and stratified the remainder with regard to VL suppression using an 8 month window centered around 24 months after the start of VLM (18–30 months). Another approach was used by Phillips et al., and characterized VLM responses to ART13, utilizing a 24–40 week window and a rule-based method to identify two populations of HIV patients (Viral Failure and Viral Rebound). Despite these studies, no formal standard has been adopted by the field to classify a patient as having DSVL, SHVL, HVLS, SLVL, or rebounding viral load (RVL) patterns.

Classifying patient VL states outside of research studies is further complicated in that real-world VL measurements are taken intermittently over time, and missing data is common due to a variety of factors (e.g. travel, social circumstances, non-adherence). This leads to incomplete and irregularly spaced data points. In addition, differences in the sensitivity of the multiple VL clinical assays available results in multiple cut off points for “undetectable" viral loads analyzed at different facilities, further complicating analyses. Thus, there is a need for analytic techniques that can adjust for these details and classify VL states, both across research studies using different methodologies and to consistently classify patients in clinical practice.

Machine learning methods can provide objective, unsupervised classification of patient clinical status14. These methods begin by collecting a set of features from patient data (e.g. demographics, laboratory measurements, therapies) and then performing computational clustering to identify similar patient classes. Some groups have applied machine learning methods to HIV research studies15 to predict HIV VL responses16 or CD4 T cell counts17, to distinguish between suppressed and viremic patients18, and to select therapeutic regimens19. None, however, have used machine learning to create a standard classification for VL status with irregularly sampled VL measurements across a cohort of patients.

To address these issues, we propose a set of unambiguous features which, when combined as a feature vector, capture the distinct dynamic patterns present in VL measurements over time. In addition, we have developed a novel centroid algorithm to cluster HIV positive subjects based on these patterns. Here we present the derivation of this method, and demonstrate its application to clustering 1,576 HIV patients with repeated VL measurements over a 5 year period. We found that patient VL measurements can be clustered into five time-varying patterns that correspond well to clinically relevant states. We note that the method and resulting categories can be used to standardize definitions of VL patterns across research studies, and potentially for clinical classification.

Methods

Human subjects protection

This proposal was reviewed and approved by the University of Rochester Human Subjects Review Board (protocol number RSRB00068884). Consent was waived by the review board due to de-identification of the data set. The analysis in this paper is presented in compliance with Center for Medicare Services (CMS) current cell size suppression policy20. Data were coded such that patients could not be identified directly in compliance with the Department of Health and Human Services Regulations for the Protection of Human Subjects (45 CFR 46.101(b)(4)).

Study data

We obtained medical encounter data from all patients with an HIV diagnosis in the University of Rochester Medical Center’s electronic medical record system (EMR) between 2011–2016, including age, gender, race, ethnicity, and VL measurements. There were a total of 1,892 patients with at least one VL measured, with 1,576 of these patients having at least three VL measurements.

Measurements 48 copies/mL, present as categorical values “NEG", “POS < 20", or “POS < 48" were transformed into numerical values of 0, 20, and 48 respectively. The deidentified study data containing only viral load measurements and relative time are available at https://doi.org/10.5281/zenodo.131324521.

Hardware and software specifications

Analyses were performed on a Windows 8 server with Intel(R) Xeon(R) CPUs E5-2620 v2 @ 2.10GHz and 256GB of RAM. Python 2.7 was used for most data mining and machine learning under Spyder v.3 installed from Anaconda2 (64-bit). The default packages available in Anaconda were used for analysis, including, but not limited to: NumPy, scikit-learn, SciPy, datetime, csv, math, Matplotlib, pip, operator, copy, random, and time. Using pip we installed the webcolors and pydotplus packages for rendering a decision tree. SQLite was used to store, query, and clean ~the data. Analytic code is available for download at https://github.com/SamirRCHI/Viral_Load_Data_Categorization.

Viral load analysis methods

Since VL data is asynchronous and noisy, with variable numbers of data points for each subject, we excluded patients with 2 VL measurements as too few to accurately assess VL patterns. Based on temporal patterns of VL described in the literature, the VL pair distribution of our data (Figure S1), and a further extensive investigation into the data, we hypothesized six potential temporal VL patterns, defined in Table 1 and illustrated in Figure 1.

Table 1. Viral load state definitions.

Abbrv.NameDefinition
DSVLDurably Suppressed Viral LoadHaving consistently suppressed their viral loads at or near the undetectable range
SLVLSustained Low Viral LoadViral load counts which are constantly slightly higher than the undetectable range.
SHVLSustained High Viral LoadViral load counts which are constantly in a range considered high risk for HIV complications (e.g. opportunistic infections, malignancy).
HVLSHigh Viral Load SuppressionA viral load pattern in which the terminal portion of the curve has a negative slope and the terminal data point is in the low or suppressed range. This could have a few different speeds or styles of suppression - rapid, gradual, or slow.
RVLRebounding Viral LoadA viral load pattern in which viral loads are unstable, with the measurement at one time step seemingly being independent of the next measurement.
EVLEmerging Viral LoadHaving a steady, or rapid, emergence of high viral load while the first few measurements of viral load were suppressed. While we have found no mention of this type of pattern in the literature, and found that this pattern did not occur in our data set, VL data sets could contain this pattern.

*Colors are used throughout the manuscript to identify clusters

4f79706a-94d2-4b0d-b43b-23416a31d185_figure1.gif

Figure 1. Possible HIV viral load patterns.

Examples of each type of viral load pattern. Note that actual viral load patterns are noisier and may often be more difficult to distinguish. The magnitudes of viral load values reflect those found in the dataset.

It is important to note that these definitions are pattern based, and do not explicitly select absolute VL cutoff levels or a specific temporal window, as other reports have done810,13. This has the advantage of allowing the absolute VL levels and critical time windows to emerge from the analysis. It also does not preclude incorporation of absolute levels (e.g. VLM>400) at a later stage into the pattern specification.

Feature vector definition

Mathematical notations for this work are described in Table 2. We next designed a feature vector to capture characteristics that would allow us to distinguish between VL patterns. VL values at the lower limit of detection are a function of the specific assay used, and appear in our data set as 0, 20, and 48 copies/mL (Figure S1). Thus, plots of the log10 transformed data have discretely spaced values at the lower level of detection, capturing the undetectable range of viral load. Additionally, we adjusted the data by log10[V L + 10] to avoid log10[0]. The addition of 10 to VL (instead of 1) is used to minimize the distance between the undetectable values: 0, 20, and 48 (copies/mL). Thus, in our notation, all the values related to viral load are assumed to have been adjusted to this measure. For example, minV L = log10[0 + 10] = 1 and maxV L = log10[107 + 10] 7.

Table 2. Mathematic notation.

SymbolDescription
NThe number of usable patients in the data. In our case 1576 patients.*
pRefers to a single patient.
V LMpThe total number of viral load measurements patient p has taken.
VLpAll viral load counts of patient p in order of time.
VLp,i Refers to the ith viral load count of patient p in VLp, where 1 ≤ iV LMp.
tpAll temporal instances corresponding to VLp.
tp,iTemporal instance of viral load VLp,i, where 1 ≤ iV LMp.
maxV LThe maximum viral load for all patients, (107).**
minV LThe minimum viral load for all patients, (0).**
Hadamard Product - elemental-wise multiplication of arrays.

*This is after selecting for patients with 3 measurements.

**This value changes after transformation of the data.

Using the transformed VL data, we next extract several relevant features of the VL measurements over time. These features are used for machine learning classification of individual patient VL time series, and designed to distinguish patterns in VL change while minimizing the effects of noise. We do not limit feature extraction based on the total elapsed time of viral load measurements because the optimal time-point for determining viral load class is not well established. The attributes for feature extraction are: relative area of viral exposure, weighted recency reliability, adjusted maximal difference, and interquartile range. The definitions include:

  • 1. Relative area of viral exposure (Area) - the area under the viral load curve relative to the total viral load area possible, which has a range between [0,1]. We choose a normalized, relative score, as the total time span between the first and last viral load measurement, which differs between patients. This feature is similar to finding the mean and median, except it is sensitive to the dimension of time, hence yielding more information. The feature is calculated by summing the area of each trapezoid created by each pair of viral load values, followed by dividing by the total possible area (Equation 1).

A˙p=i=2VLMpVLp,i+VLp,i12(tp,itp,i1)(maxVLminVL)(tp,VLMptp,1)(1)

  • 2. Weighted recency reliability (wRR) - Due to viral load noise, the last measurement may not be an accurate reflection of a patient’s viral load trend. For example, a patient may have a VL whose average slope is negative, indicating high viral load suppression over time (HVLS). If, however, the last measurement is slightly higher than the trend, heavily weighting this last measurement could lead to mis-classifying the patient as rebounding viral load (RVL). To account for this, we calculate a weighted mean where the weight of the VL measurement increases with time. More specifically, the weight function follows an inverse square root function (f(x)=1x) rather than an inverse function (g(x)=1x). This has the advantage of avoiding rapid convergence of g(x) to zero when time is measured in units of days (Equation 2). Weighted recency is then calculated as the dot product of the viral loads and weights divided by the sum of the weights (Equation 3).

weightp,i=1tp,VLMptp,i+1(2)

wRp=weightpVLpi=1VLMpweightp,i(3)

  • We were also interested in how reliable wR is as a representation of the patient’s viral load trend. To this end, we calculated the absolute deviations from the viral load measurements to wR (Equation 4). Rather than averaging the deviations, we take the median to reduce the effects of outliers and call this our weighted recency reliability measure (Equation 5). We take the inverse to force the range of the result to be between [0,1]; a property made to use in our next proposed feature, adjusted maximal difference.

devp,i=|wRpVLp,i|(4)

wRRp=1median(devp)+1(5)

  • 3. Adjusted maximal difference (Adj MD) - this is time-independent the difference between the “peak” and last VL measurements. To distinguish between viral load suppression or emergence, we calculate the “peak” as the maximum of the absolute deviations (Equation 4) and retain the sign of the result. We expected the positive scores to effectively isolate the EVL group, however, we instead found that retaining the positive (emergent) scores lead to mis-categorization of SHVL and RVL groups without clearly identifying EVL patterns. This, along with other investigation into the data, led us to conclude that the EVL pattern may not exist in our data, but we refrain to make generalizations to all healthcare facilities. With this consideration, we force (ground) the positive scores down to zero for proper labeling of SHVL and RVL (Equation 6).

    Due to the varying nature in viral load measurements, we are hesitant to use the final viral load measurement as a means of judging suppression. Thus we propose to use wR instead. To reduce the effects of rebounding patients being falsely labeled as suppressed patients, we multiply our result by wRR - as rebounding patients are expected to have a low score in the range [0,1]. The maximal difference is necessary in order to ensure that the suppression type of viral load patterns are classified appropriately (Equation 7).

grnd(x)={10x<0x0(6)

D˅p=grnd(wRpVLp,argmax(devp))max(devp)wRRp(7)

  • 4. Interquartile range (IQR) - This feature is added to further segregate the rebounding patients and follows the standard interquartile range calculation (Equation 8).

IQRp=Q3(VLp)Q1(VLp)(8)

Statistical analysis

Machine learning methods for cluster classification were compared by calculating F1 scores, the harmonic mean of precision and recall22, defined by Equation 9Equation 11.

precision=TruePositiveTruePositive+FalsePositive(9)

recall=TruePositivePositive(10)

F1=2precisionrecallprecision+recall(11)

Analytic terminology

Here we formally define keywords appearing in the analysis: Let Feature extraction be the process of determining the values Ȧ, wRR, D˅, and IQR from a set of patients (using their viral load patterns) with the formulations given above. Then a feature vector (Fp) contains the values Ȧp, wRRp, D˅p, and IQRp extracted from patient p’s viral load pattern. The words sample or point are also used here RVL (black; n= 237) and HVLS (purple; n=316) clusters. interchangeably. The term feature (F) can be thought of as a column vector for all patients in the dataset consisting of the four attributes: FȦ, FwRR, FD˅, and FIQR. Finally, the terms label assignment, VL pattern membership assignment, patient categorization, and prediction, all refer to the same principle: To assign the most appropriate label which characterizes the viral load pattern of a patient. However, while the principle is the same, the method of assigning such an appropriate label differs depending on the categorization or the learning method used.

Results

Feature extraction and normalization

We began by transforming viral load data by min-max normalization22 to equally weight the temporal features of the VL series (Equation 12). That is, we normalize the features, F, to a range between [0, 1] using Equation 12 where F* = f(F).

F=f(F)=FminFmaxFminF(12)

Next, we examined each of the four features for all patients with 3 viral load measurements (N = 1,576 patients), and did not find distinct bi-variate clustering (Figure S2). A feature correlation coefficient analysis (Supplementary Table S1) revealed that the Adj MD feature is linearly independent of Area and wRR. In contrast, there is modest linear dependence between IQR and Adj MD, and between Area and both wRR and IQR. As expected, the largest linear dependency is between wRR and IQR. These results suggest the separation between viral load patterns will be most noticeable between the Area and the Adj MD features - as we designed them to be. Also, although Adj MD is dependent upon wRR, we find that their correlation coefficient is very low (0.033).

Hierarchical clustering

We then performed hierarchical clustering of the individual subject VL patterns using a Euclidean distance metric and Ward’s linkage criterion23 to minimize the total within-cluster variance. Patients showed a clear separation into 5 distinct groups, which had clinical significance (Figure 2 and Figure 3). The cluster with the lowest viral loads and the highest weighted recency reliability (n=442) corresponds to the DSVL patient group. The patients corresponding to the SHVL group (orange; n=46) exhibited the highest relative Area and very low IQR. Compared to the DSVL cluster, the blue cluster (n=535) has slightly greater area and IQR with a significant difference in the weighted recency reliability. Using this information, along with the general patterns shown by Figure 3, we identify this as the SLVL group. The algorithm also identifies the RVL (black; n=237) and HVLS (purple; n=316) clusters. The RVL cluster has a low weighted recency reliability and high IQR. In contrast, the HVLS cluster has a lower area, higher weighted recency reliability, indicating little variation in the terminal portion of the VL time series, and most importantly very low adjusted maximal differences (Figure 4).

4f79706a-94d2-4b0d-b43b-23416a31d185_figure2.gif

Figure 2. Dendrogram of hierarchically clustered patients.

Clustered using the Euclidean distance along with Ward’s method. Numbers on the bottom axis show number of patients in each cluster. The corresponding viral load pattern plots can be found in Figure 3. DSVL = Durably Suppressed Viral Load, SHVL = Sustained High Viral Load, SLVL = Sustained Low Viral Load, RVL = Rebounding Viral Load, HVLS = High Viral Load Suppression

4f79706a-94d2-4b0d-b43b-23416a31d185_figure3.gif

Figure 3. Extracted patient viral load patterns.

For each cluster categorization of the patient from Figure 2, the days since first viral load measurement are plotted against the viral load counts. The points on the plots indicate the last viral load measurement.

4f79706a-94d2-4b0d-b43b-23416a31d185_figure4.gif

Figure 4. Feature segregation from hierarchical clustering.

Each patient is colored corresponding to the results from the hierarchical clustering in Figure 2. The artificial line of points is a result of the grounding function used in Adj MD. Area = relative area of viral load exposure, wRR= weighted recency reliability, IQR = interquartile range, AdjMD = adjusted maximal viral load difference.

VL patterns are similar within clusters, and dissimilar between clusters Figure 3. Interestingly, there are patients within each cluster whose last VL measurement occurs near 1,827 days. This is equivalent to the full span of five years of VL monitoring data set. This suggests that these clusters don’t disappear after some elapsed time, but rather each type of pattern can be found at virtually any time point.

We found large VL spikes within the time series of the HVLS group. We hypothesize that this may be due to the asynchronous timing of measurements between subjects, the natural variation in biological responses, or patient variability in adherence to therapy. This observation also reflects one limitation of asynchronous outcomes data sampling, which lacks a “completion" endpoint characteristic of most prospective, randomized clinical trials. If measurements ended at a spike, the adjusted maximal difference feature may be weighted in the favor of the patient being classified as RVL. This may indicate that some patients classified as suppressing their viral loads should have been classified as having rebounding viral loads. Alternatively, may indicate that these features do not restrict a patient to forever to one category, but allow for dynamic classification as a function of biological or therapeutic responses.

Comparison of categorization methods

Using the same data set, we next compared our VL pattern categorization method to those previously published in the literature. Visually, we find that the SLVL group detected by our method is very similar to the LLVR group defined by Greub et al. (Figure 5). Furthermore, it appears that the methods trying to capture SHVL, viral rebound, and viral failure patients did not succeed as well as the identification of SHVL and RVL patients in our method. RMVL repeat continuous visually appears to have performed very well in identifying patients whom have suppressed their viral loads. However, the results suggest that our analysis performs slightly better in identifying the suppression group (HVLS), as we find that the last VL measurements (black dots in Figure 5) are consistently low using our method.

4f79706a-94d2-4b0d-b43b-23416a31d185_figure5.gif

Figure 5. Comparison of patient categories with existing methods.

A 2D binning of VLM counts for every patient category. Each row uses a different categorization method, and the method name is located to the right of the row, and the title of each subplot is the category assigned by the indicated method. The columns of each 2D bin are normalized based on the maximum number of logged viral load measurement (VLM) counts in the column: log10[1 + V L M ]. Bin color for a count of 0 is copper, and other bin colors range from white to teal (the maximum of the log10[V L M counts] in the column of the bin). The black dots represent the last viral load measurement for the patient (opacity 0.3; 2D bins have variable opacity for the dots). The bottom row is our analysis is the same as Figure 3, but represented as a 2D bin. DSVL = durably suppressed viral load; LLVR = low level viral rebound; SLVL = sustained low viral load; SHVL = sustained high viral load; HVLS = high viral load suppression, RVL = rebounding viral load.

The other methods may not have performed as well as they rely on a window or a consecutive pair measure, which may be too subjective for assigning VL pattern membership. Furthermore, notice that patients with baseline VL<200 (Figure 5) contain VL patterns which can reach as high as 106 copies/ml, which is in contrast to Rose et al.’s assumption that these patients have consistently low viral variation. Lastly we wish to emphasize that while some of these categorization methods are successful in identifying a specific group of patients, our method is unique as it attempts to associate each VL pattern to a specific category, without using categories such as “Not Suppressed”, “Unspecified”, or “Omitted”.

Supervised learning of VL patterns

We next used the classes identified by hierarchical clustering to compare several machine learning models, with the goal of identifying methods that could be trained to prospectively assign HIV patients to VL categories (i.e. SHVL, SVL, SLVL, DSVL, and RVL). Unsupervised learning (e.g. hierarchical clustering) is useful for establishing the data structure of VL categories and their locations in the feature vector space. Once the model is established (e.g. cluster boundaries), supervised learning methods are better suited for prospective cluster assignment, given a robust "ground truth" for model training, as they do not depend on re-analysis of the entire population.

To this end, we compared the predictive power of several supervised learning methods for HIV cluster assignment, including: k-nearest neighbors (kNN), decision tree, support vector machine (SVM), Adaboost, and random forests. Models were trained on the original data set, and we then ranked their prediction power by their average F1 score derived by leave-one-out cross-validation (LOOCV) on the clustered results (Table 3). We compared the ability of these methods to reconstruct the originally identified clusters, even when allowing for variability in cluster numbers (e.g. kNN with k={7, 9}, or DT without a maximum depth specification). All methods performed comparably well, with the notable exception of Adaboost. This generally high performance was expected because the VL pattern categories are well-separated as a result of the clustering. k-Nearest Neighbors and k=5, was computationally efficient and yielded the best results in Table 3.

Table 3. F1 prediction scores using LOOCV.

Group:DSVLSLVLSHVLHVLSRVLAverage
Patients:44253546316237F1 Score
kNN,k=50.99660.99250.96770.98890.98100.9853
kNN,k=90.99430.99070.95830.98410.97250.9800
kNN,k=70.99430.98970.93620.98730.97460.9764
Random Forest0.99090.98410.95560.96850.96450.9727
Decision Tree (DT)0.98980.97950.96700.95120.94320.9661
SVM0.99550.98330.91110.96660.93870.9590
DT,max depth=50.97570.96590.90320.93750.91680.9398
Polyhedron0.97270.94740.90110.89850.91090.9261
Bounding Box0.98650.96300.87640.90380.86140.9182
Push and Pull0.97370.93470.88420.90270.87670.9144
Best Rep.0.95890.92800.90110.85980.89230.9080
Mean0.94010.90040.90720.82120.89680.8931
Smallest Disk0.96270.90170.88890.82460.87170.8899
Median0.92710.88820.91670.79670.89530.8848
AdaBoost0.92270.82480.57970.50330.64750.6956

LOOCV = leave-one-out cross validation, kNN = k nearest neighbor, DT = decision tree

SVM = support vector machine, DSVL = Durably Suppressed Viral Load

SHVL = Sustained High Viral Load, SLVL = Sustained Low Viral Load

RVL = Rebounding Viral Load, HVLS = High Viral Load Suppression

We next considered the trade-off of predictive precision versus model interpretability. Critical clinical evaluation of machine learning results is important to protect against mis-categorization and clinical error. For this reason, many have advocated using models that are more clinically interpretable. kNN is dependent upon the entire training set for prediction, as it does not inherently “learn” patterns24, hence it does not meet our interpretability criteria. In comparison, SVM offers a simpler model, but it’s results could be non-intuitive for clinicians. And although Decision Trees offer the best interpretability, overly complex trees may be generated, as occurred in our study (Figure S3).

We found that pruned decision tree rules, with a maximum depth of 5 levels, met this interpretable criteria, however at a slight cost to the predictive power (Table 3). The extracted decision rules are shown in Figure 6. Each category has a rule with a high proportion of true positive samples following the rule relative to all samples for the category (support). Similarly, a high proportion of the predicted class was found in the rule (precision), indicating that the rules can be summarized into a majority rule. Note that the sum of the support does not necessarily add up to one for each class because some samples belonging to that class may have been otherwise placed into a different rule, making the precision of that rule weaker.

4f79706a-94d2-4b0d-b43b-23416a31d185_figure6.gif

Figure 6. Extracted rules from pruned decision tree and polyhedral CM rule region.

Support is the fraction of true positives satisfying the rule relative to all samples of the class. Precision is the proportion of true positives versus all positives in the rule. Rules are sorted in order of application, first by the level of the decision tree depth (Depth), and then by descending precision. The colored regions represent the the values for which the rule holds (rule feature space). For the centroid method (CM; shaded gray) bounds were calculated by the polyhedron method, where the rectangular bar is the center and the radius is the area inside the parentheses. Area = relative area of viral load exposure, wRR= weighted recency reliability, IQR = interquartile range, AdjMD = adjusted maximal viral load difference.

As an alternative interpretable model we explored the use of centroid cluster summarization, which is often used in clustering algorithms, and is flexible enough to accommodate different centroid determination methods22,25. To determine the effects of different centroid determination algorithms, we compared seven different methods: multidimensional mean, multidimensional median, best representative center, bounding box method, smallest disk method, polyhedral center, and a novel “push and pull" (PnP) method inspired by force-directed graph drawing such as the Fruchterman-Reingold’s algorithm26,27 (see Supplementary File 1). Force directed clustering methods maximize inter-cluster center distances, while minimizing intra-cluster distance, and are the basis for modularity clustering in graph theory28.

We then combined the centroid cluster summarization approach with a radius-based classification prediction algorithm. Let ci be the ith cluster center with corresponding radius ri, where ri is calculated as the distance to the farthest intra-cluster sample from ci, then for a new sample s choose its predicted cluster membership j such that scj2rj is a minimum. We refer to this method as radial normalization classification.

Comparing the representative F1 power of the centroid radial normalization methods (italicized in Table 3) to common machine learning algorithms, we find that the centroid interpretation loses some predictive power. However, the centroid summary is highly interpretable because the entire model can be expressed concisely (Supplementary Table S2), and understood clearly. For example, a clinician classifying a patient by VL time series values would compare observed feature values with the ranges given in Figure 6, and find which classification the patient’s data fits best within. In the case of the centroid method, if an observed value appears to fall in multiple categories, then they should be assigned to the one closest to the center (this allows a clinician to cross-check model predictions).

Temporal state variation

HIV patient viral load states are often fluid, with class changes (e.g. SHVL HVLS) occurring due to therapy, viral genetics, social and other factors. To examine this aspect of classes, we use the k-Nearest Neighbors (k=5) model, fit to the original clusters, to predict the class state of each patient with 3 VL measurements using only partially retained VL data. For example if a patient has 6 viral load measurements, then we predict the class state at 3, 4, 5, and 6 VLM, which may yield SHVL SHVL RVL HVLS as its prediction. We then constructed a state-transfer network using the trace-route method29, revealing several interesting relationships:

  • 1. Patients on therapy appear to suppress their viral loads at a positive linear rate throughout the entire 900 day span. This is quite different from the literature which suggests that if a patient is going to suppress their VL, it will be within 32 weeks, or 224 days13 (Figure 7A).

4f79706a-94d2-4b0d-b43b-23416a31d185_figure7.gif

Figure 7. Class state variation.

A) Classification using kNN, with k=5, trained on the original five clusters to predict on partially retained viral load for patients 900 days of data. The number of patients in one class between 0–900 days are shown relative to the first state classification (i.e. third viral load measurement). B) A trace-route map of class state transfers (class1class2) as a function of partially retained viral load derived from model. Nodes represent viral load classification and arrows reflect the volume of state transitions between successive VL measurements (e.g. SHVLDSVL). Self-loops (e.g. RVLRVL) indicate no change in state reflecting stable classification.

  • 2. DSVL classification appears unstable for the first 400 days, suggesting that patients in this class should be monitored carefully during this initial period (Figure 7A).

  • 3. The number of patients classified as SHVL drops considerably until 500 days after first classification. After this point, those patients who have not yet left the SHVL category, may not do so (Figure 7A).

  • 4. The two sets of classes {DSVL, SLVL} and {SHVL, RVL, HVLS} are well separated (i.e. without much transfer between sets; Figure 7B). This appears to suggest that patients whose viral load is consistently low or durably suppressed tend not to transfer into a high viral load state (i.e. RVL or SHVL), at least in this data cohort.

  • 5. SHVL patients in this cohort tended to transfer out of the class at a much higher rate than the transfer in, suggesting positive patient care (Figure 7B). This observation is consistent with reports in the literature that entry into treatment, with adherence to a HAART regimen, generally results in viral load suppression.

  • 6. The state transfer diagram illustrates that the most frequent state transition over time is remaining within the same cluster (Figure 7B) assignment.

Discussion

Researchers have previously performed HIV population case studies using differing schema to classify VL patterns2,9,10,13. We have developed a unique method for standardizing the algorithmic classification of VL patterns using a set of optimally segregating features. These features have been specifically engineered to optimize unsupervised clustering of temporal sequences of VL data that are asynchronous and noisy. Our findings demonstrate their success in identifying five viral load patterns often reported in the literature712. It is possible that additional viral load patterns may emerge in the future, for example due to new HIV variants that are resistant to current therapies. The method reported here is flexible enough to recognize such new temporal patterns of VL responses. It is also general enough that models could be trained on other viral infections that have patterns of natural or treatment related patient responses (e.g. hepatitis B and C, parvovirus B19), although this may require defining new features that capture disease specific pattern variants.

A common practice in data analytics is to calculate the centroid as the average of the points22,30. However, Table 3 suggests that the mean is not necessarily the best centroid for HIV viral load data. We note two advantages of the centroid algorithm: First, we can choose the centroid best corresponding to the shape of the data, and second, we can use it to mathematically determine the amount of over-lap between n-dimensional cluster spheres (i.e. viral load categories). This method may facilitate cross-comparison of HIV research studies by providing a standard for VL pattern classification. Such standardization would be immensely useful in meta-analyses3135, potentially revealing the influence of different patient care strategies or new relationships between different patient populations.

Our work also explored the trade-off with respect to predictive accuracy between model interpretability and more complex, "black box" approaches to classification. The interpretability versus predictability problem is well known in the deep learning literature3638. Interpretability is a desirable attribute in clinical classification systems, allowing clinicians to integrate causal physiology and diagnostic information with data features in a way promotes clearer bedside clinical reasoning. Using an interpretable model for assigning viral load pattern membership may be advantageous when a clinician wishes to use the assigned pattern membership to aid in making a critical clinical decision (e.g. choosing between treatment options), or when examining features that may be linked to a mechanism (e.g. slope of VL decline and viral genotype). A "black box" or more complex model may make such decisions or interpretations more difficult39, and can favor the use of simpler models at the expense of some predictive power.

Along these lines, we have also proposed a novel centroid-based algorithm for summarization of clustering results. This algorithm is not meant to supplant other well defined supervised learning algorithms, but rather to aid in interpretable assignment of VL patterns from other data sets into one of the five categories. The algorithm results are concise, allowing investigators to build the model in their preferred programming language. Hence this method may improve and standardize HIV population research by giving precise definitions to the varying temporal VL patterns, and potentially improving patient care.

Several caveats apply to our work. As noted, this is a single center study, and thus our method should be tested with a much larger data set to cross-validate the categories represented by the clusters. In addition, our feature vector was designed specifically based on observed VL patterns previously reported in the literature rather than objectively clustering the data using a standard time-series based clustering method4042. This may limit generalizability to other VL analyses. In addition, some of our features are slightly collinear - with the greatest correlation coefficient being between IQR and wRR (-0.717). However, while HVLS and RVL both have a varied range of IQR, it is clear that the HVLS class has greater wRR than the RVL class due to HVLS patients having a long consistent viral load tail. Furthermore IQR helps distinguish the HVLS and the SLVL or SHVL class, hence both IQR and wRR are necessary despite the slight correlation. Finally, because our method normalizes time into number of days since first VL measurement, we lose the ability to look for seasonal or yearly patterns in the data.

Our data set did not have patients in whom VL was initially suppressed, and then rebounded (EVL). We originally hypothesized the existence of six distinct VL patterns, we found that the emergent VL group was not a pattern identified in our data. Perhaps this is a consequence of a high rate of local patient engagement in therapy in this cohort study, access to care, or the effectiveness of highly active anti-retroviral therapy regimens. We hypothesize that these conditions may not always exist (e.g. in areas where HAART is expensive, when people may lose the ability to pay for therapy), and that in such cases the EVL pattern may indeed be present and significant. Based on the formulation of the adj MD and wRR features, we hypothesize that a consequence of the grounding function is that any EVL pattern, if exists, will be grouped under RVL. This grouping may be appropriate as one can argue that going from a suppressed state to a high VL state is a form of rebounding. Clinical treatment of these patterns is likely to be similar. Further work with data sets that contain RVL patterns will need to be done to test these hypotheses. Unfortunately, we are not aware of any such data currently in the public domain.

Our method used hierarchical clustering to define groups, with a cutoff for group specification at a high level in the branching tree (i.e. level 5). Such thresholds or tuning parameters are characteristic of most unsupervised clustering algorithms22,43. However, identification of important sub-clusters by using a lower threshold is also possible. Clustering results may change depending on the parameter chosen, revealing finer between-cluster differences as the number of clusters increase. The hierarchical clustering algorithm has the advantage that a proper cut-off can be easily visualized. For example, choosing a lower cut-off may reveal that the suppression group splits itself into categories with different rates of HIV viral load suppression during treatment. Researchers wishing to engineer a new feature vector for VL pattern segregation may find useful the Supplementary material on features we considered but subsequently removed due to poor performance.

Conclusions

We have proposed a set of four unambiguous features which have been successfully used in segregating five different types of temporal viral load patterns: durably suppressed viral load (DSVL), sustained low viral load (SLVL), sustained high viral load (SHVL), high viral load suppression (HVLS), and rebounding viral load (RVL). We have also proposed a novel centroid-based cluster summary algorithm. The use of this algorithm may improve meta-analyses or population studies of viral load patterns by standardizing the classification of HIV patient categories. Furthermore, the segregation process used in this paper (i.e. identifying domain specific features, performing unsupervised clustering, interpreting the results with a cluster summary) can be used to model other viral infections and the response of VL levels over time to treatment or natural disease progression. We also found that using a temporal state variation method is important when considering patient viral load classifications, as changes in patient response can continue to occur beyond previously estimated time frames.

Abbreviations

AdjMD = adjusted maximal viral load difference, Area = relative area of viral load exposure, ART = anti-retroviral therapy, DT = decision tree, EVL = emerging viral load, DSVL = Durably Suppressed Viral Load, HAART = highly active retroviral therapy, HIV = human immunodeficiency virus, IQR = interquartile range, kNN = k nearest neighbor, LLVL = low level viral load, LOOCV = leave-one-out cross validation, SHVL = Sustained High Viral Load, SLVL = Sustained Low Viral Load, RVL = Rebounding Viral Load, HVLS = High Viral Load Suppression, SVM = support vector machine, VL = viral load, VLM = viral load measurement, wRR= weighted recency reliability.

Data availability

Full access to the data is available on GitHub (Data S1): https://doi.org/10.5281/zenodo.131324521

Data S1: Viral load data. The data set used for this study is provided in a completely deidentified format, CSV format where the first column represents a unique subject, with a random identifier. The subsequent values are as ti,j, V Li,j, where ti,j is the time from a universal T0 for the VL measurement j for patient i, and V Li,j is the corresponding VL measurement. Each record (row) is of a unique length, depending on the number of VL measurements present for that subject. The study data, and code used for analysis, can be found at https://doi.org/10.5281/zenodo.131324521.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 27 Jul 2018
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Farooq SA, Weisenthal SJ, Trayhan M et al. Revealing HIV viral load patterns using unsupervised machine learning and cluster summarization [version 1; peer review: 1 approved, 1 approved with reservations] F1000Research 2018, 7:1144 (https://doi.org/10.12688/f1000research.15591.1)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 27 Jul 2018
Views
8
Cite
Reviewer Report 29 Oct 2018
Amalio Telenti, The Scripps Research Institute, La Jolla, CA, USA 
Approved with Reservations
VIEWS 8
This report brings machine learning approaches to the classification of patterns of viral control in HIV infected individuals. This is welcome because, although this is a mature field in HIV, it signals the opportunity for new models in this and ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Telenti A. Reviewer Report For: Revealing HIV viral load patterns using unsupervised machine learning and cluster summarization [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2018, 7:1144 (https://doi.org/10.5256/f1000research.17007.r39645)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
5
Cite
Reviewer Report 28 Sep 2018
Sally Blower, Center for Biomedical Modeling, Semel Institute of Neuroscience and Human Behavior, David Geffen School of Medicine, University of California, Los Angeles (UCLA), Los Angeles, CA, USA 
Approved
VIEWS 5
This is an extremely interesting study that proposes a novel quantitative methodology for classifying HIV patients by viral load patterns. The authors propose four computable characteristics of time-varying viral load patterns and a novel classification algorithm. They demonstrate their approach ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Blower S. Reviewer Report For: Revealing HIV viral load patterns using unsupervised machine learning and cluster summarization [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2018, 7:1144 (https://doi.org/10.5256/f1000research.17007.r37937)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 27 Jul 2018
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.