Keywords
Machine learning, HIV, viral load, feature extraction, HIV categories, centroid, cluster summarization, clinical interpretability
This article is included in the Artificial Intelligence and Machine Learning gateway.
Machine learning, HIV, viral load, feature extraction, HIV categories, centroid, cluster summarization, clinical interpretability
The primary clinical goal of HIV treatment and patient engagement is suppression of the HIV viral load (VL), as measured by low or undetectable circulating HIV RNA levels. However, VL most often fluctuates over repeated measurements, with a range that spans 8 orders of magnitude from 0 (undetectable) - 107 copies/mL. VL is regularly monitored for signs of progression of HIV infection. Standard HIV treatment protocols are based on VL measurements1, especially when monitoring responses to antiretroviral therapy (ART). Monitoring of VL helps to determine whether ART therapy was able to successfully suppress patient VL2. Individuals with sustained high viral loads (SHVL) are at greater risk of secondary transmission, clinical progression to AIDS, and death3–6. In contrast, significant reduction in VL or high viral load suppression (HVLS) both lead to immune recovery, as measured by CD4 T cell levels7, and can reduce or eliminate the risks of SHVL. Furthermore, patients sustaining low-level viral load (SLVL), or with a rising VL after previous suppression, have a high incidence of treatment failure8. Thus, developing an objective measure of VL status, and categorization of patients by time varying patterns of VL, is critical for standardizing both therapy and comparing research protocol efficacy.
Reports in the current literature differ in the definition “high viral load"9–12, and their findings of how long it takes a patient on highly active anti-retroviral therapy (HAART) to suppress their VL2,9,10,13. We summarize some of the published approaches here (for greater detail see Supplementary File 1). With respect to VL levels, Terzian et al. defined SHVL as two consecutive viral load measurements (VLM) ≥100,000 copies/mL9. Durably suppressed viral load (DSVL) was defined as all VLM <400 copies/mL. In contrast, Greub et al. focused on detecting low level viral rebound (LLVR) by first considering patients with an initial consecutive VLM pair <50 copies/mL, and classified LLVR as having subsequent maximum VLM between 51–5008. Alternatively, Rose et al. investigated the use of five different frameworks to categorize suppressed versus not-suppressed VL10. Their approach excluded patients with VLM<200 at baseline, and stratified the remainder with regard to VL suppression using an 8 month window centered around 24 months after the start of VLM (18–30 months). Another approach was used by Phillips et al., and characterized VLM responses to ART13, utilizing a 24–40 week window and a rule-based method to identify two populations of HIV patients (Viral Failure and Viral Rebound). Despite these studies, no formal standard has been adopted by the field to classify a patient as having DSVL, SHVL, HVLS, SLVL, or rebounding viral load (RVL) patterns.
Classifying patient VL states outside of research studies is further complicated in that real-world VL measurements are taken intermittently over time, and missing data is common due to a variety of factors (e.g. travel, social circumstances, non-adherence). This leads to incomplete and irregularly spaced data points. In addition, differences in the sensitivity of the multiple VL clinical assays available results in multiple cut off points for “undetectable" viral loads analyzed at different facilities, further complicating analyses. Thus, there is a need for analytic techniques that can adjust for these details and classify VL states, both across research studies using different methodologies and to consistently classify patients in clinical practice.
Machine learning methods can provide objective, unsupervised classification of patient clinical status14. These methods begin by collecting a set of features from patient data (e.g. demographics, laboratory measurements, therapies) and then performing computational clustering to identify similar patient classes. Some groups have applied machine learning methods to HIV research studies15 to predict HIV VL responses16 or CD4 T cell counts17, to distinguish between suppressed and viremic patients18, and to select therapeutic regimens19. None, however, have used machine learning to create a standard classification for VL status with irregularly sampled VL measurements across a cohort of patients.
To address these issues, we propose a set of unambiguous features which, when combined as a feature vector, capture the distinct dynamic patterns present in VL measurements over time. In addition, we have developed a novel centroid algorithm to cluster HIV positive subjects based on these patterns. Here we present the derivation of this method, and demonstrate its application to clustering 1,576 HIV patients with repeated VL measurements over a 5 year period. We found that patient VL measurements can be clustered into five time-varying patterns that correspond well to clinically relevant states. We note that the method and resulting categories can be used to standardize definitions of VL patterns across research studies, and potentially for clinical classification.
This proposal was reviewed and approved by the University of Rochester Human Subjects Review Board (protocol number RSRB00068884). Consent was waived by the review board due to de-identification of the data set. The analysis in this paper is presented in compliance with Center for Medicare Services (CMS) current cell size suppression policy20. Data were coded such that patients could not be identified directly in compliance with the Department of Health and Human Services Regulations for the Protection of Human Subjects (45 CFR 46.101(b)(4)).
We obtained medical encounter data from all patients with an HIV diagnosis in the University of Rochester Medical Center’s electronic medical record system (EMR) between 2011–2016, including age, gender, race, ethnicity, and VL measurements. There were a total of 1,892 patients with at least one VL measured, with 1,576 of these patients having at least three VL measurements.
Measurements ≤48 copies/mL, present as categorical values “NEG", “POS < 20", or “POS < 48" were transformed into numerical values of 0, 20, and 48 respectively. The deidentified study data containing only viral load measurements and relative time are available at https://doi.org/10.5281/zenodo.131324521.
Analyses were performed on a Windows 8 server with Intel(R) Xeon(R) CPUs E5-2620 v2 @ 2.10GHz and 256GB of RAM. Python 2.7 was used for most data mining and machine learning under Spyder v.3 installed from Anaconda2 (64-bit). The default packages available in Anaconda were used for analysis, including, but not limited to: NumPy, scikit-learn, SciPy, datetime, csv, math, Matplotlib, pip, operator, copy, random, and time. Using pip we installed the webcolors and pydotplus packages for rendering a decision tree. SQLite was used to store, query, and clean ~the data. Analytic code is available for download at https://github.com/SamirRCHI/Viral_Load_Data_Categorization.
Since VL data is asynchronous and noisy, with variable numbers of data points for each subject, we excluded patients with ≤ 2 VL measurements as too few to accurately assess VL patterns. Based on temporal patterns of VL described in the literature, the VL pair distribution of our data (Figure S1), and a further extensive investigation into the data, we hypothesized six potential temporal VL patterns, defined in Table 1 and illustrated in Figure 1.
It is important to note that these definitions are pattern based, and do not explicitly select absolute VL cutoff levels or a specific temporal window, as other reports have done8–10,13. This has the advantage of allowing the absolute VL levels and critical time windows to emerge from the analysis. It also does not preclude incorporation of absolute levels (e.g. VLM>400) at a later stage into the pattern specification.
Mathematical notations for this work are described in Table 2. We next designed a feature vector to capture characteristics that would allow us to distinguish between VL patterns. VL values at the lower limit of detection are a function of the specific assay used, and appear in our data set as 0, 20, and 48 copies/mL (Figure S1). Thus, plots of the log10 transformed data have discretely spaced values at the lower level of detection, capturing the undetectable range of viral load. Additionally, we adjusted the data by log10[V L + 10] to avoid log10[0]. The addition of 10 to VL (instead of 1) is used to minimize the distance between the undetectable values: 0, 20, and 48 (copies/mL). Thus, in our notation, all the values related to viral load are assumed to have been adjusted to this measure. For example, minV L = log10[0 + 10] = 1 and maxV L = log10[107 + 10] ≈ 7.
Symbol | Description |
---|---|
N | The number of usable patients in the data. In our case 1576 patients.* |
p | Refers to a single patient. |
V LMp | The total number of viral load measurements patient p has taken. |
All viral load counts of patient p in order of time. | |
Refers to the ith viral load count of patient p in , where 1 ≤ i ≤ V LMp. | |
All temporal instances corresponding to . | |
Temporal instance of viral load , where 1 ≤ i ≤ V LMp. | |
maxV L | The maximum viral load for all patients, (107).** |
minV L | The minimum viral load for all patients, (0).** |
○ | Hadamard Product - elemental-wise multiplication of arrays. |
Using the transformed VL data, we next extract several relevant features of the VL measurements over time. These features are used for machine learning classification of individual patient VL time series, and designed to distinguish patterns in VL change while minimizing the effects of noise. We do not limit feature extraction based on the total elapsed time of viral load measurements because the optimal time-point for determining viral load class is not well established. The attributes for feature extraction are: relative area of viral exposure, weighted recency reliability, adjusted maximal difference, and interquartile range. The definitions include:
1. Relative area of viral exposure (Area) - the area under the viral load curve relative to the total viral load area possible, which has a range between [0,1]. We choose a normalized, relative score, as the total time span between the first and last viral load measurement, which differs between patients. This feature is similar to finding the mean and median, except it is sensitive to the dimension of time, hence yielding more information. The feature is calculated by summing the area of each trapezoid created by each pair of viral load values, followed by dividing by the total possible area (Equation 1).
2. Weighted recency reliability (wRR) - Due to viral load noise, the last measurement may not be an accurate reflection of a patient’s viral load trend. For example, a patient may have a VL whose average slope is negative, indicating high viral load suppression over time (HVLS). If, however, the last measurement is slightly higher than the trend, heavily weighting this last measurement could lead to mis-classifying the patient as rebounding viral load (RVL). To account for this, we calculate a weighted mean where the weight of the VL measurement increases with time. More specifically, the weight function follows an inverse square root function rather than an inverse function This has the advantage of avoiding rapid convergence of g(x) to zero when time is measured in units of days (Equation 2). Weighted recency is then calculated as the dot product of the viral loads and weights divided by the sum of the weights (Equation 3).
We were also interested in how reliable wR is as a representation of the patient’s viral load trend. To this end, we calculated the absolute deviations from the viral load measurements to wR (Equation 4). Rather than averaging the deviations, we take the median to reduce the effects of outliers and call this our weighted recency reliability measure (Equation 5). We take the inverse to force the range of the result to be between [0,1]; a property made to use in our next proposed feature, adjusted maximal difference.
3. Adjusted maximal difference (Adj MD) - this is time-independent the difference between the “peak” and last VL measurements. To distinguish between viral load suppression or emergence, we calculate the “peak” as the maximum of the absolute deviations (Equation 4) and retain the sign of the result. We expected the positive scores to effectively isolate the EVL group, however, we instead found that retaining the positive (emergent) scores lead to mis-categorization of SHVL and RVL groups without clearly identifying EVL patterns. This, along with other investigation into the data, led us to conclude that the EVL pattern may not exist in our data, but we refrain to make generalizations to all healthcare facilities. With this consideration, we force (ground) the positive scores down to zero for proper labeling of SHVL and RVL (Equation 6).
Due to the varying nature in viral load measurements, we are hesitant to use the final viral load measurement as a means of judging suppression. Thus we propose to use wR instead. To reduce the effects of rebounding patients being falsely labeled as suppressed patients, we multiply our result by wRR - as rebounding patients are expected to have a low score in the range [0,1]. The maximal difference is necessary in order to ensure that the suppression type of viral load patterns are classified appropriately (Equation 7).
4. Interquartile range (IQR) - This feature is added to further segregate the rebounding patients and follows the standard interquartile range calculation (Equation 8).
Machine learning methods for cluster classification were compared by calculating F1 scores, the harmonic mean of precision and recall22, defined by Equation 9–Equation 11.
Here we formally define keywords appearing in the analysis: Let Feature extraction be the process of determining the values Ȧ, wRR, , and IQR from a set of patients (using their viral load patterns) with the formulations given above. Then a feature vector (p) contains the values Ȧp, wRRp, , and IQRp extracted from patient p’s viral load pattern. The words sample or point are also used here RVL (black; n= 237) and HVLS (purple; n=316) clusters. interchangeably. The term feature (F) can be thought of as a column vector for all patients in the dataset consisting of the four attributes: FȦ, FwRR, , and FIQR. Finally, the terms label assignment, VL pattern membership assignment, patient categorization, and prediction, all refer to the same principle: To assign the most appropriate label which characterizes the viral load pattern of a patient. However, while the principle is the same, the method of assigning such an appropriate label differs depending on the categorization or the learning method used.
We began by transforming viral load data by min-max normalization22 to equally weight the temporal features of the VL series (Equation 12). That is, we normalize the features, F, to a range between [0, 1] using Equation 12 where F* = f(F).
Next, we examined each of the four features for all patients with ≥ 3 viral load measurements (N = 1,576 patients), and did not find distinct bi-variate clustering (Figure S2). A feature correlation coefficient analysis (Supplementary Table S1) revealed that the Adj MD feature is linearly independent of Area and wRR. In contrast, there is modest linear dependence between IQR and Adj MD, and between Area and both wRR and IQR. As expected, the largest linear dependency is between wRR and IQR. These results suggest the separation between viral load patterns will be most noticeable between the Area and the Adj MD features - as we designed them to be. Also, although Adj MD is dependent upon wRR, we find that their correlation coefficient is very low (0.033).
We then performed hierarchical clustering of the individual subject VL patterns using a Euclidean distance metric and Ward’s linkage criterion23 to minimize the total within-cluster variance. Patients showed a clear separation into 5 distinct groups, which had clinical significance (Figure 2 and Figure 3). The cluster with the lowest viral loads and the highest weighted recency reliability (n=442) corresponds to the DSVL patient group. The patients corresponding to the SHVL group (orange; n=46) exhibited the highest relative Area and very low IQR. Compared to the DSVL cluster, the blue cluster (n=535) has slightly greater area and IQR with a significant difference in the weighted recency reliability. Using this information, along with the general patterns shown by Figure 3, we identify this as the SLVL group. The algorithm also identifies the RVL (black; n=237) and HVLS (purple; n=316) clusters. The RVL cluster has a low weighted recency reliability and high IQR. In contrast, the HVLS cluster has a lower area, higher weighted recency reliability, indicating little variation in the terminal portion of the VL time series, and most importantly very low adjusted maximal differences (Figure 4).
VL patterns are similar within clusters, and dissimilar between clusters Figure 3. Interestingly, there are patients within each cluster whose last VL measurement occurs near 1,827 days. This is equivalent to the full span of five years of VL monitoring data set. This suggests that these clusters don’t disappear after some elapsed time, but rather each type of pattern can be found at virtually any time point.
We found large VL spikes within the time series of the HVLS group. We hypothesize that this may be due to the asynchronous timing of measurements between subjects, the natural variation in biological responses, or patient variability in adherence to therapy. This observation also reflects one limitation of asynchronous outcomes data sampling, which lacks a “completion" endpoint characteristic of most prospective, randomized clinical trials. If measurements ended at a spike, the adjusted maximal difference feature may be weighted in the favor of the patient being classified as RVL. This may indicate that some patients classified as suppressing their viral loads should have been classified as having rebounding viral loads. Alternatively, may indicate that these features do not restrict a patient to forever to one category, but allow for dynamic classification as a function of biological or therapeutic responses.
Using the same data set, we next compared our VL pattern categorization method to those previously published in the literature. Visually, we find that the SLVL group detected by our method is very similar to the LLVR group defined by Greub et al. (Figure 5). Furthermore, it appears that the methods trying to capture SHVL, viral rebound, and viral failure patients did not succeed as well as the identification of SHVL and RVL patients in our method. RMVL repeat continuous visually appears to have performed very well in identifying patients whom have suppressed their viral loads. However, the results suggest that our analysis performs slightly better in identifying the suppression group (HVLS), as we find that the last VL measurements (black dots in Figure 5) are consistently low using our method.
The other methods may not have performed as well as they rely on a window or a consecutive pair measure, which may be too subjective for assigning VL pattern membership. Furthermore, notice that patients with baseline VL<200 (Figure 5) contain VL patterns which can reach as high as 106 copies/ml, which is in contrast to Rose et al.’s assumption that these patients have consistently low viral variation. Lastly we wish to emphasize that while some of these categorization methods are successful in identifying a specific group of patients, our method is unique as it attempts to associate each VL pattern to a specific category, without using categories such as “Not Suppressed”, “Unspecified”, or “Omitted”.
We next used the classes identified by hierarchical clustering to compare several machine learning models, with the goal of identifying methods that could be trained to prospectively assign HIV patients to VL categories (i.e. SHVL, SVL, SLVL, DSVL, and RVL). Unsupervised learning (e.g. hierarchical clustering) is useful for establishing the data structure of VL categories and their locations in the feature vector space. Once the model is established (e.g. cluster boundaries), supervised learning methods are better suited for prospective cluster assignment, given a robust "ground truth" for model training, as they do not depend on re-analysis of the entire population.
To this end, we compared the predictive power of several supervised learning methods for HIV cluster assignment, including: k-nearest neighbors (kNN), decision tree, support vector machine (SVM), Adaboost, and random forests. Models were trained on the original data set, and we then ranked their prediction power by their average F1 score derived by leave-one-out cross-validation (LOOCV) on the clustered results (Table 3). We compared the ability of these methods to reconstruct the originally identified clusters, even when allowing for variability in cluster numbers (e.g. kNN with k={7, 9}, or DT without a maximum depth specification). All methods performed comparably well, with the notable exception of Adaboost. This generally high performance was expected because the VL pattern categories are well-separated as a result of the clustering. k-Nearest Neighbors and k=5, was computationally efficient and yielded the best results in Table 3.
We next considered the trade-off of predictive precision versus model interpretability. Critical clinical evaluation of machine learning results is important to protect against mis-categorization and clinical error. For this reason, many have advocated using models that are more clinically interpretable. kNN is dependent upon the entire training set for prediction, as it does not inherently “learn” patterns24, hence it does not meet our interpretability criteria. In comparison, SVM offers a simpler model, but it’s results could be non-intuitive for clinicians. And although Decision Trees offer the best interpretability, overly complex trees may be generated, as occurred in our study (Figure S3).
We found that pruned decision tree rules, with a maximum depth of 5 levels, met this interpretable criteria, however at a slight cost to the predictive power (Table 3). The extracted decision rules are shown in Figure 6. Each category has a rule with a high proportion of true positive samples following the rule relative to all samples for the category (support). Similarly, a high proportion of the predicted class was found in the rule (precision), indicating that the rules can be summarized into a majority rule. Note that the sum of the support does not necessarily add up to one for each class because some samples belonging to that class may have been otherwise placed into a different rule, making the precision of that rule weaker.
As an alternative interpretable model we explored the use of centroid cluster summarization, which is often used in clustering algorithms, and is flexible enough to accommodate different centroid determination methods22,25. To determine the effects of different centroid determination algorithms, we compared seven different methods: multidimensional mean, multidimensional median, best representative center, bounding box method, smallest disk method, polyhedral center, and a novel “push and pull" (PnP) method inspired by force-directed graph drawing such as the Fruchterman-Reingold’s algorithm26,27 (see Supplementary File 1). Force directed clustering methods maximize inter-cluster center distances, while minimizing intra-cluster distance, and are the basis for modularity clustering in graph theory28.
We then combined the centroid cluster summarization approach with a radius-based classification prediction algorithm. Let ci be the ith cluster center with corresponding radius ri, where ri is calculated as the distance to the farthest intra-cluster sample from ci, then for a new sample s choose its predicted cluster membership j such that is a minimum. We refer to this method as radial normalization classification.
Comparing the representative F1 power of the centroid radial normalization methods (italicized in Table 3) to common machine learning algorithms, we find that the centroid interpretation loses some predictive power. However, the centroid summary is highly interpretable because the entire model can be expressed concisely (Supplementary Table S2), and understood clearly. For example, a clinician classifying a patient by VL time series values would compare observed feature values with the ranges given in Figure 6, and find which classification the patient’s data fits best within. In the case of the centroid method, if an observed value appears to fall in multiple categories, then they should be assigned to the one closest to the center (this allows a clinician to cross-check model predictions).
HIV patient viral load states are often fluid, with class changes (e.g. SHVL → HVLS) occurring due to therapy, viral genetics, social and other factors. To examine this aspect of classes, we use the k-Nearest Neighbors (k=5) model, fit to the original clusters, to predict the class state of each patient with ≥ 3 VL measurements using only partially retained VL data. For example if a patient has 6 viral load measurements, then we predict the class state at 3, 4, 5, and 6 VLM, which may yield SHVL → SHVL → RVL → HVLS as its prediction. We then constructed a state-transfer network using the trace-route method29, revealing several interesting relationships:
2. DSVL classification appears unstable for the first 400 days, suggesting that patients in this class should be monitored carefully during this initial period (Figure 7A).
3. The number of patients classified as SHVL drops considerably until ∼500 days after first classification. After this point, those patients who have not yet left the SHVL category, may not do so (Figure 7A).
4. The two sets of classes {DSVL, SLVL} and {SHVL, RVL, HVLS} are well separated (i.e. without much transfer between sets; Figure 7B). This appears to suggest that patients whose viral load is consistently low or durably suppressed tend not to transfer into a high viral load state (i.e. RVL or SHVL), at least in this data cohort.
5. SHVL patients in this cohort tended to transfer out of the class at a much higher rate than the transfer in, suggesting positive patient care (Figure 7B). This observation is consistent with reports in the literature that entry into treatment, with adherence to a HAART regimen, generally results in viral load suppression.
6. The state transfer diagram illustrates that the most frequent state transition over time is remaining within the same cluster (Figure 7B) assignment.
Researchers have previously performed HIV population case studies using differing schema to classify VL patterns2,9,10,13. We have developed a unique method for standardizing the algorithmic classification of VL patterns using a set of optimally segregating features. These features have been specifically engineered to optimize unsupervised clustering of temporal sequences of VL data that are asynchronous and noisy. Our findings demonstrate their success in identifying five viral load patterns often reported in the literature7–12. It is possible that additional viral load patterns may emerge in the future, for example due to new HIV variants that are resistant to current therapies. The method reported here is flexible enough to recognize such new temporal patterns of VL responses. It is also general enough that models could be trained on other viral infections that have patterns of natural or treatment related patient responses (e.g. hepatitis B and C, parvovirus B19), although this may require defining new features that capture disease specific pattern variants.
A common practice in data analytics is to calculate the centroid as the average of the points22,30. However, Table 3 suggests that the mean is not necessarily the best centroid for HIV viral load data. We note two advantages of the centroid algorithm: First, we can choose the centroid best corresponding to the shape of the data, and second, we can use it to mathematically determine the amount of over-lap between n-dimensional cluster spheres (i.e. viral load categories). This method may facilitate cross-comparison of HIV research studies by providing a standard for VL pattern classification. Such standardization would be immensely useful in meta-analyses31–35, potentially revealing the influence of different patient care strategies or new relationships between different patient populations.
Our work also explored the trade-off with respect to predictive accuracy between model interpretability and more complex, "black box" approaches to classification. The interpretability versus predictability problem is well known in the deep learning literature36–38. Interpretability is a desirable attribute in clinical classification systems, allowing clinicians to integrate causal physiology and diagnostic information with data features in a way promotes clearer bedside clinical reasoning. Using an interpretable model for assigning viral load pattern membership may be advantageous when a clinician wishes to use the assigned pattern membership to aid in making a critical clinical decision (e.g. choosing between treatment options), or when examining features that may be linked to a mechanism (e.g. slope of VL decline and viral genotype). A "black box" or more complex model may make such decisions or interpretations more difficult39, and can favor the use of simpler models at the expense of some predictive power.
Along these lines, we have also proposed a novel centroid-based algorithm for summarization of clustering results. This algorithm is not meant to supplant other well defined supervised learning algorithms, but rather to aid in interpretable assignment of VL patterns from other data sets into one of the five categories. The algorithm results are concise, allowing investigators to build the model in their preferred programming language. Hence this method may improve and standardize HIV population research by giving precise definitions to the varying temporal VL patterns, and potentially improving patient care.
Several caveats apply to our work. As noted, this is a single center study, and thus our method should be tested with a much larger data set to cross-validate the categories represented by the clusters. In addition, our feature vector was designed specifically based on observed VL patterns previously reported in the literature rather than objectively clustering the data using a standard time-series based clustering method40–42. This may limit generalizability to other VL analyses. In addition, some of our features are slightly collinear - with the greatest correlation coefficient being between IQR and wRR (-0.717). However, while HVLS and RVL both have a varied range of IQR, it is clear that the HVLS class has greater wRR than the RVL class due to HVLS patients having a long consistent viral load tail. Furthermore IQR helps distinguish the HVLS and the SLVL or SHVL class, hence both IQR and wRR are necessary despite the slight correlation. Finally, because our method normalizes time into number of days since first VL measurement, we lose the ability to look for seasonal or yearly patterns in the data.
Our data set did not have patients in whom VL was initially suppressed, and then rebounded (EVL). We originally hypothesized the existence of six distinct VL patterns, we found that the emergent VL group was not a pattern identified in our data. Perhaps this is a consequence of a high rate of local patient engagement in therapy in this cohort study, access to care, or the effectiveness of highly active anti-retroviral therapy regimens. We hypothesize that these conditions may not always exist (e.g. in areas where HAART is expensive, when people may lose the ability to pay for therapy), and that in such cases the EVL pattern may indeed be present and significant. Based on the formulation of the adj MD and wRR features, we hypothesize that a consequence of the grounding function is that any EVL pattern, if exists, will be grouped under RVL. This grouping may be appropriate as one can argue that going from a suppressed state to a high VL state is a form of rebounding. Clinical treatment of these patterns is likely to be similar. Further work with data sets that contain RVL patterns will need to be done to test these hypotheses. Unfortunately, we are not aware of any such data currently in the public domain.
Our method used hierarchical clustering to define groups, with a cutoff for group specification at a high level in the branching tree (i.e. level 5). Such thresholds or tuning parameters are characteristic of most unsupervised clustering algorithms22,43. However, identification of important sub-clusters by using a lower threshold is also possible. Clustering results may change depending on the parameter chosen, revealing finer between-cluster differences as the number of clusters increase. The hierarchical clustering algorithm has the advantage that a proper cut-off can be easily visualized. For example, choosing a lower cut-off may reveal that the suppression group splits itself into categories with different rates of HIV viral load suppression during treatment. Researchers wishing to engineer a new feature vector for VL pattern segregation may find useful the Supplementary material on features we considered but subsequently removed due to poor performance.
We have proposed a set of four unambiguous features which have been successfully used in segregating five different types of temporal viral load patterns: durably suppressed viral load (DSVL), sustained low viral load (SLVL), sustained high viral load (SHVL), high viral load suppression (HVLS), and rebounding viral load (RVL). We have also proposed a novel centroid-based cluster summary algorithm. The use of this algorithm may improve meta-analyses or population studies of viral load patterns by standardizing the classification of HIV patient categories. Furthermore, the segregation process used in this paper (i.e. identifying domain specific features, performing unsupervised clustering, interpreting the results with a cluster summary) can be used to model other viral infections and the response of VL levels over time to treatment or natural disease progression. We also found that using a temporal state variation method is important when considering patient viral load classifications, as changes in patient response can continue to occur beyond previously estimated time frames.
AdjMD = adjusted maximal viral load difference, Area = relative area of viral load exposure, ART = anti-retroviral therapy, DT = decision tree, EVL = emerging viral load, DSVL = Durably Suppressed Viral Load, HAART = highly active retroviral therapy, HIV = human immunodeficiency virus, IQR = interquartile range, kNN = k nearest neighbor, LLVL = low level viral load, LOOCV = leave-one-out cross validation, SHVL = Sustained High Viral Load, SLVL = Sustained Low Viral Load, RVL = Rebounding Viral Load, HVLS = High Viral Load Suppression, SVM = support vector machine, VL = viral load, VLM = viral load measurement, wRR= weighted recency reliability.
Full access to the data is available on GitHub (Data S1): https://doi.org/10.5281/zenodo.131324521
Data S1: Viral load data. The data set used for this study is provided in a completely deidentified format, CSV format where the first column represents a unique subject, with a random identifier. The subsequent values are as ti,j, V Li,j, where ti,j is the time from a universal T0 for the VL measurement j for patient i, and V Li,j is the corresponding VL measurement. Each record (row) is of a unique length, depending on the number of VL measurements present for that subject. The study data, and code used for analysis, can be found at https://doi.org/10.5281/zenodo.131324521.
This work was partially funded by the University of Rochester Clinical and Translational Science Institute grants UL1 TR002001, and TL1 TR002000 from the National Center for Advancing Translational Sciences (NCATS), a component of the National Institutes of Health (NIH). This publication was also made possible through core services and support from the University of Rochester Center for AIDS Research (CFAR), an NIH-funded program (P30 AI078498).
The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health (NIH).
We would like to thank Yusuf Bilgic (State University of New York at Geneseo) and James Java (University of Rochester), for discussions regarding the statistical analyses.
Supplementary Figures:
Click here to access the data.
Figure S1: Viral load distribution. For each pair of viral load measurements, we calculate the change in days and the change in viral load counts for all patients and plot it as a scatter. The horizontal line of dots which appears between 0 and 2 are an artifact of using 20 and 48 in data to replace the “Pos <20" and “Pos <48" values which appeared in our data. The sequential range of viral load measurements shows that VL measurements taken within 10 days of each other may vary by ±105 copies/mL.
Figure S2: Patient feature extraction. Feature extraction on 1576 patients displayed as 2D splicing of the 4 dimensional feature space. Each splice plots a dimension versus another in the form of a scatter plot.
Figure S3: Decision Tree. While some useful rules may be pruned, the tree is otherwise complicated and difficult to draw useful conclusions from.
Figure S4: Seven centroid calculations on clustered viral load data. For each cluster, the seven methods of calculating a globular cluster center are shown in comparison to each other (calculated on the normalized and clustered viral load data). Since the PnP method can have a center outside the range of [0,1], an indicator is shown for when the center goes beyond the range.
Figure S5: Centroid methods. Gives a visual of how the seven methods work on an example point set. The green target signifies the exact center which is found according to the different methods in our algorithm.
Supplementary File 1: Review of existing viral load categorization methods and features and centroid detection methodologies that were considered but not used. A review of currently published viral load categorization methods.
Click here to access the data.
Supplementary Tables:
Views | Downloads | |
---|---|---|
F1000Research | - | - |
PubMed Central
Data from PMC are received and updated monthly.
|
- | - |
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Partly
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: Host and pathogen genomics
Is the rationale for developing the new method (or application) clearly explained?
Yes
Is the description of the method technically sound?
Yes
Are sufficient details provided to allow replication of the method development and its use by others?
Yes
If any results are presented, are all the source data underlying the results available to ensure full reproducibility?
Yes
Are the conclusions about the method and its performance adequately supported by the findings presented in the article?
Yes
Competing Interests: No competing interests were disclosed.
Reviewer Expertise: mathematical modeling of HIV
Alongside their report, reviewers assign a status to the article:
Invited Reviewers | ||
---|---|---|
1 | 2 | |
Version 1 27 Jul 18 |
read | read |
Provide sufficient details of any financial or non-financial competing interests to enable users to assess whether your comments might lead a reasonable person to question your impartiality. Consider the following examples, but note that this is not an exhaustive list:
Sign up for content alerts and receive a weekly or monthly email with all newly published articles
Already registered? Sign in
The email address should be the one you originally registered with F1000.
You registered with F1000 via Google, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Google account password, please click here.
You registered with F1000 via Facebook, so we cannot reset your password.
To sign in, please click here.
If you still need help with your Facebook account password, please click here.
If your email address is registered with us, we will email you instructions to reset your password.
If you think you should have received this email but it has not arrived, please check your spam filters and/or contact for further assistance.
Comments on this article Comments (0)