1 Introduction

The colorectal cancer, is a subtype of cancer, which affects the lower portion of the gastrointestinal tract and develops in the cells lining the colon and rectum [24]. It can be further divided according to the site where the pathology develops. Colon and rectum cancers are, in fact, different pathologies, with different associated genetic causes and different progressions according to distinct molecular pathways [29]. Statistics show that colorectal cancer is the most common form of cancer in the digestive system, the third most common and the fourth deadliest cancer overall [15].

The use of machine learning (ML) techniques has been growing in cancer research [17]. The accurate prediction of survivability in patients with cancer remains a challenge namely due to the heterogeneity and complexity of the disease. However, accurate survivability prediction is important for patients with cancer so that they can make the most of the rest of their lives. It is also important to help clinicians to make the best decisions, when palliative care is an essential component of the process. Given that colon and rectal are the most common cancers of the digestive system, one would expect the existence of numerous tools for ascertaining the likelihood of a patient surviving this disease. Although there are some tools for this task, few provide predictions for both colon and rectal cancer, and none of them apply ML techniques in order to build evolving predictive models. Furthermore, their digital support may hinder their consultation at care delivery.

The objective of this work is to present an easy to use tool that provides survivability predictions of colon and rectal cancer patients for 1, 2, 3, 4 and 5 years after diagnosis and treatment. Due to the ubiquitous presence of mobile devices in everyday life and the ease with which one is able to consult these devices and use their applications, we chose to develop this tool as a mobile application. The underlying model for survivability prediction was obtained through ML techniques applied to the data from the Surveillance, Epidemiology, and End Results (SEER) program [18], a large cancer registry in the United States, and arguably the most complete cancer database in the world. The dataset includes records of patients diagnosed with different types of cancer from 1973 to 2012. The focus of this paper will be placed on the mobile solution developed for survivability prediction, but a part of the paper will be dedicated to briefly describing its underlying ML model so as to provide a better comprehension of the work as a whole.

The paper is structured as follows. Section 2 presents related work featuring survivability prediction tools for colon and rectal cancer, with an analysis of their main strengths and limitations. Section 3 describes the selected requirements for the tool, its underlying ML-based predictive model, architecture and a comprehensive use case. Section 4 provides a reflection about the strengths and limitations of our approach. Finally, Sect. 5 presents the conclusions drawn so far and future work considerations.

2 Related Work

Existing tools for colon or rectal cancer survivability prediction are mostly available as web applications. Table 1 shows a summary of their main features, namely: (1) whether the application is used for colon or rectal cancer; (2) the number of features necessary to get a prediction; (3) the data set that its underlying model is based on; (4) the technique used to construct the predictive model; (5) the type of target prediction it produces; and (6) a measure of performance in the form of a concordance index (C-index). The C-index corresponds to the probability of giving a correct response in a binary prediction problem. It is considered to be numerically equivalent to the area under the ROC curve (AUC) [16].

Table 1. Characteristics of applications for colon and rectal cancer survivability prediction.

There is a disparity in the number of features used in each tool. However, twelve [22] or even nine [10, 25] features may be too much information for a physician to input on-the-fly. Furthermore, there are cases in which the increased number of features does not necessarily translate into a better performance, as can be seen in the direct comparison between the works in [28] and in [22].

All the underlying models are based on statistical modelling, most notably on Cox regression analysis [8]. This is the dominant multivariate approach used in survivability prediction and corresponds to a multiple linear regression of the hazard on a set of variables. This indicates that the use of soft computing techniques, namely ML, in survivability prediction, especially in colon and rectal cancer, has yet to be fully explored. Since one of the advantages of ML is having more discriminative power in identifying patterns in data and finding nuances that may escape statistical modelling, its usage for survivability prediction may result in models with better performances [17]. As such, ML was chosen as the modelling approach for this work.

Most of the target predictions, either for colon or for rectal cancer, cover a 5-year span [22, 27, 28]. Even though there are models that cover a wider time span [7, 10, 11, 25], the five year barrier is an important goal for a colorectal cancer patient to overcome, and is used throughout clinical practice guidelines [4, 5] as a turning point for follow-up procedures, in which the vigilance over the patient is lightened, and for the assessment of the recurrence risk. For this reason, the present work will also have a target prediction of five years. Another noteworthy observation is that only two of the tools feature conditional survivability predictions.

To determine if the tools are suitable for mobile devices, the applications were analysed using the mobile-friendly test tool from Google Footnote 1. The results showed that, except for the tools reported in [22, 28], all the others are unsuitable for mobile access. The test revealed that the text was too small to read, the mobile viewport was not set, links were too close to each other and usually the content was wider than the screen. Therefore, few of these applications had a mobile-friendly design. Another goal is to address this by developing a cross-platform tool that is available to users in a practical and intuitive way, through a smartphone or tablet.

3 CRCPredictor: An Application for Survivability Prediction

Throughout the last decade, mobile phones have gone from being simple phones to being handheld pocket-sized computers. Their capabilities, namely the processing and on-board computing capacity incite the development of applications [6]. According to data from the International Data Corporation (IDC) Worldwide Quarterly Mobile Phone Tracker, the Android of Google and iOS of Apple are the two most popular smartphone operating systems [12].

For the health care industry, mobile applications yielded new boundaries in providing better care and services to patients. Moreover, it is making a revolution in the way information is managed and made available [23]. The portability of mobile applications can increase the productivity of health care professionals. It grants a rapid access to information and multimedia resources, allowing health care professionals to make decisions more quickly with a lower error rate, increasing the quality of patient documentation and improved workflow patterns [26]. This work discloses an assistive tool to help physicians to improve their practice. The problem it addresses is predicting the survivability of colorectal cancer patients in an individualized manner.

3.1 Requirements for the Survivability Prediction Tool

Several functionalities were delineated to achieve a solution that covers the limitations mentioned in Sect. 2 and, at the same time, is able to help physicians to improve their practice. These functionalities are summarized in the following functional requirements for the prediction tool: allow the user to select the cancer type (either colon or rectal) for which he seeks a prediction; allow the user to provide inputs for a set of selected features, based on which the underlying models generate survivability predictions; allow the user to choose the value of an input for a feature from a set of pre-determined values; provide a survivability prediction, according to the inputs, for 1, 2, 3, 4 and 5 years after the diagnosis and treatment; provide a likelihood value for the prediction of each year; to allow the visualization of the predictions and likelihood values in a chart; and allow the insertion of new patient registries into the case database, thus increasing the number of cases for the periodic recalculation of the prediction models.

Additional requirements for the tool are that it should be made available in the two main mobile platforms (iOS and Android) and it should be able to recalculate the prediction models upon the addition of a significant number of new patient registries. This confers a dynamism to the prediction models and should ensure their evolution over time.

3.2 Colon and Rectal Cancer Survivability Prediction Models

Survivability prediction was approached as a binary classification problem. The goal was to produce predictions for 1, 2, 3, 4 and 5 years after treatment of colon or rectal cancer. Each classification label (there were five representing years 1, 2, 3, 4 and 5) could only have two values: survived or not survived. As such, it was necessary to build five survivability prediction models (one per year) for each type of cancer. The created models were based on the SEER dataset. The criteria for selecting patient registries was the same for both colon and rectal cancer. Only patients with age greater than or equal to 18 years old were selected. Patients who were alive at the end of the data collection whose survival time had not yet reached 60 months (five years) and those who passed away of causes other than colon or rectal cancer were sampled out. After preprocessing, 38,592 cases were isolated for colon cancer and 12,818 cases were considered for rectal cancer. From the isolated cases for each pathology, 10 % were selected for testing sets. After filtering cases with “unknown” values, the colon cancer testing set had 2,221 cases and the training set had 20,061 cases. The testing set for rectal cancer had 551 cases and the training set had 4,962 cases. In total, the training set had 61 attributes representing possible classification features.

All the phases, from preprocessing to evaluation, were executed using RapidMinerFootnote 2, an open source data mining software chosen for its workflow-based interface and an intuitive application programming interface (API).

Using the Optimize Selection [21] operator for feature selection with the classification labels as target, a total of 6 features were obtained from a feature selection phase for each cancer type. Their name and description are shown in Tables 2 and 3. The training sets for colon and rectal cancer with their respective selected features were used in the learning of multiple prediction models using different ML ensemble methods such as bagging, adaboost, bayesian boosting, stacking, and voting. The accuracy, the AUC and the F-measure were used as performance measures in order to evaluate the models developed for colon and rectal cancer. The accuracy is the percentage of correct responses among the examined cases [9]. The F-measure is a combination of precision (a form of accuracy, also known as positive predictive value) and recall (also known as sensitivity) [20]. The AUC can be interpreted as the percentage of randomly drawn data pairs of individuals that have been accurately classified in the two populations [16]. These measures were calculated using the training data set and 10-fold cross validation. By applying the testing sets to the models, we calculated the percentage of incorrectly classified cases. The stackingFootnote 3 [14], using k-NN, decision tree, and random forest classifiers as base learners and a naive bayes classifier as a stacking model learner, was the best performing model for both colon and rectal cancer. Upon prediction, the model is capable of providing a confidence value that represents the likelihood of the prediction. Table 4 shows the performance values of the best model developed, for both cancer types.

As the intent with this paper is to present the features of the developed tool and describe its inner workings, it was considered that an exhaustive description of the ML process was out of scope.

Table 2. Features obtained by feature selection and used for colon cancer models.
Table 3. Features obtained by feature selection and used for rectal cancer models.
Table 4. Performance measures for 10-fold cross validation and the incorrectly classified cases from the test data set of the stacking model.

3.3 Architecture

The CRCPredictor is a hybrid mobile application targeting smartphones and tablets. The back-end of this tool includes two web services: one to give the survivability prediction responses for colon or rectal cancer to the user and another to recalculate the survivability prediction models. Figure 1 shows the architecture of the CRCPredictor system.

Fig. 1.
figure 1

Architecture of the CRCPredictor system.

The Survival Prediction App was developed using a hybrid approach, between a web and a native methodology. This allows an abstraction from the native language of the target operating system while retaining the core features of a native app. A hybrid application is developed by applying web technologies (mainly, HTML5, CSS and JavaScript) and is executed inside a native container on the mobile device. It is suitable for multiple platforms and is distributable through an application store, just like native applications. This type of approach can have an inferior performance compared with native applications. However, nowadays mobile devices have powerful capabilities and the performance gap is hardly noted. The application was developed using AngularJS, Ionic Framework, and Cordova. Cordova wraps the HTML/JavaScript app into a native container which can access the device functions of several platforms [1]. These functions are exposed via a unified JavaScript API, for an easy access to the full native functionalities.

The Survival Prediction Model Server Application was developed to cover the need of an individualized system, able to respond according to a particular set of patient characteristics. It exposes a set of RESTful web services. This service architecture was chosen for being light-weight, easy to access and scalable [30]. The web services were developed in Java with the Java API for RESTful Web Services (JAX-RS) [2]. The data is sent over the HTML POST method when the health care professional submits the values for the prediction features on the Survival Prediction App. The RESTful web service, using the RapidMiner API, receives the values and feeds them to the corresponding models, encoded in XML files. The response with the survivability predictions for the five years is returned in a JSON format.

The Online Learning Server Application also follows a REST architecture. It handles newly submitted patient data. The outcomes are added to a database for a posterior recalculation of all the models, which keeps them up-to-date. The data is inserted into a NoSQL database and, for each 1000 new registries, the models for the five years are recalculated, generating five new XML files for the type of cancer that just got the thousandth new case. The 1000 mark was arbitrarily defined and can be subject to adjustment.

3.4 Use Case

Figure 2a shows the first screen that appears when the Survival Prediction App of the CRCPredictor is initiated. By clicking on the menu (Fig. 2b), all options available in this application become visible.

Fig. 2.
figure 2

Home screen and menu of the Survival Prediction App of the CRCPredictor.

A typical use case is getting a prediction for colon cancer survivability. Supposing a physician is treating a patient diagnosed with colon cancer, once the type of cancer in the home screen is set (as shown in Fig. 2b), the health care professional inserts the values for the selected features (Fig. 3a). All features, except for the age of the patient, are filled in by choosing the value from a list of available options. By submitting a case of a patient with 55 years old, having a positive/elevated carcinoembryonic antigen value, with clinical assessment of regional lymph nodes of not clinically evident, with the primary site of the cancer being in the sigmoid colon, with stage 0 and with 5 as the number of regional nodes examined, the values are sent to the Survival Prediction Model Server Application and the outcome is calculated. The prediction is always provided in the form of confidence values for a positive prediction, i.e., the confidence that the patient will survive. This is displayed in a new screen in the form of a bar chart (Fig. 3b). For the stage of the patient, the physician can choose between the TNM system or the grouped stage, known as American Joint Committee on Cancer (AJCC) stage. The results show that, while the model was able to predict with 100 % confidence that the patient will survive the first three years, the confidence of his surviving the fourth and fifth years is 0 %. To predict the survivability of a patient diagnosed with rectal cancer, the procedure is similar to the one used for colon cancer.

Fig. 3.
figure 3

Colon cancer survivability calculator.

4 Analysis and Discussion

In terms of inputs the constructed prediction models, for both colon and rectal cancer, require only the input of six selected features. Comparing with the related tools in Sect. 2, this number is inferior to the number of features used in the underlying models of the tools described in [7, 10, 22, 25], and is closer to the number of inputs of the remaining prediction tools. The number of input features may be crucial to the adoption or the rejection of a tool, as it may become difficult to use it on-the-fly if too much information is needed. Another aspect to note is that, apart from the age at diagnosis and the AJCC Stage, the feature selection produced two very different sets for colon and for rectal cancer, which is in line with the notion that the two, although having aspects in common, are different pathologies. Regarding the colon cancer features, the age at diagnosis and the AJCC stage are present in most colon cancer prediction tools [11, 22, 28]. The other selected features are not usually present, but they are closely related to the ones that are. For instance, the clinical assessment of regional lymph nodes is a product of a medical evaluation of a feature widely used in the existing tools [22, 28] that is the number of lymph nodes found to have cancer out of the lymph nodes isolated during surgery. The same can be said about the selected features for rectal cancer, i.e., they are, at the very least, closely related to the ones used in other prediction tools.

In [3], the use of ML ensemble models to develop survival prediction models for colon cancer is described. The modelling component of our work is similar to that approach; therefore, it is possible to compare the performance of our selected model with theirs. The classification accuracies reported in [3] for years 1, 2, and 5 were 90.38 %, 88.01 %, and 85.13 %. The reported AUCs were 0.96, 0.95, and 0.92, respectively. As shown in Table 4, we were able to improve the classification results with our models. This direct comparison is not possible for the rectal cancer model as it was not possible to find such a closely related work in the literature. However, it is possible to verify that the rectal cancer model performs worse than the colon model in every metric, possibly due to the smaller size of the training set used in the learning process. At the same time, both the colon and rectal cancer models showed low classification errors on the randomly selected test data sets. Additionally, when comparing the AUCs of the generated models in Table 4 with the C-indexes in Table 1, it is possible to conclude that the generated models show a better discriminative power than the currently available models.

Regarding the CRCPredictor system, it fulfils the requirements defined at the beginning of the work. The distinguishing features of the system’s architecture are its flexibility and scalability, which make the addition of new features (services) simple and easy. The Survival Prediction App was developed as a mobile-friendly application, enabling the easy access of health care professionals to its functionalities on their mobile devices. Another component that distinguishes this system from established tools is the Online Learning Server Application which ensures the continuous evolution of the prediction models. However, the system does not provide conditional survivability predictions, which makes it less appealing when compared with the works in [7, 11, 25], as this is a type of information that health care professionals generally like to know.

5 Conclusions and Future Work

The main contribution of this work is a survivability prediction tool for colon and rectal cancer. Its distinguishing features are a balance between the number of necessary inputs and prediction performance, being mobile-friendly, and featuring an online learning component that enables the automatic recalculation and evolution of the prediction models upon the addition of new cases. The goal with this tool is to facilitate the access of health care professionals to instruments capable of enriching their practice and improving their results. Future work on the tool includes the development of conditional survivability models that allow the user to get a prediction knowing that the patient has already survived a number of years after diagnosis and treatment. Additionally, we intend to conduct experiments to assess how well the tool fulfils the needs of health care professionals and identify aspects to improve. Additionally, the models presented herein will be considered for inclusion in a guideline-based decision support system, described in [19], as a dynamic knowledge complement to the static recommendations of clinical practice guidelines. Since colon and rectal cancer affect mostly the elderly, this survivbility prediction apllication can be used within a technological environment, such as the one disclosed in [13], to provide better support to this population group.