Collection of HSCT outcome data and its utilization

Globally, the collection and analysis of information on diseases and post-transplant courses of allogeneic hematopoietic stem cell transplant recipients have played important roles in the improvement of therapeutic outcome of hematopoietic stem cell transplantation [1, 2]. Patient registries, typically referred to as outcome registries, are organized systems that utilize observational study methods to collect uniform data to evaluate specified outcomes for a particular disease, condition, or exposure [3, 4]. In Japan, with the Japan Society for Hematopoietic Cell Transplantation (JSHCT) as a hub in collaboration with the Japanese Society of Pediatric Hematology and Oncology (JSPO), the Japan Marrow Donor Program (JMDP), cord blood banks, and more than 300 transplant centers throughout the country have made efforts to collect and analyze information on disease types and post-transplant courses of recipients for more than 20 years [5, 6]. The Japanese Data Center for Hematopoietic Cell Transplantation (JDCHCT) was founded and began operations in 2014. The JDCHCT is responsible for collecting and analyzing HSCT recipient and donor information based on the “Act for Appropriate Provision of Hematopoietic Stem Cells to be Used in Transplantations” with government support from 2014 in collaboration with JSHCT.

Introduction of a second-generation Transplant Registry Unified Management Program

HSCT clinical outcome data are collected through Transplant Registry Unified Management Program (TRUMP), as described previously [5]. The first-generation TRUMP (TRUMP1) was a computer-based software suite developed to manage HSCT outcome data offline in transplant centers. Anonymous and encrypted submission datasets were stored locally and sent to the data center by postal mail or using a web submission page. A logical check program was implemented within TRUMP1 to check for missing or inconsistent data. Introduction of TRUMP1 was a success, with an introduction rate of >99 % among approximately 250 adult and 90 pediatric transplant centers in Japan. However, TRUMP1 had a number of limitations. Because TRUMP1 was an offline program, data collection had to go through a complex and time consuming process. For HSCT from unrelated donors, donor baseline data were sent to the center from JMDP or cord blood banks via reporting fax sheet, and the data were entered into TRUMP1 within the transplant center. Second-generation TRUMP (TRUMP2) was released in 2015. TRUMP2 is a web-based HSCT registry database and can also be used offline for transplant centers not capable of reporting online. The program includes 1251 items for outcome data collection, and it correlates with international harmonized essential data, i.e., Transplant Essential Data of the Center for International Blood and Marrow Transplant Research or Medical Essential Data A of the European Group for Blood and Marrow Transplantation. A data linkage system with JMDP and cord blood banks is implemented in TRUMP2. Recipient and donor baseline, HLA, or cell dose at harvest information are provided by the JMDP or cord blood banks for HSCT cases from unrelated donors. An anonymized identification number is provided upon entry on TRUMP2 for each recipient. For recipients of multiple HSCT, the same anonymized identification number is given for the recipient, and HSCT cases are identified with date of HSCT as the key variable.

The outcome data collection consists of two types of data submission: “Basic Registration” within the first 2 months of the year to report 10 items for each HSCT cases performed in the previous year and “Main Registration” about 6 months later to report all required items. Information on survival, underlying disease status, and long-term complications, including chronic GVHD and subsequent malignancies, is renewed annually. Logical check program to check missing data and inconsistent data is implemented for data quality management. An error list is generated for each HSCT cases. For missing data, for example, reasons for data missing need to be identified for the submission to be considered complete. Completeness of data entry and follow-up for each center is included in annual report of the JDCHCT/JSHCT, which is published openly on the JDCHCT web page. Reporting of outcome data of all HSCT cases is required for center approval of the JMDP and cord blood banks.

TRUMP forms are revised periodically. Revisions are proposed from Working Groups, which are described in the following section, researchers, or members of data center staff. Revision proposals are required to undergo careful review and approval before application to TRUMP. Additional data collections for certain studies are also performed. The number of studies requiring additional data collection is limited by data center resources and in consideration of transplant center burdens.

Utilization of data and publication

This outcome data can in turn serve as the basis for future scientific and clinical trials or policymaking. Findings from observational research have led to significant improvement in the field of HSCT and better clinical trial planning [7]. Collaboration and sharing of data for maximum utilization and optimizing patient outcome is the key to an effective registry. To promote data utilization for research, it is necessary to ensure high-quality data and to construct an accessible data utilization system for researchers. The JSHCT formed Working Groups for HSCT research, which are similar to Working Committees of the Center for International Blood and Marrow Transplant Research and Working Parties of the European Group for Blood and Marrow Transplantation. The JDCHCT supports Working Group activities and promotes collaborative studies with other research groups. These activities to promote research contributed to increased publications using TRUMP data in this field. Registry studies using TRUMP data are capable of addressing questions such as HSCT results in specific patient groups, analysis of prognostic factors, evaluation of conditioning regimens, comparison of donor/stem cell sources, and characterizing rare late effects. Linking clinical data with immunological and genetic information can provide important insights into transplant biology. Study proposal and approval process, study progress management process, and authorship guidelines are organized and managed accordingly.

Fixed TRUMP dataset

The TRUMP dataset is fixed annually and is used to generate annual report/summary slides for HSCT activities and outcomes and for other research purposes. The fixed TRUMP dataset is structured according to a data collection item list containing 1251 variables with various types of values. Outcome data collection by the JSHCT was initiated in 1993, and the forms have undergone various revisions. In addition, four registries (JSHCT, JSPO, JMDP, and cord blood banks coordinated by the Japan Cord Blood Bank Network) used different forms with different items and values until unification and harmonization of the forms, which has been described elsewhere [5]. These collected data before introduction of TRUMP1 are converted to TRUMP structure, and were imported to TRUMP1 in each transplant center. All submitted HSCT outcome data are thus formatted in the TRUMP structure, and follow-up information on surviving recipients can be entered by using TRUMP2. However, for data collected before the introduction of TRUMP1, only limited data fields were required at the time of data submission, and no logical check programs were in place.

Registry studies are observational studies that are retrospective in nature. Extreme care should be given to identify and address possible biases. Misunderstanding of data definitions or mishandling of variables and values including missing values may lead to wrong conclusions. Careful handling is required in the selection of subjects for analyses. Inclusion and exclusion criteria should be defined clearly. Completeness of follow-up is also an important issue. Discrepancies between groups for comparison in completeness of follow-up may need to inappropriate comparison. Users of TRUMP data for research should thus understand both the current form and past forms, including the revision process.

Shared script for defining variables

Certain baseline, disease, transplant-related, or outcome variables are repeatedly used in HSCT outcome registry studies. The JDCHCT and JSHCT defined and introduced shared scripts to define variables according to unified definition for quality control and improving the efficiency of registry studies using TRUMP data. Shared scripts are developed using Stata (Stata Corporation, College Station TX, USA) and R (R Foundation for Statistical Computing, Vienna, Austria) or EZR. EZR is a graphical user interface for R, and a modified version of R commander is designed to add statistical functions that are frequently used in biostatistics [8].

The data management process for observational research includes generating and defining variables, defining study subjects and characteristics, and performing statistical analyses as designed. Extreme care should be given to quality management of statistical analyses. Shared scripts can play a role in guaranteeing the reproducibility of statistical analyses. The script used for final analyses and its log should be recorded along with statistical results. Educational seminars for shared scripts and fixed TRUMP dataset are held periodically.

Baseline, disease, and transplant variables

Variables generated with shared scripts are shown in Table 1. HLA variables and HLA matching variables are described in a different paper in the same series.

Table 1 Baseline, disease, and transplant variables defined in shared scripts

Baseline variables include recipient and donor age, age groups, sex, and ABO type. Recipient–donor sex and ABO matching variable is also generated. Disease risk for leukemia is defined as standard vs. advanced or low risk vs. high risk, according to the disease status at transplant. Three frequently used disease-risk groups for leukemia are provided: First or second complete remission (CR) of AML, 1CR of ALL, first chronic phase of CML, and refractory anemia or refractory anemia with ringed sideroblasts as standard-risk diseases vs. advanced for all others. First or second CR of AML and ALL, first or second chronic phase of CML, and myelodysplastic syndromes (MDS) for low risk and all others for high risk. For the third risk group, probabilities of 5-year overall survival of greater than 45 % in TRUMP 2010 fixed dataset were considered standard risk. Five-year overall survival of leukemia (AML, ALL, CML and MDS) was 43.3 % (95 % confidence interval 42.8–44.1 %), and was considered as the reference. First or second complete remission (CR) of AML and ALL, first or second chronic phase and accelerated phase of CML, and refractory anemia or refractory anemia with ringed sideroblasts of MDS as standard-risk diseases vs. advanced for all others.

Conditioning regimens were classified as myeloablative if total body irradiation >8 Gy, oral busulfan ≥9 mg/kg, intravenous busulfan ≥7.2 mg/kg, or melphalan >140 mg/m2 was used based on the report from the CIBMTR [9, 10]. Those with insufficient information on the doses of agents or radiation used for the conditioning regimen were classified according to information on conditioning intensity (whether or not the conditioning regimen was intended to be myeloablative) as reported by the treating clinicians. According to the logic above, approximately 15 % of conditioning intensity information reported by the treating clinicians was re-classified. Conditioning regimens are categorized by the agents used in typical regimens, i.e., cyclophosphamide and total body irradiation (TBI)-containing regimen, other TBI-containing regimen, busulfan- and cyclophosphamide-containing non-TBI regimen, and other non-TBI regimen for myeloablative conditioning regimens, fludarabine- and busulfan-containing regimen, fludarabine- and cyclophosphamide-containing regimen, fludarabine and melphalan-containing regimens, and other regimens for reduced intensity conditioning. GVHD prophylaxis variable is categorized as cyclosporine based and tacrolimus based. In vivo T cell depletion is defined if campath, ATG, or ALG is used for conditioning or GVHD prophylaxis.

A variable list with definitions and values is provided with the shared scripts for Working Group members of the JSHCT. Other variables needed for each study can also be efficiently generated by using the shared script.

Outcome variables

Event and time variables for outcomes after HSCT are defined and generated as described in Table 2. Overall survival is defined as time from transplant to death from any cause. Neutrophil recovery was defined as an absolute neutrophil count of at least 500 cells per cubic millimeter for three consecutive points; platelet recovery was defined as a count of at least 20,000 or 50,000 platelets per cubic millimeter without transfusion support. Diagnosis and clinical grading of acute GVHD were performed according to the established criteria [11, 12]. Relapse was defined as the recurrence of underlying hematological malignant diseases. Clinical/hematological relapse was considered for relapse for shared script. Cytogenetic or molecular relapse was considered clinical relapse if therapy was given for disease after cytogenetic/molecular relapse. A relapse variable is generated in consideration for use in acute leukemia. Date of disease progression information is collected for lymphoma or multiple myeloma. Disease progression event variables need to be generated differently for these diseases. Transplant-related death was defined as death during a continuous remission.

Table 2 Outcome variables defined in shared scripts

Competing risk events are also defined. For neutrophil and platelet recovery, death before neutrophil or platelet recovery was the competing event; for GVHD, death without GVHD and relapse were competing events; for relapse, death without relapse was the competing event; and for transplant-related mortality (TRM), relapse was the competing event [13, 14].

Conclusion

Efficient and high-quality data collection systems are essential and play important roles in HSCT outcome registries. The introduction of TRUMP2 will lead to better quality of data and more efficient data management. TRUMP2 is also expected to expand possibilities for data usage, for it is capable of building richer relational databases. For adequate data utilization, the construction of an accessible data utilization system for researchers would promote research activity. Study approval and management process and authorship guidelines need to be organized with the data utilization system. Quality control of data manipulation and analyses process is also considered to affect study outcomes. Shared scripts are introduced to define variables according to unified definition for quality control and improving efficiency of registry studies using TRUMP data.