Background

The use of patient-reported outcomes (PROs) in clinical care has gained an increasing amount of attention due to patient advocacy and the increasing appreciation of the central role that patients’ symptoms, emotions, and goals play in disease cognition [1,2,3]. PROs can describe specific symptoms, treatment preferences or aspects of overall health and provide insights into a patient’s well-being that cannot be captured by laboratory data alone [4,5,6,7]. PROs are particularly relevant to CKD patients’ care and health, as CKD patients have poorer functional status than those with other chronic conditions; thus, providers are largely unaware of the presence and severity of these symptoms [8, 9]. PROs are being increasingly recognized as a key component of patient-centered kidney disease care.

Chronic kidney disease (CKD) contributes to the global health burden with a high prevalence, poor outcomes, and high cost [10] and is currently the 16th leading cause of years of life lost [11]. It is expected to become the fifth leading cause of death worldwide in the future [12]. CKD progresses to end-stage kidney disease (ESKD). At this point, patients receive renal replacement therapy (RRT), including hemodialysis, peritoneal dialysis, and kidney transplantation [13]. Hypertension, diabetes, and cardiovascular diseases are common comorbidities in patients with ESKD [14]. CKD progression, accompanied by heart failure, fatigue, itching, restless legs, waist muscle soreness, sleep disorders, anxiety, depression, and a series of problems, aggravates the economic, social, physiological, and psychological burden of patients [15]. Historically, the management of patients with CKD was evaluated mainly by clinical results and other hard indicators, such as biological indicators, recurrence rate and mortality, but some symptoms and treatment effects, such as pain, pruritus, and sleep, can only be felt by patients. Capturing and accurately quantifying the subjective feelings of patients is helpful for medical staff to obtain CKD management information on patients and promote clinical decision-making.

Patient-reported outcomes (PROs) come from the status report of the health condition directly collected from the patient. The report generally contains key domains such as symptoms, functional limitations and physical, mental, and social perspectives and can generate a perspective from the patients on the effectiveness of treatment. Patients have become the only source of health outcome endpoint data for many diseases [16, 17]. PROs is often regarded as a vital complement to traditional clinical evidence for studying the treatment impact on patient function and well-being. PROs are useful to discriminate patients and could be a predictor of health conditions, i.e., hospital admissions and health-related quality of life (HRQOL), which is of great significance. In recent years, PROs have been increasingly recognized as valuable instruments for the evaluation of the effectiveness of medical interventions for the HRQOL of CKD patients in many clinical trials [18].

Generic QOL measures have the ability to compare disease burdens between CKD and other conditions; disease-specific measures have better validity, for instance, better responsiveness in some specific conditions [19]. The most commonly used generic QOL tools in CKD include the Short Form-36 Health Survey [20] and its 12-item subset and the Short Form-12 Health Survey [21]. As a disease-specific measurement tool, the Kidney Disease Quality of Life 36-Item Short Form Survey (KDQOL-36) can be used in both CKD-specific and generic QOL domains [22, 23]. The KDQOL-36 augments the Short Form-12 generic core with 24 items and scores 3 kidney-specific scales: Burden of Kidney Disease (4 items), Symptoms/Problems of Kidney Disease (12 items), and Effects of Kidney Disease (8 items). At present, the self-rating scale for CKD patients mainly focuses on symptoms and quality of life without evaluating treatment and social support. PROs is not another word for QOL but involves a wider range of measures than QOL and HRQOL.

In summary, using both classical test theory and item response theory, this study aims to describe the development and validation of the CKD-PROs and to provide guidance for PROs measures. It may also provide a scientific and effective PROs approach to CKD clinical evaluation.

Methods

Design and setting

In this study, a cross-sectional design was used to test psychometric properties. The development and evaluation process was performed in four phases between May 2020 and March 2022. (1) Creation of an item bank: a literature review and patient interviews were performed, and relevant scales were referenced. (2) Formation of the initial scale: the Delphi method was used. (3) Selection of items: Classical test theory (CTT) methods and Item response theory (CRT) method were used to select items and adjust dimensions to form the final scale. (4) Scale validation: the CKD-PROs could be evaluated from the perspectives of reliability, content validity, construct validity, responsibility, and feasibility. Figure 1 shows a flowchart of the developmental process.

Fig. 1
figure 1

CKD-PROs research roadmap

Participants

The study recruited adults with CKD from Southwest China. The development of the scale was conducted in Chinese. The sample size was determined based on the principle of 5 to 10 times the number of items in the scale [24]. A total of 365 paper questionnaires were distributed, out of which 365 were ultimately collected. However, only 360 questionnaires were deemed valid for data analysis. Most of the participants completed the scales independently, but in certain cases, a trained investigator asked questions orally if the participants were unable to complete the task without help. The inclusion criteria were as follows: (1) meeting the diagnostic criteria of chronic kidney disease and glomerular filtration rate (GFR) ≤ 30 ml/(min·1.73 m2); (2) age 18–70; (3) patients can understand and complete the scale; and (4) provided informed consent. The exclusion criteria were as follows: (1) patients with acute renal failure due to various reasons; (2) subjects who were unable to complete the questionnaire due to severe organ dysfunction of the heart, liver, or brain; and (3) subjects who did not cooperate with the study due to mental or cognitive disorders. The study was reviewed by the hospital’s Ethics Commission and reached an agreement with all patients by signing the informed consent form.

Procedure for psychometric property testing

Phase I—creation of an item bank

Modifying the conceptual framework

According to the principle and process of making the PROs scale stipulated by The USA Food and Drug Administration (FDA) [25], the existing PROs scale and qualitative research literature for CKD patients were systematically reviewed, the theoretical framework was formed based on the theoretical basis of chronic kidney disease, and the connotation and elements of PROs were classified and analyzed (Fig. 2).

Fig. 2
figure 2

Conceptual framework of the CKD-PROs. PHD physiological domain, PSD psychological domain, SOD social domain, THD therapeutic domain, SOM somatization, GEN General Symptoms, IND independence, ANX anxiety, DEP depression, LOH level of hope, SPB self-perceived burden, SUP Social support, SOC Social adaptation, EFF effectiveness, SAT satisfaction, COM compliance

Generating items

Under the guidance of the scaling framework, relevant literature related to self-reported outcomes of patients, symptoms, psychology, quality of life, compliance, and satisfaction of patients with CKD were searched for analysis. An objective sampling method was adopted to select 10 patients with CKD for a semistructured interview, and a subject analysis method was adopted to analyze the results. The purpose was to understand the patients’ discomfort symptoms, the impact of the disease, and their expectations for treatment to further enrich the scale item pool.

Phase II—formation of the initial scale

In this study, items were preliminarily screened by the Delphi method. Experts were solicited anonymously through several rounds of correspondence until consensus was reached among the panelists for the purpose of forecasting, and the age structure, specialty, and knowledge structure of experts were fully considered in the selection of experts [26]. Eleven experts who had worked for 10 years or more at the general Hospital of Nanjing Military Region, West China Hospital of Sichuan University, Xijing Hospital of Air Force Military Medical University, Xiangya Hospital of Central South University, and another grade-A hospital were selected as the survey subjects (including medical staff at the level of deputy senior medical officer or above). Two rounds of questionnaires were administered in this study, including expert surveys and expert self-evaluations of the basis for judgment, item familiarity, item relevance, and importance, and there was a column for modification suggestions. The Likert 5 score method was used to assign values for the relevance of items. SPSS 22.0 software was used to calculate statistics on the consulting results, arithmetic mean, and coefficient of variation were calculated, and entries were selected by the boundary value method. Items were further modified, deleted, added, or merged after group discussion based on expert opinions. Expert reliability is analyzed from three aspects: expert positive coefficient, expert authority degree and expert opinion coordination degree [27].

Thirty patients were selected for a pilot survey to ascertain numerous variables: whether patients could understand the items, how to answer the items, and whether their understanding of the items was the same as the contents of the scale or was there a need to modify or delete items that were difficult to understand; a cultural debugging of the form and content of the scale was also conducted. In the end, no entries were modified or deleted.

Phase III—selection of items

Classical test theory (CTT)

According to the methods and principles of entry screening stipulated by the WHO [25], the following four-item screening statistical methods are adopted in this project for screening: (1) discrete trend: the standard deviation (SD) of each item’s score is used to measure the discrete trend; it is recommended to delete items with an SD < 0.7 [28]; (2) correlation coefficient: according to the expected theoretical structure, the correlation coefficient between each item and the total score is calculated, and the items with r < 0.4 will be deleted [29]; (3) factor analysis: factor analysis and orthogonal rotation with maximum variance are performed. In this study, items with a loading < 0.4 on each factor and indexes with similar load coefficients in ≥ 2 factors without specificity were deleted. (4) Cronbach’s α coefficient: if Cronbach’s α coefficient increases greatly after removing a single item, it indicates that the existence of this item has an impact on reducing the internal consistency of this aspect; thus, it should be deleted.

Item response theory (CRT)

This method uses probability to explain the relationship between subjects’ responses to items and their potential ability traits [30]. The Likert 5-point scoring method was adopted in this study, so the Samejima rank response model was adopted to estimate the distinction parameter (a) and difficulty parameter (b) of each item. An item with a distinction of a < 0.4 should be excluded. Parameters b1, b2, b3, and b4 correspond to four difficulty levels, where b1 is the category threshold between option 1 and option 2, and so on, and b1 < b2 < b3 < b4. The difficulty level parameter generally ranges from − 3 to 3 [31].

Phase IV—scale validation

Reliability analysis

Reliability refers to the consistency of measurement results. (1) Split reliability: the split reliability method divides all variables into two halves and calculates the correlation between the two parts. In this study, items are arranged according to the classification of items and divided into half in the order of odd and even, usually ≥ 0.7. (2) Cronbach’s α coefficient reflects the average correlation between variables and can estimate the scale and the internal consistency of each field; values above 0.7 are considered acceptable [32].

Validity analysis

The evaluation is about bias or what proportion of systematic error is included in the measurement results. (1) Structural validity: confirmatory factor analysis (CFA) was used to build a measurement model between indicator items and their dimensions. Relatively reliable indicators include the nonnormed fit index (NNFI), comparative fit index (CFI), adjusted goodness-of-fit index (AGFI) and root mean square of the approximate error (RMSEA) [33]. (2) Content validity refers to the extent to which a particular item reflects a content category. In this study, the content validity index (CVI) was used for quantitative analysis. The project was retained if the CVI was more than 80%.

Dimensional correlation

Item correlation refers to the degree of relevance between an item and its domain. When the correlation coefficient r is > 0.4, the dimensional correlation is considered acceptable.

Response analysis

Response analysis refers to the ability to detect the minimum changes in patients’ quality of life. The scores and total scores of each field measurement before and after treatment were calculated and statistically analyzed. The two‐sample t test was used as the statistical method, and p < 0.05 was regarded as the indication that the scale has the capability to discriminate the control group from the CRF group.

Feasibility evaluation

Feasibility evaluation mainly reflects the acceptability of the questionnaire. Common indicators include the scale recovery rate, efficiency rate, and time to complete each scale.

Data analysis software

LISREL version 8.8 of Confirmatory factor analysis (CFA) was used in this study, and MULTILOG version 7.0 of IRT analysis was performed. Other data analysis was processed by SPSS version 22.0. If data from individual items are missing, item scores were replaced based on the average data. If at least three methods passed the filter, it was selected as the final item.

Results

Participants’ characteristics

During the item filtering process, we conducted a clinical survey of 365 patients, and 360 valid samples were collected. The patients’ average age was 54 ± 12.56, and there were 187 males (51.9%) and 173 females (48.1%). To examine the reliability and construct validity of the scale, 272 patients were surveyed with the final scale. The mean age of these patients was 51 ± 13.48 years; 188 were males (69.1%), and 84 were females (31.9%). 47 (13.1%) individuals had primary school education or below, 236 (65.5%) had junior high or senior high school education, and 77 (21.4%) had undergraduate education or above; 271 (75.3%) were married, and the remaining 89 (24.7%) were single, including those who were divorced; 238 (66.1%) of the patients were engaged in paid work, and the remaining 122 (33.9%) were unemployed; 21 (5.8%) individuals had high income (annual household income > 150,000 CNY), 192 (53.3%) had moderate income (annual household income between 50,000 and 150,000 CNY), and 147(40.9%) had low income (annual household income < 50,000 CNY); 58 (16.1%) cases had concurrent diabetes, 125 (34.7%) had concurrent hypertension, 43 (11.9%) had concurrent cardiovascular disease, and 66 (18.3%) were infected with hepatitis B. The distribution of personal and clinical characteristics of the study patients is shown in Table 1.

Table 1 Demographic and disease characteristics of patients

Psychometric properties of the level of CKD-PRO

Item generation and selection

In total, 79 entries were generated through literature analysis and patient interviews. In addition, in the physiological, psychological, social, and therapeutic domains, there were 27, 20, 12 and 20 items, respectively.

Subsequently, a total of 22 questionnaires were distributed in the 2 rounds of this study. The recovery rates of expert consultation questionnaires in the first and second rounds were 100% and 90.9%, and the positive coefficients of experts were 100% and 90.9%, respectively, indicating a high degree of participation and importance in this study. The Kendall coordination coefficient W of the second round of consultation was 0.254, which was statistically significant by the χ2 test (χ2 = 175.500, p < 0.001). The coordination coefficient of each dimension was between 0.201 and 0.273 (p < 0.05), indicating that the expert scores were consistent. The coefficient of variation for the importance of each item was 0–0.34, indicating that the experts agreed on the content of the index. The coefficient Cr value of expert authority degree was 0.92, indicating high reliability of expert scoring and authoritative and reliable research results.

In the first round of expert consultation, 9 items—dry mouth, constipation, leg discomfort, tinnitus, slow reaction, bad emotional control, stable blood pressure, protein intake control and water intake control—were deleted due to weak correlation, repeated content, and inconsistent fields. Four items—foam urine, skin damage, folk prescription purchase, blood pressure and blood sugar monitoring—were added, and some items were revised and improved. In the second round of expert consultation, six items were deleted: soreness and pain in the back, memory loss, confidence in the future, financial burden, social status, and impact on daily work. After 2 rounds of expert consultation and discussion and modification by the research group, a preliminary scale containing 64 items in 12 dimensions was formed.

Finally, researchers analyzed the data from 360 patients with CKD. The discrete trend method, correlation coefficient method, factor analysis method, Cronbach’s α coefficient method and item response theory were used to screen the scale items, and the items were removed with strict standards. The items that were recommended to be retained by at least three methods were selected, that is, the items that did not meet the standards by more than two methods were deleted. The final scale consists of 54 items, which belong to 12 dimensions and 4 domains. Among them, 16 are in the field of physiology, 14 in the field of psychology, 9 in the field of society and 15 in the field of therapy. The results are shown in Table 2.

Table 2 Results of the item-selection phase using CTT and IRT

Validation of the CKD-PRO

There were 272 issued copies of the CKD-PROs in all, and 270 of them were retrieved for analysis.

Reliability analysis

Cronbach’s α coefficients were calculated in four domains internally: 0.916 physiological, 0.893 psychological, 0.811 social domain, and 0.888 therapeutic. The coefficient for the entire scale was 0.939. The split-half reliability coefficient of the CKD-PROs was 0.945, and in the physiological, psychological, social, and therapeutic domains, it was 0.922, 0.904, 0.821 and 0.912, respectively. Thus, the scale showed excellent reliability.

Content validity

The CVIs of all items were higher than 80%, indicating that there was acceptable content validity. In addition, in the preparation of the CKD-PROs scale, many relevant studies and domestic and foreign scales were consulted. Methods such as expert consultation and patient interviews were used to conduct in-depth and repeated argumentations on the optimization of the scale items to ensure that the scale had high content validity.

Construct validity

The results show that the standard load solutions of each factor are all greater than 0.3. The results in Table 3 show that the values were all less than 8 except for the SOD field. The AGFI value of SOD was less than 0.8, but the AGFI value of other fields was greater than 0.8. Except for PSD, SOD RMSEA is greater than 0.1, SOD RMR is greater than 0.1, and all other fields are less than 0.1. The CFI value of the SOD field is 0.860, and the CFI value of other fields is greater than 0.9. The overall fitting of the model agrees with all the expressions, suggesting that the model has good structural validity (Fig. 3).

Table 3 Goodness-of-fit statistics of the CKD-PROs
Fig. 3
figure 3

Confirmatory factor analysis model

Dimensional correlation

There is a strong correlation between each item and its field, and the correlation number r of each item ranges from 0.413 to 0.669.

Response analysis

In this survey, 2 measurements of 147 subjects before and after treatment were used, and the matched sample t test was used to analyze the 2 measurements. According to the results in Table 4, the scores of subjects before and after treatment were statistically significant except for GEN and DOS (all p < 0.05). The differences were all within a reasonable range, indicating that the scale can effectively distinguish patients before and after treatment and that the scale has a good response degree.

Table 4 The scores of all aspects of the scale were compared before and after treatment

Feasibility analysis

A total of 636 questionnaires were issued in the 2 clinical investigations, and 630 questionnaires were finally collected with a recovery rate of 99.1%, among which 630 were effective for an effective rate of 100%. The completion time for each questionnaire was approximately 13 min. The above results show that this scale has good feasibility.

Discussion

CKD progresses slowly and is irreversible. During the early stages of CKD, patients may not experience any obvious symptoms, making it difficult to monitor their condition. Self-report outcome measures can aid medical staff in monitoring the progression of the disease and determining if more frequent check-ups or treatment are necessary. As the patient enters stages 4–5 of CKD, the complexity of their condition increases and personalized treatment becomes necessary. Self-report outcome measures can help medical staff understand the patient’s experiences and problems, identify factors that may induce or worsen symptoms, assess the quality of life and progression of CKD, and develop targeted personalized treatment and management plans. In addition, the PROs can provide insight into the patient's needs and preferences when choosing kidney replacement therapy. For instance, by considering patient feedback and reports, the best treatment method for kidney transplant or dialysis can be determined. Therefore, the development of a new CKD-PROs measure can help medical staff better understand the patient’s condition and needs, thereby allowing them to develop better treatment plans and improve the patient’s quality of life.

In this research, we have described the development of a new method for measuring CKD-specific patient-reported outcomes. The CKD-PROs has made initial assessments of its reliability and validity. We hypothesized that the CKD-PROs can measure the symptoms and psychosocial impact of CKD. According to the guidance of the FDA on the development of patient-reported outcomes, the process has solicited and documented opinions from patients who met current consensus diagnostic guidelines of CKD in a broad base, including the four main clinical phenotypes CKDsPHD/CKDsPSD/CKDsSOD/CKDsTRE [25, 34, 35]. By doing so, it can ensure that their experience is accurately understood. We also conducted a draft instrument test made up of 79 items that were regarded as critical by patients in focus groups, which was later refined to 54 items based on an evaluation of their measurement properties in a CKD patient cohort.

With an increasing focus on patient-centered care, patient-reported outcome measurements (PROMs) allow clinicians and researchers to deliver healthcare services to patients more accurately. General health-related QoL measures (e.g., the SF-36, COSMIN) allow clinicians to compare the disease burden between chronic diseases. Disease-specific devices can capture specific CKD parameters that are clinically significant and are essential to clinical trials, which provide medical professionals with patients’ perspectives. According to the FDA PROMs guidelines for clinical trial endpoints, which were developed by patients with diseases and are under study, problems identified by PROMs must reflect patient progress and be corrected based on their input. A specific and appropriate recall period for the disease must be established, and data must demonstrate validity, reliability and responsiveness [25]. The PROMIS-57 and PROMIS-29 have been widely recognized as highly reliable and effective general tools for assessing patients’ disease experience in the field of CKD [36]. However, to our knowledge, there is no PROMs in CKD that fully meets the FDA criteria. Although CKD-PROs considers FDA acceptance as meeting the standard requirement, the PROMs was developed according to the prescribed methodology in the guidance statement.

Recently, we have become increasingly aware of the fact that it is important to obtain information from patients’ disease and treatment experience in regulatory and clinical fields [25, 37, 38]. The specific PROMs of CKD have some obvious characteristics in their development, such as variability and patient involvement. Many PROs measures have demonstrated limited psychometric effectiveness in CKD, and generic PROs measures assessing HRQOL or concept-specific outcomes have demonstrated limited content effectiveness in CKD. The generic HRQOL assessment questionnaire SF-36 shows good coverage of life impacts, but it may have limited value in the clinical trial setting due to other reasons. HRQOL outcomes may also be influenced by intervening factors over time. Disease-specific HRQOL measures, such as KDQOL-36 (which shows good conceptual coverage), encounter very similar challenges to generic HRQOL measures because the outcomes are closer to the disease or treatment, such as symptoms, are more likely to show a meaningful treatment and are more effective than downstream consequences, such as HRQOL. The latter is influenced by a range of factors.

The FDA issued guidelines on PROs research application and clinical drug development and efficacy evaluation, which defined PROs as any health status and treatment efficacy report directly from patients. PROs emphasizes the importance of patients’ subjective feelings and is a key indicator for disease treatments and treatment effects from the perspective of patients. The PROs scale can be used to measure a variety of patient factors (i.e., symptoms, psychological status, social participation, ability to perform daily living activities, and health-related quality of life) [18]. To ensure the rationality and objectivity of PROs evaluation, medical experts began to introduce psychological evaluation methods into PROs evaluation and developed many famous scales. One such scale, developed by the MAPI Institute in Lyon, is the PROs&QOLID database for patients’ Reported clinical outcome and quality of Life (PROs&QOLID), which sets up scale information in a structured form and provides extensive and in-depth information about PROs and quality of life for researchers engaged in health assessment through the network. The development of the PRO integrates the latest and most scientific research methods in measurement, investigation, health information technology, clinical research and qualitative research and establishes a set of health outcomes that can be used to measure the self-reported feeling, function, and status of multiple groups of people. The PROs development process typically involves the following several steps. First, field establishment and concept definition are undertaken, which involve reviewing the literature and collecting input from the patients, their family members, care providers, and clinical professionals and experts and the initial establishment of physiological function, fatigue, pain, emotional distress, social health, and overall health areas. Then, item pool formation and proofreading are conducted. The developer creates a list of potential items relevant to the concept of interest. These items can be general (e.g., “In general, would you say your health is: excellent, very good, good, fair, poor?”) or specific to a condition (e.g., “My kidney disease interferes too much with my life: strongly agree, agree, neither agree nor disagree, disagree, strongly disagree”). By adopting qualitative and quantitative research methods and by considering the existing filters, the items were classified, selected, evaluated, and modified according to whether the item was consistent with the defined field, the results of the item response theory analysis and the definition of the entries themselves. By considering these specific qualitative and quantitative aspects, it was determined whether to retain the item. Responses from focus groups and cognitive interviews were used to revise the reserved items, and finally, a pool of items in each field was formed.

The CKD-PROs is concise and accessible. It has the advantages of good reliability and validity and can also help medical staff assess specific symptoms, treatment preferences, and all kinds of aspects of overall health. CKD-PROs can incorporate the patient’s opinions and suggestions effectively into clinical care, clinical trials, and health care policies and eventually conduct customized high-quality care for patients with CKD. PROs assessment is highly recommended at every outpatient clinic follow-up. Future studies need to guarantee that more patients must be recruited, and they should be from multiple centers and even from different cultures and countries for the examination of the psychometric properties of the CKD-PROs. If you intend to use the scale developed in this study, please contact our team by email to obtain authorization for its use.

Conclusion

In this study, researchers achieved complementary advantages by combining the CTT and IRT methods [39]. According to the results, item selection is of great significance in instrument development. First, we selected items considering their importance, suitability, and certainty by patient interviews and the Delphi method. Then, we analyzed the CTT methods, discrete trend method, correlation coefficient method, Cronbach’s alpha method, and other factor analysis methods from perspectives such as sensitivity, representativeness, independence, and internal consistency. The discrimination parameter (a), difficulty parameter (b), and item information were analyzed by the IRT method. The comprehensive application of these methods laid a good foundation for the screening of high-quality items. It may also present a scientific and effective approach for evaluating the clinical efficacy of CKD through PROs measures.

Limitations

The current study has several limitations. First, the sample size of the survey is limited. Second, the questionnaire developed contains 54 items, which may be too large. In the next phase of our work, we plan to expand the sample size to enhance content validation and calibrate the item banks. In addition, we will employ CAT (computerized adaptive testing) technology to tailor the specificity of each item to a continuous range of a specific feature, such as the degree of skin itching. This approach promises to save assessment time and improve the efficiency of questionnaire completion.