Abstract

Background. Pain is considered “the 5th vital sign” that should be regularly assessed in the neonatal intensive care setting. Although over 40 pain assessment tools have been developed for neonates, their implementation in everyday practice is challenging. Epidemiological studies demonstrate that pain is still underassessed and undertreated in European NICUs. Purpose. To evaluate the interrater and intrarater reliability of the NIPS and COMFORT-B scales among the tertiary NICU’s staff members 4 years after their implementation in local pain guidelines with no prior dedicated training. Methods. Physicians and nurses were invited to evaluate 5 video recordings of infants hospitalized in the intensive care settings, using the NIPS and COMFORT-B scales. The assessment took part twice at a 3-month interval. Interrater reliability was calculated for both scales using Kendall’s W coefficient of concordance and Krippendorff’s alpha coefficient. Cohen’s kappa was used to assess intrarater reliability. Results. 17 physicians and 19 nurses took part in the study. Interrater agreement for the COMFORT-B scale was above 0.8 for Kendall’s W coefficient ( < .01) and above 0.667 for Krippendorff’s alpha coefficient. Kendall’s W coefficient for the NIPS scores ranged between 0.7 and 0.8 ( < .01). Krippendorff’s alpha was above 0.667. Intrarater agreement for both the COMFORT-B and NIPS scales was 0.693 and 0.724, respectively. Conclusions. Overall, the agreement between our staff members was moderately good for both scales. This is not enough to avoid inadequate pain assessment. More training is needed to improve NICU’s staff competences in using pain scales.

1. Introduction

Over 30 years ago, a study published by Anand et al. [1] demonstrated that inadequate analgesia during surgery in preterm babies resulted in more pronounced metabolic stress response and unstable clinical course in the postoperative period. Hence, the myth that the immature central nervous system precludes neonates from experiencing pain was rejected. Since then, neonatal pain research has made a considerable progress in understanding the developmental aspects of postnatal nociception [2]. In the 1990s, distinct behavioural and physiological responses to painful stimuli were characterized by Craig et al. [3], which led to the development of numerous neonatal pain assessment tools. To date, over 40 scales to assess pain and/or sedation in neonates have been created, yet there is still no gold standard instrument [4]. Clinical guidelines on neonatal pain prevention and management [57] recommend to use pain scales with proven validity and reliability such as Neonatal Facial Coding System (NFCS), Premature Infant Pain Profile (PIPP), Neonatal Pain, Agitation and Sedation Scale (N-PASS), Behavioural Infant Pain Profile (BIIP), Douleur Aiguë du Nouveau-né (DAN), COMFORT scale, and Face, Legs, Activity, Cry, Consolability (FLACC) scale. The assessment method should be adapted to the type of pain a neonate is experiencing, namely, acute, prolonged, or postoperative pain. Pain should be evaluated and documented every 4 to 6 hours and after each potentially painful procedure [7].

It has been demonstrated that implementing guidelines in everyday practice is challenging. In the EUROPAIN (EUROpean Pain Audit In Neonates) prospective observational study performed in 243 NICUs from 18 European countries [8], 31.8% of enrolled neonates received an assessment of continuous pain at least once during their NICU stay. Daily pain assessments occurred in only 10.4% of patients. It is notable that practices varied among countries with the common occurrence of pain assessment in French (100%), Dutch (80%), and Belgian (75%) NICUs. As for Polish NICUs, 2 (25%) out of 8 hospitals participating in the study reported performing continuous pain assessment. It was demonstrated that the presence of local NICU pain guidelines and nurses that specialized in pain management increased the odds for pain assessment.

In the Children’s Memorial Health Institute in Warsaw, whose NICU also participated in the EUROPAIN study, there is a local guideline document regarding pain management. Medical charts are occasionally audited by the pain management services to verify that pain assessment is performed. Their staff also provides support in pharmacotherapy, if needed. In our standard of care, neonatal pain assessment is performed with the use of the Neonatal Infant Pain Scale (NIPS) in nonventilated patients and the COMFORT Behaviour (COMFORT-B) scale in ventilated patients. Nurses are provided with cards describing each scale at their workstations.

Unlike many reports on the implementation of pain assessment in hospital settings [9, 10], pain scales were introduced in our department without prior extensive training or calculation of interrater reliability. To our knowledge, their Polish translations did not undergo cross-cultural adaptation and validation. Yet, since their introduction in 2017, they have been meticulously documented in medical records. To improve our pain awareness and pain measurement, we conducted a study with the aim to evaluate the agreement between observers using both scales.

2. Materials and Methods

This was a prospective study conducted from January to April 2021 in the level 3 NICU of the Children’s Memorial Health Institute in Warsaw. The study was approved by the local Institutional Review Board (study ID number: 21/KBE/2018).

2.1. Population and Design

At the time, 77 nurses and 28 doctors were employed at the NICU. They were informed about the aim of the study and their role in it during a staff meeting. Those who did not attend the meeting were personally approached by the authors. Participation in the study was voluntary. The study procedure involved the evaluation of 5 video recordings of infants hospitalized in the intensive care settings, using the NIPS and COMFORT-B scales. Participants had 2 minutes to assess each video. The assessment took part twice at a 3-month interval. At each occasion, the assessment took place after the morning staff meeting in our department’s conference room.

The approximation of the minimum required sample size was based on Krippendorff’s estimations [11]. We assumed that, for the NIPS, each of its 8 values (from 0 to 7) is equally likely to occur. In order to achieve the smallest acceptable reliability value of alpha 0.667 at the 0.05 level of statistical significance, the minimum reliability sample size was 71 units. It means that with a fixed number of 5 videos to evaluate, we had to enroll in the study at least 13 raters. Additionally, we decided to increase the minimum number of observations made in this study to at least 100 based on the recommendations of the COSMIN Checklist [12], which rates a sample of over 100 as “excellent”. It means that at least 20 raters had to be enrolled in the study. This sample should be sufficient to detect a value of Kendall W coefficient of 0.8 (ρs1) with 80% power at the 0.05 level of significance, assuming that the null value (ρs0) equals 0.6 [13].

2.2. Video Recordings

A convenience sample of 5 videos was selected to be evaluated by the study participants. Our aim was to ensure that participation in the study would not collide with staff’s everyday duties, hence the small number of videos to assess.

4 of the videos were retrieved from the COMFORT Behaviour Scale instructional website (https://comfortassessment.nl/) [14]. The website provides video guides on how to evaluate each of the scale’s items, as well as training videos for the full assessment. The videos selected for the study included: video 1: COMFORT score of 19/20; video 2: extreme scores (5) for “Calmness,” “Alertness,” “Respiratory response,” and “Physical movement”; video 3: score 5 for “Crying”; video 4: score 3 for “Respiratory response”.

The 5th video was recorded in our department presenting a full-term neonate undergoing a venepuncture procedure, which is classified as moderately painful [15]. Written parental consent was obtained before the recording, and the venepuncture was clinically necessary.

2.3. Pain Assessment

The Neonatal Infant Pain Scale (NIPS) is a tool developed in the early 1990s [16] aimed to assess six behavioural reactions to painful procedures in preterm and full-term newborns. The scale was demonstrated to have high interrater reliability and internal consistency. It was validated for construct and concurrent validity. Its recommended use is for acute and postoperative pain, although its psychometric studies were mainly validated for acute pain [5]. It contains six items defined in Table 1. In order to provide the total NIPS score, participants in the study had to evaluate all of the items.

The COMFORT scale was developed to assess the levels of distress in PICU patients, as well as postoperative pain in children under 3 years of age. It consists of six behavioural items and two physiologic items: heart rate and mean arterial pressure. As physiological variables were demonstrated to have a weak correlation with pain behaviour, their exclusion from the scale led to creating the COMFORT-B scale containing only behavioural items. The scale is illustrated in Table 2. It is possible to omit one of the scale’s items in the pain assessment. The total score is then computed by multiplying the total score for the other items by 6/5 [14]. The scale was validated for concurrent validity, internal consistency, and interrater reliability [17, 19, 20].

2.4. Statistical Analyses

The data were analysed using IBM SPSS Statistics v. 27. Descriptive statistics were used to calculate median scores and interquartile ranges. Kendall’s W and Krippendorff's alpha coefficients were calculated to evaluate interrater reliability (IRR) for COMFORT-B and NIPS total scores, as well as for items of each scale. Both coefficients are suitable for ordinal ratings with more than 2 raters [21]. For interpretation of coefficients, we assumed the labels suggested by Landis and Koch for the use of kappa: values between 0 and 0.20 indicate a slight IRR; values between 0.21 and 0.40 indicate a fair IRR; values between 0.41 and 0.60 indicate a moderate IRR; values between 0.61 and 0.80 indicate a substantial IRR; and values between 0.81 and 1.00 indicate an almost perfect IRR [22]. Additionally, for Krippendorff’s alpha, it is accepted that its lowest conceivable limit is 0.667 [11]. Intrarater reliability was assessed using Cohen’s kappa coefficient. Where applicable, tests were performed at 0.05 significance level. Missing values were omitted from the analyses.

3. Results

36 members of our NICU staff took part in our study. The group included 5 doctors and 9 nurses with less than 5 years’ experience in a neonatal intensive care unit. The remaining 12 doctors and 10 nurses had more than 5 years’ of NICU experience.

We obtained 180 and 170 total NIPS scores at the 1st and 2nd measurements, respectively. Total COMFORT-B scores amounted to 175 at both measurements. Total scores for all assessments are displayed as box and whisker plots (Figures 1 and 2). The percentage of observers who assessed 4 videos exactly as in reference from the COMFORT training website is illustrated in Table 3. As for the 5th video that showed a procedure considered to be moderately painful, the total scores displayed in Figures 1 and 2 are within a range of severe pain.

3.1. Interrater Reliability: Total Scores

Interobserver agreement for the COMFORT-B and NIPS scales is presented in Tables 4 and 5, respectively. Kendall’s W coefficients values (from 0.736 to 0.906) indicate substantial to almost perfect agreement between observers. Krippendorff’s alpha coefficients are above the smallest acceptable value of 0.667, but below 0.8, which implies moderate interrater reliability. All reliability coefficients achieved higher values for the COMFORT-B scale and for the 2nd measurement in both scales.

3.2. Interrater Reliability: Scales’ Items

Interobserver agreement for the COMFORT-B and NIPS scales’s items is presented in Tables 6 and 7. Overall, the values of reliability coefficients seem to be more consistent for the NIPS scores. The items that did not reach the minimum desired level of interrater reliability include “Breathing pattern” (both coefficients) and “Legs”(alpha). The observers showed almost perfect agreement (Kendall’s W and Krippendorff’s alpha >0.8) while assessing the following items of the NIPS: “Facial expression,” “Cry,” and “State of Arousal”.

As for the COMFORT-B scale, Kendall’s W coefficients were shown to be above the substantial agreement threshold for all items, but they rarely reached a value greater than 0.8. However, Krippendorff’s alpha coefficients were below the acceptable agreement level for the following items of the COMFORT-B scale: “Alertness,” “Respiratory response,” “Crying,” “Muscle tone,” and “Facial tension”.

3.3. Intrarater Reliability: Total Scores

Intrarater reliability calculated as Cohen’s kappa weighted coefficients is of substantial value for both COMFORT-B and NIPS: 0.693 (CI: 0.637–0.750,  < .01) and 0.724 (CI: 0.658–0.791,  < .01), respectively.

4. Discussion

In this study, we evaluated agreement between a sample of our staff members in pain assessment using the NIPS and COMFORT-B scales 4 years after their introduction in our department. Contrary to other studies [9, 10, 14, 23, 24], we did not undergo intensive training before implementing these tools into our everyday practice. We only received cards describing each scale that are available at nurses’ workstations. Our lack of training could explain why such a low percentage of our group assessed the videos in accordance with the COMFORT training website. Nevertheless, in our group, we showed to have moderately good inter- and intrarater agreements for both scales, which indicates that, within our department, we can communicate with each other about patients’ pain levels. Calculated values of reliability coefficients were slightly lower for nurses than for doctors. That could be explained by the higher prevalence of professionals with more work experience in neonatal intensive care in the doctors’ group.

The results of the scales’ items analysis are more conflicting. If we take into consideration only Krippendorff’s alpha coefficients, we failed to demonstrate interrater agreement in 5 out of 7 items of the COMFORT scale and in 2 out of 5 items of the NIPS. It can be speculated that these results are due to our lack of training and also technical difficulties related to applying some of these items to a video recording (e.g., evaluation of muscle tone or respiratory response). It is worth noting that results were more consistent for the NIPS scores where we collected the same number of observations for all items compared to the COMFORT-B where items differed in the number of observations. It is likely that our sample size in the COMFORT-B scale’s item analysis was inadequate for the estimation of Krippendorff’s alpha coefficients [25].

We used two different reliability coefficients that are suitable for ordinal ratings with more than 2 raters. They are based on different mathematical assumptions, which leads to providing different numerical values for the same datasets [26]. Krippendorff’s alpha is considered to be a conservative measure of reliability favouring more even distribution inferred as the pattern by which cases fall into categories [26]. Kendall's W coefficient measures the associations between ratings with no assumptions regarding the nature of the probability distribution [27]. It is worth noting that all of Kendall’s W statistics reached a significance of  < .01. Moreover, for the interpretation of Kendall’s W coefficient, we employed Landis and Koch benchmarks [22] that were originally designed for Cohen’s kappa and are the most widely used in research. However, it is not certain whether they should be applied with regard to coefficients based on different assumptions than kappa [28]. In the studies related to pain assessment scales in neonates Cohen’s kappa, linearly weighted Cohen’s kappa and intraclass coefficient were the most widely used [4]. We are convinced that the reliability measures we chose to apply in our study are suitable for the dataset we had to analyse [28]. However, we are aware that selecting them instead of kappa statistics precludes from comparisons of our results with other studies involving pain assessment in neonates [12].

To our knowledge, there has been only one study comparing the NIPS and COMFORT-B scales [29]. It demonstrated that while evaluating painful procedures, the NIPS has a significantly higher coefficient of variation (CV, 188% ± 99%) compared to the COMFORT scale (33% ± 8%). We did not identify any studies comparing the interrater reliability of both scales. However, they were used together as endpoints in several randomised controlled trials [3033].

The main limitation of our study is that the relative representation of nurses in our group is much smaller compared to doctors. Only 19 of 77 employed at that time nurses took part in our study, whereas the group of doctors included 17 of 28 employed physicians. In our department, pain assessment is part of nurses’ responsibilities. In case of elevated pain scores, the adjustments of pharmacological treatment are discussed with doctors. Therefore, it is essential for physicians to be familiar with the pain scales used in NICU. In our study, Kendall’s W coefficients indicated almost perfect interrater agreement among doctors and substantial agreement among nurses. Given the importance of pain assessment, it should be our aim to achieve agreement above 0.8 between nurses. The results of our study imply there is a need for more training in using pain assessment tools.

The strength of our study is that it shows the real-life experience of a tertiary NICU, where the strain of everyday duties and work overload leads at times to omission of training in matters that seem to be intuitive and less vital than life-saving procedures. There is growing evidence that early life exposure to painful stimuli leads to long-term consequences such as altered pain sensitivity [3437], impaired cognitive, behavioural, and motor development [38], and structural changes in the central nervous system detected in MRI studies [3941]. As much as there is no doubt that pain prevention and management are crucial in neonatal care, the introduction of pain assessment tools in everyday practice is a challenge. Newborns hospitalized in NICUs are affected by different types of pain, namely, acute, postoperative, and prolonged pain. Most of the available pain scales were validated for acute pain, while tools for the evaluation of prolonged pain are scarce. Moreover, it is known that the severity of illness may affect the pain expression in neonates. Given that most behavioural pain scales are based on pain expression indices, it has not been established yet whether the cutoff values used for pain assessment should be different for more severely ill patients [42]. It is evident that the “one-size-fits-all” approach to pain assessment in neonates is unsatisfactory. Staff members should be trained to recognise different types of pain in a given clinical context and apply assessment tools accordingly. However, some scales require the evaluation of so many parameters that it makes it difficult for a single caregiver to measure them accurately. In other cases, the intensive care setting involving tubes and tapes covering patients’ faces precludes from appropriate assessment of facial expressions. Furthermore, the main goal of pain assessment is to intervene with pain-alleviating treatment when needed. A study conducted in New York showed that pain scores documented in medical charts did not influence analgesic medication practices [43].

Some argue that there is no evidence that using standardized pain assessment tools improves patient outcomes [44]. Thus, efforts should be more focused on pain detection in everyday practice, while validated tools should be reserved for research purposes [45]. It is also advisable to engage parents in pain assessment, as they might be more motivated to detect pain than healthcare workers [46, 47]. Until better pain assessment tools are available, it is in the best interest of NICU’s patients that healthcare providers focus on pain detection combined with improvement of their competence in using validated pain scales. The latter may be achieved by regular training with evaluation of interrater reliability among staff members. We believe this study to be a starting point for us to improve our pain assessment with the use of both scales.

5. Conclusions

Results of our study demonstrate that implementing pain scales without prior training may lead to a moderately good interrater agreement among staff members. Reliability values estimated here are not high enough to avoid inadequate pain assessment. Therefore, the development of a dedicated training programme is essential to improve our daily practice. Education should be focused on items of both scales that we identified to yield the most inconsistent scores among our staff members.

Data Availability

The data used to support the findings of this study are available from the corresponding author ([email protected]) upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this study.

Acknowledgments

Research and publication were funded by the Children’s Memorial Health Institute grant for Young Researchers (grant number: M33/18).