Introduction
Hypoxia is a leading cause of neonatal morbidity and mortality. It can have consequences such as hypoxic-ischemic encephalopathy (HIE), organ dysfunction, developmental delays or cognitive impairments impacting the overall development of the baby [
1]. Prompt medical intervention is crucial to minimize the potential long-term effects and improve outcomes for affected infants. Cardiotocography (CTG) is a non-invasive device that records fetal heart rate (FHR) and uterine contractions (UC). It is widely used as a screening tool in obstetric practice to determine fetal wellbeing. Specifically, obstetricians and midwives employ it during labor to identify fetal hypoxia, enabling them to intervene promptly in case of a pathological signal.
CTG analysis and interpretation is performed visually by obstetricians and midwives following guidelines [
2]. There are different classifications with varying characteristics [
2‐
6], without clear international consensus among them [
7‐
10]. Although the guidelines are constantly being challenged and reviewed [
11,
12], the overall process of interpreting CTG during delivery is known to be subjective and to induce a significant interobserver and intraobserver variability [
13,
14].
The primary constraints of studies examining the interobserver and intra-observer variations are the limited number of both the assessors and the annotated cases [
15]. Therefore, to answer these limitations, we have developed a tool available at
www.fhr-annotator.com that facilitates practitioners in annotating 100 cases sourced from the CTU-UHB open database [
16]. The objectives were to evaluate the accuracy of fetal hypoxia prediction and interobserver agreement and reliability among a wide range of practitioners using an open-source database with CTG signals, clinical data and fetal outcomes.
Discussion
With 2950 annotated cases and 120 participants from different professions and experience levels, our study is the largest evaluating the accuracy and the interobserver variability of CTG interpretation during labor. Over the whole set of annotations, we found a moderate mean success rate (0.58) in predicting fetal hypoxia, and the sensitivity and specificity were 0.58 and 0.63 respectively. The global interobserver agreement and reliability were moderate to good (PA=0.82, K=0.63). We did not find a significant difference in the success rate between the different professions or according to the number of years of experience. In contrast, we found a much lower success rate on cases with a moderate hypoxia (pH between 7.05 and 7.20). These ambiguous cases are often associated with non-reassuring CTG patterns.
The main strength of the study is the large size and diversity of our sample which reflects the composition of a labor ward team. Also, the annotation tool developed for this study was appreciated by the participants and enabled to evaluate consistently how practitioners interpret CTG signals and the main clinical variables during delivery. The data on important characteristics of the participants (their profession, place of work and number of years of experience) enabled us to analyze how these characteristics impacted the success rate. Finally, our choice to include equal numbers of normal as pathological cases (pH lower than 7.15) was important to ensure that the participants annotated a sufficient number of cases with fetal hypoxia, which helped in estimating sensitivity and specificity with a high precision. Participants were not informed of the study design in which we presented the cases in batches of 10 CTGs randomly presented inside each batch with a 50/50 ratio of pathological and normal cases. It is very unlikely that this pattern was identified by the participants and that their answers would have been modified accordingly.
Our study adheres to the GRRAS guidelines [
17], which are not followed by many similar studies according to Engelhart et al. [
15]. We assessed agreement and reliability with PA and kappa respectively, as recommended by both the GRRAS guidelines [
17] and the work by Costa Santos et al. [
25] reviewing how agreement and reliability studies in obstetrics and gynecology should be conducted. Nevertheless, we identified some limitations. First, there is probably a selection bias in the participants included in the study. They are practitioners who voluntarily dedicated a substantial amount of time to annotating the cases. They may also be individuals who spend more time in the delivery room and are thus interested in taking part in studies evaluating CTG interpretation. Additionally, most participants were working in a university hospital, which may not be fully representative of the current demographics of maternity wards. These factors could contribute to the high level of agreement within this particular cohort, and lead to an overestimation of the accuracy compared to a general population of practitioners. Also, we made the choice to set the pH threshold corresponding to fetal hypoxia at 7.15. This enabled us to compare our results as accurately as possible with the existing literature, in particular with Hruban et al. [
14] who used the same threshold as well as cases extracted from the CTU-UHB database. The pH 7.15 threshold corresponds to a moderate fetal hypoxia: in clinical practice, detecting it before it turns to a severe hypoxia gives practitioners the ability to intervene in a timely manner and ultimately leads to better outcomes. A more realistic setting would have been to define three CTG tracing categories (pathological, suspicious, normal) or even more, in accordance with the CTG interpretation guidelines [
29]. However, this choice would have made comparison with existing literature more difficult. Finally, the CTG signals in the CTU-UHB dataset contain an important share of missing data points compared to other existing datasets: for example, there are in average 19% missing points in the FHR signal compared to 7% in the SPaM dataset [
30]. Also, it is known that the FHR signal can be contaminated by the maternal heart rate [
31]. These factors make CTG interpretation harder for practitioners [
32,
33].
The results obtained in our large study confirm the limitations of visual interpretation of CTG signals with a low success rate, sensitivity and specificity. The comparison of our results with the literature evaluating CTG interpretation is challenging because existing studies generally have several differences including the choice of the classification system employed, the number of professionals involved in the study, the expertise or experience of the participating professionals, the multicenter design of the study, the specific pH threshold selected for defining hypoxia, and the statistical methods used to compute agreement and reliability. While our choice in using the group-level consensus rating per case offered simplicity in analyzing interprofessional agreement and reliability, this came with the risk of overestimating the measurement especially when compared to individual-level assessments. The existing study with the most similar protocol was Hruban et al [
14], and we have been able to compare the annotations provided by the nine experts included in their study with our results on the same set of 100 cases. The success rate is comparable, but the experts have a higher specificity and lower sensitivity. Generally, experts have a better sensitivity than the general population [
13,
34,
35], which may be consistent with their role, ultimately being a second line that assists in making decisions regarding a suspicious case. The design of our study, which includes as many normal cases as cases of fetal hypoxia, may explain the differences observed in the experts’ sensitivity. This highlights the challenging task of defining an expert in CTG interpretation.
We did not find a significant impact of the level of experience or the profession. Even if the difference is not significant, midwives had a better mean success rate in our cohort. This may be because all midwives that participated to the study practice daily in the labor ward, which may not be the case for some obstetrician-gynecologists (for example for those specialized in surgery). Also, as the midwives labelled in average more cases than the other professions, they may have improved their annotation skills with experience [
36] using the feedback provided after each annotation. This trend may also be partly explained by them becoming more accustomed to the tool. Past studies involving both midwives and obstetricians are based on smaller or less diverse databases including only a few practitioners [
10,
28,
34,
35,
37‐
40]. All of them found a poor interobserver reliability with a kappa coefficient ranging between 0.18 and 0.38. However, these studies only include a very small number of practitioners (less than ten), evaluated different outcomes, or had different inclusion criteria. For example, Blix et al. studied the assessment of CTG signals at admission [
35], Figueras et al. included antepartum CTGs [
40], Kundu et al. tried to predict the pH outcome from CTG signals [
39] and Devoae et al. asked practitioners to annotate baselines, accelerations and decelerations but not to predict the fetal outcome [
10]. Recently, a review by Engelhart et al. [
15] did not find any clear association between the level of experience or profession and the accuracy of the annotations.
Finally, we found a higher success rate and stronger agreement for cases with a pH lower than 7.05 and for cases with a pH higher than 7.20. Inversely, cases with a pH between 7.05 and 7.20 were more challenging to annotate for our participants, with a success rate below 0.50 in this category. This conclusion is consistent with past studies [
14,
35,
41,
42] and with a recent review highlighting the high reliability for CTG signals classified as normal [
15]. In practice, when interpretation is difficult, some professionals use invasive second-line analyses to improve their ability to predict hypoxia, such as fetal scalp blood sampling (FBS) and ST analysis. While the interest of FBS remains a topic of debate [
43], the contribution of STAN (ST Analysis) in retrospective cohorts has demonstrated its value in aiding clinical decision-making [
44]. Our study showed that for ambiguous cases the practitioners’ success rate was indeed very low, confirming the need for specific tools to assist them. Beyond invasive analyses, computerized systems hold promising potential for improving the interpretation of CTG signals [
45] and represent an interesting way to increase the accuracy while reducing interobserver variability [
38,
46‐
48], especially within the critical pH range between 7.05 and 7.20.
Conclusion
While the effectiveness of cardiotocography in reducing neonatal morbidity is still debated [
49], it remains the primary method for assessing fetal well-being during labor. Several past studies have highlighted the poor accuracy of practitioners and the high interobserver variability in the interpretation of CTG signals. The use of an online annotation tool enabled us to gather the largest and most comprehensive database to evaluate the interobserver agreement and reliability in the interpretation of CTG signals.
We have shown that there is no significant difference in success rate between the different professions or levels of experience. Additionally, the cases with moderate hypoxia (pH between 7.05 and 7.20) were much harder to annotate with a mean success rate below a random guess. The possible selection biases in the participants of the study may even have overestimated the success rates and agreements in our cohort, and these results should be considered keeping in mind the complexity and pitfalls of agreement and reliability studies. As described in previous studies, we think that computerized systems helping practitioners in the interpretation [
45] of CTG signals is a promising way to increase the accuracy while reducing interobserver variability in the future [
38].
Also, the annotation tool developed as part of this research will lead to future studies. First, the continuous growth in the number of participants and annotations will make the results more robust and could enable to derive new insights. Second, the tool can be used to investigate specific questions, for example comparing the success rate of practitioners in different countries using different classifications, deepening our understanding of the cases that are hard to annotate for practitioners, or evaluate how the information provided by a computerized CTG system may assist them. Finally, it can also be used by practitioners as a training tool.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.