Introduction
Positron Emission Tomography (PET) has established itself as a valuable tool in cancer diagnosis, prognosis, and clinical treatment decision-making [
1‐
3]. With the upcoming shift from qualitative to quantitative imaging, hopes have been raised that the precise quantification of image-derived biomarkers will extend the current capabilities of PET, and thus, improve patient outcomes. Several studies on quantitative PET show promising results for tumor texture analysis, indicating the possibility of capturing tumor heterogeneity using radiomic features and the subsequent utilization of these features for the prediction of clinical outcomes [
4,
5]. Radiomics refers to an approach that extracts and analyzes a large set of quantitative image-derived features (e.g., intensity, shape, texture) that are used to reveal associations between medical imaging data and patient outcomes. Moreover, deep learning approaches are rapidly gaining relevance in PET imaging, enabling automated lesion segmentation [
6,
7], classification and detection of disease patterns [
8,
9]. However, the clinical translation of these methods is currently lacking, which can be partly attributed to the low robustness and poor generalizability of these approaches. It has been shown that most radiomic feature values are sensitive to different scanners [
10], acquisition protocols [
11], and reconstruction settings [
12]. This further applies to deep learning-based methods, where such acquisition shifts cause poor generalization to new unseen data [
13,
14]. Yet, this reflects current clinical practice since ever-evolving imaging aspects such as scanners and protocols cannot be entirely standardized. The diversity of image acquisition leads to scans with different styles (e.g., caused by different scanners) and textures (e.g., induced by different reconstruction algorithms) among sites. We define
style as the global visual appearance of an image influenced by factors such as image contrast or brightness, and
texture as a local characteristic related to the spatial distribution of voxel intensities.
In response, we propose a deep learning-based PET image harmonization method (GAN-harmonization) that aims to harmonize PET scans acquired from different centers and scanners. Our objective is to improve the reproducibility and predictive performance of quantitative image biomarkers. We utilize a cycle-consistent generative adversarial network (CycleGAN) that performs image style and texture translation between unpaired PET scans from different centers and scanners. The approach is—in contrast to existing feature-based PET harmonization methods [
15,
16]—purely image-based, which allows physicians and researchers to have access to the images after harmonization. This enables the potential use of the harmonized images in subsequent downstream tasks such as deep learning image segmentation and classification. We evaluate GAN-harmonization by applying it on two different datasets and tasks. First, we perform image harmonization on a dual-center whole-body lung cancer (LC) dataset where we investigate the reproducibility of radiomic features in healthy liver tissue before and after harmonization. Second, we apply GAN-harmonization to a head and neck (HN) cancer dataset acquired from three centers, where we analyze the clinical impact by predicting patient outcome. No harmonization strategy at all and the widely used feature-based harmonization technique ComBat served as benchmarks.
Discussion
Quantitative PET imaging has become a promising method that provides additional information regarding prognosis and treatment response monitoring in cancer patients which goes beyond traditional qualitative imaging. However, the sensitivity of quantitative imaging markers to different scanners, acquisition protocols, and reconstruction algorithms is a limiting factor in large-scale multi-institutional studies. To bridge this gap, we developed a deep learning-based image harmonization method relying on CycleGANs that normalize PET scans to remove site-specific image characteristics while retaining the clinically relevant biological information for the prediction of distant metastases in HN cancer patients. This was evidenced by high image similarity measures after harmonization and the preservation of predictive performance in a classification downstream task. We demonstrated the ability of CycleGANs to generate and model high quality (whole-body) PET scans that are drawn from a given reference data distribution. Moreover, we have observed an increase in reproducibility of radiomic features after applying GAN image harmonization and showed that harmonized data enables building higher performing models (based on cross-validation) from multi-center data compared to models that were built from non-harmonized data.
Substantial harmonization efforts have been made towards prospective studies eventually resulting in the EARL guidelines [
18,
31]. However, these guidelines rely on phantom data and hence may not be applicable in retrospective studies which are needed to accelerate clinical translation. Post reconstruction feature-based harmonization methods relying on ComBat [
32] have been proposed and successfully used in radiomics studies [
15,
29,
33]. However, ComBat cannot be used in deep learning applications that operate on an image level (e.g., image segmentation), demanding the need for an image-based solution. In contrast to simple image smoothing techniques that typically rely on predefined filters or heuristics, CycleGANs utilize deep neural networks and are hence capable of learning a mapping function between a source and a target center. They can capture complex relationships that may exist between the imaging data. Moreover, they aim to preserve the diagnostic information present in the original images through the cycle-consistency loss. This constraint is missing for simple image filters, and the application of such may smooth out important features, and thus, potentially interfere with the underlying biological signal.
The capabilities of deep generative models to perform image harmonization have been studied for different imaging modalities [
34‐
39]. For brain magnetic resonance imaging (MRI), Hognon et al. proposed a contrastive deep image adaptor network, showing a positive impact of their method on a downstream segmentation task [
34]. In their approach, the authors used a combination of several different loss functions to train the network. Tixier et al. conducted a comparative study between conventional histogram matching and generative adversarial networks when being used for radiomics data harmonization in outcome prediction modeling [
35]. They found that the predictive value of certain radiomic features could be recovered after applying multi-institutional harmonization and showed that GAN-harmonization outperformed histogram matching. The influence of image harmonization on generalizability of a radiomics model in grading meningiomas on external validation has been studied by Park et al. [
36]. The variability of radiomic features in chest radiography acquired from two different vendors was evaluated previously [
37]. Both studies showed that a CycleGAN can reduce image variability while improving the predictive performance of radiomics features, which is in line with our study. Similar to this study, Marcadent et al. [
37] reported high structural similarity measures between the unharmonized and harmonized images and an increase in feature reproducibility after performing GAN-based texture-translation. Choe et al. presented a deep learning-based image conversion approach which effectively reduced radiomic feature differences caused by different reconstruction kernels in chest CT imaging for pulmonary nodules or masses [
38]. However, besides the varying imaging modalities and clinical tasks, all studies used a 2D input, which is different from our work. While the loss of spatial information along the z-axis is not a constraint for chest radiography, which is inherently a planar imaging technique, it can be difficult for tomographic imaging such as PET. The utilization of 3D convolutions for volumetric images and objects is not only more intuitive, but it also adds additional contextual information to the network. Especially in whole-body imaging, where many structures cannot be recognized from a single slice, information transfer between adjacent slices is beneficial to not cause inter-slice artifacts that may adversely affect the performance as shown in previous studies [
40]. By successfully applying GAN-harmonization to a whole-body lung cancer PET dataset and being able to produce high quality images, we have shown evidence for the extended use of GANs for whole-body imaging, enabling the potential application to other modalities (CT or MRI), as demanded by others [
41].
It is important to note that our study had limitations that should be taken into consideration. Although we observed overall high global image similarity, we identified potential failure modes of the GAN in regions with varying field of views within the datasets as typically present in the head and brain regions. GAN predictions for those regions exhibited higher uncertainties but may be regularized by enlarging the training dataset with a diverse set of samples. GAN-induced image artifacts require visual inspection by a physician and potentially confound the uptake values in the corresponding regions. Moreover, larger datasets from different centers and scanners are needed to further investigate the ability of GAN-harmonization to improve generalizability of radiomics and deep learning models for different applications and diseases. This is particularly important in a more clinically realistic scenario and with large external holdout cohorts. Moreover, due to the relatively small and imbalanced HN dataset, predictive performance measurements were performed by mixing all three centers in a 100-fold Monte Carlo cross-validation rather than training the radiomics model on one center and deploying it on the others and vice versa. We have chosen this evaluation strategy in favor of having a higher statistical power due to the cross-validation scheme. Furthermore, no feature selection was used since we wanted to truly assess the contribution of all individual features after harmonization. In the HN dataset, tumor delineations were taken over from [
21] and therefore based on the unharmonized images. The same delineations were used for the harmonized images. Even though this procedure ensures no bias towards inter and intra-operator variability, it does not reflect an ideal clinical scenario, in which tumor delineations should have been drawn on the harmonized images. Radiomic features were extracted from the largest lesion only. This procedure was performed as there is currently no clear consensus of how to aggregate features from multiple lesions [
42] and previous studies focused on the largest lesion [
43,
44]. Even though this study did not aim to build the best possible prediction model for HNSCC, it is suboptimal and potentially weakening the results. All these factors may make the results of the HN outcome prediction overoptimistic, and it cannot be guaranteed that GAN-harmonization made overfitting of the model easier. It is also important to note that the experimental setup for the HN outcome prediction did not meet the optimal conditions for using ComBat, as ComBat typically requires each batch to include a sufficient number of patients (around 20–30) acquired with a single imaging protocol on the same scanner. Moreover, the varying numbers of patients who developed DM (CHUM:
n = 3, HGJ:
n = 16, HMR:
n = 11) of each center indicate clinical differences within the HN subcohorts, violating the assumption of ComBat that the different samples come from the same population and are affected by technical differences only. Additionally, a covariate accounting for the different voxel and matrix sizes within each center (Table
1) could have been introduced to both ComBat and GAN for improved harmonization results. This was not carried out in our experiments, because it is not recommended to use covariates for ComBat when having less than 20–30 patients per covariate in each batch [
45]. This might explain the results obtained with ComBat in the context of this study. Finally, all harmonization methods are inherently and by design static, meaning that they require access to the data from both sites at once. Similar to ComBat, which needs to be performed for each new dataset, GAN-harmonization must therefore be explicitly retrained on a dataset-by-dataset basis if acquisition shifts occur. This process is computationally more expensive compared to the less complex ComBat method. At the time the GAN is trained and in deployment, it cannot cope with situations where acquisition shifts occur at unknown timepoints, e.g., when an additional new scanner is introduced, or protocols change. While this assumption can hold true in well-defined retrospective (and prospective) studies, it does not reflect real-world clinical environments, which are dynamically changing. Nevertheless, current standards require re-approval of AI software each time the model is adapted during deployment, thereby potentially allowing to adjust for such scenarios. While continual machine learning may be another promising solution to changes at unknown timepoints, it is currently unclear when and to which extent continual learning strategies will be implemented for applications in medicine [
46]. Hence, to facilitate studies with large-scale and technical heterogenous cohorts, as well as to accelerate clinical translation, harmonization approaches will play an important role in the near future.
In summary, we present here a GAN-harmonization method that has the potential to improve the reproducibility and predictive performance of quantitative PET imaging. We demonstrated the ability of GAN-harmonization to enhance predictive performance by directly linking PET image harmonization to an improved clinical outcome prediction for HN cancer patients.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.