Introduction
Gastric cancer (GC) is a major global health concern, ranking as the fifth most commonly diagnosed malignant tumor and the fourth leading cause of cancer-related deaths worldwide [
1]. More than 95% of GC are adenocarcinomas [
2], of which intestinal adenocarcinoma is the most common type [
3].
Several pathologies of chronic atrophic gastritis (CAG), including atrophy, intestinal metaplasia (IM), and dysplasia, are important pathways for the development of intestinal-type adenocarcinoma from normal mucosa (also known as the “Correa cascade”) [
4]. The GC risk in CAG patients gradually increases with the progression of the Correa cascade. Recent studies [
5,
6] have shown that the annual incidence rates of gastric cancer for atrophy, intestinal metaplasia, and dysplasia are 0.1%, 0.12–0.25%, and 0.6–6%, respectively. The operative link for gastric intestinal metaplasia assessment (OLGIM) [
7] and operative link for gastritis assessment (OLGA) [
8] by integrating the IM/atrophy score and the topography [
9] have been advocated by international guidelines [
3,
10,
11] for risk stratification of individuals diagnosed with gastric precancerous conditions. However, accurately assessing pathological IM and atrophy and stratifying OLGIM/OLGA risk have been difficult for pathologists.
Since the Sydney system was updated in 1994 [
12], pathologists have continuously pointed out that the diagnostic system can be challenging in clinical practice [
5]. The diagnosis of atrophic gastritis needs grading the severity of gland loss, which is difficult to evaluate quantitatively with accuracy, resulting in poor precision [
11]. To overcome this difficulty, pathologists have tried various methods, including holding meetings to unify conceptual terminology [
13], cycling a group of pathological pictures for repeated training [
14], and proposing some intuitive measurement methods [
15], etc. However, these efforts have failed to effectively improve accuracy, or the methods were difficult to apply widely. So far, the consistency and accuracy of pathological histology in patients with CAG for diagnosis and GC risk stratification remain limited [
3,
6,
10,
11].
Deep learning has shown potential in medical image analysis. Automatic recognition technology based on deep learning has also achieved outstanding results in the diagnosis of digital pathological images, such as breast cancer, lung cancer, colorectal cancer, and prostate cancer [
16‐
21]. Under certain conditions, the diagnostic performance of these artificial intelligence models is not inferior to that of human experts. However, these studies often use fully supervised learning, requiring pathologists to manually label lesions for pixel-level training, which is easily affected by various factors. Deep learning based on weak supervision can automatically mine suspicious lesions only with accurate category information. It is suitable for situations that are highly affected by subjective factors or difficult to obtain manual annotations. It is expected to be applied to computational pathology to further improve the accuracy and consistency of diagnosis [
22‐
24].
In this study, 2725 whole-slide images (WSIs) were collected continuously from 545 endoscopic suspected CAG patients in a multi-center trial to establish a deep neural network-based diagnostic model named GasMIL. The randomized observer study was conducted to verify the accuracy and consistency of the diagnoses made by pathologists assisted with GasMIL. We aimed to establish and validate a convolutional neural network algorithm to diagnose and risk stratification of individuals diagnosed with precancerous gastric mucosal changes.
Methods
Study design
The data sets of this study were obtained from a multicentre, prospective trial registered at
https://clinicaltrials.gov/ (register number: NCT02955134). All authors had access to the study data, reviewed, and approved the final manuscript. From December 22, 2017, to September 25, 2020, 545 patients suspected of having atrophic gastritis were consecutively included during endoscopy at 13 tertiary hospitals. According to a 4:1 ratio, patients were randomly divided into a model construction set and a test set. The part used for model construction was further randomly divided into a training set and a validation set according to a ratio of 4:1 (Fig.
1).
The ethics committees of the 13 tertiary hospitals approved the trial protocol, and all participants signed informed consent forms. An independent data safety monitoring committee was responsible for monitoring the progress and safety of the trial.
Participants and biopsy assessment
The inclusion criteria were patients aged 40–65 years suspected of having chronic atrophic gastritis during endoscopy. The exclusion criteria included autoimmune gastritis, gastric or duodenal ulcers, upper gastrointestinal bleeding, high-grade intraepithelial neoplasia in the gastric mucosa, or suspected malignant transformation based on histological diagnosis.
For patients meeting the inclusion and exclusion criteria, specimens were taken from 5 sites in the stomach, including two from the lesser and the greater curvature of the antrum (both within 2–3 cm from the pylorus), one from the lesser curvature of the corpus (about 4 cm proximal to the angulus), one from the middle portion of the greater curvature of the corpus (approximately 8 cm from the cardia), and one from the incisura angularis. The tissues were sliced, scanned using MoticEasyScan Pro, and then uploaded to the online diagnostic system.
WSIs were reviewed by two pathologists, and the superficial slices were removed for the follow-up study. Three experienced pathologists independently graded diagnoses according to the New Sydney system [
12] (see Supplementary Figure S1 for diagnostic criteria), and the final diagnosis result was obtained after discussing any inconsistencies.
Model development
We proposed a deep neural network named GasMIL (Fig.
2) to predict the degree of atrophy/IM of gastric tissue slice images. Based on the pathological slice images from five parts of each patient, a patient-level grading prediction was comprehensively obtained. We apply weakly supervised learning in our algorithmic framework, specifically multiple instance learning (MIL). Unlike traditional deep convolutional neural networks, MIL only requires coarse-grained labels for the pathological diagnosis of each image, avoiding the need for complicated manual annotations by doctors. Additionally, we constructed MIL features for pathological images of different resolutions and performed multi-scale aggregation. The basic principle of this design is that shallow low-level features (such as local edges and textures) and deep high-level features (such as severe disease appearance) contain useful information for grade prediction and extracting and integrating comprehensive multi-scale image features helps in making the best grade decision.
Self-supervised learning for learning patch embedding
We cropped each WSI into non-overlapping blocks at resolutions of 224 × 224 with a field of view of 0.5 µm per pixel (MPP) and 2.0 MPP, respectively. Before multi-instance learning, we pre-learn embedding for each cut patch using SimCLR [
25] proposed by “Hinton’s team”. This simple framework for contrastive learning was employed to learn robust image representations without manual labeling. For each original patch, we applied several data augmentation operations (including random rotation, random color distortion, etc.) to generate sub-images and performed feature extraction through a resnet18-based encoder. We then constructed a contrastive loss to minimize the distance between these sub-images from the same original image in feature space. The output of the trained encoder was used for downstream MIL tasks.
Construction of multi-scale patch embedding
For both 0.5 MPP and 2.0 MPP magnifications, we constructed single-scale WSI classifiers. The patch embedding under 2.0 MPP magnification was spliced with the embedding corresponding to the physical 0.5 MPP magnification position to obtain a comprehensive patch embedding.
Using WSI classifier to select key patches
In the MIL hypothesis, when a WSI is marked as positive (label > 0), at least one patch is the target lesion area; if the mark is negative, all patch labels should be negative. Based on this assumption, we used the MLP network as a classifier to feed multi-scale patch embeddings. After completing the training of the patch-level classifier, we could obtain the probability of all patches in the current WSI being predicted as lesion areas and sort them to obtain the patches that should be the most focused on. We uniformly selected the top 20 patches with the highest ranking for each WSI to input into the downstream aggregator.
The traditional MIL uses pooling algorithms to comprehensively evaluate the prediction probability of the top patches and obtain the prediction degree of WSI. However, these CV-based pooling algorithms ignore the correlation information between patches. Considering the potential model enhancements from these correlations, we introduced the transformer into the aggregation stage. The 100 most critical (most likely to be lesion area) patch vectors obtained from each WSI through the first step were sequentially passed to the transformer classifier to predict the entire rank probability of WSI.
Patient-level prediction model construction
For the WSI of five gastric parts in the same patient, we obtained five prediction grades through the WSI-level prediction model. According to the OLGA and OLGIM, we could then obtain the final patient-level prediction grade.
Observer study
Sixty patients from the test set were randomly selected for observational studies. All clinical information was concealed and randomly divided into two groups: one with the aid of GasMIL diagnostic results and the other without. The digital WSIs were distributed to 10 pathologists for diagnosis, and the diagnosis results were recorded. The AUC, sensitivity, specificity, and consistency of the two groups of slides diagnosed by 10 pathologists were obtained.
Statistical analysis
We employed a combination of statistical tests, including the T-test and Wilcoxon signed-rank test, to examine the impact of age at baseline and a combination of the Chi-square test, Fisher’s exact test, continuously corrected Chi-square test, and signed-rank test to analyze gender. The Wilcoxon signed-rank test was also used to compare baseline histological data.
Receiver operating characteristic (ROC) curves and area under the curve (AUC) were analyzed using the machine learning Python package sci-kit-learn to quantify diagnostic classifier performance, as well as accuracy, sensitivity, and specificity. The cutoff value of the ROC curve was set at 0.5. Cohen’s kappa coefficient was used to assess interobserver agreement between diagnostic models and human pathologists. Python and Pytorch were used to build WSI algorithms.
Discussion
Recent studies have shown that deep learning-based algorithms were promising in classifying and grading pathological lesions in digitized H&E slides [
26,
27]. Regarding the poor diagnostic accuracy and increasing diagnostic workload of endoscopic biopsy specimens call for a high-performance algorithm with high sensitivity and specificity [
28].
In this study, we developed an algorithm named GasMIL to diagnose inflammation, activity, atrophy, and IM in gastric biopsy specimens and demonstrated superb performance better than all ten pathologists. Especially concerning atrophy, the concept is represented by the discrepancy between the expected glands and what is actually observed at the histologic exam [
29], which can be subjective and pathologists are most likely to be inconsistent with [
30,
31]. Accurately diagnosing atrophy was considered crucial for the prevention of gastric cancer, as a study found that 37.2% of patients who developed gastric cancer had been diagnosed with indefinite atrophy previously [
32]. In the present study, the GasMIL showed an 80% sensitivity, 85% specificity, and 0.61 weighted kappa value on the observer study, which was higher than that of pathologists trained through four rounds of reading (kappa = 0.46) [
14]. Therefore, at the slide level, GasMIL has the potential to serve as a tool for supervising pathologists to process the sheer number of samples in limited clinical working hours.
To estimate whether the GasMIL model can accurately stratify GC risk in CAG patients as well, OLGIM and OLGA were obtained by combining the results of 5 WSIs. In the observer study, GasMIL showed the second-highest diagnostic accuracy in OLGA and fifth in OLGIM among the ten pathologists, with AUCs of 0.72 and 0.79, respectively. In contrast to OLGA, OLGIM reports a high interobserver concordance, consistent with our observer study [
7,
33]. However, OLGIM staging was considered less sensitive than OLGA staging for it downgrades high-risk patients to low-risk groups [
34]. Obtaining both OLGA and OLGIM information on the same pathological slide is beneficial for the secondary prevention of GC [
9,
35]. At the patient level, high overall accuracy for the GasMIL in GC risk stratification was observed, suggesting that it can assist clinicians in individual GC risk stratification.
In addition, we conducted an observer study to investigate whether GasMIL can help pathologists improve diagnostic accuracy. The results revealed that with the assistance of GasMIL, the accuracy of pathologists in diagnosing IM significantly improved, as did the diagnostic specificity of atrophy. However, its accuracy in diagnosing OLGIM and OLGA did not significantly improve. Since the sample size analysis for the observer study was based on slides [
36], the sample size of 30 patients may be too small to detect a statistical difference between OLGIM and OLGA. However, an increasing trend can be observed in Fig.
4 for their higher median with the assistance of GasMIL, and a follow-up observer study with a larger sample size is needed to evaluate its role in helping diagnose OLGIM and OLGA.
To the best of our knowledge, this is the first study that aimed to establish and validate a convolutional neural network algorithm to diagnose and grade OLGA and OLGIM based on the updated Sydney protocol. The AGA [
3] recommends that gastric biopsies according to the updated Sydney system should become standard in the diagnostic workup for dyspepsia and gastritis, a step that has been shown to increase the detection rate of H. pylori and IM [
37]. After a standard biopsy, it is essential to ensure that the pathologist has histologic scoring of gastric biopsy (for OLGA/OLGIM staging), avoiding the possibility of a secondary staging exam to determine risk level. However, in the grading of the biopsy IM and atrophy, many pathologists worldwide do not report severity scoring routinely [
38]. The possible reason is that the judgment of severity scoring is quite subjective based on the proportion, making it difficult for pathologists to give an accurate and consistent result [
13,
39,
40]. Therefore, we built a classification model for atrophy and IM using artificial intelligence techniques that can help quantify severity scores. High-quality multi-center data with strict quality control of all image acquisitions and histological analysis for every individual were used in this study, and a reliable gold standard diagnosis was jointly made by three experienced pathologists. Furthermore, GasMIL was fully validated, including the test set validation, compared with ten pathologists, and GasMIL auxiliary diagnosis compared with pathologists alone to investigate the robustness and reliability of GasMIL. The results proved that applying GasMIL for the quantitative analysis of WSIs offered valuable benefits for diagnosing atrophy and IM as well as GC risk in patients with CAG. Once the GasMIL model is established, pathologists only need to review and confirm the results of the model classification in the daily workflow, which is extremely easy for clinical applications.
Some studies have applied Convolutional Neuronal Networks technology to diagnose gastritis on biopsy H&E images. Panagiotis et al. reported a digital pathology framework for gastric gland segmentation and classification that achieved object dice scores equal to 0.908 and 0.967, respectively, in a dataset consisting of 20 patients with 85 WSIs of normal, gastric atrophy, and IM [
41]. Georg et al. reported a convolutional neuronal network-based algorithm to classify gastritis into autoimmune, bacterial, and chemical subtypes, achieving an overall accuracy of 84% in a data set of 135 patients [
42]. However, their digital pathology framework and study design were fundamentally different from ours. We focus on the problem that pathologists are having difficulty reaching a consensus when making diagnoses according to the updated Sydney system, we established an algorithm that can precisely grade the severity of atrophy and IM and can calculate the GC risk accordingly which is also beneficial to determine follow-up intervals.
Our study has some limitations. First, our model is independently developed and verified based on different gastric problems. This approach will increase machine memory overhead and does not further consider the relationship between image representations of different gastric problems. Secondly, as a black box model, the interpretability of the deep learning model is still poor. We tried to observe the model’s attention to different areas in the form of a heat map, but it is still difficult to explain how the model learns. Thirdly, the sample size of the observer study was insufficient for detecting the statistical difference between GasMIL auxiliary diagnosis and pathologists alone to diagnose OLGIM and OLGA, a larger sample size study is needed.
In conclusion, GasMIL shows the best overall performance in diagnosing inflammation, activity, IM, and atrophy, ranking fifth in diagnosing OLGIM and second in OLGA compared to ten pathologists. GasMIL-assisted significantly improves the performance of pathologists in diagnosing IM and atrophy. All of this suggested a clinical application potential of GasMIL for accurate pathological grading and GC risk stratification in atrophic gastritis patients.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.