Here, a new model for thyroid nodule detection in ultrasound images is proposed, which uses Swin Transformer as the backbone to perform long-range context modeling. Experiments prove that it performs well in terms of sensitivity and accuracy.
In recent years, the incidence of thyroid cancer has been increasing. Thyroid nodule detection is critical for both the detection and treatment of thyroid cancer. Convolutional neural networks (CNNs) have achieved good results in thyroid ultrasound image analysis tasks. However, due to the limited valid receptive field of convolutional layers, CNNs fail to capture long-range contextual dependencies, which are important for identifying thyroid nodules in ultrasound images. Transformer networks are effective in capturing long-range contextual information. Inspired by this, we propose a novel thyroid nodule detection method that combines the Swin Transformer backbone and Faster R-CNN. Specifically, an ultrasound image is first projected into a 1D sequence of embeddings, which are then fed into a hierarchical Swin Transformer.
The Swin Transformer backbone extracts features at five different scales by utilizing shifted windows for the computation of self-attention. Subsequently, a feature pyramid network (FPN) is used to fuse the features from different scales. Finally, a detection head is used to predict bounding boxes and the corresponding confidence scores. Data collected from 2,680 patients were used to conduct the experiments, and the results showed that this method achieved the best mAP score of 44.8%, outperforming CNN-based baselines. In addition, we gained better sensitivity (90.5%) than the competitors. This indicates that context modeling in this model is effective for thyroid nodule detection.
The incidence of thyroid cancer has increased rapidly since 1970, especially among middle-aged women1. Thyroid nodules may predict the emergence of thyroid cancer, and most thyroid nodules are asymptomatic2. The early detection of thyroid nodules is very helpful in curing thyroid cancer. Therefore, according to current practice guidelines, all patients with suspected nodular goiter on physical examination or with abnormal imaging findings should undergo further examination3,4.
Thyroid ultrasound (US) is a common method used to detect and characterize thyroid lesions5,6. US is a convenient, inexpensive, and radiation-free technology. However, the application of US is easily affected by the operator7,8. Features such as the shape, size, echogenicity, and texture of thyroid nodules are easily distinguishable on US images. Although certain US features-calcifications, echogenicity, and irregular borders-are often considered criteria for identifying thyroid nodules, the presence of interobserver variability is unavoidable8,9. The diagnosis results of radiologists with different levels of experience are different. Inexperienced radiologists are more likely to misdiagnose than experienced radiologists. Some characteristics of US such as reflections, shadows, and echoes can degrade the image quality. This degradation in image quality caused by the nature of US imaging makes it difficult for even experienced physicians to locate nodules accurately.
Computer-aided diagnosis (CAD) for thyroid nodules has developed rapidly in recent years and can effectively reduce errors caused by different physicians and help radiologists diagnose nodules quickly and accurately10,11. Various CNN-based CAD systems have been proposed for thyroid US nodule analysis, including segmentation12,13, detection14,15, and classification16,17. CNN is a multilayer, supervised learning model18, and the core modules of CNN are the convolution and pooling layers. The convolution layers are used for feature extraction, and the pooling layers are used for downsampling. The shadow convolutional layers can extract primary features such as the texture, edges, and contours, while deep convolutional layers learn high-level semantic features.
CNNs have had great success in computer vision19,20,21. However, CNNs fail to capture long-range contextual dependencies due to the limited valid receptive field of the convolutional layers. In the past, backbone architectures for image classification mostly used CNNs. With the advent of Vision Transformer (ViT)22,23, this trend has changed, and now many state-of-the-art models use transformers as backbones. Based on non-overlapping image patches, ViT uses a standard transformer encoder25 to globally model spatial relationships. The Swin Transformer24 further introduces shift windows to learn features. The shift windows not only bring greater efficiency but also greatly reduce the length of the sequence because self-attention is calculated in the window. At the same time, the interaction between two adjacent windows can be made through the operation of shifting (movement). The successful application of the Swin Transformer in computer vision has led to the investigation of transformer-based architectures for ultrasound image analysis26.
Recently, Li et al. proposed a deep learning approach28 for thyroid papillary cancer detection inspired by Faster R-CNN27. Faster R-CNN is a classic CNN-based object detection architecture. The original Faster R-CNN has four modules-the CNN backbone, the region proposal network (RPN), the ROI pooling layer, and the detection head. The CNN backbone uses a set of basic conv+bn+relu+pooling layers to extract feature maps from the input image. Then, the feature maps are fed into the RPN and the ROI pooling layer. The role of the RPN network is to generate region proposals. This module uses softmax to determine whether anchors are positive and generates accurate anchors by bounding box regression. The ROI pooling layer extracts the proposal feature maps by collecting the input feature maps and proposals and feeds the proposal feature maps into the subsequent detection head. The detection head uses the proposal feature maps to classify objects and obtain accurate positions of the detection boxes by bounding box regression.
This paper presents a new thyroid nodule detection network called Swin Faster R-CNN formed by replacing the CNN backbone in Faster R-CNN with the Swin Transformer, which results in the better extraction of features for nodule detection from ultrasound images. In addition, the feature pyramid network (FPN)29 is used to improve the detection performance of the model for nodules of different sizes by aggregating features of different scales.
This retrospective study was approved by the institutional review board of the West China Hospital, Sichuan University, Sichuan, China, and the requirement to obtain informed consent was waived.
1. Environment setup
2. Data preparation
3. Swin Faster RCNN configuration
4. Training the Swin Faster R-CNN
5. Performing thyroid nodule detection on new images
The thyroid US images were collected from two hospitals in China from September 2008 to February 2018. The eligibility criteria for including the US images in this study were conventional US examination before biopsy and surgical treatment, diagnosis with biopsy or postsurgical pathology, and age ≥ 18 years. The exclusion criteria were images without thyroid tissues.
The 3,000 ultrasound images included 1,384 malignant and 1,616 benign nodules. The majority (90%) of the malignant nodules were papillary carcinoma, and 66% of the benign nodules were nodular goiter. Here, 25% of the nodules were smaller than 5 mm, 38% were between 5 mm and 10 mm, and 37% were larger than 10 mm.
All the US images were collected using Philips IU22 and DC-80, and their default thyroid examination mode was used. Both instruments were equipped with 5-13 MHz linear probes. For good exposure of the lower thyroid margins, all the patients were examined in the supine position with their backs extended. Both thyroid lobes and the isthmus were scanned in the longitudinal and transverse planes according to the American College of Radiology accreditation standards. All the examinations were carried out by two senior thyroid radiologists with ≥10 years of clinical experience. The thyroid diagnosis was based on the histopathological findings from fine needle aspiration biopsy or thyroid surgery.
In real life, as US images are corrupted by noise, it is important to conduct proper preprocessing of the US images, such as image denoising based on wavelet transform30, compressive sensing31, and histogram equalization32. In this work, we used histogram equalization to preprocess the US images, enhance image quality, and alleviate image quality degradation caused by noise.
In what follows, true positive, false positive, true negative, and false negative are referred to as TP, FP, TN, and FN, respectively. We used mAP, sensitivity, and specificity to evaluate the model's nodule detection performance. mAP is a common metric in object detection. Sensitivity and specificity were calculated using equation (1) and equation (2):
(1)
(2)
In this paper, TP is defined as the number of correctly detected nodules, which have an intersection over union (IoU) between the prediction box and the ground truth box of >0.3 and a confidence score >0.6. IoU is the intersection over union, which is computed by using equation (3):
(3)
We compared several classic object detection networks, including SSD33, YOLO-v334, CNN backbone-based Faster R-CNN27, RetinaNet35, and DETR36. YOLO-v3 and SSD are single-stage detection networks, DETR is a transformer-based object-detection network, and Faster R-CNN and RetinaNet are two-stage detection networks. Table 1 shows that the performance of Swin Faster R-CNN is superior to the other methods, reaching 0.448 mAP, which is 0.028 higher than CNN backbone's Faster R-CNN and 0.037 higher than YOLO-v3. By using Swin Faster R-CNN, 90.5% of thyroid nodules can be detected automatically, which is ~3% higher than CNN backbone-based Faster R-CNN (87.1%). As shown in Figure 2, using Swin Transformer as the backbone makes boundary positioning more accurate.
Figure 1: Diagram of the Swin Faster R-CNN network architecture. Please click here to view a larger version of this figure.
Figure 2: Detection results. The detection results for the same image are in a given row. The columns are the detection results, from left to right, for Swin Faster R-CNN, Faster R-CNN, YOLO-v3, SSD, RetinaNet, and DETR, respectively. The ground truths of the regions are marked with green rectangular boxes. The detection results are framed by the red rectangular boxes. Please click here to view a larger version of this figure.
Method | Backbone | mAP | Sensitivity | Specificity |
YOLO-v3 | DarkNet | 0.411 | 0.869 | 0.877 |
SSD | VGG16 | 0.425 | 0.841 | 0.849 |
RetinaNet | ResNet50 | 0.382 | 0.845 | 0.841 |
Faster R-CNN | ResNet50 | 0.42 | 0.871 | 0.864 |
DETR | ResNet50 | 0.416 | 0.882 | 0.86 |
Swin Faster R-CNN without FPN | Swin Transformer | 0.431 | 0.897 | 0.905 |
Swin Faster R-CNN with FPN | 0.448 | 0.905 | 0.909 |
Table 1: Performance comparison with state-of-the-art object detection methods.
Supplemental File 1: Operating instructions for the data annotation and the software used. Please click here to download this File.
Supplemental File 2: Python script used to divide the dataset into the training set and validation set, as mentioned in step 2.4.1. Please click here to download this File.
Supplemental File 3: Python script used to convert the annotations file into masks, as mentioned in step 2.5.1. Please click here to download this File.
Supplemental File 4: Python script used to make the data into a dataset in CoCo format, as mentioned in step 2.5.2. Please click here to download this File.
Supplemental File 5: The modified Swin Transformer model file mentioned in step 3.1. Please click here to download this File.
Supplemental File 6: The Swin Faster R-CNN configuration file mentioned in step 3.2. Please click here to download this File.
This paper describes in detail how to perform the environment setup, data preparation, model configuration, and network training. In the environment setup phase, one needs to pay attention to ensure that the dependent libraries are compatible and matched. Data processing is a very important step; time and effort must be spent to ensure the accuracy of the annotations. When training the model, a "ModuleNotFoundError" may be encountered. In this case, it is necessary to use the "pip install" command to install the missing library. If the loss of the validation set does not decrease or oscillates greatly, one should check the annotation file and try to adjust the learning rate and batch size to make the loss converge.
Thyroid nodule detection is very important for the treatment of thyroid cancer. The CAD system can assist doctors in the detection of nodules, avoid differences in diagnosis results caused by subjective factors, and reduce the missed detection of nodules. Compared with existing CNN-based CAD systems, the network proposed in this paper introduces the Swin Transformer to extract ultrasound image features. By capturing long-distance dependencies, Swin Faster R-CNN can extract the nodule features from ultrasound images more efficiently. The experimental results show that Swin Faster R-CNN improves the sensitivity of nodule detection by ~3% compared to CNN backbone-based Faster R-CNN. The application of this technology can greatly reduce the burden on doctors, as it can detect thyroid nodules in early ultrasound examination and guide doctors to further treatment. However, due to the large number of parameters of the Swin Transformer, the inference time of Swin Faster R-CNN is ~100 ms per image (tested on NVIDIA TITAN 24G GPU and AMD Epyc 7742 CPU). It can be challenging to meet the requirements of real-time diagnosis with Swin Faster R-CNN. In the future, we will continue to collect cases to verify the effectiveness of this method and conduct further studies on dynamic ultrasound image analysis.
The authors have nothing to disclose.
This study was supported by the National Natural Science Foundation of China (Grant No.32101188) and the General Project of Science and Technology Department of Sichuan Province (Grant No. 2021YFS0102), China.
GPU RTX3090 | Nvidia | 1 | 24G GPU |
mmdetection2.11.0 | SenseTime | 4 | https://github.com/open-mmlab/mmdetection.git |
python3.8 | — | 2 | https://www.python.org |
pytorch1.7.1 | 3 | https://pytorch.org |