Deep learning for accurately recognizing common causes of shoulder pain on radiographs

Grauhan, Nils F.; Niehues, Stefan M.; Gaudin, Robert A.; Keller, Sarah; Vahldiek, Janis L.; Adams, Lisa C.; Bressem, Keno K.

doi:10.1007/s00256-021-03740-9

Deep learning for accurately recognizing common causes of shoulder pain on radiographs

Scientific Article
Open access
Published: 20 February 2021

Volume 51, pages 355–362, (2022)
Cite this article

Download PDF

You have full access to this open access article

Skeletal Radiology Aims and scope Submit manuscript

Deep learning for accurately recognizing common causes of shoulder pain on radiographs

Download PDF

Nils F. Grauhan ORCID: orcid.org/0000-0002-5682-5832¹,
Stefan M. Niehues^1,2,
Robert A. Gaudin^2,3,
Sarah Keller⁴,
Janis L. Vahldiek^1,2,
Lisa C. Adams^2,4 &
…
Keno K. Bressem^1,2

3362 Accesses
17 Citations
Explore all metrics

Abstract

Objective

Training a convolutional neural network (CNN) to detect the most common causes of shoulder pain on plain radiographs and to assess its potential value in serving as an assistive device to physicians.

Materials and methods

We used a CNN of the ResNet-50 architecture which was trained on 2700 shoulder radiographs from clinical practice of multiple institutions. All radiographs were reviewed and labeled for six findings: proximal humeral fractures, joint dislocation, periarticular calcification, osteoarthritis, osteosynthesis, and joint endoprosthesis. The trained model was then evaluated on a separate test dataset, which was previously annotated by three independent expert radiologists. Both the training and the test datasets included radiographs of highly variable image quality to reflect the clinical situation and to foster robustness of the CNN. Performance of the model was evaluated using receiver operating characteristic (ROC) curves, the thereof derived AUC as well as sensitivity and specificity.

Results

The developed CNN demonstrated a high accuracy with an area under the curve (AUC) of 0.871 for detecting fractures, 0.896 for joint dislocation, 0.945 for osteoarthritis, and 0.800 for periarticular calcifications. It also detected osteosynthesis and endoprosthesis with near perfect accuracy (AUC 0.998 and 1.0, respectively). Sensitivity and specificity were 0.75 and 0.86 for fractures, 0.95 and 0.65 for joint dislocation, 0.90 and 0.86 for osteoarthrosis, and 0.60 and 0.89 for calcification.

Conclusion

CNNs have the potential to serve as an assistive device by providing clinicians a means to prioritize worklists or providing additional safety in situations of increased workload.

Measuring the critical shoulder angle on radiographs: an accurate and repeatable deep learning model

Article 28 March 2022

Automated detection and classification of shoulder arthroplasty models using deep learning

Article 15 May 2020

Deep learning discrimination of rheumatoid arthritis from osteoarthritis on hand radiography

Article 02 August 2023

Introduction

Functional limitations of the shoulder due to pain are one of the most common reasons patients seek medical attention and present to the emergency department [1, 2]. While fractures of the proximal humerus and joint dislocation are the most frequent traumatic causes of shoulder pain [3, 4], calcific tendinitis, characterized by calcium depots in the rotator cuffs, is among the most common atraumatic causes [5, 6].

In addition to clinical examination, conventional radiographs of the shoulder are often obtained to identify the underlying causes of shoulder pain, especially in an emergency setting, as they allow swift decisions on further treatment [7]. However, as it may take some time before a radiologist performs the reading, many shoulder radiographs are first viewed by the requesting physician and therapy decisions may be already made at this stage. As requesting physicians are often under time pressure, important findings might initially be overlooked, delaying initiation of adequate therapy for the patients [8]. Therefore, it would be beneficial to have a supportive device, if possible, with very high sensitivity to avoid initial false findings with subsequent treatment delays.

Convolutional neural networks (CNNs) have achieved impressive performances for computer vision tasks on both nonmedical and medical image data, even surpassing human experts [9,10,11]. The advantages of CNNs are that, once trained, they allow almost instantaneous image interpretation without being susceptible to fatigue or exhaustion. As such, they have great potential in a wide range of applications in the fields of musculoskeletal radiology [12].

The aim of the current study was therefore to train a CNN to detect the most common causes of shoulder pain in plain radiographs, taken in clinical routine, which we defined as fracture, dislocation, calcification, and osteoarthritis (Table 1). Since the presence of osteosynthesis material or an endoprosthesis could affect the accuracy of a CNN, as they are cooccurring, e.g., with fractures, we also aimed for the model to reliably identify them.

Table 1 Sensitivity, specificity, and accuracy (with bootstrapped 95% confidence intervals) as well as AUC for detection of the respective findings in the shoulder radiograph after a given threshold was applied to the raw predictions obtained from the model

Full size table

Methods

Data preparation

A total of 3682 plain radiographs were extracted from multiple institutions serving PACS (Picture Achieving and Communication System), then transferred to a separate server and converted from DICOM format (Digital Imaging and Communications in Medicine) to Tagged Image File Format (TIFF). Images not containing the shoulder were manually excluded (n = 38); otherwise, no further preselection of the images was done. The resulting dataset included 3644 radiographs acquired from different views such as anterior-posterior view, Y-view, or outlet view but also non-standardized images taken during surgical procedures or in the emergency room.

Distribution of findings

A total of 3644 radiographs from 2442 patients were included in this analysis. The distribution of findings in the whole dataset was strongly skewed. The most common finding was fracture in 27.1% of cases (n = 989), followed by osteoarthritis (13.1% of cases, n = 479), osteosynthesis (13.1% of cases, n = 477), endoprosthesis (12.7% of cases, n = 463), calcification (7.7% of cases, n = 279), and dislocation (4.0% of cases, n = 147). 26.8% of all radiographs (n = 978) showed no pathological finding. An overview of label distribution and the data selection process is provided in Fig. 1. The distribution of findings is summarized in Table 2.

Table 2 Summary of the distribution of findings in training, validation, and test dataset as well as in the whole dataset. The numbers between the radiographs and the captions are different because multiple captions were allowed to a single image

Full size table

Data labeling and splitting

From this extracted collection of images, three datasets were generated for training, validation, and testing of the model.

First, all images were labeled by a radiologist. This allowed us to group the images by label and randomly draw from different groups of images when curating the test dataset. We chose this method to easily correct for rare labels in the test dataset, as very low prevalence may negatively affect the calculation of accuracy. The test dataset was curated and consisted of n = 269 radiographs, and we took care to include only one radiograph per patient. All 269 images were again labeled by a single radiologist who was blinded to the previous labels. In cases where the first and second radiologists disagreed, a third radiologist served as a reviewer.

Then, the remaining 3375 radiographs were used to create the training and validation dataset. We used a random split of 80–20, resulting in a training dataset with n = 2700 radiographs and a validation dataset with n = 675 images. Figure 1 shows a flowchart of the data extraction process.

All annotations were performed using the open source program labelImg (https://pypi.org/project/labelImg/). Annotation included the following labels: fracture, dislocation, calcification, osteoarthritis, osteosynthesis, and total endoprosthesis. Multiple tags to one image were allowed. Annotations were then exported into the XML format (Extensible Markup Language) and converted to one-hot encoded annotations in tabular format using the “R” programming language and the “tidyverse” library [13].

Model training

We employed a convolutional neural network of the Resnet architecture with 50 layers, pretrained on the ImageNet dataset [14]. For implementation, the Python programming language (https://www.python.org, version 3.8) with the PyTorch (https://pytorch.org) and FastAI (https://fast.ai) libraries was used on a workstation running on Ubuntu 18.04 with two Nvidia GeForce RTX 2080ti graphic cards (11 GB of RAM each).

Prior to any training, the images were scaled down to a width of 1024 pixel (px), maintaining aspect ratio. Pixel values of the images were normalized, and the data were augmented through random rotations, mirroring, and distortion of the images. We used a progressive resizing approach, starting at an image size of 64 × 64 px and a batch size of 256 images for a total of ten epochs with discriminative learning rates (meaning the learning rate was lower for the first layers of the network than for the last layers). For the first five epochs, only the classification head of the model was trained with the remaining weights remaining unchanged, while for the last 5 epochs, all weights of the model were updated. After each training session, image resolution was successively increased to 128 × 128, 256 × 256, 512 × 512, and finally 768 × 768 px. Used batch sizes were 256 for 64px, 128 for 128 px, 64 for 256px, 32 for 512 px, and 16 for 768 px.

For the last ten epochs of training at 768 px image size, checkpoints were saved after each epoch. Finally, each checkpoint was evaluated on the test dataset, and the predictions of the ten checkpoints were pooled to calculate accuracy.

For the purpose of illustrating predictions from the model, exemplary Gradient-based Class Activation Mappings (Grad-CAM, [15]) were calculated. Example code for model training is available at https://gist.github.com/kbressem/7e6fb4ec270cbacde07a379562a18a4b.

Statistical analysis

Predictions on the test dataset were exported as comma-separated values (CSV) and then further analyzed using the “R” statistical language and the “tidyverse,” “irr,” and “ROCR” libraries [16,17,18,19]. Model performance was evaluated using receiver operating characteristic (ROC) curves and calculating the area under the curve (AUC). Interrater agreement was evaluated using Cohen’s kappa. 95% confidence intervals (95% CI) were calculated through bootstrapping with 1000 iterations.

Results

Model accuracy

The model showed the highest accuracy for the detection of osteoarthritis with an area under the receiver operating characteristics curve (AUC) of 0.945 (95% CI 0.917–0.972). For dislocation, fracture, and calcification, accuracies were lower with AUCs of 0.896 (95% CI 0.836–0.956), 0.871 (95% CI 0.817–0.926), and 0.80 (95% CI 0.717–0.883), respectively. Receiver operating characteristics curves and AUCs are presented in Fig. 2. The presence of a shoulder endoprosthesis and osteosynthesis material were detected with near perfect accuracy (AUC of 1.00 (95% CI 0.99–1.00) and 0.998 (95% CI 0.993–1.00)).

Sensitivity, specificity, and accuracy were calculated choosing a threshold for predictions so that the sum of sensitivity and specificity was highest. Sensitivity/specificity/accuracy were 0.90/0.86/0.87 for osteoarthritis, 0.75/0.86/0.84 for fractures, 0.95/0.65/0.67 for joint dislocation, and 0.60/0.89/0.84 for periarticular calcification. Sensitivity, specificity, accuracy, and AUC results are also compiled in Table 1. Examples of Grad-CAMs are shown in Figs. 3 and 4.

Interrater reliability

Interrater reliability between the two radiologists on the test dataset was moderate and 78 of 269 radiographs needed adjudication by a third reader. For fracture, a kappa value of 0.61 (95% CI 0.50–0.73) was achieved, which was the second highest agreement. For calcification, dislocation, and osteoarthritis, agreement was even lower with kappa values of k = 0.59 (95% CI 0.45–0.74) for calcification, k = 0.53 (95% CI 0.32–0.74) for dislocation, and k = 0.47 (95% CI 0.33–0.60) for osteoarthritis.

Discussion

The aim of the present study was to train a CNN to detect the most common causes of acute or chronic shoulder pain in plain radiographs. The proposed CNN shows good overall accuracy, suggesting that such networks have the potential to provide additional help and security for the physician on duty.

Despite a rapid increase in published studies on medical image classification, so far only a few investigators have focused on shoulder radiographs and causes of shoulder pain. To the best of our knowledge, there is only one comparable study by Chung et al., in which the authors proposed a CNN explicitly for the detection of proximal humeral fractures [20]. In their approach, Chung et al. focused on the classification of different fracture types, achieving high accuracies, as measured by the AUC. However, they only included shoulder radiographs acquired in strict anterior-posterior patient position and thus did not cover the full range of variability encountered in clinical practice. We are convinced that the inclusion of images acquired under various non-standardized conditions results in a more robust method that is less susceptible to errors in everyday use (Fig. 5).

Recently, two separate studies have demonstrated the efficiency of CNNs in detecting and even differentiating shoulder implants by the manufacturer [21, 22]. However, these studies relied on frontal view radiographs, whereas our data included a broad variety of views to allow detection of surgical implants independent of image settings used for acquisition. Therefore, our model may prove to be more robust on clinical data, which include radiographs of suboptimal image quality, as pain may preclude optimal patient positioning.

Dislocated shoulder joints were reliably detected by the model. From our point of view, this is essential for a real-life approach since shoulder joint dislocation is the most common type of dislocation in humans [4].

Although still detected with high accuracy, it was periarticular calcifications that posed the greatest challenge for our model. Intuitively, this is not surprising, especially since small calcium deposits can be partially obscured by bone spurs in osteoarthritis or are difficult to distinguish from small fragments. To our knowledge, this is the first study that also focuses on periarticular calcifications.

There are several studies in the literature that have trained neural networks to recognize hip and knee joint osteoarthritis [23,24,25]. None of these studies included the shoulder joint. Although osteoarthritis of the shoulder joint is less common in the general population, it is a frequent and therefore relevant trauma sequel in the post-injury patient population [4, 26]. Therefore, detecting this finding with high accuracy, as our model does, is important for any future clinical implementation.

We used a multiclass classification model that could predict all classes for each image instead of using multiple binary classifiers. This is similar to the approach used by von Schacky et al. [27], who developed a multiclass classification model for osteoarthritis detection. Other studies used network architectures originally developed for object detection. Here, classification is combined with highlighting of relevant areas using bounding boxes. This approach was successfully applied by Krogue et al. [28], who used a DenseNet to identify and subclassify hip fractures.

Overall, there are many practical reasons to continuously improve neural networks, which have the potential to become valuable supportive devices for medical staff in the future. The most important help these networks can provide could be error reduction, especially in situations of increased workload, and worklist optimization, prioritizing critical findings or serving as a second reader. A different, and basically simplified, approach in order to achieve this goal was used in the work by Rajpurkar et al. [29]. In their study, the authors trained a neural network for the simple binary classification task of differentiating between normal and abnormal images from the publicly available MURA dataset (a collection of conventional images of the upper extremity). Interestingly, they found that their model performed worse than its human counterparts on shoulder images, while it achieved comparable results on images of the wrist, hand, and fingers. We conclude that plain radiographs of the shoulder may pose a considerable challenge to CNNs, especially when they are expected to detect more subtle findings such as periarticular calcification. Overall, however, our results are promising and strongly encourage further trials ultimately leading to clinical implementation.

Limitations

Our study has several limitations: No additional clinical information about the population included in the study was retrieved, posing a risk of bias through possible imbalance in patient characteristics (e.g., age, sex, medical preconditions). Additionally, only a limited number of labels were assessed and did not include any form of grading pathological findings such as the severity of osteoarthritis or the specific type of a fracture. In its current state, this would likely hinder clinical implementation. Furthermore, the ground truth was solely based on reading radiographs (instead of using more robust resources such as follow-up cross-sectional imaging) and labeling training and validation data were performed by a single radiologist. Both measures carry the risk of bias and mislabeling. Also, computational capacity put a limit on the applied image resolution, potentially impairing model performance when faced with subtle findings such as small calcifications or bone fragments.

Conclusion

The rapid evolution of CNNs in medical imaging is paving the way for the development of assistive devices that may eventually help medical personnel by providing a second reading instance. We present a robust CNN with the ability to detect common causes of shoulder pain on plain radiographs, even when faced with impaired image quality as often seen in clinical practice, especially in emergency settings.

References

Urwin M, Symmons D, Allison T, Brammah T, Busby H, Roxby M, et al. Estimating the burden of musculoskeletal disorders in the community: the comparative prevalence of symptoms at different anatomical sites, and the relation to social deprivation. Ann Rheum Dis. 1998;57(11):649–55.
Article CAS Google Scholar
Kuijpers T, van Tulder MW, van der Heijden GJ, Bouter LM, van der Windt DA. Costs of shoulder pain in primary care consulters: a prospective cohort study in The Netherlands. BMC Musculoskelet Disord. 2006;7(1):83.
Article Google Scholar
Warnhoff M, Lill H, Jensen G, Ellwein A, Hazra R-OD. Proximale Humerusfraktur–was sagt die aktuelle Literatur? Obere Extremität. 2019;14(1):9–17.
Article Google Scholar
Cutts S, Prempeh M, Drew S. Anterior shoulder dislocation. Ann R Coll Surg Engl. 2009;91(1):2–7.
Article Google Scholar
Hurt G, Baker CL. Calcific tendinitis of the shoulder. Orthop Clin. 2003;34(4):567–75.
Article Google Scholar
Mitchell C, Adebajo A, Hay E, Carr A. Shoulder pain: diagnosis and management in primary care. Bmj. 2005;331(7525):1124–8.
Article Google Scholar
Er E, Kara PH, Oyar O, Unluer EE. Overlooked extremity fractures in the emergency department. Ulus Travma Acil Cerrahi Derg. 2013;19(1):25–8.
Article Google Scholar
Nakajima Y, Yamada K, Imamura K, Kobayashi K. Radiologist supply and workload: international comparison. Radiat Med. 2008;26(8):455–65.
Article Google Scholar
Esteva A, Kuprel B, Novoa RA, Ko J, Swetter SM, Blau HM, et al. Dermatologist-level classification of skin cancer with deep neural networks. Nature. 2017;542(7639):115–8.
Article CAS Google Scholar
Gan K, Xu D, Lin Y, Shen Y, Zhang T, Hu K, et al. Artificial intelligence detection of distal radius fractures: a comparison between the convolutional neural network and professional assessments. Acta Orthop. 2019;90(4):394–400.
Article Google Scholar
Williams DP. On the use of tiny convolutional neural networks for human-expert-level classification performance in sonar imagery. IEEE J Ocean Eng. 2020.
Chea P, Mandell JC. Current applications and future directions of deep learning in musculoskeletal radiology. Skelet Radiol. 2020:1–15.
Wickham H, Averick M, Bryan J, Chang W, McGowan L, François R, et al. Welcome to the Tidyverse. J Open Source Softw. 2019;4(43):1686.
Article Google Scholar
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016; 2016. p. 770–778.
Selvaraju RR, Cogswell M, Das A, Vedantam R, Parikh D, Batra D. Grad-cam: Visual explanations from deep networks via gradient-based localization. Proceedings of the IEEE international conference on computer vision; 2017; 2017. p. 618–626.
Team RC. R: a language and environment for statistical computing. 2013.
Wickham H. Tidyverse: easily install and load the ‘tidyverse’. R package version. 2017;1(1):2017.
Sing T, Sander O, Beerenwinkel N, Lengauer T. ROCR: visualizing classifier performance in R. Bioinformatics. 2005;21(20):3940–1.
Article CAS Google Scholar
Gamer M, Lemon J, Gamer MM, Robinson A, Kendall’s W. Package ‘irr’. Various coefficients of interrater reliability and agreement. 2012.
Chung SW, Han SS, Lee JW, Oh K-S, Kim NR, Yoon JP, et al. Automated detection and classification of the proximal humerus fracture by using deep learning algorithm. Acta Orthop. 2018;89(4):468–73.
Article Google Scholar
Yi PH, Kim TK, Wei J, Li X, Hager GD, Sair HI, et al. Automated detection and classification of shoulder arthroplasty models using deep learning. Skelet Radiol. 2020.
Urban G, Porhemmat S, Stark M, Feeley B, Okada K, Baldi P. Classifying shoulder implants in X-ray images using deep learning. Comput Struct Biotechnol J. 2020;18:967.
Article Google Scholar
Antony J, McGuinness K, O’Connor NE, Moran K. Quantifying radiographic knee osteoarthritis severity using deep convolutional neural networks. 2016 23rd International Conference on Pattern Recognition (ICPR); 2016: IEEE; 2016. p. 1195–1200.
Tiulpin A, Thevenot J, Rahtu E, Lehenkari P, Saarakkala S. Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci Rep. 2018;8(1):1–10.
Article CAS Google Scholar
Xue Y, Zhang R, Deng Y, Chen K, Jiang T. A preliminary examination of the diagnostic value of deep learning in hip osteoarthritis. PLoS One. 2017;12(6).
Ogawa K, Yoshida A, Ikegami H. Osteoarthritis in shoulders with traumatic anterior instability: preoperative survey using radiography and computed tomography. J Shoulder Elb Surg. 2006;15(1):23–9.
Article Google Scholar
von Schacky CE, Sohn JH, Liu F, Ozhinsky E, Jungmann PM, Nardo L, et al. Development and validation of a multitask deep learning model for severity grading of hip osteoarthritis features on radiographs. Radiology. 2020;295(1):136–45.
Article Google Scholar
Krogue JD, Cheng KV, Hwang KM, Toogood P, Meinberg EG, Geiger EJ, et al. Automatic hip fracture identification and functional subclassification with deep learning. Radiol: Artif Intell. 2020;2(2):e190023.
Google Scholar
Rajpurkar P, Irvin J, Bagul A, Ding D, Duan T, Mehta H, et al. Mura: Large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:171206957. 2017.

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL.

Author information

Authors and Affiliations

Department of Radiology, Charité – Universitätsmedizin Berlin, Hindenburgdamm 30, 12203, Berlin, Germany
Nils F. Grauhan, Stefan M. Niehues, Janis L. Vahldiek & Keno K. Bressem
Berlin Institute of Health (BIH), Anna-Louisa-Karsch-Str. 2, 10178, Berlin, Germany
Stefan M. Niehues, Robert A. Gaudin, Janis L. Vahldiek, Lisa C. Adams & Keno K. Bressem
Department of Oral and Maxillofacial Surgery, Charité, Hindenburgdamm 30, 12203, Berlin, Germany
Robert A. Gaudin
Department of Radiology, Charité, Charitéplatz 1, 10117, Berlin, Germany
Sarah Keller & Lisa C. Adams

Authors

Nils F. Grauhan
View author publications
You can also search for this author in PubMed Google Scholar
Stefan M. Niehues
View author publications
You can also search for this author in PubMed Google Scholar
Robert A. Gaudin
View author publications
You can also search for this author in PubMed Google Scholar
Sarah Keller
View author publications
You can also search for this author in PubMed Google Scholar
Janis L. Vahldiek
View author publications
You can also search for this author in PubMed Google Scholar
Lisa C. Adams
View author publications
You can also search for this author in PubMed Google Scholar
Keno K. Bressem
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nils F. Grauhan.

Ethics declarations

Ethical approval

All procedures performed in studies involving human participants were in accordance with the ethical standards of the institutional and/or national research committee and with the 1964 Helsinki declaration and its later amendments or comparable ethical standards.

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Grauhan, N.F., Niehues, S.M., Gaudin, R.A. et al. Deep learning for accurately recognizing common causes of shoulder pain on radiographs. Skeletal Radiol 51, 355–362 (2022). https://doi.org/10.1007/s00256-021-03740-9

Download citation

Received: 19 November 2020
Revised: 31 January 2021
Accepted: 07 February 2021
Published: 20 February 2021
Issue Date: February 2022
DOI: https://doi.org/10.1007/s00256-021-03740-9

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Deep learning for accurately recognizing common causes of shoulder pain on radiographs