High-throughput image labeling and quality control for clinical trials using machine learning

Robert J. Harris, Pangyu Teng, Mahesh Nagarajan, Liza Shrestha, Xiang Lu, Bharath Ramakrishna, Peiyun Lu, Theo Sanford, Heather Clem, Megan McRoberts, Jonathan Goldin, Matt Brown


Background: Manually importing and analyzing image data can be time-consuming, prone to human error, and costly for large clinical trial datasets. This can lead to delays in quality control (QC) feedback to imaging sites and in obtaining data analysis results. Herein we describe the creation and application of a high-throughput review process for import, classification, labeling and QC of large multimodal clinical trial image datasets.

Methods: Automated methods were used to remove patient identifying information, extract image header data, and filter image data for usability. A convolutional neural net was applied to estimate anatomy for CT images. Internal scores were assigned for each image series to identify the optimal series for labeling and reading of each anatomical region. Image QC reports were automatically generated for all patients.

Results: In combined studies for which 204,492 series were received, 27,841 series were identified as usable and 13,415 series were labeled. Using this high-throughput method, total work-hours required per time point were reduced by an approximate factor of ten when compared to traditional review and labeling methods. Our anatomic classification system identified 95.7% of image series correctly, with the remaining series being manually corrected before labeling and analysis.

Conclusions: A high-throughput image analysis pipeline was implemented in a large combined dataset of clinical trial image series. This pipeline can be applied across other studies and modalities for fast image data characterization, labeling and QC.


Image intake, High-throughput, Machine learning, DICOM, Data management

Full Text:



Krishnankutty B, Bellary S, Kumar NBR, Moodahadu LS. Data management in clinical research: An overview. Indian J Pharmacol. 2012;44(2):168-72.

Braun R. Systems analysis of high-throughput data. Adv Exp Med Biol. 2014;844:153-87.

Yan SF, King FJ, He Y, Caldwell JS, Zhou Y. Learning from the data: Mining of large high-throughput screening databases. J Chem Inf Model. 2006;46(6):2381-95.

Sulakhe D, Balasubramanian S, Xie B, et al. High-throughput translational medicine: Challenges and solutions. Adv Exp Med Biol. 2014;799:39-67.

Veltri P. Management and analysis of biological and clinical data: How computer science may support biomedical and clinical research. In: Physics Procedia. 2015;62:29-35.

Kennan MA, Markauskaite L. Research Data Management Practices: A Snapshot in Time. Int J Digit Curation. 2015;10(2).

Cumbaa C, Jurisica I. Automatic classification and pattern discovery in high-throughput protein crystallization trials. J Struct Funct Genomics. 2005;6(2-3):195-202.

Eisenhauer EA, Therasse P, Bogaerts J, Schwartz LH, Sargent D, Ford R, et al. New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1). Eur J Cancer. 2009;45(2):228-47.

Scher HI, Halabi S, Tannock I, Morris M, Sternberg CN, Carducci MA, et al. Design and end points of clinical trials for patients with progressive prostate cancer and castrate levels of testosterone: Recommendations of the Prostate Cancer Clinical Trials Working Group. J Clin Oncol. 2008;26(7):1148-59.

Roth HR, Lee CT, Shin H-C, et al. Anatomy-specific classification of medical images using deep convolutional nets. Biomed Imaging (ISBI), 2015 IEEE 12th Int Symp. 2015: 101-104.

Ranzato M, Huang FJ, Boureau YL, LeCun Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition; 2007.

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J Mach Learn Res. 2014;15:1929-58.

Min X, Zeng W, Chen S, Chen N, Chen T, Jiang R. Predicting enhancers with deep convolutional neural networks. BMC Bioinformatics. 2017;18.

Nair V, Hinton GE. Rectified Linear Units Improve Restricted Boltzmann Machines. Proc 27th Int Conf Mach Learn. 2010;(3):807-14.

Kingma DP, Ba JL. Adam: a Method for Stochastic Optimization. Int Conf Learn Represent. 2015. 2015: 1-15.

Prokop M, Shin HO, Schanz a, Schaefer-Prokop CM. Use of maximum intensity projections in CT angiography: a basic review. Radiographics. 1997;17(2):433-51.

Meinecke AK, Welsing P, Kafatos G. Data collection in pragmatic trials. J Clin Epidemiol. 2017.

Dunn WD, Cobb J, Levey AI, Gutman DA. REDLetr: Workflow and tools to support the migration of legacy clinical data capture systems to REDCap. Int J Med Inform. 2016;93:103-10.

Omollo R, Ochieng M, Mutinda B, Omollo T, Owiti R, Okeya S, et al. Innovative Approaches to Clinical Data Management in Resource Limited Settings Using Open-Source Technologies. PLoS Negl Trop Dis. 2014;8(9).

Mansoori B, Erhard KK, Sunshine JL. Picture Archiving and Communication System (PACS) Implementation, Integration & Benefits in an Integrated Health System. Acad Radiol. 2012;19(2):229-35.

Kourou K, Exarchos TP, Exarchos KP, Karamouzis MV, Fotiadis DI. Machine learning applications in cancer prognosis and prediction. Comput Struct Biotechnol J. 2015;13:8-17.