Purpose: Patient privacy rules require removal of Protected Health Information (PHI) before sharing images publicly. Manual de-identification is no longer scalable due to the rapid increase in imaging data volume. Our goal was to configure and test the efficacy of an automated medical image de-identification (MIDI) pipeline using cloud services.
Materials and Methods: Training and test datasets for validation of image de-identification, specifically prepared by placement of synthetic PHI in DICOM headers and image pixel data, were prepared by The Cancer Imaging Archive (TCIA). These datasets included 1,836/14,372 images from 21/93 patients, respectively. Answer keys based on TCIA de-identification conventions were made available for the two datasets. The MIDI pipeline was configured using the Google Cloud Platform Healthcare API, which is based on Google’s Data Loss Prevention API for sensitive information detection. Performance was also measured by monitoring throughput.
Results: For DICOM header data elements, 99.8% of expected actions were performed correctly. The two incorrect actions included one false-positive case (information removed incorrectly), and one false-negative case (PHI not removed). For the image pixel data, one false-positive was noted. There were no false negatives; all sensitive information was correctly removed from all image pixel data. Throughput averaged at 58.4 images per second.
Conclusion: The current implementation of the MIDI pipeline holds great promise for automated de-identification at scale. However, verification by a human expert is currently recommended. Optimization of the underlying algorithm could further increase accuracy.