Scientists at the Icahn School of Medicine at Mount Sinai described the development of a new, automated, artificial intelligence-based algorithm that can learn to read patient data from electronic health records in an article published in the journal Patterns. In a side-by-side comparison, they demonstrated that their Phe2vec (FEE-to-vek) method accurately identified patients with certain diseases as well as the traditional, “gold-standard” method, which requires much more manual labor to develop and perform.
“The amount and types of data electronically stored in a patient’s medical record continue to explode. Untangling this complex web of data can be extremely time-consuming, stifling progress in clinical research,” said Benjamin S. Glicksberg, Ph.D., Assistant Professor of Genetics and Genomic Sciences, member of the Hasso Plattner Institute for Digital Health at Mount Sinai (HPIMS), and senior author of the study. “In this study, we developed a new machine learning method for mining data from electronic health records that is faster and less labor-intensive than the industry standard. We hope that this will be a valuable tool that will facilitate further, and less biased, research in clinical informatics.”
The study was led by Jessica K. De Freitas, a graduate student in Dr. Glicksberg’s lab.
Currently, scientists mine medical records for new information using a set of established computer programs, or algorithms. The Phenotype Knowledgebase system manages the development and storage of these algorithms (PheKB). Although the system is extremely effective at correctly identifying a patient’s diagnosis, developing an algorithm can be a time-consuming and inflexible process.
A new, automated, artificial intelligence-based algorithm can learn to read patient data from electronic health records. Scientists accurately identified patients with certain diseases as well as the traditional, ‘gold-standard’ method, which requires much more manual labor to develop and perform.
To study a disease, researchers must first sift through reams of medical records in search of data points, such as specific lab tests or prescriptions, that is uniquely associated with the disease. They then write the algorithm that instructs the computer to look for patients who have those disease-specific pieces of data, known as a “phenotype.” In turn, researchers must manually double-check the list of patients identified by the computer. Every time researchers want to study a new disease, they must restart the process from the beginning.
In this study, the researchers took a different approach, in which the computer learns how to identify disease phenotypes on its own, saving the researchers time and effort. This new Phe2vec method was based on previous research by the team.
“Previously, we demonstrated that unsupervised machine learning could be a highly efficient and effective strategy for mining electronic health records,” said Riccardo Miotto, Ph.D., a former Assistant Professor at the HPIMS and the study’s senior author. “The potential benefit of our approach is that it learns disease representations from the data itself. As a result, the machine does much of the work that experts would normally do to define the best combination of data elements from health records to describe a specific disease.”
Essentially, a computer was programmed to sift through millions of electronic health records and learn how to connect data to diseases. This programming was based on “embedding” algorithms developed by other researchers, such as linguists, to study word networks in different languages. Word2vec, one of the algorithms, was particularly effective. The computer was then instructed to apply what it had learned to identify the diagnoses of nearly 2 million patients whose data was stored in the Mount Sinai Health System.
Finally, the researchers compared the efficacy of the new and old systems. They discovered that the new Phe2vec system was as effective as, or slightly better than, the gold standard phenotyping process at correctly identifying diagnoses from electronic health records for nine out of ten diseases tested. Dementia, multiple sclerosis, and sickle cell anemia are just a few of the diseases.
“Overall, our findings are encouraging and suggest that Phe2vec is a promising technique for large-scale disease phenotyping in electronic health record data,” said Dr. Glicksberg. “With additional testing and refinement, we hope that it can be used to automate many of the initial steps of clinical informatics research, allowing scientists to focus their efforts on downstream analyses such as predictive modeling.” The Hasso Plattner Foundation, the Alzheimer’s Drug Discovery Foundation, and NVIDIA Corporation provided courtesy graphics processing units for this study.