Biobanks, which are databases containing genetic and health information, have been instrumental in helping researchers explore diseases and study the interplay between genetics and the environment.
These investigations have enabled us to gain valuable insights into factors ranging from the relationship between diet and disease to household size and COVID-19 severity, which can guide researchers, clinicians, and patients.
However, the usefulness of biobanks depends on the quantity and quality of the data they contain. Only complete patient datasets are a common problem, as researchers may need access to all relevant information.
For example, if a patient has been treated for type II diabetes but has never been treated in a hospital setting, the term “type II diabetes” may be missing from their data.
This missing information poses a significant barrier for researchers conducting disease studies and looking for patterns that could lead to breakthroughs.
To address this problem, Lu Yang, a PhD student at Stanford, collaborated with Sheng Wang, a recent postdoctoral student, and Russ Altman, an associate director and professor at Stanford HAI, to create a machine learning framework for disease recognition called POPDx.
The framework can predict a comprehensive set of diagnosis codes, also known as phenotype codes, for all the patients in the UK Biobank, which holds the data of half a million participants from the UK, including patients with rare diseases.
According to Yang, the POPDx model “produces probabilities that a person might have certain diseases or phenotype codes” and outperforms existing models in predicting common and rare diseases, including diseases that aren’t present in the training data.
This is a significant finding, showing that the model can work with sparse data and help patients with uncommon diseases.
Yang used a broad range of patient data, from demographic information and patient questionnaires to medical exams and electronic health records (EHRs), to train the POPDx model. The model extracts information from biological data and lab tests, providing a complete profile of the UK Biobank participants.
The model looks for relationships between the patient’s data and disease information, using natural language processing and Human Disease Ontology to make probabilistic decisions.
One of the biggest challenges for the model is diseases that need more data. However, POPDx has shown solid performance with limited or no data, obviating the need for huge datasets. Yang improved the precision metric for the model for neglected and rare diseases by 218% and 151%, respectively.
This means that if a clinical team needs to identify patients with a low-prevalence disease, “our model, on average, will increase the possibility of finding these positive cases. Before, they would have to go through a huge number of patients in the Biobank, but now they can screen a much lower number to find possible cases.”
In conclusion, the POPDx model is a significant breakthrough in disease recognition and has the potential to help researchers, clinicians, and patients alike.
With its ability to work with limited data, the model can help identify patients with rare diseases and speed up the search for breakthroughs.