Yale School of Public Health researchers have developed a new approach for improving genetic risk prediction by tapping into the vast, often underused information contained in electronic health records (EHRs). The method—called Electronic Health Record Embedding Enhanced Polygenic Risk Scores (EEPRS)—integrates modern embedding techniques with traditional genome-wide association study (GWAS) data to produce more accurate and clinically meaningful predictions of disease risk.
Embedding techniques are used to turn information, such as electronic health records, into numbers that computers can easily analyze. In EEPRS, those methods include well-known applications like Word2Vec, as well as newer approaches that use large language models, such as GPT, to capture patterns in patients’ health data.
Current polygenic risk scores (PRS) rely on simplified, predefined disease categories, usually treating conditions as binary traits—case or control. But this approach overlooks the rich, multidimensional patterns that EHRs capture across thousands of diagnoses, symptoms, and clinical encounters. The new EEPRS framework addresses this gap by applying natural language processing tools, such as Word2Vec and GPT, to generate numerical representations of clinical phenotypes. These embeddings are then incorporated directly into creating risk scores, using only GWAS summary statistics.
In evaluations across 41 traits in the UK Biobank, EEPRS consistently outperformed single-trait PRS methods, with the largest gains appearing in cardiovascular-related phenotypes. The team also introduced EEPRS-optimal, which uses cross-validation to select the most effective embedding strategy for each trait, and MTAG-EEPRS, a multi-trait extension that further boosts prediction accuracy.
The manuscript was published in The American Journal of Human Genetics. Dr. Hongyu Zhao, PhD, Ira V. Hiscock Professor of Biostatistics, and Professor of Genetics and Statistics and Data Science, is corresponding author.
Lead author Leqi Xu, a doctoral candidate in biostatistics, said the work highlights the enormous potential of combining modern embedding approaches with large-scale biobank data. “By capturing the nuanced relationships embedded in electronic health records, EEPRS allows us to build more powerful and more interpretable genetic risk models that reflect the true complexity of human health,” Xu said.
If adopted widely, the EEPRS framework could help accelerate precision medicine by uncovering subtler genetic signals and improving early-risk identification across a broad range of diseases.
Journal Reference: Xu, Leqi et al (2025). Improving polygenic risk prediction performance by integrating electronic health records through phenotype embedding. The American Journal of Human Genetics. DOI: 10.1016/j.ajhg.2025.11.006