How Do You Conduct A Genetic Study Without Genetic Data?: Leveraging Electronic Health Records

Any patient who’s new to a hospital is asked to fill out an intake form listing the name, number, and relationship of their emergency contact. It’s protocol all of us are familiar with. But what if that same routine information could be used in an entirely different way, to build a rich collection of genetic data for the large-scale study of disease heritability?

This is the question that intrigued Columbia University researchers Nicholas Tatonetti, PhD; Fernanda Polubriaginof, MD, PhD; David Vawdrey, PhD; Krzysztof Kiryluk, MD, MS; and their collaborators.

They suspected that answering it may revolutionize the way researchers conduct heritability studies. Instead of going through the laborious, time-consuming process of recruiting participants—which typically means finding families with twins—and collecting their data, researchers could tap into information that already exist but have not been used to their fullest potential.

“The idea was to leverage vast electronic health record (EHR) data to identify familial relationships and therefore be able to understand disease risk, heritability, and, in the future, provide better and safer patient care by leveraging this heretofore untapped resource,” says Vawdrey.

A background in working with patients, Polubriaginof says, piqued her interest in discovering new ways to deliver improved care using patients’ own data.

“I was very interested in studying disease risk for populations,” she says. “As a physician by training and a PhD candidate at Columbia, I had done work on high-risk breast cancer patients. I was looking for data, like family history data in electronic health records. But what I found was that this information was often incomplete or even nonexistent.”

This incomplete patient data, the researchers reasoned, could be pieced together with emergency contact data to create broader, inferred information sets.

“This new pedigree and family relationships data, when combined with what’s already available in the electronic health records, can be used to estimate heritability across almost all disease as a fraction of the cost and much faster than previous approaches,” says Tatonetti, the Herbert Irving Assistant Professor of Biomedical Informatics at Columbia.

So he and his colleagues came up with one algorithm to deduce millions of family relationships and another to estimate the heritability of hundreds of traits, then applied those algorithms to the 5.5 million electronic health records of patients and their emergency contacts at NewYork-Presbyterian/Columbia University Irving Medical Center, NewYork-Presbyterian/Weill Cornell Medical Center, and Mount Sinai Health System.

“We found a way of using existing data and conducting a genetic study without using genetic data,” says Polubriaginof.

The results were published in Cell in May 2018 by Tatonetti and 20 co-authors.

“If you can infer relationships, then you can create family trees on a scale that was before now impossible without extensive time and financial resources,” says Vawdrey. “We can do in minutes or hours what other research groups have tried to painstakingly collect in months and years.”

Impact with a purpose

What does all of this mean for healthcare delivery? How does it contribute to the field of precision medicine?

“From the hospital point of view, for why this is important, it boils down to wanting to give the best and safest patient care. It helps us understand if we’re doing a good job screening for high-risk diseases for high-risk populations,” says Vawdrey. “If I have a history of cancer, the computer could clue in my doctor to ask certain questions or order certain tests.”

It’s also enabling more patients to receive the targeted, data-informed, proactive care they deserve — which up until now, had been limited to merely a fraction of the population.

“Medical research has typically only focused on white males, which biases our knowledge, causing disparities in treatment outcomes,” says Tatonetti.

In a diverse EHR landscape like New York, however, its database of patients is much more racially and ethnically inclusive.

“One of the main advantages of using the medical records of a large urban academic medical center is the diversity in our study subjects,” says Tatonetti. “Our study is the first large scale analysis of heritability in a diverse population and can be used to highlight these disparities for future research.”

Protection of patient privacy

Researchers made patient privacy a priority while conducting the study. Access to data was restricted and storage was secured.

“Authorization to access research data is strictly limited to the project personnel, and they have signed the medical center’s confidentiality agreement. Authentication is via user identifiers and passwords,” says Tatonetti. “All access is encrypted using SSH Secure Shell and other services mediated through SSH. Access to the computer is logged and audited.”

In addition, individually identifying information was stripped from the public reports.

“To protect against identifying specific, extreme cases, we only evaluate conditions for which there are at least 1,000 patients diagnosed,” explains Tatonetti. “We mask any counts that we publish when the value is less than or equal to 10. In addition, we break the connections between multiple conditions and individuals by re-assigning random identifiers for each condition. This prohibits identification by a unique set of diagnoses.”

Strength in collaboration

By working across not only Columbia departments but also health networks, researchers were able to collaborate with researchers from a wide range of backgrounds.

“Everyone actively participated in this study and gave us a lot of insight, in terms of improving the method, thinking about how to use these relationships, and validating our research,” says Polubriaginof.

Their study also lays the groundwork for other researchers looking to expand upon their discoveries, applying similar methods and algorithms to different contexts.

“Our study presents a new way to generate hypotheses about the genetics of human disease that wasn’t before available. As these data and methods become more widely used, researchers around the world will have the ability to use their expertise to re-evaluate our data, identifying important questions that may otherwise have been missed,” says Tatonetti.

Opportunities down the road

In the future, these relationships could be used in other research studies, including clinically focused ones, by applying EHR-derived phenotypes to hone in on specific high-risk diseases.

“The diseases we tested were based on simple billing codes in EHRs. We have the capability of producing much more reliable phenotypes using special computational algorithms,” says Kiryluk.

One application he’s currently pursuing is kidney disease. “Our algorithm has excellent diagnostic properties, to predict kidney disease with high fidelity. The output of the algorithm can be used to estimate heritability for different forms.”

Kiryluk continues: “If you think about the fact that you have a large fraction of individuals within EHR linked with pedigrees and these individuals also have genetic data … then you could think about testing for co-segregation of EHR-derived phenotypes with specific genetic markers. That would be a really nice application of this method.”

Ultimately, it’s about “opening new avenues for research,” as Polubriaginof puts it.

Vawdrey echoes: “We want to be breaking ground and discovering impact. We’re discovering things that people have never done before and we think they help us drive impact and provide better and safer patient care, and also open the door to future research opportunities.”