OARD Uses Data-Driven Approach To Fill Knowledge Gaps Around Rare Disease Research

Rare diseases may be uncommon for an individual patient, but in the aggregate, they affect millions of people annually. Limited clinician knowledge on the vast array of these diseases often lead to either misdiagnosis or under-diagnosis, and can often create either financial and emotional burdens on the patients and their families.

A resource developed by researchers at Columbia University Irving Medical Center (CUIMC) presents a new method to fill this knowledge gap for the rare disease community. Open Annotation for Rare Diseases (OARD), a real-world, data-derived

resource with annotation for rare disease-related phenotypes, was generated using electronic health records (EHR) of two large academic health institutes and contains novel methods that can empower larger research initiatives in this area.

OARD: Open annotations for rare diseases and their phenotypes based on real-world data, a study recently published in The American Journal of Human Genetics, highlights this publicly accessible, data-driven resource that was developed using more than 10 million deidentified patient records from either CUIMC or Children’s Hospital of Philadelphia (CHOP).

“Existing knowledge bases are often manually curated with additional annotations found in published case reports,” said lead author Cong Liu, Associate Research Scientist in the Columbia University Department of Biomedical Informatics (DBMI). “OARD is a sharable resource for rare disease research and diagnosis. It intends to complement the current diagnosis pipeline with novel disease-phenotype associations identified from real-world clinical observations that have not been reported in a manually curated database.”

Previous knowledgebases, like Human Phenotype Ontology (HPO), are mostly domain-driven resources; while both expensive and time-consuming, these resources are also unlikely to capture the full expanse of rare diseases.

“The unique advantage of OARD is that it is data-driven,” said corresponding author Chunhua Weng, Professor of Biomedical Informatics at Columbia University. “We use a large amount of EHR data from a large, diverse patient population. We derive these symptom-disease associations directly from real world data. This data-driven approach is more generalizable and scalable than expert-driven approach.”

Existing knowledge bases are often manually curated with additional annotations found in published case reports. OARD is a sharable resource for rare disease research and diagnosis. It intends to complement the current diagnosis pipeline with novel disease-phenotype associations identified from real-world clinical observations that have not been reported in a manually curated database.

Cong Liu

“We used our clinical data warehouse with multiple decades worth of data to generate annotations of rare disease information,” Weng added. “Our team built open annotation datasets by extracting concept relationships between rare disease concepts, and we made it into a computable format. We believe this is a very valuable real world data-derived knowledgebase to help people develop future rare disease diagnosis algorithms.”

An example of how this vast knowledgebase compares to previous approaches can be seen with Duchenne, a severe form of muscular dystrophy. There are 16 phenotype concepts annotated with Duchenne in the original HPO annotation, while the OARD dataset identified 211 related phenotype concepts. This is possible for several reasons, including the long history of CUIMC work around EHR data and natural language processing, as well as the deep patient counts available in both the CUIMC and CHOP databases.

The overall patient total isn’t the only valuable aspect of this database. The diversity of patients is necessary to do equitable research around rare diseases.

“Since minorities are less likely to be involved in clinical research, their phenotype representation and rare disease knowledge are likely to be under-represented in the current knowledge base,” Liu said. “OARD is a real-world data-driven approach, which can provide an unbiased knowledge presentation reflecting the institution’s patient demographics. Given both CUIMC and CHOP are two institutions serving many under-represented patients, we believe OARD has the potential to enrich the current limited phenotype knowledge in minority rare disease patients.”

Including variables derived from clinical notes is another novel aspect of OARD, as most aggregated data in large network studies do not include this.

“Our study showed, at least for rare diseases, unstructured clinical notes are extremely important to enrich knowledge,” Liu said. “One of the major technical barriers is natural language processing is often required to extract the variables from the unstructured notes, which can be time-consuming. Our pipeline used a context-aware smart keyword search approach, which is a much faster solution to extract variables and aggregate the data.”

The research team noted OARD — which has the potential to develop polyphenotypic risk scores to diagnose rare diseases, similar to genome-wide association studies (GWAS) resources — can improve rare disease research in several areas. First and foremost, it is a pipeline that can be executed by other institutions to continue filling this knowledge gap.

“Other institutions can adopt our pipeline for concept extraction and annotation from clinical notes from the EHR system and generate the counterpart data sets in their institution,” Weng said. “If we get more people adopting this pipeline, we will get larger datasets for rare diseases. We can aggregate evidence across multiple sites about rare disease.”

The team is also considering an interface development to assist clinicians or patients in diagnosing rare diseases, as well as an integration within the EHR environment.

“The application needs to be evaluated more in a real-world use-case scenario, but this is one of the first data-driven rare-disease annotations using EHR data,” Weng said. “We believe this approach is more generalizable and scalable than expert-driven approach, and it can create more opportunities for future rare disease informatics research.

The OARD is publicly available at https://rare.cohd.io/.