DEI Research Projects & Initiatives
Ongoing Projects
Principal Investigator: Gamze Gürsoy (gg2845@cumc.columbia.edu)
Description
Machine learning or statistical models that are trained on biased genetic data (e.g., overwhelming number of data points taken from individuals with European ancestry) will result in poor performance when they are used to make inferences for patients with underrepresented features. Due to privacy concerns, we oftentimes do not have access to the underlying data to assess the bias. Here we propose to repurpose empirical tools used to measure private information leakage such as “membership inference attacks” to quantify the bias in the training data of these models.
Opportunities
Student involvement: A masters or a PhD student with computational background would be appropriate for this project. Familiarity with machine learning and statistics is required. No background in privacy/security is necessary.
Principal Investigator: George Hripcsak (gh13@cumc.columbia.edu)
Description
Many clinical algorithms include race as a predictor. However, the appropriate use and the implications of including a patient’s race in clinical predictive algorithms remain unclear. In this work, we study the impact of race on the performance of predictive algorithms for GFR. We compare the prediction error of the estimated GFR with and without the variable race between Black patients and White patients. Our results showed that the prediction error for patients coded as Black was higher compared to those coded as White, regardless of inclusion of race as a variable. Using a large amount of information represented in electronic health record variables achieved a more accurate prediction of GFR and the least difference in prediction error across racial groups.
Opportunities
Potential directions include developing and applying fairness assessment pipeline to other clinical predictive algorithms. Prerequisites: Master’s or above, basic programming in python or R, SQL. Alternatively, medical students or fellows with expertise in a clinical domain.
Principal Investigator: George Hripcsak (gh13@cumc.columbia.edu)
Description
Fairness in clinical decision-making is a critical element of health equity, but assessing fairness of clinical decision-making from observational data is challenging. Recently, many fairness notions have been proposed to quantify fairness in decision-making, among which causality-based fairness notions have gained increasing attention. In this work, we explore a causal fairness notion called principal fairness as a potential metric for assessing fairness of treatment allocation. We develop a probabilistic machine learning algorithm for estimating principal fairness, and show how principal fairness can measure fairness in medical decisions using electronic health records (EHR) data.
Opportunities
PhD student or post-doc, probability and statistical inference, programming in python or R, SQL. Alternatively, medical students or fellows with expertise in a clinical domain.
Principal Investigator: Pierre Elias (pae2115@cumc.columbia.edu)
Description
Importance: Transthyretin Amyloid Cardiomyopathy (ATTR-CM) is a cause of heart failure that disproportionately affects black patients. Despite this epidemiologic prevalence, systemic racial disparities exist in the diagnosis of ATTR-CM.
Opportunities
Students should have a general comfort with Python programming and ideally prior experience in one of the following domains: image-based analysis, machine learning, deep learning, or EHR phenotyping.
Principal Investigator: Sarah Rossetti (sac2125@cumc.columbia.edu)
Description
CONCERN is a multi-site study (Columbia University Medical Center and MassGeneral Brigham) that is developing and evaluating an early warning score system to predict and provide clinical decision support when patients are at increased risk of deterioration. The early warning system is based on the variability in nurses’ electronic health record (EHR) documentation patterns, which implies nursing surveillance and reflects nurses’ changing degrees of concern. Due to the inherent biases in EHR data, prediction models developed from EHR data are highly likely to be biased. As a part of our ongoing effort to identify and mitigate implicitly embedded biases in our model, we are conducting analyses to monitor the differences in nursing documentation patterns associated with patients’ demographic characteristics. Our recent work focuses on examining the differences in nursing documentation patterns associated with patients’ race, socioeconomic status, and primary language and seeking solutions to mitigate these biases in our predictive model.
Opportunities
Students interested in working with large EHR data sets can assist in cleaning and analysis of those data, including machine learning and logistic regression.
Principal Investigator: Chunhua Weng (cw2384@cumc.columbia.edu)
Opportunities
Research opportunities for students who have skills for natural language processing, statistical modeling, or clinical research informatics.
Principal Investigator: Noémie Elhadad (ne60@cumc.columbia.edu)
Description
The objective of this study is to perform a systematic comparison of condition prevalence by race, gender, age, and data source and their intersections across nine large electronic health record and claims databases. This is a large collaborative study with members of the OHDSI Health Equity Working Group.
Opportunities
Students interested in visualization to communicate complex patterns across a large range of conditions and populations.
Principal Investigator: Rita Kukafka (rk326@cumc.columbia.edu)
Description
This study aims to assess the effectiveness of decision support for patients and healthcare providers in improving informed choice about breast cancer chemoprevention among women with AH or LCIS. It will also seek to examine factors that facilitate or impede the implementation of these decision support tools into clinical practice. This study is a cluster randomized controlled trial (RCT) in which clinics will be randomly assigned to either an active intervention or control group. Participating providers at intervention clinics are given access to a decision support tool called BNAV (Breast cancer risk NAVigation). Participating patients at intervention clinics are given access to a patient decision aid called RealRisks, while patients at control clinics are given standard educational materials. The primary endpoint is chemoprevention informed choice at six months. Secondary endpoints include perceived breast cancer risk/worry, chemoprevention knowledge/intention, decision conflict/regret, shared decision-making, and chemoprevention uptake, adherence, and reasons for discontinuation are assessed annually for up to 5 years.
Opportunities
Current opportunities include observing and participating in qualitative data collection and analysis, assisting with patient recruitment and retention, tracking monthly screening logs, assisting software developers with the maintenance and functionality of RealRisks and BNAV, examining usage log statistics, and working with the FHIR component of the project.
Principal Investigator: Rita Kukafka (rk326@cumc.columbia.edu)
Description
This study aims to examine the use of Fast Healthcare Interoperability Resources (FHIR) and application programming interfaces (APIs) to develop accurate automated breast cancer risk assessments. Electronic health records (EHRs), a common source for populating breast cancer risk assessment models, often include mistakes or missing data, which poses a challenge when attempting to calculate accurate breast cancer risk assessments. Additionally, prior research has found that data tends to be more accurate when provided by patients compared to EHRs. Our previously developed patient decision aid, RealRisks, will be enhanced with FHIR-based technology to harness the strengths of both patient-generated and electronic health record data. This study will involve conducting usability studies to refine the FHIR-enhanced RealRisks and a feasibility pilot study to assess the effect of this technology on patient risk perception. We will also seek to identify the multi-level barriers that affect implementing FHIR-enhanced RealRisks and returning patient-corrected data to the EHR.
Opportunities
Current opportunities for students include assisting software developers/investigators with conducting usability studies to refine the user interface in RealRisks, working with the data from the EHR to present to patients for correction/additions, and running prediction models for breast cancer, the upgrades and changes necessary to RealRisks for integration of FHIR technology, developing study materials, and assisting with the compilation of results.
Principal Investigator: Rita Kukafka (rk326@cumc.columbia.edu)
Description
This study aims to expand genetic testing for hereditary breast and ovarian cancer syndrome to a broader population of high-risk women by prompting appropriate referrals from the primary care setting using an electronic health record-embedded breast cancer risk navigation (BNAV) tool. To address patient-related barriers to genetic testing, we developed a patient decision aid, RealRisks, designed to improve genetic testing knowledge, the accuracy of breast cancer risk perceptions, and self-efficacy to engage in a collaborative dialogue about genetic testing. The study design is a randomized controlled trial (RCT) of patient educational materials and provider EHR notice alone (control arm) or in combination with RealRisks and BNAV (intervention arm). Now that the RCT has been completed, we seek to understand the experiences of these women who underwent genetic testing through semi-structured interviews. This will help inform the next proposal to enhance RealRisks to include the return of results, primarily to address the uncertainty in returning variants of uncertain significance (VUS), which have become common with multigene panel testing.
Opportunities
Current opportunities for students include recruiting patients for one-time interviews with the study team, observing patient interviews, assisting with transcription and analysis of completed interviews, and assisting with writing the new grant proposal.
Principal Investigator: Herbert Chase (hc15@cumc.columbia.edu) and Bruce Forman (formanb@nyp.org)
Description
The Columbia Department of Biomedical Informatics (DBMI) Summer Research Program provides rising seniors in high school and college undergraduate students from a wide range of backgrounds (biology, psychology, engineering, computer science, applied mathematics, statistics, etc.) with fundamental knowledge, hands-on skills, and research experience in biomedical informatics and health data science. Students work with both DBMI faculty and trainees in a wide range of research opportunities.
Opportunities
In terms of DBMI student involvement, this program needs and benefits from our graduate students functioning as mentors to the participants.
Principal Investigator: Bruce Forman (formanb@nyp.org)
Description
A collaboration between the NYC Department of Education, CUNY, the Hospital, and Microsoft around an innovative NYC high school/early college in northern Manhattan called Inwood Early College for Health and Information Technologies (IEC). The school is one of the NYC “9-14 schools” as the students do the traditional four years of high school (grades 9-12) and then, if they stay for the whole program, do the last two years (“grades” 13 – 14) at a partner CUNY college from which they get an associates degree for free. Each of these schools has a career theme, and IEC’s is health information technology. The 9-14 schools are open to any NYC high school student, regardless of their academic performance up through 8th grade. Thus, IEC is representative of the diverse NYC student population with demographics as follows: 78% Latino/Hispanic, 18% African-American, 2 % Asian, and 2% White. Students in the program start taking college-level courses as soon as the 10th grade. NYP’s role is to be an industry partner to the school, providing work-based learning opportunities to the students including mentoring, job shadowing, onsite visits/tours, and internships. Dr. Forman is the lead liaison between NYP and the School for this partnership. In the past, this program has placed students in internships with the group that manages GetHealthyHeights.org (Community Engagement Core Resource of the Irving Institute for Clinical and Translational Research – Rita is directly involved with this group). Past IEC students have taken the DBMI HIT certification graduation programs and career fairs. DBMI leadership has participated in a career fair for the IEC students, sponsored by our CUNY partner, Bronx Community College.
Opportunities
There are opportunities for DBMI trainees to help out in various ways.
Completed Projects
Principal Investigator: Noémie Elhadad (ne60@cumc.columbia.edu)
Description
Lack of a large-scale survey of the health disparities and minority health (HDMH) literature leaves the field potentially vulnerable to disproportionately focus on specific populations or emphasize certain conditions, curtailing our ability to fully advance health equity and improve our understanding of the health of minoritized communities. The goal of this study is to carry out a scoping review of HDMH literature to investigate the following questions: 1) What are the major populations, study methods, conditions, and themes in the HDMH literature? 2) How have dominant themes changed over time? 3) What gaps exist in the literature? Because the HDMH is large (205KHDMH articles), computational methods are used to index and analyze this corpus.
Principal Investigator: Noémie Elhadad (ne60@cumc.columbia.edu)
Description
Many areas of clinical informatics research rely on accurate and complete race and ethnicity (RE) patient data, such as estimating disease risk and identifying health disparities. While structured data in the electronic health record (EHR) contains accessible patient-level RE data, it is often missing, inaccurate, or lacking granular details. Natural language processing (NLP) models can be trained to identify RE in clinical text to supplement missing structured RE data. A large corpus was built and annotated in two ways. First, granular information related to RE such as preferred language and country of origin were annotated, and second, RE labels were annotated. An NLP tool to extract these information was also trained and validated.