Biomedical Informatics Seminar Series

The DBMI seminar series is a 1-credit course for DBMI students who can benefit from hearing new methods of research from speakers from both academia and industry. It is currently being offered virtually, though it is traditionally held in PH-200.

During the 2020-21 academic year, DBMI announced that the weekly seminar will include a set of talks as part of the new DBMI Special Seminar Series: Toward Diversity, Equity, and Inclusion in Informatics, Health Care, and Society. The first two sessions were held during the 2021 Spring semester, and both upcoming presentations and past recordings will be shared on our Special Seminar Series homepage.


During the 2021 Fall semester, all weekly (Monday 1-2 pm ET) presentations will be provided by faculty from Columbia University. Selected presentations are recorded and posted to the DBMI YouTube page. Those individual links can be found in the speaker sections below, which include all seminars since the 2019-2020 academic year. This page will be updated with future presentations when they become available.

  To join DBMI seminars during the 2021 fall semester, please use this Zoom link.

   Meeting ID: 943 3385 8029
   Passcode: 191130

Upcoming 2021 Fall Seminars

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

Talk title: Are phenotyping algorithms fair for underrepresented minorities within older adults? Abstract: The widespread adoption of machine learning (ML) algorithms for risk-stratification has unearthed plenty of cases of racial/ethnic biases within algorithms. When built without careful weightage and bias-proofing, ML algorithms can give wrong recommendations, thereby worsening health disparities faced by communities of color. Biases within electronic phenotyping algorithms are largely unexplored. In this work, we look at probabilistic phenotyping algorithms for clinical conditions common in vulnerable older adults: dementia, frailty, mild cognitive impairment, Alzheimer’s disease, and Parkinson’s disease. We created an experimental framework to explore racial/ethnic biases within a single healthcare system, Stanford Health Care, to fully evaluate the performance of such algorithms under different ethnicity distributions, allowing us to identify which algorithms may be biased and under what conditions. We demonstrate that these algorithms have performance (precision, recall, accuracy) variations anywhere between 3 to 30% across ethnic populations; even when not using ethnicity as an input variable. In over 1,200 model evaluations, we have identified patterns that indicate which phenotype algorithms are more susceptible to exhibiting bias for certain ethnic groups. Lastly, we present recommendations for how to discover and potentially fix these biases in the context of the five phenotypes selected for this assessment. Bio: Dr. Juan M. Banda at his GSU lab, Panacea Lab, works on building machine learning, and NLP methods that help to generate insights from multi-modal large-scale data sources, with applications to precision medicine, medical informatics, as well as other domains. His research interests are not limited to structured data, he is also well-versed in extracting terms and clinical concepts from millions of unstructured electronic health records and using them to build predictive models (electronic phenotyping) and mine for potential multi-drug interactions (drug safety). Dr. Banda’s has published over 70 peer reviewed conference and journal papers and serves as an editorial board member of the Journal of the American Medical Informatics and Frontiers in Medicine – Translational Medicine, and a reviewer for JBI, nature Digital Medicine, nature Scientific Data, nature Protocols, PLOS One, and several other leading journals. Prior to being an assistant professor of Computer Science at Georgia State University, Dr. Banda was a postdoctoral scholar, then a research scientist at Stanford’s center of Biomedical Informatics. He is an active collaborator of the Observational Health Data Sciences and Informatics, and his work has been funded by the Department of Veteran Affairs, National Institute of Aging as well as NASA, NSF and NIH, and serves as a PC member and chair for several conferences and workshops including ICML, NeurIPS, FLAIRS, IEEE Big Data, among others.

More information on this presentation will be posted when available.

More information on this presentation will be posted when available.

Previous 2021 Fall Seminars

Title: Machine Learning Applications in Cardiology 

Watch The Full Presentation Here

Abstract: In this talk we will discuss why and how deep learning approaches have the potential to greatly impact cardiac imaging. We will then explore use cases developed here at Columbia that have led to two of the world’s first prospective clinical trials of deep learning in cardiology. Lastly we’ll critique the limitations of current ML approaches preventing mainstream adoption in order to answer the question, “What are the big problems the field needs to be tackling now?” (and maybe even answer, “What’s a really good idea for me to do research on as a grad student?”)

Bio: Pierre Elias, MD is a cardiology fellow at Columbia University Irving Medical Center who recently completed a two-year postdoc in the Perotte Lab at DBMI.

Title: Addressing the challenges of the “fourth paradigm” in biology and medicine Abstract: Recent advances in biotechnology and medicine allow us to collect an immense amount of physiological, contextual, and biological data at the personalized and population level. This surge in data gives rise to a paradigm shift in biology and medicine towards data intensive discoveries. While this provides the perfect opportunity to study human biology and disease, it also presents daunting challenges in data analysis, privacy and sharing at scale. In this talk, first, I will discuss the scalable tools I have developed to overcome privacy concerns associated with sharing functional genomics and genomics data. Second, I will review the computational tools I have developed to address the challenge of high-throughput functional genomics data analysis. I will end my talk by describing the vision of my future lab. This will include developing methods to address the questions related to 1- biomedical data privacy for sharing data in research and clinical setting and 2- multi-omics data integration to understand the relationship between genotypes and phenotypes.

Title: Towards a unified systems theory of mental disorders

Abstract: Understanding the biology of psychiatric disorders requires analyses on multiple levels of hierarchical organization: on the level of genes, cellular networks, neuron types, brain circuits, and patient phenotypes. Over the last decade, our lab has pioneered advances on all these organizational levels, for disorders such as autism and schizophrenia. We believe that the emerging data now allows to make an informed generalization about the etiology of major psychiatric disorders. Using examples primarily from autism spectrum disorder (ASD), I will discuss our recent work on understanding brain circuits that are likely perturbed across disorders. We have recently developed an approach to integrate genetic data with high-resolution spatial gene expression and brain-wide mesoscale connectome. The application of the approach to autism demonstrates that ASD mutations perturb widely distributed sets of brain circuits with interrelated biological functions and structures from the cortex, striatum, amygdala, thalamus and hippocampus. The identified circuits are generally responsible for the integration of sensory and emotional information as well as context-dependent learning and decision-making based on this information. Our preliminary analyses show that similar circuits are also affected in schizophrenia and likely in many other mental disorders. We have also discovered that each ASD gene can be characterized by a parameter, phenotype dosage sensitivity (PDS), which quantifies the relationship between changes in a gene’s dosage and changes in each disorder phenotype. We believe that the relationship characterized by PDS is likely to generalize to other disorders and human phenotypes. Finally, I will discuss how the emerging picture puts us on the path towards explaining the common genetic risk factor underlying multiple psychiatric disorders (p-factor) and how specific phenotypes may arise in each disorder.

2021 Spring Seminars

Speaker: Rafael Irizarry, PhD Professor and Chair of the Department of Data Sciences at the Dana-Farber Cancer Institute; Professor of Biostatistics at Harvard T.H. Chan School of Public Health

Title: Probabilistic Gene Expression Signatures for Single Cell RNA-seq Data 

Watch The Presentation Here

Abstract:  In this talk Prof. Irizarry will describe his general approach to developing statistical solutions to problems in high throughput biology. He will focus on an example related to predicting cell types from single cell RNA-seq data. He will discuss challenges such as batch effects and sparse data and describe statistical solutions for these. Finally, he will show recent results from a collaboration involving spatial transcriptomics.

Biography: Rafael Irizarry received his Bachelor’s in Mathematics in 1993 from the University of Puerto Rico and went on to receive a Ph.D. in Statistics in 1998 from the University of California, Berkeley. His thesis work was on Statistical Models for Music Sound Signals. He joined the faculty of the Johns Hopkins Department of Biostatistics in 1998 and was promoted to Professor in 2007. He is now Professor and Chair of the Department of Data Sciences at the Dana-Farber Cancer Institute and a Professor of Biostatistics at Harvard T.H. Chan School of Public Health.

Professor Irizarry’s work has focused on applications in genomics. In particular, he has worked on the analysis and signal processing of high-throughput data. He has distinguished himself by disseminating his statistical methodology as open source software shared through the Bioconductor Project, a leading open source and open development software project for the analysis of high-throughput genomic data. His widely downloaded software tools have helped him become one of the most highly cited scientists in his field. Although Professor Irizarry’s focus has been in genomics, he is an applied statistician generally interested in read-world problems. During his career he has co-authored papers on a variety of topics including musical sound signals, infectious diseases, circadian patterns in health, fetal health monitoring, and estimating the effects of Hurricane María in Puerto Rico.

Professor Irizarry’s dedication to education is best demonstrated by the success of the numerous trainees he has mentored. He has also developed several HarvardX online courses on data analysis, which have been completed by thousands of students. These courses are divided into three series: Professional Certificate in Data ScienceData Analysis for the Life Sciences and Genomics Data Analysis. He shares the material for these courses through textbooks that are freely available online and reproducible code through GitHub. Professor Irizarry also dedicates his time providing service to the profession. Examples of this work include serving as the chair of the Genomics, Computational Biology and Technology Study Section (GCAT) National Institute of Health (NIH) study section, the search committee for the National Library of Medicine director, the National Academy of Sciences Gulf War and Health Committee, and the National Advisory Council for Human Genome Research.

Professor Irizarry has received several awards honoring the work described above. In 2009, the Committee of Presidents of Statistical Societies (COPSS) named him the Presidents’ Award winner. The Presidents’ Award is arguably the most prestigious award in Statistics. That year he was also named a fellow of the American Statistical Association. In 2017 the members of chose Professor Irizarry the laureate of the Benjamin Franklin Award in the Life Sciences. In 2020 he became an ISCB Fellows. He has also received the 2019 Research Parasite Award for outstanding contributions to the rigorous secondary analysis of data, the 2009 Mortimer Spiegelman Award which honors an outstanding public health statistician under age 40, the ASA Youden Award in Interlaboratory Testing, the 2004 American Statistical Association (ASA) Outstanding Statistical Application Award, and the 2001 American Statistical Association Noether Young Scholar Award for researcher younger than 35 years of age who has significant research accomplishments in nonparametrics statistics.

Title: Identifying and Leveraging Public Data Sources with Social Determinants of Health Information for Population Health Informatics Research 

Speaker: Irene Dankwa-Mullan MD MPH, Chief Health Equity Officer, IBM Watson Health, IBM Corporation

Watch The Full Presentation Here

Abstract: Social determinants of health (SDOH) account for many health inequities. Data sources traditionally used in informatics research often lack SDOH, and, when available, SDOH may be difficult to leverage given it’s lack of specificity and lack of structured information. In this presentation, I will share the initial phases of work that we are doing around leveraging SDoH data – for health equity research – addressing some of the informatics challenges leveraging social determinants of health data to inform population health or inform health services research. I will discuss a case study using a machine learning clustering algorithm to uncover region-specific sociodemographic features and disease-risk prevalence correlated with COVID-19 mortality during the early accelerated phase of community spread.

Bio: Irene Dankwa-Mullan is nationally and internationally recognized physician and expert scientist working at the intersection of healthcare, health equity, public health, informatics, data science and applied artificial intelligence with over 60-peer-reviewed publications. She serves as the Chief Health Equity Officer and Deputy Chief Health Officer for research and evaluation at IBM Watson Health. As Chief Health Equity Officer, she works across business market segments to promote a culture of equity, ethical AI, diversity and inclusion. Her responsibilities as Deputy Chief Health Officer includes leadership for evaluation research and implementation science and promoting opportunities to advance the science of AI and advanced analytics. Dr. Dankwa-Mullan attended Barnard College where she majored in Biochemistry. She received her medical degree from Dartmouth Medical School, and a Master’s degree in Infectious Disease Epidemiology and Biostatistics from the Yale School of Public Health in a joint MD/MPH program. She completed residency training in Internal Medicine at the Johns Hopkins Hospital’s Bayview medical campus.

Speaker: Dr. Aarti Sathyanarayana, PhD – Harvard T.H. Chan School of Public Health

 Digital Phenotyping: Quantifying human health with low, medium and high frequency data streams

Watch The Presentation Here

Abstract: Digital health data is notoriously enigmatic. However, smartphones, wearables, and EEGs have the potential to provide enormous insight into human health and wellbeing. Making sense of these complex data streams requires new computational approaches that combine the best of signal processing and machine learning to find pragmatic solutions. Dr. Sathyanarayana will discuss challenges and solutions for translating low, medium and high frequency data into actionable insights for health, wellness, and performance.

Bio: Dr. Aarti Sathyanarayana is a postdoctoral research fellow in the department of biostatistics at the Harvard T.H. Chan School of Public Health. She also holds an appointment in the clinical data animation center at Massachusetts General Hospital and Harvard Medical School. Her research interests are in time variant health data analysis, signal processing, and machine learning. She strives to translate enigmatic health data into actionable insights, with an emphasis on digital phenotyping and digital biomarker discovery. Her recent work has focused on developing new methodologies to better understand smartphone, wearables, and EEG data in the context of human health and wellness. Prior to joining Harvard, Aarti received her PhD in computer science from the University of Minnesota, where her dissertation was selected for the university’s doctoral dissertation award. Since then, her work has won multiple junior investigator awards from the National Center of Women and Information Technology, the American Medical Informatics Association, the American Epilepsy Society, and the American Clinical Neurophysiology Society. Her expertise has also led her to hold positions at Apple, Intel, the Mayo Clinic, and Boston Children’s Hospital.

Speaker: Carlos Bustamante, PhD

Title: Why doing the right thing and diversifying clinical trials can unleash innovation in biopharma pipelines

Watch The Full Presentation Here

Abstract : Clinical Genetics Lacks Standard Definitions and Protocols for the Collection and Use of Diversity Measures. More:

Short bio: For the past 18 years, I have led a multidisciplinary team working on problems at the interface of computational and biological sciences. Much of our research has focused on genomics technology and its application in medicine, agriculture, and evolutionary biology. My first academic appointment was at Cornell University’s College of Agriculture and Life Sciences. There, much of our work focused on population genetics and agricultural genomics motivated by a desire to improve the foods we eat and the lives of the animals upon which we depend. I moved to Stanford in 2010 to focus on enabling clinical and medical genomics on a global scale. I have been focused on reducing health disparities in genomics by: (1) calling attention to the problem raised by >95% of participants in large scale studies being of European descent; and (2) broadening representation of understudied groups in large NIH funded consortia, particularly minority groups from the U.S., the Americas, and Africa. My work has empowered decision-makers to utilize genomics and data science in the service of improving human health and wellbeing. In the next phase of my career, I will focus on opportunities for bringing these technologies to consumers and patients, directly, where this work can have the greatest impact. I have a strong interest in building new academic units, non-profits, and companies. I was the Inaugural Chair of the Department of Biomedical Data Science—the first new department that Stanford has started in 14 years—and I was Founding Director (with Marc Feldman) of the Center for Computational, Evolutionary, and Human Genomics. I serve as an advisor to the US federal government, private companies, startups, and non-profits in the areas of computational genomics, population and medical genetics, veterinary and plant genomics, and business strategy.

Speaker: Megan Threats, PhD, MSLIS

Title: Toward health justice in informatics: a community-based, intersectional approach to HIV informatics intervention development 
Abstract: June 2021 will mark 40 years since the first cases of what would later become known as acquired immunodeficiency syndrome (AIDS) were reported in the United States. Despite groundbreaking biomedical advancements in HIV prevention and treatment, the HIV/AIDS epidemic continues to disproportionately affect sexual and gender minority communities of color. In this talk, I will discuss the development of an HIV informatics intervention aimed at reducing inequities in linkage and retention in HIV prevention and care among sexual minority Black men in the South. I will present strategies for leveraging informatics to achieve health justice in the fight to end AIDS. 
Bio: Dr. Megan Threats is an Assistant Professor in the Department of Library and Information Science at the School of Communication and Information at Rutgers University – New Brunswick. She is also Visiting Research Faculty at the Yale School of Public Health.

Speaker: Trevor Cohen, MBChB, PhD, FACMI

Title: Using Neural Language Representations to Detect Linguistic Anomalies in Neurodegenerative and Psychiatric Disease 

Watch The Full Presentation Here

Abstract: Language is uniquely positioned in mental health as both a focus of observation for clinical signs and symptoms, and a medium through which some forms of therapy are delivered.  Alzheimer’s Disease and other forms of dementia can also affect language production, for example by limiting access to more specific terms that describe the world in detail. In both cases, data from speech and text are increasingly available on account of the use of digital devices to mediate research and healthcare delivery. Neural language representations such as word embeddings, recurrent neural network language models, and contemporary transformer architectures have become a predominant point of focus in computational linguistics research. The models from which these representations are derived are typically trained on large amounts of unlabeled text, with training tasks involving predicting held-out terms that occur in proximity to observed ones. During the course of such training, much information about the typical use of language is learned. This information is of potential value for the detection of the atypical usage that may characterize certain clinical conditions. In this talk I will discuss our recent work in this area, with a focus on two areas of application: (1) a study of the responsiveness of deep neural networks that distinguish between responses to cognitive tasks from participants with and without Alzheimer’s Disease to known deficiencies in language production in this condition; and (2) the application of neural word embeddings to model language coherence in order to detect the disorganized thinking characteristic of episodes of psychosis in schizophrenia and other conditions. I will also more briefly touch on a range of related ongoing work involving efforts to model constructs that are of diagnostic or therapeutic importance in mental health.   


Background: Dr. Cohen trained and practiced as a physician in South Africa, before obtaining his PhD in 2007 in Medical Informatics at Columbia University. His doctoral work focused on an approach to enhancing clinical comprehension in the domain of psychiatry, leveraging distributed representations of psychiatric clinical text. Upon graduation, he joined the faculty at Arizona State University’s nascent Department of Biomedical Informatics, where he contributed to the development of curriculum for informatics students, as well as for medical students at the University of Arizona’s Phoenix camps. In 2009 he joined the faculty at the University of Texas School of Biomedical Informatics, where (amongst other things) he developed a NLM-funded research program concerned with leveraging knowledge extracted from the biomedical literature for information retrieval and pharmacovigilance, and contributed toward large-scale national projects such as the Office of the National Coordinator’s SHARP-C initiative, which supported a range of research projects that aimed at improving the usability and comprehensibility of electronic health record interfaces.

Research: Dr. Cohen’s research focuses on the development and application of methods of distributional semantics – methods that learn to represent the meaning of terms and concepts from the ways in which they are distributed in large volumes of electronic text. The resulting distributed representations (concept or word embeddings) can be applied to a broad range of biomedical problems, such as: (1) using literature-derived models to find plausible drug/side-effect relationships; (2) finding new therapeutic applications for known (drug repurposing); (3) modeling the exchanges between users of health-related online social media platforms; and (4) identifying phrases within psychiatric narrative that are pertinent to particular diagnostic constructs (such as psychosis). An area of current interest involves the application of neural language models to detect linguistic manifestations of neurological and psychiatric conditions.  More broadly, he is interested in clinical cognition – the thought processes through which physicians interpret clinical findings – and ways to facilitate these processes using automated methods.  

Speaker: Tian Kang, MA, MPhil (PhD Student) – Dr. Chunhua Weng’s Lab 

Title: Exploring the Synergy of Neural and Symbolic Methods for Understanding Free-text Medical Evidence

Abstract: Recent state-of-the-art results in NLP have been achieved predominantly by deep neural networks. However, their reasoning capabilities are still rather limited compared to symbolic AI when facing reading comprehension tasks. I propose Medical evidence Dependency (MD)-informed Attention, a Neuro-Symbolic model for understanding free-text medical evidence, such as clinical trial publications. One head in the Multi-Head Self-Attention model is trained to attend to Medical evidence Dependencies (MD) and pass linguistic and domain knowledge onto later layers (MD-informed). We integrated MD-informed Attention into BioBERT and evaluate on two public machine reading comprehension benchmarks for clinical trial publications. The integration of MD-informed Attention head improves BioBERT substantially in both benchmarks—as large as an increase of +30% in the F1 score—and achieves the new state-of-the-art performance. MD-informed Attention empowers neural reading comprehension models with interpretability and generalizability via reusable domain knowledge. Its compositionality can benefit any Transformer-based NLP models for reading comprehension of free-text medical evidence.

Speaker: Victor Rodriguez, MA, MPhil (MD/PhD Student) – Dr. Adler Perotte’s Lab

Title: Training Deep Generative Models with Partially Observed Data

Abstract: Most deep generative models (DGMs) require fully observed data to train. Yet, data routinely contain missing values. This incompatibility motivates the development of inference algorithms which assume only partially observed data at training time. In this talk, I will present on-going work developing such algorithms for DGMs (specifically, Variational Autoencoders) and discuss preliminary results using data for which the missingness mechanism is ignorable. I also propose extensions to a) handle non-ignorable missingness mechanisms, which are common in clinical data sets and b) model labels for supervised disease phenotyping tasks.

Speaker: Elliot G. Mitchell, MA, MPhil (PhD Student) – Dr. Lena Mamykina’s Lab

Title: Automated Conversational Health Coaching: Work in Progress

Abstract: There is a need for automated health coaching solutions to supporting individuals living with chronic conditions in making everyday nutrition decisions. My research explores methods to enable automated health coaching via conversational interactions, like chatbots. In this presentation I describe work in progress towards the necessary components of a health coaching chatbot including the need to assess users’ goal attainment automatically, to offer feedback to users on goal attainment, as well as to provide suggestions when users do not meet their goals. I propose a set of computational methods to achieve these aims including crowdsourcing, active sensing, attention, and clustering. This approach can lead to the development of an automated health coach with the potential to help individuals achieve their health goals over time.
Speaker: Eugene, Lucas, MD (Fellow) – Dr. Bruce Forman’s Lab
Title: Life as a Clinical Informatics Fellow 
Abstract: Dr. Lucas will present an introduction to the Clinical Informatics fellowship and provide an overview of several projects he has led and worked on including: [1] leading the integration of a 3rd party application with the EHR, [2] identifying and managing Living Status discrepancies in the EHR, and [3] the development/kick off of the “25 By 5: Symposium to Reduce Documentation Burden on U.S. Clinicians by 75% by 2025.”

Speaker: Dr. Manuel Rivas, DPhil – Stanford University

Title: Genomic prediction and inference from population-scale datasets 

Watch The Full Presentation Here

Abstract: Clinical laboratory tests are a critical component of the continuum of care and provide a means for rapid diagnosis and monitoring of chronic disease. In this study, we systematically evaluated the genetic basis of 35 blood and urine laboratory tests measured in 358,072 participants in the UK Biobank and identified 1,857 independent loci associated with at least one laboratory test, including 488 large-effect protein truncating, missense, and copy-number variants. We then causally linked the biomarkers to medically relevant phenotypes through genetic correlation and Mendelian Randomization. Finally, we developed polygenic risk scores (PRS) for each biomarker and built multi-PRS models using all 35 PRSs simultaneously. We assessed sex-specific genetic effects and find striking patterns for testosterone with marked improvements in prediction when training a sex-specific model. We found substantially improved prediction of incidence in FinnGen (n=135,500) with the multi-PRS relative to single-disease PRSs for renal failure, myocardial infarction, type 2 diabetes, gout, and alcoholic cirrhosis. Together, our results show the genetic basis of these biomarkers, which tissues contribute to the biomarker function, the causal influences of the biomarkers, and how we can use this to predict disease.

Bio:  Dr. Rivas is an Assistant Professor in the Department of Biomedical Data Science at Stanford University in Stanford, California. He has a Bachelor of Science in Mathematics from the Massachusetts Institute of Technology and a Doctor of Philosophy in Human Genetics from the Nuffield Department of Clinical Medicine at Oxford University where he was a Clarendon Scholar.  He also did additional training at the Broad Institute in Cambridge, Massachusetts where he led the Helmsley Inflammatory Bowel Disease Exome Sequencing Program to understand the genetic factors that contribute to ulcerative colitis and Crohn’s disease risk.


Speaker: Dr. Terika McCall, PhD, MPH, MBA – Yale University 

Title: mHealth for Mental Health: User-Centered Design and Usability Testing of a Mental Health Application to Support Management of Anxiety and Depression in African American Women

Abstract: African American women experience rates of mental illness comparable to the general population (20.6% vs. 19.1%); however, they significantly underutilize mental health services compared to their white counterparts (10.2% vs. 27.2%). Past studies exploring the use of smartphone mental health interventions to reduce anxiety or depressive symptoms revealed that participants experienced significant reduction in anxiety or depressive symptoms post-intervention. Since African American women are comfortable with participating in mHealth research and interventions, and 80% of African American women own smartphones, there is great potential to remedy the disparities in mental health service utilization by leveraging use of smartphones for information dissemination, and delivery of mental health services and resources. My talk will focus on user-centered recommendations for content and features that should be included in a smartphone application culturally-tailored to support management of anxiety and depression in African American women. I will also discuss the results of usability testing of an initial prototype of the app.

Bio: Dr. McCall is a National Library of Medicine Biomedical Informatics and Data Science Postdoctoral Fellow at Yale Center for Medical Informatics. Her research focuses on reducing disparities in mental health service utilization through use of technology. Dr. McCall’s research is interdisciplinary and focuses on issues related to the acceptance, design, development, and use of mHealth applications for mental wellness.

2020 Fall Seminars

Speaker: Tony Y. Sun, MA (PhD Student) – Dr. Noémie Elhadad’s Lab

Title: Systematically quantifying and analyzing the impact of time-to-diagnosis disparities on the diagnostic process

Brief Abstract: In recent healthcare literature, a number of studies have illuminated how sex and gender-based healthcare disparities contribute to differences in health outcomes [e.g. ten year mortality for women after the WISE study]. In this talk, I’ll be focusing on how we systematically quantified time-to-diagnosis disparities across phenotypes, and how we analyzed the impact of these disparities on the diagnostic process. Our quantification of time-to-diagnosis disparities showed that, for patients that would go on to enter the same phenotype at CUMC, women are consistently diagnosed later than men for the majority of the same presenting symptoms. To analyze the impact of these disparities on the diagnostic process, we trained gender-agnostic classifiers for each disease using patients’ presenting symptoms. We assessed how the fairness gap changes with incrementally changed amounts of data. Despite our earlier finding that women present with symptoms earlier than men, the majority of these gender-agnostic classifiers paradoxically performed better for men than for women.

Speaker: Linying Zhang, MS, MA (PhD Student) – Dr. George Hripcsak’s Lab

Title: Adjusting for Unobserved Confounding Using Large-Scale Propensity Score

Brief Abstract: Even though nowadays observational data can contain an enormous number of covariates, the existence of unobserved confounder still cannot be excluded and remains a major barrier to drawing causal inference from observational data. Recently, analyses using large-scale propensity score (LSPS) adjustment have demonstrated examples of adjusting for unobserved confounding by including hundreds of thousands of available covariates. In this paper, we present the conditions under which LSPS can reduce bias due to unobserved confounder. In addition, we show that LSPS does not adjust for various unwanted variables (e.g., M-bias colliders, instruments). We demonstrate the performance of LSPS on bias reduction using both simulations and real medical data.

Speaker 1: Amanda J. Moy, MPH, MA (PhD student) – Dr. Sarah Collins Rossetti’s (OPTACIMM) Lab

Title: Measuring clinical documentation burden among physicians and nurses: a review of the literature 

Abstract: Rapid adoption of electronic health records (EHRs) following the passage of the HITECH Act has led to advances in both individual- and population-level health. Largely still in its infancy, EHRs have also resulted in unintended consequences on clinical practice and healthcare systems, including significant increases in clinician documentation time. Extended work hours, time constraints, clerical workload, and disruptions to the patient-provider encounter, have led to a rise in discontent with existing documentation methods in EHR systems. This documentation burden (hereinafter referred to as “burden”) has been linked to increases in medical errors, threats to patient safety, inferior documentation quality, and ultimately, burnout among nurses and physicians. Few empirically-based readily-available solutions to reduce burden exist, and to our best knowledge, there is no consensus on the best approaches to measure burden. Furthermore, the concept of burden has been ill-defined and poorly operationalized. Achieving the three primary goals (cited in the 21st Century Cures Act) to reduce EHR-related clinician burdens that influence care will necessitate standardized, quantitative measurements to evaluate impact. The purpose of this scoping review is to assess the state of science, identify gaps in knowledge, and synthesize characteristics of burden measurement among physicians and nurses using EHRs.

Speaker 2: James Rogers, MS, MA, MPhil (PhD student) – Dr. Chunhua Weng’s Lab

Title: Comparison of trial participants and non-participants using electronic health record data

Abstract: Clinical trials are medical research studies in which participants are assigned to receive one or more interventions so that researchers can evaluate the interventions’ effects. They are quintessential for the development of medical evidence, but are susceptible to a variety of challenges. One such challenge is generalizability, which refers to the ability to apply the conclusions of a study to a different set of relevant patients outside the context of that study. Assessing generalizability of clinical trials is important because differences in underlying clinical characteristics can impact the estimated effect of the interventions, ultimately impacting their clinical meaningfulness. However, most contemporary assessments provide minimal granularity on clinical comparisons. In this presentation, I will explore an alternative approach that combines electronic health record (EHR) data with enrollment data from prior clinical trials, while also highlighting potential implications that emerge from the results of this study.

Title: Machine learning for mental healthcare: a human-centered approach

Abstract: Machine learning advances are opening new routes to more precise healthcare, from the discovery of disease subtypes for stratified interventions to the development of personalized interactions supporting self-care between clinic visits. This offers an exciting opportunity for machine learning techniques to impact healthcare in a meaningful way. Within the healthcare domain, machine learning for mental healthcare is an under-investigated area and yet a potentially highly impactful area of research. In this talk, I will present recent work on probabilistic graphical modeling to enable a more personalized approach to mental healthcare, whereby information can be aggregated from multiple sources within a unified modeling framework. We present a human-centered approach to mental healthcare which is aimed at increasing the effectiveness of psychological wellbeing practitioners.

Bio: Dr. Danielle Belgrave is a Principal Researcher Manager at Microsoft Research, in Cambridge (UK) in the Health Intelligence group where she leads Project Talia. She is particularly interested in integrating medical domain knowledge to develop probabilistic graphical models to develop personalized treatment strategies in health. Originally from Trinidad and Tobago, she received her BSc in Mathematics and Statistics from London School of Economics, an MSc in Statistics from University College London and her PhD in Machine Learning and Statistics for Healthcare from The University of Manchester where she was a Microsoft Research PhD scholar. Prior to joining Microsoft Research, she had a tenured faculty position at Imperial College London.

Saba Akbar
Australian Institute of Health Innovation
Macquarie University

Effects of automation on risk identification and nurses’ decision making

Watch The Recording Here 

Abstract: Electronic Decision Support Systems (DSS) can facilitate the five steps of the nursing care process (NCP): assessment, problem identification, planning, intervention, and evaluation. At each of these steps, nurses are required to process information and make complex decisions. DSS also present opportunities to support human information processing which can be broken down into four distinct functions – information acquisition, information analysis, decision selection and action implementation. For instance, to assess problem risks, nurses need to acquire information about patient’s history and physical health, analyze risk status, decide, and implement suitable management strategies. While current DSS have capacity to automate information analysis and decision selection, they require nurses to manually perform other tasks. In this project, we reviewed evidence on effects of automation in DSS on patient outcomes, care delivery and nurses’ decision making. Next, we interviewed nurses to explore their perceptions about existing DSS for risks assessments of falls and pressure injuries, which are among the top hospital acquired complications in Australia. Finally, we designed a simulated DSS that automates these risk assessments.

Due to the 2020 AMIA Conference, there was no seminar on Nov. 16.

Trey Ideker

Professor, Department of Medicine; Adjunct Professor, Departments of Bioengineering and Computer Science; Co-Director, Bioinformatics and Systems Biology PhD Program

University of California San Diego

Title: Interpreting the cancer genome through physical and functional models of the cancer cell

Abstract: Recently we and other laboratories have launched the Cancer Cell Map Initiative ( and have been building momentum. The goal of the CCMI is to produce a complete map of the gene and protein wiring diagram of a cancer cell. We and others believe this map, currently missing, will be a critical component of any future system to decode a patient’s cancer genome. I will describe efforts along several lines: 1. Coalition building. We have made notable progress in building a coalition of institutions to generate the data, as well as to develop the computational methodology required to build and use the maps. 2. Development of technology for mapping gene-gene interactions rapidly using the CRISPR system. 3. Causal network maps connecting DNA mutations (somatic and germline, coding and noncoding) to the cancer events they induce downstream. 4. Development of software and database technology to visualize and store cancer cell maps. 5. A machine learning system for integrating the above data to create multi-scale models of cancer cells. In a recent paper by Ma et al., we have shown how a hierarchical map of cell structure can be embedded with a deep neural network, so that the model is able to accurately simulate the effect of mutations in genotype on the cellular phenotype.

Dr. Ideker Bio: Dr. Ideker is a Professor in the Departments of Medicine, Bioengineering and Computer Science at UC San Diego. Additionally, he is the Director or Co-Director of the National Resource for Network Biology (NRNB), the Cancer Cell Map Initiative (CCMI), the Psychiatric Cell Map Initiative (PCMI), and the UCSD Bioinformatics PhD Program, and former Chief of Genetics in the Department of Medicine. He is a pioneer in using genome-scale measurements to construct network models of cellular processes and disease. The Ideker Laboratory seeks to create artificially intelligent models of cancer and other diseases for the translation of patient data to precision diagnosis and treatment. 

Due to Election Day, there was no seminar on Nov. 2.

Daniel Prieto-Alhambra

Prof. of Pharmaco– and Device Epidemiology, University of Oxford

Watch The Recording Here

Title: OHDSI-EHDEN Joint COVID-19 Collaboration: Global Real-World Data to Fight COVID-19 

Due to Columbia’s involvement with the 2020 OHDSI Symposium, there will be no seminar Oct. 19.

DBMI Student Town Hall

Steve Labkoff  

Watch The Recording Here

Title:  Real-world Informatics Challenges in Building a Real-World Oncology Registry: The Multiple Myeloma Research Foundation’s CureCloud Experience

Abstract: One of the biggest impediments to personalized medicine is having enough data about a given disease process to in order to explore that disease from multiple perspectives – such as genomics, EHR and immunologics.  In 2017, the Mulitple Myeloma Research Foundation, building on the previous successes of its CoMMpass Clinical Trial, sought to build a registry with 5-times the number of participants than it had in CoMMpass.  It took on a number of tenets that proved exceptionally challenging for this work including the desire to work directly with patients, return clinical genomic data to patients and their clinicians, and aggregate data from a large array of data sources.  In July 2020, the CureCloud Direct-to-Patient Registry opened for patient recruitment. After just 2 months, the registry has over 250 registrants. The challenges of getting this registry opened for recruitment demonstrates the numerous challenges in working across the US with “all comers”, the vast array of EHR vendors, standing up a new CLIA-validated bioinformatics pipeline, and getting the data ultimately returned to patients. This talk will discuss the many real-world challenges and solutions put into place in standing up this program from an informatics, regulatory, legal and clinical perspective.

Vimla Patel 

Watch The Recording Here

Title: Medical Expertise: Why and when is explanation needed?

Abstract: Since medical practice is a human endeavor, rapid technologic advances create a need to bridge disciplines to enable clinicians to benefit from them. In turn, this necessitates a broadening of disciplinary boundaries to consider cognitive and social factors related to the design and use of technology in the medical context.  My awareness of these issues began when I started investigating the development of models of medical expertise and the symbolic representation of medical knowledge in the late1980s. The last 30 years of multidisciplinary research on medical cognition in my laboratory have shown the remarkable importance of cognitive factors that determine how health professionals comprehend information, solve problems, and make decisions. These investigations into the process of medical reasoning have made significant contributions to the design of clinical AI systems. These systems offer great potential for progress to improve people’s health and well-being, but their adoption in clinical practice is still limited. A lack of transparency in these systems is identified as one of the main barriers to their acceptance. My talk will elaborate on what we have learned about how medical practitioners acquire, understand, explain, and utilize expertise, focusing on cognitive-psychological methods and frameworks.  It will also discuss how such work elucidates key lessons and challenges for the development of usable, useful, and safe decision-support systems to augment human intelligence in the clinical world.

Bio: Read more about Vimla here. Her web site is here

2020 Spring Seminars

Dr. Melanie Wall

Title: Predicting service use and functioning for people with first episode psychosis in coordinated specialty care (due to technology error, this video isn’t available, though Dr. Wall’s presentation slides are available here)

Abstract: A key initiative in research focused on treatment for first episode psychosis (FEP) is improving the implementation of evidence-based coordinated specialty care (CSC). One area of improvement is expected to come from improved data analytics facilitated by linking different clinical sites through common data elements and a unified informatics approach for aggregating and analyzing patient level data. The present study examines to what extent predictive modeling of patient-level outcomes based on background variables collected at intake and throughout care can be used to differentiate individuals in a way that is useful. Using data from 600 FEP patients from 15 different CSC sites, we will develop and compare several machine learning models for predicting multivariate, correlated outcomes across one year of care. Presentation of results will focus on interpretability of differential prediction across sites and usefulness for facilitating service decisions.

Bio: Melanie Wall is Professor of Biostatistics and Director of Mental Health Data Science (MHDS) in the New York State Psychiatric Institute (NYSPI) and Columbia University psychiatry department.  MHDS is made up of a team of 15 biostatisticians collaborating on predominately NIH (NIMH/NIH/NIAAA/NIDA) funded research projects related to psychiatry. She has worked extensively with modeling complex multilevel and multimodal data on a wide array of psychosocial public health and psychiatric research questions in both clinical studies and large epidemiologic studies (over 300 total journal publications). She is an expert in longitudinal data analysis and latent variable modeling, including structural equation modeling focused on mediating and moderating (interaction) effects where she has made many methodological contributions. She has a long track record as a biostatistical mentor for Ph.D. students and NIH K awardees and regularly teaches graduate level courses in the Department of Biostatistics in the Mailman School of Public Health attended by clinical Masters students, Ph.D. students, post-docs, and psychiatry fellows. Her current research mission is improving the accessibility and application of state-of-the-art and reproducible statistical methods across different areas psychiatric research. 

Oliver Bear Don’t Walk

TITLE: Comparing the Impact of Transfer Learning Between Clinical Care Institutions on Clinical Note Classification Tasks

ABSTRACT: Performing transfer learning with neural networks such as BERT, ELMo and GPT has lead to state-of-the-art results in the clinical domain on many natural language processing applications. Performing transfer learning with these kinds of models often includes task agnostic pre-training and then fine-tuning on a specific downstream task. However, previous work has found that pre-training at one institution and fine-tuning on a downstream task at another can lead to decreased performance on the downstream task. Differences between clinical institutions (e.g. patient population, documentation practices, clinical specialties, provider roles) can affect clinical corpus qualities and lead to intra-domain variation between institutions. Intra-domain variation could be a contributing factor to downstream task performance degradation when performing transfer learning across institutions. To the best of our knowledge, we present the first experiments focused on performing transfer learning with BERT models between two institutions and compare performance differences on downstream tasks at each institution. We confirm the previous finding that BERT performs better on downstream tasks at institutions it was most recently pre-trained at, which holds true for both institutions in our experiments. We also found that consecutive pre-training on clinical corpora further improves downstream task performance if the most recent pre-training corpus and downstream task corpus are from the same institution. This performance increase is at the expense of decreased performance on the previous institution’s downstream task corpus, a phenomenon known as catastrophic forgetting.

Shreyas Bhave

TITLE: Deep Survival Analysis: Regularization and Missingness with Non Parametric Survival Distributions

ABSTRACT: Survival analysis methods have long been used to effectively model time-to-event data. In the healthcare setting, the Framingham risk score is a salient use case in which 10-year risk of cardiovascular disease is estimated using a narrow set of clinical features. In order to use a more expanded set of clinical features from the EHR for survival analysis, a number of challenges must be addressed: (1) there is a high degree of missingness in EHR data (2) there is no natural event to align all the data (3) many nonlinear relationships likely exist between clinical features. Deep survival analysis (DSA) is an approach for addressing these issues by leveraging a deep conditional model of failure time. However, questions about how different levels and kinds of missingness affect out-of-sample prediction remain largely unexplored. Furthermore, the best approach for regularizing a model with such high capacity is empirically untested. We leverage extensions to this model which relax the distributional assumptions to fit a non-parametric survival distribution. Using this model, we run experiments on different methods of regularization and explore the effects of censorship as well as different types of missingness on model robustness. Initial results show promise with DSA outperforming baseline methods such as Cox regression. In the future, we hope to explore alternative methods of non parametric modeling (e.g. normalizing flows), simulate more clinically realistic scenarios of missingness and apply the model to EHR data from Columbia and NYU.

Dr. Jun Kong

Title: Multi-Dimensional Histopathology Image Analysis for Cancer Research

Abstract: In biomedical research, the availability of an increasing array of high-throughput and high- resolution instruments has given rise to large datasets of imaging data. These datasets provide highly detailed views of tissue structures at the cellular level and present a strong potential to revolutionize biomedical translational research. However, traditional human-based tissue review is not feasible to obtain this wealth of imaging information due to the overwhelming data scale and unacceptable inter- and intra- observer variability. In this talk, I will first describe how to efficiently process Two-Dimension (2D) digital microscopy images for highly discriminating phenotypic information with development of microscopy image analysis algorithms and Computer-Aided Diagnosis (CAD) systems for processing and managing massive in-situ micro-anatomical imaging features with high performance computing. Additionally, I will present novel algorithms to support Three-Dimension (3D), molecular, and time- lapse microscopy image analysis with HPC. Specifically, I will demonstrate an on-demand registration method within a dynamic multi-resolution transformation mapping and an iterative transformation propagation framework. This will allow us to efficiently scrutinize volumes of interest on-demand in a single 3D space. For segmentation, I will present a scalable segmentation framework for histopathological structures with two steps: 1) initialization with joint information drawn from spatial connectivity, edge map, and shape analysis, and 2) variational level-set based contour deformation with data-driven sparse shape priors. For 3D reconstruction, I will present a novel cross section association method leveraging Integer Programming, Markov chain based posterior probability modelling and Bayesian Maximum A Posteriori (MAP) estimation for 3D vessel reconstruction. I will also present new methods for multi-stain image registration, biomarker detection, and 3D spatial density estimation for For molecular imaging data integration. For time-lapse microscopy images, I will present a new 3D cell segmentation method with gradient partitioning and local structure enhancement by eigenvalue analysis with hessian matrix. A derived tracking method will be also presented that combines Bayesian filters with a sequential Monte Carlo method with joint use of location, velocity, 3D morphology features, and intensity profile signatures. Our proposed methods featuring by 2D, 3D, molecular, and time-lapse microscopy image analysis will facilitate researchers and clinicians to extract accurate histopathology features, integrate spatially mapped pathophysiological biomarkers, and model disease progression dynamics at high cellular resolution. Therefore, they are essential for improving clinical decisions, enhancing prognostic predictions, inspiring new research hypotheses, and realizing personalized medicine.

Bio: Dr. Kong is Associated Professor in Department of Mathematics and Statistics, and Department of Computer Science in Georgia State University, adjunct faculty in Department of Biomedical Informatics, Department of Computer Science, and Winship Cancer Institute at Emory University. Dr. Kong’s research interests focus on big imaging data analytics for modeling cancer diseases, multi-modal biomedical image analysis, computer-aided diagnosis, machine learning, computational biology, and large-scale translational bioinformatics with heterogeneous data integration and mining. His long-term research goal is to establish an interdisciplinary research program engaged with mathematicians, biostatisticians, computer scientists, biologists, pathologists, and oncologists, among other domains of experts, for computational disease characterization, accurate modeling analysis, and granular-resolution understanding of diseases with large-scale, multi-modal, and multi-scale biomedical data. 

Watch the presentation here

Dr. Olga Troyanskaya

Professor of Computer Science and the Lewis-Sigler Institute for Integrative Genomics, Princeton University

Title: The quest for deep knowledge – decoding the human genome with deep learning models 

Abstract:  A key challenge in medicine and biology is to develop a complete understanding of the genomic architecture of disease. Yet the increasingly wide availability of ‘omics’ and clinical data, including whole genome sequencing, has far outpaced our ability to analyze these datasets. Challenges include interpreting the 98% of the genome that is noncoding to identify variants that are functional and may lead to disease, detangling genomic signals regulating tissue-specific gene expression, mapping the resulting genetic circuits and networks in disease-relevant tissues and cell types, and, finally, integrating the vast body of biological knowledge from model organisms with observations in humans. I will discuss methods that address these challenges, and highlight their applications to neurodevelopment and neurodegenerative diseases.

Lisa Grossman

Title: Interventions to Increase Patient Portal Use in Vulnerable Populations: A Systematic Review

Abstract: Background: More than 100 studies document disparities in patient portal use among vulnerable populations. Developing and testing strategies to reduce disparities in use is essential to ensure portals benefit all populations.

Objective: To systematically review the impact of interventions designed to (1) increase portal use or predictors of use in vulnerable patient populations, or (2) reduce disparities in use.

Methods: A librarian searched Ovid MEDLINE, EMBASE, CINAHL, and Cochrane Reviews for studies published before September 1st, 2018. Two reviewers independently selected English-language research articles that evaluated any interventions designed to impact an eligible outcome. One reviewer extracted data and categorized interventions, and another assessed accuracy. Two reviewers independently assessed risk of bias.

Results: Out of 18 included studies, 15 (83%) assessed an intervention’s impact on portal use, 7 (39%) on predictors of use, and 1 (6%) on disparities in use. Most interventions studied focused on the individual (13 out of 26, 50%), as opposed to facilitating conditions, such as the tool, task, environment, or organization (SEIPS model). Twelve studies (67%) reported a statistically significant increase in portal use or predictors of use, or reduced disparities. Five studies (28%) had high or unclear risk of bias.

Conclusion: Individually-focused interventions have the most evidence for increasing portal use in vulnerable populations. Interventions affecting other system elements (tool, task, environment, organization) have not been sufficiently studied to draw conclusions. Given the well-established evidence for disparities in use and the limited research on effective interventions, research should move beyond identifying disparities to systematically addressing them at multiple levels.

Anna Ostropolets 

Title: The Data Consult Service: an opportunity to bring new evidence to the bedside.

Abstract:  Evidence-based medicine facilitates clinical care standardization, reduces medical care misuse and overuse and eventually leads to health care cost reduction and improvement in effectiveness and quality of care. On the other hand, current evidence has been reported to be inadequate or missing for specific clinical cases. Randomized clinical trials, which are the gold standard of clinical evidence, are often not generalizable to real-world patients and fail to include patients with multiple co-morbidities, patients who are pregnant, the elderly, and other vulnerable populations. On the other hand, a growing body of observational data, along with the continuing accumulation of practice-based evidence, has made new approaches to evidence generation available. We will present our first steps in developing a Data Consult Service – a clinical decision support tool that uses observational data to answer clinicians’ questions in real time. We will discuss our work on discovering potential areas of use and target groups for this tool as well as first answered questions and future work.

Fall 2019 Seminars

TITLE: Using Genetics to Address the Challenges of 21st Century Drug Development

BIO: Michael N. Cantor, MD, MA is Executive Director, Clinical Informatics, at the Regeneron Genetics Center. Currently his work focuses on developing and optimizing phenotypes from EHR and cohort data and linking them with genetic data to help discover new drug targets. Prior to Regeneron, he was Director of Clinical Research Informatics at New York University School of Medicine. As Director of Clinical Research Informatics, he was also the clinical director for NYULH’s DataCore, where his work focused on data management for clinical trials, using data from clinical systems to research, and advanced analytics. His research interests include integrating and standardizing social determinants of health-related data into the EHR, optimizing informatics tools for frontline clinicians, and providing self-service data access tools for researchers. During his previous tenure at NYU, Dr. Cantor was the Chief Medical Information Officer for the South Manhattan Healthcare Network of the New York City Health and Hospitals Corporation, based at Bellevue, and saw patients and precepted at the medical clinic there. Dr. Cantor completed his residency in internal medicine and informatics training at Columbia, has an M.D. from Emory University, and an A.B. from Princeton, and is an Associate Professor in the Department of Medicine at NYU School of Medicine. He currently sees patients weekly at Bellevue’s medicine clinic.

Speaker:  Jonathan Elias, MD, Clinical Informatics Fellow

Title:  A Day in the Life of a Clinical Informatics Fellow: CI Fellowship, Epic Together’s Mobile Messaging and Provider Team Project and the Epic Together Pre- & Post-Implementation Study

Abstract:  Per AMIA, Clinical Informatics (CI) is the application of informatics and information technology to deliver healthcare services. The CI Fellowship is a two-year ACGME accredited fellowship now being offered to one candidate a year through NYP CUMC, after completion of a medical residency. During this seminar, the fellowship structure and goals with example projects and research will be discussed.

A large area of focus of the fellowship is operational CI projects and academic research. Currently, Columbia University Medical Center (CUMC), NewYork-Presbyterian (NYP) and Weill Cornell Medical Center (WCM) are preparing to implement an enterprise-wide clinical information system, the EpicCare© Electronic Health Record (EHR). With the implementation of the EpicCare© EHR, there is an opportunity to improve, streamline and standardize role delineation, clinical communications and patient assignment across the EHR and secure mobile messaging platforms. The goals and processes associated with this project will be discussed.

Finally, a brief overview & update of the Epic Pre- & Post-Implementation Study will be explored. The overall purpose of this study is to evaluate clinical workflows, process efficiencies, EHR utilization, data quality and overall perceived system usability post implementation of Epic at NYP/CUMC/WCM compared to systems in place prior to Epic implementation. This project is comprised of three specific aims, outlined below, with associated high-level approach and metrics. Aim 1: Conduct pre-post time motion study focused in inpatient setting and outpatient setting (including emergency department) to identify documentation workflow and time changes after Epic EHR implementation. Aim 2: Conduct log-file analyses to measure process efficiencies, EHR utilization (e.g., documentation time), and EHR data quality metrics. Aim 3: Administer a survey to measure and compare health professionals’ perceived usability and satisfaction pre- and post-Epic implementation in the context of functionality to enhance the delivery of continuity of care and adaptation to new health information technology (HIT).


Speaker:  Jiayao Wang, PhD Student, Dr. Dennis Vitkup’s Lab

Title:  Contribution of recessive genotypes and common variants to autism spectrum disorder

Abstract:  Autism spectrum disorder (ASD) is a genetically heterogeneous condition, caused by a combination of rare de novo and inherited variants as well as common variants in at least several hundred genes. However, significantly larger sample sizes are needed to identify the complete set of genetic risk factors. Also, contribution from inherited variants needs to be further investigated. Here we present for SPARK ( of ~9K families with ASD, all consented online. Whole exome sequencing (WES) and genotyping data were generated for each family using DNA from saliva. With Exome sequencing data and a simple statistical framework, we show a week contribution from recessive genotypes, as well as several significant recessive genes leads to Autism such as EIF3F and RELN. With genotype array data, we performed GWAS with transmission disequilibrium test and calculated polygenic risk scores for SPATK families. We show that autism probands has a significant higher polygenic risk compared to their siblings and the risk was spread all over the genome rather only from significant loci. Contribution from recessive genotypes and common variants, together with rare inherited variants and de novo mutations from SPARK project will complete our understanding of genetics of Autism.

There was no seminar on Nov. 25.

No seminar due to the AMIA Symposium.

Video: Watch the presentation here

Title: Oops! I’m on the wrong patient: Evaluating System-Level Interventions for Preventing Wrong-Patient Electronic Orders

Bio: Dr. Adelman’s Patient Safety Research Program began with the development of the Wrong-Patient Retract-and-Reorder (RAR) Measure—a valid and reliable method of quantifying the frequency of wrong-patient orders placed in electronic ordering systems. The Wrong-Patient RAR measure was the first automated measure of medical errors and the first Health IT Safety Measure endorsed by the National Quality Forum. The RAR method identifies thousands of near-miss, wrong-patient errors per year in large health systems, enabling researchers to test interventions to prevent this type of error.

The Wrong-Patient RAR measure has been used to evaluate the effectiveness of patient safety interventions in several studies conducted in different electronic health record systems and clinical settings, including in the neonatal intensive care unit (NICU). The measure is the primary outcome measure for supported by the Agency for Healthcare Research and Quality (R21HS023704, R01HS024945) and the National Institute for Child Health and Human Development (R01HD094793). Additional research is underway to extend the RAR methodology to other types of errors, such as wrong-drug errors, and develop new health IT safety measures (R01HS024538).

Results of Dr. Adelman’s research led to national patient safety guidance, including a recommendation issued by the Office of the National Coordinator for Health Information Technology that healthcare organizations use the Wrong-Patient RAR measure to monitor the frequency of wrong-patient orders. Effective 2019, The Joint Commission will require that hospitals adopt a distinct newborn naming convention that incorporates the mother’s first name, based on studies by Adelman and colleagues.

Due to the Election Day holiday on Tuesday, there is no Seminar today.

This is a DBMI Student Town Hall.

Speaker: Alex Kitaygorodsky, PhD Student, Dr. Yufeng Shen’s Lab

Title: Identification of disease-causing genetic mutations based on machine learning and large genomic data sets

Abstract: More than 3% of young children are born with developmental disorders such as congenital heart disease (CHD), congenital diaphragmatic hernia (CDH), and autism spectrum disorder (ASD). Understanding the genetic causes of these conditions is critical to improve health care for these children and to push forward human developmental biology and neuroscience. Recently, high-throughput sequencing technologies have enabled generation of large-scale genomic data in genetic studies of these conditions. However, translating human data to knowledge is challenging due to an incomplete understanding of biology and a lack of sufficiently powerful analytical methods. My work aims to develop new computational methods based on powerful machine learning techniques to interpret genome sequencing data and identify disease-causing genetic variations. In this talk, I will focus specifically on the role of regulatory non-protein coding mutations in CHD, where we have found a substantial role of variants disrupting RNA binding protein (RBP) binding sites. RBPs oversee normal regulation of gene expression, at both the transcriptional and especially post-transcriptional stages, and so their disruption via mutation represents an important but under-studied noncoding action mechanism. To better understand the observed enrichment in these sites, we first modeled RNA binding protein processes with a robust convolutional neural network. Then, we designed a gradient boosting super-model to integrate predicted RBP binding scores with multimodal genomic data, allowing us to predict pathogenic RBP and gene regulation disruption caused by individual mutations. Finally, we applied our model back to Whole Genome Sequencing data of autism and CHD to find new disease risk genes and improve genetic diagnosis. In summary, we leveraged large genomic datasets with a sophisticated machine learning approach to better analyze sequencing data, advance genomic medicine, and aid our understanding of developmental disorder genetics.


Speaker: Sylvia Cho, PhD Candidate, Dr. Karthik Natarajan’s Lab

Title: Identifying data quality dimensions for wearable device data

Abstract: Patient-generated health data (PGHD) is one of the emerging biomedical data that is captured and recorded by patients outside clinical encounters. One of the major factors that facilitates the documentation of PGHD is the proliferated use of health tracking technologies. Among the different health tracking technologies, wearable device is unique in that individuals can continuously and objectively self-track their health in free-living conditions. As a byproduct of using wearable devices for self-tracking, the large volume of accumulated data and diverse data types have led to the interest of reusing these data for research purposes. However, there are concerns on the quality of device-generated data due to various reasons such as technical and human limitations. Therefore, assessing the quality of wearable data is essential before reusing the data for research. Data quality dimension is an important feature for data quality assessment as it provides guidance on what aspect of data quality should be assessed for the research task. While there are abundant studies on data quality dimensions for traditional clinical data such as the electronic health record data, there is a lack of understanding on the important data quality dimensions for wearable device data. In this study, we aim to identify the data quality dimensions considered to be important by researchers when analyzing wearable data, and to verify if an existing data quality framework can be applied to this type of data or if it needs to be modified. In this talk, I will discuss the methods we used to identify the dimensions and present preliminary results of the study.  

Video: Watch the presentation here

Title: Applications of Data Science and Machine Learning in Radiology and Cardiology

Abstract: The overall goal of our group is to leverage data-driven approaches to help improve patient outcomes. This talk will demonstrate examples of how are working toward this goal by leveraging large clinical datasets, data science and machine learning. Specific examples include: 1) using 46,583 clinically-acquired 3D computed tomography images of the brain to develop and implement a deep learning model to efficiently reprioritize radiology worklists for quicker diagnosis of intracranial hemorrhage; 2) using deep learning to analyze 723,754 echocardiographic videos of the heart to accurately predict patient mortality; 3) analyzing 2 million 12-lead electrocardiographic tracings from the heart to predict clinically relevant future events and 4) optimizing evidence-based care delivery for a population of >10,000 patients with heart failure using machine learning.

Bio: Dr. Fornwalt attended the University of South Carolina as an undergraduate in mathematics and marine science. He then worked in a free medical clinic for a year before starting an MD/PhD program at Emory and Georgia Tech. After finishing his degrees in 2010, he completed an internship in pediatrics at Boston Children’s Hospital before becoming an Assistant Professor at the University of Kentucky.

After four years on faculty in Kentucky, Dr. Fornwalt moved to Geisinger where he completed his diagnostic radiology residency and founded Geisinger’s Department of Imaging Science and Innovation, which focuses on data-driven approaches to improving patient outcomes. Dr. Fornwalt is also a practicing thoraco-abdominal radiologist and an active member of Geisinger’s Heart Institute.

Video: Watch the presentation here

Title: Integrative Analysis of Multi-view Data for Dimension Reduction and Prediction

Abstract: Multi-view data are data collected on the same set of samples but from different views/sources. They become increasingly common in modern biomedical studies. In this talk, I’ll introduce some recent developments of the integrative analysis of multi-view data, and present a new multivariate predictive model with application to a longitudinal study of aging.

Background: Multi-view data are data collected on the same set of samples but from different views/sources. They become increasingly common in modern biomedical studies. In this talk, I’ll introduce some recent developments of the integrative analysis of multi-view data, and present a new multivariate predictive model with application to a longitudinal study of aging.

Bio: Dr. Gen Li is devoted to developing new statistical learning methods for analyzing high dimensional biomedical data. He focuses on analyzing complex data with heterogeneous types that are collected from multiple sources. His methodological research interests include dimension reduction, predictive modeling, association analysis, and functional data analysis. He is also interested in genetics and bioinformatics. He is a consortium member of the NIH Common Fund program Genotype-Tissue Expression (GTEx) project, and contributes to the development of statistical methods for expression quantitative trait loci analysis in multiple tissues. He also has research interests in scientific domains including melanoma, microbiome, and urology research.

Video: Watch the presentation here

Title: Machine Learning in Healthcare

Abstract: In March of 2016, the AlphaGo computer program beat world champion (and human) Lee Sedol at the board game Go. The program’s success reflected the significant progress that machine learning research has made in recent years. However, AlphaGo was just one example of what can be achieved with machine learning. This talk will provide an overview of some of the techniques that are being used in machine learning today, as well as some recent and ongoing work by Google’s research teams to advance the applications of machine learning, particularly its role in biomedical research.  The talk will also discuss some of the unique challenges around applications in healthcare.  

Bio: Ming Jack Po MD, PhD is a product manager in Google Health, leading a number of its machine learning research projects as well as health care product teams.  Prior to joining Google, Jack spent a decade working in different capacities in areas related to medical devices and healthcare delivery.  Jack is currently a trustee of the Austen Riggs Center, a board member of El Camino Health Systems, a member of the National Library of Medicine Lister Hill’s Board of Scientific Counselors and a member of the ONC’s Interoperability Standards Priorities Task Force.  Jack received his MD and PhD from Columbia University, his bachelor’s degree in Biomedical Engineering, and Masters degree in Mathematics from Johns Hopkins University.

Speaker: Alexander Hsieh, PhD student

Title: Detection of mosaic single nucleotide variants in exome sequencing data and implications for congenital heart disease

Abstract: The contribution of somatic mosaicism, or genetic mutations arising after oocyte fertilization, to congenital heart disease (CHD) is not well understood. Further, the relationship between mosaicism in blood and cardiovascular tissue has not been determined. We developed a computational method, Expectation-Maximization-based detection of Mosaicism (EM-mosaic), to analyze mosaicism in exome sequences of 2530 CHD proband-parent trios. EM-mosaic detected 326 mosaic mutations in blood and/or cardiac tissue DNA. Of the 309 detected in blood DNA, 85/94 (90%) tested were independently confirmed. Twenty-five mosaic variants altered CHD-risk genes, affecting 1% of our cohort. Of these 25, 22/22 candidates tested were confirmed. Variants predicted as damaging had higher variant allele fraction than benign variants, suggesting a role in CHD. The frequency of mosaic variants above 10% mosaicism was 0.13/person in blood and 0.14/person in cardiac tissue. Analysis of 66 individuals with matched cardiac tissue available revealed both tissue-specific and shared mosaicism, with shared mosaics generally having higher allele fraction. We estimate that ~1% of CHD probands have a mosaic variant detectable in blood that could contribute to cardiac malformations, particularly those damaging variants expressed at higher allele fraction compared to benign variants. Although blood is a readily-available DNA source, cardiac tissues analyzed contributed ~5% of somatic mosaic variants identified, indicating the value of tissue mosaicism analyses.


Speaker: Michelle Chau, PhD student

Title: Developing a user-centered, machine learning approach to identify preferences for inspirational social media health-related images for young populations

Abstract: Nutrition interventions for adolescents and young adults (AYAs) increasingly rely on mobile platforms and social media. Most assume nutritional decisions are rational, targeting intentions such as goal setting and self-monitoring. However, in the absence of motivation and time, nutrition choices are often automatic and based on heuristics. The use of images is a simple way to deliver heuristic messaging. My preliminary research showing AYAs frequent use of social media for inspiration, further suggests health-related images may be suitable for nutrition interventions with these groups. Previous studies have explored inspirational social media content using qualitative and manual methods. However, there is an active area of research in computational visual analysis that explores preferences and prediction for image retrieval and recommendation tasks. The application of these techniques within health and specifically how to translate human preferences into the technical requirements needed to identify inspirational images for nutrition and young populations is underexplored. In this talk, I will discuss a study to identify image features that are relevant for inspiring healthy eating in health-related social media content. Further, I will discuss future directions for exploring how these features may be incorporated into machine learning models.