## BINF G4002: Methods II: Computational Methods

Course Description: This course is targeted to biomedical scientists developing a broad understanding of computational methods applicable in biomedicine. This is a fast-paced, technical course covering a broad range of topics including: Density estimation, regression, classification, deep learning, probabilistic graphical models, clustering, dimensionality reduction, time series models, statistical NLP, networks, hypothesis testing, causal inference, imputation, and association rule mining. Students are expected to read technical texts carefully, participate actively in lecture discussion, and develop hands-on skills in labs involving real-world biomedical and health datasets.

## Instructor

## Adler Perotte, PhD

## Class Schedule

## This class meets Tuesdays and Thursdays 9:00 - 10:15 am, and Fridays 10:30 - 11:45 am.

**Readings and Bibliography:** Considering the breadth of this course, readings will come from a variety of sources – all of which except one are freely available online:

● Pattern Recognition and Machine Learning (PRML), Chris Bishop

● Deep Learning (DL) – https://www.deeplearningbook.org/

● Grinstead and Snell’s Introduction to Probability (GS) – https://www.math.dartmouth.edu/~prob/prob/prob.pdf

● Information Theory, Inference, and Learning Algorithms (ITIL) – http://www.inference.org.uk/itprnn/book.pdf

● An Introduction to Information Retrieval (IIR) – https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

● Speech and Language Processing (SLP) – https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf

● Graph Theory and Complex Networks (GTCN) – https://www.distributed-systems.net/index.php/books/gtcn/

● Practical statistics for data scientists : 50 essential concepts (PSDS) – https://clio.columbia.edu/catalog/13632351?counter=1

● Causal Inference (CI) – https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/

● Elements of Statistical Learning (ESL) – https://web.stanford.edu/~hastie/ElemStatLearn/

**Academic Integrity:** Columbia’s intellectual community relies on academic integrity and responsibility as the cornerstone of its work. Graduate students are expected to exhibit the highest level of personal and academic honesty as they engage in scholarly discourse and research. In practical terms, you must be responsible for the full and accurate attribution of the ideas of others in all of your research papers and projects; you must be honest when taking your examinations; you must always submit your own work and not that of another student, scholar, or internet source. Graduate students are responsible for knowing and correctly utilizing referencing and bibliographical guidelines. When in doubt, consult your professor. Citation and plagiarism-prevention resources can be found at the GSAS page on Academic Integrity and Responsible Conduct of Research (http://gsas.columbia.edu/academic-integrity).

Failure to observe these rules of conduct will have serious academic consequences, up to and including dismissal from the university. If a faculty member suspects a breach of academic honesty, appropriate investigative and disciplinary action will be taken following Dean’s Discipline procedures (http://gsas.columbia.edu/content/disciplinary-procedures).

**Course Requirements and Grading:** The course consists of three classes every week (lectures on Tuesday and Thursday morning and lab on Friday morning). Attendance to all lectures and labs is required, unless otherwise pre-approved.

Grading is based on class participation (5%), reading responses (20%), labs (25%), midterm exam (20%), and final project (30%).

Attendance is mandatory and you will be expected to discuss concepts covered in class with your peers. Each lecture will be divided into modules and peer discussion will be an integral part of each module. In order for these discussions to be fruitful, doing the readings prior to class is necessary. The readings may be dense, but don’t let that intimidate you! Do your best, and challenging concepts will be clarified in class.

Reading responses should be completed individually, contain summaries of the assigned reading no more than one page in length, and touch on all major topics of the readings. The preferred form of these notes is latex-based pdf, but any electronic format is acceptable (word, powerpoint, text file, etc.). Reading responses should be uploaded to courseworks prior to the associated lecture. Late reading responses will be given partial credit. Reading responses should be unique and authored by you alone while reading the assignments. Do not use the notes of others or other materials for this assignment – this will be checked.

Each lab write-up must be submitted through courseworks by the Monday following the lab.

Final projects can be completed in groups of no more than two students. The final project should integrate concepts from at least 3 separate lectures in a meaningful way. A project proposal describing the problem being addressed (could be a clinical and/or computational problem), the dataset being analyzed (could be simulated data), methods, baseline methods, the team, and each member’s contribution to the project. The final project proposal should be no more than one page in length and is due one week after the midterm exam.

Although you are free to choose any health-related dataset, example datasets that could be used include:

● Physionet (Requires short ethics courses and approval. Speak to me before requesting

access)

○ Physiobank

○ MIMIC

○ EICU

● The Cancer Genome Atlas

● Simons Genome Diversity Project

● 1000 Genomes

● NIH X-ray Database

● Dream Challenges

● Grand Challenges

● Health-related challenges on Kaggle

● dbGaP (request for data must go through me and requires several months advance notice)

Final project presentations will be scheduled near the end of the semester and will be 5-10 minutes long depending on how many groups are presenting. There is no length requirement for the final write-up, but it should discuss the following topics (typically ~5-10 pages):

● What is the clinical/computational problem?

● How is it addressed today, if at all?

● What is limiting about the current approaches?

● What is better about your approach?

● How will we know that you are successful?

● Methods, including one or more baseline methods (baseline methods may be simple)

● Data description

● Results

● Discussion/Conclusions

Date | Lecture/Lab | Titles | Readings |
---|---|---|---|

1/12 | Lecture | Linear algebra refresher | DL Chapter 2 (excluding 2.8, 2.9, and 2.12) Supplemental/Optional: Strang Chapter 11 https://ocw.mit.edu/ans7870/resources/Strang/Edited/Calculus/Calcu lus.pdf Supplemental/Optional: Hefferon Ch 1-4 http://joshua.smcvt.edu/linearalgebra/book.pdf |

1/14 | Lecture | Calculus refresher | DL Chapter 4 (4.1, 4.3, up to and including equation 4.8) Supplemental/Optional: Strang Ch 2, 4, 5, 13 https://ocw.mit.edu/ans7870/resources/Strang/Edited/Calculus/Calcu lus.pdf |

1/15 | Lab | Python refresher and autodiff software introduction (pytorch) | |

1/19 | Lecture | Probabilities refresher | GS 1.2 (Random Variables and Sample Space, Distribution Functions, Theorem 1.1) pg 18-22 GS 2.1 (Probabilities) pg 41-42 GS 2.2 (Spinners, Darts, Sample Space Coordinates, Density Functions of Continuous Random Variables/Definition 2.1, Cumulative Distribution Functions of Continuous Random Variables/Definition 2.2 + Theorem 2.1 (no proof)) pg 55-59 GS 4.1 (Conditional Probability, Bayes Probabilities, Independent Events, Joint Distribution Functions and Independence of Random Variables, Independent Trials Processes, Bayes Formula) pg 133-147 GS 4.2 (Independent Events, Joint Density and Cumulative Distribution Functions, Independent Random Variables, Independent Trials) pg 162-168 GS 6.1 (Average Value, Expected Value, Interpretation of Expected Value, Expectation of a Function of a Random Variable, The Sum of Two Random Variables, Independence, Conditional Expectation) pg 225-232, 233-234, 239 GS 6.2 (Variance, Standard Deviation, Properties of Variance) pg 257-258, 259-261 GS 6.3 (Expected Value, Expectation of a Function of a Random Variable, Expectation of the Product of Two Random Variables, Variance, Independent Trials) pg 268-275 |

1/21 | Lecture | Probabilities refresher | GS 8.1 (Chebyshev Inequality, Law of Large Numbers, Law of Averages) pg 305-307 GS 8.2 (Chebyshev Inequality, Law of Large Numbers) pf 316-317 GS 9.1 (Benoulli Trials, Standardized Sums (not including Theorem 9.1)), pg 325-328 GS 9.2 (Standardized Sums) pg 340-342 GS 9.3 (Standardized Sums) pg 356-357 GS 5.1 (Discrete Uniform Distribution, Binomial Distribution, Geometric Distribution, Negative Binomial Distribution, Poisson Distribution, Hypergeometric Distribution) pg 183-195 GS 5.2 (Continuous Uniform Density, Exponential and Gamma Densities, Normal Density, Maxwell and Rayleigh Densities, Chi-Squared Density, Cauchy Density) pg 205-209, 212-219 Multidimensional Randomness Notes |

1/22 | Lab | Stochastic optimization & CLT | |

1/26 | Lecture | Machine Learning fundamentals | |

1/28 | Lecture | Machine Learning fundamentals | |

1/29 | Lab | Maximum Likelihood and MAP | |

2/2 | Lecture | Information theory | ITIL 1.1, 2.4, 2.5, 2.6, 2.7, 8.1, 9.1 |

2/4 | Lecture | Density Estimation, Regression & Classification | PRML 2.5, 3.1 (3.1.1, 3.1.4), 4 (up to and including 4.1.1) |

2/5 | Lab | Regression & Classification | |

2/9 | Lecture | Density Estimation, Regression & Classification | PRML 3.3.1, 4.2 (only intro), 4.3 (4.3.1, 4.3.2), 7 (up to equation 7.3, pg. 327), 14.4 |

2/11 | Lecture | Neural Networks and Computational Graphs | DL Ch 6 (up to and including 6.5.3 and excluding 6.4) |

2/12 | Lab | FFNN with Pytorch | |

2/16 | Lecture | Neural Networks and Computational Graphs | DL Ch 9 (up to and including 9.3), 8.1, 8.3, 8.5, 8.7.1 |

2/18 | Lecture | Neural Networks and Computational Graphs | DL Ch 7 (7.1.1 up to equation 7.5, 7.1.2 up to equation 7.2, 7.3, 7.4, 7.5, 7.8 up to pg 246, 7.12) |

2/19 | Lab | CNN with Pytorch | |

2/23 | Lecture | Probabilistic graphical models | PRML 8.1,8.2 |

2/25 | Lecture | Probabilistic graphical models | PRML 8.3 Blei (Probabilistic Topic Models) pg 77-80 Blei (Build, Compute, Critique Repeat) pg 203-218 |

2/26 | Midterm | ||

Spring Break | |||

3/9 | Lecture | Probabilistic graphical models | PRML 11.1.2-11.1.4 pg 523-526, 528-534 PRML 11.2-11.3 pg 537-546 PRML 10.1 pg 461-464 |

3/11 | Lecture | Probabilistic graphical models | |

3/12 | Lab | Probabilistic programming | |

3/16 | Lecture | Clustering | PRML 9.1, 9.2 ESL 14.3 (14.3.1, 14.3.2, 14.3.3, 14.3.12) |

3/18 | Lecture | Dimensionality Reduction | Hinton: https://www.cs.toronto.edu/~hinton/science.pdf PRML 12.1 (up to and including 12.1.2), 12.4.1 DL 14 (14.1, 14.2, and 14.9) |

3/19 | Lab | Hierarchical clustering + Autoencoders | |

3/23 | Lecture | Time Series Models | PRML 13.1, 13.2 (up to and including pg 615 - not section 13.2.1) |

3/25 | Lecture | Time Series Models | PRML 13.3 (up to and including pg 637) DL 10.1, 10.2 up to and including 10.2.1, section 10.7, and 10.10 up to and including 10.10.1 |

3/26 | Lab | LSTM algorithm | |

3/30 | Lecture | Statistical NLP | IIR Ch 1 SLP Ch 3 (up to and including 3.4) |

4/1 | Lecture | Statistical NLP | SLP Ch 6 (excluding only 6.7), 7.5 |

4/2 | Lab | Embeddings | |

4/6 | Lecture | Graph Theory | GTCN (No Theorems or Proofs) 2.1, 2.2, 2.3 GTCN (No Theorems or Proofs) 3.1, 3.2 |

4/8 | Lecture | Graph Theory | GTCN 6 (No Theorems or Proofs) GTCN 9.2 (Centrality and prestige only, No Theorems or Proofs) Community structure in social and biological networks (Detecting Community Structure section) - https://www.ncbi.nlm.nih.gov/pmc/articles/PMC122977/ |

4/9 | Lab | Centrality | |

4/13 | Lecture | Hypothesis testing & Survival Analysis | PSDS 2 (Sampling Distribution of a Statistic, The Bootstrap, Confidence Intervals) PSDS 3 (A/B Testing, Hypothesis Tests, Resampling, Statistical Significance and P-Values, t-Tests, Multiple Testing) https://www.ncbi.nlm.nih.gov/pubmed/12865907 |

4/15 | Lecture | Final Project Presentations | |

4/25 | Final Project Write-Up Due at 11:59 pm |