BINF G4002: Methods II: Computational Methods

Course Description: This course is targeted to biomedical scientists developing a broad understanding of computational methods applicable in biomedicine. This is a fast-paced, technical course covering a broad range of topics including: Density estimation, regression, classification, deep learning, probabilistic graphical models, clustering, dimensionality reduction, time series models, statistical NLP, networks, hypothesis testing, causal inference, imputation, and association rule mining. Students are expected to read technical texts carefully, participate actively in lecture discussion, and develop hands-on skills in labs involving real-world biomedical and health datasets.

Instructor

Adler Perotte, PhD

Class Schedule

This class meets Tuesdays and Thursdays 9:00 - 10:15 am, and Fridays 10:30 - 11:45 am.

Readings and Bibliography: Considering the breadth of this course, readings will come from a variety of sources – all of which except one are freely available online:

● Pattern Recognition and Machine Learning (PRML), Chris Bishop
● Deep Learning (DL) – https://www.deeplearningbook.org/
● Grinstead and Snell’s Introduction to Probability (GS) – https://www.math.dartmouth.edu/~prob/prob/prob.pdf
● Information Theory, Inference, and Learning Algorithms (ITIL) – http://www.inference.org.uk/itprnn/book.pdf
● An Introduction to Information Retrieval (IIR) – https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf
● Speech and Language Processing (SLP) – https://web.stanford.edu/~jurafsky/slp3/ed3book.pdf
● Graph Theory and Complex Networks (GTCN) – https://www.distributed-systems.net/index.php/books/gtcn/
● Practical statistics for data scientists : 50 essential concepts (PSDS) – https://clio.columbia.edu/catalog/13632351?counter=1
● Causal Inference (CI) – https://www.hsph.harvard.edu/miguel-hernan/causal-inference-book/
● Elements of Statistical Learning (ESL) – https://web.stanford.edu/~hastie/ElemStatLearn/

Academic Integrity: Columbia’s intellectual community relies on academic integrity and responsibility as the cornerstone of its work. Graduate students are expected to exhibit the highest level of personal and academic honesty as they engage in scholarly discourse and research. In practical terms, you must be responsible for the full and accurate attribution of the ideas of others in all of your research papers and projects; you must be honest when taking your examinations; you must always submit your own work and not that of another student, scholar, or internet source. Graduate students are responsible for knowing and correctly utilizing referencing and bibliographical guidelines. When in doubt, consult your professor. Citation and plagiarism-prevention resources can be found at the GSAS page on Academic Integrity and Responsible Conduct of Research (http://gsas.columbia.edu/academic-integrity).

Failure to observe these rules of conduct will have serious academic consequences, up to and including dismissal from the university. If a faculty member suspects a breach of academic honesty, appropriate investigative and disciplinary action will be taken following Dean’s Discipline procedures (http://gsas.columbia.edu/content/disciplinary-procedures).

Course Requirements and Grading: The course consists of three classes every week (lectures on Tuesday and Thursday morning and lab on Friday morning). Attendance to all lectures and labs is required, unless otherwise pre-approved.

Grading is based on class participation (5%), reading responses (20%), labs (25%), midterm exam (20%), and final project (30%).

Attendance is mandatory and you will be expected to discuss concepts covered in class with your peers. Each lecture will be divided into modules and peer discussion will be an integral part of each module. In order for these discussions to be fruitful, doing the readings prior to class is necessary. The readings may be dense, but don’t let that intimidate you! Do your best, and challenging concepts will be clarified in class.

Reading responses should be completed individually, contain summaries of the assigned reading no more than one page in length, and touch on all major topics of the readings. The preferred form of these notes is latex-based pdf, but any electronic format is acceptable (word, powerpoint, text file, etc.). Reading responses should be uploaded to courseworks prior to the associated lecture. Late reading responses will be given partial credit. Reading responses should be unique and authored by you alone while reading the assignments. Do not use the notes of others or other materials for this assignment – this will be checked.

Each lab write-up must be submitted through courseworks by the Monday following the lab.

Final projects can be completed in groups of no more than two students. The final project should integrate concepts from at least 3 separate lectures in a meaningful way. A project proposal describing the problem being addressed (could be a clinical and/or computational problem), the dataset being analyzed (could be simulated data), methods, baseline methods, the team, and each member’s contribution to the project. The final project proposal should be no more than one page in length and is due one week after the midterm exam.

Although you are free to choose any health-related dataset, example datasets that could be used include:
● Physionet (Requires short ethics courses and approval. Speak to me before requesting
access)
  ○ Physiobank
  ○ MIMIC
  ○ EICU
The Cancer Genome Atlas
Simons Genome Diversity Project
1000 Genomes
NIH X-ray Database
Dream Challenges
Grand Challenges
Health-related challenges on Kaggle
dbGaP (request for data must go through me and requires several months advance notice)

Final project presentations will be scheduled near the end of the semester and will be 5-10 minutes long depending on how many groups are presenting. There is no length requirement for the final write-up, but it should discuss the following topics (typically ~5-10 pages):
● What is the clinical/computational problem?
● How is it addressed today, if at all?
● What is limiting about the current approaches?
● What is better about your approach?
● How will we know that you are successful?
● Methods, including one or more baseline methods (baseline methods may be simple)
● Data description
● Results
● Discussion/Conclusions

DateLecture/LabTitlesReadings
1/12LectureLinear algebra refresherDL Chapter 2 (excluding 2.8, 2.9, and 2.12)
Supplemental/Optional: Strang Chapter 11
https://ocw.mit.edu/ans7870/resources/Strang/Edited/Calculus/Calcu
lus.pdf
Supplemental/Optional: Hefferon Ch 1-4
http://joshua.smcvt.edu/linearalgebra/book.pdf
1/14LectureCalculus refresherDL Chapter 4 (4.1, 4.3, up to and including equation 4.8)
Supplemental/Optional: Strang Ch 2, 4, 5, 13
https://ocw.mit.edu/ans7870/resources/Strang/Edited/Calculus/Calcu
lus.pdf
1/15LabPython refresher and autodiff software
introduction (pytorch)
1/19LectureProbabilities refresherGS 1.2 (Random Variables and Sample Space, Distribution
Functions, Theorem 1.1) pg 18-22
GS 2.1 (Probabilities) pg 41-42
GS 2.2 (Spinners, Darts, Sample Space Coordinates, Density
Functions of Continuous Random Variables/Definition 2.1, Cumulative
Distribution Functions of Continuous Random Variables/Definition 2.2
+ Theorem 2.1 (no proof)) pg 55-59
GS 4.1 (Conditional Probability, Bayes Probabilities, Independent
Events, Joint Distribution Functions and Independence of Random
Variables, Independent Trials Processes, Bayes Formula) pg 133-147
GS 4.2 (Independent Events, Joint Density and Cumulative
Distribution Functions, Independent Random Variables, Independent
Trials) pg 162-168
GS 6.1 (Average Value, Expected Value, Interpretation of Expected
Value, Expectation of a Function of a Random Variable, The Sum of
Two Random Variables, Independence, Conditional Expectation) pg
225-232, 233-234, 239
GS 6.2 (Variance, Standard Deviation, Properties of Variance) pg
257-258, 259-261
GS 6.3 (Expected Value, Expectation of a Function of a Random
Variable, Expectation of the Product of Two Random Variables,
Variance, Independent Trials) pg 268-275
1/21LectureProbabilities refresherGS 8.1 (Chebyshev Inequality, Law of Large Numbers, Law of
Averages) pg 305-307
GS 8.2 (Chebyshev Inequality, Law of Large Numbers) pf 316-317
GS 9.1 (Benoulli Trials, Standardized Sums (not including Theorem
9.1)), pg 325-328
GS 9.2 (Standardized Sums) pg 340-342
GS 9.3 (Standardized Sums) pg 356-357
GS 5.1 (Discrete Uniform Distribution, Binomial Distribution,
Geometric Distribution, Negative Binomial Distribution, Poisson
Distribution, Hypergeometric Distribution) pg 183-195
GS 5.2 (Continuous Uniform Density, Exponential and Gamma
Densities, Normal Density, Maxwell and Rayleigh Densities,
Chi-Squared Density, Cauchy Density) pg 205-209, 212-219
Multidimensional Randomness Notes
1/22LabStochastic
optimization & CLT
1/26LectureMachine Learning fundamentals
1/28LectureMachine Learning
fundamentals
1/29LabMaximum Likelihood and MAP
2/2LectureInformation theoryITIL 1.1, 2.4, 2.5, 2.6, 2.7, 8.1, 9.1
2/4LectureDensity Estimation,
Regression & Classification
PRML 2.5, 3.1 (3.1.1, 3.1.4), 4 (up to and including 4.1.1)
2/5LabRegression & Classification
2/9LectureDensity Estimation,
Regression & Classification
PRML 3.3.1, 4.2 (only intro), 4.3 (4.3.1, 4.3.2), 7 (up to equation 7.3,
pg. 327), 14.4
2/11LectureNeural Networks
and Computational
Graphs
DL Ch 6 (up to and including 6.5.3 and excluding 6.4)
2/12LabFFNN with Pytorch
2/16LectureNeural Networks and Computational GraphsDL Ch 9 (up to and including 9.3), 8.1, 8.3, 8.5, 8.7.1
2/18LectureNeural Networks
and Computational
Graphs
DL Ch 7 (7.1.1 up to equation 7.5, 7.1.2 up to equation 7.2, 7.3, 7.4,
7.5, 7.8 up to pg 246, 7.12)
2/19LabCNN with Pytorch
2/23LectureProbabilistic graphical modelsPRML 8.1,8.2
2/25LectureProbabilistic graphical modelsPRML 8.3
Blei (Probabilistic Topic Models) pg 77-80
Blei (Build, Compute, Critique Repeat) pg 203-218
2/26Midterm
Spring Break
3/9LectureProbabilistic graphical modelsPRML 11.1.2-11.1.4 pg 523-526, 528-534
PRML 11.2-11.3 pg 537-546
PRML 10.1 pg 461-464
3/11LectureProbabilistic graphical models
3/12LabProbabilistic programming
3/16LectureClusteringPRML 9.1, 9.2
ESL 14.3 (14.3.1, 14.3.2, 14.3.3, 14.3.12)
3/18LectureDimensionality ReductionHinton: https://www.cs.toronto.edu/~hinton/science.pdf
PRML 12.1 (up to and including 12.1.2), 12.4.1
DL 14 (14.1, 14.2, and 14.9)
3/19LabHierarchical clustering + Autoencoders
3/23LectureTime Series ModelsPRML 13.1, 13.2 (up to and including pg 615 - not section 13.2.1)
3/25LectureTime Series ModelsPRML 13.3 (up to and including pg 637)
DL 10.1, 10.2 up to and including 10.2.1, section 10.7, and 10.10 up
to and including 10.10.1
3/26LabLSTM algorithm
3/30LectureStatistical NLPIIR Ch 1
SLP Ch 3 (up to and including 3.4)
4/1LectureStatistical NLPSLP Ch 6 (excluding only 6.7), 7.5
4/2LabEmbeddings
4/6LectureGraph TheoryGTCN (No Theorems or Proofs) 2.1, 2.2, 2.3
GTCN (No Theorems or Proofs) 3.1, 3.2
4/8LectureGraph TheoryGTCN 6 (No Theorems or Proofs)
GTCN 9.2 (Centrality and prestige only, No Theorems or Proofs)
Community structure in social and
biological networks (Detecting Community Structure section) -
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC122977/
4/9LabCentrality
4/13LectureHypothesis testing
& Survival Analysis
PSDS 2 (Sampling Distribution of a Statistic, The Bootstrap,
Confidence Intervals)
PSDS 3 (A/B Testing, Hypothesis Tests, Resampling, Statistical
Significance and P-Values, t-Tests, Multiple Testing)
https://www.ncbi.nlm.nih.gov/pubmed/12865907
4/15LectureFinal Project Presentations
4/25Final Project Write-Up Due at 11:59 pm