BINF G4000 Acculturation to Programming and Statistics

Course Description: This course is targeted for biomedical scientists looking for working knowledge of programming and statistics. This is a fast-paced, hands-on course covering the following topics: programming basics in Python, probabilities, elements of linear algebra, elements of calculus, and elements of data analytics. Students are expected to learn lecture material outside of the classroom and focus on labs during class. All labs evolve around real-world biomedical and health datasets. Only open to DBMI enrolled students in our MA or PhD program. BINF G4000 must be taken fall term of entry. Instructor provides placement exam on first day of class. Students may test out of the course based on placement exam results.

Instructor

Karthik Natarajan, PhD

Class Schedule

This class meets twice weekly on Tuesdays and Thursdays.

Class Structure

I. Programming Basics I (2.5 weeks)

• ​Computing environment for biomedical sciences (Linux operating system, shell commands)
• ​IDEs (Integrated Development Environments) and text editors for Python (Emacs, Eclipse)
• ​Python variables, functions, basic data structures, libraries, numpy, files

ClassTopicObjectives and CompetenciesThemes
00Introduction to G4000Overview of course
01Assessment TestAssessment Test
02Introduction to LinuxVirtualBox setup, deploying lubuntu, basic commands1. ls
2. chmod
3. sudo
03Introduction to Regex grep, sed 1. backreference
2. commands
3. UMLS
04Development environments,
programming primitives
Emacs, Python, variables,
conditionals, loops, lists
1. keybindings
2. filtering and lists
3. early termination
05Abstraction of code and data, code reuselibraries, functions, data
structures, files
1. OS standard library
2. lists and tuples
3. recursion
06Vectorized code and visualization Vectorized operations and efficiency considerations, reading and writing files, line plots, histograms

II. Probabilities (2 weeks)

• Axioms, PMFs, PDFs, CDFs
• Distribution families (e.g., Binomial, Multinomial, Normal)
• Sampling
• Estimation
• Plotting
http://research.cs.tamu.edu/prism/lectures/sp/l10.pdf
http://www.math.uiuc.edu/~kkirkpat/SampleSpace.pdf

ClassTopicObjectives and competenciesThemes
07Introduction to probability theoryAxioms, conditional probability law of total probability, sample spaces1. Sample spaces and events
2. Probability axioms
3. Chain rule of probability
08Probability distributions Probability density functions,
Probability mass functions,
Cumulative distribution
functions, mean, variance
09Random sampling and estimation Sampling, expected value, MLE, bootstrapping
10Bayesian probabilityCox’s Theorem, Bayes theorem, interpretations of probability, MAP1. Cox’s theorem
2. Bayes’ theorem
3. MAP

III. Programming Basics II (1 week)

• Data structures (dictionaries/hash-maps, sets)
• Persistence (reading and writing delimited and JSON files)

ClassTopicObjectives and Competencies
11Data structures:
dictionaries/hash-maps and sets
Data structure performance
characteristics and choice
12Midterm Review

IV. Elements of Linear Algebra (1 week)

• Scalars, vectors, matrices
• ​Dot product, matrix multiplication
• Plotting

ClassTopicObjectives and Competencies
13Concepts from Linear Algebra Vectors, matrices, inner product
14Multidimensional randomnessRandom vectors, covariance,
multivariate normal distribution

V. Programming Basics III (2 week)

• Persistence (relational database rationale and basic operations)

ClassTopicObjectives and CompetenciesThemes
15Relational databases and basic operations: Create, Read, Update, and Delete (CRUD)schema, primary keys, group by1. LIKE
2. CONCATENATE
3. Functions
16Database modeling with multiple tablesjoin, indexes1. subqueries
2. outer join
3. multi-column index
17OHDSI
18Gitversion control, git

VI. Programming Basics IV (1 week)

• Object oriented programming
• Handling large datasets (data frames, pandas)

ClassTopicObjectives and CompetenciesThemes
19Object¬-oriented programming (OOP)Classes, objects, and inheritance
20Data, PersistenceData frames, null values,
filtering, strengths and
weaknesses of file formats
1. JSON
2. XML
3. CSV
4. Serialization

VII. Elements of Data Analytics (1.5 week)

• Hypothesis testing (scipy library, chi-square, t­-test, one-way ANOVA, correlation)
• Predicting from large datasets (logistic regression, convex optimization)

ClassTopicObjectives and Competencies
21Hypothesis testing theorynull hypothesis, p-value, confidence interval, credible interval
22Hypothesis testing practicechi-squared, t-test, ANOVA, Pearson correlation, non¬parametric tests
23Predictionregression, least squares, ML

VIII. Review and Final Exam