BINF G4013 Biological Sequence Analysis

Course Description: Biological sequence analysis (bioinformatics) is the study of the relationships between biological sequences and the implication of these relationships for macromolecular structure, function, and evolution. Bioinformatics is now a necessary part of the tools and training of all scientists who use molecular biological methods because they must process raw DNA sequences and assign structure and function to gene products. Bioinformatic methods are based on aspects of a variety of disciplines including computer science, probability, and molecular, structural, and evolutionary biology. The biomedical researcher must understand the computational and scientific basis of the programs he uses in order to use the programs correctly.

The purpose of this course is to train biomedical researchers, including those who may not yet be comfortable with computers, in bioinformatic methods. All of these methods center around the identification and classification of gene products and the elucidation of genome structure. The biological and informatic basis of all methods studied is covered, the methods are demonstrated in class, and the students run the programs themselves. The course covers both web-based and desktop-based programs.

Instructor

Richard Friedman, PhD

Teaching Assistant

Andrey Zaznaev

Teaching Assistant

This class will be taught Wednesdays from
9 am – 12 pm in PH20-200.

Prerequisites: There are no formal prerequisites. A basic background in molecular biology is assumed which can be made up by reading. If you think that you may not have an adequate background in molecular biology please let me know and I will send you a PowerPoint presentation and assign reading. No computer background beyond basic desktop usage is necessary.

Requirements: Attendance is required, at the scheduled time of the lecture and lab, even when the course is taught remotely. Each student is required to attain the primary objectives of each lab. There is homework. It is required that you bring a laptop computer to class. CMBS 4020 is graded Pass/Fail and there will be no tests. BINF 4013 is graded by a letter grade and there will be a final based upon the labs, lectures, reading, and homework.

Required Text

Bioinformatics and Functional Genomics, Third Edition, Jonathan Pevsner, Wiley-Liss. 2015.

This text is available from the publisher in either print or 2 electronic formats at http://www.wiley.com/WileyCDA/WileyTitle/productCd-1118581725.html

It is also available in print form from various online booksellers.

Course Syllabus

1. Sequences and databases. Jan. 17

Read: Bioinformatics and Functional Genomics Chapters 1 and 2.

File formats: Fasta, GCG, EMBL.
Seqret (EMBOSS).
Querying NCBI databases.
NCBI GQuery Server.

2. Comparison of Sequences. Jan. 24

Read: Bioinformatics and Functional Genomics: Chapter 3.

Graphical sequence comparison (Dotmatcher – EMBOSS).
Needleman-Wunsch global sequence alignment (Needle – EMBOSS)
Smith-Waterman local sequence alignment. (Water- EMBOSS).
Comparing a DNA sequence with a protein sequence – Genewise.

3. Database searching by sequence. Jan. 31

Read: Bioinformatics and Functional Genomics Chapter 4.

Use of the Blast World-Wide-Web interface.
Interpretation of results, Karlin-Altschul theory and statistical significance
The Blast algorithm.
Filtering of low-complexity and repetitive sequences – Seg and Dust.
The Fasta family of programs
The Blast family of programs.
Combining Blast and text string searches.

4. Multiple sequence alignment. Feb. 7

Read: Bioinformatics and Functional Genomics: p. 205-222.

Progressive pair-wise alignment.
Evolutionarily-weighted progressive pairwise alignment: ClustalX.
Interactive alignment with Clustalx.
Display of multiple sequence alignments: Boxshade.
Consistent alignments with T-Coffee.
Meme: Probabilistic identification and alignment of short conserved regions.

5. Pattern and profile methods of identifying distant homologs. Feb. 14

Read: Bioinformatics and Functional Genomics: p. 552-559, 171-186, 222-237.

A. Prosite Patterns.
A1. The Prosite database (Web).
A2. Regular grammars.
A3. Combining motif and full sequences searches: Phiblast.
B. Classical Profiles
B1. Theory of classical profiles.
B2. The Prosite Profile Database.
B3. The NCBI Conserved Domain Database.
B4. Automated iterative profile searching with Psiblast.
B5. Searching with a sets of short Profiles: Meme and Motifscan.
C. Hidden Markov Models.
C1. Theory of hidden Markov Model profiles.
C2. The Pfam Hidden Markov Model Server.
C3. Other Profile databases.
C3a. Smart.
C3b Interpro
C4. The Web implementation of Hmmer.
C4a. hmmsearch Compares protein alignment/profile-HMM to a protein sequence database.
C4b. hmmscan- Compares a protein sequence to a profile-HMM database.
C4c. phmmer- Compares a protein sequence to a protein database.
C4d. jackhmmer- Compares a protein sequence to a protein database iteratively.

6. Mapping, Primer design and RNA Secondary Structure. Feb. 21

Read: Bioinformatics and Functional Genomics p. 433-459; http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen_docs.html
A. Mapping

Restriction enzyme selection and depiction, using Nebcutter2.

B. Primer design:
B1. Thermodynamics of DNA melting.
B2. Primer design: Primer-blast.
B3. Degenerate Primers: backtranseq.

C. Vecscreen

D. RNA secondary structure prediction – Mfold

7. Genomic Analysis. Feb. 28

Read: Bioinformatics and Functional Genomics 957-979. NCBI dbsnp: 988-992; http://www.repeatmasker.org/faq.html; http://www.ncbi.nlm.nih.gov/RefSeq/key.html#accessions

Filtering of repetitive sequences using Repeatmasker.
Exon and gene identification: Genscan.
Promoter identification. Ppnn web-site.
The Transfac databases.
Identifying Transfac Profiles with Match.
Probabilities of TFBSs with RSAT
cDNA- Genomic DNA alignment.
The Santa Cruz Human and Mouse Genome Map web-site.
The NCBI Genomic web-sites.

8. Protein structure prediction. Mar. 6

Read: Bioinformatics and Functional Genomics Chapter 13.

A. Composition and digestion: compseq, pepstats, pepdigest (EMBOSS)
B. Secondary structure and hydrophobicity. Classical methods.
B1. Introduction to protein secondary structure and hydrophobicity.
B2. The Chou-Fasman secondary structure prediction method. pepinfo EMBOSS).
B3. Kyte-Doolittle hydrophobicity, pepinfo(EMBOSS).

C. Neural network based methods. The Predict-Protein Web server.

C1. Secondary structure prediction of water-soluble proteins.
C2. Secondary structure prediction of membrane proteins.
C3. Surface accessibility.

D. Signal Peptide identification- SignalP web server.

E. Coiled-Coiled regions – pepcoil

F. Subcellular location (the LocTarget (NLS, LocKey, LocHom, LocNet).

G. Globularity analysis with segmasker.

H. Threading: Phyre

9. Molecular Systematics. Mar. 20

Read: Bioinformatics and Functional Genomics Chapter 7

Molecular species trees: Rooted and Unrooted.
Newick notation.
Mutational basis of molecular systematics.
Protein vs. nucleic acid methods.
Systematic methods in BLAST.
Distance Methods – Upgma and Neighbor Joining as implemented in Clustalx.
The maximum likelihood method – PROTML.
Statistical significance of systematics trees.

10. Functional Genomics I: Microarray Analysis. Mar. 27

Read: Bioinformatics and Functional Genomics P. 460-478, 504-511; the PowerPoint slides for this lesson available on the courseworks site (read notes as well as slides); About AffylmGUI : http://bioinf.wehi.edu.au/affylmGUI/R/library/affylmGUI/doc/about.html; Running the Estrogen Dataset http://bioinf.wehi.edu.au/affylmGUI/R/library/affylmGUI/doc/estrogen/estrogen.html

A. Experimental Methods

Glass-slide cDNA.
Affymetrix oligonucleotide

Normalization

Need for normalization.
Example of normalization method: GCRMA.

Statistical Analysis of differential expression.

Statistical Theory

The normal distribution.
P-values.
Small sample size: t-test.
Multiple-tests: False Discovery Rate.
e.. Theory of LIMMA (LInear Models for MicroArrays)
Using AffylmGUI.

11. Functional Genomics II: Functional Databases and Clustering. Apr. 3

Read: Bioinformatics and Functional Genomics. Chapter 14, 676-678, 682-685, 1036-1046. 511-516.

Functional Genomic Databases
NCBI Gene Database
Online Mendelian Inheritance in Man.
Gene Ontology Database.
KEGG: Kyoto Encyclopedia of Genes and Genomes.

Functions and pathways from differential expression.

The chi square distribution and overrepresentation analysis.
Webgestalt – Pathway database searching.

Clustering (unsupervised learning)

Theory of hierarchical clustering. Generation of heat maps and dendrograms.
K-means clustering.
Principal component analysis
Use of Cluster 3.0 and Treeview to use the above algorithms.

12. Functional Genomics III: RNAseq and Quantitative Real-time PCR. Apr. 10

Read: Bioinformatics and Functional Genomics. Chapter 9, 519-521

1. RNASeq

Theory:

The Illumina RNASeq platform.
RNASeq File formats: Fastq, SAM, BAM, GFF, GTF,
RNASeq quality control FastQC.
Burroughs Wheeler alignment to the Genome: Bowtie.
Spliced Alignment of Reads to the Genome: Tophat2.
Alignment by vote-and-count: Subread.
Counting Gene Copies: HtSeq/FeatureCounts.
Normalization of data: TMM.
Theory of comparison of counts: The Poisson and negative binomial distributions.

Differential expression of genes from RNASeq data: DeSeq, EdgeR, Limma- Voom.

Practice:

Some of the above methods will be implemented in a hands-on-exercise with OneChannelGUI.

2. Real-time quantitative PCR

Theory:

The Real-time PCR experiment.
Analytical pitfalls and their solutions:
Non-normality: Use negative cycles.
Pseudoreplication: Averaging or mixed effects models.

Practice:

Processing real QT-PCR data in Excel.

13. Optional Review Session. May 1

14. Final. May 10

The final is required of students who take the course for a letter grade (BINF 4013) only. Students who take the course Pass/Fail (CMBS4020) do not have to take the final.