BINF G4013 Biological Sequence Analysis
Course Description: Biological sequence analysis (bioinformatics) is the study of the relationships between biological sequences and the implication of these relationships for macromolecular structure, function, and evolution. Bioinformatics is now a necessary part of the tools and training of all scientists who use molecular biological methods because they must process raw DNA sequences and assign structure and function to gene products. Bioinformatic methods are based on aspects of a variety of disciplines including computer science, probability, and molecular, structural, and evolutionary biology. The biomedical researcher must understand the computational and scientific basis of the programs he uses in order to use the programs correctly.
The purpose of this course is to train biomedical researchers, including those who may not yet be comfortable with computers, in bioinformatic methods. All of these methods center around the identification and classification of gene products and the elucidation of genome structure. The biological and informatic basis of all methods studied is covered, the methods are demonstrated in class, and the students run the programs themselves. The course covers both web-based and desktop-based programs.
Instructor

Richard Friedman, PhD
Teaching Assistant

Yiwei Sun
Class Schedule
Wednesday 9-12 online
via Zoom
(1 hour lecture / 2 hours lab)
- File formats: Fasta, GCG, EMBL.
- Seqret (EMBOSS).
- Querying NCBI databases.
- NCBI GQuery Server.
- Graphical sequence comparison (Dotmatcher – EMBOSS).
- Needleman-Wunsch global sequence alignment (Needle – EMBOSS)
- Smith-Waterman local sequence alignment. (Water- EMBOSS).
- Comparing a DNA sequence with a protein sequence – Genewise.
- Use of the Blast World-Wide-Web interface.
- Interpretation of results, Karlin-Altschul theory and statistical significance
- The Blast algorithm.
- Filtering of low-complexity and repetitive sequences – Seg and Dust.
- The Fasta family of programs
- The Blast family of programs.
- Combining Blast and text string searches.
- Progressive pair-wise alignment.
- Evolutionarily-weighted progressive pairwise alignment: ClustalX.
- Interactive alignment with Clustalx.
- Display of multiple sequence alignments: Boxshade.
- Consistent alignments with T-Coffee.
- Meme: Probabilistic identification and alignment of short conserved regions.
- Vecscreen
- RNA secondary structure prediction – Mfold
- Promoter identification. Ppnn web-site.
- The Transfac databases.
- Identifying Transfac Profiles with Match.
- Probabilities of TFBSs with RSAT
- cDNA- Genomic DNA alignment.
- The Santa Cruz Human and Mouse Genome Map web-site.
- The NCBI Genomic web-sites.
- Neural network based methods. The Predict-Protein Web server.
- Signal Peptide identification- SignalP web server.
- Coiled-Coiled regions – pepcoil
- Subcellular location (the LocTarget (NLS, LocKey, LocHom, LocNet).
- Globularity analysis with segmasker.
- Threading: Phyre
- Molecular species trees: Rooted and Unrooted.
- Newick notation.
- Mutational basis of molecular systematics.
- Protein vs. nucleic acid methods.
- Systematic methods in BLAST.
- Distance Methods – Upgma and Neighbor Joining as implemented in Clustalx.
- The maximum likelihood method – PROTML.
- Statistical significance of systematics trees.
- Experimental Methods
- Glass-slide cDNA.
- Affymetrix oligonucleotide.
- Normalization
- Need for normalization.
- Example of normalization method: GCRMA.
- Statistical Analysis of differential expression.
- Statistical Theory
- The normal distribution.
- P-values.
- Small sample size: t-test.
- Multiple-tests: False Discovery Rate.
- Using AffylmGUI.
- Functional Genomic Databases
- NCBI Gene Database
- Online Mendelian Inheritance in Man.
- Gene Ontology Database.
- KEGG: Kyoto Encyclopedia of Genes and Genomes.
- Functions and pathways from differential expression.
- The chi square distribution and overrepresentation analysis.
- Webgestalt – Pathway database searching.
- Clustering (unsupervised learning)
- Theory of hierarchical clustering. Generation of heat maps and dendrograms.
- K-means clustering.
- Principal component analysis
- Use of Cluster 3.0 and Treeview to use the above algorithms.
- RNASeq
- The Illumina RNASeq platform.
- RNASeq File formats: Fastq, SAM, BAM, GFF, GTF,
- RNASeq quality control FastQC.
- Burroughs Wheeler alignment to the Genome: Bowtie.
- Spliced Alignment of Reads to the Genome: Tophat2.
- Alignment by vote-and-count: Subread.
- Counting Gene Copies: HtSeq/FeatureCounts.
- Normalization of data: TMM.
- Theory of comparison of counts: The Poisson and negative binomial distributions.
- Differential expression of genes from RNASeq data: DeSeq, EdgeR, Limma- Voom.
- Some of the above methods will be implemented in a hands-on-exercise with OneChannelGUI.
- Real-time quantitative PCR
- The Real-time PCR experiment.
- Analytical pitfalls and their solutions:
- Non-normality: Use negative cycles.
- Pseudoreplication: Averaging or mixed effects models.
- Processing real QT-PCR data in Excel.
- Compound Covariate.
- Discriminant analysis.
- Logistic regression
- Regularized logistic regression.
- Nearest neighbor classifiers.
- Training set and test set.
- Sample size.
- Cross validation.
- Sensitivity and Specificity.
- Receiver operator curves.