Schedule

Lectures and material can be found in the comics-2018 gitlab repository (request access from the instructor). Please fork the repository if you want to work with it and remember to synchronize with upstream from time to time.

You will (enventually) find one notebook for each lecture or group of lectures, together with information for the practicals and tasks. To view a notebook, you may use the included lectures/show.sh script, e.g. call ./show.sh 01-introduction.ipynb in the lectures directory of the repository. The tasks are optional; the project is required.

Date Class Topics
11.10. Lecture 1 Introduction, DNA structure, number of DNA k-mers, DNA sequencing technologies, file formats (FASTA, FASTQ), Phred quality values, Lander-Waterman formula, Poisson approximation.
Practical Introduction to conda and Python; Installation of Jupyter Notebooks.
Tasks Install conda and get the notebook of lecture 1 running.
Watch and understand some of the sequencing technology videos: Illumina, Pacific Biosciences, Oxford Nanopore.
Play with Python, e.g., write or inspect a tools that counts nucleotides in FASTA files.
18.10. Lecture 2 The DNA sequencing error correction problem, k-mer statistics, Bloom filters.
Practical Computing false positive rates of bloom filters.
Tasks (1) Summarize the Lighter article; (2) solve the problems given in the linked repository.
25.10. Lecture 3 Tool: Lighter (key ideas: Sampling + Bloom filter + smoothing k-mers by positions + greedy error correction). Metrics for classifiers: recall, precision, F-score. Introduction to 3-way bucketed Cuckoo hashing.
Practical Hash functions. Discussion of problem set 1.
Tasks (1) Estimate fill rates by simulation.
01.11. (No courses; All Saints Day)
08.11. (No courses; evaluation meeting)
15.11. Lecture 4 Obtaining the k-mer abundance histogram. Genome assembly.
Practical Hash functions. Discussion of fill rates.
Tasks (1) Apply the minia3 assembler to the SRX016231 dataset. Find out what's happening. You'll need time and space (100 GB download). Minia needs little RAM.
22.11. Lecture 5 Genetic variant calling. Bayesian models.
Practical Distinguishing true variants from technical artefacts ("spalter" approach)
Tasks Read the (somewhat over-the-top) original "DeepVariant" paper.
29.11. Lecture 6 Transcriptomics: DNA microarrays vs. RNA-seq. Normalization of RNA-seq experiments. Scaling normalization: sum, medians, quantiles, quantile range. Full quantile normalization. Parametric normalization.
Practical Normalization on an example dataset.
Tasks Implement a different normalization strategy.
06.12. Lecture 7 Transcriptomics: Model-based analysis of gene expression count matrices (DEseq approach).
Practical Presentation of projects (see below).
Tasks (1) Apply steps of the DEseq approach to the example RNA-seq dataset. (2) Choose a project.
13.12. Lecture 8 Statistical aside: multiple testing and its application to gene expression analysis. Single tests, p-values, multiple tests, Bonferroni correction, Benjamni-Hochberg correction, Beta-Uniform mixtures.
Practical Presentation of project datasets.
Taks Develop a project plan.
20.12. Lecture 9 (cancelled due to illness) Proteomics: mass spectrometry. Database search for the identification of peptides and metabolites.
Practical Discussion of project plans.
Tasks Work on project.
10.01. Lecture 10 Algorithms for de-novo peptide sequencing from mass spectrometry data.
Practical Simulation of de-novo peptide sequencing.
Tasks (to be announced)
17.01. Lecture 11 Metabolomics: identification of metabolites. CHNOPS alphabet, sum formulas, coin changing problem.
Practical Implementation of algorithms
Tasks Implementation and testing of algorithms
24.01. Lecture 12 Metabolomics: metabolic networks, flux analysis, double description methods.
Practical (to be announced)
Tasks (to be announced)
31.01. Lecture 13 Summary.
Practical (to be announced)
Tasks (to be announced)

Projects

Required to pass the course

Project suggestions

Feel free to contribute your own ideas as well. Material is in the git repository.

  1. Chimera detection: In the git repository for the course, you will find two FASTA files with (simulated) sequenced genes, some from true species, some from chimeras derived from the species. The task is to classify the sequences in species and chimeras.

  2. Gene expression normalization: In the git repository, you can find two CSV files (the separator is actually just whitespace) with absolute counts of RNA molecules. (There is one header line containing the column headings, "genes" and the sample names. The following lines each contain the counts for one gene whose ID/name/description is given in the first column.) The task is to write a normalization procedure that goes beyond simple scaling, but stays below full quantile normalization. It is recommended to use pandas for reading the file and numpy/scipy for manipulating the numerical data.

  3. Discovering interesting genes: Using the same dataset(s) as in 2. and a simple form of normalization (e.g. constant 75% quantile), find "interesting" genes without known classes. The first task is to formalize what is "interesting". One (non-exclusive) possibility is to find genes where the expression is at least bimodal. Tests for non-unimodality exist in the literature, but there seem to be few implementations. Another approach could be to make distributional assumptions (similar to, but different from the Negative Binomial used by DEseq), but without known classes. It would be interesting to see whether many interesting genes support the same split into two classes.

Materials

The following links can be useful in addition to the lectures.

Git Repository with Data and Project Tasks

For read/fork access, please send an email with your gitlab username to the lecturer.

Link: https://gitlab.com/rahmannlab/comics-2018/

DNA/RNA Sequencing Technologies

  • Illumina Sequencing (Illumina): Principle. HiSeq2500 properties.
  • Illumina Sequencing (USD Bioinformatics): Part 1, Part 2.
  • IonTorrent (Life Technologies): Introduction. Claims on accuracy.
  • Long overview talk of NGS technologies and cancer genomics (Elaine Mardis, 2014), can start at 3:30
  • Descriptions of FASTA format and of FASTQ format
  • Description of PHRED quality scores

Genome Assembly

  • Velvet Assembler (classic, EMBL-EBI)
  • Minia Assembler

Duplicate Rate and Diversity Estimation

  • (German) “Wie viele Wörter kannte Shakespeare?“, Blog-Beitrag von Christian Hesse, 26. Dezember 2014, verweist auf Good & Toulmin und Efron & Thisted
  • I. J. Good and G. H. Toulmin: The Number of New Species, and the Increase in Population Coverage, when a Sample is Increased. Biometrika Vol. 43, No. 1/2 (Jun., 1956), pp. 45-63.
  • Bradley Efron and Ronald Thisted: Estimating the number of unseen species: How many words did Shakespeare know? Biometrika (1976) 63 (3): 435-447.
  • Daley T, Smith AD. Predicting the molecular complexity of sequencing libraries. Nat Methods. 2013 Apr;10(4):325-7.
  • Preseq software by Daley and Smith, and the manual
  • The “preseq challenge” website.
  • Our dupre repository (Schröder & Rahmann), see dupre/occupancy.py (e.g., the downsample function for an interesting way how to obtain a subsample from a given occupancy vector using Python standard data types)

Course Summary

Computational Omics is a course about algorithms for computational and statistical analysis of "big data" in the life sciences. Its contents are a mixture of algorithms, statistics, data science and molecular biology. It cosists of lectures (2 hours per week) and a practical session (2 hours per week), where participants implement some of the methods discussed in class and apply them to real datasets. The course is not recommended for the casual student!

Some of the topics discussed in this course are

  • Genomics: Genome sequence analysis
  • Transcriptomics: Quantifying gene activity
  • Epigenomics: Chemical DNA changes
  • Proteomics: Identification and quantification of proteins
  • Metabolomics: Chemistry and linear algebra

Vorlesungskommentar (Deutsch)

Die Vorlesung gibt eine Übersicht über die aktuellen informatischen Methoden zur Daten­analyse in den in den “Omiken” der Lebens­wissenschaften (Genomik, Transkriptomik, Epigenomik, Proteomik, Metabolomik, Interaktomik, ...). Sie besteht aus mehreren Einheiten, die sich jeweils einem dieser Themenbereiche und zugehörigen Technologien widmen. Die diskutierten Technologien und die daraus entstehenden Daten können sich von Jahr zu Jahr ändern, da die technologische Entwicklung auf diesem Gebiet schnell voranschreitet. Wichtig ist, dass in der Vorlesung theoretische Grundlagen und Prinzipien vermittelt werden. Diese werden beim Studium aktueller Arbeiten zu Technologien und Algorithmen vertieft und von den Studierenden ausgearbeitet. Im zugehörigen Praktikum werden eigene Miniprojekte bearbeitet. Beispiele für Themen sind:

  • Hochdurchsatz-DNA-Sequenzierung: Fehlerkorrektur
  • Genomassemblierung
  • Variantendetektion in Genomen
  • Massenspektrometrie: Zusammensetzung eines Moleküls aus Masseninformationen
  • Proteinidentifikation aus Massenfingerprints
  • Ionenmobilitätspektrometrie in der Metabolomik
  • ...

In den Einheiten wird jeweils eine Einführung in die zugrundeliegenden Technologien gegeben; dabei wird der Art und Erzeugung der Daten besondere Aufmerksamkeit gewidmet. Es schließen sich typische Fragestellungen an, die anhand der gewonnenen Daten gestellt und beantwortet werden können. Dazu werden jeweils die wichtigsten Datenanalysemethoden besprochen. Diese unterteilen sich häufig in sogenannte low-level-Verfahren zur Vorverarbeitung, die sich vor allem nach der Art der Daten richten und high-level-Verfahren aus dem Bereich des maschinellen Lernens, die die gewünschten Informa­tionen aus den Daten extrahieren. Aufgrund des hierbei auftretenden Datenvolumens stehen dabei besonders ressourceneffiziente Algorithmen im Vordergrund. Aus statistischer Sicht geht es zusätzlich darum, sinnvoll mit dem Problem hochdimensionaler Daten bei kleiner Stichprobengröße umzugehen ($n \ll p$ - Problematik).

Imprint

and Data Protection

This website is informational and for educational purposes. It contains lecture material for a course offered by the following responsible person, to whom comments and questions about this site should be sent:

Prof. Dr. Sven Rahmann
Institut für Humangenetik / Genominformatik
Medizinische Fakultät der Universität Duisburg-Essen
Universitätsklinikum Essen
Hufelandstr. 55
45147 Essen

The information contained in this website is for general information purposes only. The information is provided by the responsible person and while we endeavor to keep the information up to date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.

In no event will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this website.

Through this website you are able to link to other websites which are not under the control of the responsible person. We have no control over the nature, content and availability of those sites. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.

Every effort is made to keep the website up and running smoothly. However, the responsible person takes no responsibility for, and will not be liable for, the website being temporarily unavailable due to technical issues beyond our control.

Data Privacy

Concerning your data privacy (DSGVO), we do not ask you to provide personal information to use this site. This is a read-only website. However, we rely on external scripts and JavaScipt libraries to present and render this site. Also, the hosting platform of this site may set cookies to track your usage of this site. We have no control of their tracking mechanisms, and you may deactivate JavaScript entirely to be safe and disallow cookies.