Schedule

Lectures and material can be found in the gitlab repository (request access from the instructor):

Date Class Topics
11.10. Lecture Introduction, DNA structure, number of DNA k-mers, DNA sequencing technologies, file formats (FASTA, FASTQ), Phred quality values, Lander-Waterman formula, Poisson approximation.
Practical Introduction to conda and Python; Installation of Jupyter Notebooks.
Tasks Install conda and get the notebook of lecture 1 running.
Watch and understand some of the sequencing technology videos: Illumina, Pacific Biosciences, Oxford Nanopore.
Play with Python, e.g., write or inspect a tools that counts nucleotides in FASTA files.
18.10. Lecture The DNA sequencing error correction problem, k-mer statistics, Bloom filters
Practical The tool "Lighter" for error correction
Tasks (1) Summarize the Lighter article; (2) solve the problems given in the linked repository.

More will be added soon.

Materials

The following links can be useful in addition to the lectures.

Git Repository with Data and Project Tasks

For read/fork access, please send an email with your gitlab username to the lecturer.

Link: https://gitlab.com/rahmannlab/comics-2018/

DNA/RNA Sequencing Technologies

  • Illumina Sequencing (Illumina): Principle. HiSeq2500 properties.
  • Illumina Sequencing (USD Bioinformatics): Part 1, Part 2.
  • IonTorrent (Life Technologies): Introduction. Claims on accuracy.
  • Long overview talk of NGS technologies and cancer genomics (Elaine Mardis, 2014), can start at 3:30
  • Descriptions of FASTA format and of FASTQ format
  • Description of PHRED quality scores

Genome Assembly

  • Velvet Assembler (classic, EMBL-EBI)
  • Minia Assembler

Duplicate Rate and Diversity Estimation

  • (German) “Wie viele Wörter kannte Shakespeare?“, Blog-Beitrag von Christian Hesse, 26. Dezember 2014, verweist auf Good & Toulmin und Efron & Thisted
  • I. J. Good and G. H. Toulmin: The Number of New Species, and the Increase in Population Coverage, when a Sample is Increased. Biometrika Vol. 43, No. 1/2 (Jun., 1956), pp. 45-63.
  • Bradley Efron and Ronald Thisted: Estimating the number of unseen species: How many words did Shakespeare know? Biometrika (1976) 63 (3): 435-447.
  • Daley T, Smith AD. Predicting the molecular complexity of sequencing libraries. Nat Methods. 2013 Apr;10(4):325-7.
  • Preseq software by Daley and Smith, and the manual
  • The “preseq challenge” website.
  • Our dupre repository (Schröder & Rahmann), see dupre/occupancy.py (e.g., the downsample function for an interesting way how to obtain a subsample from a given occupancy vector using Python standard data types)

Course Summary

Computational Omics is a course about algorithms for computational and statistical analysis of "big data" in the life sciences. Its contents are a mixture of algorithms, statistics, data science and molecular biology. It cosists of lectures (2 hours per week) and a practical session (2 hours per week), where participants implement some of the methods discussed in class and apply them to real datasets. The course is not recommended for the casual student!

Some of the topics discussed in this course are

  • Genomics: Genome sequence analysis
  • Transcriptomics: Quantifying gene activity
  • Epigenomics: Chemical DNA changes
  • Proteomics: Identification and quantification of proteins
  • Metabolomics: Chemistry and linear algebra

Vorlesungskommentar (Deutsch)

Die Vorlesung gibt eine Übersicht über die aktuellen informatischen Methoden zur Daten­analyse in den in den “Omiken” der Lebens­wissenschaften (Genomik, Transkriptomik, Epigenomik, Proteomik, Metabolomik, Interaktomik, ...). Sie besteht aus mehreren Einheiten, die sich jeweils einem dieser Themenbereiche und zugehörigen Technologien widmen. Die diskutierten Technologien und die daraus entstehenden Daten können sich von Jahr zu Jahr ändern, da die technologische Entwicklung auf diesem Gebiet schnell voranschreitet. Wichtig ist, dass in der Vorlesung theoretische Grundlagen und Prinzipien vermittelt werden. Diese werden beim Studium aktueller Arbeiten zu Technologien und Algorithmen vertieft und von den Studierenden ausgearbeitet. Im zugehörigen Praktikum werden eigene Miniprojekte bearbeitet. Beispiele für Themen sind:

  • Hochdurchsatz-DNA-Sequenzierung: Fehlerkorrektur
  • Genomassemblierung
  • Variantendetektion in Genomen
  • Massenspektrometrie: Zusammensetzung eines Moleküls aus Masseninformationen
  • Proteinidentifikation aus Massenfingerprints
  • Ionenmobilitätspektrometrie in der Metabolomik
  • ...

In den Einheiten wird jeweils eine Einführung in die zugrundeliegenden Technologien gegeben; dabei wird der Art und Erzeugung der Daten besondere Aufmerksamkeit gewidmet. Es schließen sich typische Fragestellungen an, die anhand der gewonnenen Daten gestellt und beantwortet werden können. Dazu werden jeweils die wichtigsten Datenanalysemethoden besprochen. Diese unterteilen sich häufig in sogenannte low-level-Verfahren zur Vorverarbeitung, die sich vor allem nach der Art der Daten richten und high-level-Verfahren aus dem Bereich des maschinellen Lernens, die die gewünschten Informa­tionen aus den Daten extrahieren. Aufgrund des hierbei auftretenden Datenvolumens stehen dabei besonders ressourceneffiziente Algorithmen im Vordergrund. Aus statistischer Sicht geht es zusätzlich darum, sinnvoll mit dem Problem hochdimensionaler Daten bei kleiner Stichprobengröße umzugehen ($n \ll p$ - Problematik).

Imprint

and Data Protection

This website is informational and for educational purposes. It contains lecture material for a course offered by the following responsible person, to whom comments and questions about this site should be sent:

Prof. Dr. Sven Rahmann
Institut für Humangenetik / Genominformatik
Medizinische Fakultät der Universität Duisburg-Essen
Universitätsklinikum Essen
Hufelandstr. 55
45147 Essen

The information contained in this website is for general information purposes only. The information is provided by the responsible person and while we endeavor to keep the information up to date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.

In no event will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this website.

Through this website you are able to link to other websites which are not under the control of the responsible person. We have no control over the nature, content and availability of those sites. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.

Every effort is made to keep the website up and running smoothly. However, the responsible person takes no responsibility for, and will not be liable for, the website being temporarily unavailable due to technical issues beyond our control.

Data Privacy

Concerning your data privacy (DSGVO), we do not ask you to provide personal information to use this site. This is a read-only website. However, we rely on external scripts and JavaScipt libraries to present and render this site. Also, the hosting platform of this site may set cookies to track your usage of this site. We have no control of their tracking mechanisms, and you may deactivate JavaScript entirely to be safe and disallow cookies.