Lecture: Thu 08:30-10:00 in OH14/104
Practicals: Thu 12:00-14:00 in OH14/202 [?] or individually
Exams (Module INF-MSc-514 in Computer Science or Applied CS):
Lectures and material can be found in the comics-2018 gitlab repository (request access from the instructor). Please fork the repository if you want to work with it and remember to synchronize with upstream from time to time.
You will (enventually) find one notebook for each lecture or group of lectures, together with information for the practicals and tasks.
To view a notebook, you may use the included lectures/show.sh
script, e.g. call ./show.sh 01-introduction.ipynb
in the lectures directory of the repository.
The tasks are optional; the project is required.
Date | Class | Topics |
---|---|---|
11.10. | Lecture 1 | Introduction, DNA structure, number of DNA k-mers, DNA sequencing technologies, file formats (FASTA, FASTQ), Phred quality values, Lander-Waterman formula, Poisson approximation. |
Practical | Introduction to conda and Python; Installation of Jupyter Notebooks. | |
Tasks | Install conda and get the notebook of lecture 1 running. Watch and understand some of the sequencing technology videos: Illumina, Pacific Biosciences, Oxford Nanopore. Play with Python, e.g., write or inspect a tools that counts nucleotides in FASTA files. |
|
18.10. | Lecture 2 | The DNA sequencing error correction problem, k-mer statistics, Bloom filters. |
Practical | Computing false positive rates of bloom filters. | |
Tasks | (1) Summarize the Lighter article; (2) solve the problems given in the linked repository. | |
25.10. | Lecture 3 | Tool: Lighter (key ideas: Sampling + Bloom filter + smoothing k-mers by positions + greedy error correction). Metrics for classifiers: recall, precision, F-score. Introduction to 3-way bucketed Cuckoo hashing. |
Practical | Hash functions. Discussion of problem set 1. | |
Tasks | (1) Estimate fill rates by simulation. | |
01.11. | (No courses; All Saints Day) | |
08.11. | (No courses; evaluation meeting) | |
15.11. | Lecture 4 | Obtaining the k-mer abundance histogram. Genome assembly. |
Practical | Hash functions. Discussion of fill rates. | |
Tasks | (1) Apply the minia3 assembler to the SRX016231 dataset. Find out what's happening. You'll need time and space (100 GB download). Minia needs little RAM. | |
22.11. | Lecture 5 | Genetic variant calling. Bayesian models. |
Practical | Distinguishing true variants from technical artefacts ("spalter" approach) | |
Tasks | Read the (somewhat over-the-top) original "DeepVariant" paper. | |
29.11. | Lecture 6 | Transcriptomics: DNA microarrays vs. RNA-seq. Normalization of RNA-seq experiments. Scaling normalization: sum, medians, quantiles, quantile range. Full quantile normalization. Parametric normalization. |
Practical | Normalization on an example dataset. | |
Tasks | Implement a different normalization strategy. | |
06.12. | Lecture 7 | Transcriptomics: Model-based analysis of gene expression count matrices (DEseq approach). |
Practical | Presentation of projects (see below). | |
Tasks | (1) Apply steps of the DEseq approach to the example RNA-seq dataset. (2) Choose a project. | |
13.12. | Lecture 8 | Statistical aside: multiple testing and its application to gene expression analysis. Single tests, p-values, multiple tests, Bonferroni correction, Benjamni-Hochberg correction, Beta-Uniform mixtures. |
Practical | Presentation of project datasets. | |
Taks | Develop a project plan. | |
20.12. | Lecture 9 | (cancelled due to illness) Proteomics: mass spectrometry. Database search for the identification of peptides and metabolites. |
Practical | Discussion of project plans. | |
Tasks | Work on project. | |
10.01. | Lecture 10 | Algorithms for de-novo peptide sequencing from mass spectrometry data. |
Practical | Simulation of de-novo peptide sequencing. | |
Tasks | (to be announced) | |
17.01. | Lecture 11 | Metabolomics: identification of metabolites. CHNOPS alphabet, sum formulas, coin changing problem. |
Practical | Implementation of algorithms | |
Tasks | Implementation and testing of algorithms | |
24.01. | Lecture 12 | Metabolomics: metabolic networks, flux analysis, double description methods. |
Practical | (to be announced) | |
Tasks | (to be announced) | |
31.01. | Lecture 13 | Summary. |
Practical | (to be announced) | |
Tasks | (to be announced) |
Required to pass the course
Feel free to contribute your own ideas as well. Material is in the git repository.
Chimera detection: In the git repository for the course, you will find two FASTA files with (simulated) sequenced genes, some from true species, some from chimeras derived from the species. The task is to classify the sequences in species and chimeras.
Gene expression normalization: In the git repository, you can find two CSV files (the separator is actually just whitespace) with absolute counts of RNA molecules. (There is one header line containing the column headings, "genes" and the sample names. The following lines each contain the counts for one gene whose ID/name/description is given in the first column.) The task is to write a normalization procedure that goes beyond simple scaling, but stays below full quantile normalization. It is recommended to use pandas for reading the file and numpy/scipy for manipulating the numerical data.
Discovering interesting genes: Using the same dataset(s) as in 2. and a simple form of normalization (e.g. constant 75% quantile), find "interesting" genes without known classes. The first task is to formalize what is "interesting". One (non-exclusive) possibility is to find genes where the expression is at least bimodal. Tests for non-unimodality exist in the literature, but there seem to be few implementations. Another approach could be to make distributional assumptions (similar to, but different from the Negative Binomial used by DEseq), but without known classes. It would be interesting to see whether many interesting genes support the same split into two classes.
The following links can be useful in addition to the lectures.
For read/fork access, please send an email with your gitlab username to the lecturer.
Link: https://gitlab.com/rahmannlab/comics-2018/
Computational Omics is a course about algorithms for computational and statistical analysis of "big data" in the life sciences. Its contents are a mixture of algorithms, statistics, data science and molecular biology. It cosists of lectures (2 hours per week) and a practical session (2 hours per week), where participants implement some of the methods discussed in class and apply them to real datasets. The course is not recommended for the casual student!
Some of the topics discussed in this course are
Die Vorlesung gibt eine Übersicht über die aktuellen informatischen Methoden zur Datenanalyse in den in den “Omiken” der Lebenswissenschaften (Genomik, Transkriptomik, Epigenomik, Proteomik, Metabolomik, Interaktomik, ...). Sie besteht aus mehreren Einheiten, die sich jeweils einem dieser Themenbereiche und zugehörigen Technologien widmen. Die diskutierten Technologien und die daraus entstehenden Daten können sich von Jahr zu Jahr ändern, da die technologische Entwicklung auf diesem Gebiet schnell voranschreitet. Wichtig ist, dass in der Vorlesung theoretische Grundlagen und Prinzipien vermittelt werden. Diese werden beim Studium aktueller Arbeiten zu Technologien und Algorithmen vertieft und von den Studierenden ausgearbeitet. Im zugehörigen Praktikum werden eigene Miniprojekte bearbeitet. Beispiele für Themen sind:
In den Einheiten wird jeweils eine Einführung in die zugrundeliegenden Technologien gegeben; dabei wird der Art und Erzeugung der Daten besondere Aufmerksamkeit gewidmet. Es schließen sich typische Fragestellungen an, die anhand der gewonnenen Daten gestellt und beantwortet werden können. Dazu werden jeweils die wichtigsten Datenanalysemethoden besprochen. Diese unterteilen sich häufig in sogenannte low-level-Verfahren zur Vorverarbeitung, die sich vor allem nach der Art der Daten richten und high-level-Verfahren aus dem Bereich des maschinellen Lernens, die die gewünschten Informationen aus den Daten extrahieren. Aufgrund des hierbei auftretenden Datenvolumens stehen dabei besonders ressourceneffiziente Algorithmen im Vordergrund. Aus statistischer Sicht geht es zusätzlich darum, sinnvoll mit dem Problem hochdimensionaler Daten bei kleiner Stichprobengröße umzugehen ($n \ll p$ - Problematik).
and Data Protection
This website is informational and for educational purposes. It contains lecture material for a course offered by the following responsible person, to whom comments and questions about this site should be sent:
Prof. Dr. Sven Rahmann
Institut für Humangenetik / Genominformatik
Medizinische Fakultät der Universität Duisburg-Essen
Universitätsklinikum Essen
Hufelandstr. 55
45147 Essen
The information contained in this website is for general information purposes only. The information is provided by the responsible person and while we endeavor to keep the information up to date and correct, we make no representations or warranties of any kind, express or implied, about the completeness, accuracy, reliability, suitability or availability with respect to the website or the information, products, services, or related graphics contained on the website for any purpose. Any reliance you place on such information is therefore strictly at your own risk.
In no event will we be liable for any loss or damage including without limitation, indirect or consequential loss or damage, or any loss or damage whatsoever arising from loss of data or profits arising out of, or in connection with, the use of this website.
Through this website you are able to link to other websites which are not under the control of the responsible person. We have no control over the nature, content and availability of those sites. The inclusion of any links does not necessarily imply a recommendation or endorse the views expressed within them.
Every effort is made to keep the website up and running smoothly. However, the responsible person takes no responsibility for, and will not be liable for, the website being temporarily unavailable due to technical issues beyond our control.
Concerning your data privacy (DSGVO), we do not ask you to provide personal information to use this site. This is a read-only website. However, we rely on external scripts and JavaScipt libraries to present and render this site. Also, the hosting platform of this site may set cookies to track your usage of this site. We have no control of their tracking mechanisms, and you may deactivate JavaScript entirely to be safe and disallow cookies.