Discovery in Millions of Genomes: Three key challenges for bioinformatic

Title: Discovery in Millions of Genomes: Three key challenges for bioinformatics

Speaker: Dr. Sebastian Wernicke

Date/Time: December 2, 2015, 13:40

Place: FENS G032

Abstract: Our ability to sequence the human genetic code advances even more rapidly than computation. Rather than doubling performance every 18 months (“Moore’s law”), our ability to read genetic codes quadruples in that same amount of time. It is therefore estimated that by next year, a million individuals will have their entire genome sequenced and that by 2018, we will have ~2 Exabytes of genetic information generated While these genomic datasets – often generated through large consortia or through government initiatives – hold many promises for the effective future treatment of diseases such as cancer and rare genetic conditions, their very large size and medical sensitivity pose significant challenges for processing this data while at the same time keeping it secure. Solving these challenges requires very specific approaches that can’t be simply transferred from other large-scale computing approaches. This talk will outline three key computational challenges of discovering information in large genomic datasets and present our current research on these topics. The first challenge: Connect the data and make it available for analysis Genetic information is only useful if it can be analyzed efficiently by researchers. This is a nontrivial challenge since on the one the one hand, the data is inherently sensitive and personal, yet on the other hand researchers must be able to run tens of thousands of different and customized tools on this data. Additionally, the data usually does not reside in a single place, but multiple locations and is far to large to import everything into a single location. This requires the development of novel computing concepts, for which we will present our current approaches and thinking. The second challenge: Make algorithms come to the data Analyzing genetic data usually happens through complex chains of tools (“pipelines”). This poses two main challenges: First, despite this complexity, pipelines need to be easily reproducible and distributable by researchers. Second, as the same algorithms will usually run on distributed datasets, we need to ensure that they behave reliable in very different environments from small desktop solutions to massively parallel cloud environments. We will present the Common Workflow Language (CWL), an open source project initiated by Seven Bridges to tackle this challenge. The third challenge: Develop new, massively scalable algorithms Most of the algorithms and tools that are used today in generating and exploring genetic information were developed at a time when sequencing 1000 human genomes was considered a very ambitious feat. Consequently, these algorithms and methods were developed with small datasets in mind and do not scale efficiently to larger datasets, requiring new concepts to be developed and implemented. We will present so-called graph genomes, which are an example of these new technologies that we are developing in collaboration with the UK 100k genomes project.

Bio: Dr. Sebastian Wernicke leads the global strategy and growth of Seven Bridges Genomics, a Cambridge (MA)-based company that builds and deploys platforms for Next Generation Sequencing analysis. He also serves as managing director of Seven Bridges Genomics UK, a research-focused subsidiary in London that works in close collaboration with Genomics England and the 100k Genomes project. Dr. Wernicke joined Seven Bridges in 2012 after spending several years consulting for Fortune 500 pharmaceutical and financial services companies on their strategic initiatives. He received his Ph.D. In Bioinformatics from the university of Jena in Germany, where he developed novel algorithms for the combinatorial analysis of biological networks; today his tools and algorithms for network analytics are used by thousands of researchers worldwide.