Tse Lab

PhD Oral Defense: Govinda Kamath

Topic: Graph Clustering for Computational Genomics

  • Advisor: David N. Tse
  • Date: Monday, May 20, 2018
  • Time: 11:30 am (refreshments at 11:15 pm)
  • Location: Packard 202

Abstract

Humans have 23 pairs of homologous chromosomes. The homologous pairs are identical except on certain documented positions called single nucleotide polymorphisms (SNPs). A haplotype of an individual is the pair of sequences of SNPs on the two homologous chromosomes. In this talk, we study the problem of inferring haplotypes of individuals from mate-pair reads of their genome: that is measurements of pairs of SNPs within some window. We first draw connections with decoding convolutional codes. We derive a formula for the coverage needed to be able to do this informationally, and come up with an algorithm that recover the haplotypes whenever we have enough measurements. While the dependence on the number of SNPs is linear, that on the size of the window here is exponential. To overcome this difficulty, we proceed to modelling this as a community detection problem, where we have a graph with vertices representing SNPs and edges representing noisy pairwise measurements. In most standard community detection problems, algorithms make use of a spectral gap to ensure recovery, but the graphs produced by our model here ends up not having a spectral gap. We discuss an almost linear time spectral algorithm to overcome this. Inspired by this practical application, we broaden the scope of this problem to a graph clustering problem where the graph is adversarially chosen, while the noise process is probabilistic and discuss an algorithm for a class of graphs.