Bioinformatics: Nanopore DNA sequencing
DNA sequencing technology has undergone a major revolution with the cost of sequencing plummeting nearly six orders of magnitude. Much of this improvement was made possible by second generation sequencers, utilizing massive parallelization. But these machines can only read short fragments of DNA, typically a few hundred bases long. These short reads are then stitched together with algorithms exploiting the overlap between reads to assemble them into long DNA sequences. This assembly is unreliable because of repeated regions which commonly occur in genomic DNA. Such repeated regions play an important role in evolution, development and in the genetics of many diseases.
Nanopore sequencing promises to address this problem, by increasing the read lengths by orders of magnitude (up to 100K bases). While nanopore sequencers can acquire long reads, the high error rates (~ 30%) pose a technical challenge. In a nanopore sequencer, a DNA is migrated through a nanopore and current variations are measured. The DNA sequence is inferred from this observed current pattern using an algorithm called a base-caller.
We proposed a mathematical model for the “channel” from the input DNA sequence to the observed current, prove a multi-letter mutual information channel capacity formula, and compute bounds on the information extraction capacity of the nanopore sequencer. This model incorporates impairments like inter-symbol interference, insertions and deletions, as well as random response. The practical application of such information bounds is two-fold:
- Benchmarking present base-calling algorithms, and
- Offering an optimization objective for designing better nanopore sequencers.
Using the insights from the model, we developed new alignment algorithms for the nanopore sequencers which gave performance, on real data on several genomes (including human), that is significantly better than the state-of-the-art. We have further modified the alignment algorithm specifically for detecting structural variants on real nanopore data of human genome.
The mathematical model also inspired us to consider a trace-reconstruction problem, where one wants to reconstruct an input sequence through observations over a set of deletion channels (a dominant error pattern in nanopores where symbols are deleted without receiver knowing where). We solved the optimal trace-reconstruction problem (in terms of error probability), for a fixed number of traces by developing a provably polynomial time algorithm.