DNA sequencing technology has undergone a major revolution with the cost of sequencing plummeting nearly six orders of magnitude. Much of this improvement was made possible by second generation sequencers, utilizing massive parallelization. But these machines can only read short fragments of DNA, typically a few hundred bases long. These short reads are then stitched together with algorithms exploiting the overlap between reads to assemble them into long DNA sequences. This assembly is unreliable because of repeated regions which commonly occur in genomic DNA. Such repeated regions play an important role in evolution, development and in the genetics of many diseases.
Nanopore sequencing promises to address this problem, by increasing the read lengths by orders of magnitude (up to 100K bases). While nanopore sequencers can acquire long reads, the high error rates (~ 30%) pose a technical challenge. In a nanopore sequencer, a DNA is migrated through a nanopore and current variations are measured. The DNA sequence is inferred from this observed current pattern using an algorithm called a base-caller. In this project, we propose a mathematical model for the “channel” from the input DNA sequence to the observed current, prove a multi-letter mutual information channel capacity formula, and compute bounds on the information extraction capacity of the nanopore sequencer. This model incorporates impairments like inter-symbol interference, insertions and deletions, as well as random response. The practical application of such information bounds is two-fold: (1) benchmarking present base-calling algorithms, and (2) offering an optimization objective for designing better nanopore sequencers.
- W. Mao, S. Diggavi, and K. Sreeram, "Models and information-theoretic bounds for nanopore sequencing," Preprint.
- W. Mao, S. Diggavi, and S. Kannan, "Models and information-theoretic bounds for nanopore sequencing," in Proc. 2017 IEEE International Symposium on Information Theory Proceedings (ISIT), 2017.