nanopore sequencing technology, bioinformatics and applications

These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. In addition to the genome-polishing software Nanopolish109, ONT released Medaka, a neural network-based method, aiming for improved accuracy and speed compared to Nanopolish (Table 1). In 2016, the first aligner specifically for ONT reads, GraphMap, was developed95. Nanopore sequencing is a method for determining the order and modifications of DNA/RNA nucleotides by detecting the electric current variations when DNA/RNA oligonucleotides pass through the nanometer-sized hole (nanopore). By contrast, direct RNA sequencing currently produces about 1,000,000 reads (13 Gb) per MinION flow cell due in part to its relatively low sequencing speed (~70 bases per s). Hosseini M, Palmer A, Manka W, Grady PGS, Patchigolla V, Bi J, O'Neill RJ, Chi Z, Aguiar D. Bioinformatics. Nanopore sequencing is being applied in genome assembly, full-length transcript detection and base modification detection and in more specialized areas, such as rapid clinical diagnoses and outbreak surveillance. Although the average accuracy of ONT sequencing is improving, certain subsets of reads or read fragments have very low accuracy, and the error rates of both 1D reads and 2D/1D2 reads are still much higher than those of short reads generated by next-generation sequencing technologies. Nevertheless, ONT cDNA sequencing was also tested in individual B cells from mice120 and humans122,166. Beyond sequencing: machine learning algorithms extract biology hidden in Nanopore signal data. Overall, method development for base calling went through four stages32,44,58,67,68: (1) base calling from the segmented current data by HMM at the early stage and by recurrent neural network in late 2016, (2) base calling from raw current data in 2017, (3) using a flipflop model for identifying individual nucleotides in 2018 and (4) training customized base-calling models in 2019. NIHMS1789593-supplement-Supplementary_Table_1.xlsx, https://github.com/nanoporetech/megalodon, https://nanoporetech.com/resource-centre/tip-iceberg-sequencing-lettuce-genome, https://www.savetheredwoods.org/project/redwood-genome-project/, https://nanoporetech.com/resource-centre/beauty-and-beast, https://doi.org/10.1038/s41587-021-01108-x, Guppy, Metrichor, Nanonet, Albacore, Scrappie, Flappie, Taiyaki, Bonito, Non-reference transposable element detection, Transcriptome construction and quantification, DNA methylation and chromatin accessibility, DNA replication (replication fork detection), Epitranscriptomics (RNA secondary structure). Only three PromethION flow cells were required to sequence the human genome, requiring <6 h for the computational assembly164. Trends Genet. was invited by ONT to present at the conference London Calling 2020. K.F.A. They would require specific experimental protocols (for example, identifying chromatin accessibility by detecting artificial 5mC footprints72,175,176) rather than the simple generation of long reads. doi: 10.1093/bioinformatics/btad220. 5). By contrast, the R2C2 protocol involves the generation and sequencing of multiple copies of target molecules122. However, this review focuses on ONT technology as it has been used in most peer-reviewed studies of nanopore sequencing, data, analyses and applications. Only 1560 min of sequencing per sample was required220. In both cases, only the RNA strand passes through the nanopore, and therefore direct sequencing of RNA molecules does not generate a consensus sequence (for example, 2D or 1D2). Nanoraw (integrated into the Tombo software package) was the first tool to identify the DNA modifications 5mC, 6mA and N4-methylcytosine (4mC) from ONT data74. There is currently no theoretical estimation of this limit, but for reference, Helicos managed to reduce error rates to 4% (ref. Fig. Furthermore, the software Causalcall uses a modified temporal convolutional network combined with a connectionist temporal classification decoder to model long-range sequence features35. Additionally, Taiyaki can train models for identifying modified bases (for example, 5-methylcytosine (5mC) or N6-methyladenine (6mA)) by adding a fifth output dimension. In addition, same-day detection of fusion genes in clinical specimens has also been demonstrated by MinION cDNA sequencing198. We review 11 applications that are the subject of the most publications since 2015. Similar progress has been achieved in other model organisms and closely related species (for example, Escherichia coli109, Saccharomyces cerevisiae137, Arabidopsis thaliana138 and 15 Drosophila species139) as well as in non-model organisms, including characterizing large tandem repeats in the bread wheat genome140 and improving the continuity and completeness of the genome of Trypanosoma cruzi (the parasite causing Chagas disease)141. With the increasing throughput of ONT sequencing, real-time surveillance has been applied to pathogens with larger genomes over the years, ranging from viruses of a few kilobases (for example, Ebola virus220, 1819 kb; Zika virus222, 11 kb; Venezuelan equine encephalitis virus225, 11.4 kb; Lassa fever virus226, 10.4 kb and SARS-CoV-2 coronavirus151, 29.8 kb) to bacteria of several megabases (for example, Salmonella221, 5 Mb; N. meningitidis227, 2 Mb and K. pneumoniae228, 5.4 Mb) and to human fungal pathogens with genomes of >10 Mb (for example, Candida auris229, 12 Mb). However, both of these approaches were limited in that each molecule could only be measured twice. The expected data output of a flow cell mainly depends on (1) the number of active nanopores, (2) DNA/RNA translocation speed through the nanopore and (3) running time. These breakthroughs have required extensive development of experimental and bioinformatics methods to fully exploit nanopore long reads for investigations of genomes, transcriptomes, epigenomes and epitranscriptomes. The DNA extraction and purification methods used in these independent studies are summarized in Supplementary Table 1. HHS Vulnerability Disclosure, Help GridION, for medium-scale projects, has five parallel MinION flow cells. Indeed, this motor protein provided the last piece of the puzzle; in February 2012, two groups demonstrated processive recordings of ionic currents for single-stranded DNA molecules that could be resolved into signals from individual nucleotides by combining phi29 DNA polymerase and a nanopore (-hemolysin24 and MspA25). 4 |. Repetitive sequencing of the same molecule, for example, using 2D and 1D2 reads, was helpful in improving accuracy. The average accuracy of 1D2 reads is up to 95% (R9.5 nanopore)43 (Fig. Under the control of a motor protein, a double-stranded DNA (dsDNA) molecule (or an RNADNA hybrid duplex) is first unwound, then single-stranded DNA or RNA with negative charge is ratcheted through the nanopore, driven by the voltage. The R10 and R10.3 nanopores with two sensing regions may result in different signal features compared to previous raw current data, which will likely drive another wave of method development to improve data accuracy and base modification detection. Nanopore-based sequencing technology detects the unique electrical signals of different molecules as they pass through the nanopore with a semiconductor-based electronic detection system. Nanopore sequencing technology, bioinformatics and applications Data points shown in b (accuracy), c (read length) and d (yield) are from independent studies. Later, ONTs open-source base caller Scrappie (implemented into both Albacore and Guppy) and the third-party software Chiron70 adopted neural networks to directly translate the raw current data into DNA sequence. Bioinformatics and Biomedicine (BIBM), SACall: a neural network basecaller for Oxford Nanopore sequencing data based on self-attention mechanism, Konishi H, Yamaguchi R, Yamaguchi K, Furukawa Y & Imoto S, Halcyon: an accurate basecaller exploiting an encoder-decoder model with monotonic attention, Fast-Bonito: a faster basecaller for nanopore sequencing, Fukasawa Y, Ermini L, Wang H, Carty K & Cheung MS, LongQC: a quality control tool for third generation sequencing long read data, pycoQC, interactive quality control for Oxford Nanopore Sequencing, Lanfear R, Schalamun M, Kainer D, Wang W & Schwessinger B, MinIONQC: fast and simple quality control for MinION sequencing data, RabbitQC: high-speed scalable quality control for sequencing data, SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification, SquiggleKit: a toolkit for manipulating nanopore signal data, Methylartist: tools for visualising modified bases from nanopore sequence data, NanoMethViz: an R/Bioconductor package for visualizing long-read methylation data, Methplotlib: analysis of modified nucleotides from nanopore sequencing, Identification of differential RNA modifications from nanopore direct RNA sequencing with xPore, RNA modifications detection by comparative Nanopore direct RNA sequencing, Jabba: hybrid error correction for long sequencing reads, Error correction and assembly complexity of single molecule sequencing reads, Hybrid correction of highly noisy long reads using a variable-order de Bruijn graph, Genome assembly using Nanopore-guided long and error-free DNA reads, Ratatosk: hybrid error correction of long reads enables accurate variant calling and assembly, Hybrid error correction and de novo assembly of single-molecule sequencing reads, Hackl T, Hedrich R, Schultz J & Forster F, proovread: large-scale high-accuracy PacBio correction through iterative short read consensus, Firtina C, Bar-Joseph Z, Alkan C & Cicek AE, Hercules: a profile HMM-based hybrid error correction algorithm for long reads, Haghshenas E, Hach F, Sahinalp SC & Chauve C, CoLoRMap: correcting long reads by mapping short reads, Non hybrid long read consensus using local de Bruijn graph assembly, MECAT: fast mapping, error correction, and de novo assembly for single-molecule sequencing reads, Nonhybrid, finished microbial genome assemblies from long-read SMRT sequencing data, FLAS: fast and high-throughput algorithm for PacBio long-read self-correction, The axolotl genome and the evolution of key tissue formation regulators, NanoReviser: an error-correction tool for nanopore sequencing based on a deep learning algorithm, TALC: transcript-level aware long-read correction, Error correction enables use of Oxford Nanopore technology for reference-free transcriptome analysis, Fast and accurate short read alignment with BurrowsWheeler transform, lra: a long read aligner for sequences and contigs, Jain C, Rhie A, Hansen NF, Koren S & Phillippy AM, A long read mapping method for highly repetitive reference sequences, Jain C, Koren S, Dilthey A, Phillippy AM & Aluru S, A fast adaptive algorithm for computing whole-genome homology maps, NanoBLASTer: fast alignment and characterization of Oxford Nanopore single molecule sequencing reads, Fast and accurate algorithms for mapping and aligning long reads, GraphAligner: rapid and versatile sequence-to-graph alignment, smsMap: mapping single molecule sequencing reads by locating the alignment starting positions, lordFAST: sensitive and fast alignment search tool for long noisy read sequencing data, Chakraborty A, Morgenstern B & Bandyopadhyay S, S-conLSH: alignment-free gapped mapping of noisy long reads, QAlign: aligning nanopore reads accurately using current-level modeling, Boratyn GM, Thierry-Mieg J, Thierry-Mieg D, Busby B & Madden TL, Magic-BLAST, an accurate RNA-seq aligner for long and short reads, DEEP-LONG: a fast and accurate aligner for long RNA-seq, Accurate spliced alignment of long RNA sequencing reads, Phased diploid genome assembly with single-molecule real-time sequencing, Time- and memory-efficient genome assembly with Raven, Kamath GM, Shomorony I, Xia F, Courtade TA & Tse DN, HINGE: long-read assembly achieves optimal repeat resolution, Rapid de novo assembly of the European eel genome from nanopore sequencing reads, Efficient assembly of nanopore reads via highly accurate and intact error correction, metaFlye: scalable long-read metagenome assembly using repeat graphs, Cheng H, Concepcion GT, Feng X, Zhang H & Li H, Haplotype-resolved de novo assembly using phased assembly graphs with hifiasm, Fast and accurate de novo genome assembly from long uncorrected reads, NeuralPolish: a novel Nanopore polishing method based on alignment matrix construction and orthogonal Bi-GRU Networks, Haplotype-aware variant calling enables high accuracy in nanopore long-reads using deep neural networks, The genome polishing tool POLCA makes fast and accurate corrections in genome assemblies, SVIM: structural variant identification using mapped long reads, Dysgu: efficient structural variant calling using short or long reads, SENSV: detecting structural variations with precise breakpoints using low-depth WGS data from a single Oxford Nanopore MinION flowcell, Long-read-based human genomic structural variation detection with cuteSV, Detecting and phasing minor single-nucleotide variants from long-read sequencing data, Nanopanel2 calls phased low-frequency variants in Nanopore panel sequencing data, Exploring the limit of using a deep neural network on pileup data for germline variant calling, HapCUT2: robust and accurate haplotype assembly for diverse sequencing technologies, Practical probabilistic and graphical formulations of long-read polyploid haplotype phasing, Klasberg S, Schmidt AH, Lange V & Schofl G, DR2S: an integrated algorithm providing reference-grade haplotype sequences from heterozygous samples, Identification and characterization of occult human-specific LINE-1 insertions using long-read sequencing technology, Analysis of short tandem repeat expansions and their methylation state with nanopore sequencing, De novo clustering of long reads by gene from transcriptomics data, De novo clustering of long-read transcriptome data using a greedy, quality value-based algorithm, Comprehensive characterization of single cell full-length isoforms in human and mouse with long-read sequencing, LIQA: long-read isoform quantification and analysis, AERON: transcript quantification and gene-fusion detection using long reads, Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing, JAFFAL: detecting fusion genes with long read transcriptome sequencing, LongGF: computational algorithm and software tool for fast and accurate detection of gene fusions by long-read transcriptome sequencing, Deonovic B, Wang Y, Weirather J, Wang XJ & Au KF, IDP-ASE: haplotyping and quantifying allele-specific expression at the gene and gene isoform level by hybrid sequencing, Transcriptome variation in human tissues revealed by long-read sequencing, NanoAmpli-Seq: a workflow for amplicon sequencing for mixed microbial communities on the nanopore sequencing platform, High-accuracy long-read amplicon sequences using unique molecular identifiers with Nanopore or PacBio sequencing, Targeted nanopore sequencing with Cas9-guided adapter ligation, Single-molecule simultaneous profiling of DNA methylation and DNAprotein interactions with Nanopore-DamID, Nanopore sequencing of single-cell transcriptomes with scCOLOR-seq, Single-cell isoform RNA sequencing characterizes isoforms in thousands of cerebellar cells, Lebrigand K, Magnone V, Barbry P & Waldmann R, High throughput error corrected Nanopore single cell transcriptome sequencing, Bizuayehu TT, Labun K, Jefimov K & Valen E, Single molecule structure sequencing reveals RNA structural dependencies, breathing and ensembles, Revealing nascent RNA processing dynamics with nano-COP.