By Leo Barolo
What can you do with a few extra kilobases?
Long-read nucleic acid sequencing is taking the genomics field by storm. A growing community at UW-Madison is leveraging the technology for everything from metagenomic genome assembly to detecting transcriptome-wide RNA chemical modifications.
Technologies leading the way
These and other applications are being advanced by major players in the field, including Pacific Biosciences (known as “PacBio” and founded by UW alum Stephen Turner) and Oxford Nanopore Technology (ONT). Both are offered through the UW-Madison Biotechnology Center and seeing expanded application on campus.
PacBio instruments sequence single nucleic-acid molecules in real-time, using a fluorescence-based readout as polymerase scans the molecule. By reading each fragment in multiple passes, PacBio produces what are called “HiFi” consensus reads of 10-25 kilobases and 99.9% accuracy. The technology is especially useful for producing “phased” genome assemblies – which identify variants inherited together on the same chromosome, identifying full-length transcripts to detect of all splice variants (called Iso-Seq), and studying metagenomes in complex samples.
In contrast, ONT sequencing generates ultra-long reads (10 to >100 kilobases) from single DNA or RNA molecules passed through a nanopore. Changes in electrical current can record not only the sequence but also base modifications such as methylation. This makes the technology extremely powerful for a range of applications, from detecting structural variants to RNA modifications in a transcriptome. Both are advancing a range of research topics in leading labs on campus.
Genome assembly & annotation
Long-read sequencing is extremely powerful for producing complete genome sequences. Long stretches of sequence from large DNA fragments can span repeat regions that are specked throughout the genome and difficult to assemble from short-read sequencing. These are often missing in other types of draft genome sequences.
“Long-read sequencing on the ONT platform has allowed us to quickly and cheaply generate telomere-to-telomere, chromosome-level genome assemblies,” says John Crandall, a graduate student in Chris Hittinger’s lab in the Laboratory of Genetics and the Center for Genomic Science Innovation (CGSI). Crandall’s research focuses on the molecular evolution of subtelomeric gene families, which are extremely challenging to resolve in even the most complete short-read assemblies.
“These chromosome-level genome assemblies are offering us unparalleled insight into the structure and regulation of biotechnologically and evolutionarily important genes,” he adds.
These technologies are also accelerating our understanding of how genomes function. Mostafa Zamanian, assistant professor in the Department of Pathobiological Sciences, studies neglected tropical diseases caused by parasitic worms.
“Many of these parasites lack high-quality genomes or gene annotations,” says Zamanian. His team is using long-read transcriptome sequencing to improve gene models in each organism. Using this information, the Zamanian lab is investigating pathogenicity genes to target through drug screening.
Understanding transcript and protein identities
CGSI faculty Colin Dewey, a professor in the Department of Biostatistics and Medical Informatics who specializes in computational analysis of sequencing data, is analyzing PacBio sequences for several purposes. One collaboration with the labs of CGSI member Lloyd Smith and Nate Sherer in the Departments of Chemistry and Oncology, respectively, is analyzing paired proteomic and long-read transcriptomic data.
A goal is to study how RNA splicing and processing are coupled in cells infected with SARS-CoV-2. “Long-read RNA sequencing data can span an entire transcript,” says Dewey, “and that makes analyzing potential couplings/dependencies much easier than with short-read RNA-seq data, in which reads span at most one or two splice junctions.”
Another project is investigating how to use long-read RNA sequencing data to inform on protein isoforms (coined “proteoforms” by Smith) that are identified in mass spectrometry data. In collaboration with Smith’s lab and University of Virginia professor Gloria Sheynkman (a Smith lab alum), the team is analyzing parallel sets of transcriptomic and mass spec proteomic data (see also Rachel Miller highlight in this G&T issue).
Engineering gene variant libraries
CGSI member Vatsan Raman’s lab in the Department of Biochemistry is using PacBio technology for large-scale screening of individual gene variants to understand the mutational landscape of the gene. In pooled selection experiments, the competitive fitness of gene variants can be read by quantifying DNA “barcodes” engineered into each cell. But associating a barcode with the gene variant carried by the cell is a challenge.
“Here, long read sequencing comes to our rescue,” says Raman. The team introduces short, randomized sequences or “barcodes” downstream of the gene’s stop codon, then uses long-read sequencing to associate the barcode sequence with the variant in the gene carried by that cell. “Once the link between barcodes and variants has been clearly cataloged, we can simply sequence the barcodes with the Illumina platform to phenotype the variant libraries.” This approach enables the same library to be used in many applications or screens.
Services offered through UWBC
The UWBC houses multiple PacBio Sequel II and ONT PromethION and GridION instruments. UWBC is also using these technologies to offer two new services: high-throughput plasmid sequencing and assembly on ONT and Illumina platforms, and high-throughput amplicon sequencing using PacBio technology. The latter service provides unambiguous, haplotype-phased sequence for each allele in amplicons up to 8 kilobases.
“Researchers using degenerate primers to sequence families of genes will find this very useful,” says Josh Hyman, director of the UWBC Next-Generation Sequencing Core. Those interested in these technologies should contact the Core at firstname.lastname@example.org.