Notes on Engineering Health, March 2022

Lampposts and Genomics

It is nighttime and a man is on his knees looking for his keys under a lamppost. A passerby stops and asks him if he thinks that’s where he lost them. The man on his knees answers: No but that’s the only place where the light is. 

This famous joke is often used to describe the massive influence of technology on the way we see and interrogate the world. Although Watson and Crick famously solved the structure of DNA in 1953 from groundbreaking crystallography work by Rosalind Franklin and Maurice Wilkins, the ability to “read” or sequence DNA was not scalable until the invention of the Sanger sequencing method in 1977. Sanger’s innovation and the ones that followed slowly paved the way to today’s large-scale genome sequencing, with the current state of technology at each step largely determining the problems the scientific community was then able to focus on. 

The Sanger method allowed the sequencing of up to 900 base pairs by replicating fragments of DNA many times over and tagging them with fluorescent proteins. One could then assemble all the fragments into the original sequence. The Sanger method was both revolutionary and extremely labor-intensive, which is why the Human Genome Project, completed in April 2003, took 13 years (1990 to 2003) and cost nearly $3 billion. 

The successor to the Sanger method was Next Generation Sequencing (NGS) which instead of sequencing a single DNA fragment at a time runs a massively parallel process that  sequences millions of fragments simultaneously per run. The cost and time advantages of the NGS method (the market for which Illumina dominates with about 80% market share) have led to innumerable scientific discoveries, shedding new light on evolutionary processes and a great number of diseases caused by single-nucleotide variants, copy number variants, insertions, or deletions

The fact that NGS like the Sanger method is based on short-read sequences (\<300 base pairs in the case of NGS) that have to be assembled, however, has caused the energy of the scientific community to focus on only a certain sub-set of genomic events, and likely has led to an underrepresentation of others events such as structural variants.  Accurately assembling ~300 base-pair fragments by overlapping them onto each other in order to build large sections or even whole chromosomes is a nearly impossible endeavor, especially for regions poor in information (GC rich and highly repetitive regions) — imagine trying to solve a large jigsaw puzzle representing clear blue sky with tiny pieces all of them a similar shade of blue. Indeed, reads less than 300 bases long, such as those typically produced by Illumina NGS machines, are too short to detect more than 70% of human genome structural variation (that is, variation affecting sequences longer than 50 base pairs), with intermediate-size structural variation (less than 2 kb) especially under-represented. Being able to identify, analyze, and eventually diagnose these structural variants will unlock whole parts of biology that were kept in the dark and that preliminary studies show to be important for health and diseases. But, this will only be possible with a different street lamp to cast light on this search space.

The search for this new “street lamp” — a cheap and scalable long-read sequencing technology — has been a decade-long effort. Companies such as Pacific Biosciences (on Illumina’s radar for an acquisition in 2018) and Oxford Nanopore can generate continuous sequences ranging from 10,000 bases to several million bases in length directly from native DNA. However, some challenges remain to the wide adoption of these technologies. A longer time to sequence, a need for more DNA per sample, and a prohibitive comparative cost make these approaches currently unfit to broad expansion beyond research and into the clinic. Solutions bridging the scalability of short-read sequencing and the output of long-read are emerging and incumbent companies are placing their bets on the best ways forward (a great review on long-read solutions can be found in Nature). 

The short history of DNA sequencing is a wonderful example of the interplay of science and technology, with the current state of technology largely dictating what our science can ask and understand. In thinking about what we know, it is important to remember  the limits of what we can see.

Jonathan Friedlander, PhD & Geoffrey W. Smith