In the past 10 years, microbiology has undergone a revolution that has been driven by access to cheap high-throughput DNA sequencing. It was not long ago that the cloning and sequencing of a target gene could take months or years,whereas now this entire process has been replaced by a 10 min Internet search of a public genome database. There has been no single innovation that has initiated this rapid technological change; in fact, the core chemistry of DNA sequencing is the same as it was 30 years ago. Instead, progress has been driven by large sequencing centers that have incrementally industrialized the Sanger sequencing method. A side effect of this industrialization is that large-scale sequencing has moved out of small research labs, and the vast majority of sequence data is now generated by large genome centers. Recently,there have been advances in technology that will enable high-throughput genome sequencing to be established in research labs using bench-top instrumentation. These new technologies are already being used to explore the vast microbial diversity in the natural environment and the untapped genetic variation that can occur in bacterial species. It is expected that these powerful new methods will open up new questions to genomic investigation and will also allow high-throughput sequencing to be more than just a discovery exercise but also a routine assay for hypothesis testing. While this review will concentrate on microorganisms, many of the important arguments about the need to measure and understand variation at the species, population and ecosystem level will hold true for many other biological systems.
Is there anything left to sequence?
The first bacterial genome, that of Haemophilus influenzae, was published in 1995 (Fleischmann et al.,1995). This was the first sequence of a free-living species to be completely decoded. The genome was sequenced at The Institute for Genomic Research using the Whole Genome Shotgun (WGS) method. The data from this project, which included 1 830 137 bp of DNA and 1743 predicted genes, laid out, for the first time, the full genetic complement of a bacterial organism. Within 5 years of this publication, numerous other bacteria were sequenced,including Mycobacterium tuberculosis(Cole et al., 1998), one of the most important human bacterial pathogens, Escherichia coli(Blattner et al., 1997) and the first archaeon, Archaeoglobus fulgidus(Klenk et al., 1997). Since then, eukaryotic microbes have been sequenced, such as the malaria parasite Plasmodium falciparum (Gardner et al., 2002a; Gardner et al.,2002b; Hall et al.,2002; Hyman et al.,2002) and yeast (Goffeau et al., 1997). These sequences, along with the large genomes of mammals such as human (Lander et al.,2001), mouse (Waterston et al., 2002) and chimpanzee(Mikkelsen et al., 2005), have led to the massive expansion of sequence data available today.
It is clear that genome sequencing has spearheaded a revolution in the biological sciences by allowing the study of molecular processes in the context of complete cellular systems, thus leading to the concept of `systems biology'. Genome sequence is also the foundation of the `omics' technologies such as proteomics and transcriptomics (such as microarrays). Despite its success, a casual observer of the genomics field might easily believe that there was no requirement for more genome sequencing, as almost all of the major model organisms and important human and animal pathogens have been sequenced. I will argue in this review that sequencing has yet to reach its full potential as a tool for discovery and hypothesis testing. I will draw upon three examples where the potential of new technologies has been, or soon will be, demonstrated: comparative genomics, mutation screening and metagenomics. I will start by describing briefly what the technologies are.
Old and new sequencing technologies
The Sanger sequencing method (Sanger et al., 1977) has been the workhorse technology for DNA sequencing for almost 30 years. This method relies on synthesizing DNA on a single-stranded template while randomly incorporating chain terminators. This generates a range of different fragment sizes that correspond to the positions of the terminators. The older methods would require four reactions per template (one for each base: G, A, T and C), each reaction having a different base as a terminator. The reactions are then run on a gel to identify the size of each fragment. Improvements were made in the 1990s with the use of different colored fluorescent dyes to label terminators(Prober et al., 1987; Smith et al., 1986), so that all of the terminators can be incorporated in a single reaction. The first sequencing machines used this technology in combination with devices to automatically read fragments as they were separated on a polyacrylamide gel. Later, the gels were replaced by capillaries, which simplified the separation step and increased the length of reads(Madabhushi, 1998). In the past 10 years, the average length of a sequencing read has increased from around 450 bp to 850 bp. Despite these technological advances in the Sanger method, whole-genome sequencing is predominantly carried out at large dedicated genome centers that can each house up to 100 sequencing machines and that have the capacity to run them >10 times per day and 365 days per year,using highly automated template preparation pipelines. Without such an infrastructure in place, the cost and workload of generating enough sequencing to decode even a relatively small genome are highly prohibitive.
Recent developments in enzymology, imaging and microfluidics may offer a new approach to sequencing that could yield a massive increase in capacity while removing the need for the huge infrastructure required today. In this review, I will not give an exhaustive list of new technologies but I will describe a few of the published techniques that appear most promising. These can be separated into two approaches: sequencing with amplification and single-molecule sequencing. Fig. 1 gives an overview of some of the different sequencing strategies.
New technologies for sequencing with amplification
The first step in most sequencing processes is to amplify the DNA. This is necessary because measuring biochemical processes at a single-molecule resolution is so technically challenging. In the Sanger method, this is usually done by cloning the DNA into a plasmid and growing clones; however,this has its pitfalls as DNA is a biologically active molecule, hence there are inherent biases against certain stretches of DNA that have physical properties that do not replicate well in E. coli or that code for toxic compounds. The two methods I will discuss here are the Margulies et al. method (Margulies et al.,2005), also known as 454 sequencing after 454 Life Sciences(Branford, CT, USA), which has commercialized it, and the Shendure et al. method (Shendure et al.,2005), also known as polony sequencing(Fig. 2). Both have developed high-throughput strategies for in vitro amplification that are very cheap and also get around the inherent biases of in vivo methods.
454 sequencing is, at the time of writing, the only new sequencing technology that has been widely deployed. The 454 method is similar to the polony method in that it involves massively parallel sequencing by synthesis on a solid support. The method allows reads as long as 250 bp (and the maximum read length is expected to increase further in the coming year) and is therefore at least approaching the read lengths obtainable through traditional methods. Margulies et al. have devised a scalable, highly parallel two-step sequencing approach (Margulies et al.,2005). The first step involves shearing the genome and attachment of oligonucleotides, a process that circumvents the need for generating a clone library. Adapters are ligated to the fragments and these are bound to beads and captured in the droplets of an oil-emulsion PCR reaction mixture. PCR amplification in each droplet results in each bead carrying 10 million copies of a unique DNA template. In the second step, a modified pyrosequencing(Ronaghi et al., 1996)protocol is carried out, in which nucleotide incorporation is detected by the release of inorganic pyrophosphate and the generation of photons.
Polony sequencing involves an in vitro library construction step that generates two paired genomic tags in a linear molecule separated by a universal linker and a universal tag on either end. Millions of these molecules are circularized using the linker ends and amplified in-parallel in a single reaction tube by a process of emulsion PCR using beads containing primers to the universal tags (very similar to the 454 method). The beads are then immobilized on a flow cell for sequencing. An unusual aspect of the polony technique is that it does not use primer extension replication for the sequencing stage but instead relies on the hybridization and ligation of oligonucleotides. First, an anchor primer is hybridized to one of the universal sequences, and then degenerate nonamers, which are labeled using fluorescent dyes, are hybridized to the template and then ligated to the anchor primer. The pools of nonamers are structured so that the base in the degenerate position corresponds to the color of the fluorescent dye labeling it. The nonamers will only ligate if the sequence is complementary to the bases adjacent to the anchor primer, therefore the sequence of the template can be derived. The sequence generated by this technique is very accurate and also benefits from having paired reads. A single run can generate around 30 Mb of sequence, with an estimated cost per kilobase of raw sequence that is 10-fold less than conventional sequencing. The disadvantage of this technique is the short read length, which is currently 26 bp per amplicon (13 bp per tag). The polony method has now been taken on by Applied Biosystems (Foster City, CA, USA). They have adapted the method so it is capable of 50 bp reads and generating >1 Mb of sequence in a single run. The technology (now named SOLiD) is expected to be brought to market in 2007.
Another method for massively parallel sequencing by synthesis from amplified fragments has been recently developed by a company called Solexa(Bennett, 2004; Bennett et al., 2005). Solexa sequencing differs from polony or 454 sequencing as it amplifies the DNA on a solid surface followed by synthesis by incorporation of modified nucleotides linked to colored dyes. Solexa sequencing will not be covered in depth here as(at the time of writing) the methodology has not been published in detail. However, as this review goes to press, Solexa have released their first instrument that is capable of sequencing over 1 Gb in a single run and is likely to have a major impact on the genomics field.
Many of the problems, and inherent errors, of DNA sequencing result from the fact that thousands or millions of amplified templates are assessed in a single reaction. It would be far better to read DNA in the same way as cells do; as single molecules. The first published report of single-molecule sequencing was by the lab of Stephen Quake(Braslavsky et al., 2003). This method involves hybridizing target DNA to complimentary primers that are streptavidin–biotin bound to a silica surface. The primers are then extended by the addition of Cy3- and Cy5-labeled nucleotides; as each base is added, the incorporation is captured using a camera mounted on a microscope. A limitation of this technology is that it generates short reads, which at the time of publication was 5 bp; however, this technology has been taken up by a company (Helicos Biosciences Corporation, Cambridge, MA, USA) who are reporting much longer reads. This method is highly parallel, and on a 25 mm square it would be possible to sequence 12 million templates simultaneously,so, even with 5 bp reads, each `run' would generate 60 million bases of information.
One other method of single-molecule sequencing that is in the very early stages of development involves `reading' DNA as it is passed through a nanopore (Kasianowicz et al.,1996; Storm et al.,2005a; Storm et al.,2005b). This would not involve an enzymatic extension reaction of any kind but instead the physical properties of the molecule would be read as the bases wind through a tiny pore. In theory, this method would have no limit on read length and, hence, if the technical hurdles are overcome it could revolutionize how genome sequencing is achieved.
Read length, read quality and read pairs
When considering how a sequencing technology can be used for specific purposes, it is important to consider three quality measures: read length,read quality and read pairing. If reads are very short, then they are of limited use for de novo assembly of complete genomes. Although some simple bacterial genome assemblies have been carried out on reads of less than 50 bp, for the vast majority of genomes, assembly would be impossible. The ability to generate read pairs is also vital for assembly of large genomes as it allows distant regions of the genome to be linked. In Sanger sequencing,this is achieved by cloning large inserts and taking reads from both ends, but this is problematic for most new technologies. Short, single reads are still very useful for comparative studies where the aim is to identify single nucleotide polymorphisms (SNPs) or larger differences between a reference genome and a newly sequenced genome. This type of study requires high-quality reads and hence the error rate for any method used should be low.
Currently, Sanger sequencing outperforms all of the new technologies in these metrics of quality. Hence, efforts are underway to incorporate Sanger sequencing data into 454 sequence assemblies to improve the consensus quality. Because the reads and error distribution for new technologies are very different from Sanger methods, the tools needed to process them and assemble them are different. This means, frustratingly, that it is very difficult to mix Sanger sequencing reads with other types of reads and assemble them together, although some progress has been made in this direction(Goldberg et al., 2006; Wicker et al., 2006).
Comparative genomics: the need for more de novo genome sequencing
The fact that there are 279 complete bacterial genomes in the public databases (at the time of writing) sounds impressive, but recent estimates suggest that there could be 107 distinct bacterial taxa in only 10 g of pristine soil (Curtis and Sloan,2005; Gans et al.,2005); it therefore follows that for the vast majority of microbes there is no genome sequence data at all. For the few `lucky' species that have been selected for genomic analysis, there is usually only one reference genome.
For a few pathogenic microbes, multiple species have been sequenced, and the data from these studies have revealed that a single reference genome,while useful, may only give a snapshot of the genetic makeup of a species. A recent study of group B Streptococcus strains(Tettelin et al., 2005)revealed that, as each new strain was sequenced, new genes were discovered such that, after sequencing eight genomes, approximately 33 novel genes were discovered from each additional genome. This has led to the concept of the`Pan-genome', which refers to the full gene repertoire contained within a species. The Pan-genome theory predicts that any bacterial species will be made up of a core set of genes that is found in all individuals and a dispensable set of genes that may or may not be present in any particular individual (Medini et al.,2005; Tettelin et al.,2005). This phenomenon seems to be applicable to most other microorganisms examined, and subtractive hybridization studies of E. coli suggest that up to 25% of the genome is specific to individual strains (Fukiya et al., 2004). By sequencing more and more individuals, the scale of the Pan-genome can be estimated. So, for Bacillus anthracis, no more new genes were identified after four species were sequenced whereas for group B Streptococcus and E. coli it is estimated that the number of strains needed to survey the Pan-genome is at least in the hundreds and effectively may be infinite. An important finding from this work is that for many species, the dispensable gene set may be significantly larger than the core genome. Therefore, a single genome may give a very poor representation of the genetic potential of the species. When predicting the chance of emergence of drug resistance or new virulent forms of pathogens, knowledge of the complete genetic complement of the species is far more important than the genetic complement of an individual.
Not only do more genomes allow for the discovery of more genes but they also help us to understand how genes and genomes are evolving, as this can provide clues to gene function. Pathogen genes that are interacting with the host are often subject to positive selection (and therefore appear to be evolving rapidly). Genome-wide molecular evolution studies have been applied to various pathogens such as Plasmodium(Hall et al., 2005), Trypanosoma (El-Sayed et al.,2005), Borrelia (Qiu et al., 2004) and many other species. These studies depend on tracing the pattern of mutations that occur in synonymous and non-synonymous sites by aligning orthologous genes in closely related species. The more genomes that can be aligned, the more accurate this analysis is. The studies to date have used up to four genomes at a time but as sequencing becomes more affordable it will be possible to scale this analysis up to look at tens or hundreds of genomes at a time.
Genome sequencing is not yet being routinely used as a hypothesis-testing technology. The reason that we are limited in our ability to use genomic data is that a single reference genome does not provide enough data to allow correlations between genotypes and phenotypes. For example, the Haemophilus influenzae genome is only a single data point so we can't correlate the sequence to a phenotype. If genomes from say 100 strains of H. influenzae were sequenced, one could test hypotheses about which genes were linked to drug resistance, virulence or transmissibility, etc. However, there is a technology gap between the questions we would like to ask and what is feasible with current methods. To sequence 100 Haemophilus genomes (let alone 100 human genomes) would be completely impractical using traditional Sanger-based techniques and there is a requirement for new methods to allow genomics to address complex genetic questions.
One of the most obvious applications of cheaper, more high-throughput genome sequencing of microbes is for mutation screening. This may be carried out at the population level, to identify associations between phenotypes and genotypes, or in lab-generated strains, to identify SNPs or larger mutations that have given rise to selected phenotypes. Currently, there are a number of platforms that allow SNP screening using microarrays but these require the array to be pre-designed and they will not resolve large genomic changes such as insertions or inversions relative to the reference sequence. Recent work on experimentally evolved species has demonstrated how new sequencing methods can be used to track mutations that have been acquired in the laboratory.
Shendure et al. used polony sequencing to screen an evolved strain of an E. coli auxotroph (Shendure et al., 2005). The sequencing was able to identify a number of SNPs as well as larger deletions and inversions. This work demonstrated that,despite the small amount of data obtained per clone (26 bp), it was possible to identify large-scale rearrangements in the genome and align fragments to identify SNPs. In a similar study of the cooperative bacterium Myxococcus xanthus (Velicer et al.,2006), a laboratory-evolved strain that had been selected for a cheating phenotype and reselected for a cooperative phenotype was shotgun sequenced using 454 sequencing technology. The 454 sequence was able to identify point mutations in the evolved strain compared with the reference strain, which could then be associated with the changes in phenotype (as well as identifying errors in the reference).
While whole-genome sequencing may still be prohibitively expensive for detection of point mutations, we may expect prices to fall for these new technologies, as they have in the past for Sanger sequencing. Due to their small genome size, microbes will be in the first wave of organisms to be studied this way and we can expect direct whole-genome sequencing to replace many other forward genetic techniques for the study of very specific traits.
Metagenomics, or community genomics, is an approach aimed at analyzing the genomic content of microbial communities living in any particular niche such as the human gut or the soil. The problem of studying the microbial composition of an environmental sample is one that has baffled microbiologists for some time. The challenge is confounded by the sheer diversity of microbes that are present in even the most extreme environments, along with the fact that only a small proportion of the species are actually culturable. Genomic analysis has been used to circumvent these problems as it can allow the analysis of non-culturable organisms, and molecular phylogenetic analysis can be used to study the taxonomic diversity of the organisms present. The added advantage of genomic methods is that the analysis of gene content will also give an indication of the metabolic potential of an environment.
Metagenomic studies have been applied already to human environments such as the human gut (Breitbart et al.,2003; Gill et al.,2006; Manichanh et al.,2006; Zhang et al.,2006), environmental samples such as soil(Bertrand et al., 2005; Lim et al., 2005; Mills et al., 2006) and the ocean (Breitbart et al., 2004; Culley et al., 2006; DeLong et al., 2006; Sogin et al., 2006; Venter et al., 2004). These studies have provided interesting findings in terms of the metabolic capability and taxonomic diversity of the microbes inhabiting these environments. The major goal of these metagenomic studies is not only to find new biological species and systems but also to be able to identify biomarkers that can be used to classify the type of processes that occur in specific environments. For example, what processes and species are more commonly found in a diseased gut compared with a healthy one? Or which species or processes associate with polluted as opposed to pristine environments?
A major problem with this preliminary work is that the diversity is probably not fully sampled because of the complexity of the environments studied. It has been recently estimated that close to 107 distinct bacterial species inhabit a 10 g soil sample(Curtis and Sloan, 2005; Curtis et al., 2002; Gans et al., 2005); this is a species diversity two orders of magnitude higher than previous estimates. If each of these species had an average genome size of 3–5 Mb, this would mean that a single sample would contain the equivalent of 1000 human genomes. Even if the species were present in equal amounts then a large sequencing center would have to dedicate its entire resource for years to sample all of the genomes present. Unfortunately, the problem is still more complex than that; the new higher estimate is based on the finding that there is greater diversity in the low-abundance species that are masked by a less diverse group of high-abundance species. Hence, current studies only scrape the surface of the full diversity and most of the low-abundance species in the environments are not sampled at all. New highly parallel sequencing technologies offer a cost-effective solution to this problem as they can generate much more sequence than traditional methods. However, there are limitations to their utility because non-Sanger methods have shorter read lengths and are therefore more difficult to assemble. Two recent studies using 454 pyrosequencing have demonstrated the power of new sequencing technologies for this type of analysis: one analyzing the massive diversity in the oceans(Sogin et al., 2006) and the other analysing a low-complexity environment(Edwards et al., 2006).
The first study set out to measure the number of species in the Earth's ocean biosphere by using massively parallel sequencing to sufficiently sample the low-abundance taxa in order to make more accurate estimate of their diversity (Sogin et al.,2006). Using the 454 pyrosequencing technology, 118 000 amplicons were sequenced that spanned the V6 hypervariable region of the ribosomal RNA(rRNA) from bacteria collected at different depths and locations of the Atlantic and Pacific oceans. The resulting sequences were compared to a database of all known V6 regions in order to place them phylogenetically. Clustering of these sequences defined Operational Taxonomic Units (OTUs). In each sample, over 1000 OTUs were identified, and in the most sampled environment over 3000 OTUs were identified. In no environment did rarefaction analysis suggest that the sampling had reached a plateau, as the number of OTUs identified increased almost linearly with the sequencing of new tags. Although the authors of this study made specific efforts to control for sequencing errors, it is possible that some of the diversity observed was caused by the inherent base calling errors that occur in 454 sequencing reads,and the findings of this study should therefore be verified by other methods. Although this study was insufficient for measuring diversity, it still demonstrated the inadequacy of other methods and will increase estimates of natural diversity further.
In the second study, two water samples from adjacent sites that differed significantly in their chemistry and geology were analyzed(Edwards et al., 2006). 454 sequencing was used to generate random sequence from each sample. Over 35 Mb of sequence was generated from both samples in short reads and therefore the challenge was to be able to analyze these data to identify processes and taxonomic groups that would allow a comparison of the microbial diversity in the two environments. The 16S reads that were present in the sample were used to identify the species present; this demonstrated that the oxygenated environment had a much higher species diversity than the oxygen-poor environment. This result was verified by using Sanger sequencing of an rRNA library from each sample.
In addition to looking at species, Edwards et al. also analyzed the metabolic potential of the different communities by automatically assessing gene function by homology searches of sequence reads against a metabolic database (Edwards et al.,2006). Using this analysis they identified processes that were significantly overrepresented in one sample relative to the other. This study was able to focus on biological processes as well as diversity, as the environments in question were far less complex that the ocean environment studies by Sogin et al. (Sogin et al.,2006). However, as the technologies used become faster and cheaper, it may be possible to deeply sequence complex environments. These studies are not only limited by sequencing, however, and there will need to be improvements in genomic assembly and annotation in order to analyze the data generated.
Genome sequencing has provided us with powerful insights into the genetic make-up of the microbial world and has spearheaded a host of revolutionary technologies, such as microarrays and proteomics, that have transformed the field of microbiological research. Yet DNA sequencing has only scratched the surface of the genetic diversity present in the real world. There are a number of new technologies that are now in development that promise to reinvigorate the genomics field as they massively increase throughput while markedly decreasing the cost of DNA sequencing.
Importantly, these technologies will enable researchers to undertake the process of genomic sequencing in a single operation using bench-top instruments. This will democratize a technology that, until now, has largely been the preserve of large genome centers. It is hoped that once this process can be viewed as an assay – in the same way that we view a microarray experiment – whole-genome sequencing will be applied to a host of new questions, such as genotype association studies, mutation screening,evolutionary studies and environmental profiling.
It may be that the term `post-genomics' has been prematurely inserted into the scientific lexicon and we are in fact on the cusp of a genome sequencing renaissance.
I am grateful to Ian Paulson for his help and advice in writing this manuscript.