Scientists now have huge amounts of information at their fingertips, but what does it all mean? Piero Carninci(p. 1497) begins the review collection with a discussion on the techniques used by scientists to understand what the genome codes for. To do this, he writes, they need to categorise the transcriptome, the parts of the genome that are transcribed into mRNA; then, list which mRNAs are translated into proteins. One method is to create large libraries of cDNAs – complementary DNA sequences that are synthesised from RNAs – and use these libraries to identify those that correspond to proteins. This work has shown that the transcriptome is a lot more complex than originally thought. While many mRNA sequences are ultimately translated into proteins, some RNA transcripts do not code for proteins, and Carninci explains that these RNA transcripts are likely to play a role in the transcription of protein-coding genes.
While the Human Genome Project gave us a `parts list', John Quackenbush explains that researchers currently lack a `circuit diagram' linking all of the newly described genes into functional groups(p. 1507). He writes that powerful new data analysis techniques are needed to pick out which of the many thousands of genes in the genome are important for particular kinds of physiological responses. Computer technology developed as a result of the Human Genome Project, for example statistical techniques can be used to group genes that are controlled together and to model these groups of genes in complicated networks. Quackenbush hopes that these powerful new analytical techniques will ultimately allow scientists to develop predictions on how complex systems composed of many elements can function, for example how a mouse's genotype will affect its immunity.
The technology revolution is not just about advances in data analysis,explains Neil Hall (p. 1518). There is a huge lack of sequence data on non-model species,and advances in sequencing could be used to explore the `vast microbial diversity in the natural environment'; for example, there could be as many as 107 bacterial species in 10 g soil. Hall adds that many prokaryotes are also human pathogens, such as Plasmodium falciparum and Mycobacterium tuberculosis, so understanding the sequence variation between strains has a clear clinical relevance. Sequencing these smaller, but still important, genomes will become easier with the development of desktop sequencing machines, bringing cheap sequencing technology into university campuses, potentially leading to a `renaissance' in genome sequencing.
While many researchers are focussed on the coding parts of the genome, only a small fraction of the genome encodes functional proteins. Returning to the question of non-protein-coding DNA, John Mattick focuses on the 98% of the human genome that isn't translated into functional proteins(p. 1526). Organisms are incredibly complex, and the more complicated they are, the larger and more intricate the networks regulating their function have to be. The conventional view is that this complexity is controlled by interactions between proteins and signals between tissues; however, as organism complexity increases, there is an even greater increase in the complexity of regulation. However, this increased complexity does not seem to have been matched by a significant increase in the number of protein genes encoded by genomes, suggesting that there is an upper limit to how complicated regulation can be. Mattick suggests that this is where the majority of the genome – so-called `junk' DNA– comes in. Most of it is transcribed into RNA, apparently in a controlled and regulated way. But these RNA transcripts are not translated into proteins, suggesting that they may form another layer of genetic regulation, largely hidden up until now, controlling cell differentiation and development.