SUMMARY
The principal route to understanding the biological significance of the genome sequence comes from discovery and characterization of that portion of the genome that is transcribed into RNA products. We now know that this`transcriptome' is unexpectedly complex and its precise definition in any one species requires multiple technical approaches and an ability to work on a very large scale. A key step is the development of technologies able to capture snapshots of the complexity of the various kinds of RNA generated by the genome. As the human, mouse and other model genome sequencing projects approach completion, considerable effort has been focused on identifying and annotating the protein-coding genes as the principal output of the genome. In pursuing this aim, several key technologies have been developed to generate large numbers and highly diverse sets of full-length cDNAs and their variants. However, the search has identified another hidden transcriptional universe comprising a wide variety of non-protein coding RNA transcripts. Despite initial scepticism, various experiments and complementary technologies have demonstrated that these RNAs are dynamically transcribed and a subset of them can act as sense–antisense RNAs, which influence the transcriptional output of the genome. Recent experimental evidence suggests that the list of non-protein coding RNAs is still largely incomplete and that transcription is substantially more complex even than currently thought.
Influence of old assumptions
Sequencing the genomes of human and other so-called `model' organisms paves the way for a holistic approach to understand biological phenomena, provided that a genome sequence is properly annotated for the genes it contains. In parallel with the feasibility studies leading up to the genome sequencing projects, a shortcut was developed to the core of the problem: identifying the expressed part of the genome, the messenger RNAs (mRNAs), by sequencing randomly picked complementary DNA (cDNA) clones. Early work focused on characterizing these clones with a single-pass sequencing run from one end of the molecule to give `expressed sequence tags' (ESTs). These could be linked to the corresponding part of the genome sequence, thereby identifying the genes encoded within the genome. In fact, ab initio gene prediction algorithms (see Glossary) have proved to be particularly poor in identifying encoded genes and, despite later improvements, they are still based on the assumption that the relevant output of the genome consists of mRNAs.
Early estimates in the 1970s of the number of such cellular RNA species were 70 000 to 100 000, based on the kinetics of in vitrorenaturation of mRNA/cDNA. All of these RNAs, whose number is substantially larger than the known protein-coding genes within the genome (∼25 000),were thought to encode for proteins. Although the research community has long been aware of the existence of some hundreds of non-protein coding RNAs, these have been generally dismissed as exceptions to the widespread belief that non-structural RNAs would all be protein-coding, and recognized by the incorporation of polyadenosine (polyA) tags at the 3′ end of the mRNA molecule. Some mRNAs found to lack polyA tags were described in the late 1970s, but again were treated as exceptions and not pursued.
Clarifying the number of protein-coding genes, and the identification and meaning of non-protein coding RNAs, has required the development of novel technologies, starting with cloning methods that crucially incorporate the full length of the mRNA molecule, rather than a shorter, artefactual fragment. Despite an earlier report of a full-length cDNA library prepared by selecting full-length cDNA/mRNA hybrids through the modifications made on the 5′end of the mRNA (Theissen et al.,1986), it was not until the middle of the 1990s that the need to systematically prepare full-length cDNA libraries was recognized. These libraries allowed the systematic discovery of the entire length of the coding mRNA, including its non-protein-coding ends (see Glossary). Conventional cDNA libraries, not enriched for full-length, have an average content of full-length cDNAs of 20–30% (Marra et al., 1999), while in high quality, full-length cDNA libraries,the proportion of full-length clones can exceed 90%. These libraries have thus become very attractive for large-scale sequencing projects, because they yield the full sequence data at a fraction of the sequencing cost of the entire genomic DNA, and because the greater consistency of the full-length sequence greatly aids data analysis, clone management and full-insert sequencing.
Here I discuss the methods used to produce full-length cDNAs in the RIKEN mouse project, and how this has generated a profound understanding of the scale and complexity of the transcribed genome (see also Glossary for the definitions of some elements of the transcriptome, which is the full set of mRNA molecules or transcripts produced). This project, along with others, has substantially contributed to a mature appreciation of even greater diversity in the transcribed genome that does not relate to protein coding, but probably to complex regulatory processes that underpin the generation of biological complexity.
Full-length cDNA cloning for gene discovery
Several full-length cDNA cloning approaches have been described to date,and are extensively reviewed elsewhere(Das et al., 2001); some of these have gained widespread usage (Fig. 1). The breakthrough in full-length cDNA production took advantage of a specific feature present at the 5′ end of RNA molecules produced by the RNA polymerase II complex, namely the cap-site(Miura, 1981; Banerjee, 1980). This comprises an inverted (3′–5′) pppG nucleotide, which is added to the 5′ end of the polymerase II transcripts at very early stages of RNA synthesis.
The widely used SMARTR method of cDNA production is based on the addition by MMLV reverse transcriptase (RT), corresponding to the cap structure, of a trinucleotide CCC, which is annealed by an oligonucleotide having a GGG-3′ end. Use of the reverse transcriptase to synthesize on this cap-switch primer provides the means of priming the second strand of full-length cDNAs only (Zhu et al.,2001). Due to the relatively low efficiency, the polymerase chain reaction (PCR) is required. However, although these libraries are efficiently enriched for full-length cDNAs, they show a dramatically reduced variety of transcripts (less than half) when used for large-scale ESTs projects(Sasaki et al., 1998) if compared with non-PCR amplified full-length cDNA libraries prepared from the same tissue (Carninci et al.,2003). The oligo-capping procedure(Maruyama and Sugano, 1994; Kato et al., 1994) is more sophisticated. Uncapped RNA molecules, such as truncated RNAs, ribosomal and other structural RNAs, are dephosphorylated by a phosphatase. Next, the removal of the cap structure by tobacco acid pyrophosphatase leaves a phosphate group at the 5′ end of full-length mRNAs only, to which an oligonucleotide is added by RNA ligase, followed by library preparation by reverse transcription (RT) and PCR. Despite the requirement for PCR, this method has been widely used for the production of various cDNA collections including the full-length Japan (FLJ) human cDNA collection(Ota et al., 2004).
To clone a large variety of mRNAs efficiently without PCR, new full-length cDNA cloning approaches have been developed based on the separation of full-length cDNA from the artefactual truncated cDNAs by full-length cDNA/mRNA selection through the cap-site, while RNAse digestion cleaves the single-strand portion of the mRNAs, which happens when RNA is not protected by full-length cDNAs extending to the cap-site (see Fig. 1). RNAse removes the cap-site from these truncated cDNA/RNA hybrids. Full-length cDNA–RNA hybrids can then be physically selected using selection techniques based on retention of the cap structure. This can be achieved by direct binding of the cap with a cap-binding protein (Edery et al., 1995) (see Fig. 1A) which, however, requires tedious coupling of a mammalian cap-binding protein to a matrix and requires a substantial amount of starting mRNAs. Alternatively, the cap can be selected after its chemical modification by the addition of a biotin, followed by selection with streptavidin-coated magnetic beads (Carninci et al.,1996; Carninci et al.,1997; Carninci and Hayashizaki,1999) (see Fig. 1B). This technology, called `cap-trapper', makes use of commercially available reagents to oxidize the diol group at the cap site with NaIO4, followed by biotinylation with a long-arm biotin hydrazide,which is very efficient and allows further manipulations downstream without using PCR, even if starting with as little as ∼1.5 μg of total RNA(Carninci et al., 2003).
Comprehensive genome annotation requires unbiased cDNA cloning
Development of the full-length cDNA isolation technologies was only the first tool necessary. Although full-length libraries proved satisfactory in terms of full-length rate (∼95%)(Carninci et al., 1996), they were not ideal for efficient isolation of difficult RNAs. In fact, the efficiency of conversion of mRNA to full-length cDNAs, and subsequent cloning,was inversely proportional to the length of the original RNAs, with clear under-representation of cDNA deriving from long mRNAs. This problem can be partially obviated by the use of engineered reverse transcriptases (RT), which have been altered by mutating the RNAseH domain (for instance, Superscript II and III from Invitrogen). Together with the use of these enzymes, we have found that some small molecules, also called osmolytes, which are synthesized by a multitude of organisms including yeast under conditions of stress(De Virgilio et al., 1994; Hottiger et al., 1994),effectively activate RTs at a high temperature (60°C) that would normally be inactivating. This enzyme `thermoactivation' is promoted by the addition of trehalose and sorbitol to the reaction mixtures(Carninci et al., 1998; Carninci et al., 2002),enabling the preparation of cDNAs that exceed 15 kb in length.
Conventional plasmid vectors are strongly biased to clone short cDNAs present in cDNA ligation mixtures preferentially. This generates short insert libraries on average [(1–1.5 kilobases (kb)], even when the input molecules of cDNA are of a larger average size (>2.5 kb). To overcome this problem, cDNA mixtures containing such long cDNAs can be cloned into lambda vectors specifically designed for long cDNA cloning. These lambda FLC(full-length cDNA) vectors were derived by adjusting the size of the vector to just below the nominal cloning capacity (37.5 kb): the lambda phage most efficiently packages DNA of lengths close to the wild-type size (48.5 kb), so large cDNAs that traditionally were unclonable can now be packaged and cloned more efficiently than shorter cDNAs(Carninci et al., 2001). This has enabled the preparation of comprehensive cDNA libraries of size of 2.5–3 kb. Such libraries yield up to twofold greater diversity of cDNAs by random sequencing compared to libraries of shorter size.
Targeting rare RNAs
The ultimate tool available for maximizing gene discovery by sequencing of randomly selected cDNA clones is to remove undesired cDNA sequences through normalization and subtraction by hybridization(Bonaldo et al., 1996). In mammalian cells and tissues, the RNAs can be divided into classes of expression. Relatively few genes may account for up to 20–30% of the total mass of the mRNAs, whereas intermediately expressed (1000–2000 different RNAs) and rarely expressed (>10 000 different RNAs) gene classes account for the remaining 30–50% and 30–40% of the cellular RNAs,respectively. Although the proportions of these RNA classes vary in different tissues and cell types, in order to avoid prohibitive scaling up of sequencing operations, it is mandatory to reduce the frequency of the highly and intermediately expressed RNAs and increase that of the rarely expressed sequences. Since the cap-trapper protocol is efficient, we developed methods to rebalance the frequency of transcripts representing different genes(normalization) and, secondly, to remove from the library those cDNAs already collected (subtraction). Indeed, use of cap-trapped, normalized/subtracted cDNA libraries is much more efficient for the discovery of novel cDNAs(Carninci et al., 2000; Hirozane-Kishikawa et al.,2003).
Subtraction and normalization have been widely used to produce diverse EST libraries rich in novel transcripts, and also for gene discovery in many organisms including human (Hillier et al.,1996; Marra et al.,1999) and rat (Scheetz et al.,2004). These libraries have contributed substantially to our current knowledge of gene structure and its many variations in mRNAs, and for full-length cDNA-based ESTs (Carninci et al., 2003).
Significantly, normalization and subtraction protocols tend to select against alternative splicing variants (different mRNAs generated from the same coding sequence by alternative selection of coding modules contained within it), and these have been discovered mainly by accident as hybridization leftovers. Although in the mouse transcriptome we have already identified more than 78 000 different splicing variants out of 44 000 transcriptional units(TUs; a TU groups together all of the mRNA sequences that show transcription overlap, see Glossary) (Carninci et al.,2005), splicing diversity is expected to be much larger. The comprehensive discovery of splicing variants necessitates different approaches, some of which may take advantage of selection of mis-paired nucleic acid hybrids (Watahiki et al.,2004; Thill et al.,2006). Besides displaying alternative exons, however, new methods will have to include full-length cDNA cloning, because it is not possible to reconstruct the structure of full-length mRNA transcripts without full-length cDNAs.
Coverage is far from complete
Subtracted/normalized full-length cDNA libraries have allowed extensive coverage of the transcriptome. While producing approximately 2 million ESTs,we monitored the subtraction rate during production of each library. We removed more than 90% of the abundant or already collected cDNAs in a large part of libraries, and then calculated that 13.9 million EST sequencing passes would have been required to achieve the same coverage and capture of rare RNAs using conventional libraries (Carninci et al., 2003), thus representing a considerable saving in time and money. Although many consider the coverage of the mouse transcriptome is close to saturation, it is important to note that the continued introduction of new tissues and stimulated cell types has provided a continuing high novel gene discovery rate that shows no sign of levelling out(Fig. 2)(Carninci et al., 2003). For instance, sequencing the 3′ ends of 15 000 macrophage cDNAs from a subtracted library still yielded >20% new clusters. This is remarkably high, considering that sequences were achieved in the late stage of a large-scale project (after producing ∼90% of the RIKEN ESTs), when one would assume that most genes should have been already discovered(Carninci et al., 2003). Although we have now discovered a very large number of transcripts [>181 000 (Carninci et al., 2005)],which exceeds even the largest estimate (120 000) of the number of genes(Liang et al., 2000), and we have shown that 62% of the genome is transcribed into primary RNA transcripts,we have still not yet isolated all the RNAs that could be discovered using this approach (Carninci et al.,2003). Similarly, the rat EST project shows that the identification of novel genes using subtracted libraries was still yielding a considerable number of novel cDNAs at the moment of its publication(Scheetz et al., 2004).
Tiling arrays identify large RNAs complexity
To assess the complexity of the transcriptome without cDNA cloning,whole-genome tiling arrays have been developed. These provide an evenly distributed series of oligonucleotide array probes designed from the genomic regions not covered by repeat elements (reviewed in Mockler et al., 2005; Carninci, 2006). The mRNAs (or non-cloned cDNAs) isolated from tissues are labeled and hybridized to these arrays and the expressed regions of the genome identified from the distribution of positive array probes. These expressed regions are bioinformatically grouped into contiguous expressed regions, which are either called `transfrags' (transcribed fragments) in the Affymetrix platform(Cheng et al., 2005) or TAR(transcriptional active regions) with the Yale platform(Bertone et al., 2004). Regardless of platform differences(Mockler et al., 2005), human whole genome tiling has demonstrated that a large part of the genome is transcribed into stable RNAs (∼25%) and that a large part of the transcript is cell-specific, as almost half of the novel transcripts (and 20%of the known transcripts) are specific for only one cell line out of eleven tested (Kampa et al., 2004). This is in agreement with the full-length derived ESTs(Carninci et al., 2003),suggesting that the number of identified transcripts will rise simply by increasing the number of tissues and cells analyzed, although it is hard to define the plateau using current data. In particular, isolation of RNAs from minor cell populations within large tissues has not yet been properly addressed.
Even more surprisingly, the number of mRNAs that lack poly-adenylation is as large as the number of polyadenylated RNAs(Cheng et al., 2005) and more than 41.5% of the RNAs are confined to the nuclear regions. As such RNAs were never considered for gene discovery use and there are no ad-hoctechnologies for cloning them, we can assume that transcriptome complexity is at least some fourfold larger than our current description based upon full-length cDNAs and ESTs (Table 1), which were derived from polyA-plus RNA isolated from whole RNA enriched for cytosolic RNAs.
Variability type . | Minimal estimation . | Projection . | Reference . |
---|---|---|---|
Transcription starting sites (TSS) | 236 000 | >500 000 | (Carninci et al., 2005; Carninci et al., 2006) |
Transcription termination sites (TTS) | 153 000 | >180 000 | (Carninci et al., 2003) |
Tissue specific mRNAs | ∼half the transcripts are cell specific (11 lines) | Unknown for all the cells | (Kampa et al., 2004) |
Large, non-polyA RNA | (Not possible to unambiguously group into individual RNAs) | Double the number of the RNAs above | (Cheng et al., 2005) |
Nuclear specific | (Not possible to unambiguously group into individual RNAs) | Double the number of the RNA above | (Cheng et al., 2005) |
Short RNA (miRNAs class) | >3000 | 20 000 (mouse) | (Mineno et al., 2005) |
70 000 (Arabidopsis) | (Lu et al., 2005) | ||
Short RNA (but longer than 25 nt) | >100 clusters | Thousands (testis) | (Kim, 2006) |
Splicing difference | Including splicing (78 000), more than a million | Not available | (Carninci et al., 2005) |
Variability type . | Minimal estimation . | Projection . | Reference . |
---|---|---|---|
Transcription starting sites (TSS) | 236 000 | >500 000 | (Carninci et al., 2005; Carninci et al., 2006) |
Transcription termination sites (TTS) | 153 000 | >180 000 | (Carninci et al., 2003) |
Tissue specific mRNAs | ∼half the transcripts are cell specific (11 lines) | Unknown for all the cells | (Kampa et al., 2004) |
Large, non-polyA RNA | (Not possible to unambiguously group into individual RNAs) | Double the number of the RNAs above | (Cheng et al., 2005) |
Nuclear specific | (Not possible to unambiguously group into individual RNAs) | Double the number of the RNA above | (Cheng et al., 2005) |
Short RNA (miRNAs class) | >3000 | 20 000 (mouse) | (Mineno et al., 2005) |
70 000 (Arabidopsis) | (Lu et al., 2005) | ||
Short RNA (but longer than 25 nt) | >100 clusters | Thousands (testis) | (Kim, 2006) |
Splicing difference | Including splicing (78 000), more than a million | Not available | (Carninci et al., 2005) |
Final estimation of the transcript number is not possible, but may be derived by combining the data obtained from the various modalities of RNA expression (TSS, TTS, tissue specificity, polyA status, compartmentalization,size and splicing)
CAGE tags suggests large number of transcripts and their variants
Cap-analysis Gene Expression (CAGE) technology uses the cap-trapping as the first step to capture the 5′ ends of the cDNAs, which are then transformed in short sequence (tags) of 20 nucleotides (nt) corresponding to the mRNA transcriptional starting sites (TSS)(Kodzius et al., 2006; Shiraki et al., 2003; Harbers and Carninci, 2005). We have produced millions of mouse and human CAGE tags(Carninci et al., 2006). Unpublished CAGE analyses suggest that in the human HepG2 cell line, used to produce close to one million of CAGE tags, there are about 66 700 TSSs mapping close to the first exons of known TUs, which can therefore be considered true 5′ end candidates deriving from full-length mRNA transcripts. Of them,about 47 000 appeared only once, while only 7700 were represented by two tags,and 12 000 by three or more tags. This suggests that the majority of the different, rarely expressed transcripts require an analytical technology with enough throughput and sensitivity to detect at least a million different transcripts for each cell type, a number tenfold larger than the current sequencing capacity (Carninci et al.,2005) and larger than current estimates of transcriptome diversity(Jackson et al., 2000). Notably, data derived from analysis of CAGE tags are largely confirmed by whole-genome tiling arrays (Carninci,2006), while the opposite is not always true. This may possibly be due either to false positive signals with tiling arrays, or because the transfrags may identify uncapped RNAs that are not detected by cap-selection based methods.
Tagging technologies (Harbers and Carninci, 2005) have been developed with a sensitivity at least one order of magnitude larger than EST sequencing to detect transcripts,exhaustively to identify transcripts (Ng et al., 2005), identify their promoters, and correlate them with expression profiling by counting the tags as a digital measure of gene expression (Harbers and Carninci,2005; Nilsson et al.,2006). Unexpectedly, these technologies have also revealed a surprisingly large degree of fine variability of transcription start and termination sites (Carninci et al.,2005). In the mouse, we have grouped all the transcripts in 44 000 transcriptional units (of which less than 21 000 are protein coding). By taking the conservative approach of requiring independent evidence for both the TSS and TTS (transcription termination sites) via analysis of their starting and termination sites, more than 181 000 independent transcripts were identified in mouse, whereas there are at least 238 000 independent TSSs and 153 000 TTSs.
This variability in TSSs highlights biologically significant differences between TSSs contained within a single TU, and indicates enormous complexity in the mechanisms mediating their regulated expression. For instance, CAGE analysis has identified promoters in the 3′ UTRs of many genes(Carninci et al., 2005). When two genes map tail-to-tail on the genome (i.e. the 3′ ends of genes mapping in opposing genomic strands are terminating towards each other), the rate of 3′ UTR transcription is higher when two genes map closer to each other (average gap of ∼2 kbp) than for tail-to-tail genes having low 3′ UTR transcription (∼5 kb). Other genes, which do not map as tail-to-tail, also show 3′ UTR transcription, but no clear patterns are evident. In all cases, such 3′ UTR transcripts have true, conserved promoters that can activate transcription of a reporter gene(Carninci et al., 2006).
CAGE tags allow the classification of the TSS clusters into two main categories, based on the shape of the TSS. Surprisingly, the largest category of mammalian promoters does not show an accurate TSS, but instead a broad TSS(spread on average over up to 100 bp), generally associated with promoters constituted by CpG islands (see Glossary). Within such CpG islands,transcription starts mostly from pyrimidine/purine dinucleotides, a simplified consensus of the `initiator' element, and these promoters are generally devoid of TATA-boxes (see Glossary). A much smaller fraction of promoters show well-defined, sharp peak TSSs, which are located 29–32 nt downstream of a classic TATA-box. Genes having TATA-box promoters are also preferentially associated with the presence of unusual transcripts, originating from exons(Carninci et al., 2006)(reviewed by Sandelin et al., in press). These exonic transcripts might consist of non-protein-coding regulatory RNAs, which are speculated to influence the chromatin status. Except for the brain, TATA-box promoted transcripts tend to be tissue-specific (Gustincich et al.,2006), whereas CpG, broad promoters seem to be involved in tissue-specific transcription, suggesting in turn that epigenetics features are particularly relevant for brain transcriptional control. Elsewhere, CpG promoters generally promote the transcription of housekeeping genes. The promoter shape can be defined only when many CAGE tags are identified (>100 per cluster), which happens in cases of highly and broadly expressed transcripts (8100 mouse and 6900 human promoters); however, all datasets described above have pointed at the existence of RNAs that are rare and specifically expressed, for which such general promoter properties analyses will require larger CAGE datasets.
Full-length cDNAs have been instrumental in the discovery of non-coding RNAs
Full-length cDNAs clones, once sequenced over the full length of the clone insert, are amenable to individual annotation in order to extrapolate their function (Kawai et al., 2001; Okazaki et al., 2002; Maeda et al., 2006; Carninci et al., 2005; Imanishi et al., 2004). Although initial attempts to annotate cDNAs were based on the assumption that all mRNAs would encode protein (Kawai et al., 2001), the expansion of the mouse cDNA collection to 61 000 cDNAs (Okazaki et al., 2002),and subsequently to 103 000 cDNAs (Maeda et al., 2006), has progressively revealed the existence of a class of generally lowly expressed transcripts apparently lacking coding potential. In fact, the discovery that in mouse there are at least 23 000 non-coding TUs came from the initial struggle to annotate these transcripts that were derived from full-length, cap-selected cDNAs, without any apparent CDS (coding sequence).
Known non-capped RNAs appear to be strongly selected against in the cap-trapped libraries. Enrichment for capped RNAs during the cap-trapping selection was calculated to be at least 330-fold(Carninci et al., 2006). Indeed, although structural RNAs comprise more than 90% of the mammalian RNAs,examination of the raw data obtained from RIKEN 3′ ESTs (1 512 533 sequences) reveals that there are only 758 ribosomal cDNAs and 6516 mitochondrially derived cDNAs (of which 3842 were derived from only 12 problematic libraries out of 249). This proportion of cDNAs deriving from non-capped RNA is much lower than the frequency of these RNAs in cells,suggesting that these novel cDNAs, lacking coding potential, were unlikely to be genomic cDNA contamination. We further analyzed these cDNAs by computation,and identified a set of 4280 cDNAs that mapped far from existing loci, with multiple proof of their existence as bona fide non-coding RNAs(ncRNAs) (Numata et al.,2003). Experimental validation of novel ncRNAs that map in the mouse Gnas locus demonstrated the existence of eight new imprinted transcripts (Holmes et al.,2003). Further large-scale validation was performed, showing that ncRNAs are dynamically regulated in macrophages upon induction with lipopolysaccharides, further confirming that they are real RNA transcripts(Ravasi et al., 2006).
Further insights on the function of the ncRNAs derive from the observation that a large fraction of RNAs are transcribed from both orientations of the genome, thus forming sense–antisense (S/AS) transcript pairs, in which ncRNAs are often involved. These were first identified in the mouse(Okazaki et al., 2002; Kiyosawa et al., 2003) and later in human (Yelin et al.,2003). Further analysis proved that antisense ncRNAs are dynamically regulated and tend to be nuclear(Kiyosawa et al., 2005). CAGE tag data suggested that the extent of the S/AS transcription is much larger than previously estimated, by identification of bidirectional transcription for 72% of the TUs, and in particular for 86% of the TUs that map in genomic imprinted regions (loci containing genes that are expressed either paternally or maternally), suggesting that these transcripts may be involved in regulating entire complex loci (Katayama et al., 2005). The S/AS rate was further supported with 50%estimation by mouse Serial Analysis of Gene Expression (SAGE) data(Siddiqui et al., 2005). Further evidence of the regulation logic derives from the identification of over 2000 `chains', or groups of transcriptional units that are overlapping or share a bidirectional promoter. These chains are to some extent conserved between mouse and human and are hypothesized to group genes under the same epigenetic regulation (Engstrom et al.,2006).
The enormous transcripts (ENEOR) consist of a group of at least 66 very large (∼92 kb average) non-polyadenylated noncoding RNA, which have not been clonable with standard techniques due to size limitation of cloning vectors. These were identified by observing the presence of 3′-truncated cDNA clones primed in A-rich stretches, and reconstructing their structure by multiple RT–PCR. These ENEOR span very large regions, including various TUs, identify imprinted and micro-RNA (miRNA) genes, and may have a regulatory effect on the chromatin, as in the case of the AIR gene(Furuno et al., 2006).
The observation of ENEOR is in line with the initial analysis of 5′–3′ ditags. In fact, a large part of the cDNA population of primary lambda libraries, constituted by cDNAs longer than 6–7 kb(Carninci et al., 2002),usually does not survive large-scale propagation/sequencing operations. To overcome this, we prepared libraries containing only tags from the 5′and 3′ ends of transcripts (Carninci et al., 2005) that were derived from large insert size cDNAs cloned in lambda FLC vectors (Carninci et al., 2001), which allows cloning of cDNAs without size bias as long as the cDNAs do not exceed 15 kbp. Large-scale sequencing of these ditags libraries suggests not only that the number of total independent transcript is larger than that identified with full-length cDNAs, but also that there are very large transcribed genomic regions called gene forests (see Glossary). Large RNAs identified by ditags span regions as large as 2 Mbp and group the TUs identified by cDNA into very large transcribed forests(Carninci et al., 2005). These 5′–3′ ditags represent borders of a part of the missing transcriptome.
The identification of non-coding RNA was initially met with scepticism,mainly because they are relatively poorly conserved between species(Wang et al., 2004; Pang et al., 2006). Despite this, their putative promoters are well conserved(Carninci et al., 2005),suggesting that their expression rather than their sequence may be biologically more important. As they may be involved in S/AS, or produce shorter RNAs (such as miRNAs), their full-length sequence conservation might indeed not be biologically relevant. For a more dedicated discussion on the function of these non-protein-coding RNAs, see(Mattick, 2003; Mehler and Mattick, 2006; Mattick and Makunin, 2006; Mattick, 2007; Carninci, 2006).
Missing transcriptome
The human transcriptome has also been extensively analyzed in the Mammalian Gene Collection (MGC) by isolating cDNAs from full-length libraries(Strausberg et al., 2002). However, these efforts greatly differ from the RIKEN approach, which is based on serial subtraction using `drivers' deriving from the pools of cDNA already isolated. Instead, the MGC project has been based on outsourcing the preparation of cDNA libraries to various collaborators in at least 11 research groups (Gerhard et al., 2004),which is not compatible with serial subtraction strategies. Another key difference is the purpose of the MGC, which aims to produce at least one full-length cDNA sequence for every protein coding gene. Therefore, after sequencing the 5′ end, the clones that do not show any potential coding region are not further used for full-insert cDNA sequencing. This clearly causes under-representation of non-coding RNAs in public databases. By contrast, the selection regime for isolating full-length cDNAs used by RIKEN has been hypothesis-free: all seemingly new clones have been fully sequenced.
Human and rat transcriptomes have also been extensively sampled using subtracted/normalized ESTs from non-full-length libraries. The main difference from the RIKEN project, is that the other widespread normalization/subtraction technology (Bonaldo et al.,1996) uses double-strand cDNAs drivers. This is likely to remove antisense- as well as sense-cDNAs, thereby rendering comparisons of S/AS across different transcriptome datasets irrelevant.
The widespread existence of non-coding human RNA transcription was recently vindicated by work with whole-genome tiling arrays: upon experimental validation, some 60% of S/AS transcription rate was confirmed in the human genome (Cheng et al.,2005).
Different methodologies give rise to very great differences among datasets. In contrast with genome sequencing, where shotgun strategies are well established, it is clear that we have not yet established a universal strategy for analyzing the transcriptome, which differs from the genome in its inherent complexity. Genome sequencing alone is insufficient to compare biological phenomena because (1) comparative analysis cannot interpret a large fraction of conserved but not expressed genomic regions, (2) expressed RNAs and regulatory elements, including promoters, show different levels of conservation, and (3) low or absent conservation may be important for species-specific structural and regulatory functions. For example, the broad,CpG type of human promoters are evolutionarily more plastic, and mutate faster, than the average genomic regions in the recent human lineage, compared to the chimpanzee, in contrast to sharp, TATA-box promoters, which tend generally to be more conserved (Taylor et al., 2006; Carninci et al.,2006). Because there is such a variable degree of conservation of RNAs and regulatory elements, strategies based on genome conservation to identify genes and expressed transcripts are unacceptably hypothesis-bound.
Conversely, transcriptomics datasets are still very far from being comprehensive and comparable, due to lack of sampling, shallow sequencing,subtraction and normalization and diversification of libraries. Transcriptome analysis takes advantage of the specific interest of scientists in particular sets of expressed genes in particular tissues, but data is not systematically collected, and consequently comparison of transcriptome datasets between different organisms is inconclusive.
Even more among short RNA
With the discovery of the RNA-interference (RNAi) phenomena in C. elegans (Fire et al.,1998) and the discovery that these short siRNAs (∼23 nt)control transcript levels in mammalian cells(Elbashir et al., 2001), the research community embarked on the medium-scale cloning and sequencing of these short (20–25 nt) RNAs. These studies surprisingly revealed the existence for the first time of microRNAs (miRNAs), which are very highly expressed (Lau et al., 2001)and regulate mRNA expression level in a large variety of biological contexts,including development, differentiation and cancer. Although sequencing costs have so far constrained the exploration of this new transcriptional world,very high-throughput sequencing methods are starting to show the full-extent of this phenomenon. For instance, in Arabidopsis there are more than 75 000 short RNAs (19–25 nt) (Lu et al., 2005), and a similar approach in the mouse has conservatively identified more than 20 000 sequences, among which 3374 were considered highly reliable (Mineno et al.,2006). More recently, analysis of different sized short RNAs(29–30 nt) has revealed the existence of a novel, yet-uncharacterized class of short RNAs restricted to the testis, constituted by >1000 different short RNA tags that cluster on ∼100 gene-poor regions of the genome (reviewed in Kim,2006). These short RNAs form a complex with miwi, mili or piwi RNA-binding proteins and are essential for spermatogenesis, although the exact functional mechanisms are not yet understood. Likewise, analysis of short RNAs from other tissues might soon reveal additional classes of short RNAs. These will include tissue-specific CAGE tags that identify RNA transcribed from repeated elements (G. Faulkner, K. Waki, C. O. Daub, T. Lassmann, S. Grimmond,D. Hume, Y. Hayashizaki and P. Carninci, manuscript in preparation), which could function as global genome regulators.
How many RNAs are there in a mammal?
Despite the availability of rapidly growing datasets, the true size of the transcriptome is still difficult to estimate due to the different modalities of RNA expression, their widely varying levels of expression, their compartmentalization, and the cell specificity and plasticity of RNA expression. Considering all aspects that are still underestimated(Table 1), one can envisage the existence of more than 106 distinguishable RNAs. However,considering all the different mammalian cell types still not explored and the considerable number of cell-specific transcripts(Kampa et al., 2004), it would not be surprising if there were at least 107 different mammalian transcripts. Whatever this number becomes, cells seem to produce many more RNAs than were previously recognized. How many of these RNAs are essential,redundant or dispensable, and when? This is not testable with single nucleotide mutagenesis and knock-out experiments in the laboratory, nor is it feasible to measure all of the possible phenotypes, some of which could be extremely mild, redundant or context-specific, and whose display may require different thresholds of RNA inactivation for multiple genes. Alternatively,some RNAs, such as expressed pseudogenes(Frith et al., 2006) might become functional under a new set of conditions, and thus should be considered as potential genes [or `potogenes' (see Hayashizaki and Carninci,2006)] rather than as non-genes. Finally, some of the non-coding RNAs, or groups of non-coding RNAs, might behave as genes and confer selectable traits only when a given organism is subjected to selection pressure.
The task to identify all of these different RNAs remains a substantial challenge that requires us to develop novel methodologies beyond the whole-genome tiling arrays (which cannot distinguish different overlapping transcripts and their splicing variants), the tagging technologies and individual cDNA clone analysis. Although sequencing short RNAs would fit the novel generation of sequencers developed for the $1000 genome project perfectly (Bennett et al.,2005; Margulies et al.,2005), this would not lend itself to the discovery of large(m)RNAs, because the physical combination of all splicing variants requires sequence determination of individual full-length cDNAs. Additionally, novel technologies would need to collect full-length cDNA from many more different and rare cell types from mammalian organs, and eventually from the unexplored RNomics regions (polyA-minus and nuclear RNAs). Although the $1000 genome project might become feasible in few years, a $1000 high-resolution transcriptome is well beyond our cloning technologies due to the elusive nature of different RNA classes.
Despite these difficulties, and because comprehensive transcriptome analysis adds so much value to genome sequencing, I argue for the strategic need to standardize transcript collection methods based on comprehensive cell and condition sampling with multiple types of transcriptome libraries,combined with novel high-throughput sequencing systems. Expanding this in the comparative direction by addressing the transcriptomes of as yet unexplored organisms will surely yield biological surprises and even more novelty.
FOOTNOTES
Glossary available online at http://jeb.biologists.org/cgi/content/full/210/9/1497/DC1
Acknowledgements
I thank all of the members of the RIKEN GSC-GREG and GSL and the Fantom-3 consortium members for data production, analysis, advice, discussions and support, and Andrew Cossins for critical reading of the manuscript. This work was supported by a Research Grant for National Project on Protein Structural and Functional Analysis from MEXT, a Research Grant for the RIKEN Genome Exploration Research Project from the Ministry of Education, Culture, Sports,Science and Technology of the Japanese Government and a grant of the Genome Network Project from the Ministry of Education, Culture, Sports, Science and Technology (Japan).