It is usually thought that the development of complex organisms is controlled by protein regulatory factors and morphogenetic signals exchanged between cells and differentiating tissues during ontogeny. However, it is now evident that the majority of all animal genomes is transcribed, apparently in a developmentally regulated manner, suggesting that these genomes largely encode RNA machines and that there may be a vast hidden layer of RNA regulatory transactions in the background. I propose that the epigenetic trajectories of differentiation and development are primarily programmed by feed-forward RNA regulatory networks and that most of the information required for multicellular development is embedded in these networks, with cell–cell signalling required to provide important positional information and to correct stochastic errors in the endogenous RNA-directed program.
The developmental ontogeny of a human from an embryo to a fully formed adult involves the construction of an organism of approximately 100 trillion cells, with an extremely precise architecture and many differentiated tissues. These include intricately sculpted bones, organs and muscles, such as the dozens of fine muscles in the face (Gray,1918), as well as a brain that evolves in situ in response to experience (Edelman,1993). This is an extraordinary feat of genetic programming, which in all likelihood, requires enormous amounts of information. This information directs not just a human developmental program, or that of another species,but the idiosyncrasies of the particular program that was inherited by the individual from their parents and their ancestors, as exemplified by the shape of our nose, mouth and ears and other identifying familial features.
How is this feat achieved, and where is this information embedded? In the only well-studied case, the nematode worm Caenorhabditis elegans, it is known that developmental ontogeny is precise and invariant, with each cell in the adult being the result of a spatially and temporally ordered progression of cell division, selected apoptosis (programmed cell death) and,ultimately, differentiation into nerve, muscle, gut, germ and other specialized cells (Ambros,2001; Sternberg and Felix,1997). Similar processes are observed in the development of insects and mammals (Baehrecke,2002; McCarthy,2003), for example in the apoptosis that sculpts the eye ommatidia in the former (Clark et al.,2002) and separates the digits of the fore- and hindlimbs in the latter (Zuzarte-Luis and Hurle,2005). Thus, it is likely that the ontogeny of higher animals,while vastly more complex and likely to be subject to individual (genomic)variation, is also precisely programmed(Clarke and Tickle, 1999). Indeed, the almost exact identity of monozygotic twins in their physical characteristics and idiosyncrasies, as well as a high degree of concordance in their psychological characteristics (independent of environment), is clear testimony to the precision and reproducibility of the genetic instructions they share.
The genetic programming of development is usually considered to be directed by proteins involved in morphogenetic signalling and various aspects of gene regulation. These include homeodomain-containing proteins, chromatin-modifying proteins, and transcription factors acting on cis-regulatory elements, informed by those involved in cell surface receptor and signal transduction systems. Together they form elaborate modular regulatory networks(Arnone and Davidson, 1997; Bantignies and Cavalli, 2006; Levine and Davidson, 2005; Levine and Tjian, 2003)– notwithstanding the recent discovery of microRNAs (see below) that are regarded as an interesting extension of the current paradigm(Davidson, 2006) rather than the vanguard of another entire layer of regulation. This protein-centric perspective underpins most conceptions of the control of development, as exemplified by elegant studies on sea urchin embryogenesis and fruitfly development (Ben-Tabou de-Leon and Davidson, 2006; Davidson,2006; Levine and Davidson,2005; Stathopoulos and Levine, 2005). On the other hand, many proteins are shared in common throughout the metazoa (Duboule and Wilkins, 1998). Moreover, the genomes of C. elegans(Stein et al., 2003), which only has 1000 cells, and sea urchins(Sodergren et al., 2006) have essentially the same number of annotated protein-coding genes as those of vertebrates, including humans (Aparicio et al., 2002; International Human Genome Sequencing Consortium, 2004a; International Human Genome Sequencing Consortium, 2004b; Goodstadt and Ponting, 2006; Taft et al., 2007).
All of these observations suggest that significant amounts of relevant information must lie beyond protein-coding sequences, presumably in expanded regulatory regions that control the expression of these proteins(Kleinjan and van Heyningen,2005; Taft et al.,2007). It also seems likely, although firm conclusions are limited by the poor cDNA library coverage in many species, that the proteome is expanded in more developmentally complex species by the increased use of alternative splicing (Graveley,2001; Smith and Valcarcel,2000; Stamm et al.,2005). This in turn, however, mandates an increase in regulation,assuming that cell- or tissue-specific alternative splicing is not random. Thus evolutionary innovation and phenotypic divergence is achieved not only by variations in the structure and function of proteins, but also and probably more so, by those in the regulatory circuitry that controls their deployment(Davidson, 2006; Duboule and Wilkins, 1998; Jacob, 1977; Zuckerkandl and Cavalli,2007).
Analogue components and digital information transfer in complex systems
Proteins are extraordinarily versatile macromolecules that perform the vast bulk of the catalytic, structural and (to a greater or lesser extent; see below) regulatory functions in biology. As such, proteins (and their derived products such as carbohydrates, lipids and infrastructural RNAs) may be thought of as the analogue components of cells, in the same way that windows,chairs, wheels, gears, sensors and signalling systems comprise the analogue components of bicycles and aircraft. Damage to components usually has severe consequences for the function of the system and is therefore likely to be very evident, although there will be exceptions.
In addition to sophisticated operational controls, complex entities(whether aircraft or organisms) require extensive and detailed design plans for their construction, information about which has also to be stored in the system, along with the specifications of the components themselves. Random changes to assembly plans may have more subtle effects than those that alter component structure (particularly those that compromise component function),creating design variations that often have less severe consequences, although there will be exceptions in both directions. In biology these changes will therefore often result in minor defects, quantitative trait variation or alterations in disease susceptibility. Altered regulatory information has been shown to underlie such variation in a number of cases where it has been possible to map the causative nucleotide changes to completion in well-structured pedigrees (Clark et al.,2006; Clop et al.,2006; Ishii et al.,2006; Smit et al.,2003; Van Laere et al.,2003).
While it has long been recognized that genetic information is encoded digitally in DNA, it has also been widely assumed that the cellular outputs of this information, expressed via the intermediate of messenger RNA(mRNA), are almost exclusively analogue components. That is, it has been assumed that most genes are synonymous with proteins and that most genetic information is transacted by proteins. This is essentially true for the prokaryotes, whose genomes comprise densely packed protein-coding sequences,although these genomes clearly also encode a limited number of small regulatory RNAs that function in part by sequence-specific interactions with other RNAs and DNA (Gottesman,2005; Mattick and Makunin,2006; Vogel and Sharma,2005; Winkler,2005). The situation is similar in unicellular eukaryotes such as the yeasts Saccharomyces cerevisiae(David et al., 2006; Olivas et al., 1997) and Schizosaccharomyces pombe(Watanabe et al., 2002). Interestingly, although of similar complexity, the former has more protein-coding sequences than the latter, whereas the latter has many more introns (Goffeau et al.,1996; Wood et al.,2002) and a more elaborate RNA signalling infrastructure, which includes the basic components of the RNA interference (RNAi) pathway(Martienssen et al., 2005). This suggests that there may be some trade-off between protein- and RNA-based forms of gene regulation in simple eukaryotes. In any case, at first approximation it is reasonable to say that micro-organisms, particularly the prokaryotes, are in fact largely analogue devices (the `bicycles' of biology)and that proteins not only comprise the primary structural and catalytic components of these cells but are also the main agents by which they are regulated.
For the past 50 years it has been assumed that the same applies in more complex organisms, i.e. that regulation, particularly developmental regulation, is also largely analogue (protein-based) in multicellular organisms (Davidson, 2006),despite the fact that genome sequence analysis has shown that the numbers of protein-coding genes do not scale strongly or consistently with morphological complexity (Taft et al.,2007) (Fig. 1). This apparently quite reasonable assumption (at least initially) led logically to two subsidiary assumptions: (i) that the increased regulatory sophistication of more complex organisms is achieved through combinatoric interactions of regulatory proteins intersecting with more complex regulatory sequences in promoters and untranslated regions of mRNAs (etc.)(Buchler et al., 2003; Levine and Tjian, 2003); and(ii) that the vast amounts of non-protein-coding sequences in more complex organisms are, apart from a limited amount of cis-acting regulatory sequences, evolutionary debris. The latter view has been reinforced by the fact that many of these non-coding sequences are derived from transposons (DNA sequences that can move within the genome to new positions), themselves widely assumed to be non-functional, selfish DNA(Doolittle and Sapienza, 1980; Orgel and Crick, 1980) and to be evolving `neutrally' (Waterston et al., 2002). These assumptions have remained largely unquestioned for many years and have become articles of faith, but they are not necessarily correct.
Non-linear scaling of regulatory information in integrated systems
In earlier papers it was shown that the requirement for endogenous communication and regulatory information in integrated complex systems,whether cells or computers, scales faster than linearly with function and thus must hit a limit (Gagen and Mattick,2005; Mattick and Gagen,2005). This limit can only be relaxed and raised by changing the physical basis and efficiency of the control architecture. In other domains,this limit has been raised by superimposition of digital communication and control systems, using symbolic or sequence-specific strings to store and transmit information within the system. This allows both higher information density and improved transmission accuracy, the latter to overcome the problem of amplified noise (unintended crosstalk) inherent in analogue computation,thereby achieving higher operational sophistication and complexity (see e.g. Collen, 1994). Good examples are the transition from analogue to digital computing(Weinstein and Keim, 1965)and the evolution of aircraft from purely mechanical devices to modern passenger or military jets, wherein a large proportion of the information and cost is entailed in the computing and software systems, including hundreds of kilometers of optical fiber (Csete and Doyle, 2002). Imagine what a bicycle engineer, or even an aeronautical engineer, might have made of the latter when unexpectedly confronted with it, a situation akin to the discovery of introns in the late 1970s (see below).
It should be noted that the power and precision of digital communication and control systems has only been broadly established in the human intellectual and technological experience during the past 20–30 years,well after the central tenets of molecular biology were developed and after introns had been discovered. The latter was undoubtedly the biggest surprise(Williamson, 1977), and its misinterpretation possibly the biggest mistake, in the history of molecular biology. Although introns are transcribed, since they did not encode proteins and it was inconceivable that so much non-coding RNA could be functional,especially in an unexpected way, it was immediately and almost universally assumed that introns are non-functional and that the intronic RNA is degraded(rather than further processed) after splicing. The presence of introns in eukaryotic genomes was then rationalized as the residue of the early assembly of genes that had not yet been removed and that had utility in the evolution of proteins by facilitating domain shuffling and alternative splicing(Crick, 1979; Gilbert, 1978; Padgett et al., 1986). Interestingly, while it has been widely appreciated for many years that DNA itself is a digital storage medium, it was not generally considered that some of its outputs may themselves be digital signals, communicated viaRNA1.
On some early occasions it was suggested that RNA may act as a regulatory molecule. The possibility was first mooted briefly by Jacob and Monod in 1961(Jacob and Monod, 1961) but lapsed when the archetypal gene regulatory factor, the lac repressor,was subsequently shown to be a protein(Gilbert and Muller-Hill,1966). The existence of RNA regulatory networks was first postulated by Britten and Davidson in 1969 (Britten and Davidson,1969; Davidson et al.,1977), in an attempt to explain the vastly greater complexity of the RNA in the nucleus (then called `heterogenous nuclear RNA' or hnRNA)compared to the cytoplasm where mRNA is located. Although this paper is of historical importance for first proposing a major role for regulatory mechanisms in the evolution of higher eukaryotes, the idea of RNA regulation itself was not pursued, even following the discovery of introns, despite the fact that this discovery provided an explanation (at least in part) of the origin of hnRNA and an obvious potential source of the co-production of gene regulatory signals from the excised intronic RNA(Mattick, 1994; Mattick and Gagen, 2001).
Analysis of prokaryotic genomes has shown that, as predicted(Croft et al., 2003), the numbers of genes encoding regulatory proteins scale almost quadratically with gene number or genome size (Croft et al.,2003; Gagen and Mattick,2005; Mattick,2004; Mattick and Gagen,2005; van Nimwegen,2003). In addition, extrapolation of these relationships show that the point where the number of new regulatory genes is predicted to exceed the number of new (non-regulatory) functional genes is close to the observed upper size limit of bacterial genomes (Gagen and Mattick, 2004; Gagen and Mattick, 2005). This implies (albeit does not prove) that bacteria have reached a complexity ceiling imposed by the accelerating cost of protein-based regulation, possibly early in evolution. It also implies (i)that the more complex eukaryotes must have solved the problem some other way,most likely by the co-option of RNA as a sequence-specific regulatory molecule[microRNAs (miRNAs) being a good example] and, more subtly, (ii) that the combinatorics of regulatory factors per se cannot be used to enlarge the regulatory space to get past this ceiling, as there is no a priori reason to expect that prokaryotes could not have easily evolved more complex promoters and recruited additional transcription factors, etc. This in turn suggests that the complex gene regulatory regimes in the higher organisms may operate through multiple layers of regulation and regulatory decisions, rather than multiple (combinatoric) inputs at any given point.
In any case, and consistent with the non-linear scaling of regulatory information, there is a strong relationship between the extent of non-protein-coding DNA sequences in the genomes of higher organisms and their relative complexity. Indeed this appears to be the only consistent relationship between genome information content and complexity(Taft et al., 2007)(Fig. 1). These non-protein-coding sequences occupy almost 99% of the human genome(Frith et al., 2005), and it has been inconceivable to many that they might all be functional as cis-acting regulatory elements (although these have clearly expanded in complex organisms). Again this view is implicitly predicated on the assumption that most genetic information is transacted by proteins.
The major output of metazoan genomes is non-coding RNA
In apparent opposition to the above assumption, it is now evident that most of the non-protein-coding sequences in genomes are in fact expressed (i.e. transcribed), either as introns in the primary transcripts of protein-coding genes (which occupy ∼40% of the human genome) or as intergenic or antisense transcripts (Frith et al.,2005; Mattick and Makunin,2006). Indeed it appears that the vast majority of all genomes,from yeast to insects and mammals (wherein most studies have been done), are transcribed, much on both strands(Carninci et al., 2005; Cheng et al., 2005; David et al., 2006; Manak et al., 2006). Both cDNA (Carninci et al., 2005; Katayama et al., 2005; Okazaki et al., 2002) and genome tiling array studies (Cheng et al.,2005; Kampa et al.,2004; Kapranov et al.,2002; Kapranov et al.,2005) of the transcriptome have revealed an extraordinarily complex landscape of interleaved and overlapping transcripts, with distal exons, elaborate splicing patterns and alternative polyadenylation sites, many of which appear to have no protein-coding capacity(Mattick and Makunin, 2006). The most recent data show that at least 85% of the Drosophila genome(Manak et al., 2006), 70% of the mouse genome (Carninci et al.,2005) and 93% of the ENCODE regions of the human genome (The ENCODE Project Consortium, manuscript submitted for publication) have experimentally documented transcripts. Moreover, there also appears to be a large and mostly distinct population of non-polyadenylated transcripts located in the nucleus and the cytoplasm, which (despite indications from some very early studies) it was not appreciated existed, because of the widespread use of oligo dT to purify mRNA and to construct cDNA libraries(Cheng et al., 2005).
There are literally tens of thousands of long non-coding RNAs (ncRNAs) that have been identified in mammals (Carninci et al., 2005; Kampa et al.,2004; Okazaki et al.,2002), including many antisense transcripts(Alfano et al., 2005; Cocquet et al., 2005; Katayama et al., 2005; Korneev and O'Shea, 2005; Pandorf et al., 2006; Reis et al., 2004; Tufarelli et al., 2003; Werner, 2005; Werner and Berdal, 2005) and large numbers of smaller RNAs such as miRNAs(Berezikov et al., 2006a; Berezikov et al., 2006b) and piRNAs (Aravin et al., 2006; Girard et al., 2006; Lau et al., 2006). Many of these ncRNAs are expressed in a cell- or tissue-specific manner, suggesting that they are developmentally regulated. Characterized long ncRNAs include H19 (Barsyte-Lovejoy et al.,2006; Brannan et al.,1990; Wrana,1994), 7H4 (Velleca et al., 1994), bic(Tam et al., 1997), NTT (Liu et al.,1997), BORG (Takeda et al., 1998), Xist(Brockdorff, 1998), Tsix (Lee et al.,1999), DD3(Bussemakers et al., 1999), Msx1 (Blin-Wakkach et al.,2001), Air (Sleutels et al., 2002), MALAT-1(Ji et al., 2003), adapt33 (Wang et al.,2003), SCA8(Mutsuddi et al., 2004), MIAT (Ishii et al.,2006), CTN (Prasanth et al., 2005), NFAT(Willingham et al., 2005), PRINS (Sonkoly et al.,2005), TUG1 (Young et al., 2005), PINC(Ginger et al., 2006), SAF (Yan et al.,2005), Evf-2 (Feng et al., 2006), HSR1(Shamovsky et al., 2006) and HAR1 (Pollard et al.,2006), most of which have been associated with specific cellular or developmental functions and/or disease. However, most of the ncRNAs discovered in genome-wide transcriptomic analyses or expressed from particular genomic regions have not been studied in any detail, although high-throughput cell-based and other screening strategies are beginning to be deployed to ascertain their function (Mattick,2005; Reis et al.,2004; Willingham et al.,2005). Moreover, the documented numbers of these RNAs are conservative estimates: more are being regularly discovered as genomic analyses of one sort or another delve deeper into the transcriptome. Recent evidence suggests that deep sequencing has not remotely exhausted the repertoire of either long ncRNAs (Carninci et al., 2005) or short ncRNAs(Berezikov et al., 2006a; Berezikov et al., 2006b; Cummins et al., 2006; Ruby et al., 2006) and that there may be hundreds of thousands of small RNAs expressed in humans (T. R. Gingeras, personal communication; L. Croft, R. J. Taft and J.S.M., unpublished data).
These observations confront and very largely contradict the traditional protein-centric view of genetic information and genome organization(Mattick and Makunin, 2006). Either the bulk of the transcriptional output from the human genome and those of other complex organisms is random `noise' (or, in the case of introns, the residue of evolutionary baggage retained and accumulated within genes, as widely assumed) or this transcription comprises a massive but hitherto hidden layer of expression of systemic genetic information that is transacted by RNA(Mattick, 1994; Mattick, 2001; Mattick, 2003; Mattick, 2004). The former has been described as a rather nihilistic view(Werner, 2005), but is one that is comfortable for the prevailing orthodoxy. On the other hand, the latter is strongly supported by the observations that: (i) all well-studied loci in insects and mammals express a large number of non-protein-coding transcripts (e.g. Ashe et al.,1997; Bae et al.,2002; Holmes et al.,2003; Jones and Flavell,2005; Lemons and McGinnis,2006; Lipshitz et al.,1987; Sanchez-Herrero and Akam, 1989; Sessa et al.,2007); (ii) many of the experimentally detected ncRNAs are differentially expressed (Carninci et al.,2005; Cheng et al.,2005; Ravasi et al.,2006), apparently under the control of common transcription factors (Barsyte-Lovejoy et al.,2006; Cawley et al.,2004); (iii) at least some have specific subcellular locations(Ginger et al., 2006; Prasanth et al., 2005); and(iv) at least some have been shown to be functional(Brannan et al., 1990; Brockdorff, 1998; Feng et al., 2006; Ginger et al., 2006; Prasanth et al., 2005; Velleca et al., 1994; Willingham et al., 2005; Wrana, 1994; Young et al., 2005).
Microarray analyses have shown that large numbers of ncRNAs are dynamically regulated during the differentiation of embryonal stem cells, myoblasts,neuronal cells and the gonadal ridge, as well as during T-cell and macrophage activation (M. E. Dinger, K. C. Pang, I. Qureshi, M. Crowe, A. C. Perkins, S. M. Grimmond, D. A. Hume, P. A. Koopman, G. E. O Muscat, S. Bruce, M. F. Mehler and J.S.M., manuscript in preparation) and in cancer(Lu et al., 2005; Reis et al., 2004). In addition, in situ hybridization analyses are revealing large numbers of ncRNAs that are expressed in particular regions of the brain and in particular subcellular locations (T. R. Mercer, M. E. Dinger, S. Sunkin, M. F. Mehler and J.S.M., in preparation). Many of these ncRNAs are antisense or intronic to genes encoding proteins important in neural development, function and disease. It is also now evident that many of the complex genetic phenomena in complex organisms, including transcriptional and post-transcriptional gene silencing (Cogoni and Macino,2000; Matzke et al.,2001; Zamore and Haley,2005), imprinting (Kelley and Kuroda, 2000; Morison et al.,2005; Nikaido et al.,2003) and probably also transvection(Mattick and Gagen, 2001) and transinduction (Ashe et al.,1997), are linked to RNA signalling(Mattick, 2003; Mattick and Gagen, 2001).
Digital–analogue conversion of RNA signals
A key advantage of RNA is its sequence specificity, in that it can direct a precise interaction with its target by base pairing, over short stretches of nucleotides, far more efficiently than can be achieved by proteins. This allows large numbers of regulatory controls to be encoded compactly in genomes, especially as those genomes come under pressure to contain exponentially greater amounts of regulatory information as complexity increases. These regulatory controls can also be flexibly altered and re-configured by evolution to achieve phenotypic variation without altering the underlying components of the system, a concept that is well established in engineering (Mattick and Gagen,2001). A good case in point is that of miRNAs, some of which are widely distributed among species and highly conserved while others are species-specific (Berezikov et al.,2006a; Berezikov et al.,2006b), with two documented cases of mutations in miRNA target sites underpinning disease (Abelson et al.,2005) or quantitative trait variation(Clop et al., 2006). RNAs also intrinsically possess much more precise specificity of interactions with other RNAs and DNA than is usually possible by and between proteins, thus potentially improving the precision of the control system and minimizing noise from crosstalk, especially in complex regulatory networks. (The problem of noise was a primary limitation of analogue computers and a primary driving force in the transition to digital computing.) Thus it appears that evolution may have discovered the power of digital communication and control systems a billion years before we did (see below).
However, the sequence-specific interaction of a regulatory RNA with its target is relatively sterile unless this interaction can be converted into a meaningful analogue action. At its simplest level, this may comprise antisense binding to block another interaction, and this primitive mechanism seems to be a common feature of regulatory RNAs in prokaryotes. However, a more sophisticated strategy is to embed secondary signals either in the RNA itself or in the structure of the resulting RNA:RNA or RNA:DNA complex, to recruit different types of complexes, which then undertake the type of analogue action required upon receipt of the signal. Good examples are (i) the complexes of RNA-modifying enzymes that act at a site adjacent to and determined by the position of the sense:antisense interaction between small nucleolar RNAs(snoRNAs) and their targets (Bachellerie et al., 2002; Meier,2005), and (ii) the RNA-induced silencing (RISC) complexes that act on RNAs bound to small interfering RNAs (siRNAs) and miRNAs(Tang, 2005). Thus, there are two components to RNA signals: a sequence-specific interaction with the intended target(s) and a secondary or tertiary structural component that acts as a transducer to recruit generic infrastructural proteins to impart different types of actions. Indeed, this two-stage principle also applies to other classes of functional RNAs including snRNAs and tRNAs, which recognize splice junctions in pre-mRNAs or codons in mRNAs and recruit the spliceosome or ribosome, respectively. That is, RNAs function as adaptors, with a target sequence-specific address code and separate structural motifs that specify the type of consequent function and bind the appropriate proteins.
Such considerations suggest that a receptive infrastructure for RNA signalling must have co-evolved with the RNA signals themselves and become progressively more sophisticated as RNA regulatory and transport networks gained currency during the evolution of the eukaryotes. Examples include the proteins of the argonaute family and others associated with RNA interference(Carmell et al., 2002), and those containing RRM domains, KH domains, SR domains, SET domains,pumilio-homology domains and double-stranded RNA-binding domains, which occur in a wide range of developmental regulators with global functions(Anantharaman et al., 2002; Bernstein and Allis, 2005; Saunders and Barber, 2003; Wang et al., 2002). Indeed many of the so-called nucleic acid binding proteins and chromatin-binding proteins whose target specificity is uncertain or unknown may in fact recognize different types of RNA signals. This possibility is supported by evidence suggesting that regulatory proteins containing C2H2 zinc fingers(Shi and Berg, 1995), Y-boxes(Ladomery, 1997),chromodomains (Akhtar et al.,2000; Bernstein and Allis,2005), tudor domains(Maurer-Stroh et al., 2003)and SET domains (Krajewski et al.,2005), and others such as DNA methyl transferases(Jeffery and Nakielny, 2004),may recognize such RNA signals in one form or another.
The origin and evolution of RNA-based regulatory networks in complex organisms
I suggest that the transition from a largely analogue protein-based regulatory control to digitally based RNA regulation was a fundamental rate-limiting step in the emergence of complex organisms(Mattick, 1994; Mattick and Gagen, 2001),together with other factors such as the level of atmospheric oxygen(Canfield et al., 2007). It follows that the RNA-based regulatory systems underpinning the ability to control more complex developmental trajectories must have been largely in place prior to the metazoan radiation and have been a critical factor enabling this evolutionary event (Mattick,1994; Mattick,2001; Mattick,2004; Mattick and Gagen,2001). Following the emergence of all modern animal phyla at that time, often referred to as the Cambrian explosion(Fig. 2), these new dynasties of multicellular organisms settled down to `battle it' out in evolutionary competition. This was achieved, firstly, by refining and introducing new adaptations to body plans to improve their competitiveness for survival and reproduction, and to enable the colonization of new ecological niches and new domains such as the land and the air. The latter presented new physical and physiological challenges, which required significant innovations in proteins as well as in the regulatory architecture controlling developmental ontogeny(Bejerano et al., 2004; Kleinjan and van Heyningen,2005; Mattick and Gagen,2001). Recent data indicate that many regulatory RNAs, such as miRNAs, emerged in the ancestors of the Bilateria(Hertel et al., 2006; Prochnik et al., 2007) and in major transitions of metazoan evolution, including the advent of the vertebrates and eutherian mammals (Hertel et al., 2006). Secondly, there would have been considerable evolutionary advantage, and therefore pressure, to enhance sensory and cognitive capacities to recognize and respond to opportunities and threats and to alter the environment in favour of better survival and reproduction. This led to the evolution of learning and memory, an even greater mechanistic challenge that almost certainly involved RNA editing as a means of dynamically intersecting the environment with otherwise hardwired genetic information,ultimately leading to the emergence of higher-order cognition(Mehler and Mattick,2007).
Although RNA is an ancient molecule and may well have been the progenitor of both DNA and proteins (Gesteland et al., 2006), its evolution as a regulatory molecule with associated infrastructure and networks probably had its genesis in the invasion of eukaryotic protein-coding genes by mobile self-splicing group II introns(Cavalier-Smith, 1991; Cousineau et al., 2000; Lambowitz and Zimmerly, 2004; Mattick, 1994; Palmer and Logsdon, Jr,1991). These sequences occur in prokaryotes(Ferat and Michel, 1993; Martinez-Abarca and Toro,2000) but are restricted to non-protein-coding sequences by the intimate coupling between transcription and translation(Cavalier-Smith, 1991; Mattick, 1994), thereby restricting the target area for evolutionary experimentation. While RNA regulation occurs in prokaryotes, it is not well developed, just as there is little need for digital control systems in a bicycle. The need to find solutions to the accelerating problem of increasing regulatory sophistication required to underpin multicellular development – ultimately through the co-option of RNA as a compact signalling molecule and later connecting these signals to different types of actions through the co-evolution of different types of RNA binding and effector proteins – might have been felt by both prokaryotes and eukaryotes, but the latter may have had more opportunity to do so, especially given the compartmentalization of their cells. This latter feature probably arose due to the lifestyle of early eukaryotes as phagocytic cellular predators, such as amoebae or macrophages(Cavalier-Smith, 1991). Importantly, the separation of transcription from translation by the introduction of a nuclear membrane allowed introns to invade protein-coding sequences, as their negative effects could be minimized as long as they were(self) spliced out before export to the cytoplasm. In so doing, it also created the raw material for a new round of molecular evolution of RNA signals produced in parallel with protein-coding sequences(Mattick, 1994)(Fig. 2).
The subsequent evolution of the spliceosome occurred by the devolution of the originally cis-acting catalytic sequences within introns to trans-acting generic co-factors (spliceosomal RNAs) and the recruitment of ancillary proteins. This reduced the internal sequence constraints on the introns, allowing them more freedom to evolve and flexibly explore new functional space (as RNA molecules). It also made their excision from primary transcripts more efficient, perversely providing them with even greater facility to expand and invade other genes(Mattick, 1994). As these RNA networks began to be established, proteins capable of recognizing subsets of signals in these networks would have been selected for, increasing the sophistication of the system. Moreover, it would be expected that increasing numbers of genes would have evolved solely to express RNA as higher-order regulators in this increasingly complex system. This will have occurred at least in part by gene duplication followed by loss of protein-coding capacity,as appears to have happened in Xist (the ncRNA controlling X chromosome inactivation in female mammals)(Duret et al., 2006) and in many of the non-protein-coding genes that encode snoRNAs or miRNAs in their introns (Cavaille et al.,2001; Mattick and Makunin,2005; Rodriguez et al.,2004; Tycowski et al.,1996; Ying and Lin,2005). Interestingly, many ncRNAs are alternatively spliced(Cocquet et al., 2005; Pang et al., 2005),suggesting that there is an operational distinction between RNA sourced from exons and introns. The other major source of functional RNAs has almost certainly been various other types of mobile (transposable) elements, many of which are derived from small RNAs and have been a potent force in genome evolution and genetic innovation (Brosius,1999; Brosius,2005; Waterston et al.,2002).
The extent of the genome under evolutionary selection
This raises the question of the composition, rate of evolution and functionality of the genome as a whole, especially as it is now known that most of the genome is transcribed. A large percentage of the mammalian genome(∼46% in humans) is composed of transposon-derived sequences(Lander et al., 2001; Waterston et al., 2002),often pejoratively referred to as repeats, and assumed to be non-functional and therefore evolving `neutrally'(Waterston et al., 2002). The same assumption has often been made about introns, although it is now evident that there are significant amounts of conserved sequences within them(Dermitzakis et al., 2003; Hare and Palumbi, 2003; Sironi et al., 2005),presumably reflecting either functional RNA products or important cis-acting regulatory sequences. In any case, on the assumption that ancient repeats (ARs) can be used as a yardstick of the background neutral evolutionary rate, it has been estimated that ∼5% of the human genome is under purifying selection in mammals(Waterston et al., 2002), and therefore functional, with the remainder largely considered to comprise genetically inert, neutrally evolving evolutionary debris.
This is in direct contradiction to the suggestion that much of the genome-wide transcription, which is developmentally regulated, is functional. However, it is questionable whether the ARs that are used as yardsticks for these estimations are really evolving neutrally. First, if ARs have no functional relevance to the organism, they would be expected to evolve freely and eventually to either acquire function or be deleted (M. Pheasant and J.S.M., manuscript submitted for publication), as appears to have occurred with a large fraction of ARs (Waterston et al., 2002). That is, the more ancient the extant sequence, the more likely it is to have acquired function. Second, in agreement with this logic, there are increasing numbers of transposon-derived sequences of all classes, both ancient and modern, including lineage-specific repeats such as Alu elements that have been shown to have undergone functional exaptation as gene promoters, regulatory elements, exons and microRNA precursors (Bejerano et al.,2006; Britten,2006; Brosius,1999; Dagan et al.,2004; Ferrigno et al.,2001; Hasler and Strub,2006; Krull et al.,2005; Landry et al.,2001; Lev-Maor et al.,2003; Lippman et al.,2004; Matlik et al.,2006; Nigumann et al.,2002; Smalheiser and Torvik,2005; Smalheiser and Torvik,2006; Volff,2006; Zhou et al.,2002).
These observations throw increasing doubt on the widespread assumption that such sequences are mostly parasitic, and remain as inert genomic passengers. Transposable elements have also been found to underlie the birth of new genes and regulatory networks (Brandt et al.,2005; Cordaux et al.,2006; Landry et al.,2001; Zhou et al.,2002) and to influence early development(Peaston et al., 2004) and phenotypic variation (Whitelaw and Martin, 2001). It is also possible to identify AR sequences that are clearly conserved, some of which are very ancient(Nishihara et al., 2006),such as recently discovered classes of ARs in humans sharing common ancestors with those in marsupials (Kamal et al.,2006) and fish (Ogiwara et al., 2002; Xie et al.,2006), including an example of the slowest evolving regions of the human genome (Bejerano et al.,2006). Moreover, some major classes of ARs show variable rates of sequence conservation within them. One example is the class of so-called`mammalian interspersed repeats' (MIRs), of which there are ∼300 000 copies in the human genome (Smit and Riggs, 1995). These MIRs date back ∼130 million years and are tRNA-derived SINEs (short interspersed elements) with a consensus length of∼260 nt including a 70 nt central region and 15–25 nt more highly conserved core (Silva et al.,2003; Smit and Riggs,1995). The fact that hundreds of thousands of such elements have an internal sequence that is conserved more highly than the rest of the element is prima facie evidence that this class of ARs (or at least the conserved core within them) is not neutrally evolving and is likely under selection, presumably for function and possibly as regulatory RNAs.
It is also clear that there are widely different rates of evolution of different types of genomic sequences, particularly of gene regulatory sequences, some of which are extraordinarily highly conserved blocks(Bejerano et al., 2004), while many others cover extended genomic regions and exhibit rapid turnover(Fisher et al., 2006; Frith et al., 2006; Smith et al., 2004; Taylor et al., 2006). The latter includes the remarkable functional conservation of regulatory sequences controlling ret gene expression in zebrafish and humans, although there is little recognizable primary sequence conservation(Fisher et al., 2006). The cis-regulatory elements of the HoxA cluster have also been shown to undergo accelerated evolution, presumably under positive selection during the origin of amniotes and mammals (Wagner et al., 2004). Moreover, it is evident that phenotypic diversification may be due as much, if not more, to changes in regulatory architecture than to the protein components(Duboule and Wilkins, 1998; Levine and Tjian, 2003; Mattick and Gagen, 2001). Indeed, regulatory sequences often exhibit considerable evolutionary plasticity (depending on the number of their interacting targets; see below)and relatively low conservation (Pang et al., 2006) compared with proteins whose evolutionary flexibility is limited by both analogue structure–function relationships and multitasking, i.e. the differential use of the same components in multiple contexts (Duboule and Wilkins,1998; Mattick and Gagen,2001).
There are also other regions of the genome under evolutionary constraints that are not evident at the primary sequence level, including shuffled cis-regulatory elements (Sanges et al., 2006), gene deserts(Ovcharenko et al., 2005),transposon-free regions (Simons et al.,2006), chromatin domains(Bernstein et al., 2005; Bernstein, B. E. et al.,2006), regions under indel-purifying selection(Lunter et al., 2006), the distances between ultra-conserved elements(Sun et al., 2006) and regions predicted to contain common RNA secondary or tertiary structures(Lescoute et al., 2005; Washietl et al., 2005). Thus,the proportion of functionally meaningful DNA in the human genome is substantially greater than estimated from sequence conservation alone(Smith et al., 2004).
Different rates of evolution also occur within and between different classes of functional gene products, both RNAs and proteins. While most protein-coding sequences are highly constrained and hence highly conserved,some are much more flexible and others have diverged under positive selection(Bustamante et al., 2005). The estimated 5% of the human genome that is conserved with mouse does not include 35% of annotated protein-coding sequences and 17% of RefSeq annotated genes(M. Pheasant and J.S.M., manuscript submitted for publication). Many miRNAs are highly conserved (Pang et al.,2006) but many are not, being lineage- or even species-specific(Berezikov et al., 2006a; Berezikov et al., 2006b). There are also thousands of recently discovered small RNAs (piRNAs) expressed in testis that are not conserved between rodents and humans, although similar RNAs are produced from syntenically orthologous loci(Aravin et al., 2006; Girard et al., 2006; Lau et al., 2006). SnoRNAs have very divergent sequences and many are identifiable only by the loose consensus and positioning of the C/D (RUGAUGA/CUGA)(Shanab and Maxwell, 1992) or H(ANANNA)/ACA boxes (Meier,2005). It is also clear that many longer functional non-protein-coding RNAs (ncRNAs), such as the Xist and Tsixtranscripts involved in X-chromosome dosage compensation, are evolving quickly(Chureau et al., 2002; Migeon et al., 2001; Nesterova et al., 2001; Pang et al., 2006). In other cases, there is evidence of recent positive selection in ncRNAs, such as the HAR1 transcript expressed in particular regions of the brain(Pollard et al., 2006). While functionally validated RNAs do not presently add up to a large fraction of the genome, they do illustrate that lack of conservation does not necessarily equate to lack of function (Pang et al.,2006; Smith et al.,2004). They also point to the likelihood that many functional transcripts, particularly regulatory ncRNAs, are not highly conserved over significant evolutionary distances.
Most of the mammalian genome appears to be evolving more quickly than protein-coding sequences, and at a (regionally adjusted) rate similar to ancient transposon-derived sequences. However, this is evidence simply that the majority of the genome is under similar average selection pressures (M. Pheasant and J. S. Mattick, manuscript submitted for publication), rather than being non-functional and evolving neutrally, although the latter is the favored explanation (Waterston et al.,2002) being consistent with the orthodox view. Moreover, it has been known for some time that the nucleotide substitution frequency varies across the genome. This has often been interpreted as the result of regional variation in the background mutation or fixation (related to recombination)frequencies, rather than selection, as it was (again) inconceivable that the vast intronic and intergenic sequences could be under selection, since that in turn would impute function. Variation in substitution frequencies beyond that which might be expected from random events is also observed at close range within genomic regions, and the data are more consistent with the genome comprising different types of genetic information that are evolving at different rates under different selection pressures and different structure–function constraints (M. Pheasant and J.S.M., manuscript submitted for publication).
Functional constraints on the evolution of regulatory RNAs
Structure–function constraints are different for different types of molecules. As noted already, proteins are analogue components that have quite strict structural specifications. There are only so many ways to construct a wheel, a catalytic site, or an oxygen-binding pocket that is responsive to O2 and CO2 partial pressures, and it is hard to vary a successful design. On the other hand, sequence-specific regulatory signals like miRNAs are purely informational and only need to address the right targets; thus at first glance it seems a mystery why many of the known miRNAs have been so fiercely conserved – more so than most protein-coding sequences (Pang et al., 2006)– over 500 million years of evolution from worms to mammals. The exact sequence of these small RNAs does not seem to matter that much: it is easy to design them artificially against almost any sequence, and such siRNAs are now commonly employed as experimental tools(Chalk et al., 2005; Truss et al., 2005). So why have some been so frozen in evolution? The answer appears to be that those miRNAs that were first cloned are common central regulators that have multiple targets (John et al., 2004; Lewis et al., 2005; Lim et al., 2005), which makes co-variation almost impossible in evolutionary terms. If the odds of a miRNA and a target co-varying by compensatory mutations in the same generation are 10–5, the odds of co-variation of an miRNA with 20 targets are 10–100. Most miRNAs that have been subsequently identified through bioinformatics means have also invoked evolutionary conservation as a filter (Berezikov et al., 2005; Jones-Rhoades and Bartel, 2004), thereby likely also restricting their discovery to those that have multiple targets.
Clearly, the level of selection pressure on such sequences will be a function of the number of interactions that must be maintained, rather than the precise sequence itself. Those with one or few interacting partners will be able to evolve relatively freely and also explore new connections in regulatory networks, which themselves can evolve to explore new developmental space, which (given a relatively stable proteome) may be the major route to higher complexity and phenotypic variation. Thus, logic would suggest that there may be many miRNAs that are not highly conserved over significant evolutionary distances, for which there is some supporting evidence(Berezikov et al., 2006a; Berezikov et al., 2006b; Lindow and Krogh, 2005). There is also good reason to expect that some, and perhaps many, miRNAs will have very restricted expression, as exemplified by the miRNA lsy-6,which controls left/right neuronal asymmetry in C. elegans and is expressed in only a few neurons (Johnston and Hobert, 2003). Indeed, recent deep sequencing shows that the rate of new miRNA discovery continues unabated, albeit with a logarithmic drop as deeper sequencing finds those that are not so highly expressed or are only expressed in a limited subset of cells. Many of these rarer miRNAs are less conserved, being order- or species-specific(Berezikov et al., 2006a; Berezikov et al., 2006b; Cummins et al., 2006). Moreover, if conservation is dropped as a requirement for the bioinformatics prediction, there are well over 1 million plausible miRNA precursor(stem–loop) structures in the mammalian genome, with a large fraction showing evidence of producing small RNA products in array-based assays (L. Croft, R. Taft and J.S.M., unpublished observations).
Other newly discovered classes of putative small regulatory RNAs, such as the 26–31 nt piRNAs (Aravin et al.,2006; Girard et al.,2006; Lau et al.,2006) and 21 nt 21U-RNAs(Ruby et al., 2006), show little long range evolutionary conservation. Many longer ncRNAs exhibit short-range sequence conservation only in small patches(Pang et al., 2006), as exemplified by the case of Xist in mammals, even though mutational studies have suggested that most of the molecule is functional(Nesterova et al., 2001). Thus, it seems safe to predict that the sequence of many, if not most,regulatory RNAs will not be highly conserved over significant evolutionary distances, even in cases of conserved function, due to more relaxed structure–function constraints (allowing rapid drift) and to selection pressures for adaptive radiation by altering the endogenous regulatory circuitry (network structure) underpinning developmental processes.
Endogenous feed-forward control of development by RNA networks
The simple logic is that if all of the transcribed and processed ncRNAs are functional, these ncRNAs must in the main be regulatory, because catalytic versatility is not the forte of RNA, notwithstanding its central role in splicing and translation and the identification of catalytic RNAs in other contexts (Gesteland et al.,2006). This is not to deny that some RNAs may have interesting(and as yet unappreciated) catalytic functions(Salehi-Ashtiani et al.,2006), or that secondary structural motifs or domains in RNA may be important mediators of interactions with proteins. Nonetheless, if the major function of the massive numbers of ncRNAs transcribed from animal, and particularly mammalian, genomes is regulation, as is likely, the logical extension is that the main (but not exclusive) role of such regulation is to control differentiation and development, rather than (simply) the short-term physiological responses of terminally differentiated cells. In summary, if functional, these RNAs must be mainly regulatory and, if so, their major regulatory function must be to direct development.
This conclusion is well supported by what we currently know or suspect of RNA regulation at many different levels of gene control (see below), but has one very profound implication: that the enormous amount of information required to program development is endogenously embedded in these RNA networks and that most regulatory transactions during development are directed by RNA,albeit mediated by proteins and supplemented by external cues that are conveyed by proteins (see below).
These RNA (and protein) networks, initially laid down by transcription in the female (and also possibly the male) gamete, create an epigenetic state that is asymmetric in the fertilized embryo and that is asymmetrically inherited by daughter cells. Thus each of the daughter cells has a defined subsequent state and is on a pre-programmed pathway of division and differentiation controlled by internal and external cues, the latter of which probably becomes operative at the time of syncytial formation in insects and morula formation in mammals. Thus, every cell in the developing organism contains an epigenetic memory2 of what its pathway has been, and where it is headed. In computer science, this is akin to what is termed a dynamical recurrent neural network(Aussem et al., 1995; Sudharsanan and Sundareshan,1994), in which the current state of the network (in this case the gene regulatory and expression network) is defined as the combination of part history and current (external) inputs.
This memory can be quite plastic and can be modulated by contextual cues(cell signalling), set in a new direction by artificial translocation of the cell to a new context, or (in some cases) recapitulated when required, such as during the regeneration of fingertips, tails, limbs or rays in mammals,lizards, axolotls and starfish.
This information about the state of the network (and the embedded trajectories) is enclosed in the structure of the chromatin (almost certainly itself controlled by RNA signalling; see below), the protein repertoire (also directed and regulated by RNA; see below) and, ultimately, the RNA networks that are current in individual cells. These RNA networks have been described as the cellular `soft wiring' or `ribotype'(Herbert and Rich, 1999a; Herbert and Rich, 1999b). Thus, RNA transcription and processing may be thought of as a series of steps,one or more of which have two mutually exclusive outcomes: a default outcome and an alternative outcome that is controlled by appropriate regulatory signals. These outcomes can be used either to regulate cellular responses directly or to control other RNA processing events, the latter forming networks wherein the processing of one RNA (either to produce more regulatory RNAs or alternative splice variants of mRNAs) is sequentially contingent on another (Herbert and Rich,1999a; Herbert and Rich,1999b).
While such networks would be clearly subject to natural selection(Herbert and Rich, 1999a; Herbert and Rich, 1999b; Mattick, 1994), I suggest that they now dominate the genomic programming of complex organisms and are the primary drivers of development in an unfolding cascade of regulatory interactions that gives each cell a unique identity and vectorial place in the developmental trajectory. This therefore constitutes an endogenous feed-forward regulation of differentiation and development, which is largely predetermined by embedded unfolding RNA networks. Thus, the current behaviour and trajectory of each cell are determined by the networks operative in the preceding cell or state, until the terminal state is reached, at which point the cell cycle is suspended and differentiation completed. This also suggests that there are, in fact, ∼1014 different cells (i.e. cells with a specific history and identity) in humans, leaving aside those that may have clonally expanded during (e.g.) fat storage or immune responses.
Parallel expression of exonic sequences and efferent RNA signals
An important feature of the proposed exaptation of introns as a source of trans-acting RNAs is the potential to produce regulatory signals in parallel with mRNA sequences (and other non-coding RNAs) that may then make contacts to alter settings at multiple loci or targets(Fig. 3). This is akin to what is described by neurobiologists as `efferent signals' (which are essential to motor coordination, cognition and memory)(Andersen et al., 1997; Bridgeman, 1995; Elman, 1998; Plunkett et al., 1997) and would in theory, and possibly practice, permit much more complex communication and control networks to operate in different cells and states during ontogeny(Mattick, 1994; Mattick, 2001; Mattick and Gagen, 2001). Almost all snoRNAs and a large proportion of miRNAs in animals are encoded within introns (Baskerville and Bartel,2005; Cai et al.,2004; Mattick and Makunin,2005; Rodriguez et al.,2004; Ying and Lin,2005). Moreover, many snoRNA and miRNA gene loci appear to be polycistronic (Cavaille et al.,2002; Huang et al.,2004; Lau et al.,2001; Runte et al.,2001; Seitz et al.,2004). Although introns are thought to be degraded after excision from primary transcripts (Padgett et al.,1986), there is good evidence that intronic RNAs may actually be processed to smaller RNAs with significantly long half-lives and specific subcellular locations (Clement et al.,2001; Clement et al.,1999). Recently, it was shown that ectopic expression of intronic sequences derived from the CFTR gene causes specific changes in transcription of various genes in HeLa cells, with different intron sequences resulting in a distinctive pattern of effects on specific subsets of genes(Hill et al., 2006). There is also evidence that coding and noncoding regions contain sequences that match others in the genome in functionally congruent networks(Rigoutsos et al., 2006).
Layers of RNA-directed control of gene expression in development
RNA is known or strongly implicated to be involved in the regulation of gene expression (both protein-coding and non-coding) at all levels in animals,creating extraordinarily complex hierarchies of interacting controls. This includes chromatin modification and associated epigenetic memory,transcription, alternative splicing, RNA modification, RNA editing, mRNA translation, RNA stability, and cellular signal transduction and trafficking pathways.
Chromatin structure and epigenetic memory
The fine control of chromatin structure is one of the major hallmarks of eukaryotes and of gene regulation in multicellular development(Margueron et al., 2005). Chromatin architecture is altered by DNA modification (methylation) and histone modifications of various types (including compound patterns of methylation, acetylation and phosphorylation at various residues)(Lam et al., 2005; Peterson and Laniel, 2004) in different ways at many different loci in different cell lineages. This involves proteins such as the polycomb group and trithorax group, which mediate repressive and permissive effects, respectively (epigenetic memories),on gene expression in development(Bantignies and Cavalli, 2006; Cernilogar and Orlando, 2005; Lund and van Lohuizen, 2004). As there are only a limited number of enzymes (DNA methyltransferases, histone acetylases and deacetylases, etc.) that perform these modifications, there must be some other signal that specifically directs these modifications to the myriad of target loci around the genome. Indeed, in the absence of an army of DNA sequence-specific binding proteins, the only logical alternative is RNA signals.
While the details of this putative RNA signalling are unknown, there is a great deal of evidence to support its existence(Andersen and Panning, 2003; Bernstein and Allis, 2005; Lippman and Martienssen,2004; Schmitt and Paro,2006). This includes the observations that (i) DNA methytransferase and some domains in chromatin remodelling enzymes and binding effector proteins, such as SET, tudor domains and chromodomains, appear to interact with RNA (Bernstein and Allis,2005; Jeffery and Nakielny,2004; Sanchez-Elsner et al.,2006), (ii) many regulatory regions affecting chromatin structure and the expression of adjacent protein-coding genes are themselves transcribed in spatially and temporally regulated ways(Bae et al., 2002; Lipshitz et al., 1987; Sanchez-Elsner et al., 2006),and (iii) such non-coding transcripts play important roles in activation of gene expression by targeting global protein regulators such as HP1, Ash1 and the chromatin insulator protein CP190 to the cognate sequences in cis-regulatory response elements, including polycomb- and trithorax-response elements (PREs and TREs)(Grimaud et al., 2006; Lei and Corces, 2006b; Maison et al., 2002; Sanchez-Elsner et al., 2006; Schmitt et al., 2005) (see also below). It also includes the well-characterized roles of RNAs in DNA methylation and transcriptional silencing in plants(Aufsatz et al., 2002; Mette et al., 2000; Wassenegger, 2000) and in animals (Bayne and Allshire,2005; Imamura et al.,2004; Jeffery and Nakielny,2004; Morris et al.,2004; Ting et al.,2005; Tufarelli et al.,2003; Weiss et al.,1996), imprinting in mammals(Sleutels et al., 2002),heterochromatin formation in Drosophila(Birchler et al., 2004; Pal-Bhadra et al., 2004),global activation or repression of sex chromosomes for dosage compensation in insects and mammals (Andersen and Panning,2003), RNA interference-mediated heterochromatin assembly and chromosome dynamics in fission yeast(Martienssen et al., 2005; Verdel and Moazed, 2005),meiosis (Cho et al., 2005; Watanabe et al., 2001), and programmed DNA elimination in Tetrahymena(Mochizuki and Gorovsky,2004). More recently it has been shown that a specialized set of RNAi components, including members of the argonaute family, are required for DNA methylation in plants (Qi et al.,2006) and yeast (Irvine et al., 2006), as well as transcriptional gene silencing and associated alterations to chromatin structure involving polycomb recruitment in Drosophila (Grimaud et al.,2006) and in human cells (Kim et al., 2006). This indicates that the RNAi machinery may regulate higher-order nuclear organization to orchestrate gene expression during development (Lei and Corces,2006a). The nuclear organization of chromatin insulators is also affected by the RNAi machinery (Lei and Corces, 2006b).
The proteins of the polycomb group (PcG) and trithorax group (TrxG) are important global regulators of transcriptional silencing and activation and mediators of epigenetic memory in development, best characterized in homeotic loci (Boyer et al., 2006; Guenther et al., 2005; Negre et al., 2006; Ringrose and Paro, 2004; Schwartz et al., 2006; Schwartz and Pirrotta, 2007; Squazzo et al., 2006). Both PcG and TrxG are recruited to genomic elements (termed PREs and TREs,respectively) that encompass hundreds of base pairs. These elements have a very weak consensus in Drosophila and none have been identified yet in mammals (Ringrose and Paro,2004; Ringrose and Paro,2007). Although five proteins associated with PcG or TrxG complexes (GATA, PSQ, Zeste, PHO and PHO-like) have DNA-binding properties,they bind to rather degenerate sequences and have not been demonstrated to have a role in target recognition in vivo(Ringrose and Paro, 2004). Moreover many PREs/TREs are transcribed as ncRNAs(Schmitt et al., 2005), and Hox gene loci exhibit complex patterns of non-coding transcripts on both strands (Carninci et al.,2005; Engstrom et al.,2006). The activation of the HoxA genes is also accompanied by intergenic antisense ncRNA transcription(Sessa et al., 2007). These observations, together with recent data suggesting that such transcripts and the RNAi pathway play a central role in PcG- and TrxG-mediated epigenetic regulation (Bernstein, E. et al.,2006; Grimaud et al.,2006; Lei and Corces,2006a; Sanchez-Elsner et al.,2006; Schmitt et al.,2005), suggest that the specificity of this process is controlled by RNA. Thus, the locus- and stage-specific epigenetic modification of chromatin by proteins with global functions may be viewed as the first derivative of a genomically encoded developmental program that is elaborated via unfolding RNA regulatory networks, informed by contextual cues and modulated by environmental inputs.
There is also increasing evidence that transcription itself is influenced,directly or indirectly, by RNA signalling(Goodrich and Kugel, 2006; Kim et al., 2006). Not only do certain classes of transcription factors either bind RNA or have high affinity for nucleic acid structures involving RNA(Ladomery, 1997; Shi and Berg, 1995), but also transcription has been shown to be both inhibited by single-stranded RNA directed at transcription start sites(Janowski et al., 2005) and activated by double-stranded RNAs (dsRNAs) directed at promoter sequences(Li et al., 2006). The latter requires the Argonaute 2 (Ago2) protein and is associated with a loss of lysine-9 methylation on histone 3 at dsRNA-target sites(Li et al., 2006). Theβ-globin LCR (`locus control region'), which is considered to be the archetypal long-distance transcriptional `enhancer', is itself specifically transcribed in erythroid cells (Ashe et al., 1997). Enhancers controlling expression of homeotic and other genes are also specifically transcribed(Feng et al., 2006; Jones and Flavell, 2005; Ronshaugen and Levine, 2004). It has also been shown that transactivation of the steroid receptor, as well as MyoD (which regulates skeletal myogenesis), requires the ncRNA called SRA(Caretti et al., 2006; Hube et al., 2006; Lanz et al., 1999; Lanz et al., 2002). The ncRNA 7SK is involved in the transcriptional activation of the proto-oncogene c-myc (Krause,1996), among other examples(Goodrich and Kugel,2006).
There is an enormous amount of post-transcriptional processing of RNA, both protein-coding and non-coding, much of which involves and is probably regulated by other RNAs. The mechanism of control of alternative splicing, the other major hallmark of developmentally complex organisms, is not known, but a range of circumstantial evidence suggests that this process too is controlled by RNA signals. This evidence includes: (i) that alternative splicing choices are not well explained by what is known about proteins involved in splicing or splicing regulation, despite speculations about combinatorial interactions(Blencowe, 2006; Caceres and Kornblihtt, 2002; Pozzoli and Sironi, 2005; Soller, 2006); (ii) that alternative splice sites are generally more highly conserved than constitutive splice sites (suggesting that a sequence-specific trans-acting sequence is required to address the former)(Sorek and Ast, 2003; Sugnet et al., 2004; Sugnet et al., 2006); and,most convincingly, (iii) the well-established observation that synthetic RNA derivatives directed against splice sites can easily alter splicing patterns both in vivo and in vitro(Garcia-Blanco, 2005; Gendron et al., 2006; Kole and Sazani, 2001; Roberts et al., 2006; Wilton and Fletcher, 2005). If this can easily be achieved by artificial means, then it is not unlikely that nature will employ a similar mechanism. It is not immediately obvious where such regulatory RNAs may be sourced, as conserved splice sites do not have obvious orthologous sequences elsewhere in the genome. However, it is possible that antisense transcripts are the source of these signals(Yan et al., 2005). It is also possible that, given the high affinity of RNA:RNA interactions, the antisense elements in putative trans-acting RNAs are short and difficult to identify, as they are in reverse when trying to identify the possible targets of orphan (non-rRNA directed) small nucleolar RNAs(Cavaille et al., 2000) (see below).
RNA modification and RNA editing
There is also considerable post-transcriptional modification and editing of RNA in eukaryotes, especially complex eukaryotes. SnoRNAs range from 60 to 300 nucleotides in length and guide the site-specific modification of target RNAs via short regions of base pairing. There are two major classes: (i)the box C/D snoRNAs, which guide 2′-O-ribose-methylation, and (ii) the box H/ACA snoRNAs, which guide pseudo-uridylation of target RNAs(Bachellerie et al., 2002; Henras et al., 2004; Kiss et al., 2004; Meier, 2005). The action of snoRNAs was initially thought to be restricted to rRNA modification in the nucleolus during ribosome biogenesis, but it is now evident that they can target other RNAs, including small nuclear (spliceosomal) RNAs and mRNAs(Bachellerie et al., 2002; Henras et al., 2004; Kishore and Stamm, 2006; Kiss et al., 2004; Meier, 2005). A subset of box H/ACA snoRNAs is located in Cajal bodies (a class of small nuclear organelle),and are sometimes called scaRNAs (small Cajal body RNAs)(Meier, 2005), where they modify telomerase RNA in a cell-cycle dependent manner(Jady et al., 2004; Jady et al., 2003). At least some snoRNAs exhibit tissue-specific and developmental regulation and/or imprinting (Cavaille et al.,2000; Cavaille et al.,2002; Cavaille et al.,2001; Rogelj and Giese,2004), which is indicative of a regulatory function. There are also a number of so-called `orphan' snoRNAs without known targets(Cavaille et al., 2000; Cavaille et al., 2002; Cavaille et al., 2001; Huttenhofer et al., 2001; Kiss et al., 2004; Vitali et al., 2003), one of which has recently been shown to be involved in the aberrant splicing of the serotonin receptor 5-HT(2C)R gene in Prader–Willi syndrome patients(Cavaille et al., 2000; Kishore and Stamm, 2006).
RNAs may also be edited by enzymes termed ADARs (Adenosine Deaminases Acting on RNAs), which catalyze the deamination of adenosine to inosine to alter coding capacity, splicing patterns or regulatory functions, and also by the APOBEC family of cytidine deaminases, which catalyze C-U/C-T editing of both RNA and DNA(Navaratnam and Sarwar,2006). The targets of RNA editing include not only mRNAs but also miRNAs and other ncRNAs whose functions are as yet unknown(Athanasiadis et al., 2004; Blow et al., 2004; Blow et al., 2006; Levanon et al., 2004; Yang et al., 2006). RNA editing appears to be the major mechanism by which environmental signals overwrite encoded genetic information to modify gene function and regulation,particularly in the nervous system, where it is well documented to modify transcripts encoding proteins involved in fast neural transmission. These include ion channels and ligand-gated receptors(Bass, 2002; Valente and Nishikura, 2005)such as the serotonin receptor, which is regulated in the same region by snoRNA-mediated RNA modification (Kishore and Stamm, 2006). In humans, where RNA editing is considerably more prevalent than in mouse (Athanasiadis et al., 2004; Blow et al.,2004; Levanon et al.,2004), RNA editing alters many transcripts from genomic loci encoding proteins involved in neural cell identity, maturation and function. This implies a role for RNA editing not only in the regulation of neural transmission but also of brain development(Mehler and Mattick,2007).
mRNA translation and stability
It is now well established that mRNA translation and mRNA stability are controlled by miRNAs, primarily directed at sequences in the 3′untranslated region (UTR). 3′ UTRs have expanded greatly during metazoan evolution and in humans occupy over 1% of the genome, accounting for almost as much of the mRNA sequences as the protein-coding sequences themselves(Frith et al., 2005), and suggesting that extremely complex regulatory controls are embedded within them. miRNAs have been shown to be centrally involved in gene regulation in both plants and animals (Bartel,2004; Carrington and Ambros,2003; Mattick and Makunin,2005; Pasquinelli et al.,2005), including flowering in plants(Chen, 2004) and many aspects of development (Bernstein et al.,2003; Giraldez et al.,2005; Hornstein et al.,2005; Kanellopoulou et al.,2005; Ronshaugen et al.,2005), cell growth and differentiation(Baehrecke, 2003; Brennecke et al., 2003; Chen et al., 2004; Hatfield et al., 2005; Johnston and Hobert, 2003; Kuwabara et al., 2004; Naguibneva et al., 2006; Wienholds et al., 2005) in animals. miRNA regulation has also been shown to be perturbed in developmental abnormalities including cancer (Croce and Calin, 2005; Esquela-Kerscher and Slack, 2006; Hammond,2006) and possibly other diseases(Abelson et al., 2005), as well as in quantitative trait variation (Clop et al., 2006). Some miRNAs have also been shown to regulate Hox gene expression (Hornstein et al., 2005; Mansfield et al., 2004; Naguibneva et al.,2006; Yekta et al.,2004) and to exhibit expression patterns reminiscent of hox genes in embryonic development(Mansfield et al., 2004). Moreover, as noted above, while there are ∼103 known miRNAs,there may be far more expressed in mammals. It is also worth noting that neither specific nor general biological functions have yet been ascribed to the thousands of piRNAs that are expressed in mammalian testis, although they are known to interact with the Piwi subfamily of Argonaute proteins, which are required for germ cell maintenance and meiosis(Aravin et al., 2006; Girard et al., 2006; Lau et al., 2006). The same is true of the class of 21U-RNAs recently discovered in C. elegans(Ruby et al., 2006).
RNA intersection in signalling cascades and other aspects of cell biology
While it is already clear that various proteins involved in gene regulation have RNA binding domains or domains that intersect with complexes involving RNA, there is also evidence that proteins involved in cellular signal transduction cascades also bind RNA. This is exemplified by the RasGAP-binding protein G3BP/rasputin, which contains both an RNA recognition motif (RRM) and SH3 binding domains (Irvine et al.,2004; Pazman et al.,2000; Zekri et al.,2005). There is also evidence that ncRNAs may be involved in regulating nuclear factor trafficking(Willingham et al., 2005),and the large numbers of ncRNAs that appear to have a cytoplasmic location(Cheng et al., 2005) suggest that many other cellular functions are also regulated by such RNAs.
The role of proteins in development
It is clear that proteins, many of which (such as homeotic proteins,signalling proteins and transcription factors) are referred to as regulatory and are differentially expressed in different cells and tissues, are intimately intertwined with regulatory RNAs in the control of development, and that the boundaries between them may often be blurred. However, without putting too fine a point on it, I suggest that there are two general classes of proteins involved in developmental regulation.
The first class encompasses those whose role is to transmit contextual signals from the cell and the external environment (other cells and circulating signals) into the gene regulatory networks of the cell. It includes secreted proteins (as well as other ligands such as steroid hormones)that act locally or systemically and their receptors, for example the patched-hedgehog (Murone et al.,1999) and Wnt-frizzled systems(Gordon and Nusse, 2006),which may be positioned asymmetrically on different parts of the cell surface. This class also includes internal protein kinase-mediated signal transduction cascades. These signals are critical to the fidelity of the developmental process, both as feedback controls to correct (inevitable) stochastic errors in the endogenous RNA-directed program, and as important additional positional information to supplement the endogenously specified developmental program. For example, imagine a robot that has been given full instructions for the specification and assembly of a motor vehicle but is denied any environmental reference information (through vision, touch, etc.). It would be impossible to design a program whose execution would be sufficiently precise as to preclude the necessity for feedback controls or (particularly in the case of self assembling multicellular systems) remove the enormous advantages of positional information and cell–cell communication during growth and development. However, as noted already, simply because environmental signalling is critical to the process of ontogeny, this does not mean that this is where the majority of the relevant information is embedded.
The second class of `regulatory' proteins important for development encompasses those that effect analogue functions to control gene expression at various levels. These proteins are directed to the appropriate site of action in many, if not most, cases by RNA signals, albeit also influenced (activated or repressed) by intersections with the protein-based signal transduction systems that usually operate via phosphorylation. These effector proteins include homeotic and other types of chromatin-modifying proteins,transcription factors, splicing factors, RNA editing enzymes, RNA modification enzymes, and Argonaute proteins and others in RISC complexes. Many of these proteins are themselves developmentally regulated at the transcriptional and post-transcriptional level, and are contributory variables in the complex matrix of RNA:DNA:protein interactions and the resultant regulatory networks.
Genetic signatures of RNA regulatory networks
If RNA regulatory networks pervade the cell and developmental biology of complex organisms in such a profound manner, why have they not been recognized sooner, especially in genetic screens? Apart from the fact that the sheer complexity of the ncRNA population has only recently been revealed by sophisticated transcriptomic analysis (genome-scale tiling arrays, extensive cDNA libraries and, most recently, deep sequencing of small RNA fractions) and the possibility that regulatory networks may be intrinsically robust, most genetic screens have suffered a strong expectational, perceptual and technical bias towards mutations in protein-coding sequences. The expectational bias derives from the long-held orthodoxy that most genes encode proteins and their cis-acting regulatory elements. This has been reinforced by the perceptual bias that mutations in proteins (as key analogue components of the system) will, in most cases, produce a strongly impaired and often visibly affected phenotype. In contrast, those in regulatory circuits will, in many if not most cases, produce more subtle effects that may not be noticed at all,except in sensitive genetic screens such as that which identified the miRNA lsy-6 in C. elegans(Johnston and Hobert, 2003). Indeed, the entire world of miRNAs and their central role in regulating differentiation and development lay hidden for many years despite intense genetic scrutiny of fruitflies and mammals. It was only revealed by the characterization of the small RNA products of the let-7 and lin-4 loci, which control developmental timing in C. elegans(Lee et al., 1993; Reinhart et al., 2000), and the intersection of these findings with the characterization of the similar-sized small RNAs produced by RNAi(Hammond et al., 2000; Zamore et al., 2000), also discovered in C. elegans (Fire et al., 1998). This suggested that a similar mechanism may produce(other) short regulatory RNAs (Grishok et al., 2001; Lagos-Quintana et al., 2001; Lau et al.,2001; Lee and Ambros,2001; Ruvkun,2001).
There may also be a difference between protein-coding sequences and those encoding regulatory sequences (whether acting in cis or in trans via RNA) in terms of their functional sensitivity to point mutations,which comprise the bulk of the natural and induced mutations in mammals. On the other hand, many non-coding regulatory mutations have been known for a long time in Drosophila, where many mutations have been obtained by deletion or insertion. However, these have almost inevitably been interpreted as affecting cis-regulatory DNA elements(Duncan, 2002), despite the fact that many of the regions concerned, such as bxd in the bithorax complex (Lipshitz et al., 1987; Petruk et al.,2006), which includes PRE/TRE response elements(Tillib et al., 1999), are known to be transcribed into separately regulated ncRNAs (i.e. may represent separate genetic units) and to be involved in the complex and still poorly understood genetic phenomena of transvection and polycomb-mediated developmental memory (Mattick and Gagen,2001). Apart from a few cases of regulatory mutations affecting quantitative traits that have been mapped to completion in well-structured animal pedigrees, most screens in mammals (especially in humans) progress from positional mapping to mutation screening of exons, with little prospect (due to the enormous technical and statistical difficulties) of identifying mutations that lie outside these limited regions in large intronic or intergenic sequences. However, some such mutations are being identified,including one in a novel ncRNA called MIAT, which appears to increase the risk for myocardial infarction (Ishii et al.,2006). I predict that as re-sequencing of target regions in affected populations becomes feasible with new sequencing technologies, more of these mutations/variations will be discovered and that many will affect regulatory RNAs sourced from such regions. Indeed, apart from revealing the true extent of the involvement of RNA in the developmental programming of humans and other complex organisms, these discoveries will go to the heart of what is perhaps the most interesting aspect of our biology, the genetic factors controlling or influencing our individual physical, physiological and psychological variation, including disease susceptibility.
I suggest that we have fundamentally misunderstood the nature of genetic programming of complex organisms for the past 50 years, because of the presumption – largely true in the prokaryotes but not in the complex eukaryotes – that most genetic information is transacted by proteins. This view was derived from studying simple organisms in an analogue age before the power and use of digital information systems were appreciated. However, it now seems increasingly likely that most of the human genome, and those of other complex organisms, encodes a vast and hitherto hidden layer of regulatory RNAs (Mattick and Makunin,2005; Mattick and Makunin,2006). This evolved to breach the operational limits imposed by solely protein-based regulatory systems, in the face of the nonlinear scaling of regulatory requirements as living organisms explored higher organizational and macro-functional complexity (Mattick,2004). Indeed, it may well be that most of the human genome is functional (M. Pheasant and J.S.M., manuscript submitted for publication),including many sequences such as introns and other mobile element-derived sequences that have been long considered as parasitic evolutionary debris rather than the historic raw material for genetic innovation and the current embodiment of higher levels of regulatory sophistication. Thus it appears that the genome is largely composed of sequences encoding components of RNA regulatory networks that co-evolved with a sophisticated protein infrastructure to interact with RNAs and act on their instructions.
The advantages of RNA over protein as a regulatory molecule are its genomic compactness, its high sequence specificity, and its mutability and associated ease of re-configuration of interaction networks to explore phenotypic and functional diversity. This leads to a new conception of how multicellular development is regulated and where the relevant information is embedded, i.e. that development is primarily driven by endogenous RNA regulatory networks,which are contextually informed and whose instructions are functionally executed by proteins. There is clearly a long way to go to understand and parse these networks, with many surprises yet in store, including the likely discovery of new classes and subclasses of small and large regulatory RNAs,and many biological and mechanistic aspects to decipher. Whatever the details may be, the irony is that what was dismissed as junk because it was not understood may well comprise the majority of the information that underpinned the emergence and now directs the development of complex organisms(Mattick, 1994; Mattick, 2004; Mattick and Gagen, 2001),including ultimately the brain (Mehler and Mattick, 2007). It probably also contains a large fraction of the information that determines the phenotypic differences between individuals and the diversity of species.
This article draws on, and to some extent integrates, ideas elaborated in others that I have co-authored with Michael Gagen, Igor Makunin, Mark Mehler,Michael Pheasant and Ryan Taft, which are cited in the appropriate places in the text. I am grateful to them, as well as all members of my laboratory, for many stimulating discussions and research contributions over many years. I am also grateful to collaborators, particularly Yoshihide Hayashizaki, Piero Carninci and Harukazu Suzuki from RIKEN, and other senior scientists in my own institute and elsewhere. I particularly thank Paulo Amaral, Andy Cossins,Larry Croft, Martin Feder, Michaela Handel, Ian Holmes, Igor Makunin, Michael Pheasant and Cas Simons for comments and suggestions on the manuscript. This work was supported by an Australian Research Council Federation Fellowship.