A set of 956 expressed sequence tags derived from 7-hour (mid-cleavage) sea urchin embryos was analyzed to assess biosynthetic functions and to illuminate the structure of the message population at this stage. About a quarter of the expressed sequence tags represented repetitive sequence transcripts typical of early embryos, or ribosomal and mitochondrial RNAs, while a majority of the remainder contained significant open reading frames. A total of 232 sequences, including 153 different proteins, produced significant matches when compared against GenBank. The majority of these identified sequences represented ‘housekeeping’ proteins, i.e., cytoskeletal proteins, metabolic enzymes, transporters and proteins involved in cell division. The most interesting finds were components of signaling systems and transcription factors not previously reported in early sea urchin embryos, including components of Notch and TGF signal transduction pathways. As expected from earlier kinetic analyses of the embryo mRNA populations, no very prevalent protein-coding species were encountered; the most highly represented such sequences were cDNAs encoding cyclins A and B. The frequency of occurrence of all sequences within the database was used to construct a sequence prevalence distribution. The result, confirming earlier mRNA population analyses, indicated that the poly(A) RNA of the early embryo consists mainly of a very complex set of low-copy-number transcripts.
The sea urchin embryo has proved particularly useful for studies of the regulatory mechanisms that underlie the early processes of embryogenesis. Recent studies have focused on spatial control of differential gene expression and on specification functions that depend on signaling between blastomeres. In sea urchin embryos, these processes begin early in cleavage (reviewed by Davidson et al., 1998). Here we describe an initial expressed sequence tag (EST) analysis of cleavage-stage (7-hour postfertilization) mRNA populations from Strongylocentrotus purpuratus embryos. This study was undertaken to provide a qualitative exploration, by a relatively unbiased method, of the nature of the biosynthetic program by then established. This is an interesting moment in early development: at 7 hours, the embryos are in 6th cleavage and all of the major early lineage compartments have segregated from one another. Blastomere specification processes are well underway, as evinced by the onset of regional programs of spatial gene expression. The embryo genomes are already operating at maximal rates of transcription, relative to any later stages. Essentially all maternal mRNA has been loaded onto the polysomes by this point, but newly synthesized zygotic messages are also being translated (reviewed by Davidson, 1986; Davidson et al., 1998).
The mRNA populations of S. purpuratus embryos are relatively well known. Early in development, there are on the order of 104 different polysomal mRNA sequences, as established by mRNA excess hybridization against single-copy DNA tracers (reviewed by Davidson, 1986). Most of the mRNA (i.e., ∼90% by mass) consists of relatively low prevalence transcripts, though there is a rising population of more highly represented early blastula-specific zygotic mRNA species (Reynolds et al., 1992), and also some other relatively prevalent mRNAs encoding proteins required for cell division, such as the cyclins (Evans et al., 1983; Pines and Hunt, 1987; Kelsowine-Miller et al., 1993). As expected, these classes of message are well represented in the EST database that we here report. The only mRNA species that are very highly prevalent in the 7-hour embryo encode zygotically expressed early histones, translation of which accounts for about 8% of total protein synthesis in this species at 7 hours postfertilization (Goustin, 1981). However, the early histone messages are not polyadenylated, and would not be expected to be well represented in the cDNA library sampled in this study. Though the maternal ‘cleavage-stage histone’ mRNAs are polyadenylated, they are much more rare (reviewed by Davidson, 1986). There is a total of about 5×107 molecules of poly(A) RNA per embryo. If 90% of these consist of relatively low prevalence messages (Lasky et al., 1980; Flytzanis et al., 1982) and the complexity of the latter is about 104 species, then each species will be represented by only about 0.01% of the mRNA molecules. Put another way, when the embryo has been divided up into several hundred cells at the late blastula stage, i.e., as it approaches its final cytoplasm-to-nucleus ratios, there are about 105 mRNA molecules/cell, and the typical low prevalence RNA of the complex class of message will be present in the average population at less than ten copies per cell (see Davidson, 1986 for calculations). In reality there is of course a continuum of prevalence; many species exist at <3 copies per average cell; others at 3-10; others at 10-30, etc. (Lasky et al., 1980). In a 103 EST database, transcripts present at 10 copies per average cell or less would be expected to be encountered only once if at all; or, if they are found twice or a few times, this would imply the order of 102 copies per cell.
In the following, we consider the poly(A) RNA population of the 7-hour embryo, both qualitatively and quantitatively. Those EST sequences that produce significant matches with known protein-coding sequences in the GenBank database provide a qualitative snapshot of the biosynthetic functions in which the early embryo is engaged. We have also reconstructed the prevalence distribution of sequences in the embryonic poly(A) RNA population from that of the EST sequences. By this means, we confirm that the mRNA population of the early embryo consists mainly of a low prevalence, high complexity sequence set, as deduced earlier from its complexity and hybridization kinetics, its synthesis kinetics, and from measurements of cDNA hybridization to randomly selected cDNA clones (Galau et al., 1976, 1977; Flytzanis et al., 1982; Lasky et al., 1980; Duncan and Humphreys, 1981; Xin et al., 1982; Davidson, 1986).
MATERIALS AND METHODS
cDNA libraries were constructed in the Gibco P-Sport vector, following manufacturer’s instructions, and using N6 random primers. The libraries were arrayed in 384-well plates in a Genetix Q-Bot robot. The ESTs determined in this work were obtained from eight different plates of a library made from 7-hour early cleavage-stage S. purpuratus embryos. mRNA was extracted from these embryos as described (Lee et al., 1986). The screening experiments described below were carried out on high-density filters prepared from the arrayed library using the 4×4 format described by Maier et al. (1994). On each 22×22 cm2 filter, 18,432 clones are spotted in pairs oriented by a preassigned program loaded into the Q-Bot (Maier et al., 1994).
Clones were withdrawn from areas of the 384-well plates, the DNA prepared in an Autogen robot and their sequence obtained by conventional procedures in an automated ABI 377 sequencer, using dideoxy chain terminators or a T3 dye primer. The sequences were accessioned in GenBank under the identifiers: AF122056 to AF122818.
Each of the nucleotide sequences was translated in all six reading frames and the resulting six amino acid sequences were assembled into a single sequence which was used to search GenBank DNA sequences with TFASTA and TBLASTN. This is an effective method of identifying weak sequence similarities. We have required that the TFASTA and TBLASTN results agree for a sequence to be listed, except in marginal cases.
Screening of high-density filters
High-density colony blots of arrayed cDNA libraries from 7-hour, 20-hour and 40-hour-old sea urchin embryos were utilized for this work. An oligonucleotide Sox probe was generated from the overlapping regions of two Sox EST clones: 5′-GCAAGAGGTTAG-GAGCCGAATGGAAGTTGCTTTCTG-3′. The probe was end-labeled with T4 polynucleotide kinase. The membranes bearing the cDNA libraries were pre-wet first with water and then with hybridization solution, and placed in hybridization bottles. Prehybridization (for 2 hours) and hybridization (for 16 hours) were carried out at 37°C in 6× SET, 5× Denhardt’s solution, 50 mM PBS, pH 7.4, 0.25% SDS, 100 μg/ml sonicated denatured salmon sperm DNA. After hybridization, the membranes were washed for 10 minutes in 2× SSC, 0.2% SDS and for 15 minutes in 1× SSC, 0.2% SDS at room temperature and then for 30 minutes at 37°C in TMACl mix, which is 3 M tetramethylammonium chloride, 50 mM Tris (pH 8.0), 2 mM EDTA, 0.1% SDS. They were then washed for 30 minutes in TMAC1 mix at 45°C. The membranes were wrapped in plastic wrap and exposed to films for autoradiography. Positive spot pairs were identified by reference to the spotting template (cf. Maier et al., 1994). To obtain transcript prevalence estimates, the number of spot pairs reacting with probes for that transcript (N) was created. Relative prevalence was taken as P=N/(18,432F) where 18,432 is the number of spot pairs per filter, and F the number of filters used in the analysis. Absolute prevalence is P×T where T is the number of mRNAs for egg or embryo (see text).
RESULTS AND DISCUSSION
Overall distribution of sequence categories
The 956 ESTs that are the subject of this report were obtained from a directionally cloned, arrayed cDNA library prepared from 7-hour embryo poly(A) RNA. Insert lengths in this library lay mainly in the range 1.2-2.5 kb, and the average readable EST length on which the following analysis is based was about 500 nucleotides. These sequences were obtained using the 5′ vector primer with respect to insert orientation, and thus they preferentially sample the 5′ ends of the inserts. The library was directionally cloned from cDNA initiated on random primers to avoid a bias towards 3′ trailer sequences, which are generally very long on sea urchin embryonic transcripts. This strategy, plus the use of the 5′ sequencing primers, resulted in a large majority of ESTs containing either identified or putative protein-coding sequences, as we show below. Those ESTs that do not contain open reading frames (ORFs) derive either from 3′ trailer sequences of bona fide messenger RNAs, or from the interspersed repeat sequence class of poly(A) RNAs that constitute about 60% of the total cytoplasmic poly(A) RNA mass of S. purpuratus eggs and early embryos (Costantini et al., 1980; Davidson, 1986). These are long, non-translatable, apparently unprocessed transcripts resembling nuclear pre-mRNA in structure and consisting of covalently linked single copy and interspersed repetitive sequence transcripts (Calzone et al., 1988). Most of the repetitive sequence elements recognized in the EST database are probably of this origin.
The translated sequences (see Materials and Methods) were compared to GenBank and significant matches collected. The initial EST set contained two classes of sequence which were subtracted from the database prior to analysis, namely 12 ESTs consisting of vector sequences and 33 sequences derived from an ETS class transcription factor mRNA. As described in footnote 1 of Table 1, the ETS sequences were artifactually over-represented because they were cloned at a natural NotI site, which vastly increases the probability of recovery relative to other sequences. After removal of the vector sequences and all but one ETS EST from the data set, there remained 956 sequences. Table 1 indicates the general categories, defined by these sequence similarity searches, into which the total set of ESTs are divided. There are three general categories: (1) recognized protein-coding sequences for which there is a high probability that the identifications are meaningful (see footnote 2 of Table 1), plus a few ESTs displaying significant similarities to unidentified ESTs of other organisms, together amounting to about 24% of the 956 ESTs; (2) sequences representing known classes of transcript other than mRNAs encoded in the nuclear genome, namely, mitochondrial RNAs, rRNAs and interspersed repeat containing poly(A) RNA transcripts, totaling about 25% of the 956 ESTs; and (3) sequences belonging to neither of the above classes, totaling about 51% of the 956 ESTs. These could represent unidentified mRNAs.
EST identification by matches with GenBank
Table 2 lists all of the 215 significant matches discovered by comparing the EST sequences against GenBank (i.e., the initial row of Table 1). These matches are for the most part evidence of membership in the same family of proteins rather than specific identifications of particular proteins. Of course, where the echinoderm mRNA has earlier been sequenced, as with the cyclins, arylsulfatase, the metallothioneins, dynein, kinesin, Spec2A, actins, histones, fibropellin and some other well-studied proteins, the identifications are indeed exact. But in general they merely indicate the probable family of proteins to which belongs the sequence that the EST fragment encodes. Detailed examination of the matches leads to an important caveat. Only when the probability of chance occurrence is less than about 10−12 are the matches likely to identify long sequence overlaps, thus strongly implying membership in a given family of proteins. Probabilities of chance occurrence higher than this, down to our limit of 10−6, may only signify the presence of recognized protein motifs, which may or may not be shared amongst families of proteins other than that listed for the given entry in the Table.
The classification system that we have employed in Table 2 essentially divides the identified ESTs into four major categories. Category A can loosely be described as structural and enzymatic housekeeping proteins that are required in dividing, metabolizing cells. Category B consists of signaling and intercellular communication proteins. Category C includes transcription factors and other nuclear proteins that affect gene regulation, and category D consists of specialized products. In Table 3 the number of diverse mRNA species represented in each of these categories is listed, irrespective of their prevalence, that is, irrespective of the number of occurrences of each sequence match in Table 2. There are a total of 153 different proteins recognized. The majority of the diversity recovered (∼60%) is in category A, consisting of housekeeping proteins of every variety. About 20% of all the different proteins are putatively involved in intercellular communication and/or signaling, and about 11% are nuclear proteins, probably utilized in the regulation of gene expression directly or indirectly. However, these values cannot be used to estimate the fraction of all the mRNA complexity devoted to these respective functional categories because the recognized sequence class of the EST database is biased towards more prevalent mRNAs compared either to the non-recognized ESTs (Table 1) or to the total mRNA population, as we discuss below. This is because current knowledge is itself biased toward more prominent proteins (i.e., more prevalent mRNAs), and because rare mRNAs, which constitute the vast majority of the total mRNA complexity, are grossly underrepresented in the EST database just because of its small size. However, the probability that a given prevalent mRNA species will appear in the 956 clone database is much higher. Thus only some very general conclusions can be drawn. It does seem clear that, at least in the moderate prevalence class of 7-hour embryo mRNAs, a large fraction of species will belong to category A, consisting of messages encoding transporters, cytoskeletal proteins, cell division proteins, enzymatic machinery, etc., and this will probably be true of the great mass of the yet unknown rare messages in the early embryo as well since many of the category A transcripts occur only once in the database and thus are probably of low abundance. Though long suspected, this observation provides the first relevant and direct evidence for the high fraction of early embryo RNAs encoding housekeeping machinery.
Among the housekeeping proteins of category A are some potentially interesting finds. Examples include a considerable number of ion transporters (subcategory AI); RNA splicing factors, RNA polymerase and RNA mobilizing enzymes (AII); a protein strongly related to a spindle protein that regulates cytokinesis (AIII); a nuclear pore complex protein and a centrosomal protein (AIV); and two proteins that in other systems control apoptosis (AIX).
For the present authors, as for most developmental biologists, the most interesting categories are (B) signaling and (C) transcription control, and the most useful aspect of any EST project is the new probes for interesting genes that it affords. Several such discoveries are included in parts B and C of Table 2, i.e., sequences not previously isolated from echinoderm material. In category B, these include sequences related to chordin and lunatic fringe, a secreted activator of the Notch signaling receptor (Johnston et al., 1997) (subcategory BI), and various G proteins and casein kinase. In category C, we found sequences similar to five different transcription factors to our knowledge not previously recovered from these embryos, including factors of the Sox and MAD families. All of these finds have new implications for the functional activities of the 7-hour embryo. For example, though mRNAs encoding two putative ligands of TGF-β family have been reported, namely univin (Stenzel et al., 1994) and an orthologue of human BMP5-8 (Ponce et al., 1999), the presence of chordin and of a MAD class transcription factor suggests that signaling mediated by one or the other of these ligands is occurring or will soon occur in cleavage-stage embryos, and that it could be involved in the early specification functions. Similarly, though the early embryo contains maternal Notch mRNA and protein (Sherwood and McClay, 1997), the presence of a fringe family mRNA at 7 hours postfertilization suggests a current regulation of this pathway, for which there is yet no role assigned in the cleavage-stage embryo. In addition, we note that one of the unsolved problems in the regulatory molecular biology of the early sea urchin embryo is the mechanism by which maternal transcription factors are modified so that they become active in the appropriate embryonic territories (Davidson et al., 1998).
In quantitative terms, this is a very small EST project, and of all the sequences obtained only about a quarter generated significant matches against the GenBank database. Prima facie, it is remarkable that the recognized sequences include such interesting examples and provide such potentially useful probes.
ESTs displaying no significant similarity to known genes
About half the total EST set displays no significant match with any sequence in GenBank (Table 1). In order to determine whether these clones are likely to represent the non-translatable poly(A) RNAs of the early embryo, or in contrast, are bona fide mRNAs too divergent for recognition using the (mainly mammalian) GenBank database, we carried out a statistical search for open reading frames (ORFs). This was done by determining the distribution of lengths between stop codons, compared to the lengths that occur in randomly generated sequences of a matched sequence length distribution. Results are shown in Fig. 1; details can be found in the legend. Briefly, the observed distribution of ORF lengths in the unidentified ESTs was compared with that expected to occur on a chance basis in a set of sequences of the length distribution of these ESTs. Fig. 1 shows clearly that the ORF length distribution observed (histogram) cannot be accounted for on a random basis. Most of the unidentified ESTs in the analysis thus contain protein-coding sequences, i.e., by our estimate, 65-80%. It follows that if we consider all the possible protein-coding sequences, in the identified plus unidentified clones (75.2% of the total in Table 1), 73-84% of these indeed include codogenic mRNA sequence. The remainder consist of 3′ trailer sequences of mRNAs, or are interspersed repeat-containing transcripts, probably mainly the latter. On this basis, the total number of ESTs representing the interspersed repeat transcript class would be the sum of those in which repeats were recognized (65 sequences; Table 1) plus the non-coding sequences inferred from the analyses of Fig. 1. The fraction of the whole library (as sampled in the ESTs) representing interspersed transcripts would thus lie in the range of 25-40%. This is consistent with the estimate that, in eggs and early embryos, this class of poly(A) RNA by mass constitutes over 50% of the total (Costantini et al., 1980), if the probability of reverse transcription is about the same per molecule, since the interspersed poly(A) RNAs are on the average at least 5× as long as are the mRNAs.
As discussed above, the more prevalent mRNAs are expected to be overrepresented in the category of ESTs recognized by sequence comparison with GenBank and, by the same token, they will be underrepresented in the unrecognized category. Thus to obtain a balanced image of sequence prevalence distribution in the early embryo poly(A) RNA from the EST database, all EST sequences excluding ribosomal, mitochondrial and repetitive sequences were compared against one another to detect multiple occurrences, and the results pooled with those shown in Table 2 for the identified ESTs. The summed prevalence distribution is shown in Fig. 2. As expected, no sequence occurred with the frequency of cyclins A and B (i.e., 10 and 9 occurrences, respectively) in the unrecognized EST set and, in fact, the maximum multiplicity was three occurrences of one of the unidentified sequences. Fig. 2 presents the prevalence distribution in both number and mass terms. Respectively, the number representations give the frequency of molecules occurring at each prevalence as percent of total, and the mass representations give the frequency of total mass represented by the sum of molecules at each prevalence. These distributions can be compared with earlier number (e.g., Flytzanis et al., 1982) and mass (e.g., Lasky et al., 1980) distributions for S. purpuratus embryo poly(A) RNA, as deduced from hybridization of large random sets of cDNA clones with labeled cDNA. The present results led to remarkably similar conclusions. They confirm that the very large majority of embryo poly(A) RNAs are of the rare sequence class, represented here by molecular species appearing only once in the EST database (i.e., ∼80% of all ESTs included in Fig. 2). As noted above, because of the small size of this database we cannot estimate the actual frequency of these rare mRNAs, except that it is significantly <100 per average cell equivalent (i.e., at a 500-cell stage where there are about 105 mRNA molecules/cell). As the lower abscissa of Fig. 2 shows, for more prevalent mRNAs, the expected frequencies are on the order of several hundred mRNA molecules per average cell. All told, the prevalent transcripts, here represented as those occurring more than once, constitute about 12% of the total sum of poly(A) RNA sequences in the analysis.
Tracking prevalence changes
The ESTs considered here represent an early stage of embryonic development and it is often important to examine change in representation of given transcripts as embryogenesis proceeds. Availability of a comparable set of arrayed library grids in a high-density filter format renders this an easy measurement. As one example, we take the Sox class transcription factor mRNAs uncovered in this project. Two Sox mRNAs were identified, namely those in plate 4 position B08 and plate 8 position L16 (Table 2). A probe was designed from the overlapping region of these clones and used to screen (1) several filters each containing 18,432 randomly selected clones from the 7-hour library that was the source of the ESTs, (2) equivalent filters from a 20-hour mesenchyme blastula stage library, and (3) filters from a 40-hour late gastrula stage library. As Table 4 shows, Sox mRNAs are probably present as modestly prevalent maternal mRNAs (3.7×104/egg). A majority of these have disappeared by the mesenchyme blastula stage, but the prevalence again increases by late gastrula, so that the prevalence per average cell increases from about 70 to about 120 molecules, undoubtedly as result of zygotic transcription. This pattern is fairly typical, as found earlier for a large set of unidentified messages (Flytzanis et al., 1982). Note that these values confirm the inference in the lower abscissa of Fig. 1, i.e., that linear extrapolation of prevalence in the EST data set can be used to provide a thumbnail estimate of prevalence for the whole embryo (or for the average cell), for sequences that occur more than once. This approach to prevalence determination, which does not depend at all on accurate measurement of the amounts of probe hybridized to given clones, is not only quick and easy but is also relatively robust. That is, since the library is arrayed, and large amounts of plasmid DNA are present in each spot, every spot pair represented by a given probe will hybridize similarly every time the array is screened, in contrast to λ plaque screening, in which under the usual conditions plaque size for a given recombinant varies greatly at each plating. In our experience, the amount of hybridization to each member of a spot pair is almost always less than a factor of 2.5, a level of variability that does not affect detectability. The proportion of spot pairs representing a given sequence is directly related to the prevalence of that sequence in the parental mRNA used to make the library, except for the possibility that it is under- or overtranscribed by the reverse transcriptase employed to generate the cloned cDNA. This occasional inaccuracy, however, equally affects methods based on quantitative measurement of the amount of a cDNA hybridized to a given clone (or a synthetic DNA sequence).
Though it is of relatively small size, analysis of this set of ESTs has yielded several kinds of useful information pertaining to the mid-cleavage-stage sea urchin embryo. Among the main results are the following.
(1)We have obtained an overall, quantitative classification of the various types of transcript represented in the arrayed 7-hour embryo cDNA library and in the embryo itself.
(2)A representation of the population structure of the embryo poly(A) RNA has emerged that strongly supports the earlier conclusions that most transcripts are rare in the egg, while a few species, most of which are already known, occur at a modestly higher prevalence. The major complexity of early embryo RNA, i.e., the greatest diversity of genes represented in its transcript populations, is in the rare sequence class. A practical implication is that since a relatively small fraction of the ESTs consist of prevalent mRNA species (or of mitochondrial, ribosomal and interspersed RNAs), it is not particularly advantageous to normalize libraries of this stage, or to go to great effort to remove or identify very prevalent transcripts prior to other analyses. Furthermore, this result emphasizes the importance of methods (including EST analysis) to which rare mRNAs are accessible: this is of course where most the expressed genetic information is to be found.
(3)The prevalence distribution also provides a reasonable estimation of the actual frequency of occurrence in embryos of the more highly represented transcripts, those present at a few hundred molecules per average cell. Furthermore, it is easy to track developmental changes in representation using array library prints once the desired sequences have been identified.
(4)Sea urchin embryos express many protein-coding sequences that are too divergent from those at present in GenBank to provide identification; we found at least as many unidentified protein-coding sequences as identified ones. Eventually knowledge of protein-folding motifs will permit educated guesses as to function from most protein-coding sequences, and we may expect the ‘unidentified’ category to shrink.
(5)Finally, we have uncovered a number of recognized molecules previously not known to be expressed in cleavage-stage embryos. Among these are Notch and TGF-β family signaling components, a function for which is now implied in the mid-cleavage embryo.
A Strongylocentrotus purpuratus genome project has been initiated (funded by the Stowers Institute for Medical Research), from which there is now emerging a high-resolution BAC-end sequence map of the whole genome and a large collection of arrayed libraries representing various embryonic and larval stages and cell types. The EST analysis described in this paper illustrates the illuminating informational returns that can accrue when genomic approaches are applied to a developmental system that is relatively well characterized at the molecular level.
We thank Dr Hans Lehrach, Director, Max Planck Institute for Molecular Genetics, Berlin, for his help with the construction of the arrayed library used in this study. This work was supported by the Stowers Institute for Medical Research.