ABSTRACT
We developed an in vitro system to differentiate embryonic stem cells (ESCs) derived from reciprocally crossed F1 hybrid mice into neurons, and used it to investigate poly(A)+ and total RNA transcription at different stages of cell differentiation. By comparing expression profiles of transcripts assembled from 20 RNA sequencing datasets [2 alleles×(2 cell lines×4 time-points+2 mouse brains)], the relative influence of strain, cell and parent specificities to overall expression could be assessed. Divergent expression profiles of ESCs converged tightly at neural progenitor stage. Patterns of temporal variation of monoallelically expressed transcripts and antisense transcripts were quantified. Comparison of sense and antisense transcript pairs within the poly(A)+ sample, within the total RNA sample, and across poly(A)+ and total RNA samples revealed distinct rates of pairs showing anti-correlated expression variation. Unique patterns of sharing of poly(A)+ and poly(A)− transcription were identified in distinct RNA species. Regulation and functionality of monoallelic expression, antisense transcripts and poly(A)− transcription remain elusive. We demonstrated the effectiveness of our approach to capture these transcriptional activities, and provided new resources to elucidate the mammalian developmental transcriptome.
INTRODUCTION
The maternally and paternally derived copies of each gene could be generally expressed at comparable levels (biallelic expression) in diploid eukaryotic organisms. Expression of certain genes is, however, biased towards that of one of the parents (imprinted expression) (Barlow, 2011; Bartolomei and Ferguson-Smith, 2011; Lee and Bartolomei, 2013; Savova et al., 2013) or towards that from a specific strain (strain-specific expression) in mice created by crossing two distinct strains (F1 hybrids), in which one can measure the expression of genes in an allele-specific manner (Gimelbrant et al., 2007; Eckersley-Maslin and Spector, 2014). There are also non-random and random monoallelically expressed genes, which are neither imprinted nor strain specific (Eckersley-Maslin et al., 2014; Gendrel et al., 2014). Despite intense studies of these monoallelically expressed (not meaning an ‘on or off’ situation of expression, but rather an ‘allelic bias’ of expression in specific alleles) genes by using F1 hybrids, the regulation and functionality of them are poorly understood. Although the historical estimate of the imprinted gene number was between 100 and 200 (Lee and Bartolomei, 2013; Crowley et al., 2015), a range of the numbers, from 95 to 1308, have been reported by recent studies (Gregg et al., 2010a,b; DeVeale et al., 2012; Goncalves et al., 2012; Crowley et al., 2015). While the discrepancy is partly due to the tissue-specific imprinting expression of many imprinted genes (Lee and Bartolomei, 2013), different parental strains of the F1 hybrids, and coexistence of two subpopulations of cells [one expressing single and another expressing both alleles of genes within a tissue (Ginart et al., 2016)], it is also caused by concatenation of results reported from unrelated experiments that used different threshold parameterization and analysis methods. Therefore, investigation over a full developmental time-course from a single experiment may help to determine the number of imprinted genes (Eckersley-Maslin and Spector, 2014).
Although random monoallelic expression has been reported for members of the olfactory receptor gene family, protocadherins and immunoglobulins (Cedar and Bergman, 2008; Chen and Maniatis, 2013; Magklara and Lomvardas, 2013; Rodriguez, 2013), whole-genome studies have detected unexpectedly large numbers of strain-specific genes in F1 hybrids (Gimelbrant et al., 2005, 2007; Zwemer et al., 2012; Eckersley-Maslin et al., 2014; Gendrel et al., 2014). In an experiment, for example, of 13,699 expressed genes examined, 1666 genes in embryonic stem cells (ESCs) and 1960 genes in neural progenitor cells (NPCs) differentiated from the ESCs showed random monoallelic expression in six replicates, whereas 547 genes in ESCs and 276 genes in NPCs showed non-random monoallelic expression (consistent monoallelic expression in all the six replicates) (Eckersley-Maslin et al., 2014). In genes displaying random monoallelic expression, one of the two alleles is chosen to be silenced, and this choice is tightly inherited in differentiated cells (Gimelbrant et al., 2005; 2007; Zwemer et al., 2012; Savova et al., 2013). Accurate determination of whether the expression of monoallelic genes is specific to developmental stage or constitutive in a single experiment would improve our understanding of the regulation and function of these genes.
Previous studies based on F1 hybrids identified large-scale, dynamic and complex transcriptional activities that involve imprinting, sense–antisense and remarkably different poly(A)+ and poly(A)− expression profiles at the Ube3a locus and its downstream region extending over more than one megabase (Mb) on chromosome (chr)7 (Numata et al., 2011; Kohama et al., 2012; Meng et al., 2012). In mature neural cells, Ube3a gains tissue-specific and maternal-specific expression. Antisense transcripts to Ube3a (Ube3a-ATS) are paternally expressed and [poly(A)–]-predominant [that is, devoid or depleted from the poly(A)+ RNA fraction], and their expression increases with decreasing expression of maternally expressed Ube3a (Meng et al., 2012). Although the structure and the border of the transcriptional unit are yet to be determined, a 1-Mb transcript, LNCAT (large non-coding antisense transcript), which starts from the Snurf/Snrpn locus and spans the Ube3a locus in the antisense direction, has been anticipated (Landers et al., 2004; Le Meur et al., 2005). There are other clusters of imprinted genes, including non-coding RNAs with large polycistronic transcripts in the mouse genome (Lee and Bartolomei, 2013; Savova et al., 2013; Luo et al., 2016). Comparison of the arrangements of allele specificities of the imprinted genes, features of polycistronic transcripts and binding sites of molecular factors between the imprinted gene clusters will help elucidate the regulation and functionality of these loci. Inspired by the previous findings, we have developed an in vitro system to differentiate ESCs derived from F1 hydrids into neurons and investigated the dynamics of monoallelic expression, antisense transcription and bimorphic status [expression of both poly(A)+ and poly(A)− transcripts from a gene locus] comparing poly(A)+ and total RNA transcriptional activities on a whole-genome scale during a 17-day time-course of neural differentiation.
RESULTS
RNA libraries, sequencing and bioinformatics analysis
F1 hybrid ESCs and mice were obtained in reciprocal crosses of C57BL/6J (B6) and MSM/Ms (MSM) (Takada et al., 2013) strains (maternal B6 and paternal MSM, termed BM, and maternal MSM and paternal B6, termed MB). ESCs were subject to in vitro neuronal differentiation and sampled at four time-points, day 0 (D0, undifferentiated ESCs), day 4 (D4), day 8 (D8, NPCs) and day 17 (N9, neurons) (Numata et al., 2011) (see Materials and Methods). Adult brain (adult, day 70) was included as an in vivo reference to compare in this study; in figures ‘Adult’ was labeled as a time-point. Poly(A)+ and total RNA libraries were strand-specifically prepared from each of these samples and sequenced (RNA-seq) with ILLUMINA HISEQ sequencer to examine poly(A)+ and total RNA transcriptional activities of different alleles (B6 of BM, MSM of BM, B6 of MB and MSM of MB, termed B6-BM, MSM-BM, B6-MB and MSM-MB, respectively) (Fig. 1, summary of abbreviations, threshold parametrization, and terms and definitions given in Table S1). Total RNA, which corresponds more faithfully to the nascent transcriptome, was used to observe poly(A)− fraction within the whole RNA by analyzing in conjunction with poly(A)+ data. Hereinafter, we occasionally refer to poly(A)− RNA, based on the analysis of poly(A)+ and total RNA. Note that our experimental strategy lacks a cloning step or single cell resolution, hence all ‘monoallelic’ events that occur randomly during differentiation were not detected.
We aligned the RNA-seq sequences [2 cell lines×4 time-points+2 mouse brains=10 datasets for each poly(A)+ RNA and total RNA sample] to the mouse genome (mm10), separated them into B6 and MSM alleles based on SNPs (Takada et al., 2013) to obtain 20 sequence datasets for each poly(A)+ RNA and total RNA sample. Reference-guided assembly of the each 20 sequence datasets yielded totals of 42,386 and 62,399 non-overlapping expressed [fragments per 1 kb of transcript per one million mapped reads (FPKM) ≥1] transcripts (medians of 15,390 and 18,254 at a time-point) in poly(A)+ and total RNA samples, respectively (Fig. 2, Table S2). The detail of bioinformatics analyses are given in the Materials and Methods and Tables S2 and S3. We confirmed appropriate expression of the differentiation marker genes at each time-point (Fig. 3).
Divergence and convergence of transcript expression profiles between the alleles and over the time-course of neuronal differentiation
To evaluate the influence of cell, strain and parent specificity on the overall transcript expression profile and the dynamics of the transcript expression over the developmental time-course, we compared the expression profiles between two alleles in F1 hybrids from the same reciprocal crosses (intra-allelic pairs: B6-BM versus MSM-BM, B6-MB versus MSM-MB), between alleles derived from the same inbred strains in different reciprocal crosses (strain-specific pairs: B6-MB versus B6-BM, MSM-MB versus MSM-BM), and between alleles derived from the same parent in different reciprocal crosses (parent-specific pairs: MSM-MB versus B6-BM, B6-MB versus MSM-BM) at each time-point (Fig. 4A–D; Table S4). The overall expression profiles differed notably between the alleles, and the level of divergence (or convergence) changed dynamically over the time-course.
At each time-point, the highest and lowest convergence of expression profiles occurred in the strain-specific and parent-specific comparison, respectively (Fig. 4A–D; Table S4), indicating that the strain of the chromosome origin is the strongest determinant of the gene expression profile in F1 hybrids. Since there was no considerable difference in the convergence in intra-allelic and parent-specific comparisons, the influence of cell or parent specificity on the expression profile appeared to be marginal compared to that of strain, at least in the overall expression profile.
The expression profiles were the most divergent at D0, converged through D4 to D8, and showed the tightest convergence at D8 (when cells were committed to differentiation), likely reflecting a stringently controlled gene expression profile and/or a highly homogenous NPC population. The convergence somewhat relaxed at N9.
We confirmed the convergence and divergence of expression profiles by determining the TPM (transcripts per million) (Wagner et al., 2012) for all the 40 datasets. The correlation between TPM and FPKM, and convergence and divergence reproduced with TPM are shown in supplementary material Fig. S1.
The large-scale, dynamic and complex transcriptional activities of Ube3a and its downstream region
Consistent with the previous reports (Numata et al., 2011; Kohama et al., 2012; Meng et al., 2012), Ube3a became predominantly expressed from the maternal allele at N9 in both the poly(A)+ and total RNA samples (Fig. 5A–C). Based on RNA-seq datasets, we detected four Ube3a-ATS transcripts in the poly(A)+ (p-ATS-1, p-ATS-2, p-ATS-3 and p-ATS-4) and two in the total RNA (t-ATS-1 and t-ATS-2) samples (Fig. 5A–C); all gained paternal-specific expression at N9, and five [three in poly(A)+, and two in total RNA] were upregulated from N9 to Adult, while protein-coding Ube3a was downregulated from N9 to Adult, consistent with direct repression of Ube3a with activation of Ube3a-ATS (Huang et al., 2012; Meng et al., 2012). Whereas the signals of the sense, protein-coding Ube3a transcripts were almost identical between poly(A)+ and total RNA samples, those of Ube3a-ATSs were considerably different (Fig. 5B,C). Although the Ube3a-ATS signals were stronger in exons than in introns in poly(A)+ samples, they spanned the entire Ube3a gene locus, that is exons and introns, in total RNA samples. These results indicate the importance of analyzing not only poly(A)+ RNA but also total RNA to characterize the whole transcriptome, especially for long non-coding RNAs. Paternally expressed antisense transcripts appeared at D8 in the region downstream of Ube3a and increased at N9 and Adult; these transcripts covered a nearly 3 Mb region (59–62 Mb on chr7) (Fig. 5D,E). These transcripts were mostly represented in the total RNA fraction (total RNA-represented), although a substantial proportion was transcribed with poly(A) tails. Although the number and structures of these total RNA-represented transcripts are yet to be determined, transcript reconstruction through Cufflinks transcript assembly (Trapnell et al., 2010) yielded clearly spliced transcripts in a substantial unannotated part of the 3 Mb region at N9 and Adult (Fig. S2).
Monoallelically expressed transcripts in poly(A)+ and total RNA samples over the time-course of neuronal differentiation
We identified imprinted and strain-specific autosomal transcripts at the four time-points and in adult brain for the poly(A)+ and total RNA samples (threshold parameterization in Table S1B) and focused on dynamic gain and loss of monoallelic expression. The expression biases of monoallelically expressed transcripts and time-course variation are graphically illustrated (Gregg et al., 2010a) in Fig. 6A,B. The time-course variation of expression levels of several constitutive or developmental-stage-specific monoallelic transcripts are shown in Fig. 6C,D.
We identified 356 imprinted transcripts in poly(A)+ samples and 631 in total RNA samples (Table S5). The numbers of imprinted transcripts changed dynamically [43–239 in poly(A)+ samples and 12–582 in total RNA samples] over the time-course (Fig. 6E). Most of the imprinted transcripts, in particular in total RNA samples, were unannotated and paternally expressed in the downstream region of Ube3a. When these transcripts were excluded, the numbers of detected imprinted transcripts were 225 in poly(A)+ RNA samples and 158 in total RNA samples. In agreement with previous studies (Gimelbrant et al., 2005, 2007; Zwemer et al., 2012; Eckersley-Maslin et al., 2014; Gendrel et al., 2014), we found large numbers of strain-specific transcripts (Table S5): 18,528 in poly(A)+ samples and 6764 in total RNA samples. The numbers of strain-specific transcripts also changed dynamically [6336–10,264 in poly(A)+ samples and 1079–3321 in total RNA samples] during the time-course.
We determined whether monoallelic expression was developmental-stage-specific or constitutive by counting the time-points at which the transcripts showed monoallelic expression (Fig. 6F; Table S5). Most imprinted transcripts [265 (74%) in poly(A)+ and 515 (82%) in total RNA samples] were imprinted only at a single time-point, and 12 (3%) poly(A)+ and 6 (1%) total RNA transcripts were imprinted at all time-points. Thus, as previously known, imprinted transcripts were highly developmental-stage-specific, although small, but non-negligible, numbers of transcripts showed constitutive monoallelic expression (Lee and Bartolomei, 2013). Though to a lesser extent, a similar trend was observed for strain-specific transcripts; 10,006 (54%) poly(A)+ and 4892 (72%) total RNA transcripts were strain-specific at a single time-point, and 2032 (11%) poly(A)+ and 406 (6%) total RNA transcripts were strain-specific at all time-points.
We detected switching between maternal and paternal alleles for the imprinted genes Grb10 (Charalambous et al., 2003; Garfield et al., 2011; Cowley et al., 2014; Wilkins, 2014; Ubeda and Gardner, 2015; Wolf et al., 2015) and Igf2 (Gregg et al., 2010a), which are known to switch the alleles in different tissues (Fig. 6C,D; Table S5). We also found that 1137 (6.1%) poly(A)+ and 87 (0.2%) total RNA strain-specific transcripts switched the allele between B6 and MSM.
Since it is of interest which of activation (or upregulation) of two active alleles or silencing (or downregulation) of two inactive alleles caused monoallelic expression, we investigated the change of expression levels of the four alleles between two adjacent time-points during neural differentiation in the transcripts which shifted from non-monoallelic to monoallelic expression state. We categorized the patterns of the change of expression levels as follows: (1) all up, all the four alleles increased expression, but to different degrees, that is, active alleles upregulated more than inactive alleles (Fig. S3A); (2) up and same, expression increased in both active alleles and remained the same, mostly at zero in both inactive alleles (Fig. S3B); (3) up and down, expression increased in both active alleles and decreased in both inactive alleles (Fig. S3C); (4) all down, all the four alleles decreased expression, but to different degrees, that is active alleles downregulated less than inactive alleles (Fig. S3D); and (5) others, patterns other than the above types 1–4 (Fig. S3E).
Although one might expect that monoallelic expression states of transcripts are created mostly through upregulation of active alleles and downregulation of inactive alleles (‘up and down’ type), the proportion that were of the ‘up and down’ type was the smallest (Table S6). The majority of the patterns were of the ‘all up’ or ‘up and same’ types. A significant proportion was also of the ‘all down’ type, but they were much less abundant than additions of the ‘all up’ and ‘up and same’ types. Thus, upregulation of the two active alleles rather than downregulation of the two inactive alleles contributed more to the transition to monoallelic expression states in both imprinted and strain-specific transcripts during neural differentiation. Note that this categorization of the five types is an operational one only to capture the core trends of the expression changes leading to monoallelic expression states. Since we distinguished monoallelic expression by a threshold-based method evaluating the degree of bias of expression levels of the four alleles, the monoallelic and non-monoallelic expression states are not discrete. Therefore a small change of expression in an allele from a time-point to another could shift nearly monoallelic but non-monoallelic transcripts to the monoallelic expression state. During the categorization of the five types, we assigned categories even if there was a small change of expression levels. Certain portions of the ‘others’ type may be attributed to these limitations.
Pairs of expressed sense and antisense transcripts within and across poly(A)+ and total RNA datasets
We compared developmental stage specificity, RNA species composition, and positively and negatively correlated variation in expression over the time-course between three distinct combinations of sense and antisense transcript pairs: (1) within the poly(A)+ datasets [poly(A)+-antisense versus poly(A)+-sense, termed P/P pairs], (2) within the total RNA datasets (total RNA-antisense versus total RNA-sense, termed T/T pairs), and (3) between the total RNA and poly(A)+ datasets (total RNA-antisense versus poly(A)+-sense, termed T/P pairs). Definition of sense and antisense transcripts is given in Table S1C and Fig. S4. We considered only pairs of sense and antisense transcripts for which exon(s) of at least one transcript overlapped exon(s) of annotated transcript(s) (Ensembl version 71) on the same strand. As sense transcripts, we considered transcripts whose exon(s) overlapped exon(s) of annotated transcript(s) on the same strand. Thus, unannotated novel transcripts whose exon(s) did not overlap exon of any annotated transcript on the same strand (described in Table S1C and Fig. S4) were considered only as antisense transcripts, whereas transcripts whose exon(s) overlapped exon(s) of annotated transcript(s) on the same strand were considered to be as both sense and antisense transcripts. We found 8928 expressed antisense transcripts in P/P pairs, 7609 in T/T pairs and 8446 in T/P pairs (Table S7). Their numbers varied over the time-course: 5263–6460 in P/P pairs, 4860–5846 in T/T pairs, and 5216–6311 in T/P pairs (Fig. 7A; Table S7). Most of the antisense transcripts identified were unannotated; only 157 (1.8%) of P/P, 149 (2.0%) of T/T, and 173 (2.0%) of T/P pairs overlapped annotated antisense transcripts (Ensembl version 71) (Cunningham et al., 2015). Overlapping and non-overlapping subsets of the antisense transcripts between the three pair sets are listed in Table S7D. Approximately 40% of the antisense transcripts [41.6% (P/P), 42.9% (T/T), and 40.3% (P/T)] were expressed at all the time-points, whereas more than 20% of them [26.2% (P/P), 20.3% (T/T), and 21.7% (T/P)] were expressed only at one time-point (Fig. 7B; Table S8).
We examined the RNA species composition of the expressed antisense transcripts and expressed sense transcripts [8873 (P/P), 6926 (T/T) and 7198 (T/P)] (Fig. 7C; Table S9). The overall demarcations of the sense and antisense transcripts according to the RNA species were similar. Among the antisense transcripts, those corresponding to protein-coding genes were the most abundant in all the three sets (50.2%–54.1%); 11%–12% corresponded to non-coding genes and about 40% were novel transcripts. The highest proportion of novel transcripts (39.9%) was found in the T/P set (37.3% and 35.3% in P/P and T/T sets, respectively). Among the sense transcripts, the proportion of transcripts corresponding to protein-coding genes was markedly higher in the P/P set (67.4%) than in T/T (62.4%) or T/P set (63.8%).
We were interested in whether the proportions of sense and antisense transcript pairs whose variation in expression over the time-course was positively or negatively correlated differed between the three sets. Although pairs with positively correlated variation in expression dominated in all the three sets, a significant proportion of sense and antisense pairs showed highly negatively correlated variation in expression (Fig. 7D; notable examples are shown in Fig. S5). Interestingly, the T/P set had a significantly higher ratio of pairs showing negative correlation than the two other sets (χ-squared test P<2×10−4 versus P/P set and <2.7×10−8 versus T/T set), whereas P/P set had a significantly higher ratio of pairs showing positive correlation than the two other sets (χ-squared test P<2.2×10−16). The different RNA species enrichment and different ratios of positively and negatively correlated expression in the three sets, may reflect certain differences between the regulatory roles of distinct RNA species when they are transcribed as sense and antisense pairs. We also investigated dependence of the correlation of expression variation of sense and antisense transcript pairs on location of the overlapped region in transcripts (gene promoter, gene body and 3′ end regions) and the size of exon and intron overlap. The portion of negatively correlated expression variation decreased when the overlap was the in promoter and increased when it was in the gene body and 3′ end (Fig. S6A). In overall, the portion of negatively correlated expression variation decreased with exon overlap and increased with intron overlap although this trend varied with distinct ranges of overlap size (Fig. S6B,C).
We found some imprinted antisense transcripts in all the three sets of sense and antisense transcript pairs, some of which were found on the antisense strand of imprinted sense genes of distinct parental origins, as in the Ube3a locus. In addition, 529–1135 (P/P), 49–89 (T/T), and 153–308 (T/P) pairs of sense and antisense transcripts showed distinct or same mouse-strain specificities (Table S10).
Poly(A)+-predominant, [poly(A)–]-predominant and bimorphic transcripts in various RNA species
We were interested in the proportion of poly(A)+ transcripts among the total RNA transcripts and how this differed between different RNA species. A total of 38,404 poly(A)+ transcripts overlapped on the same strand with 39,754 total RNA transcripts (43,281 combinations), whereas 12,609 and 23,071 transcripts were expressed exclusively in the poly(A)+ and total RNA datasets, respectively.
Expression levels of the transcripts of the four alleles were compared between the poly(A)+ and total RNA datasets for each of 13 RNA species (Fig. 8A). Clear dominance of total RNA expression over poly(A)+ expression in histone RNAs (Cui et al., 2010) was confirmed and was used as a positive control for [poly(A)–]-predominant expression (Fig. 8A). The 13 RNA species showed unique proportions of poly(A)+-predominant, bimorphic and total RNA-represented expression. We used the term ‘total RNA-represented’, as we used total RNA samples. However, total RNA-represented implies [poly(A)–]-predominant.
We considered the transcript pairs with the ratio of expression level in total RNA sample to that in poly(A)+ sample [total RNA/poly(A)+] <0.05 and >1.0 as poly(A)+-predominant and total RNA-represented, respectively, and compared their proportions between the 13 RNA species (Fig. 8B). As expected, transcripts from loci containing non-coding genes showed a significantly higher proportion of total RNA-represented transcripts than those from protein-coding gene loci (χ-squared P<10−8), except for processed pseudogenes and processed transcripts. Notably, among the non-coding gene categories, the highest proportion of total RNA-represented transcripts was found for novel transcripts. There was a clear peak at ∼0.3 in the value for total RNA/poly(A)+ for protein-coding genes and several other RNA species (Fig. 8B). Although this peak at 0.3 implies existence of a certain average ratio of poly(A)− to poly(A)+ transcription, whether the ratio reflects intrinsic amounts of transcripts having a poly(A)+ tail or not is to be determined. Although this may partially reflect the complexity of transcripts, such as length and the number of introns, which influences the processing time until a potential poly(A) tail is added, influences on the numbers of expressed transcripts, distribution of expression levels and also experimental limitations, including effects of amplification during RNA-seq and PCR products in combination with distinct transcript populations between poly(A)+ and total RNA samples, cannot be ruled out. These limitations will need to be addressed in a more-elaborate experimental setup, such as qPCR designing distinct primers for poly(A)+ and poly(A)− transcripts, in the future.
Transcripts from pseudogene loci showed the highest correlation (Pearson R=0.92) between poly(A)+ and total RNA expression. This may be due to their protein-coding gene origin and may reflect a rather simple regulatory relationship between poly(A)+ and total RNA expression. Long intergenic non-coding RNA (linc-RNA) transcripts showed lower expression than those of other gene categories for both the poly(A)+ and total RNA samples [Mann–Whitney P<10−8 in poly(A)+ and P<10−7 in total RNA compared with protein-coding genes].
Total RNA-represented expression is more dynamic than poly(A)+-predominant expression
We detected genomic regions with total RNA-represented expression resembling that in the Ube3a locus and also regions with poly(A)+-predominant expression by using HOMER software (Heinz et al., 2010) and examined their features over the time-course. Total RNA-represented expression changed dynamically over the time-course both in the number of transcribed regions and developmental specificity (Fig. 8C). The number of regions with total RNA-represented expression remained nearly unchanged from D0 to N9, but was ∼5-fold higher at Adult. In contrast, the number of regions with poly(A)+-predominant expression remained relatively constant over the time-course. Notably, this surge of total RNA-represented regions for the Adult occurred in intronic and intergenic regions rather than in exons (the increase between N9 and Adult was 8.2–13.5-fold in introns, 4.4–8.1-fold in intergenic regions and only 2.1–2.8 fold in exons), suggesting greater roles played by unannotated poly(A)− transcripts in mature cells.
We found 28 total RNA-represented regions showing imprinted expression (7 maternally and 21 paternally imprinted) with 5-kb or larger extensions, including the known clusters of imprinted genes Kcnq1 (Mancini-Dinardo et al., 2006; Lee and Bartolomei, 2013) and Meg3-Rian-Mirg (or Dlk1-Dio3) (Luo et al., 2016) loci with polycistronic transcripts (examples are shown in Fig. S7).
DISCUSSION
We compared the transcripts of the four distinct alleles in F1 hybrids at distinct developmental time-points and between the poly(A)+ and total RNA libraries. By comparing the expression profiles between alleles, we found that strain is the strongest determinant of the expression profile, whereas the impacts of intra-allelic and parent specificities are both much weaker. The overall expression profiles changed dynamically during neuronal development, with the peak of convergence at D8. In this study, we have shown quantitatively the relative influence of the alleles and dynamic developmental constraints on the overall transcript expression profile.
In all the three categories of transcripts (monoallelically expressed, antisense and total RNA-represented transcripts) we investigated, a considerable fraction of transcripts showed highly dynamic expression over the time-course, and a clear demarcation existed between groups with developmentally specific and constitutive expression. Although a great majority of imprinted genes showed developmentally specific expression, a small, but non-negligible, proportion of them were constitutively expressed. This clear demarcation suggests that there are two categories of imprinted genes and may suggest certain differences in their biological significance (Pauler et al., 2012; Lee and Bartolomei, 2013). Strain-specific transcripts showed a similar trend, although to a lesser extent. In both monoallelic transcripts (imprinted or strain-specific), allele switching was very rare. The rarity of allele switching was consistent with a previous study (Eckersley-Maslin et al., 2014), which observed coexistence of biased expression of either allele among six biological replicates in only one of 1666 (0.1%; ESCs) and in 86 of 1960 (4.4%; NPCs) random monoallelically expressed genes. A large fraction (∼40%) of antisense transcripts showed constitutive expression, whereas ∼20% were developmental stage specific. Total RNA-represented transcripts showed more dynamic expression variation than poly(A)+-predominant transcripts, and poly(A)− transcripts surged in introns and intergenic regions at Adult. The timing (developmental stage) of expression and its pattern (developmental-stage-specific or constitutive) provide vital information for identifying the function of specific genes and general functions of the three categories of transcripts. We have set a platform for accurate measurement of the developmental transcriptomics features by using F1 hybrid ESCs and mice.
Functional generalization of antisense transcripts remains elusive, although intense investigation has been carried out since the report of pervasive antisense transcription in the mouse genome and likeliness of their functionality (Kiyosawa et al., 2003, 2005). Anti-correlation of expression between sense and antisense transcripts has been observed in individual gene loci, in some of which disruption of the expression of the antisense or sense transcript affected specific phenotypes (Janowski et al., 2007; Morris et al., 2008; Yu et al., 2008; Watts et al., 2010; Modarresi et al., 2012; Piatek et al., 2017). However, a high-throughput experiment revealed mostly positive correlation between sense and antisense transcript expression (Cawley et al., 2004). In addition, previous studies have reached a consensus that antisense transcripts form several heterogeneous groups with different functions (Faghihi and Wahlestedt, 2009; Piatek et al., 2017). The clear demarcation of developmental-stage-specific and constitutive fractions of antisense transcripts, the biases of positively and negatively correlated variation in expression of sense and antisense transcripts between different combinations of poly(A)+ and total RNA transcripts, and the location and size of the transcript overlap, support this consensus. At present no categorization can segregate the sense and antisense transcript pairs of negative and positive expression variation completely. Additional structural (of the 5′ and 3′ end) (Werner et al., 2009) and functional analyses in conjunction with the proposed mechanisms involving both the 5′ and 3′ end of sense transcripts (summarized in Piatek et al., 2017) is expected to improve the demarcation of the positive and negative regulatory relationship.
Previous studies have reported that 13.1%–23.3% of mammalian transcripts are bimorphic (Yang et al., 2011), and 60%–80% are either poly(A)− or bimorphic (Cheng et al., 2005; Wu et al., 2008). We computed the fractions of poly(A)+-predominant, total RNA-represented and bimorphic transcripts for distinct RNA species and identified certain rules and trends. First, except for histone RNAs, all RNA species had both poly(A)+ and total RNA-represented transcripts, and the relative proportions of poly(A)+-predominant, total RNA-represented and bimorphic transcripts showed unique patterns for distinct RNA species. Second, six RNA species (protein-coding, pseudogene, linc-RNA, antisense, processed pseudogene and processed transcript) had considerable bimorphic fractions, whereas the remaining five non-coding RNA species were [poly(A)−]-predominant. Third, in all RNA species with considerable bimorphic transcription, the ratio of the expression level of total RNA to that of poly(A)+ RNA peaked at ∼0.3. The landscape of poly(A)+, poly(A)− and bimorphic transcription in mammalian genomes is more complex than previously thought (Mattick and Makunin, 2005, 2006). It was previously found that a considerable fraction (24–25%) of transcripts have short or no poly(A) tails (<20–30 nucleotides) (Gu et al., 1999; Meijer et al., 2007), some of which might be poly(A)+ RNAs processed to reduce or totally remove the classical poly(A) tail (Katinakis et al., 1980). Another study (Yang et al., 2011) identified novel stable poly(A)− intron-derived RNAs and reported a considerable variation in the relative abundance of poly(A)+ and poly(A)− transcription among genes for linc-RNAs and protein-coding genes. The non-coding RNAs (or distinct portions of the polycistronic LNCAT) expressed from the Ube3a locus showed distinct poly-adenylation rates and half lives, thus they were differentially and separately processed (Meng et al., 2012). Further characterization of non-coding RNAs in these features may reveal a clue as to the roles of non-coding RNAs in imprinting of gene clusters with polycistronic transcripts (Lee and Bartolomei, 2013). It is of great interest to investigate how the three modes of transcription are regulated in distinct RNA species in relation to their functionality, particularly in imprinted gene clusters.
We have described the dynamic gain and loss of monoallelic expression from the alleles of F1 hybrid ESCs at each time-point during neural differentiation. To the best of our knowledge, this is the first study that investigated the time-course variation of monoallelically expressed transcripts over such a long period as 17 days in a single experiment. We detected extensive paternal expression in Ube3a and its downstream region, particularly in total RNA samples, but the numbers of imprinted transcripts [225 poly(A)+ and 158 total RNA transcripts] in all other regions were consistent with the historical estimates.
The function of genomic imprinting remains elusive, although theories have been proposed to explain this evolutionary puzzle (Spencer and Clark, 2014), involving ‘kinship theory’ (Haig, 1997; Wolf and Hager, 2006). Recently a new theory has been proposed based on co-regulation of imprinted genes. Although imprinted genes appeared to be functionally unrelated, recent studies reported that imprinted genes are often co-expressed and potentially make up an imprinted gene network (IGN) which regulates (common) biological processes including extracellular matrix regulation (Al Adhami et al., 2015; Varrault et al., 2017). They also found that a member of the IGN, paternally imprinted Plagl1 (a zinc finger transcription factor) potentially regulates 22% of 409 genes of the IGN. Plagl1 showed highly dynamic, developmental-specific paternal expression in our time-course datasets (Fig. S8). In addition to Plagl1, we detected four other imprinted transcription factors (TFs), Peg3, Ndn, Zim1 and Tial1 (Crowley et al., 2015) in both poly(A)+ and total RNA samples (Fig. S8). Similar to Plagl1, Peg3, Ndn and Zim1 are also highly connected with other imprinted genes in the IGN, thus likely regulate substantial portions of genes in the IGN. The five imprinted TFs showed highly dynamic and distinct expression profiles (e.g. time span of imprinting) in both our time-course datasets (Fig. S8). A network analysis of all these highly connected imprinted TFs and their target genes based on the expression profiles in both poly(A)+ and total RNA samples will refine the dynamic nature of the proposed IGN.
We detected 28 total RNA-represented regions with large extensions (≥5 kb) and imprinted expression patterns similar to those of the Ube3a locus, some of which overlapped the known clusters of imprinted genes with large polycistronic transcripts (Pauler et al., 2012; Luo et al., 2016). Since non-coding RNAs (products or transcription alone) are an essential element in the regulation of clusters of imprinted genes (Pauler et al., 2012; Lee and Bartolomei, 2013), study of similarities and differences of total RNA expression features between these loci may bring a clue to the regulation and function of the polycistronic transcripts within the imprinted gene clusters (Luo et al., 2016). We have provided a set of 40 expression profiles combining the four alleles, four time-points and adult brain, and two RNA libraries. The arrangements of cis-elements (differentially DNA-methylated regions, enhancers, promoters and insulators) and timing of binding of molecular factors (TFs, DNA and histone methyltransferases, the transcriptional insulator CTCF, the recently found elongation factor AFF3 and yet-to-be-identified factors) to the cis-elements (Hark et al., 2000; Dindot et al., 2009; DeVeale et al., 2012; Luo et al., 2016) responsible for the dynamics and distinctions of the expression profiles need to be identified in future ChIP-seq experiments.
We demonstrated that our approach is remarkably effective in capturing novel features of the developmental transcriptome. We have provided rich resources for mammalian transcriptomics and developmental biology. Similar approaches using our de facto standard for studies on mammalian developmental transcriptome based on the use of F1 hybrids will expand the research on these and related issues of transcriptional regulation in mammalian genomes.
MATERIALS AND METHODS
Animals and mouse ESC lines
F1 hybrid mice were bred by crossing B6 mothers and MSM fathers to give BM mice, and reciprocally crossed to give MB mice. A MB-ESC line, MB4, was obtained as previously described (Kohama et al., 2012) and designated as MB-ESC. An additional reciprocally cross-derived ESC line, BM23, was obtained using the same protocol and designated as BM-ESC. F1 hybrid ESCs used for RNA-seq were all male. The brains used as references were excised from 10-week-old adult male mice. Mouse experiments were performed in accordance with the institutional guidelines of Kochi University and National Institute of Genetics.
Neuronal differentiation of mouse ESCs
Highly efficient neuronal differentiation was performed as previously described (Kohama et al., 2012) with minor modifications as follows. Undifferentiated mouse ESCs were instead maintained in a KnockOut™ Serum Replacement (KSR; Life Technologies)-based medium supplemented with a MEK-inhibitor (PD0325901; 1 μM) and recombinant mouse leukemia inhibitory factor (|ESGRO; 1000 units/ml). GSK3β-inhibitor, usually added in the 2i-format, was again omitted here to raise the neuronal differentiation efficiencies (data not shown). The ensuing neuronal differentiation in the chemically defined medium (CDM) was performed exactly as described previously (Kohama et al., 2012) up to the D8 stage. At the neuronal maturation stage (N0–N9), we instead have supplemented CDM with DAPT (1 μM), CHIR99021 (3 μM) and forskolin (10 μM) to enhance neuronal maturation and coated the culture dishes with poly-L-ornithine, laminin and fibronectin for enhanced attachment. A step-by-step protocol can be obtained upon request.
Library preparation
Total RNA was extracted from cultured cells and mouse brains using TRI Reagent (Molecular Research Center, Inc.) following the manufacturer's protocol. Poly(A)+ and total RNA libraries were prepared using TruSeq Stranded kits (Illumina) following the manufacturer's protocol. Note that both poly(A)+ and total RNA libraries were derived from identical samples.
Mapping of ILLUMINA sequences
The ILLUMINA sequences (paired end, 100 or 101 bp length and 150 bp insert, 2 cell lines or mice×(4 time-points+adult brain)=10 datasets for each poly(A)+ and total RNA sample] were aligned to the mouse genome (mm10) by tophat (Trapnell et al., 2009) allowing indels that were six bases or shorter, and fewer than two and five and mismatches in the first 32 bases and in the whole read-length, respectively.
SNPs on the mouse genome
Aligning the genomic sequences of MSM (Takada et al., 2013) to the B6 genome (mm10), we identified the single-nucleotide polymorphisms (SNPs) on the genomic sequences of mm10. We cut the MSM genomic sequences into 10-kb fragments and the 10-kb fragments were mapped to mm10 genome by using BLAT (Kent, 2002). After selecting only unique maps of the 10-kb fragments on mm10, we determined the positions and types of SNPs on the coordinates of mm10. A total of 21628229 SNPs (20714172 autosomal and 914057 chrX SNPs) were identified.
Separation and dereplication of the ILLUMINA sequences mapped to the two alleles
Based on the SNPs between the B6 and MSM genomes (Takada et al., 2013), we identified those that originated from the B6 and MSM alleles from the mapped read-pairs and separated them into the two distinct alleles. The mapped ILLUMINA sequences were dereplicated based on the mapped positions. That is, of the read-pairs mapped to the same 5′ and 3′ positions, only one read-pair was taken in each dataset. Since one of our target regions, the Ube3a locus and its downstream region were subject to genomic duplications, we took sequences that were equally mapped to fewer than 100 sites on the genome in our downstream analyses. The read-pairs that mapped equally well to two or more sites contributed fractionally (by the inverse of the number of mapped sites) to the mapped read-pairs of transcripts and thus to their expression levels.
We could assign medians of 19.0 and 18.0 million (M) read-pairs to B6 and MSM alleles, respectively, at each of the four time-points and adult brain of poly(A)+ datasets, whereas medians of 3.6 and 4.1 M read-pairs were assigned to B6 and MSM alleles, respectively, in total RNA datasets. The number of read-pairs assigned to the two alleles at each time-point for the poly(A)+ and total RNA datasets are given in Table S2. At this analysis step, we obtained mapped datasets for the distinct alleles at each of the four time-points and adult brain. Thus we had a total of 20 [2 strains×2 alleles×(4 time-points+adult brain)] mapped datasets for each poly(A)+ and total RNA sample, a total of 40 mapped datasets for both samples. The downstream analyses were carried out independently for each of the 40 mapped datasets until merging the results in each of poly(A)+ and total RNA samples, and normalizing the expression levels (FPKM) for the 20 datasets of each poly(A)+ and total RNA sample.
Reference-based assembly of transcripts based on the sequences mapped to the genome
By using Cufflinks (Trapnell et al., 2010; http://cole-trapnell-lab.github.io/cufflinks), we assembled the expressed transcripts based on the coordinates mapped by the ILLUMINA sequences on the genome using the exon coordinates of Ensembl database (version 71) (Cunningham et al., 2015) as a reference gene set for each mapped dataset. This assembly was performed for each of the 40 mapped datasets. Thus we obtained 40 sets of transcript assemblies (gtf files) after this step.
Merging of the transcript assemblies and computation of the number of mapped read-pairs and normalized expression level of each transcript
By using cuffmerge (http://cole-trapnell-lab.github.io/cufflinks), we merged the 20 assemblies and obtained a single merged assembly for each poly(A)+ and total RNA sample. By using cuffdiff (http://cole-trapnell-lab.github.io/cufflinks) [comparing mapping results of the 20 sequence datasets with the single merged set of assembled transcripts in each poly(A)+ and total RNA sample], we computed read-pair counts and expression levels (FPKM) normalized over the 20 mapped datasets for the assembled transcript in each poly(A)+ and total RNA sample. We tested classical (no scaling) and geometrical (FPKMs are scaled via the median of the geometric means of mapped read-pair counts) (Anders and Huber, 2010) normalization and found the later superior to maintain the profile of expression levels between the B6- and MSM-alleles between each paired mapped datasets. Thus, we used the geometrically normalized expression levels for the comparison of expression levels between poly(A)+ and total RNA datasets over the time-course.
At this step, we detected totals of 42,386 and 62,399 non-overlapping Cufflinks-defined genes that showed an expression level with a FPKM≥1.0 in at least one of the 20 datasets (one allele at a time-point) for the poly(A)+ and total RNA samples, respectively. 15,327 (37.0%) and 15,674 (25.1%) of the Cufflinks-defined genes overlapped known genes annotated in the Ensembl database in poly(A)+ and total RNA samples, respectively [see Table S3 for a summary including numbers at a lower (≥0.5) FPKM threshold]. The Cufflinks-assembled genes mapped 16,684 and 16,769 Ensembl genes on the same strand in poly(A)+ and total RNA samples. 16,555 (99.2%) and 16,690 (99.5%) of the Ensembl genes were mapped by a single Cufflinks-defined gene in poly(A)+ and total RNA samples, respectively, whereas in the remaining portions, 129 (0.8%) and 80 (0.5%) were mapped by two to four and two to five Cufflinks-defined genes in poly(A)+ and total RNA samples, respectively. As transcriptional units, we used the Cufflinks-defined genes in downstream analyses. We operationally used the term ‘transcript’ as a Cufflinks-defined gene since there was no reference for gene locus determination in unannotated regions of the genome.
All the figures were plotted by using R (R Core Team, 2014). All the computational work was performed on the DDBJ supercomputer system (Ogasawara et al., 2013).
Footnotes
Author contributions
Conceptualization: H. Kiyosawa; Methodology: H. Kato, H. Kiyosawa; Software: S.K.; Validation: S.K., H. Kiyosawa; Formal analysis: S.K., H. Kato, H. Kiyosawa; Investigation: S.K., H. Kato, Y.S., T.T., M.E., T.S., N.S., S.S., H. Kiyosawa; Resources: H. Kato, Y.S., T.T.; Data curation: S.K., T.S., N.S.; Writing - original draft: S.K., H. Kato, M.E., H. Kiyosawa; Writing - review & editing: S.K., H. Kato, M.E., S.S., H. Kiyosawa; Visualization: S.K.; Supervision: H. Kiyosawa; Project administration: H. Kiyosawa; Funding acquisition: H. Kiyosawa.
Funding
This work was supported in part by Japan Society for the Promotion of Science (JSPS) KAKENHI Grant-in-Aid for Scientific Research (B) (No. 26290063), Ministry of Education, Culture, Sports, Science and Technology Japan (MEXT) KAKENHI Grant-in-Aid for Scientific Research on Innovative Areas (No. 24115703) and Ministry of Education, Culture, Sports, Science and Technology-Japan (MEXT) KAKENHI (No. 221S0002) to H. Kiyosawa.
References
Competing interests
The authors declare no competing or financial interests.