During embryogenesis the heart forms as a linear tube that then undergoes multiple simultaneous morphogenetic events to obtain its mature shape. To understand the gene regulatory networks (GRNs) driving this phase of heart development, during which many congenital heart disease malformations likely arise, we conducted an RNA-seq timecourse in zebrafish from 30 hpf to 72 hpf and identified 5861 genes with altered expression. We clustered the genes by temporal expression pattern, identified transcription factor binding motifs enriched in each cluster, and generated a model GRN for the major gene batteries in heart morphogenesis. This approach predicted hundreds of regulatory interactions and found batteries enriched in specific cell and tissue types, indicating that the approach can be used to narrow the search for novel genetic markers and regulatory interactions. Subsequent analyses confirmed the GRN using two mutants, Tbx5 and nkx2-5, and identified sets of duplicated zebrafish genes that do not show temporal subfunctionalization. This dataset provides an essential resource for future studies on the genetic/epigenetic pathways implicated in congenital heart defects and the mechanisms of cardiac transcriptional regulation.
Embryonic development is driven by the coordinated and precise regulation of thousands of genes through gene regulatory networks (GRNs). Each GRN contains the logic circuits that encode the response of a cell to signaling pathways and other developmental cues to form a specific tissue or organ, and their disruption can cause a wide range of birth defects. Therefore, assembly of GRNs is crucial for identifying genes involved in human disease, increasing our mechanistic understanding of their interactions and providing vital information for improved diagnosis and treatment of various malformations.
One area where this progress is clearly evident is the study of congenital heart disease (CHD). CHDs are the most common class of congenital defects in newborns, and remain a major cause of morbidity and mortality in children and adults. Mutations in at least 25 transcription factors have been identified in patients with heart defects (McCulley and Black, 2012). Several of these have been strongly linked to CHD in model organisms and humans. However, we still do not have a comprehensive view of the heart GRN, and a majority of individuals with a CHD do not have mutations in known genes (Fung et al., 2013). It is also likely that some CHDs are not genetic, but caused by environmental factors or perturbations in the epigenome that disrupt the GRN driving heart morphogenesis (Lage et al., 2012). In order to study these interactions, it is essential that the complete heart GRN be assembled.
The heart is the earliest functioning organ in the developing embryo, and is essential for normal embryonic development and growth. Morphologically, the heart forms as a linear tube located along the ventral midline of the embryo, with the inflow tract and primitive atrium located caudal to the forming ventricle and outflow tract. The heart then undergoes extensive remodeling, whereby it loops upon itself to bring the inflow and outflow tracts together and the primitive atrium into a position rostral/cranial and dorsal to the primitive ventricle (Männer, 2009). During this same time period, proliferation and addition of cardiac neural crest cells cause the chambers to expand, and ventricular trabeculation and wall compaction begin. Transcription factors, such as Tbx2 and Tbx3, repress expansion of the region separating the primitive atrium and ventricle, leading to the appearance of a constriction – the atrioventricular canal (AVC) – between the developing chambers (Ribeiro et al., 2007; Singh et al., 2012). The conduction system and endocardial cushions, which will later form the valves, begin to expand into the canal. Finally, remodeling events of the great vessels and septation of the chambers occur in mammals, birds and partially in most reptiles. These processes, except for septation, are highly conserved across vertebrate species, including lamprey, fish, reptiles, birds and mammals (Jensen et al., 2013).
Largely due to the clinical relevance of CHD, efforts to assemble the cardiac GRN have been a major focus of research for many years. As a result, several transcription factors involved in cardiac differentiation and morphogenesis have been identified, including Nkx2-5 (Chen and Schwartz, 1996) and members of the Tbx (Piotrowski et al., 2003; Plageman and Yutzey, 2005; Ribeiro et al., 2007), Gata (Pehlivan et al., 1999), Klf (Nemer and Horb, 2007) and Fox (Chi et al., 2008; Kume et al., 2001; Wang et al., 2004) families, and others (McCulley and Black, 2012). This research has focused on select interactions between individual factors and their downstream response genes. In order to build the complete GRN, current research is now using high-throughput techniques to understand how these factors interact, which genes they regulate, and how they function. This effort has been greatly aided by the development of high-throughput genomic technologies and systems biology approaches to comprehensively identify gene regulatory interactions. These technologies are greatly accelerating the pace of GRN assembly.
In order to elucidate the GRN driving heart morphogenesis and differentiation, we used a systems biology approach to identify regulatory interactions controlling temporal gene expression patterns in the heart. We conducted an RNA-seq timecourse in zebrafish with samples collected every 6 hours from 30 hpf to 72 hpf, corresponding to heart looping [30 to 72 hpf (Männer, 2009)], cardiomyocyte maturation, initial trabeculation and sino-atrial (SA) node establishment, AVC formation [30-55 hpf (Peal et al., 2011)] and valve formation [beginning at 55 hpf (Pestel et al., 2016)]. We then clustered genes by their temporal expression pattern using self-organizing map (SOM) analysis. We discovered that clusters of genes with similar temporal gene expression patterns contain cohorts of co-regulated genes, as confirmed by analyzing the frequency of known gene regulatory interactions within the clusters. We then searched the proximal cis-regulatory regions to identify transcription factor binding motifs overrepresented in each cluster. Using these data, we constructed a large GRN model to explain the temporal expression patterns in heart morphogenesis. Importantly, this GRN contained several groups of genes with similar temporal expression patterns regulated by shared sets of multiple transcription factors, consistent with the gene battery hypothesis (Britten and Davidson, 1969; Nelander et al., 2005; Peter and Davidson, 2011). This resource provides a framework to study the effects of genetic or environmental factors that affect heart development and the complex interactions governing combinatorial control of gene batteries.
Identification and clustering of dynamically expressed genes in cardiac development
Heart looping and concurrent morphogenetic events begin just before 30 h post fertilization (hpf) and are mostly complete by 72 hpf in the zebrafish (Fig. 1A). In order to identify genes that are dynamically expressed during this critical period, we conducted RNA-seq on isolated heart tissue taken from eight time points between 30 hpf and 72 hpf (three replicates per time point). In order to determine the relative similarity of the replicates and time points, we used multidimensional scaling (MDS). MDS projects multidimensional data, in this case gene counts for each of the genes in the transcriptome, onto a two-dimensional space in a way that preserves the relative distances between points as much as possible. The resulting MDS plot showed strong clustering of the replicates, except for two outlier samples (A30 and B42). Further analysis of these two samples showed that they contained a high amount of E. coli DNA contamination, so they were excluded from subsequent analyses. MDS also showed a clear progression from 30 hpf to 72 hpf along the first axis (Fig. 1B), indicating that gene expression patterns were altered sequentially through the developmental stages analyzed.
The goal of this study was to investigate and identify gene regulatory interactions occurring during heart morphogenesis. Therefore, we identified and segregated genes that were differentially expressed over the developmental timecourse. As timecourse data are not suitable for pairwise hypothesis testing, differential gene expression was assessed using the negative binomial log ratio test in the DESeq2 package, which is conceptually similar to an ANOVA test. A total of 5861 genes exhibited dynamic changes in expression level during the time period covered (Fig. 1C, Table S1). Hierarchical clustering grouped genes into shared patterns of increasing expression, decreasing expression, transient expression and transient repression during the time window studied (Fig. 1D). The fact that a large number of genes have changing expression patterns across our dataset highlights the complexity of the developmental processes active during our timecourse.
Self-organizing map (SOM) analysis identifies clusters of genes with similar temporal expression patterns
In order to better separate the various expression patterns within the set of dynamically expressed genes, we clustered them using SOM analysis, which is an artificial neural network learning algorithm that fits a grid of nodes to high-dimensional data – gene expression patterns for individual genes in this case – and then assigns each pattern to the nearest node. Thus, it can be thought of as a non-linear principal component analysis (PCA), except each gene expression pattern is assigned to a specific cluster instead of merely generating loading values. Another important feature of SOM analysis is that the algorithm lays out each co-clustered gene expression pattern within a box (with relative expression on the y-axis and temporal expression on the x-axis) and places each box on a relational grid, with converse patterns at opposite corners and similar patterns grouped more closely together, allowing the expression patterns of genes in different clusters to be compared visually to determine whether they are largely similar or are largely contrasting.
The number of nodes in an SOM must be chosen a priori and is considered to be largely arbitrary, although SOMs using a low number of nodes provide results conceptually similar to k-means clustering, whereas grids with large numbers of nodes show emergent properties that make them more topographical in nature. For our purposes, we chose to cluster the dynamically expressed genes using an SOM with a small number of nodes to create a set of discrete expression patterns. Thus, we chose a rectangular 5×5 grid (Fig. 2A, a complete table of normalized counts and SOM assignments is available in Table S2) as the optimal number of nodes, since it was the maximum size that did not result in any nodes containing no genes and had nodes that produced similar, but distinct, expression patterns. Larger numbers of nodes resulted in multiple nodes with what appeared to be identical expression patterns and one or more empty nodes (data not shown). Conversely, smaller numbers of nodes merged clearly distinct patterns into a single node (data not shown). The resulting grid had clusters containing 5-1125 dynamically expressed genes. The largest clusters recapitulated the major patterns seen by hierarchical clustering, but intermediate patterns were more clearly represented in the other nodes. Of note, genes expressed at constant levels throughout our timecourse (P>0.05) were not included in the SOM analysis, but were assigned to an outgroup. The resulting outgroup contains many genes essential for heart development that were assigned to this cluster because their expression, and likely their function, does not change during the window covered.
SOM clusters contain cohorts of co-regulated genes
Because the SOM-derived clusters contain genes with similar temporal expression patterns in the heart, we hypothesized that the clusters represent groups of co-regulated genes. If this hypothesis is correct, we would expect there to be more gene regulatory interactions within a cluster than between clusters. Databases with known gene regulatory interactions in zebrafish are lacking. Therefore, we first converted the zebrafish gene identities to their human orthologs using the Orthoretriever program (http://lighthouse.ucsf.edu/orthoretriever/), which is an interface to the Ensembl BioMart version 77 database (Kasprzyk, 2011). We then calculated the number of known gene regulatory interactions within clusters and between clusters using the GEA_CLR database in the UCSC Interaction Browser, a database of known interactions from various sources (Wong et al., 2013). The resulting graph of the interactions for cluster A1 is shown in Fig. 2B, and those for all clusters can be found in Fig. S1. Visual analysis of this graph appears to show a large number of interactions, with a few highly connected nodes and many nodes with a few connections, which are likely to represent regulators and their targets, respectively.
However, this inspection is insufficient to ascertain whether the number of interactions within the graph is higher than expected with a randomly selected group of genes this size. Therefore, we calculated the assortativity coefficient for all clusters in the SOM graph. The assortativity coefficient measures the homophily of the data, i.e. the relative probability of interacting with another member of one's own class versus another class. In this case, a gene's class was taken to be its SOM-assigned cluster. Thus, the assortativity coefficient gives a relative measure of the number of connections within an SOM cluster versus the number of connections between SOM clusters. A positive coefficient indicates that nodes are more likely to connect to members of the same cluster, and vice versa for negative coefficients. Larger assortativity values indicate stronger preferences, but this is difficult to gauge without context. Therefore, to determine the significance of this coefficient, a bootstrap null distribution was calculated using 10,000 replicates containing randomly scrambled cluster assignments (Fig. 2C). The calculated assortativity coefficient for the SOM clustering results was significantly greater than the mean (Fig. 2C, red line), with a bootstrap P-value less than 1×10−5. Therefore, we can reject the null hypothesis that co-expression clusters are independent of co-regulated cohorts, suggesting that the SOM clustering corresponds to groups of genes containing a large number gene regulatory interactions.
Although it is not currently possible to measure the level of conservation between the GRNs in zebrafish and mammalian heart development, this analysis supports the hypothesis that most regulatory interactions are conserved between zebrafish and humans. Otherwise, clustering zebrafish expression patterns would not have enriched known human regulatory interactions. Thus, this dataset will be useful for informing studies of heart development across several classes within the phylum Chordata. Indeed, when unconserved regulatory interactions are identified, they may indicate evolutionary divergence of the network to drive unconserved developmental processes, such as septation. These divergences will also provide important information on the mechanisms and evolution of cardiac GRNs.
Motif enrichment in specific temporal expression clusters identifies gene batteries within GRNs
One limitation of the aforementioned method is that interactions are constrained to known human gene regulatory interactions contained in the GEA-CLR database (Lee et al., 2015; Lowdon et al., 2014; Wong et al., 2013). Therefore, we next sought to identify novel regulatory interactions responsible for the gene expression patterns identified in our SOM analysis. In order to predict novel gene regulatory interactions within the developing heart, we sought to identify transcriptional regulatory interactions driving gene batteries. Gene batteries were first proposed in a classic work that has inspired GRN analysis for the last half century (Britten and Davidson, 1969; Peter and Davidson, 2011). Under this model, it is expected that groups of genes with the same expression pattern would be regulated by a common set of one or more transcription factors, and evidence of gene battery regulation has been reported using cohorts of genes with similar spatial expression patterns (Nelander et al., 2005; Zhang and Horvath, 2005). These gene batteries are the building blocks of GRNs (Peter and Davidson, 2011). Therefore, we hypothesized that gene batteries in the developing heart could be identified from sets of genes with shared temporal expression patterns.
In order to identify novel gene batteries from the timecourse data, we used the HOMER program (Heinz et al., 2010), which contains motifs generated from a number of sources, including ChIP data and in vitro transcription factor binding assays, to identify transcription factor binding sites statistically overrepresented in the proximal regions of the genes assigned to each SOM cluster. Identification of transcription factor binding sites near the transcription start site does not conclusively show regulation of the gene, but statistical enrichment of binding sites within promoters of genes with similar temporal expression patterns (single SOM clusters) provides strong evidence of co-regulation, as transcription factor binding motifs that are not actively involved in temporal regulation should be randomly distributed throughout the SOM, not enriched in particular clusters. This analysis identified 48 transcription factor binding motifs enriched in 17 of the 25 clusters (Fig. 3). Within this dataset were 9 of the 18 transcription factors that have been implicated in human patients with CHD (McCulley and Black, 2012) and are included in the HOMER motif database (Heinz et al., 2010). The number of interactions predicted by HOMER that are also found in the GEA_CLR database was calculated to show the ability of the method to capture known gene regulatory interactions, strengthening the concurrence of the novel gene regulatory interactions that we have identified. The percentage of HOMER-predicted interactions that were in the database (ʻknown') or not (ʻnovel') varied greatly between transcription factors (Fig. 3, Table S3). On average, 36% of the gene regulatory interactions predicted by this analysis were contained within the GEA_CLR database, consistent with previous gene battery prediction results using cohorts of spatially co-expressed genes (Nelander et al., 2005). Thus, on average 64% of the predicted interactions represent putative novel gene regulatory interactions in the developing heart, providing a rich set of predicted interactions for future analysis of heart development.
There were two striking outcomes of this analysis. First, the majority of the identified transcription factor binding motifs were enriched in only a single SOM cluster; for example, TBX, ETS and BMAL in SOM A1, B4, D3, respectively. Second, the majority of clusters contained fewer than three enriched transcription factors (Fig. 3B, Fig. 4A,B). The number of transcription factor binding motifs found in a particular cluster was not correlated with the size of the cluster (Pearson correlation coefficient r=0.09). For clusters with multiple transcription factors, distinct subsets and groups containing multiple binding sites were seen. These groups form the putative gene batteries predicted by this analysis and are often large, with up to 157 genes containing a single transcription factor binding site (Fig. 4C). Evidence for combinatorial control of these gene batteries was also seen, as many batteries of similarly expressed genes contained binding sites for several different transcription factors (Fig. 4C). Transcription factors that were enriched in multiple clusters also tended to be enriched in adjacent clusters, indicating that they regulate genes with similar temporal expression patterns (Fig. 5A), with subtle differences provided by the different combinatorial factors found in adjacent clusters. Although it is possible that their enrichment in multiple adjacent patterns indicates that the SOM cluster subdivided cohorts of genes that should be considered to be of the same cluster, the complement of enriched transcription factor binding motifs differs for every cluster. Therefore, it is more likely that combinatorial control by multiple transcription factors creates small variations in the temporal expression patterns of their target genes.
Many transcription factors can act as both repressors and activators depending on their interaction with various co-factors. Because the SOM clusters are laid out spatially by temporal expression pattern, gene batteries activated by a given transcription factor should regulate genes in their same cluster or in clusters nearby, as activation results in a positive correlation between the expression levels of a transcription factor and its targets. Conversely, gene batteries repressed by a transcription factor should be expressed in clusters arranged at opposite sides of the SOM grid, as repression results in a negative correlation between a transcription factor and its targets. Therefore, analyzing the relationships between the assignment of transcription factors to an SOM cluster based on expression pattern and the clusters enriched with targets for those transcription factors might help determine whether the identified regulatory interactions represent activation or repression by a specific transcription factor. In order to test this hypothesis, we selected two well-characterized transcription factor families in heart development – Sox and Klf – and compared the SOM clusters containing members of that transcription factor family with the SOM clusters that contained predicted targets.
Sox family members, especially the SoxE subfamily, have been shown to be expressed dynamically during heart looping and valve development in the chick (Montero et al., 2002). In our data analysis, Sox was the transcription factor family with the largest battery of gene targets, with 1087 target genes (643 known and 444 novel) across four SOM clusters (A3, A4, A5, B5, Fig. 3A, Fig. 5B). Of note, all of the clusters enriched for Sox binding motifs are located in the lower left corner of the SOM grid, corresponding to genes expressed early during development and inactive at the end of the timecourse. Twelve Sox family members were also expressed in A3 (Sox12), A4 (Sox9a), A5 (Sox1b, Sox2, Sox3, Sox9b, Sox11a, Sox18, Sox19a and Sox19b) and B5 (Sox19b and Sox21a). Of these, ten have known transactivation domains (Lefebvre et al., 2007). Sox6 was the only Sox family member not to follow this pattern, with expression in C1. Interestingly, Sox6 does not have any known activation or repression domains, but only contains several dimerization domains (Lefebvre et al., 2007). Together, the data strongly support the hypothesis that Sox family members play an activating role during early stages of heart development and are largely downregulated by the completion of heart looping.
Another transcription factor family with multiple family members and clusters with enriched binding site motifs is the Klf family. Klf motifs were enriched in four SOM clusters (Fig. 5C), three on the right side and one in the lower left hand corner (Fig. 5C). Klf family member expression is also divided into two groups, with Klf1 (A3), Klf8 (A4), Klf13 (A4) and Klf17 (A5) expression in the lower left clusters and Klf11b (B1), Klf12a (D1), Klf5a (E1), Klf18 (C2), Klf6a (E3), Klf11a (C1), Klf2a/b (E2), Klf9 (E2) and Klf15 (E2) expression in the upper right corner. Further studies are necessary to determine why our analysis breaks Klf family members into two distinct classes, but it might be that distinct sets of Klf factors are involved in early and late processes during heart development. Alternatively, or in addition, Klf factors may be involved in repressing and activating distinct sets of genes. Klf family members with dynamic expression during this timecourse also include those known to be involved in heart development, such as Klf2 (Chiplunkar et al., 2013), Klf3 (Kelsey et al., 2013), Klf5 (Drosatos et al., 2016) and Klf13 (Nemer and Horb, 2007), as well as several novel factors. Thus, it is possible that subsets of Klf factors play at least two distinct roles in the developing heart – one early and one late. These data will be useful in guiding future research that aims to determine the contrasting roles of the individual Klf members identified here.
Inclusion of predicted transcription factor binding motifs expands the gene regulatory interaction graph for each cluster
We next sought to determine the effect on the gene regulatory interaction graphs when transcription factors with enriched motifs in the cluster were added. We began with analysis of cluster A1, which was enriched for a single class of transcription factor binding motif: the Tbx family. Although several Tbx factors have been implicated in heart development (Plageman and Yutzey, 2005), including Tbx1, Tbx2, Tbx3, Tbx5 and Tbx18, none of the genes encoding these transcription factors was in the A1 cluster, as they were generally expressed at constant levels during the timecourse, sorting to the SOM outcluster. Therefore, we chose to add these transcription factors to the gene list for cluster A1 and generate a new graph based on known interactions in the GEA_CLR database (Fig. 6A). Four of the five genes had multiple known gene regulatory interactions with genes in the cluster and became major nodes within the graph. Interestingly, Tbx18, which is expressed exclusively in the epicardium except for a small region of expression in the myocardium at the sinus venosus (Greulich et al., 2011), displayed no known gene regulatory interactions with genes in the graph. The overall graph grew from 58 nodes to 66 nodes and 124 edges to 196 edges. This graph also shows strong evidence for gene batteries under combinatorial control. In order to highlight these groups, we created a hierarchical clustering based on known gene regulatory interactions in cluster A1 (Fig. 6B). The resulting dendrogram shows six putative gene batteries containing two to seven genes each (red branches).
Testing the proposed GRN with cardiac transcription factor mutants in mice and zebrafish
We sought to determine whether our predicted GRN interactions from the SOM and motif enrichment analyses were supported by differential expression analysis in cardiac transcription factor mutants. First, we compared enrichment of differentially expressed genes in the Tbx5 mouse knockout model (Waldron et al., 2016). Mouse data were used owing to the lack of such genomic data for tbx5a/b null fish, although they have a similar phenotype (Parrie et al., 2013). Differentially expressed genes from that study were compared with their zebrafish orthologs in our data to identify enrichment in SOM clusters, specifically in cluster A1, which showed an overrepresentation of Tbx family motifs. Overall, 1110 SOM-assigned genes were differentially expressed in the Tbx5 null mouse heart. In cluster A1, 90 of the 149 genes were differentially expressed in Tbx5 mutant hearts, including 40 of the 52 genes with predicted Tbx binding motifs and identifiable mouse orthologs (Fig. 7A). It should be noted that the published RNA-seq data contained only two biological replicates, represented only a single developmental time point, and were obtained in mouse, thus requiring lift-over into another species. Despite these limitations, we were able to identify strong concurrence of altered expression with our gene assignments to Tbx-enriched SOM clusters, highlighting the predictive power of our zebrafish datasets and the applicability of this analysis for cross-species comparisons.
We next analyzed nkx2-5 mutant zebrafish embryos as a confirmation of a transcription factor binding motif that was not enriched in any SOM cluster, and the nkx2-5 transcript was assigned to the outcluster with other transcripts that were expressed in the heart at steady-state levels. Nkx2-5 is a well-studied transcription factor that is known to be involved in cardiomyocyte differentiation (Benson et al., 1999; McCulley and Black, 2012; Tanaka et al., 1999). The nkx2-5 mutant phenotype in zebrafish includes looping and chamber size defects (Hill et al., 2013; Targoff et al., 2008), indicating that it may play a role in the heart development processes studied here. A recent study showed that Nkx2-5 expression in early cardiomyocyte differentiation was sufficient to maintain chamber identity and size (George et al., 2015). Thus, it is possible that Nkx2-5 acts early in differentiation to establish gene expression patterns for heart development and regulates the continued, steady-state expression of a cohort of genes during the period covered by our timecourse. Consistent with this, nkx2-5 RNA expression was constant in our timecourse, and assigned to the SOM outgroup.
To further explore the hypothesis that Nkx2-5 acts earlier in development, we analyzed genes differentially expressed between nkx2-5 null and sibling embryos at 48 hpf to identify their SOM assignments in our timecourse analysis. In contrast to the large number of SOM-assigned genes that are differentially expressed in the Tbx5 mutant mouse, most genes that were differentially expressed in nkx2-5 null zebrafish embryos were assigned to the SOM outgroup, and only 57 SOM-assigned genes showed differential expression between nkx2-5 mutant and sibling embryos, comprising 38 upregulated and 19 downregulated genes in nkx2-5 mutants (Table S4). Strikingly, the differentially expressed genes were separated into two distinct SOM clusters based on the direction of change in the nkx2-5 mutant (Fig. 7B). Of the 38 upregulated genes, 35 clustered into SOM clusters with increasing expression during heart looping and chamber maturation (clusters B1, C1, D1, D2 and E1-5). Similarly, 14 of the 19 downregulated genes clustered into the lower left corner (clusters A5, B5 and C5), corresponding to genes that show decreasing expression during looping and chamber maturation. However, in neither case would these numbers be sufficient to create a statistically significant overrepresentation in the patterns, consistent with our motif enrichment results. Together, the Tbx5 and nkx2-5 mutant data support the power of the model GRN generated here to predict the gene regulatory interactions driving major expression patterns in the heart during looping morphogenesis.
The findings with Nkx2-5 highlight one of the limitations of this type of study. Although our timecourse encompassed heart looping morphogenesis, it is still unclear how the timing between gene regulatory interactions and morphogenetic events is related. Transcription factors may make epigenetic changes that set the stage for subsequent regulatory events. For example, we have recently shown that Nkx2-5 regulates DNA methylation patterns (Gorsi et al., 2017 preprint). This might also explain the absence of GATA factor binding sites in our analysis, as GATA factors have been shown to interact with Nkx2-5 early in development (Searcy et al., 1998). Both Gata4 and Nkx2-5 expression, as well as that of several other canonical heart transcription factors, was detected by the RNA-seq data presented here, but their expression levels were constant throughout the timecourse, so they were not clustered by SOM and were assigned to the outgroup. The enrichment analysis, by its nature, also misses isolated gene regulation events for small sets of genes, instead focusing on regulatory events that appear to simultaneously affect the expression of many genes. Therefore, the expression patterns identified here should not be taken as an exhaustive list of regulatory events found in the heart, but this approach certainly provides a broad representation of the major temporal gene expression patterns during heart morphogenesis.
Analysis of cell type-specific and tissue-specific gene batteries
Many of the concurrent developmental processes occurring during this timecourse are restricted to subsets of cell types or tissues. Although it is important to recall that a ʻdifferentiation gene battery is not the same thing as a cell type' (Peter and Davidson, 2011), we asked whether a subset of SOM clusters would be confined to specific cell types or tissues in the developing heart. We created 13 hand-curated lists containing 398 known tissue-specific markers (Table S5), described here in three groups.
The first group contained markers for the major structures in the embryonic zebrafish heart: sinus venosus, atrium, AVC, ventricle and bulbus arteriosus (Fig. 8A). The markers for all of these structures appear to be distributed into three distinct groups: early, middle, and late expression.
The second group contained genes expressed in each of the major cell types of the embryonic heart: progenitor, endocardial, myocardial, epicardial, pericardial and neural crest cells (Fig. 8B). Progenitor cell markers were enriched in the lower left corner of the SOM, i.e. decreasing over time, as expected. Neural crest cells also showed enrichment for genes with decreasing expression patterns, although a few genes were increasing. By contrast, the four major tissue types in the heart (endocardium, myocardium, epicardium and pericardium) showed distinct expression patterns, predominantly found in mutually exclusive clusters. For example, the myocardial markers were enriched in the upper middle side of the SOM (clusters C1, C2, D1), and the epicardial markers were enriched on the righthand side of the SOM (clusters E2, E3), indicating that the genes are generally increasing. This is consistent with the timing of epicardial migration to cover the heart, which is not complete until after the period covered by our timecourse (Peralta et al., 2014). Conversely, pericardial markers were largely decreasing. Finally, the other two major cell types, endocardium and myocardium, showed a complex expression pattern, with some genes decreasing and others increasing during the timecourse. Although the clusters containing the decreasing genes were largely overlapping, the clusters with increasing genes differed between cell types, with myocardial markers increasing expression earlier than endocardium.
Finally, the third group contained genes encoding ion channels and structural proteins such as muscle fiber components (e.g. actin and myosin). Strikingly, both categories were enriched in the top middle of the SOM, indicating that expression comes on at ∼48 h and remains on for the remainder of the timecourse (Fig. 8C).
In conclusion, although these tissues are undergoing active and complex gene regulatory events throughout the timecourse, cluster information for a particular gene may provide clues to its likely role within heart development, and in the hunt for functions of novel genes this might allow investigations to focus on particular cell and tissue types according to SOM assignment of that gene. For example, comparison of novel genes co-regulated with known tissue-specific genes in a given SOM cluster would provide insight into the likely spatial localization of the novel gene, creating testable hypotheses for future studies.
Timecourse analysis identifies pairs of duplicated genes with similar expression patterns
Sets of duplicated genes are common in many vertebrates, and it is thought that these duplicated genes create an important source of phenotypic divergence in evolution (Prince and Pickett, 2002). Although there have been several genome duplication events during vertebrate evolution, duplicate genes are not maintained unchanged. There are three major potential fates of duplicated genes: non-functionalization, where one copy becomes a pseudogene; neo-functionalization, where one gene gains a novel function; or subfunctionalization, where both copies lose some of their functions such that both are required to perform the function of the ancestral gene. This last case is thought to have resulted in 10% of the duplicated copies from the teleost-specific genome duplication being retained (Postlethwait et al., 2000). Subfunctionalization is most commonly manifest by divergence in expression patterns between the two copies. Thus, we analyzed all pairs of duplicated genes in the timecourse to measure the amount of divergence in expression patterns, and to identify genes that have retained similar expression patterns. A total of 1272 duplicated gene pairs had at least one gene differentially expressed and assigned to SOM clusters during our timecourse. Of these, 287 (22.6%) had both members of the pair assigned to SOM clusters, supporting the hypothesis that most pairs diverge into vastly different expression patterns. However, when both members of the pair were dynamically expressed over our timecourse, they frequently showed very similar expression patterns. Of the 287 pairs in the SOM, 61 were in the same cluster and 100 were in adjacent clusters (Fig. S2, see Table S6 for complete analysis). Future studies will focus on analysis of these pairs to determine the nature of their retained redundancy and to pursue functional tests to assess whether the evolutionary process of subfunctionalization explains their continued co-expression.
The genomics approach used here identified several major temporal gene expression patterns in the heart during looping morphogenesis, as well as key transcription factors driving these expression patterns. These data were then used to generate a model GRN for heart looping morphogenesis and concurrent processes, providing a valuable resource for future GRN studies in the heart. The model GRN assembled here represents a predictive network that identifies putative regulatory mechanisms for large batteries of genes in heart morphogenesis. Many of these gene batteries do not only share sites for a single transcription factor, but for a number of potentially interacting transcription factors. This characteristic would be missed by traditional analysis methods focusing on knockdown or knockout of a single transcription factor and provides an important dataset for studying the combinatorial control of gene expression.
Although this is the first heart looping morphogenesis RNA-seq timecourse in zebrafish, three timecourses have been conducted in cell cultures or mice. The timecourse conducted by Wamstad et al. (2012) followed gene expression patterns while differentiating mouse ESCs into cardiomyocytes in vitro. Each data point in their study thus represented a distinct cell type (embryonic stem cells, mesoderm, cardiac progenitors, and cardiomyocytes), all of which are likely to represent cell types earlier in development than covered by our data. By contrast, most of the time points in the single-cell timecourse generated by Delaughter et al. (2016) were taken after heart looping, including two postnatal time points. The most similar timecourse to ours was created by Li et al. (2016), which contains three time points during heart looping, but was generated using single-cell sequencing of mouse cardiomyocytes at three time points as compared with the eight time points collected in our study. Thus, both the similarities and differences between these datasets might provide key insights into heart development. However, it should be emphasized that these datasets are complementary, rather than redundant, with the dataset presented here. It is established that single-cell analysis is not well suited for reliably assessing quantitative changes owing to the high number of PCR cycles needed during library preparation, but is able to distinguish between different cell types. Therefore, while single-cell datasets provide high-resolution spatial information, the timecourse presented here will provide more accurate assessment of temporal changes in gene expression. The combined application of these datasets will help define the GRNs driving heart development.
MATERIALS AND METHODS
Heart collection and total RNA collection
Groups (∼20 males and ∼20 females) of fish positive for the cmlc2:GFP transgene (Huang et al., 2003) were mated to generate large synchronized clutches of embryos. After the embryos were collected and cleaned, they were divided into eight groups of ∼200 embryos for heart isolation at 30, 36, 42, 48, 54, 60, 66 or 72 hpf. Embryos were dechorionated as needed and anaesthetized by adding 0.02% tricaine to the embryo dish. Hearts were then isolated as previously described (Burns and MacRae, 2006), except that hearts were manually selected from the medium instead of filtering. Hearts were then placed in fresh medium and manually picked again, to ensure no carryover of non-cardiac tissues, and placed in a 1.5 ml microcentrifuge tube. Tubes containing the isolated hearts were centrifuged briefly (1000 g for 30 s) to pellet the tissue and the supernatant removed under a microscope. Approximately 2 µl Trizol (Life Technologies) per heart was then added to the tube and the hearts homogenized before storage at −80°C.
RNA was isolated by phenol-chloroform extraction followed by ethanol precipitation using standard protocols. In order to obtain the necessary amount of RNA for library preparation, the total RNA from two collections were combined together at each time point. Therefore, the three replicates in this experiment arose from a total of six collections conducted on different days.
Library construction and RNA sequencing (RNA-seq) were performed at the Huntsman Cancer Institute High-Throughput Genomics Core Facility. Libraries were constructed from the total RNA using the NuGEN Ovation RNA-Seq Library System. Sequencing was performed on a HiSeq 2000 sequencer (Illumina) to generate 50 bp single-end reads. Three biological replicates for each of the eight time points (24 samples) were split across four lanes. Post-sequencing quality control was conducted by FastQC (https://www.bioinformatics.babraham.ac.uk/projects/fastqc/). The total number of reads ranged from 22-34 million reads per sample and both per-base and per-sequence quality scores were above 30 for more that 90% of the reads. Sequencing results were then aligned by Novoalign (Novocraft) to a reference genome containing the zv9 build of the zebrafish genome and a set of known and theoretical splice junctions generated from the Ensembl version 75 genome annotation using the MakeTranscriptome program in the USeq package (Nix et al., 2008).
Differential gene expression analysis and clustering
Differential gene expression and clustering analyses were performed in R. First, a count matrix of the data was created using the package Rsubread (Liao et al., 2013). Next, consistency of the replicates was assessed using the plotMDS function in the EdgeR package (Robinson et al., 2009) using default parameters, which include the 500 genes with the largest pairwise distances for analysis. Based on this analysis, we removed replicate A30 and B42 from all future analyses. Because timecourse experimental designs do not involve a clear case versus control or baseline comparison, genes that were differentially expressed over the timecourse were determined by the negative binomial log ratio test in the DESeq2 package (Love et al., 2014). Genes with an adjusted P-value less than 0.05 were selected. The rlog normalized values (generated by DESeq2) of the replicates were then averaged to generate a single expression value for each gene at each time point. These values were converted to z-scores and clustered using the som package (https://CRAN.R-project.org/package=som) and a rectangular 5×5 grid. Grid size was determined by the maximum dimensions not resulting in any empty clusters.
Known gene regulatory interaction analysis
The interconnectedness of each SOM cluster was measured using known gene regulatory interactions in the GLR_CEA database on the UCSC Interaction Browser (Wong et al., 2013). Interaction lookups were limited to those where both the source and the sink were contained within the SOM cluster being analyzed. P-values were calculated by generating a bootstrapped distribution from 10,000 groups of randomly assigned gene clusters and determining the interconnectedness within the random groups. All network graphs were created in R using the igraph package (Csardi and Nepusz, 2006).
Motif enrichment analysis
Overrepresentation of transcription factor binding motifs in the proximal promoter regions of the genes in each cluster was determined using the HOMER program (Heinz et al., 2010) set to search 1000 bp upstream and 100 bp downstream of the transcription start site. All genes were used as the background set for P-value calculation. Results presented here are limited to known motifs.
nkx2-5 and Tbx5 RNA-seq comparisons
Comparisons between our timecourse and RNA-seq data from nkx2-5 mutant zebrafish embryos and Tbx5 mutant embryonic mouse hearts were conducted using data generated in our lab (Gorsi et al., 2017 preprint) and existing data (Waldron et al., 2016), respectively. nkx2-5 data were generated from 48 hpf zebrafish hearts collected from nkx2-5 mutant and wild-type siblings identified phenotypically. Differential expression for both genes was conducted using DESeq2 (Love et al., 2014). For mouse Tbx5, differentially expressed genes were converted to their zebrafish orthologs using Orthoretriever, and then merged with the SOM classification by Gene ID. Genes differentially expressed in the nkx2-5 data were directly compared with the SOM assignments.
Spatial gene expression analysis
SOM assignment of known markers was conducted using a hand-curated list of genes created by an extensive literature search (Table S5). Genes identified in the literature were annotated for function, anatomical location and/or cell type. These lists were then merged with the SOM gene assignments and each annotation visualized in R.
Duplicate gene analysis
Pairs of genes duplicated in the zebrafish genome were downloaded from the Ensembl database (Ensembl release 88) and merged with the SOM data table to identify the cluster locations of each. Relative distances between pairs were calculated using the Euclidean distance on the grid: a distance of 0 indicates both members of the pair are in the same SOM cluster, a distance of 1 indicates members are in adjacent clusters, a distance of 1.4 indicates diagonal clusters, and larger distances indicate more distant clusters.
All animal research was approved by the University of Utah IACUC committee (protocol number 15-06004).
Conceptualization: J.T.H., H.J.Y.; Methodology: J.T.H., B.D., M.S., H.J.Y.; Software: J.T.H., B.D.; Validation: J.T.H., B.D., B.G.; Formal Analysis: J.T.H., B.D., B.G.; Investigation: J.T.H., M.S.; Resources: H.J.Y.; Data Curation: J.T.H., B.D., B.G.; Writing - original draft: J.T.H., B.G., H.J.Y.; Writing - review and editing: J.T.H., B.D., B.G., H.J.Y.; Visualization: J.T.H., B.D.; Supervision: J.T.H., H.J.Y.; Project Administration: H.J.Y.; Funding Acquisition: H.J.Y.
This study was funded by a National Heart, Lung, and Blood Institute (NHLBI) Bench-to-Bassinet Consortium (http://www.benchtobassinet.com) grant to H.J.Y. (2UM1HL098160) and sequencing was provided by a NHLBI core facilities support grant to the New England Research Institute (U01 HL098188). Deposited in PMC for release after 12 months.
RNA-seq data are available at the NCBI Sequence Read Archive under accession SRP117696 or (with Flash and Java enabled) at https://b2b.hci.utah.edu/gnomex/gnomexFlex.jsp?requestNumber=152R.
The authors declare no competing or financial interests.