Uncovering the direct regulatory targets of doublesex (dsx) and fruitless (fru) is crucial for an understanding of how they regulate sexual development, morphogenesis, differentiation and adult functions (including behavior) in Drosophila melanogaster. Using a modified DamID approach, we identified 650 DSX-binding regions in the genome from which we then extracted an optimal palindromic 13 bp DSX-binding sequence. This sequence is functional in vivo, and the base identity at each position is important for DSX binding in vitro. In addition, this sequence is enriched in the genomes of D. melanogaster (58 copies versus approximately the three expected from random) and in the 11 other sequenced Drosophila species, as well as in some other Dipterans. Twenty-three genes are associated with both an in vivo peak in DSX binding and an optimal DSX-binding sequence, and thus are almost certainly direct DSX targets. The association of these 23 genes with optimum DSX binding sites was used to examine the evolutionary changes occurring in DSX and its targets in insects.
INTRODUCTION
Although many of the regulatory hierarchies governing development have been elucidated, progress has been slow in identifying the direct targets of the terminal regulatory genes in these hierarchies. These difficulties arise from: (1) few target genes being unique to one hierarchy; (2) difficulties in distinguishing direct from indirect targets; and (3) in vitro defined binding sites of transcription factors occurring by chance many times in the genome. Technological advances such as ChIP-seq solve some of these difficulties, but challenges remain for transcription factors that lack a good antibody or regulate their targets in restricted tissues (Park, 2009). The sex hierarchy of Drosophila melanogaster exhibits all of these features that make it difficult to identify direct targets of the terminal genes in regulatory hierarchies.
Our interest here is in identifying the direct targets of doublesex (dsx), a terminal gene in the fly sex hierarchy. dsx transcription is highly regulated, as evidenced by its temporally and spatially restricted expression patterns through development (Robinett et al., 2010; Rideout et al., 2010). dsx pre-mRNAs are sex-specifically spliced (Ryner et al., 1996; Burtis and Baker, 1989), resulting in adults in whom some cells express DSXM or DSXF, and are thus either male or female, respectively, and other cells do not express these transcription factors.
DSXM and DSXF share identical DNA-binding domains, but have different C termini (Burtis et al., 1991; Erdman and Burtis, 1993). The DSX proteins specify nearly all somatic sexual differences outside the nervous system, as well as several nervous system sexual dimorphisms (Lee et al., 2002; Sander and Arbeitman, 2008; Robinett et al., 2010; Rideout et al., 2010; Mellert et al., 2010) (for reviews, see Christiansen et al., 2002; Camara et al., 2008).
Although some genes whose activity levels are dependent on dsx have been identified by a candidate gene approach, most have been identified via plus/minus, microarray, SAGE or enhancer trap-based screens for genes expressed sex-differentially in the soma (Shirangi et al., 2009; Lebo et al., 2009; Chatterjee et al., 2011) (for reviews, see Christiansen et al., 2002; Camara et al., 2008). These screens identified dozens of genes whose activity levels are governed by dsx, but did not distinguish direct from indirect targets. Studies of these dsx-regulated genes showed that the DSX proteins largely act by modulating in some tissues the activities of genes that are also used sex-nonspecifically in other tissues. The array of processes regulated by dsx is complex, as different genes are regulated by dsx in all cell and tissue types examined (for reviews, see Christiansen et al., 2002; Camara et al., 2008). Of the dsx-regulated genes that have been sufficiently characterized to distinguish likely direct and indirect targets, the vast majority are indirect targets (Arbeitman et al., 2004; Chapman and Wolfner, 1988; Wolfner, 1988). The few identified direct targets are the three Yolk protein (Yp) genes (Burtis et al., 1991; Coschigano and Wensink, 1993; Hutson and Bownes, 2003), bric à brac 1 (bab1) (Williams et al., 2008) and Fad2 (Shirangi et al., 2009).
The paucity of direct DSX targets limits understanding of important aspects of sexual biology. First, we have little knowledge of the molecular links between dsx and the biological processes it controls. Second, understanding how the functions of the sex hierarchy and other patterning hierarchies are integrated during development requires the identification of direct DSX targets (Christiansen et al., 2002; Williams et al., 2008). Third, not knowing the direct DSX targets seriously constrains studies of the molecular evolution of sex. To open up these topics for more systematic analysis, we carried out a screen for direct DSXF targets.
DSXF and DSXM bind with similar kinetics to identical sequences in the Yp1 and Yp2 common promoter (Burtis et al., 1991; Erdman and Burtis, 1993; Coschigano and Wensink, 1993). SELEX experiments defined a consensus DSX-binding site (Erdman et al., 1996) as a 13 bp palindromic sequence (G/A)nnAC(A/T)A(T/A)GTnn(C/T) composed of two half-sites around a central (A/T) base pair. One such site is expected by chance every 2 kb in the D. melanogaster genome, thus, additional information is needed in vivo to specify biologically relevant DSX-binding sites.
We identified at a genome-wide level potential DSX-binding regions via DamID chromatin profiling. Analysis of these DSX-binding regions identified a new consensus 13 bp palindromic DSX-binding sequence in which the identity of each base is important for optimal DSX binding. This optimal DSX binding sequence is enriched in the D. melanogaster genome (58 copies versus approximately three expected from random). There are 23 genes associated with an in vivo peak in DSX binding and an optimal DSX binding sequence, and therefore almost certainly are direct DSX targets. The abundance of the optimal DSX-binding site in other insect genomes, as well as the maintenance of its association with the above 23 direct DSX target genes in these species, provides insight into the molecular evolution of DSX and its targets in insects.
MATERIALS AND METHODS
Plasmids construction
To construct a FLP-activated DSXF-Dam plasmid, the coding region of the E. coli Dam methyltransferase was cut from pCMycDam (a gift from Drs van Steensel and Henikoff, Fred Hutchinson Cancer Research Center, Seattle, WA) with EcoRI/XbaI, and inserted into pP{UAST} between EcoRI and XbaI sites. The fragment encoding a V5-tagged DSXF was digested from pMTDSXFV5 (Garrett-Engele et al., 2002), and inserted in-frame into the above plasmid between EcoRI and AgeI sites. An FRT-stop-FRT cassette was inserted into the EcoRI site to generate the final FLP-activated DSXF-Dam construct (fsfDSXF-Dam).
To generate a DNA-binding mutant for DSXF-Dam, site-directed mutagenesis was performed using primers 5′-TCTGCAAACGGCCTTGctgcagGCCCAGGCGCAGGAT-3′ and 5′-ATCCTGCGCCTGGGCctgcagCAAGGCCGTTTGCAGA-3′ (the lowercase letters indicate the mutated sequences). This resulted in the substitutions of R90L and R91Q in the DSX protein, and the mutated plasmid was confirmed by sequencing to be free of other mutations. An 820 bp BglII/XhoI fragment containing the mutated sequence was used to replace the wild-type sequence in fsfDSXF-Dam, resulting in the plasmid fsfR91Q-Dam.
To construct a FLP-activated Dam control plasmid, the coding region of Myc-tagged E. coli Dam methyltransferase was digested from pNdamMyc (a gift from Drs van Steensel and Henikoff) with EcoRI/SalI, and cloned into P{UAST} between EcoRI and XhoI sites. An FRT-stop-FRT cassette was inserted at the EcoRI site and the correct orientation selected, generating the final plasmid fsfNdam.
To generate bacterial expression plasmids for DSXF, the coding sequence from the fsfDSXF-Dam was PCR amplified using primers 5′-ATGGTTTCGGAGGAGAACTGGA-3′ and 5′-CGTAGAATCGAGACCGAGGAGA-3′. The PCR products were cloned into pCR-Blunt-TOPO plasmid (Invitrogen, Carlsbad, CA), with the correct orientation confirmed by sequencing.
Eletrophoretic mobility shift assay
Eletrophoretic mobility shift assays were carried out as described previously (Erdman et al., 1996) with modifications. Bacteria expressing the V5-tagged DSXF (in pCR-Blunt-TOPO) were cultured in 5 ml of LB medium overnight at 30°C, then induced with 1 mM IPTG for 3 hours and harvested by centrifugation at 4°C. The cells were re-suspended in 300 μl of Z-50 buffer (Erdman et al., 1996) and briefly sonicated followed by centrifuging at 15,000 g at 4°C for 5 minutes and the supernatant immediately frozen (−80°C) until use. The DNA oligos for the probes (from the branchless gene) were synthesized as 5′ biotin labeled. Binding reactions (20 μl) were set up in Z-50 buffer and include 2 μg of poly[dI-dC], 0.25 pmol of labeled probes, 10 μl of total cell lysate and various amounts of competitors as indicated. After 20 minutes of incubation at 22°C, reactions were electrophoresed (5% PAGE gel, 0.5× TBE buffer, 120 V) then transferred to a Hybond-N+ membrane (GE Life Sciences, Piscataway, NJ) and the biotin detected using a LightShift Chemiluminescent EMSA Kit (Thermo Scientific, Rockford, IL).
Fly crosses and heat shocks
Males carrying either fsfDSXF-Dam (two lines), fsfR91Q-Dam (three lines) or fsfNdam (1 line) were crossed to y w hsFLP122; +; + virgin females at 22°C. Newly eclosed female offspring with pigmented eyes were subjected to heat shock at 37°C for 40 minutes to induce the expression of FLPase that in turn activates the expression of Dam proteins, then allowed to recover at 25°C for 96 hours (24 hours for fsfNdam) before being collected for genomic DNA preparation.
Construction of DpnI sequence-tag libraries for ultra high-throughput sequencing
Genomic DNA was prepared with a DNeasy Blood & Tissue Kit (Qiagen, Valencia, CA). For each sample 8 μg DNA was digested with 40 U of DpnI in 100 μl at 37°C overnight and purified through a QIAquick PCR purification column (Qiagen). A biotin-labeled EcoP15I adaptor was generated by annealing biotin-5′-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTCAGCAG-3′ and 5′-CTGCTGAGATCGGAAGAGCT-3′ (the EcoP15I sites are underlined) and ligated to each DpnI-digested end. DNA polymerase I was used to repair the nicked strand followed by EcoP15I (which cleaves DNA 27 bp 3′ to its recognition site) digestion for 4 hours to generate a 27 bp sequence tag ligated to the biotin-labeled adaptor. The biotin-labeled tags were purified using 30 μl of strepavidin-Dynabeads (Invitrogen) per sample. After end repairing using Klenow polymerase, a second adaptor (generated by annealing 5′-ccgagatctaCACTCTTTCCCTACACGACGCTCTTCCGATCT-3′ and 5′-AGATCGGAAGAGCGTCGTGTA-3′) was ligated to the other end of the sequence tags. After washing, the beads were re-suspended in 25 μl 1× Buffer 2 (New England Biolabs, Ipswich, MA). DpnI tags were PCR amplified with primers 5′-CAAGCAGAAGACGGCATACGAGCTCTTCCGATCTCAGCAGTC-3′ and 5′-AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGAC-3′ (underlined sequences correspond to the Illumina library construction primers) as follows: 1 cycle of 68°C for 5 minutes; 11 cycles of 95°C for 30 seconds, 65°C for 1 minute, 72°C for 20 seconds; one cycle of 72°C for 5 minutes. The products were resolved on a 4% agarose gel. The desired ~125 bp product was recovered using a QIAquick Gel extraction kit (Qiagen) and quantified using a Nanodrop spectrophotometer (Thermo Scientific, Rockford, IL). Ultra high-throughput sequencing was performed on a Genome Analyzer platform (Illumina, San Diego, CA). A more detailed protocol on library preparation is available upon request.
Analysis of sequencing data
Sequencing data (25 bp reads) from the Genome Analyzer were matched to the fly genome using the Illumina software package and reads matching a unique location were binned for further analysis. The unique reads were aligned with GATC sites using Perl scripts, and any reads that were not located at a sequence flanking a GATC site were discarded (~10%). Reads that aligned with each GATC site were counted for each sample. Because methylation of any particular GATC site is independent between cells, and the probability of any particular site being methylated in a cell is small, the read counts at each GATC site were treated as a Poisson count. A Z-score-based approach was used to compare the counts at each GATC site for different samples. The Z score was calculated by the formula: where Ca is the count of reads for sample A, and Cb is the count of reads for sample B. If Ca+Cb is smaller than 4, the Z score at that site was artificially set as 0 (which means that Ca is not different from Cb) to minimize bias created by the small sample. Any Z-score greater than 1.65 (equivalent to P<0.05) was transformed into a P value using the Z-score to P-value table. The transformation was set at a halving of P value for the convenience of the Perl scripts programming, such that a P value between 0.025 and 0.05 was set as P=0.05, a P value between 0.025 and 0.0125 was set as P=0.025, and so on; Z-scores greater than 4.3 were set as P=0.00001. For any Z-score smaller than 1.65, the P value was set as 1. The cumulative P value was then calculated for each peak by grouping the GATC sites whose P values were equal or smaller than 0.05, and are separated from each other by less than 1.5 kb or no more than one GATC site (in the case that the distance between two GATC site is greater than 1.5 kb). The P value for the peak is calculated by multiplying the P values of the sites and the total number of sites within the peak.
Identification of an optimal DSX-binding site by MEME, and its analysis in insects
The peaks (see Table S2 in the supplementary material) used for searching for an enriched motif by MEME (Bailey and Elkan, 1994) were picked based on peak shapes and good GATC-site density such that allow us to locate the center of the peak, to which DSX potentially binds, and to remove extraneous sequences on both sides.
Whole-genome analysis of optimal DSX-binding site matches in the D. melanogaster genome of all chromosome arms (release sequence BDGP R5, April 2006) was carried out with a Perl script. The whole-genome searches for a given 13 bp sequence used Blast (Altschul et al., 1997), with E-value adjusted for genome size.
Orthologous genes were defined by the FlyBase annotation or by Blast using the D. melanogaster sequence if no ortholog had been annotated. The sequence of the gene and its 5′ and 3′ flanking regions were manually exported and subjected to analysis using Perl script. The 5′ or 3′ flanking region is defined as the sequence until the next gene with the following exceptions: if the next gene is less than 10 kb away, the flanking region is extended to 10 kb; if the next gene is greater than 50 kb away, the flanking region is truncated at 50 kb.
Protein alignment
The amino acid sequences of DNA-binding domains for the 12 Drosophila DSX proteins were obtained from NCBI GenBank or FlyBase except for D. mojavensis and D. willistoni, which were derived from sequences of RT-PCR products, as we found errors in the NCBI sequences in the DSX proteins for these two species. The amino acid sequence alignment was performed using Clustalw2 (Larkin et al., 2007).
RESULTS
Whole genome identification of DSX-binding regions
In vivo DSX-binding regions were identified on a whole-genome scale using DamID, a technique in which tethering E. coli DNA adenine methyltransferase (Dam) to a chromatin binding protein leads to specific methylation of GATC sites in the vicinity of that protein's binding/recruitment sites (van Steensel and Henikoff, 2000; van Steensel et al., 2001). To avoid non-specific methylation from overexpression (van Steensel and Henikoff, 2000), Dam-fusion proteins were expressed using the basal expression level of the minimal hsp70 promoter in P{UAST} (Fig. 1A). The GATC methylation profile of a DSXF-Dam fusion construct was compared with those of two control proteins. First, Dam alone was used to measure non-specific methylation (van Steensel and Henikoff, 2000). As the fly genome contains ‘transcription factor colocalization hotspots’, where transcription factors accumulate through protein-protein interactions (Moorman et al., 2006), a second control was Dam fused to a DSX protein with a mutation in its DNA-binding domain (R91Q) that has lost the ability to bind the Yp1 DSX-binding site (Erdman and Burtis, 1993; Zhang et al., 2006). R91Q-Dam should bind to transcription factor hotspots, but not to real DSX-binding sites.
Modified DamID approach to identify DSX-binding regions. (A) Schematic drawing of the flip-on Dam-fusion constructs. The proteins are expressed from the basal level of the UAS sequence after the FRT-stop-FRT cassette is removed by expression of FLPase. (B) Example of an Illumina sequencing library. Lane 1, 100 bp DNA ladder; lanes 2 and 3, PCR of a sample skipping the DpnI digestion step to serve as a negative control; 1 μl (lane 2) and 4 μl (lane 3) of ligation product were added as a template; lanes 4 and 5, PCR of a sample with every step finished, 1 μl (lane 4) and 4 μl (lane 5) of ligation product were added as a template. The library band is indicated by an arrowhead and the primer dimer is indicated by an asterisk. (C) The Z-score peak of a DSX-binding region at the Yp1, Yp2 locus. The Z-score comparing the tag counts between DSXF-Dam and R91Q-Dam for each GATC site was plotted at the Yp1, Yp2 locus. The genomic position is indicated on the x-axis. The scale of the Z-score is indicated on the y-axis. Each vertical line indicates the Z-score for a GATC site at the corresponding genomic position on the x-axis. The height of the vertical line indicates the value of the Z-score. For those GATC sites with a Z-score less than 0, only the position of the GATC site is indicated (by a vertical line with the Z-score value of 0). The genes are indicated under the x-axis, with open boxes indicating genes on the forward strand and shaded boxes indicating genes on the reverse strand. (D) Similar to C, the Z-sore peak of a DSX-binding region near the gene shd is shown.
Modified DamID approach to identify DSX-binding regions. (A) Schematic drawing of the flip-on Dam-fusion constructs. The proteins are expressed from the basal level of the UAS sequence after the FRT-stop-FRT cassette is removed by expression of FLPase. (B) Example of an Illumina sequencing library. Lane 1, 100 bp DNA ladder; lanes 2 and 3, PCR of a sample skipping the DpnI digestion step to serve as a negative control; 1 μl (lane 2) and 4 μl (lane 3) of ligation product were added as a template; lanes 4 and 5, PCR of a sample with every step finished, 1 μl (lane 4) and 4 μl (lane 5) of ligation product were added as a template. The library band is indicated by an arrowhead and the primer dimer is indicated by an asterisk. (C) The Z-score peak of a DSX-binding region at the Yp1, Yp2 locus. The Z-score comparing the tag counts between DSXF-Dam and R91Q-Dam for each GATC site was plotted at the Yp1, Yp2 locus. The genomic position is indicated on the x-axis. The scale of the Z-score is indicated on the y-axis. Each vertical line indicates the Z-score for a GATC site at the corresponding genomic position on the x-axis. The height of the vertical line indicates the value of the Z-score. For those GATC sites with a Z-score less than 0, only the position of the GATC site is indicated (by a vertical line with the Z-score value of 0). The genes are indicated under the x-axis, with open boxes indicating genes on the forward strand and shaded boxes indicating genes on the reverse strand. (D) Similar to C, the Z-sore peak of a DSX-binding region near the gene shd is shown.
Expression of the Dam-fusion proteins was induced by heat shock of newly eclosed females (hspFLP; UAS-FRT-stop-FRT-Dam-[fusion] protein coding sequence) and the flies were allowed to recover at 25°C for 24 hours for Dam alone, and 96 hours for DSXF-Dam and R91Q-Dam (Fig. 1A). These times were chosen empirically as producing comparable levels of methylation in the alpha-tubulin84D region (an indicator of the whole genome methylation levels), because comparable levels of methylation in these samples are important to minimize the biases towards regions with lowest or highest accessibility in the samples when their methylation levels are compared. Using Yp1 as a positive control, we also validated the DamID approach using semi-quantitative PCR.
Ultra high-throughput sequencing was used to profile methylated GATC sites across the whole genome. To quantify the methylation level at each GATC site, the genomic DNA from whole flies was digested with DpnI, which cuts only methylated GATC sites, and processed to generate a library in which each element has two adapters flanking a 27 bp genomic sequence from a DpnI-digested end (Fig. 1B). The final product was amplified by PCR to facilitate visualization and purification. In theory, two 27 bp sequence tags will be generated from each GATC site independently of the methylation status of other GATC sites. In addition, because of the low number of PCR cycles (10-12 cycles) and the uniform size of the sequence tags (expected 27 bp, observed 27±1), little PCR bias should be introduced. Thus, the library can be subjected to ultra high-throughput sequencing to quantify the relative methylation level at each GATC site (see Fig. S1 in the supplementary material).
Using an Illumina Genome Analyzer sequencing platform, 3.1 million, 2.6 million and 3.5 million reads matching a unique position in the fly genome were generated for Dam alone, DSXF-Dam and R91Q-Dam, respectively. Individual reads that matched two or more genomic locations were discarded, as were reads that matched a location distant from a GATC site, which were probably generated from fragments produced by random DNA breaks. In general, over 85% of the unique reads matched a genomic GATC site. GATC sites in heterochromatic regions had significantly fewer unique reads (over 10-fold fewer compared with euchromatic regions, consistent with the expectation that the heterochromatic regions are generally less accessible), and were excluded from further analysis. These filtering steps resulted in read counts of 2.6 million, 2.4 million and 3.1 million for Dam alone, DSXF-Dam and R91Q-Dam, respectively. These translate into 7.4-9.6 times coverage of the GATC sites, and 84-89% of the GATC sites have at least one read mapped to them. A large proportion of the 11-16% unmapped GATC sites were common to the three samples, and in regions of repetitive sequence, so the reads for these sites were discarded due to multiple matches.
The read counts for each GATC site were compared between DSXF-Dam and R91Q-Dam by Z-score testing, which takes into account the difference and the absolute number of reads for the samples. Sites where methylation is significantly higher in DSXF-Dam than in R91Q-Dam often clustered together in peaks (Fig. 1C,D). We calculated the cumulative P values for all such peaks and chose 1.0e-5 as the cut off threshold for defining a DSX-binding region. This strategy yielded 1452 putative direct DSX-binding regions. To discern whether any of these are likely transcription factor colocalization hot spots, we compared the methylation profiles of R91Q-Dam with Dam alone. A total of 3456 peaks passed the 1.0e-5 cut off, and we consider the corresponding genomic regions to be hot spots. This represents approximately one hot spot every 35 kb, similar to the report of one hot spot per 50 kb in a 1 Mb region studied (Moorman et al., 2006). We re-assessed the DSX-binding peaks by the same Z-score-testing, but excluded GATC sites located in hot spots, resulting in 1278 DSX-binding regions in the genome.
We repeated our experiments using different biological samples for DSXF-Dam and R91Q-Dam (different transgenic lines were used for R91Q). The libraries of sequence tags were sequenced by a second generation Illumina Genome Analyzer II platform. About 11 million reads that map to unique GATC sites were generated for each sample, providing about 34 times the coverage of GATC sites. For Z-score comparison of the two samples, we chose a P value of 1.0e-10 as a cut off as that yielded 1460 peaks, which is similar to the number of peaks obtained from the first run (1452). We excluded the hot spots defined in the first run from the second run data, resulting in 1370 peaks. Six-hundred and fifty (~50%) of the peaks were common to both lists, and we consider these regions to be the more reliable in vivo DSX-binding regions (see Table S1 in the supplementary material).
In both runs, all four known direct DSX target genes have significant peaks in the regions where their previously demonstrated DSX-binding sites are located (Table 1), indicating that our approach is robust in identifying real DSX-binding regions.
We also compared the genes near the identified DSX-binding regions (see Table S1 in the supplementary material) with the lists of (1) genes showing sex-biased expression or (2) genes regulated by DSX (Ranz et al., 2003; Parisi et al., 2003; Arbeitman et al., 2004; Gnad and Parsch, 2006; Goldman and Arbeitman, 2007; Lebo et al., 2009). The only significant correlation is with DSX-regulated genes (Goldman and Arbeitman, 2007): 14/54 genes labeled as ‘DSX set’ are located near a DSX-binding region, which is significantly more frequent than the overall probability (0.05) of a random gene being found near a DSX-binding region (P=3.0e–7). There was no significant correlation between the sex-biased genes from the other studies and proximity to a DSX-binding region, suggesting that most of the sex-biased genes identified in these papers are probably indirect DSX targets.
Identification of a new consensus DSX-binding site
As the 13 bp consensus DSX-binding sequence derived from SELEX experiments (Erdman et al., 1996) has insufficient information to function as a full DSX-binding site in vivo, we searched the DSX-binding regions that we identified above to see whether we could refine the DSX-binding sequence. We chose 43 regions where the center GATC sites had higher Z scores and the distribution of GATC sites in the regions was relatively even and dense. We trimmed these regions to an average of 1.5 kb to remove as much extraneous sequence as possible without much risk of omitting the potential DSX-binding site. The trimmed regions were searched for enriched motifs using MEME (Bailey and Elkan, 1994). The sequence identified as most enriched is GCAACAATGTTGC (Fig. 2A), which resembles the SELEX-identified DSX-binding site, (G/A)nnAC(A/T)A(T/A)GTnn(C/T), except that it is a complete palindrome and the identity of the outside six bases is specific, suggesting that the these bases may contribute to interactions with DSX protein.
Identification of a new consensus DSX-binding sequence. (A) The sequence identified by MEME that is enriched in 43 DSX-binding regions (see Table S3 in the supplementary material) (E-value=3.4e-21). The overall height of each stack indicates the sequence consensus at that position (measured in bits), whereas the height of nucleotide symbols within the stack reflects the relative frequency of the nucleotides at that position. (B) Comparing the new consensus DSX-binding sequence with the previously in vitro identified DSX-binding sequence. A 33 bp sequence from the second intron of branchless (bnl) containing a perfect new DSX-binding site was used as a probe in an electrophoresis mobility shift assay. Two palindromic sequences with mutation at the outside 6 bp of the DSX-binding site were created and used as competitors. The sequences and the amount (fold over labeled probe) of competitors are shown in the table under the panel. Lower case letters indicate the mutated nucleotides.
Identification of a new consensus DSX-binding sequence. (A) The sequence identified by MEME that is enriched in 43 DSX-binding regions (see Table S3 in the supplementary material) (E-value=3.4e-21). The overall height of each stack indicates the sequence consensus at that position (measured in bits), whereas the height of nucleotide symbols within the stack reflects the relative frequency of the nucleotides at that position. (B) Comparing the new consensus DSX-binding sequence with the previously in vitro identified DSX-binding sequence. A 33 bp sequence from the second intron of branchless (bnl) containing a perfect new DSX-binding site was used as a probe in an electrophoresis mobility shift assay. Two palindromic sequences with mutation at the outside 6 bp of the DSX-binding site were created and used as competitors. The sequences and the amount (fold over labeled probe) of competitors are shown in the table under the panel. Lower case letters indicate the mutated nucleotides.
We analyzed the affinity of recombinant DSX protein for each of these sequences in vitro by electrophoretic mobility shift assay (EMSA) using a probe containing the new consensus DSX-binding site (GCAACAATGTTGC) and two competitor sequences that differ at the outside six bases (atgACAATGTcat and cgtACAATGTacg). As there are same amounts of protein and probe in all reactions, the decrease in the intensity for the shifted band reflects the relative affinity of the respective competitor to the DSX protein. The SELEX-identified DSX-binding site predicts that these three sequences will have similar affinities, whereas the new consensus DSX-binding site predicts that the first sequence will have higher affinity than the other two. We found that the second sequence (atgACAATGTcat) has about a threefold lower affinity compared with the new DSX consensus site, whereas the third sequence (cgtACAATGTacg) has an even lower affinity (Fig. 2B), suggesting that the new consensus sequence provides more information about the identity of an optimal DSX-binding site.
The Yp3 gene is a direct DSX target, and DSX binds a 177 bp fragment from the Yp3 promoter by EMSA (Hutson and Bownes, 2003). However, the SELEX-identified DSX-binding site could not be found. Alignment with the new consensus DSX-binding sequence identified a good candidate for the DSX-binding site in the Yp3 promoter, with only two mismatches out of the 13 bp sequence (see Fig. S2A in the supplementary material). By EMSA, this Yp3 promoter sequence was bound by DSXF protein (see Fig. S2B in the supplementary material), and competed away by unlabeled oligos of the same Yp3 sequence or oligos for the DSX A site in the Yp1 gene, and to a lesser extent, by oligos for the DSX B site in the Yp1 gene.
To elucidate the importance of each base pair in the new consensus DSX-binding site, we used EMSA to evaluate the change in DSX affinity for sites mutated at different positions. We found that (1) the sequence identity of each of the 13 bp contributes to the affinity of the DSX binding site, with the outside 6 bp sequence contributing to a lesser extent than the 7 bp core sequence; and (2) a single A/T is preferred at the center position (see Fig. S3 in the supplementary material).
The new consensus sequence is enriched in the Drosophila melanogaster genome and significantly associated with DSX-binding regions
Searching the whole genome for perfect matches to the new consensus sequence (hereafter referred to as the optimal DSX-binding site) revealed 58 matches (Table 2), well in excess of the 3 (±2) matches expected by chance assuming the genome sequence is random and adjusting for a 42.5±2.5% GC content (Jabbari and Bernardi, 2004). To test whether the optimal DSX-binding site is significantly enriched in the genome compared with similar sequences, we permuted the outside 6 bp of the consensus sequence in such a way that the percentage of GC remained unchanged and the resulting sequence was palindromic. There are 24 such sequences including the optimal DSX-binding site (Fig. 3A). The copy number for the optimal DSX-binding site was significantly greater than the copy number for each of the other 23 sequences by Grubb's test (P=1.24e-12 for one-tailed test) (Fig. 3B). The largest copy number for the other 23 sequences, which is 12, is not significantly different from the other 22 sequences (P=0.22 for a one-tailed test). Excluding the optimal DSX-binding site, the average copy number for the other 23 sequences is five, with a standard deviation of three; this is close to the expected number assuming that the presence of a sequence in the genome is random.
Determination of whether perfect matches to each of these 24 sequences are located near one of the 650 DSX-binding regions (including the 43 used by MEME) revealed that the optimal DSX-binding site is significantly associated with DSX-binding regions (Fig. 3, P=2.1e-22 compared with a random 13 bp sequence in the genome which has a probability of 0.019 of being within 0.5 kb of a DSX-binding region). By contrast, the other 23 palindromic sequences are not significantly associated with DSX-binding regions. Therefore, the optimal DSX-binding site is both significantly enriched in the genome and significantly associated with the identified DSX-binding regions.
The new DSX consensus sequence is significantly enriched in the genome of Drosophila melanogaster. (A) List of palindromic sequences created by switching the position at the outside 6 bp of the new DSX consensus binding sequence (column 1, altered positions are shown in red), and their respective copy numbers in the whole genome (column 2), as well as the number of the copies that are near a DSX-binding region (column 3). (B) Scatter plot of the copy numbers for sequences listed in A. The copy number for the DSX consensus sequence (in orange) is significantly greater than those numbers for the rest sequences (P=1.24e-12 for one-tailed test).
The new DSX consensus sequence is significantly enriched in the genome of Drosophila melanogaster. (A) List of palindromic sequences created by switching the position at the outside 6 bp of the new DSX consensus binding sequence (column 1, altered positions are shown in red), and their respective copy numbers in the whole genome (column 2), as well as the number of the copies that are near a DSX-binding region (column 3). (B) Scatter plot of the copy numbers for sequences listed in A. The copy number for the DSX consensus sequence (in orange) is significantly greater than those numbers for the rest sequences (P=1.24e-12 for one-tailed test).
Is this optimal DSX-binding sequence sufficient to predict an in vivo DSX-binding site? Twenty-five of the 58 optimal DSX-binding sites in the genome are associated with a DSX-binding region (27 for the first run, 31 for the second run and 25 in common, corresponding to 23 genes; see Table 2). Thus, the optimal DSX-binding sequence has at least a 40% chance of predicting an in vivo DSX-binding site, whereas the other 23 palindromic sequences do so no better than a random sequence. That not all 58 optimal DSX-binding sites are associated with a DSX-binding region may reflect that: (1) required co-factor(s) for DSX are not present at the time point of our experiment; or (2) the DSX-binding site is located in a GATC-sparse region and/or a region without adequate sequencing coverage because of low chromatin accessibility and therefore cannot be detected by DamID. Alternatively, the sequence itself may not be sufficient to define a DSX-binding site in vivo.
Other DSX-binding sites
Although the 13 bp optimal DSX-binding sequence is frequently associated with direct DSX targets, mismatches are allowed in the sequences through which DSX functions in vivo. The identified DSX sites in the bab1 gene (Williams et al., 2008) match well with the optimal DSX-binding site (13/13 and 12/13 bp) and are within the DSX-binding region we identified for bab1. We found several additional DSX-binding regions within the bab1 and bab2 intergenic region, two of which contain an optimal DSX-binding sequence, the functional roles of which are not known. The DSX protein regulates the female-specific expression of Fad2 by binding to a sequence (GCAACAATGTatg; 10/13 bp; matches in capitals) located 272 bp upstream of the Fad2 transcription start site. Finally, in the case of the Yp1 and Yp2 intergenic region, there is a DSX-binding site A (aCtACAATGTTGC) and a weaker DSX-binding site B (cCtACAAaGTgat), which have 11/13 and 7/13 matches to the optimal DSX-binding site, respectively.
Given that particular sequences with 11/13 and 12/13 matches to the optimal DSX-binding sequence function as DSX-binding sites in vivo, we asked how many such sequences were observed in the genome and how many were expected, assuming the genomic sequence was random. For 12/13 matches, the number observed (541) was four times the number expected (135). For 11/13 matches, the number of observed (4532) and expected (2310) were less different. Of the sequences with 12/13 matches to the optimal binding site, 14% were associated with a DSX-binding region, whereas only 1.9% would be expected at random. In addition, the numbers of copies of individual sequences with 12/13 bp matches to the optimal DSX-binding site were nonrandom, with some sequences having significantly more copies than expected. However, we could not discern any correlation between the number of copies of individual sequences and the likelihood that a sequence was associated with a DSX-binding region. Thus, although additional information probably determines whether 12/13 and 11/13 matches function as DSX-binding sites in vivo, the nature of that information is currently unclear.
The enrichment of the optimal DSX-binding sequence is evolutionarily conserved across the 12 Drosophila species
To determine whether the enrichment of the optimal DSX-binding site was conserved in other Drosophila species, across which the DNA-binding domain of DSX is very conserved (see Fig. S4 in the supplementary material), we examined the number of copies of the optimal DSX-binding sequence, as well as the other 23 related palindromic sequences, in the other 11 sequenced Drosophila genomes. The copy number for the optimal DSX-binding sequence is always the highest in each Drosophila genome (Fig. 4 and see Fig. S5 in the supplementary material) and stands out as the sole significant outlier by Grubb's test (all P<8e-6, on a two-tailed test).
We next examined whether the association of an optimal DSX-binding site with a particular gene is conserved for the 23 D. melanogaster genes that we had identified as direct DSX targets based on their being both associated with a DSX-binding region and having an optimal DSX-binding site (Table 2). We determined whether one or more potential DSX-binding site (13/13, 12/13 or 11/13 matches) is associated with each of these 23 genes in the other 11 sequenced Drosophila species (four examples are in Fig. 5; the rest are in Fig. S6 in the supplementary material). The association of genes with an optimal DSX-binding site is seen for 23-35% of the 23 genes, suggesting the possibility that these genes may function in sex-specific differentiation in all these species. Thus, five of these genes are associated with either at least one copy of the optimal DSX-binding sites, or two or more copies of a 12/13 match to the optimal DSX-binding site in all 11 species for which the orthologous gene can be found from FlyBase annotation or a Blast search. For three additional genes, an optimal DSX-binding site is found associated with the gene in 10 species and a 12/13 match to this site found in one species. Similarly, within the four melanogaster subgroup species examined, 65-90% of the 23 genes show an association with an optimal DSX-binding site, suggesting that these genes may function in sex-specific differentiation in all these species. Thus, 15/23 genes were associated with at least one optimal DSX-binding site in all four species, and an additional six genes associated with either an optimal DSX-binding site or a sequence that is a 12/13 match to that binding site.
The enrichment of the optimal DSX-binding sequence is conserved in other insects. The genome-wide copy numbers for the optimal D. melanogaster DSX-binding sequence and the other 23 palindromic sequences (listed in Fig. 4A) were normalized by the copy number expected from randomness, and plotted for each insect species. The copy number for the optimal D. melanogaster DSX-binding sequence is indicated in orange for the first 15 species. The copy number for sequence aCgACAATGTcGt in the Culex pipiens genome is indicated in purple. The copy number for sequence cgtACAATGTacg in the Bombyx mori genome is indicated in red. The mean±s.e.m. for each species is indicated.
The enrichment of the optimal DSX-binding sequence is conserved in other insects. The genome-wide copy numbers for the optimal D. melanogaster DSX-binding sequence and the other 23 palindromic sequences (listed in Fig. 4A) were normalized by the copy number expected from randomness, and plotted for each insect species. The copy number for the optimal D. melanogaster DSX-binding sequence is indicated in orange for the first 15 species. The copy number for sequence aCgACAATGTcGt in the Culex pipiens genome is indicated in purple. The copy number for sequence cgtACAATGTacg in the Bombyx mori genome is indicated in red. The mean±s.e.m. for each species is indicated.
The pattern of enrichment of the optimal DSX-binding sequence in other insects
Extending this analysis to other insect species revealed the following. In two mosquito species, Aedes aegypti and Anopheles gambiae, similar to what is observed in the 12 Drosophila genomes, the D. melanogaster optimal DSX-binding site is found at both a significantly higher copy number than the 23 control sequences and is 26- or 34-fold over expected, respectively (Fig. 4). In another mosquito species, Culex pipiens, the copy number of the optimal DSX-binding site is the second highest and is 10-fold more than random expectation. The highest number of perfect matches in the C. pipiens genome is to the related sequence aCgACAATGTcGt (Fig. 4). In the silkworm Bombyx mori, which is further removed from Drosophila, the D. melanogaster optimal DSX-binding site copy number is only twofold more than random expectation and is not an outlier among the 24 palindromic sequences. Instead, the related sequence cgtACAATGTacg is the sole outlier, and is 16-fold more prevalent than expected. None of these 24 sequences showed up as an outlier (P=0.01 cut off) in the more distantly related insect species examined (Tribolium castaneum, Apis mellifera, Nasonia vitripennis, Acyrthosiphon pisum and Pediculus humanus corporis) (Fig. 4). Taken together, these observations suggest that a global evolution of the optimal DSX-binding sequence has occurred within the insects.
DISCUSSION
A modified DamID approach identified a collection of ~650 potential direct DSX target genes. The subset of 23 of these genes shown to be associated with evolutionarily conserved optimal DSX-binding sites seem extremely likely to be direct DSX targets, substantially increasing the number of identified direct DSX targets. Below, we discuss some salient aspects of our experimental approach and findings.
The advantage of profiling methylated GATC sites using ultra high-throughput sequencing
We have modified the DamID technique to take advantage of next-generation sequencing. Previous whole-genome DamID experiments largely used array-based methods (van Steensel and Henikoff, 2000; van Steensel et al., 2001; Orian et al., 2003; Bianchi-Frias et al., 2004; Song et al., 2004; de Wit et al., 2005; Reddy et al., 2008; Pindyurin et al., 2007). In these approaches, the methylated fragments are isolated by PCR-based methods, and therefore are subjected to PCR bias, although efforts were made to minimize such bias. Still, the recovery of a fragment depended on the GATC sites at both ends being methylated, and large fragments produced from GATC-poor regions would be less efficiently amplified. In our approach, each GATC site is evaluated independently. Two sequence tags are generated from each methylated GATC site, and this does not depend on the methylation status of nearby GATC sites. In addition, there are only minimal PCR cycles (<13) employed during library construction and the sizes of PCR products are very uniform (125±1 bp); hence, there is minimal PCR bias.
In previous reports where cDNA arrays were used, only binding regions close to exons could be identified. As transcription factor-binding sites are often located in the intergenic regions or in introns, using a cDNA array would probably miss many binding regions. Using tilling arrays may avoid the problem of missing target regions distant from genes, but tilling arrays are limited by the fraction of the genome that can be put on a chip. Moreover, the information is harder to compare between different versions of arrays. To avoid these pitfalls, we used ultra high-throughput sequencing to generate digitized actual sequences that can be directly matched to the genome and can be easily compared with results from different experiments. Furthermore, the array-based methods are more easily saturated, thus resulting in false-negatives, whereas the ultra high-throughput sequencing in theory has an infinite linear range.
The evolutionary change of the association of DSX-binding site with genes. (A-D) Four examples of the association of potential DSX-binding sites in the 12 sequenced Drosophila species with the 23 genes that are identified in D. melanogaster as likely direct DSX targets. For each gene, the phylogenetic tree of the 12 Drosophila species is illustrated. The maximum number of base pairs matched to the 13 bp optimal DSX site by any 13 bp sequence in the gene and flanking regions is listed for each species. For the 12/13 matches, the actual sequences (with the mis-matched base pair in red) are listed instead. N.A. indicates that there is no sequence available for analysis.
The evolutionary change of the association of DSX-binding site with genes. (A-D) Four examples of the association of potential DSX-binding sites in the 12 sequenced Drosophila species with the 23 genes that are identified in D. melanogaster as likely direct DSX targets. For each gene, the phylogenetic tree of the 12 Drosophila species is illustrated. The maximum number of base pairs matched to the 13 bp optimal DSX site by any 13 bp sequence in the gene and flanking regions is listed for each species. For the 12/13 matches, the actual sequences (with the mis-matched base pair in red) are listed instead. N.A. indicates that there is no sequence available for analysis.
The DamID approach might lack tissue specificity
We found that for a binding region to be identified, the corresponding target gene does not need to be in a transcriptionally active state when the Dam-fusion protein experiment is carried out. In the case of the Yp1 DSX-binding region, we observed a three- to ninefold higher normalized methylation level for the DSXF-Dam sample versus Dam alone sample in third instar larvae, even though the Yp1 gene is not transcribed until after eclosion, some 4 days later.
There are two possibilities for this lack of temporal specificity. One is that it is an intrinsic feature of the DamID approach. It has been suggested that transcription factors are always in a dynamic mode of on and off their binding sites (e.g. Hager et al., 2009). The Dam-fusion protein might be highly efficient in methylating the GATC sites around its binding sites, so that even transient binding of the protein is sufficient to generate methylation. Alternatively, the DSX-binding sites may be readily recognizable by the DSX protein either because of their sequences per se or because of other factors associated with the sites are already in position, so that when the DSX-Dam protein is expressed at an earlier stage, it binds to these pre-arranged binding sites.
Implications for the evolution of sex
Our findings contribute at several levels to an understanding of the evolution of sex. Considerations of where in developmental regulatory hierarchies the mutations underlying evolutionary changes are most likely has led to the proposal that such changes are more likely to be in the cis-regulatory elements of the targets of the transcription factors encoded by the terminal regulatory genes in a hierarchy, than in the coding sequences of the terminal regulatory genes themselves (Tautz, 2000; Carroll, 2000).
With respect to the sex hierarchy, the limited data previously available are consistent with this proposal. Thus, in the DSX DNA-binding domain (DM domain), the Zn module, as well as the N-terminal region of the tail (amino acids 78-97) immediately following the Zn module [which has been proposed to be the functionally important part of a recognition helix providing the binding specificity (Zhu et al., 2000)], are highly conserved among these Drosophila species (see Fig. S4 in the supplementary material). Of the four previously identified direct DSX targets, only two (bab and Fad2) exhibit changes in their sex-specific expression patterns within the Drosophila genus and in both cases these changes in the sex specificity of their expression are correlated with changes in adjacent DSX-binding sites (Williams et al., 2008; Shirangi et al., 2009).
The 23 genes we identified as very probable direct DSX targets provide a broader data set with which to examine this proposal. Within the four other sequenced members of the melanogaster subgroup species, 21/23 of these genes are also associated with putative DSX-binding sites (13/13 or 12/13 matches) in all four species, whereas 2/23 genes are not associated with close matches to the optimal DSX-binding site in at least one species, suggesting that the latter may be cases in which the sex-specificity of expression of a gene has changed via an alteration of its cis-acting DSX-binding sequence. When the data from these 23 genes were examined across all 12 sequenced Drosophila species, a similar result was seen: in 23-35% of the cases an association between the gene and a DSX-binding site is retained in all species, while for 65-72% of these genes, a recognizable DSX-binding site is not present in one or more of these species. These findings with respect to the latter group of genes are consistent with a potential change in the sex specificity of expression as a consequence of a change in the adjacent DSX-binding site.
A second theme in modern considerations of the evolution of sex is that sex evolves rapidly when compared with other developmental processes (Schütt and Nöthiger, 2000). This view derives in part from the wide diversity in sex determination mechanisms used in animals, and findings that the primary sex determination mechanisms can differ in closely related species (Schütt and Nöthiger, 2000; Bull, 1985; Gempe and Beye, 2011). Furthermore, and probably independently, at a developmental level evidence for the rapid evolution of sex comes directly from the myriad of secondary sexual characteristics that frequently distinguish closely related species, which otherwise have nearly identical body plans (Darwin, 1871). Our data, cited immediately above, provides a relatively broad look at the molecular genetic level at the frequencies with which there are potential changes between sex-specific and sex-nonspecific patterns of expression of downstream effector genes. Consistent with relatively rapid evolutionary changes in the sex-specificity of expression of some of the downstream genes in the Drosophila sex hierarchy, we found that ~70% of the 23 D. melanogaster genes associated with DSX-binding sites are not associated with such a site in one or more of the other 11 sequenced Drosophila species.
It is also important to note that these data, in addition to pinpointing a set of genes whose roles in sexual development may have changed during the evolution of this genus, also identify another subset of these genes, comprising ~30% of the putative direct DSX targets in D. melanogaster, that retain an association with putative DSX-binding sites in all 12 Drosophila species examined. The latter finding suggests that a significant subset of the genes directly regulated by DSX have functions in aspects of sexual development that are evolutionarily relatively stable. Such evolutionarily conserved aspect of maleness or femaleness might encompass, for example, somatic functions that are necessary for sex-specific germline functions or gametogenesis.
Extending our analysis of the optimal DSX-binding site and the related 13 bp palindromic sequences beyond the Drosophila genus provided additional evolutionary insights. In two mosquito species, Aedes aegypti and Anopheles gambiae, enrichment in the optimal DSX-binding site is also observed. However, in the mosquito species Culex pipiens, although the number of optimal DSX-binding sites is still 10-fold more than expected, the related sequence aCgACAATGTcGt had the highest number of perfect matches. Further analysis found that this latter sequence is within a repetitive element of the C. pipiens genome, so we infer that GCAACAATGTTGC is most probably the optimal DSX site in this species. The enriched sequence, cgtACAATGTacg, in the silkworm Bombyx mori is also within a repetitive element. None of the 24 sequences are enriched for the more distantly related insect species (Tribolium castaneum, Apis mellifera, Nasonia vitripennis, Acyrthosiphon pisum, and Pediculus humanus corporis) examined. Thus, in insects outside of the Diptera, the DSX proteins might have diverged sufficiently that they bind to sequences different than the 24 examined.
Extending approaches such as those described here may provide significant insights into how a DNA-binding domain and its targets have evolved to maintain control of a fundamental developmental process across a broad evolutionary timescale.
Acknowledgements
We thank the members of the Baker and Sidow labs for helpful discussions and comments. We thank Dr Bas van Steensel and Dr Steven Henikoff for providing the pNDamMyc and pCMycDam plasmids, Dr Ziming Weng for helping with the high-throughput sequencing, Phil Lacroute for initial sequence mapping, Drs Mike Weiss, Loren Looger and Sean Eddy for helpful discussions, and Guennet Bohm for preparing fly food. This work is supported by NIGMS and the Howard Hughes Medical Institute. This study was also facilitated by information from FlyBase (www.flybase.org). Deposited in PMC for release after 6 months.
References
Competing interests statement
The authors declare no competing financial interests.