The number of long noncoding RNAs (lncRNAs) with characterized developmental and cellular functions continues to increase, but our understanding of the molecular mechanisms underlying lncRNA functions, and how they are dictated by RNA sequences, remains limited. Relatively short, conserved sequence motifs embedded in lncRNA transcripts are often important determinants of lncRNA localization, stability and interactions. Identifying such RNA motifs remains challenging due to the substantial length of lncRNA transcripts and the rapid evolutionary turnover of lncRNA sequences. Nevertheless, the recent discovery of specific RNA elements, together with their experimental interrogation, has enabled the first step in classifying heterogeneous lncRNAs into sub-groups with similar molecular mechanisms and functions. In this Review, we focus on lncRNAs with roles in development, cell differentiation and normal physiology in vertebrates, and we discuss the sequence elements defining their functions. We also summarize progress on the discovery of regulatory RNA sequence elements, as well as their molecular functions and interaction partners.
Thousands of genomic loci are transcribed into so-called long noncoding RNAs (lncRNAs) (Rinn and Chang, 2012). A fraction of these has been demonstrated to regulate a range of important biological processes, such as development and cell differentiation (Mallory and Shkumatava, 2015; Perry and Ulitsky, 2016; Ulitsky and Bartel, 2013). LncRNAs have also been linked to human diseases, including nearly every major cancer type as well as various neurological disorders (Bian and Sun, 2011; Esposito et al., 2019; Gutschner and Diederichs, 2012; Huarte, 2015; Li et al., 2018). Although the annotation of lncRNAs and the identification of their biological functions has progressed over the past decade, lncRNAs are still poorly understood at the molecular level, partially due to the diversity of their modes of action and a lack of suitable molecular tools to dissect them.
Three sub-groups of lncRNAs can be distinguished based on their molecular mechanisms of action: (1) lncRNA loci that contain enhancers regulating gene expression; (2) lncRNA loci for which the act of transcription but not the transcript per se is important for regulating neighbouring genes; and (3) lncRNA loci that carry out their cellular functions via lncRNA transcripts that interact with DNA, other RNAs and proteins (Marchese et al., 2017; Wang and Chang, 2011). Notably, some lncRNAs may belong to more than one of these subgroups. The lncRNAs in the final group share biogenesis pathways and post-transcriptional features with mRNAs, i.e. they are transcribed by RNA polymerase II and are capped, spliced and polyadenylated (Cabili et al., 2011; Guttman and Rinn, 2012). For these lncRNAs, the linear transcript sequence often defines function. Comparative analyses of lncRNAs in different vertebrate species has shown that lncRNA sequences undergo rapid evolutionarily turnover, resulting in their overall poor conservation (Hezroni et al., 2015; Kutter et al., 2012; Necsulea et al., 2014; Ulitsky et al., 2011; Washietl et al., 2014). Nonetheless, signatures of primary sequence conservation have been detected in ∼100 lncRNAs conserved from mammals to fish and in ∼1000 lncRNAs conserved among mammals (Hezroni et al., 2015). For the majority of conserved lncRNAs, only short stretches of sequence are retained (Hezroni et al., 2015; Ulitsky et al., 2011). These short sequences may represent functional elements of lncRNA transcripts that facilitate interactions with other RNAs, proteins or genomic loci.
In addition to their linear RNA sequences, the secondary and tertiary structures of lncRNA transcripts have been proposed to be functional determinants (Diederichs, 2014; Guttman and Rinn, 2012; Wutz et al., 2002). Several functionally conserved RNA structure examples support this idea (Holmes et al., 2020; Uroda et al., 2019; Zhang et al., 2010). Moreover, recent development of new technologies to determine RNA structures in vivo have demonstrated that RNA structures can be engaged in important long-range interactions (Lu et al., 2016; Ziv et al., 2018). However, it is unclear how prevalent the impact of RNA structure on lncRNA functionality is (Rivas et al., 2017). Furthermore, the unbiased analysis of sequence and context preferences for multiple human RNA-binding proteins (RBPs), which are important drivers of lncRNA function, has demonstrated that RBPs bind short, often low complexity, RNA motifs (Dominguez et al., 2018). RBP-binding specificity is further supported by additional features, such as local RNA secondary structure and flanking nucleotides (Dominguez et al., 2018), suggesting that the functional elements controlling lncRNAs reside in their sequence and adjacent secondary structure motifs, rather than in their global tertiary structure.
In this Review, we discuss a set of vertebrate lncRNAs with characterized developmental, cell biological and physiological functions defined by specific lncRNA sequence elements (Table 1). We discuss how such sequence elements and motifs regulate the biogenesis, localization and interactions of lncRNAs. In addition, we survey recent efforts to experimentally identify functional lncRNA motifs and highlight computational algorithms that have been developed to predict these functions based on lncRNA sequences.
The regulation of lncRNA biogenesis and processing by RNA motifs
To generate functional noncoding transcripts, tight regulation of lncRNA biogenesis is required. Although the majority of mature lncRNA transcripts are generated by the same biogenesis pathways as those regulating mRNAs, two particular lncRNA transcripts – MALAT1 (metastasis-associated lung adenocarcinoma transcript 1) and NEAT1 (nuclear enriched abundant transcript 1) – are processed in a manner that involves conserved motifs (Wilusz et al., 2012) (Fig. 1A).
Both MALAT1 and NEAT1_2 (the long isoform of NEAT1) are structural components of nuclear bodies, with MALAT1 localizing to speckles and NEAT1_2 to paraspeckles (Clemson et al., 2009; Hutchinson et al., 2007; Nakagawa et al., 2011; Sasaki et al., 2009; Sunwoo et al., 2009). Whereas NEAT1_2 is essential for paraspeckle formation (Clemson et al., 2009; Mao et al., 2011; Naganuma and Hirose, 2013; Naganuma et al., 2012; Sasaki et al., 2009), depletion of MALAT1 does not noticeably affect speckles (Nakagawa et al., 2012; Zhang et al., 2012). At the organismal level, Neat1 is required for normal mammary gland development, lactation and corpus luteum formation during the establishment of pregnancy (Nakagawa et al., 2014; Standaert et al., 2014). By contrast, Malat1 does not appear to be required for normal development, as mouse and zebrafish null alleles of Malat1 do not exhibit any apparent morphological defects (Eissmann et al., 2012; Lavalou et al., 2019; Nakagawa et al., 2012; Zhang et al., 2012). However, in-depth analysis of one of the mouse null alleles has shown retinal vascularization defects (Michalik et al., 2014). In addition, there is growing evidence that Malat1 plays an essential role in disease, affecting cancer progression and metastasis (Arun and Spector, 2019).
During their biogenesis, MALAT1 and NEAT1 transcripts are processed by an unusual RNase P cleavage step directed by structured tRNA-like motifs (Wilusz et al., 2012). Upon RNase P cleavage, MALAT1, which is conserved from fish to human (Ulitsky et al., 2011), is stabilized by a highly conserved triple helical structure located immediately upstream of the RNase P cleavage site; this event prevents exonucleolytic degradation from the 3′ end of the transcript (Brown et al., 2012; Wilusz et al., 2012) (Fig. 1A). The mammalian-specific NEAT1 locus encodes two lncRNA isoforms generated from the same promoter. Although the short isoform, termed NEAT1_1 (3.7 kb in humans), is produced by canonical cleavage and polyadenylation using an upstream polyadenylation signal, the long isoform, termed NEAT1_2 (23 kb in humans), is processed by RNase P cleavage and stabilized by a triple helical structure, similar to that regulating MALAT1 biogenesis (Fig. 1A). In both cases, the 3′-end stabilizing triple helical structure is generated by conserved motifs of ∼10 nucleotides (nt) in length – two U-rich motifs and an A-rich tract (Brown et al., 2014, 2012; Wilusz et al., 2008, 2012) (Fig. 1A). This biogenesis mechanism remains a unique feature of MALAT1 and NEAT1, as no additional lncRNA transcripts processed by RNase P cleavage and harbouring similar triple-helix forming motifs have been identified so far.
Motifs and sequences that control the subcellular localization of lncRNA transcripts
The subcellular localization of lncRNAs is also tightly regulated and is an important determinant of lncRNA function (Chen, 2016; Guo et al., 2020). In the nucleus, lncRNAs regulate chromatin organization, transcription, RNA maturation and nuclear protein activity (Sun et al., 2018; Wang and Chang, 2011). In the cytoplasm, by contrast, lncRNAs interact with microRNAs (miRNAs) and regulate translation efficiency and cytoplasmic protein activity (Noh et al., 2018; Paraskevopoulou and Hatzigeorgiou, 2016). Notably, lncRNAs tend to be less enriched in the cytoplasm compared with mRNAs (Derrien et al., 2012; Mukherjee et al., 2017). Specifically, for lncRNAs whose functions are linked to activities in the nucleus, repression of their export to the cytoplasm is required. Many nuclear lncRNAs have been studied in detail and specific RNA elements that facilitate their binding to proteins defining nuclear localization have been determined. Below, we highlight examples that illustrate the variety of different RNA elements and molecular mechanisms directing the nuclear enrichment of lncRNAs.
The nuclear localization of MALAT1 can be explained in part by the single-exon structure of its transcript, which does not engage with exon junction complexes promoting nuclear export (Hutchinson et al., 2007). In addition, two distinct sequence elements have been found to promote the specific enrichment of MALAT1 in nuclear speckles (Miyagawa et al., 2012). These nuclear retention sequences of ∼600 nt are bound in vitro by the nuclear speckle protein RNSP1. Depletion of RNSP1 and two other key nuclear speckle proteins, SRM160 and IBP160, leads to diffusion of MALAT1 into the nucleoplasm, suggesting that these proteins may drive the nuclear retention of MALAT1. However, sequence and structure conservation analyses did not detect any common motifs within the two MALAT1 nuclear localization elements (Miyagawa et al., 2012).
In contrast to MALAT1, the lncRNA Firre (functional intergenic repeating RNA element) is a spliced and polyadenylated nuclear transcript, the knockout of which leads to cell-specific defects in hematopoietic populations in mice (Lewandowski et al., 2019). Although Firre is syntenically conserved in almost all eutherians (Hezroni et al., 2015), its sequence comprising multiple repetitive elements of ∼150 nt, termed repeating RNA domains (RRDs), has diverged between primate and rodent lineages (Hacisuleyman et al., 2016). Despite this divergence in sequence evolution resulting in limited sequence similarity between primate and rodent RRD elements, orthologous RRDs share the same function and act as strong nuclear retention signals (Hacisuleyman et al., 2014, 2016). RRDs drive nuclear retention of the Firre transcript and are sufficient to localize an otherwise cytoplasmic mRNA to the nucleus when added to the 3′ end of this mRNA (Hacisuleyman et al., 2016). Mechanistically, it has been proposed that RRD elements interact with the nuclear matrix protein HNRNPU, which insures correct localization of RNAs to the nucleus (Hacisuleyman et al., 2014, 2016).
The nuclear retention of the lncRNA MEG3 is also regulated by a linear sequence motif of 356 nt (Azam et al., 2019). Deletion of this nuclear retention element (NRE), which is bound by several components of U1 snRNP, effectively relocates MEG3 reporter transcripts from the nucleus to the cytoplasm (Azam et al., 2019). By contrast, the function of MEG3, which acts as a tumour suppressor that stimulates the p53 pathway and controls the cell cycle, is mediated by a tertiary structure that is conserved in mammals (Uroda et al., 2019).
The spliced and polyadenylated lncRNA BORG regulates BMP-induced differentiation of C2C12 cells into osteoblastic cells and is exclusively localized to the nucleus (Takeda et al., 1998). The nuclear localization of BORG is mediated by a short AGCCC RNA motif. Although the mechanism of nuclear retention controlled by this RNA motif remains unknown, the motif was found to be present in multiple other nuclear lncRNAs as well as in protein-coding RNAs, indicating that nuclear localization motifs are shared between noncoding and coding transcripts (Zhang et al., 2014).
The 3′UTR of CTN-RNA, a polyadenylated transcript that localizes to paraspeckles under regular cellular conditions, harbours another type of a NRE (Prasanth et al., 2005). This NRE consists of a ∼100 nt forward repeat and three inverted repeats originating from SINE retro-elements. Each of the inverted repeats can form a stem loop with the forward repeat. The stem loop serves as a target site for ADAR enzymes to catalyse adenosine-to-inosine (A-to-I) RNA editing (Prasanth et al., 2005). It has been proposed that the paraspeckle protein p54nrb interacts with CTN-RNA at I-edited residues, triggering retention of the transcript in the nucleus. Under cellular stress conditions, CTN-RNA is released from the nucleus by cleavage of its edited 3′UTR, generating an mRNA that contains a 5′UTR and a coding region. This cleaved transcript is then transported to the cytoplasm, where it is translated into the mCAT2 protein that modulates cellular uptake of L-arginine (Prasanth et al., 2005), demonstrating that tight regulation of CTN-RNA subcellular localization is essential during homeostatic and stress conditions.
From the few examples presented here, it is evident that the repertoire of RNA nuclear retention motifs and their associated interactors far exceeds the number of known protein nuclear localization signal motifs, presenting a major challenge in their identification. It remains to be seen whether individual transcripts adapted various nuclear retention mechanisms or whether common sets of lncRNA sequence elements execute their retention in the nucleus. High-throughput approaches (discussed below) have the potential to find common sets of motifs with similar functions and group them into mechanistically defined sub-classes.
Unbiased identification of lncRNA subcellular localization motifs
Identification of the aforementioned subcellular localization elements has been restricted to individual lncRNA and mRNA transcripts, and has been achieved using reporter constructs and classical genetics. However, with recent advances in comparative genomics, data analysis and systematic experimental approaches, several computational and high-throughput experimental strategies have been applied to unbiasedly decipher sequence elements dictating RNA localization.
To identify RNA motifs that drive nuclear localization, several independent studies have used a tiling-based strategy. In this approach, pools of thousands of tiled oligonucleotides covering human lncRNAs and selected 3′UTRs are evaluated for their ability to retain an otherwise cytoplasmic reporter transcript in the nucleus. The Ulitsky laboratory constructed a library of ∼5500 109-mers that tiled exons of 37 human lncRNAs and 13 3′UTRs of mRNAs enriched in the nucleus. The library was cloned into the 5′ and 3′UTRs of a cytoplasmic GFP reporter and transfected into human cells; RNA from cytoplasmic and nuclear fractions was then sequenced and analysed for enrichment of sequence motifs. A 42 nt C-rich sequence element derived from an Alu-repeat named SIRLOIN was found to drive nuclear localization of a set of lncRNAs and mRNAs by interacting with the protein HNRNPK (Lubelsky and Ulitsky, 2018). In parallel, the Rinn laboratory used a similar tiling array-reporter method to identify cis elements driving nuclear retention (Shukla et al., 2018). The authors generated a pool of ∼12,000 110-mers tightly tiled across 38 lncRNAs with different subcellular localization patterns. Similarly, the oligonucleotides were fused to a cytoplasmic reporter transcript and the reporter-oligonucleotide pool was transfected into human cells, which were subjected to subcellular fractionation and RNA-seq (Shukla et al., 2018). This study identified 109 RNA elements, including a 15 nt C-rich motif, that enriched the cytoplasmic reporter in the nucleus. Interestingly, the identified C-rich motif shows sequence similarities to the SIRLOIN motif (Lubelsky and Ulitsky, 2018; Shukla et al., 2018). However, single-molecule in situ hybridization analyses showed that this short C-rich region alone is not sufficient to drive nuclear localization and requires a larger RNA element to retain the reporter transcript in the nucleus (Shukla et al., 2018). Furthermore, the Shen team used a tiling-based approach to search for lncRNA sequence elements that can mediate chromatin enrichment and identified a short 7 nt motif that is essential for reporter RNA localization to chromatin. It has been proposed that this motif is recognized by U1 snRNP to tether lncRNAs to chromatin (Yin et al., 2020).
In addition to the systematic experimental approaches described above, several computational approaches have been developed to evaluate the subcellular localization of lncRNAs based on their sequences. One approach generated a map of transposable element (TE) insertions in lncRNAs and found that a set of evolutionarily conserved TEs functionalized to nuclear retention elements (Carlevaro-Fita et al., 2019). Specifically, this study identified four TEs that significantly correlate with lncRNA nuclear enrichment, as well as a set of GC-rich TEs that correlates with lncRNA cytoplasmic localization (Carlevaro-Fita et al., 2019). In addition, efficient splicing has been found to be one of the main predictive factors for lncRNA cytoplasmic localization, whereas inefficient splicing strongly correlates with nuclear enrichment (Zuckerman and Ulitsky, 2019). Moreover, by training computational algorithms to compare sequences of lncRNAs with known localization and detecting a link between specific sequence motifs and subcellular localization, one can predict the subcellular localization of lncRNA transcripts (Cao et al., 2018; Su et al., 2018). Another approach used nuclear and cytoplasmic fractionated RNA-seq data to develop a deep learning algorithm that predicts lncRNA subcellular localization based on lncRNA sequences (Gudenas and Wang, 2018). Taken together, the ability to computationally predict and experimentally identify functional sequence elements that drive lncRNA subcellular localization is an important step towards understanding lncRNA functions.
Sequences that control the formation of lncRNA-DNA triplexes
Nuclear pyrimidine (T and C)-rich RNAs can interact with purine (A and G)-rich DNA via Hoogsteen-type hydrogen bonding to form RNA:DNA triple helices that may have regulatory functions (Fig. 1B) (Felsenfeld and Rich, 1957). As triplex formation requires pairing between double-stranded DNA and single-stranded RNA, the formation of the triplexes depends on the lncRNA primary sequence (Li et al., 2016). Fendrr is one example of a nuclear lncRNA that forms RNA:DNA triple helices and plays a regulatory role during development. Loss of function of Fendrr, which is exclusively expressed in the lateral plate mesoderm during mouse embryogenesis, leads to embryonic lethality due to heart and body wall developmental defects (Grote et al., 2013; Sauvageau et al., 2013). Fendrr interacts with both PRC2 (polycomb repressive complex 2) and TrxG/MLL complexes in vivo, and its loss results in changes in epigenetic modifications and gene expression of mesoderm differentiation factors (Grote et al., 2013). A 40 nt sequence element of Fendrr RNA was identified in silico and validated experimentally to form triplexes with promoter regions of two targets of Fendrr, Foxf1 and Pitx2, resulting in increased PRC2 occupancy at these promoters and a subsequent inhibition of expression (Grote et al., 2013).
Although reports of DNA:lncRNA triplexes and their functional relevance have been restricted to specific examples, computational methods have been developed to predict the formation of lncRNA-DNA triplex structures based on lncRNA sequence in silico (Buske et al., 2012; Kuo et al., 2019). One algorithm, called TDF (triplex domain finder), enables retrieval and statistical ranking of sequence elements already known to form triplex structures, including characterized domains of the lncRNAs Fendrr, HOTAIR and MEG3 (Kuo et al., 2019). Importantly, the TDF algorithm has also identified and ranked novel sequence elements with triplex-forming potential and detected previously unknown DNA-binding domains in MEG3 and in lncRNAs relevant to human cardiomyocyte differentiation (Kuo et al., 2019).
RNA motifs that mediate miRNA degradation
While RNA sequence elements of nuclear lncRNA transcripts have received a lot of attention, RNA motifs in cytoplasmic lncRNAs that have defined molecular mechanisms linked to cellular or developmental functions are as yet under-represented. One of the most studied interaction partners of cytoplasmic lncRNAs are miRNAs, which are small RNAs of ∼22 nt that regulate virtually every aspect of development and normal physiology by pairing with their targets (Alberti and Cochella, 2017; Bartel, 2018). Because one can easily predict miRNA interactions based on the presence of miRNA-binding sites in their target transcripts, miRNA-lncRNA interactions have been extensively investigated (Ulitsky, 2018). The majority of miRNA binding sites within vertebrate lncRNA transcripts are canonical miRNA sites with partial miRNA-lncRNA pairing complementarity (Fig. 1C), which can regulate lncRNA expression similar to miRNA-mediated regulation of mRNA transcripts. It has also been suggested that canonical miRNA target sites within lncRNAs can regulate miRNA activity via competition for binding sites (Ulitsky, 2018).
In addition to the canonical miRNA-binding sites with partial complementarity, several independent studies have found lncRNAs that harbour miRNA-binding sites with extensive, near-perfect complementarity (Fig. 1D), which is rather unusual for animal miRNAs and has not been reported in endogenous transcripts before (Bitetti et al., 2018; Kleaveland et al., 2018; Ulitsky et al., 2011). By using these high-complementarity miRNA sites in artificially engineered targets and natural viral RNAs, it has been demonstrated that the extensive target-miRNA pairing triggers efficient degradation of miRNAs by shortening (‘trimming’) or untemplated lengthening ‘tailing’) the 3′ end of miRNAs, a phenomenon known as target-directed miRNA degradation, or TDMD (Ameres et al., 2010; de la Mata et al., 2015; Marcinowski et al., 2012).
More recently, an endogenous transcript that harbours a RNA sequence element conserved among all vertebrates that binds and specifically degrades miR-29b was discovered (Bitetti et al., 2018). The conserved RNA sequence element is part of the cytoplasmic lncRNA libra in zebrafish and the 3′ UTR of the protein-coding gene Nrep in mammals. Genetic scrambling of the miR-29 site resulted in ectopic accumulation of miR-29b in the brain, leading to abnormal animal behaviour, thus demonstrating that the miRNA sites embedded in genome-encoded lncRNAs have crucial in vivo relevance (Bitetti et al., 2018). A similar activity of the endogenous high complementarity miR-7 site (Ulitsky et al., 2011) was reported for the lncRNA Cyrano (Kleaveland et al., 2018). The miR-7 site embedded within the Cyrano transcript is conserved among all vertebrates and triggers efficient miR-7 degradation in vivo in mouse tissues and in human cells (Kleaveland et al., 2018). The biological relevance of miR-7 degradation by TDMD remains to be addressed, as genetic perturbations of Cyrano do not lead to obvious morphological defects in mice or zebrafish (Goudarzi et al., 2019; Kleaveland et al., 2018; Lavalou et al., 2019), in contrast to previous morpholino-based knockdown zebrafish studies (Sarangdhar et al., 2018; Ulitsky et al., 2011). Nonetheless, it is possible that additional lncRNAs contain high complementarity miRNA sites that control miRNA turnover and potentially have key biological functions. This is further supported by another finding of a high-complementarity miRNA site for miR-30 located in the 3′UTR of the Serpine1 (Ghini et al., 2018). Similar to libra/NREP and Cyrano, the endogenously encoded miR-30 site reduces levels of miR-30b-p5 and 30c-p5 by TDMD, regulating mitotic rate and apoptosis in mouse fibroblasts (Ghini et al., 2018).
LncRNA motifs that sequester cellular proteins and regulate normal physiology
Although multiple lncRNAs have been shown to have regulatory functions, detailed structure-function analyses have been performed only for a few of them. One of the striking examples of a lncRNA for which specific RNA sequences have been assigned a function is NORAD (noncoding RNA activated by DNA damage). NORAD is an abundant cytoplasmic transcript present at 80-1400 copies per cell that shows sequence conservation across mammals and is upregulated upon DNA damage (Lee et al., 2016; Tichon et al., 2016). Remarkably, NORAD contains at least 17 consensus binding sites for PUMILIO proteins (PUM1 and PUM2), which are highly conserved RBPs that bind to consensus motifs typically located in the 3′ untranslated regions (UTRs) of mRNAs, resulting in reduced translation and turnover of mRNA targets (Miller and Olivas, 2011). NORAD, with its multiple PUMILIO binding sites and high copy number, acts as a negative regulator of PUMILIO proteins by limiting their amount in the cell (Fig. 1E). NORAD-mediated limitation of PUMILIO modulates the abundance of PUMILIO targets, which are enriched for mRNAs with mitotic, DNA repair and DNA replication functions (Lee et al., 2016; Tichon et al., 2016). As a result, NORAD loss of function leads to excess repression of PUMILIO targets, resulting in chromosomal instability, aberrant mitosis and aneuploidy in human cells (Lee et al., 2016; Tichon et al., 2016). In mice, Norad deletion leads to genomic instability and mitochondrial dysfunction, with animals showing signs of premature aging; Norad null allele mice were found to have grey and thin fur, to show abnormal spine curvature characteristic of aging and to die earlier than wild-type mice (Kopp et al., 2019). While NORAD interacts with additional proteins (Munschauer et al., 2018; Tichon et al., 2018), a series of genetic experiments, including PUMILIO 2 overexpression in mice and NORAD expression in human cells deficient for NORAD, demonstrated that the aforementioned cellular and organismal phenotypes are linked to PUMILIO hyperactivity (Elguindy et al., 2019; Kopp et al., 2019; Lee et al., 2016; Tichon et al., 2016). This example demonstrates the importance of clustered RNA motifs/elements in establishing functional lncRNA-protein interactions that play key roles in cell biology and normal physiology.
LncRNA elements that mediate lncRNA-protein interactions
Interactions between lncRNAs and RBPs can lead to different molecular, cellular and developmental outcomes. Indeed, lncRNAs can interact with proteins to inhibit or enhance transcription (Feng, 2006; Wutz et al., 2002), stabilize proteins (Zhao et al., 2018) or sequester them (Kino et al., 2010; Lee et al., 2016; Tichon et al., 2016), regulate translation efficiency (Gong et al., 2015), and/or regulate scaffold protein complexes (Rinn et al., 2007). Because RBPs tend to bind low-complexity sequences (Dominguez et al., 2018), it is often difficult to identify protein consensus binding motifs within RNA transcripts; however, it is possible to attribute protein binding to specific lncRNA regions. Indeed, it has been shown that, while lncRNA transcripts can interact with multiple RBPs, often only a short RNA region, embedded in the long transcript and interacting with a specific RBP or with a specific set of RBPs, is sufficient and required to drive the biological function of a number of lncRNAs.
One of the best characterized lncRNAs with a crucial developmental function is Xist (X-inactive specific transcript), which orchestrates transcriptional silencing of one of the female X chromosomes in placental mammals (Borsani et al., 1991; Brown, 1991; Moindrot and Brockdorff, 2016), a process known as X-chromosome inactivation (XCI) (Gendrel and Heard, 2014; Marahrens et al., 1997). Xist knockout leads to loss of X-inactivation and female-specific lethality in mice (Marahrens et al., 1997). One of the conserved regions of Xist, named the A-repeat, is composed of repetitive sequence elements that are believed to be derived from TEs (Elisaphenko et al., 2008) and is required for the initiation of XCI (Nesterova et al., 2001; Wutz et al., 2002). The A-repeat interacts with multiple RBPs (Chu et al., 2015b; Graindorge et al., 2019; Lu et al., 2016; McHugh et al., 2015; Minajigi et al., 2015; Moindrot et al., 2015; Monfort et al., 2015), among which SPEN (also known as SHARP in human and MINT in mouse) is essential for initiating XCI in mouse embryonic stem cells (ESCs) and preimplantation embryos (Carter et al., 2020; Dossin et al., 2020). By specifically binding the Xist A-repeat region, SPEN recruits additional silencing factors to initiate XCI, bridging Xist RNA, transcription machinery and chromatin remodellers (Chu et al., 2015b; Dossin et al., 2020; McHugh et al., 2015; Minajigi et al., 2015). Interestingly, in mouse ESCs, Spen binds and silences endogenous retroviral element (ERV) RNAs that resemble the A-repeat of Xist (Carter et al., 2020). Insertion of an ERV into an A-repeat-deficient Xist transcript rescues Xist-mediated gene silencing, suggesting that Xist might have adapted a functional TE RNA-protein interaction for dose compensation (Carter et al., 2020).
Braveheart is a mouse ∼590 nt lncRNA that carries out a regulatory function in cardiovascular lineage commitment (Klattenhoff et al., 2013). It contains a short G-rich motif termed AGIL that is necessary for the differentiation of mouse ESCs to cardiomyocytes (Xue et al., 2016). This motif interacts with the zinc-finger transcription factor CNBP/ZNF9 (Xue et al., 2016), a heart-specific transcription factor known to bind single-stranded G-rich DNA and RNA motifs (Calcaterra et al., 2010; Chen, 2003). It has been proposed that Braveheart and CNBP function together to regulate cardiac gene expression (Xue et al., 2016). Thus, Braveheart directs cell fate via a short RNA motif that interacts with a tissue-specific transcription factor (Xue et al., 2016).
The lncRNA growth arrest-specific 5 (GAS5) is one of the first studied lncRNAs (Schneider et al., 1988). GAS5 regulates the survival of female germline stem cells in vitro (Wang et al., 2018) and the self-renewal of human ESCs (Xu et al., 2016). In certain cases, GAS5 acts as a glucocorticoid receptor (GR) decoy (Kino et al., 2010), participating in the negative-feedback loop of activated GRs, directing cell fate and regulating transcription. The region implicated in this interaction is located in the last exon of the 12-exon GAS5 transcript and includes a motif that ‘mimics’ genomic GR-binding sites (Hudson et al., 2014). Interestingly, a single nucleotide substitution is sufficient to abolish the GR-GAS5 interaction. Sequence alignment showed that this recognition site is conserved among the haplorhine lineage (a suborder of primates), suggesting that other RNAs containing this motif may also interact with GRs (Hudson et al., 2014).
Megamind (also known as Tunar) harbours a region of ∼200 nt that is deeply conserved among vertebrates (Lin et al., 2014; Ulitsky et al., 2011). Knockdown of Tunar in mouse ESCs inhibits their differentiation into neural lineages (Lin et al., 2014). It has been proposed that Tunar carries out its regulatory function in pluripotency through an interaction with a protein complex composed of PTBP1, HNRNPK and NCL. Specifically, RNA pulldown experiments have shown that the conserved sequence binds to PTBP1, HNRNPK and NCL proteins with an affinity comparable with that of the full-length Tunar transcript, whereas binding affinity of the proteins to a synthetic transcript without the conserved sequence is decreased. The conserved region of Tunar therefore appears to act as an RBP-binding platform driving cellular differentiation (Lin et al., 2014).
lncRNA-protein interactions that play regulatory functions in development and organ growth can also contribute to human diseases. For example, the ubiquitously expressed lncRNA Pint (p53-induced noncoding transcript), which is under the direct control of p53 (Marín-Béjar et al., 2013), has been implicated in cancer. In mouse cells, Pint controls cell proliferation and survival by interacting with PRC2 and regulating the expression of hundreds of genes (Marín-Béjar et al., 2013). In line with this, it has been demonstrated that Pint−/− mice are noticeably smaller and have reduced body weight compared with wild-type mice (Sauvageau et al., 2013). Similar to mouse Pint, human PINT is regulated by p53 and interacts with PRC2. Of note, PINT inhibits the migration and invasion of cancer cells, and this function is carried out by a short sequence motif conserved in mammals that is required for the interaction between PINT and PRC2, and the downregulation of pro-invasion genes (Marín-Béjar et al., 2017).
Identifying RNA-binding proteins associated with lncRNA elements
As highlighted above, a number of lncRNA molecular functions are mediated by interactions with proteins. As such, reliable identification of lncRNA-associated proteins is key for delineating the molecular mechanisms of lncRNA action. Unbiased identification of lncRNA associated proteins is typically carried out by applying so-called RNA-centric approaches, which identify RBPs that interact with a specific RNA of interest. The most commonly used strategy relies on cross-linking and probe-based affinity capture of a test RNA followed by mass spectrometry-based identification of the co-purified proteins (Fig. 2A) (Chu et al., 2015a). Despite the high potential of this methodology, the identification of lncRNA-protein interactions via this approach remains technically challenging due to the general inefficiency of RNA pull-downs. Recently, an alternative approach termed incPRINT, which enables in-cell identification of protein interactions with any RNA of interest, was developed (Graindorge et al., 2019). This technique is based on screening a library of tagged proteins (including the majority of all known human RBPs, transcription factors and chromatin modifiers) with a test RNA that is tethered to a luciferase detector. Instead of pulling down the test RNA, high-throughput immunoprecipitation of thousands of tagged RBPs is performed followed by luciferase detection of their interactions with the test RNA (Graindorge et al., 2019; Sabaté-Cadenas and Shkumatava, 2020). incPRINT is suitable for the identification of in-cell RNA-interacting proteomes of full-length transcripts of various endogenous abundancies, and for mapping proteins that bind to different regions of longer RNA transcripts (Graindorge et al., 2019).
An alternative method to identify proteins that bind to specific sequence motifs harnesses existing data from crosslinking and immunoprecipitation (CLIP) assays. CLIP methods are based on chemical or UV crosslinking followed by immunoprecipitation of a test protein to identify all RNA fragments that crosslink to the protein of interest, which in turn is followed by sequencing (Darnell, 2012; Jensen and Darnell, 2008) (Fig. 2B). Currently, the eCLIP data from the ENCODE project includes ∼150 RBPs (Van Nostrand et al., 2020a, 2020b) the binding sites of which can be mapped and analyzed in specific transcripts of interest. Although the application of CLIP data to an RNA of interest is limited by the number of proteins and cell lines for which the data are available (currently, predominantly K562 or HepG2 cells), it is powerful as an orthogonal approach for validating RNA-centric methods (Graindorge et al., 2019) and for mapping RBP-binding sites identified by eCLIP (Van Nostrand et al., 2017) to RNA sequence elements (Kirk et al., 2018; Lee et al., 2016; Tichon et al., 2018). The identification of RNA motifs and their associated proteins can also enable the prediction of RNA-protein interactions based on lncRNA sequences, although this will require the development of algorithms that are able to detect short sequence homologies. These predictions can then be tested by the aforementioned RNA-protein interaction technologies.
RNA motif discovery as a starting point for lncRNA functional classification
One of the major bottlenecks in identifying physiologically important lncRNAs is the difficulty in establishing a connection between lncRNA sequence and function. Moreover, because even lncRNAs with well-characterized cellular and organismal functions represent a highly heterogeneous group, it is becoming increasingly important to classify them into functional subgroups. The identification of sequence motifs linked to specific molecular events, such as localization and protein binding, and the subsequent discovery of these motifs in other lncRNAs and/or mRNAs, might help in assigning specific functions to individual lncRNAs, as experimental interrogations are often tedious. Indeed, the comparative analysis of lncRNA sequences across different species (Fig. 2C) (Chen et al., 2016; Hezroni et al., 2015; Necsulea et al., 2014) can be a powerful approach for studying the functions and molecular mechanisms of action of lncRNAs, especially when followed by experimental validations. Such alignment-based comparative approaches have led to the identification of conserved sequence motifs within lncRNA transcripts such as Megamind, Cyrano, NEAT1, MALAT1, NORAD, PINT and many more (Cornelis et al., 2016; Lee et al., 2016; Marín-Béjar et al., 2017; Tichon et al., 2016; Ulitsky et al., 2011; Wilusz et al., 2012). Moreover, uncovering RNA elements with known functions within as-yet uncharacterized lncRNA sequences can facilitate the discovery of novel lncRNA functions. For example, a SRA1-like (steroid receptor RNA activator 1) sequence element that mediates androgen receptor (AR) binding was found in the lncRNA SLNCR1 (Schmidt et al., 2016) and, based on the known function of this motif as a co-activator of hormonal pathways (Lanz et al., 1999; Novikova et al., 2012), it was possible to predict the function of SLNCR1. As predicted, targeting this specific binding site within SLNCR1 prevented its interaction with AR, leading to inhibition of melanogenesis (Schmidt et al., 2020).
Although powerful, alignment-based approaches require relatively long stretches of sequence conservation between homologs. However, most lncRNAs lack long stretches with high sequence conservation. Detection of rather short, conserved sequence motifs within long and rapidly evolving transcripts would be an important step towards assigning sequences to biological and molecular functions. Application of the SEEKR algorithm, developed to detect the presence of similar short sequence motifs of three to eight nucleotides (termed k-mers), recently allowed the functional classification of lncRNAs with related functions (Kirk et al., 2018). Reasoning that lncRNAs with shared functions harbour a shared set of similar k-mers while lacking linear homology, SEEKR groups lncRNAs based on their k-mer profiles. Subgroups of lncRNAs with similar k-mer contents share biological properties, such as protein-binding and subcellular localization, even between groups with no apparent evolutionary relationships. Based on both computational data and experimental validation, it has been proposed that no linear sequence homology is required for lncRNAs to support conserved functions. Instead, lncRNA functions are driven by similar sets of short k-mer motifs that represent RNA-binding protein motifs (Kirk et al., 2018). The classification of lncRNAs into functional groups based on short motifs is very promising but there is a need for improved algorithms for unbiased identification of short conserved motifs within lncRNA transcripts across more distant species.
Conclusions and perspectives
The regulatory functions of the lncRNA sequence motifs discussed here demonstrate the importance of primary RNA sequences and the necessity of careful analyses of lncRNA sequence elements, as many of them are shared with protein-coding RNAs. RNA motifs regulate a range of different functions, including transcript biogenesis, subcellular localization and transcript interactions with proteins, DNA and other RNAs, such as miRNAs and protein-coding mRNAs. The identification and further characterization of regulatory RNA motifs will facilitate the prediction of lncRNA interactions and mechanisms of action. Establishing a database of functional RNA elements identified experimentally as well as RNA motifs determined by computational approaches would enable the classification of lncRNAs into functional subgroups. Because several thousand of lncRNAs have been annotated and it is likely that only a minor fraction of them are functional, the identification of conserved RNA motifs would help to prioritize lncRNAs for experimental interrogation to identify transcripts with important cellular functions.
The unbiased identification of regulatory sequence elements within lncRNA transcripts has demonstrated that some coding mRNAs employ the same regulatory RNA motifs. As such, the characterization of functional RNA sequence elements might also help us to understand regulatory sequences that are embedded in mRNA transcripts. Indeed, studying regulatory RNA motifs within mRNA transcripts is often challenging, because the impact of genetic manipulation of sequences located in untranslated regions of mRNAs cannot always be easily uncoupled from their impact on protein production and function. Hence, the characterization of RNA sequence elements in noncoding RNAs could provide a more accessible route and would allow predictions of mRNA regulation by these elements. Furthermore, regulatory RNA sequence motifs represent an untapped source of potential therapeutic targets. As the realization that RNA molecules are druggable targets (Warner et al., 2018), there has been an effort in the field to find small molecule inhibitors that target specific RNA transcripts. However, it is challenging to identify selective drugs that bind specific RNAs without broad effects. Targeting regulatory RNA motifs may provide this specificity. Furthermore, many proteins have been found to be undruggable (Warner et al., 2018); thus, targeting RNA motifs that regulate the transcription, stability and/or translation of mRNAs impacting protein production could open the door to novel treatment strategies.
We thank all members of the Shkumatava laboratory and Dina Zielinsky for useful discussions.
The authors’ research is supported by LabEx DEEP, France (ANR-11-LABX-0044 and ANR-10-IDEX-0001-02) and by a SDSV n°577, université PSL doctoral fellowship to F.C.
The authors declare no competing or financial interests.