We screened the draft sequence of the human genome for genes that encode intermediate filament (IF) proteins in general, and keratins in particular. The draft covers nearly all previously established IF genes including the recent cDNA and gene additions, such as pancreatic keratin 23, synemin and the novel muscle protein syncoilin. In the draft, seven novel type II keratins were identified, presumably expressed in the hair follicle/epidermal appendages. In summary, 65 IF genes were detected, placing IF among the 100 largest gene families in humans. All functional keratin genes map to the two known keratin clusters on chromosomes 12 (type II plus keratin 18) and 17 (type I), whereas other IF genes are not clustered. Of the 208 keratin-related DNA sequences, only 49 reflect true keratin genes, whereas the majority describe inactive gene fragments and processed pseudogenes. Surprisingly, nearly 90% of these inactive genes relate specifically to the genes of keratins 8 and 18. Other keratin genes, as well as those that encode non-keratin IF proteins, lack either gene fragments/pseudogenes or have only a few derivatives. As parasitic derivatives of mature mRNAs, the processed pseudogenes of keratins 8 and 18 have invaded most chromosomes, often at several positions. We describe the limits of our analysis and discuss the striking unevenness of pseudogene derivation in the IF multigene family. Finally, we propose to extend the nomenclature of Moll and colleagues to any novel keratin.
INTRODUCTION
The increase in specific cell types represents one hallmark of metazoan evolution. It is paralleled by the acquisition of multigene families, which often encode proteins of similar structure but distinct function. One such family is represented by the intermediate filament protein (IF) family. Its members form part of the cytoskeleton of most metazoan cells. Vertebrate IF are organised into five distinct gene families according to sequence identity and expression patterns (Fuchs and Weber, 1994EF9; Herrmann and Aebi, 2000EF11). These include keratins (K), which represent the type I and II homology groups encoded by more than 20 genes, and a further 15 hair keratin genes (Langbein et al., 1999EF21; Rogers et al., 2000EF34), the type III proteins desmin, vimentin, GFAP and peripherin, and the type IV homology group, which encompasses α-internexin, syncoilin (Newey et al., 2001EF29), nestin, synemin and the neurofilament proteins NF-L, -M and -H. The nuclear lamins A/C, B1 and B2 form the type V IF, whereas the eye lens proteins phakinin and filensin constitute a separate group. All 16 known non-keratin IF proteins, including syncoilin (Newey et al., 2001EF29) and synemin (Becker et al., 1995EF2; M. Titeux et al., unpublished), were identified by biochemical, immunological and cDNA cloning methods. The power of the classical approach is best exemplified by the pioneering work of Moll and Franke, who in 1982 established the `catalog of human cytokeratins' (Moll et al., 1982EF28). They laid the groundwork for keratin expression profiles and provided a rational nomenclature. Their data were based on the isolation of keratins from microdissected normal and tumor tissues, as separated in high resolution 2D gels. The numbering system for type II keratins ranges from 1 to 8 with letters for later additions and from 9 to 21 for type I keratins. Hair keratins were named in an analogous way with letters Ha and Hb indicating type I and II hair keratins, respectively (Langbein et et al., 1999EF21; Rogers et al., 2000EF34). Subsequent work established that all IF proteins, with the exception of a few polymorphic variants (Mischke and Wild, 1987EF26; Korge et al., 1992EF17), are encoded by single copy genes (Fuchs and Weber, 1994EF9). One difficulty of the classical biochemical and genetic approach is that potential minor keratins and other IF proteins, present in only a few cells of a tissue, or expressed transiently during embryonic development, may have escaped detection.
Gene mapping studies revealed that genes coding for non-keratin IF proteins are not clustered (International Human Genome Sequencing Consortium, 2001EF15). All type I keratin genes (except K18; Waseem et al., 1990EF41) are clustered on chromosome 17q21 and type II genes on 12q13 (International Human Genome Sequencing Consortium, 2001EF15). Transcription analysis has demonstrated that the diversity of keratins is not increased further by alternative splicing.
Knowledge of IF genes and expression patterns stimulated the discovery of point mutations in a still growing number of IF genes, which has provided evidence for their pathogenic relevance in human disorders (Bonifas et al., 1991EF3; Coulombe et al., 1991EF6; Lane et al., 1992EF20; reviewed by Irvine and McLean, 1999EF16). Such `experiments of nature' have demonstrated that mutations in at least 14 epidermal keratin genes cause fragility syndromes of epidermis and its appendages that seem to result from a collapse of a mutant keratin cytoskeleton. Formally, this was the genetic proof for a true cytoskeletal function of these proteins. Desmin mutations analogous to those in epidermal keratins were connected to myopathies of skeletal and heart muscle (Goldfarb et al., 1998EF10), whereas point mutations in GFAP are now known to cause Alexander's disease (Brenner et al., 2001EF4). At least two reports have linked NF-L mutations to Charcot-Marie-Tooth disease type 2E (Mersiyanova et al., 2000EF24; De Jonghe et al., 2001EF7). Finally, mutations in the genes coding for the nuclear lamins A/C give rise to several tissue-restricted disorders termed laminopathies (for a recent discussion, see Hutchison et al., 2001EF14; Wilson et al., 2001EF43). These data support the view that IF proteins also serve non-cytoskeletal functions (Quinlan et al., 2001; Wilson et al., 2001EF43).
Additional insight into IF protein function comes from genetically altered mice (H. Herrmann et al., unpublished). One common theme that emerges from such studies is that there are essential and nonessential IF protein functions depending on the tissue context. Ablation of keratins leads to extensive tissue fragility in the basal but not in the suprabasal epidermis (Lloyd et al., 1995EF22; Peters et al., 2001EF30; Reichelt et al., 2001EF33). Moreover, knockout studies have demonstrated that certain IF proteins compensate each other (Magin et al., 2000EF23). In addition, the phenotype of some IF gene knockout mice has shed light on new pathologies (Ku et al., 1999EF18; Caulin et al., 2000EF5; Hesse et al., 2000EF12; Tamai et al., 2000EF38).
The analysis of diseases with IF involvement as well as the understanding of IF function and evolution will be aided by the knowledge of the corresponding genes. Given that currently about 40 functional keratin genes had been identified, we were surprised by the large number of keratin genes in the recently published draft of the human genome. To clarify whether 111 keratin genes exist in the human genome (International Human Genome Sequencing Consortium, 2001EF15), we have set out to analyze the data-set available in the public domain.
RESULTS
Number and organisation of keratin genes
We have used the NCBI and the Celera genome database for our search and included the most recently published keratins expressed in the inner root sheath (IRS) of hair follicles (Bawden et al., 2001EF1). We found 208 keratin-related sequences in the draft (Fig. 1). Of these, 49 represent single copy genes for type I and II keratins. The type I keratin cluster contains at least 25 functional genes and 2 pseudogenes spread over nearly 1 Mb of DNA; the corresponding type II gene array harbours at least 24 functional genes and 5 pseudogenes distributed along 1.2 to 1.3 Mb.
The gene density in the two keratin clusters appears much higher than estimated for the overall genome and is approximately 35 kb per gene. There are 111 pseudogenes plus 47 gene fragments for all keratins. Intron-containing pseudogenes are mostly contained within the two keratin clusters, whereas those with features of processed pseudogenes have invaded most chromosomes, often at several positions (Fig. 2). A few earlier analyses have identified pseudogenes for keratins 8, 14, 16, 17, 18, 19 and hair keratins (Kulesh and Oshima, 1988EF19; Rosenberg et al., 1988EF35; Waseem et al., 1990EF41; Troyanovsky et al., 1992EF40; Ruud et al., 1999EF36; Smith et al., 1999EF37; Hut et al., 2000EF13; Rogers et al., 2000EF34; Winter et al., 2001EF42). The peudogenes coding for K14, K16 and K17, which arose by gene duplication, are located outside the type I keratin cluster.
Unexpectedly, processed pseudogenes, which are cDNA derivatives, show a strikingly uneven gene relatedness. By far the highest number of processed pseudogenes relates to keratin genes 8 and 18, which map adjacently on chromosome 12q13 within the type II gene cluster. K8 and K18 are typical of internal epithelia and represent the earliest intermediate filament expression pair in embryogenesis. There are 62 processed pseudogenes plus 15 gene fragments for the keratin 18 gene, and 35 processed pseudogenes plus 26 gene fragments for the keratin 8 gene (for a previous notion of pseudogenes, see Kulesh and Oshima, 1988EF19; Waseem et al., 1990EF41). These processed pseudogenes are dispersed over all chromosomes (see Fig. 2). None of these pseudogenes contained an intact open reading frame. Other keratin genes are either true single copy genes or are accompanied by one to four pseudogenes (Fig. 1).
In the present draft, no gene for keratin 11 (Moll et al., 1982EF28), which may represent a polymorphic variant of K10 (Mischke and Wild, 1987EF26; Korge et al., 1992EF17) or for K6c-f (Takahashi et al., 1995EF39) were found. The status of the latter may have to await the completion of the human genome.
Novel keratin genes and nomenclature
We discovered seven new type II keratins. Of these, five displayed homology to K6a, K6b and K5, one was most closely related to K1 and one was highly similar to K6b (Fig. 3). This new member of the K6 family has 99% protein sequence identity to K6b, but at the genomic level it contains a completely different intron 3. The evolutionary relationship of keratins is outlined in Fig. 4. Owing to the incomplete alignment of contigs, a few additional keratin genes and pseudogenes may exist.
The total number of keratin genes amounts to 49. Our survey of the current draft of the human genome conforms well with the view of 22 keratins expressed in various epithelia, 15 trichocyte-specific, 5 inner root sheath and 7 novel keratins described in this report. Together with the 13 genes for the non-keratin IF proteins, the number of genes encoding cytoplasmic IF proteins reaches 62. The three nuclear lamin genes bring the entire IF multigene family to 65.
Based on the numbering system introduced by Moll and colleagues (Moll et al., 1982EF28), we propose to name novel type II keratins according to their sequence relationship with one of the existing eight type II genes, followed by a small letter. The type II keratin genes reported in this study are therefore named K1b, K5b, K5c, K6h, K6i, K6k and K6l. Type I keratins should be named in the same way (see also Fig. 1). Novel genes not related to existing proteins should be given new numbers starting with K21.
Non-keratin IF genes
All 13 genes encoding the non-keratin cytoplasmic IF proteins are covered by the draft sequence (Fig. 1). Given the considerable sequence drift among these genes, the chicken sequence of synemin was non-informative for the identification of human synemin. The human orthologue was identified by D. Paulin (M. Titeux et al., unpublished). No additional functional IF gene was recognized in the current draft. Interestingly, pseudogenes are very rare among the non-keratin genes. Only the neurofilament NF-H gene is accompanied by two pseudogenes. Also, the genes for the three nuclear lamins (lamins A/C, B1 and B2) lack pseudogenes. If the completed version of the human genome lacks an additional lamin gene, the oocyte-specific lamin of certain amphibia (Döring and Stick, 1990EF8) has no orthologue in the human genome.
CONCLUSIONS AND PERSPECTIVES
Our analysis is limited by two factors: (1) the alignment of contigs leading to the present draft is still incomplete; therefore, we cannot exclude the existence of a few more keratin genes. In light of the fidelity of the `Moll catalog' and the concordant phenotypes of keratin-knockout mice (H. Herrmann et al., unpublished), we predict that any keratins yet to be discovered may be restricted to the hair follicle and/or other epidermal appendages. The existence of additional keratins specific for embryonic stages or specialized cells of internal epithelia appears unlikely. (2) Given the strong sequence drift among non-keratin IF genes, novel IF genes with yet unknown properties might exist. The prototype of such proteins could be represented by syncoilin, a constituent IF member of the dystrobrevin complex, which was proposed to link IF proteins to dystrobrevin at the neuromuscular junction (Newey et al., 2001EF29). One task ahead will be to determine whether syncoilin does form copolymers with muscle-specific IF proteins or whether it serves different functions.
In view of the well-conserved structure of IF proteins and the common principles governing their assembly properties, a search for mutations in known and newly discovered IF protein genes is likely to reveal their involvement in additional disorders and to unravel new IF functions (see also Quinlan, 2001EF32).
Most vertebrate gene families have pseudogenes, but these usually represent only a small minority of the total gene number (Mighell et al., 2000EF25). Thus, the large number of pseudogenes for the keratin gene family is startling. Particularly striking is the finding that some 87% of these pseudogenes relate to keratin genes 8 and 18. An uneven distribution also holds for the human actin pseudogenes. There are 23 pseudogenes for β- and 6 for γ-cytoplasmic actin, while the four muscle actin genes lack pseudogenes (Pollard, 2001EF31). The molecular mechanisms resulting in the generation of pseudogenes from some but not other genes are unknown. However, a future analysis of their integration sites may yield further information about the structural properties of human chromatin and the mechanisms of recombination.
Note added in proof
While this manuscript was under review, Mizuno et al. characterized desmuslin, an IF protein that interacts with α-dystrobrevin and desmin (Mizuno et al., 2001EF27). When we compared its sequence with that of human synemin, we found it to be nearly identical to the synemin α splice variant described by M. Titeux et al. (unpublished). Therefore, we propose to use the established name synemin.
Acknowledgements
We are grateful to D. Paulin (Paris) for providing the human synemin gene sequence, and to J. Schweizer and M. Rogers (Heidelberg) for helpful discussion and for providing sequence information on K6i. We also thank D. Siepe (Bonn) for advice on database searches. This work was supported by the DFG (SFB 284, C7) to T.M.M.