Fueled by recent advances in single cell biology, we are moving away from qualitative and undersampled assessments of cell identity, toward building quantitative, high-resolution cell atlases. However, it remains challenging to precisely define cell identity, leading to renewed debate surrounding this concept. Here, I present three pillars that I propose are central to the notion of cell identity: phenotype, lineage and state. I explore emerging technologies that are enabling the systematic and unbiased quantification of these properties, and outline how these efforts will enable the construction of a high-resolution, dynamic landscape of cell identity, potentially revealing its underlying molecular regulation to provide new opportunities for understanding and manipulating cell fate.
For centuries, biologists have sought to deconstruct the complexity of biological systems by breaking them down into their component parts – cells – and cataloging these individual units according to their identity. Establishing such a cellular taxonomy provides a universal scheme to standardize cell biology, yet the notion of cell identity, or cell type, remains poorly defined. Historically, cells have been classified by features such as morphology, location, ontogeny and interactions with other cell types. Over time, new assays were developed to measure the physiological function of cells, and these, accompanied by advances in molecular biology that enable the quantification of gene and protein expression, have allowed for more nuanced cell type classifications.
Fundamentally, though, no general method to accurately define cell identity currently exists. This represents a barrier to cataloging cell types in organisms where the full repertoire of cell identities remains unknown, such as in mouse (Han et al., 2018; Tabula Muris Consortium et al., 2018) and human (Regev et al., 2017). Thus, although several cell atlas construction endeavors are under way, these efforts have reignited debate around how cell identity can be effectively and accurately curated, revealing many differing viewpoints on this subject (Various authors, 2017; see also Xia and Yanai, 2019 in this issue). Here, I draw on new and established notions to synthesize a framework consisting of three pillars (Fig. 1) that I propose are central to the concept of cell identity: (1) phenotype (and function) – representing a central pillar for the definition of cell identity, this defines the broad range of physical, molecular and functional features that can be captured and analyzed to enable systematic and unbiased cell type categorization; (2) lineage – to fully characterize cell identity, it is also valuable to understand the lineage relationships between different cell types and their genesis. Tracing the developmental origins of cell identity may allow a cellular taxonomy to be constructed, enabling similar cell types to be grouped together, potentially helping to characterize new cell species; (3) State – cell identity is stable; however, in response to diverse stimuli, the same cell type can exhibit a range of different phenotypes (states). Curating the cell states associated with a given cell type enables identity to be distinguished from state. Moreover, mapping the landscape of cell states lays the foundation for identifying when a cell travels out of normal physiological bounds into a pathological state. Together, a consideration of these three pillars can enable the construction of a high-resolution, dynamic cell identity landscape, potentially providing new opportunities for understanding and manipulating cell fate.
Phenotype and function: curating high-resolution snapshots of cellular features to characterize identity
Inferring cell identity from phenotype
The characterization of cell phenotype is central to defining cell identity and represents a longstanding focus of biologists. In the 1600s, aided by light microscopy, Robert Hooke initially described the cells that made up a sample of cork (Hooke, 1665). Two-hundred years later, the first histological stains using carmine, silver, and Hematoxylin and Eosin emerged, thus allowing relatively detailed cytological observations to be made (Pearse, 1984). It was around this time that Ramón y Cajal used Golgi's silver staining method to describe neurons, providing evidence that the nervous system isn't a continuum of fibers but is composed of individual units, neurons (Ramón y Cajal, 1888). Since these early discoveries, cell visualization using ever increasingly sophisticated microscopy and imaging techniques has remained central to cell type identification; probing key features such as cell shape, size, location and interactions with other cell types facilitates the classification of cells into discrete categories. With advances in molecular biology came the ability to stain cells for specific markers of identity (Coons et al., 1941). Eventually, distinct cell types could be labeled with fluorescent tags such as GFP (Chalfie et al., 1994), enabling the detailed investigation of cell phenotype within whole biological systems.
Imaging-based phenotypic assessment, along with other established techniques, such as flow cytometry, provides high resolution in terms of capturing information on an individual cell basis. Furthermore, these analyses can be deployed in intact cells and organisms, enabling cell function to be probed. However, the information yielded by these assays is comparatively low dimensional, i.e. relatively few phenotypic features are captured from many cells. In addition, the selection of these features tends to be driven by prior knowledge of the biological system under study, limiting and potentially biasing assessment of cell identity. In contrast, methods supporting genome-wide analysis of RNA and protein abundance support the collection of broader and more objective measurements. Indeed, increasing the number of molecular features used to define cell types has enabled more systematic and unbiased assessments of cell identity, based on gene expression alone (Cahan et al., 2014; Roost et al., 2015). Nevertheless, these approaches have relied on bulk analysis of mixed cell populations, blending signals from different sub-populations and altogether masking rare cell species, limiting the precision of cell type identification.
Recently developed single cell technologies have served to bridge the gap between detailed studies of individual cells and bulk studies of cell populations. These methods enable the capture of many thousands of features, without the requirement for experimental cell enrichment, thus generating a rigorous and unbiased picture of the range of cell phenotypes that exists within any given tissue. Of the current suite of technologies, which include genetic, epigenetic and proteomic profiling (Stuart and Satija, 2019), single cell RNA-sequencing (scRNA-seq), has seen rapid and wide adoption since its recent emergence (Tang et al., 2009). Although early iterations requiring cell separation in wells were relatively low-throughput and expensive, more recently developed microfluidic-based technologies have brought huge gains in cell capture rate (Klein et al., 2015; Macosko et al., 2015). Presently, pool-and-split cell labeling strategies are yielding even greater cell capture rates and further reductions in cost (Cao et al., 2017; Rosenberg et al., 2018).
scRNA-seq delivers relatively high-dimensional datasets, consisting of thousands of measurements across thousands of individual cells. Computational tools based on dimensionality reduction seek to reduce this complexity, clustering cells based on transcriptional similarity and enabling their visualization within two-dimensional space (Becht et al., 2018; Satija et al., 2015). It is important to note here that cluster-specific gene expression is used to infer cell type, representing an initial prediction of identity that must be orthogonally validated. One key limitation of scRNA-seq is that it requires tissue disruption and cell destruction, resulting in loss of spatial information that is valuable for cell type identification. Maintaining this spatial information has been a recent focus of new single cell techniques (reviewed by Mayr et al., 2019 in this issue). For example, multiplexed in situ hybridization and sequencing technologies have enabled the measurement of gene expression at subcellular spatial resolution within intact tissues (Chen et al., 2015; Lee et al., 2014). Although these approaches initially required the upfront selection of genes for analysis, information on the expression of thousands of transcripts (Eng et al., 2019) and even genome-wide gene expression can be now captured (Rodriques et al., 2019). Overall, these technologies are particularly promising, offering high-resolution visualization of many cellular features in situ, thereby allowing powerful predictions of cell identity to be made, based on phenotype.
Cell function: a ground truth of cell identity
Ultimately, cell identity is best defined by function. One powerful method for investigating cell function involves the physical elimination of cells, followed by observation of any physiological or behavioral impact on the organism. For example, laser ablation of a specific subset of C. elegans neurons revealed their role in locomotion (Chalfie et al., 1985). Alternatively, where a cell type is exclusively marked by expression of a specific gene, genetic ablation is possible, as illustrated by the targeted expression of a toxin gene to selectively kill pancreatic acinar cells (Palmiter et al., 1987). Although elegant in approach, ablation experiments are limited if cells cannot be physically accessed or are not marked by exclusive gene expression. For example, in the context of assessing cell function in humans, ablation experiments are clearly not feasible. Under these more limited circumstances, cells can be isolated and their function tested in vitro or in xenograft models. These approaches are being facilitated by single cell technologies that can identify new cell surface marker combinations at the proteomic level, for a given transcriptional state, enabling new cell species to be captured by flow cytometry and functionally assessed (Peterson et al., 2017; Stoeckius et al., 2017). However, assigning cell function to a previously undescribed cell type would require an intractable array of assays to be deployed. Moreover, isolated cells often quickly lose their phenotype and function if culture conditions are not optimized, as illustrated by the dedifferentiation of ex vivo cultured hepatocytes (Elaut et al., 2006). Therefore, how do we begin to explore the function of novel cell types?
Where it is impractical to validate cell identity based on functional assays, will it be possible to predict cell function? Gene ontology serves as one commonly implemented method to predict cell function and behavior based on gene expression patterns (Ashburner et al., 2000). However, this approach often returns vague annotations, as gene expression does not directly translate to cell function. Considering that proteins are key effectors of cell function, measurement of protein abundance may be a more accurate predictor. Indeed, machine learning approaches have been deployed to infer cellular function based on tissue-specific protein function (Zitnik and Leskovec, 2017). To improve these predictions, quantifying protein localization in addition to protein abundance, e.g. via spatial proteomics, will undoubtedly prove beneficial. In this context, the recent construction of high-resolution cell atlases of protein expression, based on immunostaining of 12,003 proteins across 56 human cell lines, is extremely valuable (Thul et al., 2017). Also promising are machine learning algorithms that can be used to predict protein expression and localization in cells, based on light microscopy images alone (Christiansen et al., 2018). Indeed, the broader application of machine learning to weave a more comprehensive picture of cell phenotype may serve well to infer cell function and classify cell identity (Smith et al., 2018). However, where possible, these predictive approaches must ultimately be supported by experimental evidence.
Lineage: new tracking technologies reveal cellular origins
So far, I have discussed some of the key tools that can be used to measure cellular phenotype and function, and how they serve to define cell identity. Consider a situation where the composite of these measurements reveals a previously undescribed novel cell type. It may be possible to assign some function to this new cell species using predictive tools, ideally confirmed by experimental assays. Even so, to fully understand cell identity is to place it within the context of all other cell types, in a cellular taxonomy. Constructing such a classification of cell identities from a snapshot of the adult organism in homeostasis is challenging, especially at present where datasets are sparse and we are still working to best integrate them in a meaningful way (Stuart and Satija, 2019). Instead, understanding the origins of a cell's identity, its developmental lineage, is a powerful and simple way to position a cell within a much more complex hierarchy. At a minimum, the new cell species can then be connected to its nearest relatives to provide further clues as to its role in the organism. Can, then, developmental origins alone provide sufficient information to define cell identity?
Lineage tracing, the identification of all progeny stemming from an individual cell, originates from Whitman's light microscopy studies of cell cleavage and eventual cell fate in invertebrate embryos (reviewed by Kretzschmar and Watt, 2012). Following on from these early studies, C. elegans has proven to be a particularly powerful model for lineage tracing, given its amenability to imaging, its relative small number of somatic cells and its invariant cell lineage. Indeed, a complete lineage tree for every cell in the C. elegans embryo has been constructed via non-invasive live imaging, documenting how cells decrease in potential and increase in specialization as development progresses (Sulston et al., 1983).
With sequencing technologies, new methods to diagram the relationships between lineal ancestry and prospective cell fates have emerged. These stemmed from DNA-based barcoding approaches, where cells are labeled with random heritable DNA sequences (Lu et al., 2011), later progressing to transcribed barcodes that allow clonal relationships and cell identity to be read in parallel (Yao et al., 2017). These early approaches enable clonal analysis, i.e. all descendants of an ancestrally marked founder cell can be identified via inheritance of their integrated barcodes. However, lineage relationships between clonal descendants cannot be mapped using these techniques. New single cell tracking approaches are emerging to fill this gap (reviewed by McKenna and Gagnon, 2019 in this issue). For example, sequential rounds of labeling with transcribed barcodes has enabled the construction of lineage trees (Biddy et al., 2018). In an alternative approach, CRISPR/Cas9-based genome editing has been leveraged to introduce mutable genetic labels into individual cells (Alemany et al., 2018; Raj et al., 2018; Spanjaard et al., 2018). Yet another method, transposon-based TracerSeq (Wagner et al., 2018), exploits the Tol2 transposase to randomly integrate unique heritable labels into individual cell genomes; asynchronous insertion over successive cell divisions then permits lineage tree reconstruction. When applied to zebrafish development, TracerSeq revealed evidence of convergent differentiation, where clonally distinct embryonic fields give rise to similar cell types (Wagner et al., 2018). In contrast, some clonally related cells diverged toward distant identities, supporting the case for divergent differentiation. Thus, lineage analyses do not always produce an expected tree structure, i.e. cells from diverse embryonic origins can converge on a similar identity. This is not surprising given classic C. elegans lineage tracing studies showing that similar neuron types can be generated by distinct lineages (Sulston et al., 1983). More recently, the same phenomenon has been suggested in mouse development, where myocytes are produced by two convergent trajectories, and neurons by several trajectories (Cao et al., 2019). However, it is important to note that in this study of mouse development, trajectories were inferred via computational methods and are not based on ground truth data. Nonetheless, taken together, we must bear these examples of convergent differentiation in mind when considering the utility of lineage alone in defining cell identity.
Another limitation of relying on lineage to facilitate cell type identification is its deployment in the context of human development. How, in the absence of ground truth data that can be used to map lineage relationships, can we infer a meaningful and accurate cell developmental hierarchy? Representing a relatively simple experimental strategy, retrospective lineage tracing exploits naturally occurring genetic variation to trace clonally related cells (Ludwig et al., 2019), but this is limited in scale and cannot produce detailed lineage trees. As an alternative, computational approaches enable temporal reconstruction of scRNA-seq data (Saelens et al., 2019; Tritschler et al., 2019 in this issue). However, the resulting trajectories are inferred, relying on sufficient capture and sampling of intermediate cell states. This can be problematic, particularly for tracing the origins of human cell identities. Here, in vitro models of mammalian development (Huch and Koo, 2015) could offer valuable insights into human development. Another possibility is to leverage non-human primate models, performing cross-species comparisons to infer lineage (Boroviak et al., 2018).
Altogether, considering the restricted opportunity for ground truth lineage tracing in humans, and the above evidence of convergent differentiation, defining cell identity based on lineage alone may not provide accurate cell type classification. However, combining lineage with phenotypic and anatomical features could be powerful, especially given that spatial transcriptomics is now poised to enable the generation of fate maps by supplementing lineage trees with positional information.
State: same identity, different guise
In the previous sections, I explored how high-resolution snapshots of cell phenotype and function, together with lineage, can serve to define cell identity. A third and essential facet of cell identity is ‘state’, which can be described as the range of cellular phenotypes arising from the interaction of a defined cell type with its environment. T cells serve as a well-characterized example: these cells exist in different activation states, which arise in response to different stimuli, yet they maintain their T-cell identity (Zemmour et al., 2018). Indeed, cell identity is generally stable, maintained by the autoregulation of identity-specifying transcription factors (Holmberg and Perlmann, 2012). In this respect, cell identity can be thought of as ‘hard-wired’, although it is reprogrammable under defined conditions (as exemplified by Takahashi and Yamanaka, 2006). On the contrary, cell state can be thought of as ‘soft-wired’, where a given cell type can exist in a range of subtly different states, raising the issue of how cell identity and state can be distinguished for previously uncharacterized cell types. For example, how can we be confident that a novel transcriptional signature represents a new cell type rather than a known cell type in an unrecognized state? As the cell transcriptome adjusts rapidly in response to changes in environmental conditions, reliance on scRNA-seq-based technologies alone is likely insufficient to address these questions. In this respect, probing heritable, epigenetic signatures of cell identity (reviewed by Ludwig and Bintu, 2019 in this issue) may provide a more stable measure of cell type, permitting identity to be distinguished from state. For example, ATAC-seq (assay for transposase-accessible chromatin using sequencing) provides information on chromatin accessibility and can now be applied at single cell resolution (Cusanovich et al., 2018). Ideally, ‘multi-omic’ measurements will be collected from the same individual cell (Cao et al., 2018), revealing the different transcriptional states that are associated with the same epigenetic signatures. Ultimately, though, these technologies provide only a ‘snapshot’ of cell phenotype within a tissue, with connections between identity and state largely inferred, providing little objective measurement of the states associated with a given identity.
To provide a direct measure of cell state potential, single cell clonal or lineage mapping can be applied to map the emergence of different cell states from a given cell identity. To achieve this, the introduction of perturbations will be essential. For example, several recent methods have employed pooled CRISPR/Cas9 genome editing to introduce a large array of genetic perturbations into a population of cells, followed by measurement of the effects via scRNA-seq or scATAC-seq (Adamson et al., 2016; Dixit et al., 2016; Rubin et al., 2018 preprint). This approach could be modified to expose cells to a range of different environmental perturbations, e.g. exposure to different cytokines, tracking features of clonally related cells under different conditions and pushing given cell types into their full range of potential states. Altogether, this will provide ground truth data that reflect the different cell states that can arise from the same cell identity in response to different environmental cues. Using these approaches, we might also explore more extreme scenarios where cells are pushed over their boundaries, into different identities. In such cases, lineage may prove helpful to distinguish the line between a change in identity versus a dramatic change in state. Overall, for each cell identity, we can attach to it the probability that it will exist in a given state under defined conditions, potentially revealing the molecular regulation underlying hard-wired cell identity and soft-wired cell state.
Here, I have outlined three pillars of cell identity – phenotype (and function), lineage and state – each encompassing a unique and complementary set of measurements that together can serve to define cell identity in a systematic and unbiased manner. This approach will undoubtedly reveal new cell identities that can be placed within a larger cellular taxonomy, providing valuable clues to their physiological role. The full application of this framework in a human context may be somewhat limited at present, due to a reliance on in vitro culture systems that do not fully recapitulate in vivo counterparts. However, continued efforts to improve human tissue culture models will prove beneficial in this context. Altogether, these three pillars of cell identity will support the construction of high-resolution dynamic cell atlases, with the promise to reveal novel facets of the molecular regulation controlling cell identity, and providing new opportunities for understanding and manipulating cell fate. These endeavors raise some interesting questions: first, is there a minimal set of observations that will serve to universally define cell identity across all cell types and organisms? This leads to a second question: what information do we need to capture from cells to be able to predict their past and future from their present state? This is particularly exciting, as the construction of a probabilistic model of cell identity could enable, for example, the future disease state of a cell to be predicted, providing new insight into disease progression and diagnosis. These questions are also relevant for the cell fate reprogramming field where, at a minimum, we will gain a high-resolution template to recapitulate the identity of major functional cell types. Once we have amassed a critical amount of information, will the landscape of cell identity be continuous or discrete? If cell identity can indeed exist as a continuum, this presents the opportunity to stabilize transient phenotypes and to create new cell identities, endowing known cell types with new functions. Through our continued efforts to define cell identity, we come closer to realizing these possibilities.