ABSTRACT
Analysis of single cell transcriptomics (scRNA-seq) data is typically performed after subsetting to highly variable genes (HVGs). Here, we show that Entropy Sorting provides an alternative mathematical framework for feature selection. On synthetic datasets, continuous Entropy Sort Feature Weighting (cESFW) outperforms HVG selection in distinguishing cell-state-specific genes. We apply cESFW to six merged scRNA-seq datasets spanning human early embryo development. Without smoothing or augmenting the raw counts matrices, cESFW generates a high-resolution embedding displaying coherent developmental progression from eight-cell to post-implantation stages and delineating 15 distinct cell states. The embedding highlights sequential lineage decisions during blastocyst development, while unsupervised clustering identifies branch point populations obscured in previous analyses. The first branching region, where morula cells become specified for inner cell mass or trophectoderm, includes cells previously asserted to lack a developmental trajectory. We quantify the relatedness of different pluripotent stem cell cultures to distinct embryo cell types and identify marker genes of naïve and primed pluripotency. Finally, by revealing genes with dynamic lineage-specific expression, we provide markers for staging progression from morula to blastocyst.
INTRODUCTION
Single cell RNA-sequencing (scRNA-seq) was first described in 2009 (Tang et al., 2009) and has since become a cornerstone of stem cell and developmental biology research. scRNA-seq in principle allows the expression levels of the entire transcriptome of individual cells to be measured in an unbiased manner. The mRNA profiles can then be used as a proxy to define the identity of each cell. A key application is to assess how transcriptomes change as cells transition from one cell state to another and thereby obtain insights into regulatory genes and networks.
Computational tools have been developed to extract useful information from the large and complex datasets that scRNA-seq generates. These bioinformatics pipelines can be broadly broken down into three main steps (Haque et al., 2017): (1) sequence alignment; (2) counts matrix processing; (3) data analysis. Here, we focus on the counts matrix processing stage of scRNA-seq analysis, which is a crucial step for increasing ability to extract biologically relevant insights from the data. Due to the importance of counts matrix processing, several open-source packages have been released. Notably, Seurat, Scanpy and Scran (Heumos et al., 2023) provide easy to use workflows for counts matrix processing, which can be further divided into three stages. Basic quality control removes low quality cells (low counts per cell, low genes per cell, high mitochondrial counts) and genes expressed in a very low number of cells, etc. Feature selection allows the identification of a subset of genes that are believed to be more informative of cell identity. Data denoising applies methods such as principal component analysis (PCA) and data regression/smoothing to mitigate the presence of artefacts such as technical noise and batch effects.
The primary aim of feature selection is to address the curse of dimensionality (Bellman, 1957) by finding a subset of the genes that are more discriminative of cell identity. In conventional workflows, this is typically achieved by selection of highly variable genes (HVGs). However, poor reproducibility between different methodologies has brought into question the robustness of HVG selection (Yip et al., 2019).
Following HVG selection, methods that augment or transform the data, such as feature extraction, data smoothing and batch integration, are employed to maximise the biologically relevant information that can be extracted (Chu et al., 2022). Although there are many examples of this approach successfully aiding the identification of interesting gene expression profiles, any computational technique that changes/augments the values in the counts matrix before downstream analysis has the potential to introduce artefacts. For example, smoothing techniques that try to impute and/or repair spurious gene expression values in a scRNA-seq counts matrix based on cell-cell similarities or gene-gene relationships may introduce computational false-negative or false-positive expression values that obfuscate real biological signals of interest (Andrews and Hemberg, 2018).
We sought to evaluate whether an alternative method of feature selection could increase the biological signal-to-noise ratio in scRNA-seq data in a manner that mitigates or removes the need for further counts matrix processing via feature extraction (e.g. PCA), data smoothing or batch integration. A key benefit of reducing data augmenting processes is that we may be more confident that patterns identified in further downstream analysis are not due to the introduction of computational artefacts.
We recently outlined a mathematical framework termed Entropy Sorting (ES) and incorporated it into two software packages, Functional Feature Amplification Via Entropy Sorting (FFAVES) and Entropy Sort Feature Weighting (ESFW) (Radley et al., 2023). Together, FFAVES and ESFW seek to identify genes that are highly structured within the data and are therefore more likely to be predictive of cell identity. Gene selection using these tools exposed the hitherto ambiguous inner cell mass (ICM) population in human pre-implantation embryo data (Radley et al., 2023; David et al., 2023). However, FFAVES and ESFW operate on binarised data, such that genes are considered active or inactive in individual cells. FFAVES is used to correct sub-optimal gene discretisation before the application of ESFW, which assigns feature importance weights. The ESFW weights are then used to select for a subset of all the genes in the data that are more informative of cell identity, in a similar manner to HVG selection.
Here, we update ES so that it can be applied to continuous data. We formulate this into a new software package, continuous Entropy Sort Feature Weighting (cESFW) and outline a workflow that uses cESFW to perform feature selection on scRNA-seq counts matrices. We validate cESFW on synthetic data and then apply the cESFW workflow to human embryo scRNA-seq datasets, achieving high resolution of cell types and developmental trajectories without augmentation or smoothing of the original data. Our analysis provides clear evidence for the two-step model of lineage segregation (Cockburn and Rossant, 2010) in early human embryo development by elucidating distinct branch points from morula to ICM or trophectoderm (TE), and from ICM to epiblast (Epi) or hypoblast (Hyp).
These findings demonstrate that cESFW can reveal gene expression states and transitions in scRNA-seq data that were previously unobservable through conventional feature selection techniques.
RESULTS
Proposed cESFW feature selection workflow for scRNA-seq data
Data analysis packages such as Seurat, Scanpy and Scran involve multiple steps, some of which alter the values in the original counts matrix. In Fig. 1, HVG selection is highlighted in green because, although it reduces the size of the counts matrix by subsetting down to a smaller set of genes, none of the expression values within the matrix is changed. Conversely, PCA and batch correction methods are highlighted in red because they augment the starting scRNA-seq counts matrix before passing the data onto the next step of analysis, and hence have a potential to introduce computational artefacts.
scRNA-seq counts matrix processing workflows. Left hand column shows a high-level outline of the workflows implemented by widely used scRNA-seq bioinformatics pipelines such as Scran, Seurat and Scanpy. Right hand column outlines the steps of our cESFW workflow. Colours indicate the degree to which the steps alter expression counts values in the original scRNA-seq counts matrix before the data are passed onto downstream analysis steps.
scRNA-seq counts matrix processing workflows. Left hand column shows a high-level outline of the workflows implemented by widely used scRNA-seq bioinformatics pipelines such as Scran, Seurat and Scanpy. Right hand column outlines the steps of our cESFW workflow. Colours indicate the degree to which the steps alter expression counts values in the original scRNA-seq counts matrix before the data are passed onto downstream analysis steps.
As an alternative to conventional workflows, we propose cESFW (Fig. 1). This workflow facilitates unsupervised gene importance weighting based on entropy sorting rather than HVG selection. Below, we demonstrate that the cESFW gene selection methodology leads to improved downstream analysis, without the need to change any of the expression values in the original counts matrix. This reduces the potential for computational artefacts to be introduced into the analysis of an scRNA-seq dataset. A key advance from our previous approach (Radley et al., 2023) is that cESFW can be applied to data with continuous values. For a description of how ESFW was adapted into cESFW see supplementary Materials and Methods, and for a detailed breakdown of each step of our cESFW workflow see Materials and Methods.
cESFW robustly discriminates between significant and non-significant gene correlation signals in synthetic data
The foundation of the cESFW workflow is the use of cESFW to perform feature selection in place of HVG feature selection. Both cESFW and HVG feature selection aim to take a high dimensional scRNA-seq dataset and subset it down to a set of genes that are believed to be more informative of cellular identity than the entire gene set. We used synthetic data to perform a side-by-side analysis of cESFW and the HVG implementations of Scran and Seurat. The Scanpy protocol was excluded because, rather than providing a ranked list of genes, Scanpy provides a true/false index for selected HVGs, making it difficult to quantify performance.
We used Dyngen (Cannoodt et al., 2021) to create four synthetic datasets (SD) (Fig. 2A) with known ground truths as to which genes relate to cell-state-specific transcription factor (TF) networks and which are part of ubiquitous housekeeping (HK) gene networks. Dyngen generates synthetic scRNA-seq data by simulating empirically derived gene regulatory networks (GRNs), while allowing control over the size and shape of the generated data. The column labelled ‘High housekeeping gene expression’ (Fig. 2A) refers to a property of Dyngen such that at different random seeds, the average expression of the simulated genes can vary significantly. We took advantage of this to intentionally find random seeds that generated data where either the HK genes had expression levels comparable with the TFs (High housekeeping gene expression=No) or the HK genes had relatively high expression levels compared with the TFs (High housekeeping gene expression=Yes). For example, the average non-zero expression of SD3 TFs and HKs are 9.75 and 1.26, respectively, whereas for SD4 they are 1.74 and 115, respectively (Fig. 2A). We introduced this layer of analysis because previous evaluations of HVG selection methods have highlighted that the gene selection processes can be biased towards genes that have higher mean expression (Yip et al., 2019). The effect of highly expressed HK genes is shown in Fig. 2B,C. In Fig. 2B, the UMAP generated for SD4 using just the TFs shows the expected branching trajectories, whereas in Fig. 2C, including the HK genes for UMAP generation obscures the branching gene expression dynamics.
Comparison of cESFW and HVG feature selection techniques. (A) Summary of the four synthetic datasets (SD) of varying size and shape created using the Dyngen software. (B) UMAP of SD4 created using only the transcription factor (TF) genes. (C) UMAP of SD4 created using all genes [TFs and housekeeping (HKs)]. (D) AUROC curve generated using the ranked gene lists from application of cESFW, Scran and Seurat gene selection to SD4. (E) Precision-Recall curve using the same inputs as D. (F) AUROC scores after applying cESFW, Scran and Seurat gene selection to all four SDs. (G) Precision-Recall AUCs using the same inputs as F. Green stars indicate that cESFW had the highest score for an SD and red stars indicate when a different method had a higher score than cESFW. (H) Summary of all area under curve (AUC) results. Green boxes indicate that cESFW had the highest score for an SD and red boxes indicate when a different method had a higher score. For UMAPs and AUC curves relating to SD1-SD3, see Fig. S1.
Comparison of cESFW and HVG feature selection techniques. (A) Summary of the four synthetic datasets (SD) of varying size and shape created using the Dyngen software. (B) UMAP of SD4 created using only the transcription factor (TF) genes. (C) UMAP of SD4 created using all genes [TFs and housekeeping (HKs)]. (D) AUROC curve generated using the ranked gene lists from application of cESFW, Scran and Seurat gene selection to SD4. (E) Precision-Recall curve using the same inputs as D. (F) AUROC scores after applying cESFW, Scran and Seurat gene selection to all four SDs. (G) Precision-Recall AUCs using the same inputs as F. Green stars indicate that cESFW had the highest score for an SD and red stars indicate when a different method had a higher score than cESFW. (H) Summary of all area under curve (AUC) results. Green boxes indicate that cESFW had the highest score for an SD and red boxes indicate when a different method had a higher score. For UMAPs and AUC curves relating to SD1-SD3, see Fig. S1.
Dyngen allows us to differentiate between genes informative of cell type (TFs) and ubiquitously expressed genes (HKs). We used this to quantify the performance of the ranked gene lists generated by cESFW, Scran and Seurat using receiver operating characteristic (ROC) and Precision-Recall (PR) curves (Grau et al., 2015). For this analysis, the area under (AU) the ROC and PR curves give scores between 0 and 1, where 1 means that the feature selection method is perfect at discriminating between cell type informative TF genes and non-informative HK genes. We show both the ROC and PR curves (Fig. 2D,E) because ROC is more readily comparable across datasets, whereas PR provides a comparable metric of method performance that is more sensitive to class imbalance (many more HK genes compared with TF genes). The AUROC and PR curves for SD4 are shown in Fig. 2D,E, respectively. In both cases, the area under the blue cESFW curve is largest, indicating that cESFW outperforms Scran or Seurat at ranking TFs as more informative of cell identity than HKs. The results of all the ROC and PR curves are summarised in Fig. 2F-H. The AUROC scores of cESFW are higher than Scran and Seurat on all four SDs (Fig. 2F). High AUROC scores can be thought of as enriching highly ranked genes with genes that are indicative of cell type.
Although cESFW does not have the highest PR-AUCs score for SD1 and SD3 (Fig. 2G; Fig. S1C,D), we note that the drop in performance is less than the gain for SD2 and SD4. As SD2 and SD4 are the datasets in which HK gene expression is relatively high compared with TF expression, this reinforces previous findings that HVG selection can be biased towards highly expressed genes (Yip et al., 2019). Therefore, the results in Fig. 2G suggest that cESFW is less sensitive to the presence of highly expressed genes than HVG selection methods.
These results on four independent SDs show that cESFW can provide more robust feature selection than HVG selection.
cESFW reveals high resolution gene expression dynamics in the early human embryo
We incorporated cESFW into the proposed workflow (Fig. 1, right hand column) and exemplified application to biological scRNA-seq using a reference dataset of peripheral blood mononuclear cells (PBMCs) (Stuart et al., 2019). We saw that cESFW performed comparably to Seurat and Scran in identifying cell types and more stably maintains cell type clusters with increasing number of selected top ranked genes (Fig. S2).
We next applied the cESFW workflow to six merged independent scRNA-seq datasets of early human embryo development (Yan et al., 2013; Fogarty et al., 2017; Petropoulos et al., 2016; Yanagida et al., 2021; Meistermann et al., 2021; Xiang et al., 2020). These datasets have previously been characterised (Petropoulos et al., 2016; Stirparo et al., 2018; Meistermann et al., 2021; Singh et al., 2023; Wei et al., 2023), and there is a foundation of experimental knowledge regarding the expected distinct cell types and stages.
The cESFW workflow identified a set of 3012 genes. For details and code see Materials and Methods. Gene set enrichment analysis found Gene Ontology (GO) terms relating to regulation and development. Using this gene set we obtained a smooth, high resolution uniform manifold approximation and projection (UMAP) embedding (Fig. 3A; Fig. S3A). Importantly, the UMAP embedding was generated by simply subsetting down to the 3012 cESFW workflow genes, without any data augmentation or smoothing. The observed structure should therefore reflect biologically significant gene expression patterns, free of computational artefacts that may be introduced through conventional scRNA-seq counts matrix processing workflows (Fig. 1, left hand column). The UMAP clearly displays continuous progression along the developmental time course. Cells from the five pre-implantation embryo datasets are intermingled across the UMAP space, indicating minimal contribution of batch effects (Fig. 3B; Fig. S3B). Cells from the post-implantation extended culture samples (Xiang et al., 2020) form groups adjacent to distinct regions of pre-implantation cells, indicating that related gene expression patterns have been identified across the pre- and post-implantation datasets.
High-resolution UMAP embedding of early human embryo development. (A-C) UMAP embedding generated using the 3012 genes identified by our cESFW workflow coloured by time point labels from data source papers (A), coloured by dataset sources (B) and coloured by our cell state cluster annotations (C). cTB, cytotrophoblast; Epi, epiblast; ExE-Mes, extra-embryonic mesenchyme; Hyp, hypoblast; ICM, inner cell mass; preIm-Epi, pre-implantation epiblast; sTB, syncytiotrophoblast; TE, trophectoderm. See Figs S3 and S4 for further details regarding the cluster annotations. (D) Heatmap showing similarity of annotated cell states by calculating the correlation distance matrix of pseudobulk samples using the 3012 cESFW selected genes. Low correlation distances (red) indicate similar samples. (E) Known cell type marker expression profiles validate our UMAP annotations. See Fig. S4 for more examples.
High-resolution UMAP embedding of early human embryo development. (A-C) UMAP embedding generated using the 3012 genes identified by our cESFW workflow coloured by time point labels from data source papers (A), coloured by dataset sources (B) and coloured by our cell state cluster annotations (C). cTB, cytotrophoblast; Epi, epiblast; ExE-Mes, extra-embryonic mesenchyme; Hyp, hypoblast; ICM, inner cell mass; preIm-Epi, pre-implantation epiblast; sTB, syncytiotrophoblast; TE, trophectoderm. See Figs S3 and S4 for further details regarding the cluster annotations. (D) Heatmap showing similarity of annotated cell states by calculating the correlation distance matrix of pseudobulk samples using the 3012 cESFW selected genes. Low correlation distances (red) indicate similar samples. (E) Known cell type marker expression profiles validate our UMAP annotations. See Fig. S4 for more examples.
We performed unsupervised agglomerative clustering on the scRNA-seq counts matrix subset down to the 3012 cESFW genes. Overlaying these clusters onto the UMAP embedding (Fig. S3A) provided an unsupervised foundation to label samples with 15 distinct cell type/stage annotations (Fig. 3C; Fig. S3B,C). Comparison with annotations from previous studies (Petropoulos et al., 2016; Stirparo et al., 2018) corroborated our annotations and UMAP embedding (Fig. S3D,E). Notably, the unsupervised clusters were readily annotated, whereas previous annotations were generated through supervised analysis focused on a handful of candidate marker genes. A heatmap of pseudobulk samples generated by taking the mean expression levels of genes for samples in each of our annotated cell states reveals how closely cell populations are related to one another (Fig. 3D). Genes implicated in the literature to correspond with 13 of our cell state annotations are represented in the 3012 cESFW gene set (Cockburn and Rossant, 2010; Taubenschmid-Stowers et al., 2022; Stirparo et al., 2018; Singh et al., 2023; Corujo-Simon et al., 2023; Liu et al., 2022; Zadora et al., 2017; Yue et al., 2020; Yabe et al., 2016; Yang et al., 2021). Such genes show specific gene expression in the UMAP embedding (Fig. 3E; Fig. S4). Mural TE lacks validated specific markers but is known to express HAND1 (Liu et al., 2022). Although HAND1 is not captured in the cESFW gene set, it shows relatively specific day 6/7 mural TE expression on our UMAP embedding (Fig. S4).
Visualisation of chronological embryo progression with gene expression patterns consistent with the literature demonstrates that the cESFW workflow identifies a set of genes that is highly descriptive of the transcriptional landscape of day 3-14 developing human early embryos. We have created an interactive online portal for visualisation of gene expression profiles on our UMAP embedding and generation of plots at https://bioinformatics.crick.ac.uk/shiny/users/boeings/radley2024umap_app/.
Comparison of cESFW with HVG selection techniques on human embryo data
To compare the performance of cESFW with HVG selection using the same input data, we identified the 3012 top ranked genes for Scran and Seurat workflows. Previous studies comparing HVG selection methods have shown that there is often poor concordance between gene sets selected by different methods (Yip et al., 2019). Intersecting the gene sets from the cESFW, Scran and Seurat workflows, we found only 290/3012 (9.63%) are common to all three (Fig. S5A). We then compared UMAP embeddings. For a fair comparison, we omitted prior PCA selection for the Scran and Seurat UMAPs, as PCA reduction is not part of the cESFW workflow. Scran and Seurat UMAPs broadly separated samples by embryo time points but did not provide a smooth embedding with a coherent progression of cell states and developmental trajectories (Fig. S5C,D).
We used silhouette scores to quantify how well each of the 3012 gene sets formed distinct clusters of samples, using our UMAP cell state annotations (Fig. 3C) as cluster labels. Silhouette scores measure how similar samples are to their designated clusters, with positive scores indicating samples are most similar to the cluster they are a member of, and negative scores designating samples that would be better assigned to a different cluster. The average of the silhouette scores for all samples within a cluster then quantifies the quality of the cluster as whole. The cESFW genes produced higher silhouette scores than the Seurat or Scran gene sets for 14 out of the 15 clusters, and similar scores for ICM (Fig. S5B). Notably the Epi/Hyp branch, Hyp and Mid TE score positive with cESFW but have negative silhouette scores using HVG, signifying that the HVG gene sets are unable to identify these cell types as distinct from other cells in the data.
These results show qualitatively and quantitatively that cESFW gene selection outperforms HVG selection workflows on these human embryo scRNA-seq data and help explain why the UMAP generated using the 3012 cESFW gene set is better able to readily separate known cell types from one another.
Lineage branching during blastocyst development
The textbook model of blastocyst formation derived from studies in the mouse embryo entails sequential lineage bifurcations that generate first TE and ICM, and then Hyp and Epi (Cockburn and Rossant, 2010). However, the existence of two branch points in human embryogenesis has been questioned in analyses of scRNA-seq data (Petropoulos et al., 2016; Goedel and Lanner, 2021). More recent studies have proposed two bifurcations, but have not demonstrated separate branch point populations (Yanagida et al., 2021; Radley et al., 2023; Wei et al., 2023).
The cESFW embedding unambiguously reveals two binary branch points in pre-implantation embryogenesis: from morula to ICM or early TE, and from ICM to Epi or Hyp. Unsupervised clustering identifies populations of cells at these intersections which we labelled ICM/TE and Epi/Hyp branch cells, respectively (Fig. 3C). Silhouette scores quantify both as distinct groups of cells according to transcriptional profile (Fig. S5B). Examining the co-occurrence of cell states in individual embryos further substantiates the two step model (Fig. 4A,B; Fig. S6A,B). ICM/TE branch and early TE cells most commonly co-occur with one another, whereas ICM cells rarely co-exist in embryos with Epi or Hyp. In Fig. 4C, we used our UMAP and ranked gene list (Table S1) to highlight three genes that show relatively specific upregulation within the ICM/TE branch population.
REject cells are within the ICM/TE branch point cluster. (A,B) Quantification of cell state co-occurrence in individual embryos. See Materials and Methods for details. Heatmap in A shows cell state co-occurrence in individual embryos, quantified via ESSs. Hierarchical clustering shows that temporally close cell states co-occur with each other most frequently. Colour bar shows ESSs. B shows the same as A, but the colour bar has been truncated so that the maximum value is 0.1, to better appreciate cell co-existence in individual embryos. (C) Proposed ICM/TE branch marker genes. cTB, cytotrophoblast; ExE-Mech, extra-embryonic mesenchyme; Hypo, hypoblast; ICM, inner cell mass; preIm-Epi, pre-implantation epiblast; sTB, syncytiotrophoblast; TE, trophectoderm. (D) REject cells identified by Singh et al. (2023) coloured on human embryo UMAP from Fig. 3.
REject cells are within the ICM/TE branch point cluster. (A,B) Quantification of cell state co-occurrence in individual embryos. See Materials and Methods for details. Heatmap in A shows cell state co-occurrence in individual embryos, quantified via ESSs. Hierarchical clustering shows that temporally close cell states co-occur with each other most frequently. Colour bar shows ESSs. B shows the same as A, but the colour bar has been truncated so that the maximum value is 0.1, to better appreciate cell co-existence in individual embryos. (C) Proposed ICM/TE branch marker genes. cTB, cytotrophoblast; ExE-Mech, extra-embryonic mesenchyme; Hypo, hypoblast; ICM, inner cell mass; preIm-Epi, pre-implantation epiblast; sTB, syncytiotrophoblast; TE, trophectoderm. (D) REject cells identified by Singh et al. (2023) coloured on human embryo UMAP from Fig. 3.
To examine the inferred lineage trajectories, we performed RNA velocity analysis (Bergen et al., 2020) (Fig. S6D) and trajectory inference using the STREAM software package (Chen et al., 2019) (Fig. S6E). The RNA velocity and trajectory inference results support the existence of branch points between TE and ICM fates, and ICM to Epi/Hyp fates. The RNA velocity vectors are aligned well with temporal ordering and lineage divergences across the embedding.
In a recent publication, Singh et al. (2023) re-analysed the Petropoulos et al. (2016) human embryo scRNA-seq dataset and highlighted a previously undescribed group of day 4/5 cells. These authors were unable to place this cluster in a differentiation trajectory and noted upregulation of several proposed apoptosis markers. They therefore speculated that that this group of cells had accumulated DNA damage and was fated to apoptose, denoting them ‘REject’ cells. However, when we inspected the REject cells in the cEFSW embedding we found that they do not constitute a separate cluster but are almost entirely within the ICM/TE branch point (Fig. 4D). We therefore investigated the markers suggested to indicate that REject cells are pre-apoptotic. We found that most of these genes are expressed as highly elsewhere in the developing human embryo, suggesting either that they are not strong discriminators of pre-apoptotic versus non-apoptotic cells, or that apoptosis is distributed across cell types (Fig. S7). For example, although suggested apoptosis markers Bik, Atg2a and Atf3 are upregulated in the ICM/TE branch when compared against the early TE, ICM, pre-implantation Epi (preIm-Epi) and Hyp populations, these three genes are expressed more highly in the eight-cell and/or morula populations (Fig. S7B). In a similar vein, Bak1, Ctsb and Casp6 were suggested to be apoptotic markers specific to the REject population, but we found these genes to be broadly expressed across several cell types of day 3-14 human embryos (Fig. S7C). Thus, we did not find strong indications that the proposed REject cells are specifically fated for apoptosis.
Overall, these findings substantiate the occurrence of an initial lineage bifurcation between TE and ICM on day 4/day 5 that precedes segregation of ICM to Hyp and Epi between days 5 and 6.
Comparison of human pluripotent stem cell cultures to reference human embryo scRNA-seq embedding
Human pluripotent stem cells (PSCs) are generally considered to be analogues of the pluripotent epiblast lineage in the embryo. However, PSCs exist in different states that are propagated in distinct signalling environments (Smith, 2017; Nichols and Smith, 2009). Correlation of PSCs with developmental stages of Epi in the human embryo has been problematic due to the scarcity of relevant data and the limited resolution provided by previous analysis methods. We took the opportunity provided by our high-resolution embedding to re-evaluate the transcriptome relatedness of various cultures of human PSCs to cell types in the day 3-14 embryo, highlighting preIm-Epi and post-implantation Epi (postIm-Epi) (Fig. 5A; Fig. S4).
Characterising the human embryo counterparts of in vitro stem cell cultures. (A) UMAP embedding highlighting the embryo eight-cell, pre-implantation epiblast (preIm-Epi) and embryonic disc populations. FBP1 and SFRP2 shown as previously identified preIm-Epi and Embryonic disc markers, respectively (Messmer et al., 2019). (B) Projection of cultured naïve, primed and expanded potential stem cells (EPSCs) scRNA-seq datasets onto our UMAP embedding. To aid interpretation, sample transparencies are scaled by the density of samples on the embedding from the same condition, such that samples in the densest region are completely opaque. Abbreviations in brackets denote relevant culture media. (C) Quantification of cultured naïve, primed and EPSC scRNA-seq sample identities. Colour coding as in panel A, with grey denoting unassigned. (D) Heatmap of z-score normalised gene expression of eight naïve specific and eight primed specific gene markers. All values are pseudobulk expression values, generated by grouping single cells by their dataset and predicted cell state labels from panel C, and taking the average expression of each group.
Characterising the human embryo counterparts of in vitro stem cell cultures. (A) UMAP embedding highlighting the embryo eight-cell, pre-implantation epiblast (preIm-Epi) and embryonic disc populations. FBP1 and SFRP2 shown as previously identified preIm-Epi and Embryonic disc markers, respectively (Messmer et al., 2019). (B) Projection of cultured naïve, primed and expanded potential stem cells (EPSCs) scRNA-seq datasets onto our UMAP embedding. To aid interpretation, sample transparencies are scaled by the density of samples on the embedding from the same condition, such that samples in the densest region are completely opaque. Abbreviations in brackets denote relevant culture media. (C) Quantification of cultured naïve, primed and EPSC scRNA-seq sample identities. Colour coding as in panel A, with grey denoting unassigned. (D) Heatmap of z-score normalised gene expression of eight naïve specific and eight primed specific gene markers. All values are pseudobulk expression values, generated by grouping single cells by their dataset and predicted cell state labels from panel C, and taking the average expression of each group.
We examined published scRNA-seq datasets (Guo et al., 2021; Messmer et al., 2019) from conventional primed PSCs and cells in two naïve culture conditions, PXGL (Guo et al., 2017) and its predecessor t2iLGö (Takashima et al., 2014). We also assessed data for expanded potential stem cells (EPSCs), originally suggested to be related to the eight-cell stage (Gao et al., 2019). We used the umap.transform function of the Python UMAP package (McInnes et al., 2018) to map the query cells from the PSC cultures onto our reference human embryo UMAP. This method places individual query samples closest to their most transcriptionally similar counterparts in the UMAP in an unsupervised manner based on all 3012 cESFW genes. Fig. 5B shows the projections of individual cells from the query scRNA-seq datasets onto the embedding. Conventional primed PSCs are placed in the vicinity of the day 10-14 embryonic disc Epi. EPSCs are similarly positioned close to postIm-Epi, with <2% of cells adjacent to earlier stages. These findings are consistent with previous comparative analyses with non-human primate embryos (Guo et al., 2021; Nakamura et al., 2017). In contrast, naïve PSCs overlie preIm-Epi (day 6-7). Notably cells in PXGL are more homogeneous than cells in t2iLGö. Quantification using a k-nearest neighbour classifier trained on the embryo cell state annotations (Fig. 5C) assigns 97% of PXGL cells to preIm-Epi, compared with 77% for t2iLGö.
We also evaluated published data for 4CL and e4CL cells that are proposed to be related to day 5 ICM and eight-cell stages, respectively (Mazid et al., 2022). These samples are more heterogeneous than other PSC cultures. They comprise preIm-Epi-like proportions of 48% for 4CL and 60% for e4CL. A substantial fraction (39%) of 4CL cells are postIm-Epi-like, whereas many e4CL cells are positioned between preIm-Epi and postIm-Epi, indicating either an indeterminate mixed identity or transitioning intermediates. e4CL cultures contain only 5% of cells which are related to day 5 ICM and 5% of cells related to the eight-cell embryo.
To validate UMAP label transfer, we calculated the correlation distances between samples grouped by our embryo cell type annotations and each of the PSC scRNA-seq samples (Fig. S8). The results closely match those in Fig. 5C for all samples, demonstrating that label transfer in the low dimensional UMAP space is consistent with cell type prediction in the high dimensional 3012 gene space.
Identifying naïve and primed human PSC markers
Previous studies have identified gene markers for naïve and primed human PSCs by examining differential expression between these two PSC types specifically. We sought to characterise markers in the context of the entire day 3-14 human embryo. We started by taking the top three naïve versus primed PSC markers identified by Messmer et al. (2019) and plotted them on our UMAP embedding (Fig. S9). We saw that, although these six markers were clearly up- or downregulated between the preIm-Epi and embryonic disc samples, for five out of the six genes, expression is not restricted to the Epi lineage. In contrast, the top three ranked genes for our UMAP-annotated preIm-Epi and embryonic disc cell types (Table S1; Fig. 3C) show more restricted expression to the preIm-Epi or embryonic disc regions (Fig. S9).
We then proceeded to use our rank gene lists to identify genes that showed specific upregulation in the preIm-Epi or embryonic disc, while also showing differential expression in naïve versus conventional human PSCs. In Fig. 5D we present a heatmap of eight naïve and eight primed marker genes that have consistent cell-type-specific differential expression across the cell line datasets, and in our reference human embryo embedding. We also note that, in this heatmap, the 4CL cells show upregulation of both naïve and primed markers, consistent with evidence from UMAP label transfer and correlation analysis that this culture condition displays mixed identity (Fig. 5B,C; Fig. S8).
This analysis of naïve, primed and EPSC stem cells from different research groups demonstrates how access to an unbiased view of the early human embryo can facilitate classification of in vitro cell cultures. Furthermore, we show that taking a more holistic approach to cell state marker identification can provide more specific cell type marker genes.
Emergence of Epi, Hyp and TE signatures during blastocyst formation
Between days 3 and 7 post fertilisation, the morula of the human embryo develops into the blastocyst, comprising three distinct lineages: Epi, Hyp and TE. We sought to use the cESFW embedding to identify genes that may be useful markers for staging cell identities during blastocyst formation.
The Entropy Sort Score (ESS), is a correlation metric derived from ES (Radley et al., 2023). A potential benefit of the ESS for gene ranking compared with other typical methods such as the t-test or Wilcoxon test is that the ESS was specifically derived to reduce whenever gene expression is observed outside of a specific population of interest. Therefore, when there are multiple cell types within a dataset, the ESS can be better suited for identifying regions of specific gene expression, as opposed to more general up- or downregulation. Using the ESS, we ranked genes by how well their expression profiles overlap with the Epi, Hyp or TE morula-to-blastocyst trajectories, while also progressively restricting the analysis to the terminal Epi, Hyp or TE populations. For examples and details regarding this progressive restriction, see Fig. S10 and Materials and Methods.
In Fig. 6A-C we present heatmaps highlighting sets of genes that show sequential upregulation as the morula differentiates into Epi (A), Hyp (B) or TE (C). Day 7 TE comprises both mural and polar TE. Fig. 6D-F and Fig. S10 show example genes with specific upregulation along each trajectory. For example, during the emergence of TE cells during blastocyst formation, Slc28a3 (Fig. 6F) is first upregulated in the morula and maintains expression until day 7 TE, while being downregulated in the ICM, Epi and Hyp lineages. Conversely, the TE marker Hapln1 (Fig. S10B) is upregulated later than Slc28a3 at around day 6 and is maintained in day 7 of TE.
Emerging epiblast, hypoblast and trophectoderm signatures during blastocyst development. (A-C) Heatmaps of pseudobulk z-score-normalised human embryo data, displaying genes that show blocks of sequential upregulation between morula and Epi (A), Hyp (B) or TE (C) lineages. Pink boxes highlight most relevant cell types for each lineage. Green highlighted genes are presented in UMAPs within this figure and in Fig. S10. (D-F) UMAP gene expression profiles for example genes showing expression from morula to Epi (D), Hyp (E) or TE (F) lineages. Epi, epiblast; Hypo, hypoblast; ICM, inner cell mass; preIm-Epi, pre-implantation epiblast; TE, trophectoderm.
Emerging epiblast, hypoblast and trophectoderm signatures during blastocyst development. (A-C) Heatmaps of pseudobulk z-score-normalised human embryo data, displaying genes that show blocks of sequential upregulation between morula and Epi (A), Hyp (B) or TE (C) lineages. Pink boxes highlight most relevant cell types for each lineage. Green highlighted genes are presented in UMAPs within this figure and in Fig. S10. (D-F) UMAP gene expression profiles for example genes showing expression from morula to Epi (D), Hyp (E) or TE (F) lineages. Epi, epiblast; Hypo, hypoblast; ICM, inner cell mass; preIm-Epi, pre-implantation epiblast; TE, trophectoderm.
These findings illustrate how ES and our UMAP embedding can be used to identify new gene expression patterns that may aid future experiments and analyses, also exemplified in Dattani et al. (2024).
DISCUSSION
In this work, we introduce cESFW, a feature selection method designed for distinguishing between cell state informative and uninformative genes in high dimensional scRNA-seq data. A key ethos of the cESFW workflow is that the original values in the scRNA-seq counts matrix are not changed in any way. We show that using cESFW to subset down to a set of informative genes is not only sufficient to identify dynamic gene expression states, but can outperform more complex methods that augment the values of the counts matrix. These findings challenge the view that scRNA-seq data are inherently too noisy to identify high resolution gene expression dynamics without repairing, smoothing or transforming the data first.
On synthetic scRNA-seq data we found that cESFW performs comparably with or better than conventional HVG selection. HVG selection has been shown to be sensitive to the presence of highly expressed genes, which can lead to a significant drop in feature selection performance. Previous work has shown that Seurat and Scran HVG selection are more robust to differences in mean expression than other HVG selection methodologies (Yip et al., 2019). However, on our synthetic data we found that both Scran and Seurat HVG selection performance dropped significantly when data contained a large number of highly-expressed HK genes. Conversely, cESFW feature selection is relatively agnostic to the presence of genes with high mean expression, with a minimal drop in performance. Less influence of expression level when ranking gene importance may be a significant contributor to the superior performance of cESFW on biological data. In addition, the multivariate approach of cESFW may be more robust than univariate HVG feature selection. In the univariate approach, genes are considered in isolation, leading to their importance ranking being more sensitive to fluctuations in properties such as mean expression. cESFW, in contrast, considers pairwise the expression profile of an inspected gene with every other gene in the data, and generates a gene importance weight by quantifying the degree to which the inspected gene forms networks of significantly co-expressing genes.
Applying our cESFW workflow to early human embryo scRNA-seq data generates a high resolution UMAP embedding of different cell states, trajectories and branch points in the developing embryo. Our cell state annotations formed quantifiably more distinct clusters when using top rank genes selected by cESFW compared with the top ranked genes from conventional HVG selection. Furthermore, we obtain a coherent merged dataset without having to apply batch correction methods as in previous analyses (Meistermann et al., 2021; Wei et al., 2023). We propose that cESFW is effective for combining datasets from different sources because improved feature selection tips the biological signal-to-noise ratio in favour of the biologically relevant expression profiles in the data. Cell similarity metrics may thus be less affected by confounding gene expression signals.
Our high-resolution map of the early human embryo transcriptome allowed us to substantiate the existence of two distinct branching populations, at the intersections of morula with ICM and TE, and of ICM with Epi and Hyp. Their identification and the general topology of the cESFW embedding indicate that human embryogenesis proceeds via the sequence of lineage bifurcations established for the mouse embryo (Cockburn and Rossant, 2010).
The branching populations reside at crucial junctures in blastocyst formation, the partitioning of extra-embryonic and embryonic lineages. These branch point clusters do not define unitary states. On the contrary, cells in these clusters are heterogeneous and may become specified to alternative fates. For example, PDGFRA, a Hyp marker (Corujo-Simon et al., 2023), and NANOG, an Epi marker (Allègre et al., 2022), are heterogeneously distributed in the Epi/Hyp branching population. Furthermore, branch cluster boundaries extend beyond the bifurcation region, potentially indicating that cells remain plastic and may be redirected. Notably, it has been seen in mouse embryos that cells expressing ICM genes remain capable of generating TE up to the late 32-cell stage (Posfai et al., 2017). Further analytical and experimental studies may focus on heterogeneity and gene expression dynamics in these branch point populations to unravel the process of lineage choice and commitment.
Our analyses call for a reappraisal of the proposal that the human embryo contains a distinct population of cells that are excluded from development and fated for apoptosis (Singh et al., 2023). We found that so-called REject cells are located on the developmental trajectory, residing in the ICM/TE branch population. Those cells do express some markers of a pre-apoptotic state, but so do other cells throughout the data. Quantitative immunostaining analyses are needed to reveal whether there is a consistently higher frequency of apoptosis in this population in the embryo.
We used the cESFW gene set and embedding as a reference point to evaluate the identity of various human PSC cultures. Quantifying proportional relatedness of cell lines to embryo populations confirmed that conventional human PSCs and EPSCs are most related to post-implantation embryonic disc, whereas naïve PSCs are similar to preIm-Epi. This analysis also demonstrated that PXGL naïve PSCs have a very high (97%) representation of preIm-Epi-like cells.
Finally, we used ESSs and our UMAP embedding to suggest sets of genes that may aid cell type staging during days 3-7 of lineage progression during human blastocyst formation. Immunostaining to assess protein expression dynamics for these genes will be of interest.
In summary, this study demonstrates that cESFW, an algorithm derived from the mathematical framework of ES, has the ability to reveal gene expression dynamics in scRNA-seq data that were previously unobserved. The expression profiles of each of 16,719 genes available in the scRNA-seq counts matrix can be visualised and plotted at the online portal https://bioinformatics.crick.ac.uk/shiny/users/boeings/radley2024umap_app/. Our characterisation of early human embryo scRNA-seq data has provided ranked gene lists for multiple cell states and a high resolution UMAP embedding. We provide a proposed workflow for incorporating cESFW into general scRNA-seq analysis. With wider application, we expect to gain greater insights into which scenarios cESFW can outperform or complement conventional HVG selection. We propose that cESFW may be particularly suited to analysis of transcriptome dynamics and trajectories in time course series or during developmental progression. We also envisage that the cESFW technique and workflow will be readily adapted for application to datasets of varying complexity and type, such as multi-omics or spatial transcriptomics studies.
Limitations of the study
As more embryo scRNA-seq datasets become publicly available, re-analysis should yield greater resolution. A higher density of samples from day 4-6 embryos will be valuable for better delineation of ICM progression and lineage branch points. Further data from extended cultures will enable greater resolution of embryonic disc progression and emergence of extra-embryonic mesenchyme and amnion.
A ground truth is lacking for human embryo scRNA-seq. To quantify the performance of Seurat and Scran compared with cESFW feature selection, we took the top 3012 genes from the ranked gene lists and calculated the cluster Silhouette scores based on our cell state annotations. We chose to use our cluster annotations because we believe we have provided substantial evidence of the biological relevance. However, this approach may introduce some bias towards cESFW.
The genes we have highlighted for staging of cells during blastocyst formation were chosen through semi-supervised analysis. The initial gene sets were curated from ranked gene lists output based on the ESS. However, the final sets of selected genes were subject to manual inspection of the top 30 ESS ranked genes, and genes that appeared to be poorly expressed in the cell populations of interest were manually excluded. Therefore, the presented gene sets may be incomplete.
Finally, the benefits of feature selection through cESFW do not preclude the use of other data analysis techniques such as batch integration and data repair. We suggest that more accurate feature selection has the potential to improve results of all downstream steps. We anticipate adaptations of our proposed cESFW workflow that incorporate a range of data analysis methodologies.
MATERIALS AND METHODS
From discrete data Entropy Sorting to continuous data Entropy Sorting
A key distinction between the ESFW algorithm outlined in our previous work (Radley et al., 2023) and the cESFW algorithm presented in this work is that cESFW updates the theory behind ES so that it can be readily applied to continuous data. For a detailed explanation of how ES has been adapted for continuous data, see supplementary Materials and Methods.
A fifth entropy sorting error scenario
In addition to expanding the ES framework so that it may be applied to continuous data, we also introduce a fifth error scenario where ES divergence can be observed through the ES parabola in a manner that indicates that samples may be displaying false negative (FN) expression levels. We present this new error scenario in Fig. S11 (black box), alongside the four maximum entropy error scenarios introduced in our previous manuscript (Radley et al., 2023). The error scenarios are important because they provide implicit logic that determines how we calculate divergence and error potential for any pair of features.
As an example, in our newly proposed fifth error scenario, there are two features being compared against one another. The top feature (blue and red) is denoted the reference feature (RF) and the bottom feature (green and purple) is denoted the query feature (QF) because, according to the maximum entropy principle, the QF should always be the feature with the larger cardinality for its observed minority state (Radley et al., 2023). For this pair of features, we see a strong but imperfect conditional relationship between the co-occurrence of ground truth (absent of X symbols) RF and QF minority states. However, the presence of RF majority states overlapping with QF minority states introduces uncertainty into the system, which manifests as a non-zero conditional entropy between the two features. Hence, we can hypothesise that the samples diverging from the ground truth (indicated by the blue Xs) are either real, or are present due to the introduction of erroneous data points. This allows us to set up a hypothesis test, which we introduce in Radley et al. (2023) as the ES hypothesis test. The ES hypothesis test allows us to quantify whether it is more likely or not than random chance that the overlapping RF majority and QF minority states are present due to sampling/measurement error.
All five of these error scenarios are encoded into our cESFW algorithm, which can be inspected to gain further insights into the algorithmic application of the error scenarios. However, before looking at our cESFW code, we advise reading the relevant error scenario sections in the supplementary material of our Radley et al. (2023) publication for a detailed description.
Continuous Entropy Sort Feature Weighting workflow
In Fig. 1, we outlined the main steps in our proposed cESFW scRNA-seq data processing workflow. In this section we will provide further details for each step. Executable code to apply our cESFW workflow to the synthetic and human embryo scRNA-seq data presented in this paper can be found at https://github.com/aradley/cESFW_Embryo_Topology_Paper. The primary purpose of this section and our workflow is to give readers an insight into how cESFW can be successfully applied to scRNA-seq data. It is expected that users will find other ways to incorporate cESFW into their omics data analysis pipelines for the plethora of data that is now available to the community.
Temporary minority state normalisation
For features to be compared with one another with cESFW, they must each be converted into their minority state feature activity vectors (Fig. S14A). Practically, converting features into their minority state vectors amounts to normalising each feature so that their values are within the same range, in a manner that meaningfully captures how features co-occur with one another. For gene expression data, this could be two genes where when one feature displays expression values close to its maximum expression in the data, the other gene also shows values close to its maximum expression. In this example, to convert these features to their minority state feature activity vectors, we could simply divide all the values of each feature by their respective feature maximums (Fig. S14C). However, now consider that there is a third gene where, when the first two genes have values close to 0, the third gene has values close to its maximum expression, and when the first two features have values close to their maximum, the third feature has values close to 0 (Fig. S14A,B). This is equivalent to the high expression of the third gene being the majority state feature activity (Fig. S14B), which is implicitly defined by the sum of activities in the normalised vector of the third gene being greater than half the number of samples. Hence, to convert the third gene to its minority state activity vector, we simply deduct each value of the vector from 1 (Fig. S14A,B).
This logic is generally applicable to any dataset where we can justifiably convert each feature into active versus inactive states. How this abstraction is done and whether this is meaningful for different datasets is context dependent and ultimately left to the user. In this work, we found the relatively simple normalisation method presented in Table S2 is effective.
In this work, the feature percentile threshold (p) was set as the 97.5th percentile. Sensitivity analysis demonstrated that the results of downstream cESFW analysis were robust to different choices of p (Fig. S12). Having obtained the minority state matrix (Mm), we may now use it as the input for ES calculations to quantify the correlations between features.
Exclude genes significantly enriched within a dataset [optional]
Sometimes when analysing scRNA-seq data we would like to combine datasets from different sequencing experiments. Combining datasets from different experiments can introduce batch effects as a significant confounder in differentiating biological signal from experimental noise. To mitigate the contribution of batch effects on downstream analysis of scRNA-seq data, several methods have been developed (Tran et al., 2020). Many of these methods focus on smoothing or regressing the counts matrices with the aim of making samples with similar cellular states that are derived from different datasets look more similar to one another than samples of different cellular states that are from the same dataset. Although there are scenarios where smoothing/imputation methods can be successfully employed to gain greater insights into scRNA-seq datasets, they always incur the risk of adding computational artefacts or unintentionally removing biological signal of interest.
In this cESFW workflow, we address the issue of batch effects in a different manner. Rather than change the values of the counts matrix, we hypothesised that genes that are significantly enriched in any individual dataset constitute genes that contribute to batch effects and hence should be removed from the data. In doing so, we aim to reduce batch effect signals to the point that biological patterns of interest that are common between datasets have a larger signal than dataset-specific noise, which should in turn lead to more biologically relevant insights in downstream analysis.
To identify which genes are considered significantly enriched in a specific dataset in an unsupervised manner, we turn to the Error Potential (EP). The EP is a metric derived from ES and described in detail by Radley et al. (2023). Given two minority state activity feature vectors, the EP between the two features can be calculated, and if the EP>0, the features can be considered to be co-occurring with each other in a manner that is greater than random chance. In Table S3 we provide pseudocode for how we use the EP for excluding features that are significantly enriched in any of the datasets comprising a scRNA-seq counts matrix of interest.
A limitation with this batch-specific gene exclusion methodology is that, if there are datasets within the combined counts matrix (M) that are comprised of almost entirely one cell type, and that cell type has a distinct gene signature that is not present in any other dataset, these genes may be unwittingly identified as batch effect genes and removed from the data. In such a scenario, it would be up to the researcher to modify the workflow accordingly or seek alternative methodologies.
Calculate ESS and EP feature matrices
Having created the Mm, we use it to calculate the ESS and EP pairwise for each gene in Mm. In doing so we generate the ESSs and EPs matrices, which are j by j matrices in which j is the number of genes present in Mm. The ESS uses ES theory (Fig. S15) to quantify the correlation between two features with a value between −1 and 1. The EP uses a hypothesis test between the prospect of two features being completely independent versus the two features being dependent, to quantify the significance of the correlation between the two features. When EP<0, it is more likely that the observed correlation between two features occurred through a random relationship between two independent features than the features having a dependent relationship. Conversely, when EP>0, the correlation between the two features is more likely to have been observed due to some dependent relationship.
For further details regarding the theory and derivation behind the ESS and EP, see Radley et al. (2023). In the next steps of our cESFW workflow, we use the ESSs and EPs matrices to get a feature importance weight for each of the j genes, and use these weights to decide which genes should be retained in the scRNA-seq counts matrix, and which should be excluded.
Calculate cESFW feature weights
Having generated the ESSs and EPs matrices, we now calculate the cESFW feature weights for each of the j genes. To calculate gene weights, we combine the ESSs and EPs matrices derived from ES theory, with the concept of node centrality from graph theory. In the context of scRNA-seq data, node centrality can be used to try to distinguish between genes (nodes) that form networks with relatively high correlations with other genes in the dataset, from genes that form weak correlative networks. The assumption is that genes that are members of highly correlating networks are more likely to be involved in cellular function/identity.
Because the ESS is a symmetric correlation metric, the ESSs matrix can be thought of as a weighted undirected graph in the same manner as other correlation metrics (Pearson's correlation, mutual information, etc.), where ESSs [2, 4]=ESSs [4, 2]=0.7 would quantify the weight of the edge between genes 2 and 4 as 0.7. However, practical limitations of node centrality often force users to apply supervision to the analysis of complex datasets such as scRNA-seq data. For example, Singh et al. (2020) discuss how node centrality measures that use weighted edges can fail to account for the number of edges that each gene has to other genes. In the following we show that with the ESSsjx j and EPsjx j matrices we can calculate node centralities of weighted networks in a manner that is unsupervised and accounts for the number of edges that each gene has to other genes.
In summary, we have calculated a vector of cESFW gene weights, for each gene in ESSsjx j, such that a larger cESFW gene weight indicates that the expression of a gene in the original scRNA-seq counts matrix are more likely to be part of a small network of highly correlated genes than other genes with lower cESFW gene weights. As such, we can use the cESFW gene weights to form a ranked list of genes for downstream analysis.
Select gene cluster with branching network
The final step of our proposed cESFW workflow is to take the cESFW gene weights and use them to select the top most informative genes for downstream analysis. In a similar manner to HVG analysis, we can simply take a subset of the top highest ranking genes according to the cESFW gene weights. Up until this point, the cESFW workflow has been entirely unsupervised, i.e., no subjective user decisions were applied and no prior knowledge regarding cell type or gene importance was used when deciding which genes to exclude from downstream analysis. As with HVG analysis, the final step of the cESFW workflow where we choose how many genes to select based on the cESFW gene weights is an iterative process where researchers use domain knowledge to settle on the final gene set, based on the assumption that the cESFW ranked gene list provides a meaningful proxy for thresholding. Below, we provide a description of our rationale for using the cESFW gene weights to identify a set of biologically informative genes for the human embryo scRNA-seq data within this manuscript. We present this description as a guide to aid analysis on other datasets.
For the human embryo scRNA-seq data presented in this manuscript, we selected the top 4000 genes by taking the 4000 genes with the highest cESFW gene weights. The selection threshold of 4000 was chosen by looking at UMAPs of the genes in the ESSs matrix at varying cESFW gene weight cutoffs (Fig. S13A). In these UMAPs, we qualitatively look for two main properties. First, we are looking for clusters of genes that have significant branching patterns, rather than smooth ‘blob’ like clusters. In Fig. S13A we see that as we decrease the number of top ranked genes that we include, the darker blue coloured cluster forms increasingly distinct branches. These branches indicate that there are genes that form fully connected networks, with some genes being very specifically connected to each other without being directly connected to many other genes in the network. Biologically we infer that these represent genes that are specifically expressed in a subset of cells present in the entire dataset, and are hence genes of interest when dissecting cellular identity. The top 4000 genes were chosen relatively arbitrarily as the point at which a clear branching pattern appears in the dark blue cluster. Readers will notice that choosing values smaller than 4000 can result in even more distinct branches in the blue cluster. We suggest users of our cESFW workflow aim to find a balance between keeping as many genes as possible while increasing the qualitative property of branching structure. Furthermore, we suggest supervising the selection of gene clusters using a small set of markers known to be important in the system of study. In this work, we found that genes known to be important during early human embryo development (Fig. S4) are enriched in the dark blue cluster of genes, further suggesting that this cluster of genes is more likely to discriminate cell type identities in downstream analysis.
A second feature of interest in these UMAPs is that a separate cluster of genes with significantly higher weighted node centrality scores can appear (yellow; Fig. S13A). Empirically, we find that failure to remove these clusters of genes prevents the emergence of high resolution scRNA-seq embeddings (see the online GitHub workflow for examples). Gene set enrichment analysis of the final 3012 gene set used throughout this manuscript, comprising the blue cluster of the top 4000 genes in Fig. S13A (red square), showed that these genes were enriched for GO terms relating to regulation and development (Fig. S13B). Conversely, the GO terms of the 988 genes comprising the yellow cluster were dominated by terms relating to transcription, translation and metabolism (Fig. S13C). We therefore hypothesise that the genes present in the relatively highly correlated yellow clusters represent expression signals relating to generic cell transcription, translation and metabolism processes which are less discriminating of cellular identity/function than genes that are part of the blue branching cluster. By excluding the genes from the yellow cluster in downstream analysis, we appear to further increase the signal to noise ratio of cell-type-specific expression, which ultimately improves our ability to discriminate different cell identities in the data.
We emphasise that this final step of the cESFW workflow is supervised and subject to user interpretation/dataset variability. We hope that our transparency regarding this step helps guide potential users of our cESFW workflow to improved resolution in their datasets or to develop new cESFW feature weight based selection criteria.
Data and code availability
Instructions to install cESFW can be found at: https://github.com/aradley/cESFW. Computational workflows and data used for the generation of results in this article can be found at: https://github.com/aradley/cESFW_Embryo_Topology_Paper. A copy of the human pre- and post-implantation embryo data used in this work may be found in the following permanent Mendeley Data repository: https://doi.org/10.17632/34td4ds2r9.1.
The data used to create the UMAP embedding are a combination of human embryo pre-implantation and post-implantation embryo (extended culture) scRNA-seq data. The pre-implantation raw counts scRNA-seq data from Yan et al. (2013), Petropoulos et al. (2016), Fogarty et al. (2017) and Meistermann et al. (2021) were compiled into a single gene expression matrix by Meistermann et al. (2021). For information regarding quality control and cell filtering of these four datasets, please refer to Meistermann et al. (2021). The Yanagida et al. (2021) human pre-implantation embryo data are available via GEO accession number GSE171820. The 3D cultured post-implantation human embryo raw counts scRNA-seq data are from Xiang et al. (2020), available at GSE136447. For our analysis, we removed the day 6 and 7 cultured embryo samples from Xiang et al. (2020) to mitigate the contribution of batch effects and uncertainty over timings of embryo stages.
Synthetic data generation
The four SDs used in this study were generated using the Dyngen software (Cannoodt et al., 2021). Synthetic dataset size and shapes were all controlled using out-of-box Dyngen functions.
UMAPs
For consistency, all UMAPs presented in this work were created with the Python umap-learn package, with n_neighbors and min_dist set to 50 and 0.1 respectively. The correlation distance metric was used instead of the default Euclidean distance metric. All UMAPs displaying gene expression profiles show the log2 normalised gene expression. No data smoothing or clipping were applied. An interactive online portal where readers may visualise gene expression profiles on our UMAP embedding and generate their own plots can be found at: https://bioinformatics.crick.ac.uk/shiny/users/boeings/radley2024umap_app/.
Unsupervised clustering
Unsupervised clustering of scRNA-seq embryo data (Fig. S3A) was performed using the Python package sklearn.cluster.AgglomerativeClustering, with k=30 clusters and the correlation distance metric.
Gene marker identification
Identification of markers for UMAP plotting and our ranked gene list (Table S1) was performed by using the ESS correlation metric to rank genes that are enriched to a cell state or cell states of interest. For details see Radley et al. (2023), and for examples see the workflows provided alongside this paper (https://github.com/aradley/cESFW_Embryo_Topology_Paper).
RNA velocity
RNA velocity was carried out using the scVelo (Bergen et al., 2020) with default parameters, except for the cell nearest neighbour matrix, which we manually substituted with the cell nearest neighbour matrix according to the 3012 cESFW genes and the correlation distance metric.
Query cell classification
Query scRNA-seq samples were transformed onto our UMAP embedding using the out-of-box umap.transform function. After query samples are placed on the UMAP embedding, we may predict their cell type in the low dimensional UMAP space.
To predict cell identity, we used the sklearn.neighbors.KNeighborsClassifier Python function. A knn classifier was trained using our UMAP embedding and cell state annotations. The number of neighbours used to train the classifier was 20. The trained classifier was then applied to the query samples to get their predicted cell identities.
Lineage emergence during blastocyst formation
To identify the emergence of Epi, Hyp and TE gene expression profiles during human blastocyst formation, we used the ESS to rank genes by how well their expression profiles overlap with the morula to Epi, Hyp or TE trajectories, and then progressively restricted the late blastocyst Epi, Hyp or TE populations. As an example, to identify Epi staging genes, we started by ranking genes based on their enrichment in the combined morula, ICM/TE branch, ICM and preIm-epi samples from our cell UMAP annotations (Fig. 3C). This was followed by ranking against the combined ICM/TE branch, ICM and preIm-Epi population, and then by the combined ICM and preIm-Epi samples, and finally the preIm-Epi samples in isolation (Fig. S10A). A similar process was repeated for the Hyp and TE trajectories. For each ranked gene list generated, we inspected the top 30 ranked genes and manually confirmed the gene expression profile using the UMAP embedding. Through this process, we were able to identify sets of genes that delineate the emergence of Epi, Hyp and TE lineages from the morula progenitor population.
Individual embryo cell state co-occurrence
As the ESS is a correlation metric that quantifies the co-occurrence of states in a set of samples, we used it to quantify how often our cell state annotations coincided in individual embryos. By treating each embryo as a sample and counting how many times each cell label appears for each embryo, we create a sample by feature matrix. We normalise each sample by dividing all values by the total number of cells in each embryo, and then calculate the ESS from the resulting matrix. This generates the ESS correlation matrix, presented in Fig. 4A,B.
Acknowledgements
We thank Marc Goodfellow, Linus Schumacher and James Briscoe for helpful discussions and insightful comments. We thank Dimitri Meistermann, Cheng Zhao and Fredrik Lanner for helping us obtain the spliced and unspliced scRNA-seq matrices used for RNA velocity analysis. We thank Manvendra Singh for providing cell label IDs for the Petropoulos et al. (2016) scRNA-seq data used in Singh et al. (2023). Thank you to Pengtao Liu and Xi Chen for providing the human EPSC scRNA-seq counts matrix used in Gao et al. (2019).
Footnotes
Author contributions statement
Conceptualization: A.R.; Methodology: A.R.; Software: A.R., S.B.; Validation: A.R., A.S.; Formal analysis: A.R.; Investigation: A.R.; Data curation: A.R.; Writing - original draft: A.R.; Writing - review & editing: A.S.; Visualization: A.R., S.B.; Supervision: A.S.; Project administration: A.S.; Funding acquisition: A.S.
Funding
This research was supported by the Medical Research Council Programme Grant (MR/W025310/1) and European Research Council Advanced Grant (Plastinet, contract 835312) funding to A.S. A.S. is a Medical Research Council Professor (G1100526/2). Open Access funding provided by University of Exeter. Deposited in PMC for immediate release.
Data availability
Instructions to install cESFW can be found at: https://github.com/aradley/cESFW. Computational workflows and data used for the generation of results can be found at: https://github.com/aradley/cESFW_Embryo_Topology_Paper. The human pre- and post-implantation embryo data can be found at: https://doi.org/10.17632/34td4ds2r9.1.
Peer review history
The peer review history is available online at https://journals.biologists.com/dev/lookup/doi/10.1242/dev.202832.reviewer-comments.pdf
References
Competing interests
The authors declare no competing or financial interests.