Branching topology of the human embryo transcriptome revealed by Entropy Sort Feature Weighting

ABSTRACT Analysis of single cell transcriptomics (scRNA-seq) data is typically performed after subsetting to highly variable genes (HVGs). Here, we show that Entropy Sorting provides an alternative mathematical framework for feature selection. On synthetic datasets, continuous Entropy Sort Feature Weighting (cESFW) outperforms HVG selection in distinguishing cell-state-specific genes. We apply cESFW to six merged scRNA-seq datasets spanning human early embryo development. Without smoothing or augmenting the raw counts matrices, cESFW generates a high-resolution embedding displaying coherent developmental progression from eight-cell to post-implantation stages and delineating 15 distinct cell states. The embedding highlights sequential lineage decisions during blastocyst development, while unsupervised clustering identifies branch point populations obscured in previous analyses. The first branching region, where morula cells become specified for inner cell mass or trophectoderm, includes cells previously asserted to lack a developmental trajectory. We quantify the relatedness of different pluripotent stem cell cultures to distinct embryo cell types and identify marker genes of naïve and primed pluripotency. Finally, by revealing genes with dynamic lineage-specific expression, we provide markers for staging progression from morula to blastocyst.

Average silhouette scores calculated at with varying sets of top ranked genes show that all three methods perform comparably.Grey dashed line shows the average silhouette score when using all genes in initial scRNA-seq counts matrix as a negative control.Although the cESFW initially performs slightly worse than Seurat and Scran, as more of the highest ranked genes are included, cESFW appears more stable.This is consistent with our results on synthetic data (Fig 2 ), where cESFW appears more robust to the presence of house keeping genes.D. Silhouette scores for individual clusters when using the top 200, 1000 or 2800 genes.As in (C.), we find that cESFW is initially comparable to Seurat and Scran HVG selection, but more stably maintains cell type clusters as the number of top ranked genes used increases.A.

B.
C.   S1).   A. UMAPs of top ranked genes with increasingly stringent thresholds shows the emergence of a cluster of genes that form a graph with branches.We hypothesise that these branches indicate sets of co-regulating genes.To generate the high resolution human embryo embedding show in the main text of this paper, we used a cESFW selection threshold that took the top 4000 genes according to cESFW weighting (red box in this figure).B. Gene ontology analysis of the genes in blue branching cluster of the top 4000 genes (red box) shows that these genes are enriched for developmental and differentiation terms.C. Gene ontology analysis of the genes in yellow cluster of the top 4000 genes (red box) shows that these genes are enriched for cell transcription, translation and metabolic processes.

D.
Fig. S1.Dyngen generated synthetic datasets (SDs) 1-3 A. UMAPs of SDs generated using just the TFs of each dataset.B. UMAPs of SDs generated using all of the TF and HK genes of each dataset.C. AUROC curves generated by different feature selection methods on each of SDs 1-3.D. PR-AUCs generated by different feature selection methods on each of SDs 1-3.

Fig. S2 .
Fig. S2.Comparison of feature selection methods on Peripheral Blood Mononuclear Cells (PBMCs).We took PMBC scRNA-seq data from the Seurat tutorial workflow and performed feature selection with Seurat HVG selection, Scran HVG selection and our cESFW feature selection workflow.Cell type labels were defined through Seurat clustering and literature based marker genes.Using our cESFW workflow we identified a set of 2898 highly informative genes.A. Resulting UMAPs for each gene selection workflow.B. Venn diagram of the top 2898 genes from the cESFW, Seurat and Scran workflows.C.Average silhouette scores calculated at with varying sets of top ranked genes show that all three methods perform comparably.Grey dashed line shows the average silhouette score when using all genes in initial scRNA-seq counts matrix as a negative control.Although the cESFW initially performs slightly worse than Seurat and Scran, as more of the highest ranked genes are included, cESFW appears more stable.This is consistent with our results on synthetic data (Fig 2), where cESFW appears more robust to the presence of house keeping genes.D. Silhouette scores for individual clusters when using the top 200, 1000 or 2800 genes.As in (C.), we find that cESFW is initially comparable to Seurat and Scran HVG selection, but more stably maintains cell type clusters as the number of top ranked genes used increases.

Fig. S3 .
Fig. S3.Annotation of our UMAP embedding.Unsupervised agglomerative clustering with number of clusters set to 30.Clustering was performed on the raw counts matrix subsetted down to our 3012 cESFW selected genes.A. The resulting sample clusters visualised on our UMAP embedding.By inspecting known markers and previous analyses of these scRNA-seq data, we manually groups our unsupervised clusters into annotated cell types.B. Summary of the annotated cell types and which unsupervised clusters were used to form them. C. Visualisation our our annotated cell types on our UMAP embedding.D. Stirparo et al. 2018's supervised analysis of the Petropoulos et al. 2016 scRNA-seq data support our cell type annotations.E. Petropoulos et al. 2016's supervised analysis for differentiating between mural and polar TE support our cell type annotations.

Fig. S5 .
Fig. S5.Comparison 3012 cESFW workflow genes against top 3012 Scran and Seurat HVGs. A. Venn diagram of overlap between the genes within each of the 3012 gene sets.B. Silhouette scores of each of our annotated clusters when using each of the 3012 gene sets.B, C. UMAP generated when using the same scRNA-seq counts matrix and UMAP parameters used to generate our cESFW UMAP embedding, but instead using the top 3012 Scran HVGs (B.) or top 3012 Seurat HVGs (C.).

Fig. S7 .
Fig. S7.Proposed apoptotic markers of NCC cells display NCC non-specific expression profiles.Visualisation of NCC specific apoptosis markers proposed by Singh et al. 2023 show a variety of NCC specific and non-specific expression patters. A. Proposed apoptosis markers that have relatively specific NCC or ICM/TE branch expression.C. Proposed apoptosis markers that show higher expression in morula and/or 8-cell populations than NCC or ICM/TE cells.C. Proposed apoptosis markers that are upregulated across several cell types of the day 3-14 human embryo.D. Proposed apoptosis markers that show sparse/low expression throughout the day 3-14 human embryo.

Fig. S8 .Figure S9 .
Fig. S8.Human embryo cell type label transfer to pluripotent stem cell culture samples.Related to Fig 5. A-D.Pseudo-bulk for each of our human embryo cell type annotations was generated by taking the average expression of samples within each group, for each gene.The correlation distance metric was then calculated between each of the pseudo-bulk samples and each of the scRNA-seq samples from the (Guo et al. 2021) (A.), (Messmer et al. 2019) (B.), (Mazid et al. 2022) (C.) and (Gao et al. 2019) (D.) PSC cultures conditions.Correlation distances were calculated on the raw expression counts subsetted down to the 3012 genes identified in this manuscript as informative of early human embryo cell identity.For each scRNA-seq sample, the correlation distances were normalised between 0 and 1 to aid visualisation of which human embryo cell type they are most similar to (dark blue).E. Bar plots summarising the proportions of which human embryo cell types the PSC scRNA-seq samples are most similar to shows consistency with cell type label transfer performed in the low dimensional UMAP space (Fig 5C).
Fig. S10.Emerging epiblast, hypoblast and trophectoderm signatures during blastocyst development.UMAP gene expression profiles for example genes of the Epi (A.), Hypo (B.) or TE (C.) lineages.Presented genes are from those that are highlighted in green in the heatmaps shown in Fig 6.
Fig. S11.Entropy sorting error scenarios.A set of simple examples that demonstrate the 5 known error scenarios where expression states between two features can be quantified as potential false positive (FP) or false negative (FN) data points through the ES parabola.SD indicates the split direction for the observed reference feature (RF) and query feature (QF) pair.These examples are set up with discrete states (minority Vs majority) for simplicity, but can be expanded to continuous data, as per SUPPLEMENTARY MATERIALS AND METHODS.The top 4 scenarios were outlined in detail in our previous work (A.Radley et al. 2023).The bottom error scenario (black box) outlines a previously undescribed error scenario where ES divergence indicative of FN expression values can be observed.

Fig. S12 .
Fig. S12.Feature normalisation percentile sensitivity.A, B. Varying the feature percentile threshold (p) for normalisation of features on each of the synthetic datasets shows that the results of cESFW are robust to difference choices of p, as demonstrated by minimal changes to the overall AOC (A.) and AUROC (B.) scores generated by cESFW feature ranking.