NODAL/TGFβ signalling mediates the self-sustained stemness induced by PIK3CAH1047R homozygosity in pluripotent stem cells

ABSTRACT Activating PIK3CA mutations are known ‘drivers’ of human cancer and developmental overgrowth syndromes. We recently demonstrated that the ‘hotspot’ PIK3CAH1047R variant exerts unexpected allele dose-dependent effects on stemness in human pluripotent stem cells (hPSCs). In this study, we combine high-depth transcriptomics, total proteomics and reverse-phase protein arrays to reveal potentially disease-related alterations in heterozygous cells, and to assess the contribution of activated TGFβ signalling to the stemness phenotype of homozygous PIK3CAH1047R cells. We demonstrate signalling rewiring as a function of oncogenic PI3K signalling strength, and provide experimental evidence that self-sustained stemness is causally related to enhanced autocrine NODAL/TGFβ signalling. A significant transcriptomic signature of TGFβ pathway activation in heterozygous PIK3CAH1047R was observed but was modest and was not associated with the stemness phenotype seen in homozygous mutants. Notably, the stemness gene expression in homozygous PIK3CAH1047R hPSCs was reversed by pharmacological inhibition of NODAL/TGFβ signalling, but not by pharmacological PI3Kα pathway inhibition. Altogether, this provides the first in-depth analysis of PI3K signalling in hPSCs and directly links strong PI3K activation to developmental NODAL/TGFβ signalling. This work illustrates the importance of allele dosage and expression when artificial systems are used to model human genetic disease caused by activating PIK3CA mutations. This article has an associated First Person interview with the first author of the paper.


Cell lysate collection for RNA sequencing and total proteomics
For RNA sequencing and total proteomics, subconfluent cells were fed fresh E8/F 3 h prior to snap-freezing on dry ice and subsequent RNA or protein extraction. Relative to the results in Ref. (Madsen et al., 2019), the current transcriptomic data of PIK3CA H1047R were obtained more than 6 months following the first study, with cells at different passages, and were thus independent from one another. Moreover, sample collection for the second transcriptomics experiment was conducted over three days according to a block design, thus allowing us to determine transcriptional differences that are robust to biological variability.

Cell lysate collection for RPPA
For RPPA in growth factor-replete conditions, cells were fed fresh E8/F 3h before collection. To assess variability due to differences in collection timing, clones from each iPSC genotype were collected on each one of three days according to a block design, giving rise to a total of 22 cultures. To test the effect of the PI3Kα-specific inhibitor BYL719, cells were treated with 100 nM drug (or DMSO only as control treatment) for 24 h and exposed to growth factor removal within the last hour before collection. All cells were washed in DPBS prior to collection to rinse off residual proteins and cell debris.

TGFb/NODAL signalling studies
Wild-type or homozygous PIK3CA H1047R iPSCs were seeded in 12-well plates all coated with Geltrex from the same lot (#2052962; diluted in DMEM/F12 lot #RNBH0692). Cells were processed for seeding at a ratio of 1:15 according to the standard maintenance protocol. One day after seeding, individual treatments were applied to triplicate wells. Briefly, cells were first washed twice with 2 and 1 ml of Dulbecco's PBS (DPBS) to remove residual growth factors. The base medium for individual treatments was Essential 6 supplemented with 10 ng ml -1 heat-stable FGF2. This was combined with one of the following reagents or their diluent equivalents: 100 ng ml -1 NODAL (diluent: 4 mM HCl), 250 nM BYL719 (diluent: DMSO), 5 µM SB431542 (diluent: DMSO). Cells were snap-frozen on dry ice after 24, 48 and 72 h following a single DPBS wash. Individual treatments were replenished daily at the same time of day to limit temporal confounders.

RNA sequencing data analyses
Raw read mapping, counting and differential expression Raw reads were mapped to the human genome build GRCh38 (for iPSC RNAseq) or the mouse genome build GRCm38 (for MEF RNAseq), and gene level counts were performed using Spliced Transcripts Alignment to a Reference (STAR) v2.5 (Dobin et al., 2013). Subsequent data processing was performed using open-source R software according to the limma-voom method (Ritchie et al., 2015). Briefly, raw counts were converted to counts per million (cpm) using the cpm() function in the edgeR package (Robinson, McCarthy and Smyth, 2009), followed by normalisation according the trimmed mean of M (TMM) method (Robinson and Oshlack, 2010). The meanvariance relationship was modelled with voom(), followed by linear modelling and computation of moderated t-statistics using the lmFit() and eBayes() functions in the limma package (Ritchie et al., 2015). The associated P-value for assessment of differential gene expression was adjusted for multiple comparisons with the Benjamini-Hochberg method at false-discovery rate (FDR) < 5% (Benjamini and Hochberg, 1995). The function duplicateCorrelation() was applied to correct for the use of replicate iPSC clones.
Correlations between corresponding transcriptomics and/or proteomics data were calculated using Spearman's rank-order correlation test for non-normally distributed data.

Ingenuity® Pathway Analysis (IPA)
The list of differentially expressed total proteins (PIK3CA H1047R/H1047R vs wild-type) was subjected to IPA (build version: 448560M; content version: 36601845) against the Ingenuity Knowledge Base, considering only relationships where confidence was classified as "Experimentally Observed". Following exclusion of chemicals and drugs, the Upstream Regulators list was used for generation of Volcano plots of the respective activation z-scores and overlap pvalues.
IPA was also used to analyse the lists of differentially expressed genes in both heterozygous and homozygous PIK3CA H1047R iPSCs (IPA build version: 484108M; content version: 45868156) and MEFs (IPA build version: 486617M; content version: 46901286), using the Ingenuity Knowledge Base and considering only relationships where confidence was classified as "Experimentally Observed". Chemicals and diseases were excluded from Node Types. For the iPSC datasets, differentially expressed genes were only considered for IPA analysis if having an absolute log2(fold-change) > log2(1.3). The choice of this relatively permissive log fold-change choice was guided by the limma() tutorial for RNAseq (Ritchie et al., 2015), and the assumption that small fold-changes in the expression of genes that act within the same pathway may be sufficient to elicit a functionally important response and thus should not be omitted. This consideration is of particular relevance for transcriptomic data from heterozygous PIK3CA H1047R iPSCs where fold-changes were relatively small. Due to the high number of differentially expressed transcripts in PIK3CA H1047R/H1047R iPSCs, the analysis was conducted using the top 2000 up-and top 2000 downregulated transcripts.
The IPA Upstream Regulator Analysis is based on the proprietary Ingenuity Knowledge Base which is used to compute two scores based on user-specified data: an enrichment score (Fisher's exact test p-value) that measures overlap between observed and predicted regulated gene sets; a z-score that assesses the match between observed and predicted up/down regulation patterns (Krämer et al., 2014). The results of the Upstream Regulators Analysis were extracted for downstream Volcano plotting of overlap p-values and associated activation z-scores. Note that for heterozygous PIK3CA H1047R iPSCs, a bias-corrected activation z-score was used for plotting to take into account any bias arising from a larger number of upregulated vs downregulated genes in these cells.

Weighted Gene Correlation Network Analysis (WGCNA)
RNA sequencing counts from all 12 samples were converted to reads per kilobase million (RPKM). A threshold of 10 RPKM was used to filter out low-expression genes, followed by removing any genes with missing values caused by this filtering. The RPKM values for the remaining 16,823 genes were log2 transformed and taken forward for network analysis using the WGCNA R package (Zhang and Horvath, 2005;Langfelder and Horvath, 2008), with a soft power threshold of 28 (chosen to maximise scale independence and minimise mean connectivity) and a minimum module size of 30 genes.
To identify modules associated with homozygosity for PIK3CA H1047R , we used the correlation between a gene's module membership (eigengene) and significance for differential expression in homozygous PIK3CA H1047R/H1047R iPSCs. The top two most significant modules for the homozygosity trait were selected for functional enrichment analysis. CytoScape plugin ClueGO (version 2.5.4) (Bindea et al., 2009) was used to perform pathway analysis using the Kyoto Encyclopedia of Genes and Genomes (KEGG) ontology (build 27.02.19) (Kanehisa et al., 2017). All settings were kept at default values. Only pathways with p-value < 0.05 were selected, and a custom reference gene set was used as background (the 16,823 genes analysed using WGCNA). Network visualisation was performed using Cytoscape (Shannon et al., 2003).

Disease Models & Mechanisms • Supplementary information
All spectra were acquired on an Orbitrap Fusion Tribrid mass spectrometer (Thermo Fisher Scientific) operated in data-dependent mode coupled to an EASY-nLC 1200 liquid chromatography pump (Thermo Fisher Scientific) and separated on a 50 cm reversed phase column (Thermo Fisher Scientific, PepMap RSLC C18, 2 µM, 100A, 75 µm x 50 cm). Proteome samples (non-enriched) were eluted over a linear gradient ranging from 0-11% acetonitrile over 70 min, 11-20% acetonitrile for 80 min, 21-30% acetonitrile for 50 min, 31-48% acetonitrile for 30 min, followed by 76% acetonitrile for the final 10 min with a flow rate of 250 nl/min. Survey-full scan MS spectra were acquired in the Orbitrap at a resolution of 120,000 from m/z 350-2000, automated gain control (AGC) target of 4x105 ions, and maximum injection time of 20 ms. Precursors were filtered based on charge state (≥2) and monoisotopic peak assignment, and dynamic exclusion was applied for 45s. A decision tree method allowed fragmentation for ion trap MS2 via electron transfer dissociation (ETD) or higher-energy collision dissociation (HCD), depending on charge state and m/z. Precursor ions were isolated with the quadrupole set to an isolation width of 1.6 m/z. MS2 spectra fragmented by ETD and HCD (35% collision energy) were acquired in the ion trap with an AGC target of 1e4. Maximum injection time for HCD and ETD was 80 ms for proteome samples.

Whole-exome sequencing (WES) and FASTA file generation
WES was performed on a single clone per genotype to generate cell-specific databases for downstream mass spectrometry searchers. Genomic DNA was extracted with Qiagen's QIAamp DNA Micro Kit according to the manufacturer's instructions, followed by quantification using the Qubit dsDNA High Sensitivity Assay Kit and by dilution to 5 ng/μl in the supplied TE buffer. The samples were submitted for library preparation and sequencing by the SMCL Next Generation Sequencing Hub (Academic Laboratory of Medical Genetics, Cambridge). Sequencing was performed on an Illumina HiSeq 4000 with 50X coverage across more than 60% of the exome in each sample. Raw reads were filtered with Trimmomatic (Bolger, Lohse and Usadel, 2014) using the following parameters: headcrop = 3, minlen = 30, trailing = 3. The trimmed reads were aligned to the human reference genome (hg19 build) with BWA (Li and Durbin, 2010), followed by application of GATK base quality score recalibration, indel realignment, duplicate removal and SNP/indel discovery with genotyping (McKenna et al., 2010). GATK Best Practices standard hard filtering parameters were used throughout (Depristo et al., 2011).
In order to find non-reference, mutated peptides in the MS data, we increased the search FASTA file with mutations affecting the protein sequence, as detected by WES with a high sensitivity filter: QD < 1.5, FS > 60, MQ > 40, MQRankSum < -12.5, ReadPosRankSum < -8.0, and average DP > 5 per sample. The Ensembl Variant Effect Predictor (VEP) with Ensembl v88 was used to predict the effect of the mutations on the protein sequence (McLaren et al., 2016). For every variant with an effect on the protein sequence we added the predicted mutated tryptic peptide at the end of the protein sequence.

Mass spectrometry searches
Raw files were processed using MaxQuant 1.5.0.2 (Tyanova, Temu and Cox, 2016) with all searches conducted using cell-specific databases (see Whole-exome sequencing and FASTA file generation), where all protein sequence variants were included in addition to the reference (Ensemble v68 human FASTA). Methionine oxidation, protein N-terminal acetylation and serine/threonine/tyrosine phosphorylation were set as variable modifications and cysteine carbamidomethylation was set as a fixed modification. False discovery rates were set to 1% and the "match between runs" functionality was activated. We filtered out peptides that were associated with multiple identifications in the MaxQuant msms.txt file, had a score < 40, were identified in the reverse database or came from known contaminants. Analysis of the observed peptides passing these filters was performed using a Monte Carlo Markov Chain model as described previously (Robin et al., 2019 preprint). Briefly, the model predicted the average ratio (sample versus control) of a peptide as a function of the observed protein concentration (obtained from the MaxQuant evidence.txt file). Combined with a noise model, a distribution of likely values for the parameters was obtained. The mean and standard deviation of this resulting distribution was used to calculate  (1.2)).

RPPA data analyses
Slide images were analysed using Mapix software (Innopsys), with the spot diameter of the grid set to 270 μm. Background signal intensity was determined for each spot individually and subtracted from the sample spot signal. A test for linearity was performed from the four-point antibody dilution series, according to a flag system where R 2 > 0.9 was deemed good, R 2 > 0.8 was deemed acceptable and R 2 < 0.8 was poor (excluded from subsequent analyses). Median values from the four-point dilution series were calculated for each technical replicate and normalised to the corresponding Fast Green value to account for differences in protein loading. For each sample and protein target, a mean expression value was calculated from the remaining technical replicates and normalised to the corresponding mean of the wild-type group. All phosphoprotein signals were also normalised to the corresponding total protein values.

RT-qPCR set-up and data analyses
For SYBR Green-based qPCRs, A 5-fold cDNA dilution series was prepared and used as standard curve for relative quantitation of gene expression. TBP was used as normaliser following confirmation that its gene expression remained unaffected by the tested conditions. Melt curve analyses confirmed amplification of a single product by each primer pair. All primers had amplification efficiencies 95%-105%. Samples were loaded in duplicate in 384-well plates.
The TaqMan hPSC Scorecard was set up according to the manufacturer's instructions with the following modifications. From each cDNA sample diluted to 20 ng/µl, two 50 μl RT reactions were set up, with 500 ng RNA sample in each. Next, the two RT replicates were combined to obtain 1 μg cDNA in a total volume of 100 μl (final concentration: 10 ng/μl). This was subsequently diluted to 0.715 ng/μl and 10 μl loaded into each Scorecard well. All Ct values were mapped to their corresponding genes using the TaqMan hPSC Scorecard analysis software provided by the manufacturer. Genes with Ct values < 15 were excluded from further analyses. To be considered for downstream processing, genes were also required to have Ct values < 30 in at least two out of the eight samples. Next, Ct values were linearised (antilog) under the assumption of 100 % primer amplification efficiency. The geometric expression mean of the control gene assays was used for subsequent normalisation of individual gene expression values.
All qPCR data were acquired on a Quant Studio™ 5 Real-Time PCR System (Thermo Fisher Scientific). The thermocycling conditions (SYBR Green reactions) were as follows (ramp rate 1.6°C/s for all): 50°C for 2 min, 95°C for 10 min, 40 cycles at 95°C for 15 sec and 60°C for 1 min, followed by melt curve analysis (95°C for 15 sec, 60°C for 1 min, and 95°C for 15 min with ramp rate 0.075°C/sec). The TaqMan hPSC Scorecard thermocycling conditions were as specified by the manufacturer in the accompanying template.
All relevant primer sequences are included in Table S4.

Colony size quantitation from light micrographs
Routine cell culture light micrographs were acquired on an EVOS FL digital inverted microscope (AMF4300, Thermo Fisher Scientific) using the 4X or 10X objective (final magnification 40X and 100X, respectively). For quantitation, 4X images were used for colony segmentation with Definiens Developer XD software. Background was detected using a contrast threshold; for this each pixel was compared to those in the surrounding 24 pixels (i.e. a 5x5 pixel box), and pixels with low contrast (between -50 and +50) were classified as background. Remaining pixels were classified as colonies, and any holes (pixels that were not initially classified as being part of the colony due to low contrast) were filled. Edges of the resulting colonies were smoothened by shrinking and then growing the colonies by 2 pixels. Finally, colonies less than 2000 pixels in size were reclassified as background. The area of the resulting colonies could then be measured and averaged over each field of view.

Statistical analyses
In line with recent (ATOMIC) recommendations by the American Statistical Association (Wasserstein, Schirm and Lazar, 2019), we have avoided arbitrary use of "statistical significance" applied to data from small-scale cell culture experiments which violate assumptions of the most widely used statistical tests. Instead, we present all data from multiple orthogonal experiments, alongside complete information on experimental replicates, independent clones and replicate cultures.
For RPPA data in Fig. 2A, a statistical test for differential expression was performed on datasets with more than three samples per group, using the limma package to apply the limmatrend method with lmFit() and eBayes(), specifying collection time as blocking factor (Ritchie et al., 2015). Phosphoprotein and total protein lists were processed separately. The associated p-value for assessment of differential gene expression was adjusted for multiple comparisons with the Benjamini-Hochberg method at FDR < 5% (Benjamini and Hochberg, 1995). The function duplicateCorrelation() was applied to correct for the use of replicate iPSC clones on the same day. Heatmaps were generated using the heatmap.2() function within the gplots package in R, using target-wise correlation for dendrogram construction.  and plotting them to the corresponding protein identified by total proteomics. If differentially expressed (see (a)), the matched proteins are highlighted in red. (C) and (D) As in (A) and (B), respectively, but using the data for PIK3CA WT/H1047R iPSCs. All proteomic data were obtained from 3 independent clones per genotype using cultures at passages P47-P52, corresponding to the cultures used in our previous study (Madsen et al., 2019). The high-depth transcriptomic data were obtained from 4 independent cultures (minimum 2 independent clones) per genotype using cultures at passages P55-P59. Spearman's rho and the corresponding P-values are indicated on all plots.  (HOM) iPSCs relative to wild-type controls, based on RPPA profiling of cells cultured in growth factor-replete conditions. The data are based on a total of 10 wild-type cultures, 5 PIK3CA WT/H1047R cultures and 7 PIK3CA H1047R/H1047R cultures, and all shown targets were differentially expressed at a false-discovery rate (FDR) < 0.05. Following quality checks (see Materials and Methods), the RPPA data included 21 phosphorylated and 21 total proteins. (B) Heatmap of total proteins from RPPA profiling of wild-type (WT), PIK3CA WT/H1047R (HET) and PIK3CA H1047R/H1047R (HOM) iPSCs following short-term growth factor removal (1 h), +/-100 nM BYL719 (PI3Ka inhibitor) for 24 h. Groups of total proteins exhibiting a consistent expression pattern in BYL719-treated PIK3CA H1047R/H1047R iPSCs are specified.   Fig. S5, related to Fig. 5. Light micrographs of PIK3CA H1047R/H1047R iPSC clones exposed to different media formulations +/-specific inhibitors. A. Each row corresponds to images of parallel cultures of one of two independent iPSC clones exposed to different treatments for 48 or 72 h as indicated. E6/F/N: Essential 6 medium supplemented with 10 ng ml -1 FGF2 and 100 ng ml -1 NODAL; E6/F: Essential 6 medium supplemented with 10 ng ml -1 FGF2; E6/F/SB: Essential 6 medium supplemented with 10 ng ml -1 FGF2 and 5 µM SB431542; E6/F/BYL: Essential 6 medium supplemented with 10 ng ml -1 FGF2 and 250 nM BYL719. The images are representative of 3 independent experiments. Scale bar: 1000 µm. The original images at higher resolution are available online on https://osf.io/hbf7x/. B. Quantification of colony size based on the acquired light micrographs; individual points corresponding to separate images and bars represent the overall mean across the three experimental replicates.