A fully automated deep learning pipeline for high-throughput colony segmentation and classification

ABSTRACT Adenine auxotrophy is a commonly used non-selective genetic marker in yeast research. It allows investigators to easily visualize and quantify various genetic and epigenetic events by simply reading out colony color. However, manual counting of large numbers of colonies is extremely time-consuming, difficult to reproduce and possibly inaccurate. Using cutting-edge neural networks, we have developed a fully automated pipeline for colony segmentation and classification, which speeds up white/red colony quantification 100-fold over manual counting by an experienced researcher. Our approach uses readily available training data and can be smoothly integrated into existing protocols, vastly speeding up screening assays and increasing the statistical power of experiments that employ adenine auxotrophy.


INTRODUCTION
Auxotrophy is the inability of an organism to synthesize a particular organic compound required for its growth. For example, yeasts with mutations in the adenine biosynthetic pathway cannot grow on media lacking adenine. When grown under adenine-limiting conditions, adenine auxotrophs grow but accumulate a celllimited red pigment in their vacuoles, whereas wild-type cells grow white. This red/white colony color assay has been widely used over the last few decades for the investigation of many biological processes, such as recombination, copy number, chromosome loss, plasmid stability, prion propagation, or epigenetic gene regulation in both budding and fission yeasts. However, adapting this assay to quantitative high-throughput applications has proven challenging, as it requires extensive scoring of colony phenotypes by eye. In addition to being time-consuming and tedious, manual colony scoring may suffer from inaccuracy and irreproducibility. Nonetheless, up to now manual scoring is a common practice in the yeast community. Modern machine-learning techniques such as deep learning have made huge strides in automated image classification in recent years and are beginning to be applied to previously intractable problems in the biomedical imaging domain. We set out to leverage these recent developments to build a computational pipeline that would enable fully automated highthroughput adenine auxotrophy-based screening and quantification.
Typically, red/white colony color assays start with plating individual yeast cells on limiting adenine indicator plates and allowing them to grow until they form colonies large enough to be inspected by eye. Each plate may represent an independent condition or genotype, the penetrance of which can be assessed by quantifying the percentage of non-white colonies per plate. In order to create a pipeline that would fit into existing protocols as seamlessly as possible, we considered our input to be images of plates and our desired output to be percentages of white and nonwhite colonies per plate. Two major tasks are required to generate this output: separating the colonies from the plate background and classifying them individually as white or non-white.
These two tasks could conceivably be completed either in one step, as with a single-shot detector (Liu et al., 2016) or RetinaNet (Lin et al., 2017), or in two separate steps, such as with a semantic segmentation, where each pixel is assigned a label such as 'foreground' or 'background', followed by classification of cropped images. While a single-step approach may be preferable from the perspective of algorithmic efficiency and speed (Huang et al., 2016), the training data annotations are more complex, requiring both manually assigned labels and matched bounding boxes identifying the location of each colony on a plate. As insufficient training data is a common problem hampering efforts to apply deep learning in many biological domains (Hughes et al., 2018), we opted to use a pragmatic approach, treating the segmentation and classification steps as separate problems (Fig. 1A). This allowed us to use simpler and, when available, preexisting annotations for training data: for the segmentation task, we used masks generated previously using the Ilastik image-processing toolkit (Sommer et al., 2011), while for the classification task, we relied on manual labels assigned by experienced biologists to cropped images of single colonies. All of our training data consisted of Schizosaccharomyces pombe colonies, with red pigment resulting from heterochromatin-mediated silencing of the ade6 + gene.

Implementation of the pipeline
All analyses were performed in Python v. 3.6.5 inside a conda virtual environment (v. 4.5.4). Deep learning models were built and trained using the fast.ai library (https://github.com/fastai/fastai) (Howard and others, 2018). Image processing steps were performed using the scikit-image library (v. 0.13.1) (https://scikit-image.org) (van der Walt, 2014). Model training and prediction was run on a GeForce GTX 1080 GPU with CUDA v. 9.0 and CUDNN v. 7.1.2.

Image segmentation
A modified U-net architecture using a Resnet-34 network pretrained on ImageNet as the encoder was used as the network architecture for the plate image segmentation task (He et al., 2016;Ronneberger et al., 2015). A total of 492 pairs of plate images and corresponding segmentation masks generated by Ilastik were used as training data, with approximately 20% (95 pairs) set aside for validation. Binary cross-entropy with logits ('BCEWithLogitsLoss') was used as a loss function, and the dice coefficient was used as an evaluation metric. Data augmentation was applied to the training images with the following transformations: random rotation up to 4°, random horizontal flip, random adjustment of brightness and contrast. The same transformations, except for adjustment of brightness and contrast, were applied to the training masks.
The network was first trained with center-cropped masks and images resized to 512×512 pixels, with a batch size of 4. A weight decay parameter of 1e-7 was used for all training. First, only the last layer was trained using stochastic gradient descent with restarts (SGDR) for 1 cycle of 8 epochs, using a circular learning rate scheduler with a maximum learning rate of 4e-2, a minimum learning rate of 8e-3, 1 epoch of increasing learning rate, and 7 epochs of decreasing learning rate (Huang et al., 2017;Smith, 2015). Next, all layers were trained using SGDR for 1 cycle of 20 epochs. Differential learning rates were applied across layers, with the first third of layers having a maximum learning rate of 1e-4, the middle third having a maximum learning rate of 1e-3, and the last third having a maximum learning rate of 1e-2. A circular learning rate scheduler was again used, with minimum learning rates of one-twentieth of respective maximum learning rates, 2 epochs of increasing learning rates, and 18 epochs of decreasing learning rates. The resulting weights were then saved and used as a starting point to train the network with larger, 1024×1024 images.
Training images and masks were scaled up to 1024×1024 pixels and the network was further trained. First, only the last layer was trained using SGDR for one cycle of 2 epochs, using a circular learning rate scheduler with a maximum learning rate of 4e-2, a minimum learning rate of 8e-3, 0.5 epochs of increasing learning rate, and 1.5 epochs of decreasing learning rate. Next, all layers were trained using SGDR for 20 cycles of 20 epochs. Differential learning rates were applied across layers, with the first third of layers having a maximum learning rate of 4e-5, the middle third having a maximum learning rate of 2e-4, and the last third having a maximum learning rate of 4e-3. A circular learning rate scheduler was again used, with minimum learning rates of one-twentieth of respective maximum learning rates, 2.5 epochs of increasing learning rates per cycle, and 17.5 epochs of decreasing learning rates per cycle. The resulting weights were saved and used for prediction.

Colony classification
A Resnet-34 network that had been pre-trained on ImageNet was used as the network architecture for the colony classification task. The final output layer was replaced with a layer predicting five classes ('white', 'red', 'pink', 'variegating' and 'bad segmentation'). A total of 1476 manually labeled, cropped images of individual colonies were used as training data, with approximately 20% (295 images) set aside for validation. Out of the total pool of images, 537 were labeled as white, 273 as red, 310 as pink, 318 as variegating, and 38 as bad segmentation. Validation images were chosen so as to have the same proportions among the five classes as in the total pool. For training, images were resized to 50×50 pixels and a batch size of 128 was used. Stochastic gradient descent (SGD) with a Momentum of 0.9 was used as an optimization algorithm. Categorical cross-entropy was used as a loss function, and both log loss and accuracy were used as evaluation metrics. Data augmentation was applied to the training images with the following transformations: random rotation up to 10°, random rotation or reflection of dihedral group 8, random adjustment of brightness and contrast, random zoom up to 1.1×.
The last layer of the network was first trained without data augmentation for 1 epoch using a learning rate of 1e-2. Data augmentation was then added, and it was trained with SGDR for three cycles of 1 epoch each using cosine annealing, with an initial learning rate of 1e-2. Next, all layers were trained for 17 sets of three cycles of increasing length (1 epoch, followed by 2 epochs, followed by 4 epochs), for a total of 51 cycles and 119 epochs. Differential learning rates were applied across layers, with the first third of layers having a maximum learning rate of 1.1e-4, the middle third having a maximum learning rate of 3.3e-4, and the last third having a maximum learning rate of 1e-3. Training was stopped when over-fitting was observed, and the resulting weights were saved and used for prediction.

Mask prediction, post-processing and colony class prediction
The full prediction pipeline is implemented as follows: first, plate images are center-cropped and resized to 1024×1024 pixels. Segmentation masks are predicted using the trained U-net Performance of the pipeline To perform the segmentation task, we chose a U-net-like architecture implemented in the fast.ai library (Howard and others, 2018). U-net was developed specifically for semantic segmentation and has been successfully applied to complex biomedical images such as electron microscopy of neuronal structures and MRI or ultrasound images in breast cancer screening (Ronneberger et al., 2015;Kumar et al., 2018;Dalmısȩ t al., 2017). We trained the U-net network using 492 plate images and corresponding Ilastik-generated masks, with 20% kept aside for validation (see Image segmentation for full training parameters). After training, visual inspection of predicted masks revealed an accurate segmentation of colonies from background, although some errors remained around plate edges (Fig. 1B). This was not unexpected, considering that the Ilastik-generated masks also often contained artefacts at the edges. In order to circumvent this problem, we applied post-processing on the predicted masks, which effectively removed artefacts. The vast majority of resulting cropped regions contained a single colony; however, a few regions still contained multiple small, overlapping colonies. To reduce possible bias that might result from counting multiple colonies as one, we filtered these out during the classification stage. For the classification task, we fine-tuned a Resnet-34 architecture that was pre-trained on ImageNet (http://www.image-net.org/) (He et al., 2016;Deng et al., 2009), also implemented in the fast.ai library (Howard and others, 2018). We trained the network using 1476 images of individually cropped colonies, which were split into five manually-labeled classes: white, red, pink, variegating and multiple colonies. Again, 20% of colony images were kept aside for validation. After training, we achieved a validation accuracy of 91.8% across the five classes (Fig. 1D, Table 1). Further, aggregating the three classes of non-white colonies together (red, pink and variegating) yielded a much higher validation accuracy of 98.6%. This higher pooled accuracy was encouraging, considering our desired output of percentages of white and non-white colonies per plate. It also demonstrates that most classification errors occur within non-white classes rather than between white and non-white classes, an expected result given that the non-white classes often do not have clear distinctions and can be difficult to define even by eye.
While we were encouraged by our high validation accuracy, we wanted to test the pipeline's performance against manual counting in a real-world, experimental context. To this end, we took data from two published experiments testing trans-generational inheritance of ade6 + silencing in Schizosaccharomyces pombe (Duempelmann et al., 2019). In these experiments, ade6 + silencing was first induced by expression of small interfering RNAs (siRNAs) that are complementary to the ade6 + gene in a paf1-Q264Stop nonsensemutant background, leading to red colonies. Paf1 is a subunit of the Paf1 complex (Paf1C), which represses siRNA-induced heterochromatin formation in S. pombe (Kowalik et al., 2015). In the presence of the paf1-Q264Stop allele, the silenced (red) phenotype was inherited through meiosis, even in the absence of the original siRNAs that have triggered ade6 + repression. This was not the case if the progeny inherited a paf1 + wild-type allele, i.e. the red silencing phenotype was lost. However, these white paf1 + cells inherited a marked ade6 + epiallele (ade6 si3 ), which reinstated silencing when cells became mutant for Paf1 again in subsequent generations (Duempelmann et al., 2019). The following experiments were performed to quantify different aspects of this phenomenon.
In the first experiment, paf1-Q264Stop cells that inherited the ade6 si3 allele and re-established the red silencing phenotype were plated on limiting adenine indicator plates to quantify the maintenance of this re-established silencing through mitosis. Out of 59 plates derived from a red progenitor colony, our automated pipeline predicted a mean of 84.7% non-white colonies, indicating a high degree of maintenance of ade6 + silencing. Ten of these plates were manually counted, and the mean difference in the percent of counted and predicted non-white colonies was 2.1 percentage points. In comparison, many red colonies (43%) were falsely classified as white when using the popular open-source software CellProfiler (github.com/CellProfiler/CellProfiler/tree/v3.1.9) ( Fig. 2A). As a control, colonies derived from cells of white paf1-Q264Stop progenitor colonies were also quantified by both methods. Our automated pipeline predicted a mean of 0.57% non-white colonies across 60 plates, while manual counting of colonies on 12 plates detected 0 non-white colonies. Also, here the predictions made with CellProfiler were less accurate and misclassified 14.66% of the white colonies as non-white (Fig. 2B).
The second experiment was performed to assess mitotic stability of the ade6 si3 epiallele. White paf1 + cells with an ade6 si3 epiallele were grown exponentially and a sample was crossed to white paf1-Q264Stop cells every 3 days (30-40 mitotic divisions) over 19 days total. Progeny of these crosses were also observed to re-establish silencing; however, the frequency of re-establishment declined with the number of mitotic divisions the cells had gone through. The percentage of non-white colonies on plates resulting from crosses at each time point was both counted manually (66,000 colonies total) and predicted using the automated pipeline. Both methods showed a near-exponential decrease in non-white colonies over time; the mean difference in the percent of non-white colonies between the two methods ranged from 0.26 to 5.1 percentage points (Fig. 2C).

DISCUSSION
Here we introduce a novel computational pipeline for colony segmentation and classification. It allows us to achieve accuracy comparable to human performance, which was difficult to achieve with existing colony segmentation and classification programs. For example, many red colonies were falsely classified as white when analyzing our data with CellProfiler ( Fig. 2 and data not shown). CellProfiler also cannot distinguish variegating, pink and red colonies.
We observed several factors that contributed to the accuracy of the automated pipeline predictions versus manual counting. In general, the pipeline performed with very high accuracy on plates with a low (<5%) percentage of non-white colonies; however, on some plates that had been grown for over 2 weeks, white colonies formed an irregular ring-like morphology and tended to be mis-classified as pink. The accuracy of the pipeline was also decreased for plates with very dense, small colonies. This may be partly due to the difficulty in segmenting individual colonies when they are touching one another, leading to many colonies being excluded from the analysis. Very small colonies are also more likely to be mistakenly filtered out during post-processing. Since cells with a silenced ade6 + gene tend to grow more slowly, this may lead to bias, as very small colonies are more likely to be non-white. However, plating cells at a controlled density and imaging the plates after an appropriate amount of time can counteract this potential bias. We have posted a full suggested protocol describing plating and imaging for most accurate prediction on Protocol Exchange (https://protocolexchange. researchsquare.com/article/nprot-7305/v1).
Our pipeline generates as output the numbers of predicted colonies per plate for both the aggregated non-white versus white categories as well as for the more granular classes of red, pink and variegating colonies. As detailed in Table 1, we observed a higher overall performance for white versus non-white predictions; however, in some cases researchers may prefer to use the more granular predictions despite their lower accuracy. This decrease in accuracy is largely due to reduced sensitivity for individual class predictions, while a high level of specificity is maintained.
Our fully automated pipeline can be run on a CUDA-enabled CPU or GPU. On a GeForce GTX 1080 GPU, it took approximately 45 minutes to process 245 plate images, representing an 80-100x speedup over manual counting by an experienced biologist. In addition to saving time and manual labor, the pipeline has the potential to increase reproducibility by removing variations between individual researchers or computer monitors. Our pragmatic approach, combining transfer learning with readily available annotations, allows us to achieve accuracy comparable to human performance with relatively little training data. Our work should Class labels were predicted for n=295 colonies and compared against manual annotations. Metrics are shown for individual class labels (white, pink, red and variegating), as well as for the aggregated non-white class, which includes pink, red and variegating colonies. For each class, true positives (TP) are considered to be colonies that were predicted to have a given label and also manually annotated with that label. False positives (FP) are considered to be colonies that were predicted to have a given label, but manually annotated with a different label. True negatives (TN) are considered to be colonies that were predicted to have any label besides a given label and also manually annotated to have any label besides that label. False negatives (FN) are considered to be colonies that were predicted to have any label besides a given label, but manually annotated with that label. Sensitivity was calculated as TP/(TP+FN). Specificity was calculated as TN/(TN+FP). The F1-score represents the harmonic mean of sensitivity and specificity and was calculated as 2*(sensitivity * specificity)/(sensitivity+specificity).
thus enable larger-scale experiments and higher statistical power, unlocking a true quantitative use of the red/white color assay in yeast research. We have made the code and trained network weights freely available to the community at https://github.com/fmi-basel/ buehler-colonyclassification.

Protocol for plating and imaging of yeast
We have posted a full suggested protocol describing plating and imaging for most accurate prediction on Protocol Exchange (https://protocolexchange. researchsquare.com/article/nprot-7305/v1). Pipeline requirements Operating system(s): Platform independent Programming language: Python Other requirements: Python 3.6 or higher, fastai library v. 0.7 License: GNU GPL v3.0 Any restrictions to use by non-academics: None Predictions with CellProfiler made for each image. Because of the high memory usage and memory leaks we reduced the image resolution from 5184×3456 pixels to 1728×1152 pixels. The range of typical diameter of objects was set to 6-33 pixel units and the threshold value which separates white from non-white colonies was optimized to 0.055. Mean predicted or manually counted percentages of non-white colonies per plate across a time course of mitotically dividing white paf1 + ade6 si3 cells crossed to white paf1-Q264Stop cells every 3 days. The time course was repeated with 11 independent biological replicates, and six plates per replicate per timepoint were quantified by automated prediction (n=66 per time point). The mean of all 66 plates is reported for each timepoint. For manual counting, one plate per replicate per timepoint was counted (n=11 per time point); the mean of 11 plates is reported for each timepoint.