Cell imaging has entered the ‘Big Data’ era. New technologies in light microscopy and molecular biology have led to an explosion in high-content, dynamic and multidimensional imaging data. Similar to the ‘omics’ fields two decades ago, our current ability to process, visualize, integrate and mine this new generation of cell imaging data is becoming a critical bottleneck in advancing cell biology. Computation, traditionally used to quantitatively test specific hypotheses, must now also enable iterative hypothesis generation and testing by deciphering hidden biologically meaningful patterns in complex, dynamic or high-dimensional cell image data. Data science is uniquely positioned to aid in this process. In this Perspective, we survey the rapidly expanding new field of data science in cell imaging. Specifically, we highlight how data science tools are used within current image analysis pipelines, propose a computation-first approach to derive new hypotheses from cell image data, identify challenges and describe the next frontiers where we believe data science will make an impact. We also outline steps to ensure broad access to these powerful tools – democratizing infrastructure availability, developing sensitive, robust and usable tools, and promoting interdisciplinary training to both familiarize biologists with data science and expose data scientists to cell imaging.
Microscopy provides visual access to cell appearance, organization and behavior, enabling us to discover new biology by observing cells in their basal and perturbed states. The intricate beauty of microscopy images is often engrossing. However, a digital microscopy image is a sequence of numerical values and can be interpreted not only visually, but also via mathematical analysis. Many techniques have been developed for cell biology that take advantage of the dual nature of microscopy images by using their quantitative representation to test hypotheses articulated after carefully viewing them (Ellenberg et al., 2018).
The approach of first looking and then subsequently quantifying microscopy images is becoming increasingly difficult because microscopy for cell biology now entails more – more automation for high-content image acquisition, more modes of microscopy that generate larger datasets, and more microscopes, enabling greater access to microscopy experiments by more people. Beyond generating larger and larger datasets, these advances allow us to test biological hypotheses requiring complex image data that might extend across wide spatial scales, long time-frames or many channels. Even a single complex image, such as a dense 3D mesh of actin or a spheroid of cells, can be too complicated to visually interpret. Humans have an amazing capacity to spot patterns in visual data, but the increased volume and complexity of modern cell imaging data makes visual interpretation infeasible. To draw biological conclusions from ever larger and more-complex imaging datasets, we must change how we interpret cell image data (Ouyang and Zimmer, 2017).
Consider the example of a recent COVID-19 drug screen with 300,000 five-channel immunofluorescence images (Heiser et al., 2020 preprint). It would not be feasible to visually assess and interpret such a large screen. Instead, a deep convolutional neural network, which is a machine-learning technique, was used to automatically extract 1024 properties from each image for statistical analysis, and the results were interpreted and visualized to communicate with other scientists and the general public. This example follows a new paradigm for drawing biological conclusions from complex or high-volume imaging data. Rather than looking and then subsequently quantifying, the order is switched, first computationally analyzing images to develop and test biological hypotheses and only then moving back to the image data to interpret the results and communicate findings (see Fig. 1). In this Perspective, we present the state of data science in cell imaging, which is currently dominated by data science-based tool building for automated quantification of routine bioimage processing. We distinguish these ‘low-level’, signal-driven, tools from ‘high-level’, biology-driven, data science, where hypotheses are raised and biological insights are derived from complex cell image data. Low-level tasks are enabling technologies to address existing questions, whereas high-level tasks, which build upon low-level tasks, open up whole new categories of currently inaccessible questions. Data science has the potential to revolutionize microscopy-based cell biology, but only if infrastructure democratization and cross-disciplinary training are advanced to enable high-level data science in cell imaging.
Data science in cell biology
With the volume and complexity of imaging data increasing, we now need computation to automatically perform tasks across large datasets and to reframe complex data via pattern detection and visualization. Data science, an emerging interdisciplinary field that involves the development and application of computational tools to extract domain-specific insights from large and/or complex datasets, has already begun to supply the needed toolbox. Although the boundaries of data science remain fluid, the field combines domain knowledge with techniques from mathematics, statistics, computer science and information sciences, such as machine learning, to identify patterns hidden in data and perform statistical hypothesis testing on large data sets. The data science toolbox enables the computation-first interpretation of cell images by allowing us to iteratively alternate computational analysis with the generation of biological hypotheses and visualization of the obtained results (Wait et al., 2020).
Data science has been successfully applied to cell imaging data in multiple contexts. One prominent recent theme is the development of deep-learning inference techniques, for example inference of high-resolution images from low-resolution images or inference of cell structure directly from images (Belthangady and Royer, 2019; Eisenstein, 2020). In general, machine-learning algorithms fit generic mathematical models to data. In contrast to traditional machine learning, where models are learned from data features manually engineered by experts, deep learning enables analysis without relying on predetermined features. Instead, a hierarchy of image features is generated directly from the data, simultaneously with the model learning process. This is achieved by using ground truth annotations to train a model to map an input image to a predicted annotation, for example, mapping every pixel in a fluorescence image to its corresponding foreground or background annotation. During training, the model is automatically optimized for a given task by gradually adjusting its internal parameters according to the errors it makes, in a process called back-propagation. Deep learning has already revolutionized machine-learning-driven fields and, in microscopy, has mostly been used to improve the robustness and performance of standard bioimage-analysis tasks, such as segmentation, tracking and classification (Moen et al., 2019; Ouyang et al., 2019a; Ronneberger et al., 2015; Van Valen et al., 2016). It has also provided solutions to other, less-routine, computational tasks. For example, image restoration algorithms attempt to enhance image quality by inferring high-quality images from low-quality data (Weigert et al., 2018) using a variety of strategies, such as by taking advantage of structural redundancy in an image to reconstruct high-quality super-resolution images from under-sampled localization microscopy data (Ouyang et al., 2018), or by performing point spread function engineering for single-molecule localization (Nehme et al., 2020). Other applications include the inference of intracellular organelle localization from label-free images and the mapping of different cell microscopy modalities onto one another (Christiansen et al., 2018; Ounkomol et al., 2018), with potential applications including high-content screening (Cheng et al., 2021) and the prediction of the functional cell state, such as stages of the cell cycle or disease progression (Buggenthin et al., 2017; Eulenberg et al., 2017; Yang et al., 2020; Zaritsky et al., 2020 preprint).
A second theme of data science in cell imaging is high-content cell profiling, where the distributions of image-derived single-cell measurements, such as length, area and fluorescence brightness, are used to define fingerprints of cell populations under different experimental conditions (Perlman et al., 2004). By distilling often large image datasets into succinct fingerprints, cell profiling renders datasets accessible to biological interpretation by users. For example, CellProfiler, a popular software tool for high-content image analysis, encourages a ‘measure everything, ask questions later’ approach to image analysis (Caicedo et al., 2017; Carpenter et al., 2006; Chandrasekaran et al., 2020) by enabling users to first quickly extract and visualize a wide variety of quantitative measures before deciding which are biologically important. These image-based cell profiling ideas are now beginning to be applied to more-complex model systems, including the screening of 3D patient-derived organoids (Beck et al., 2021 preprint; Betge et al., 2019 preprint; Serra et al., 2019).
There are many other examples of the application of data science to cell imaging that are specific to particular biological subdomains. These include, for example, quantitative representations of cell shape in 2D (Bagonis et al., 2019; Chan et al., 2020 preprint; Keren et al., 2008; Pincus and Theriot, 2007) and in 3D (Driscoll et al., 2019; Elliott et al., 2015), perturbation-free inference of information flow in signaling pathways via ‘computational multiplexing’-based fluctuation analysis (Lee et al., 2015; Machacek et al., 2009), statistical-based methods for classification and characterization of protein localization patterns and intracellular organization (Boland et al., 1998; Boland and Murphy, 2001; Glory and Murphy, 2007; Ouyang et al., 2019b; Peng and Murphy, 2011), atlases for intracellular organization and their analyses (Cai et al., 2018; Heinrich et al., 2020 preprint; Thul et al., 2017; Viana et al., 2020 preprint), time-series analyses of heterogeneous dynamic molecular events (Aguet et al., 2013; Bhave et al., 2020; Goglia et al., 2020; Jacques et al., 2020 preprint; Wang et al., 2018, 2020), tracking of lineage, tissue structure and dynamics in development, morphogenesis and collective cell migration (Amat et al., 2014; Etournay et al., 2016; Hartmann et al., 2020; Keller, 2013; Zaritsky et al., 2017), graph representations of dynamic cellular processes (Gut et al., 2015), integration of single-cell omics and imaging data (Villoutreix, 2021; Yang et al., 2021), and machine learning for automated microscopy (Royer et al., 2016; Waithe et al., 2020).
The emerging use of data science tools is revolutionizing many fields, including the social sciences and business, and its impact in cell biology will likely grow. Even just a few years ago, advanced programming skills were needed to implement data science pipelines. Recently, however, user interfaces and other tools have been developed (Bannon et al., 2021; Fazeli et al., 2020; Ouyang et al., 2019a; Stringer et al., 2021; Von Chamier et al., 2020 preprint), rendering data science in cell imaging more accessible to a wide range of researchers.
Hierarchies of data processing in microscopy
The robust and versatile construction of computational pipelines for cell imaging is built on two software design concepts – modularity and abstraction. Modularity and abstraction are what make image analysis pipelines broadly useful and were arguably the key conceptual software advances that fueled the development of modern-day computing.
Building a modular pipeline requires decomposing the main image-analysis task into discrete subtasks that are as independent and generalizable as possible. For example, analysis of nuclei movement in an embryo could be decomposed into a nuclei detection problem, followed by generic object tracking, and then track analysis. The power of modularity stems from the ability to construct complex image analysis pipelines from smaller components that can be designed independently, yet function together. This promotes the reuse of successful modules in many pipelines.
Abstraction is a process that enables modular design by promoting both module reuse and simplicity, hiding algorithmic details within modules, and exposing the inner working of modules to other modules only when necessary. Abstraction enables users and tool developers to focus only on the details that are immediately relevant instead of conceptualizing the algorithm in its full complexity. For example, there exist countless proprietary microscopy file formats, each differently encoding the image and its corresponding metadata. The software Bio-Formats (Linkert et al., 2010), which is executed every time a user reads or writes image data in the image processing program Fiji, provides the abstraction that allows users to access image data without having to be aware of the exact encoding of the different file formats.
Modularity and abstraction are concepts that go hand-in-hand to enable effective problem solving with abstraction enabling modularity. For example, Fiji promotes the construction of modular image analysis pipelines via plugins. Plugins are the modular components composing these pipelines, each solving a well-defined problem and providing an abstract input–output interface. Such implementation enables straightforward reuse of the same plugin in different pipelines, switching between different components with the same interfaces, and expansion of existing pipelines.
The modules that compose image analysis pipelines can be crudely partitioned into two categories, low-level (signal driven) and high-level (biology driven) (see Fig. 2). Low-level tasks are the signal-driven processing steps that take images or image-derived data and transform them into other images or sequences of numbers. Low-level tasks include image preprocessing (e.g. deconvolution, stage drift correction and tiling fields of view), detection and/or segmentation (e.g. identifying cells/intracellular organelles within an image), and tracking. It is the low-level tasks that enable the automated and complete processing of large image datasets (Danuser, 2011). Importantly, devising effective solutions for low-level tasks requires deep algorithmic knowledge, and sometimes deep understanding of the imaging and optical settings. Domain knowledge can be very helpful. For example, knowledge of the bending properties of microtubules could allow preliminarily detected microtubules that have an unrealistic bend to be excluded from further analysis. However, in most cases, deep knowledge of the biological system or question is not necessary to solve low-level tasks.
High-level tasks are biology driven, transforming large or otherwise difficult to interpret sets of data, which are generally the outputs of low-level tasks, into information that can be directly understood to draw biological conclusions. High-level tasks include data visualization and exploration, model fitting, and statistical inference and comparisons. In contrast to most low-level tasks, high-level tasks always require deep knowledge of the particular biological domain. In order to formulate testable hypotheses, one must understand the biological process at hand and be aware of the experimental and computational techniques available to extract information hidden within the image data. Admittedly, it is currently difficult to point to specific major breakthrough discoveries in cell biology achieved by applying data science to cell imaging. However, both low- and high-level tasks carry the potential to transform the field. Biological discovery is driven by enabling technologies – data science applied to low-level tasks will open the door to addressing existing questions that were previously inaccessible due to a lack of suitable powerful methods. High-level application of data science may unlock completely new fields driven by new types of questions and new ways to discern cell imaging data.
Moving beyond tool building
Data science tools have already been extensively adapted for a variety of low level tasks, such as image enhancement (Weigert et al., 2018), segmentation (Caicedo et al., 2019; Isensee et al., 2020; Stringer et al., 2021; Van Valen et al., 2016) and tracking (Ulman et al., 2017). Indeed, most efforts in the thriving bioimage informatics community have been invested in these types of automation and tool building projects (Meijering et al., 2016). Low-level tool building is essential for advancing almost all cell-imaging-based research, but is not sufficient to answer biological questions. For example, even if all the cells in a developing zebrafish embryo are segmented and tracked, this alone does not provide biological insight. Rather, the tracks must be further visualized and analyzed with the underlying biology in mind.
Why has the bioimage analysis community so far focused on low-level analysis tasks at the expense of the high-level tasks that yield exciting biological discoveries? We believe that this focus stems from two main causes. First, low-level tasks are the most common problems encountered by any microscopist and thus draw community attention as obvious important questions worth tackling. Furthermore, they are the initial steps in any quantification. This may seem trivial; however, developing algorithms for high-level tasks is complicated by the need to first deploy an array of low-level tasks, whereas developing low-level algorithms simply requires the raw data.
Second, low-level tasks are simpler for researchers outside the field of biology to tackle, and are particularly well-suited to computer scientists. Low-level tasks are often readily formulated as abstract computational problems and developing algorithms for them does not typically require any specific ‘domain’ knowledge. In addition, a major motivation for researchers from applied computational sciences, such as computer vision, is algorithmic elegance and efficiency. Publishing and career advancement in computer science is driven by novelty in algorithm design, performance, robustness and, for some applications, usability. Utility to other fields, such as biology, is not emphasized. Moreover, the gold standard for evaluating most low-level applications is comparison with human annotation; however, there is often no correspondingly simple way of evaluating high-level algorithms whose utility is understood only in the context of a particular biological domain. Accordingly, application of data science techniques in cell imaging is heavily biased toward low-level tasks.
Building robust image analysis pipelines requires shared infrastructure
No one research lab can be expert at the full spectrum of low- and high-level tasks needed to draw robust biological conclusions from imaging data. In fact, few labs currently have the expertise and resources to take a computation-first approach to cell imaging data. To utilize the full power of modern microscopy, we must democratize access to computational analysis tools, data and training.
Although well-designed algorithms that employ modularity and abstraction enable the reuse of tools across labs, good software design alone is not enough. Moving beyond low-level tasks requires shared infrastructure to enable the joint development of algorithms and the open use of data. Such infrastructure promotes the exchange of open-source software and image-analysis toolboxes that enable an effective quantification of low-level tasks and allows developers to focus on one component of interest without the need to build a full analysis pipeline to support it. Image-analysis software, such as Fiji (Schindelin et al., 2012), CellProfiler (Carpenter et al., 2006), Icy (de Chaumont et al., 2012) and Ilastik (Berg et al., 2019), as well as open-software libraries (e.g. scikit-learn; Pedregosa et al., 2011), have so far played this role, with deep-learning-specific platforms, such as ImJoy and ZeroCostDL4Mic, beginning to be released (Haase et al., 2020; Ouyang et al., 2019a; Von Chamier et al., 2020 preprint). Support for these platforms was recently consolidated to a single online forum (https://forum.image.sc/), which is very active with frequent use by many visitors. The Bioimage Informatics Index (BII, https://biii.eu/) is a search engine that organizes the wealth of available resources by linking bioimage analysis problems to relevant tools to solve them. Another key infrastructure effort is providing open access to published data to enhance reproducibility, enable computational tool development and allow new discoveries to be made from ‘old’ data (Zaritsky, 2018). To this end, image repositories have recently received significant attention, with the planned BioImage Archive as a major example (Ellenberg et al., 2018; Williams et al., 2017). Image repositories will enable analyses of unprecedented scales of data and are critical to attracting computational researchers to the field.
Software engineers are needed to implement and maintain large-scale tools and data repositories, but these positions are expensive and currently rarely supported by governments or other funding agencies. Philanthropy efforts, such as the Chan–Zuckerberg Initiative and the Allen Institute of Cell Science, have identified this gap and now provide external support, or hire software engineers internally to produce open software. These efforts will hopefully inspire more traditional funding mechanisms to support professional engineers in building solid and shared infrastructure.
Training the next generation of data scientists in cell biology
Cell biology is inherently technology-driven and uses many different tools from biochemistry, molecular biology, microscopy and genomics. The tools of data science are in many ways no different. Effective researchers need to be able to selectively deploy technologies from other fields to forward their research, and it is becoming increasingly clear that the ability to extract quantitative information from microscopy data is essential. A modern cell biologist should be able to decompose an image analysis problem into subtasks, use existing computational tools to solve each subtask and then analyze the pipeline output. This requires basic familiarity with common image-analysis procedures for cell imaging, an ability to piece together modules using simple programming and, importantly, basic knowledge of statistics and machine learning to interpret the results of the pipeline and its limitations.
How do we train the next generation of biologists to adapt to the reality of bioimaging as a data-intensive field? With the encouragement of funding agencies, academic institutes are beginning to adjust their training programs for the ‘Big Data’ era (Barone et al., 2017; Ekmekci et al., 2016; Rubinstein and Chor, 2014; Waldrop et al., 2015). Data analysis or programming bootcamps and high-intensity basic training that last several days or weeks emerged as one of the most popular means to train inexperienced undergraduate or graduate students. However, the effectiveness of these bootcamps is questionable (Feldon et al., 2017), especially when the skills acquired during these short-format interventions are not subsequently practiced and applied. Other initiatives have focused on computational thinking, introducing the basic computer science principles of abstract, algorithmic and logical thinking to life scientists (Rubinstein and Chor, 2014), and/or full courses in developing programming skills (Ekmekci et al., 2016).
We argue that this is not enough. Experimental methods are taught, both directly in laboratory courses and indirectly through the reading of journal articles, with the background knowledge needed to understand these methods spread out among various courses. Similarly, data science and other quantitative methods can be integrated into curriculums. New, comprehensive cross-disciplinary training programs must be established to bridge the technical and cultural gaps between the disciplines. Similar to how chemistry is perceived as essential to the biology curriculum, statistics and other data science tools should also be considered a part of the modern biologist's basic training (Markowetz, 2017). These skills should be acquired early and be used continuously throughout undergraduate and graduate school, not solely in computationally focused courses (Hoffman et al., 2016). For example, when learning about microscopy, students can analyze images with Fiji and integrate results with simple python scripting. The importance of early training and continuity was supported by a recent survey (Attwood et al., 2019).
Whereas hands-on teaching of laboratory methods can require significant space and equipment, hands-on teaching of data science techniques requires only a laptop. A lack of qualified teachers can, however, be a significant challenge (Williams et al., 2019). Faculty without formal knowledge and hands-on experience in data science are asked to design and teach relevant courses. Further compounding this problem is the lack of suitable training materials and reference textbooks specifically suited for these purposes. This situation is even worse in the domain of cell imaging. Most of the textbooks and courses for quantitative thinking and/or programming aimed at biologists are focused on applications in classic ‘bioinformatics’ (omics) (Attwood et al., 2019; Cvijovic et al., 2016; Madamanchi et al., 2018; Rubinstein and Chor, 2014). Images require a different focus because of the diversity in image acquisition techniques and experiments (Gonzalez-Beltran et al., 2020), as well as their multidimensional spatial and temporal structure.
An exciting way to solve the teacher shortage is joint interdisciplinary graduate-level training that brings together students from experimental and computational sciences and introduces both biological problems and quantitative approaches to tackle them (Saunders et al., 2018; von Arnim and Missra, 2017). Another potential solution is recruiting faculty from a neighboring computational department to jointly develop with biomedical faculty, a discipline-specific data science curriculum (Marshall and Geier, 2020). Resources to facilitate cross-disciplinary teaching have also begun to sprout. Steve Royle's recent book, The Digital Cell: Cell Biology as a Data Science (Royle, 2019), is a guidebook for cell and molecular biologists on data science in cell biology, with a special focus on cell imaging. The Network of European BioImage Analysts (NEUBIAS) provides on-site and remote training in bioimage analysis for biologists. Two members of NEUBIAS, Kota Miura and Nataša Sladoje, recently published a ‘Bioimage Data Analysis Workflow’ (Miura and Sladoje, 2020), which teaches how to combine multiple image processing components to construct an effective automated image analysis pipeline suited to a specific purpose and image dataset.
We have so far focused on training biologists to do image analysis, but could we instead turn data scientists into biologists? One possible way forward is to engage computational students in the development of low-level tasks with the motivation of outperforming alternative algorithms and making tools usable for biologists. This route does not require deep domain knowledge and is premised on the hope that some students will develop a fascination with biology. Another parallel strategy is to design cross-disciplinary courses that include both biologists and data scientists. In the domain of data science for cell imaging, the curriculum could include a mix of topics, from low-level bioimage analysis to high-level inference. Similar to a course that one of us, Assaf, designed (Table S1), such a class could introduce data scientists to the amazingly complex world of cell imaging and eventually bring highly desired skills to cell biology.
We anticipate that data science applied to cell imaging will propel cell biology forwards through these four themes.
Understanding a biological system requires considering the variability of its components rather than just population averages that mask heterogeneous phenotypes, especially since important phenotypes may be rare.
Cell biological processes cross scales in space and time – molecules organize within cells, and cells organize within tissues to function. Although we have extensively studied cell biology at some specific scales, we still do not understand how information propagates between scales to enable biological function.
Integrating data across modalities
On the one hand, single-cell omics technologies provide rich information of many well-defined per-cell measurements that is missing in microscopy-based approaches. On the other hand, microscopy can provide information at the protein level, as well as the spatial and temporal context that is mostly lacking in omics. Integrating these two forms of complementary information has vast potential to transform the field (Villoutreix, 2021).
Interpretable machine learning
Machine learning and deep learning, in particular, are very effective at identifying hidden patterns in complex cell imaging data, but lack the ability to explain which biologically relevant properties are important. Developing interpretable data science approaches are absolutely necessary for mechanistic understanding.
Modern biology is becoming more and more complex, advancing toward studies with ever more physiologically relevant systems. This trend of technology-driven complexity is only expected to grow, and we, as a community, must learn to embrace and celebrate it in order to move biology forwards. The combination of more complex data with increased data volume demands infrastructure advancements. Sensitive, robust and usable tools that enable automated analysis are key to processing vast amounts of data and reproducibly analyzing complex data sets. We must train students in data science techniques that enable them to make sense of this data. Together, we can enter the era of data science in cell imaging!
We would like to thank Philippe Roudot and Dagan Segal for kindly commenting on a draft of this manuscript, as well as Yoav Ram and Natalie Elia for discussions. We would also like to thank The Company of Biologists for funding the 2020 workshop on Data Science in Cell Imaging, and all workshop participants for discussions.
Our work in this area is supported by the Israeli Council for Higher Education (CHE) via the Data Science Research Center at Ben-Gurion University of the Negev, Israel (to A.Z.), the National Institutes of Health, K99GM123221 (to M.K.D.) and a pilot grant from the Lyda Hill Foundation (to M.K.D.). Deposited in PMC for release after 12 months.
The authors declare no competing or financial interests.