## ABSTRACT

Half a century after Lewis Wolpert's seminal conceptual advance on how cellular fates distribute in space, we provide a brief historical perspective on how the concept of positional information emerged and influenced the field of developmental biology and beyond. We focus on a modern interpretation of this concept in terms of information theory, largely centered on its application to cell specification in the early *Drosophila* embryo. We argue that a true physical variable (position) is encoded in local concentrations of patterning molecules, that this mapping is stochastic, and that the processes by which positions and corresponding cell fates are determined based on these concentrations need to take such stochasticity into account. With this approach, we shift the focus from biological mechanisms, molecules, genes and pathways to quantitative systems-level questions: where does positional information reside, how it is transformed and accessed during development, and what fundamental limits it is subject to?

## Introduction

How and when cells in a developing organism know what they are and where they are, are questions that are almost synonymous with the definition of developmental biology (Kirschner and Gerhart, 1997; Lawrence, 1992). In metazoans, different cells have to perform different tasks. They therefore need to interpret cues that steer them towards the correct fates (Ephrussi and St. Johnston, 2004). Evolution had the possibility to act on both the ‘cues’ and the machinery that performs the ‘interpretation’ of these cues. Wolpert's concept of positional information (PI) elegantly touches on both of these aspects.

The idea that cells adopt different fates by ‘sensing’ the presence or absence of chemicals, so called fate-determining factors or ‘determinants’ (Conklin, 1905; Wilson, 1904), dates back to the early 20th century. Experiments on sea urchin embryos suggested that developmental patterns could be determined by opposing ‘gradients’ (Boveri, 1901a,b), while regeneration experiments on flat worms postulated the existence of ‘formative substances’ that influence the developmental plan of the embryo (Morgan, 1904, 1905). The notion of chemical gradients acting at large distances to affect developmental patterning has an even longer history (Lawrence, 2001), but it was not until the middle of the 20th century that Turing postulated that concentrations of specific chemicals, called ‘morphogens’, might instruct cell fates and thus the emergence of shape and form in a developing organism (Turing, 1952).

The next big idea was the inclusion of space and the notion that spatial fields of chemicals could lead to developmental patterning and cellular differentiation (Crick, 1970; Lawrence, 1970; Wolpert, 1969). Key to this idea is a predetermined initial symmetry-breaking event, often triggered by asymmetrically localized factors. For example, morphogens are produced in cells that are located in spatially restricted regions and they diffuse along a central axis of an egg or tissue, thereby establishing a gradient. Wolpert eloquently postulated that cells could determine their fate by interpreting local concentrations of these graded profiles, and he coined the abstract notion that these profiles thus contain ‘positional information’ (Wolpert, 1969, 1971). This was one of the solutions he proposed for the ‘French Flag Problem’ of patterning (Wolpert, 1969), which later became colloquially known as the ‘French Flag’ model (Sharpe, 2019). Here, adjacent groups of cells are delineated by a concentration threshold, which defines a boundary. Fate determination in this model is due to an additional step, in which cells ‘interpret’ the concentration of the morphogen. ‘Information’ is thus contained in the nominal value of the concentration at a given position, and in the molecular apparatus that transforms this value into a cellular response. Thus, morphogen concentrations of two orthogonal gradients could act as positional coordinates, defining a two-dimensional spatial fate map. Individual cells measure and interpret the local morphogen concentration and determine the appropriate fate choice for that position, as manifested experimentally by Spemann's famous grafting experiments (Spemann and Schotté, 1932) and by the arrangement of chick wing digits (Saunders and Gasseling, 1968).

Conceptually, Wolpert's postulate was indeed a big leap forward, as evidenced by the significant gap before its experimental manifestation and its subsequent molecular proof. The framework of PI found immediate popularity and was put to use, e.g. by Postlethwait to interpret his famous Antennapedia *Drosophila* mutant, in which a pair of head antennae is converted into legs. Postlethwait postulated ‘that perhaps all appendages may have the same PI and that what makes one appendage different from another is the response of cells with a different determination to the same set of proximodistal, mediolateral positional cues’ (Postlethwait and Schneiderman, 1971), which turned out to be the case for Hox genes in all animals (Akam, 1989).

In 1974, the existence of cytoplasmic determinants was undoubtedly proven by transplantation experiments in *Drosophila* (Illmensee and Mahowald, 1974). Fifteen years later, the first morphogen molecule was finally discovered, with the anterior determinant Bicoid in the *Drosophila* embryo displaying all the characteristics of Wolpert's concept (Driever and Nüsslein-Volhard, 1988a,b; reviewed by Lawrence, 1988; Wolpert, 1989). This discovery was immediately followed by the demonstration that a frog growth factor determines differential cell fates according to concentration thresholds (Green and Smith, 1990; Green et al., 1990; reviewed by Green and Smith, 1991). Subsequently, many more PI-carrying morphogens were discovered (Neumann and Cohen, 1997), including in vertebrates such as zebrafish (Chen and Schier, 2001) and chick (McMahon et al., 2003).

The concept of PI has since had enormous success in shaping our understanding of spatial patterning in developing organisms (Fasano and Kerridge, 1988; Lacalli and Harrison, 1991; Moses and Rubin, 1991; Reinitz et al., 1995; Tomlinson et al., 1987; Wolpert, 1969, 1971; see review by Wolpert, 1996). Given its intuitively physical nature, the concept of PI also lent itself swiftly to quantitative questions. For example, the number of different thresholds that can be set reliably by a given concentration gradient could be estimated using straightforward calculations (Lewis et al., 1977). Moreover, the idea of PI has been applied to understand precision and reproducibility in development. Specific morphological features during early development have been studied in great detail and have been shown to occur reproducibly and precisely across wild-type embryos (Crauk and Dostatni, 2005; Gregor et al., 2005; Houchmandzadeh et al., 2002; Jaeger and Reinitz, 2006; Jaeger et al., 2008; Lecuit et al., 1996), while perturbation experiments have revealed systematic shifts of these features (Capovilla et al., 1992; Kraut and Levine, 1991; Rivera-Pomar et al., 1995). These findings have thereby established a causal – but not quantitative – link between the PI encoded in morphogens and the resulting body plan.

To sharpen the use of PI and to elevate its usefulness as a quantitative tool, we propose here a mathematical definition that is based on the concepts of Shannon's information theory (Box 1). We first introduce the mathematical framework that allows us to formalize the colloquial concept whereby ‘a cell determines its position from noisy patterning cues in the form of low-concentration molecular gradients’. We next highlight how the combination of precise data and mathematically rigorous PI quantities helped us revisit key biological questions. Finally, we end by formulating several unsolved puzzles to motivate future research.

When a change in random variable, X, leads with some probability to a change in another random variable, Y, we say that X ‘has information’ about Y. This information would allow us to infer (or predict) the value of Y if we knew the value of X, and vice versa. Claude Shannon identified mutual information, I(X;Y), as the unique measure that mathematically captures such a statistical dependence between X and Y, while satisfying various intuitive expectations (e.g. independent bits of information add) and remaining independent of system-specific assumptions (Shannon, 1948).

Mutual information is derived from a more basic quantity, the ‘entropy’ S(X)=-Σ P(X) log_{2} P(X), where the summation extends over all values of X that happen with probability P(X). Entropy measures the dynamic range of the distribution, and is conceptually related to its variance. Mutual information is I(X;Y)=S(X)+S(Y) – S(X,Y), or the difference in entropy of X and Y taken separately (as if they were statistically independent) and jointly (which captures any correlation between them). Mutual information generalizes the linear correlation coefficient (or regression R^{2}) to nonlinear dependence between two random variables. Linear correlation can miss statistical dependencies that information will detect. Information will be zero only if X and Y are statistically independent, and thus no inference about one variable is possible from the other. Despite its unusual notation, *I(X;Y*) is not a function but a single non-negative number, the units of which are ‘bits’ (see Box 4). Larger values imply stronger statistical dependence, less noise and higher predictability between the two variables.

## In search of a mathematical framework for PI

Initial efforts towards a quantitative interpretation of PI relied mainly on indirect, system-specific quantities. Some of the measured quantities were based on the necessity for precision and reproducibility in the patterning process (Bollenbach et al., 2008; Desponds et al., 2016; Gregor et al., 2007; He et al., 2010; Morishita and Iwasa, 2009, 2011), whereas others were based on the idea that special shapes of morphogen profiles, ‘sharp’ gene expression boundaries, or a ‘stripe’ of gene expression, are intrinsically favored for successful patterning and are thus selected for by evolution (Briscoe and Small, 2015; Crauk and Dostatni, 2005; Erdmann et al., 2009; Fujioka et al., 1995; Houchmandzadeh et al., 2002; Jaeger et al., 2004; Meinhardt and Gierer, 1980; Sokolowski et al., 2012). Interestingly, both intuitions contain a partial, yet incomplete, characterization of PI. However, a unifying mathematical framework that could consistently merge the two was missing.

Ideally, a mathematical formalization of PI should satisfy the following properties: (1) PI should be independent of specific biological mechanisms that establish or read out primary morphogen gradients or patterns; (2) PI should be a numerical measure that can be experimentally determined; (3) PI should be defined without *a priori* assumptions about pattern shape, and thus should be applicable to any arbitrarily complex spatial gene expression pattern; (4) PI should be applicable and generalizable to multiple concentration fields of patterning molecules; and (5) PI should allow for theoretical first principle derivations, and lend itself to the establishment of a predictive theory for biological patterning.

These five desired properties can all be fulfilled simultaneously when information about the physical position (i.e. the coordinates) of a cell within an organism is encoded in noisy spatiotemporal concentration profiles of morphogen molecules. Here ‘encoding’ signifies the biological processes that establish spatially graded molecular profiles (Fig. 1). The mechanistic implementation of this encoding could be complex, consisting of a variety of biological steps that are only partially known: maternal cues, gene regulatory and signaling networks, cell-cell communication, diffusion, etc. However, PI should only be a function of the resulting spatiotemporal concentration profiles, regardless of the processes that establish them, as these profiles are by definition the sole quantities that determine subsequent morphological events. In addition, PI should be equivalently applicable to both classical graded profiles of signaling molecules (morphogen gradients) and spatiotemporal expression patterns of developmental genes; for simplicity, we therefore use the term ‘morphogen’ broadly to refer to both of these cases.

Importantly, the issue of how PI is read out or decoded is separate from the measure of how much information is present in the pattern. Here ‘decoding’ stands for the biological processes that estimate the physical location of a cell in a tissue or determine its discrete cell fate based on readout or measurement of noisy local morphogen concentration levels (i.e. the processes that ‘interpret the positional cues’). Both, encoding and decoding are mechanism dependent (Fig. 1). Building a general mathematical framework relies on the possibility of separating these mechanisms from the actual representation of PI, which depends solely on directly measurable concentration profiles and is thus mechanism independent.

PI is not only ‘established’ (e.g. as a morphogen gradient) and then ‘read-out’ (e.g. via thresholds), but it can also be ‘recoded’ (Fig. 1). Recoding means that the information present in the morphogen gradient is reformatted or transformed into another internal cellular representation (e.g. for downstream processing convenience). Gap genes in *Drosophila*, for example, carry PI much like their primary maternal morphogen regulators do. This information originates from the primary morphogens and vanishes if they are removed (Petkova et al., 2019). Typically, the process of ‘reading out’ implies applying an operation on the morphogen gradient that loses PI. Yet gap genes individually (and most likely as a group) encode at least as much information as their primary morphogen inputs, and provide a complete ‘coordinate system’ allowing for precise positional determination. It is thus more pertinent to speak of transforming or recoding of PI that will be read out only at a later stage. Such transformations could happen multiple times, and each successive step should be tracked in a general mathematical framework. The concept of recoding is conceptually loosely related to Wolpert's original idea of ‘positional value’ (Wolpert, 1989).

A theoretical framework for PI that maps spatiotemporal concentration profiles to position must also consider stochasticity. Although patterning precision and reproducibility can be achieved over very short developmental time spans, using only a few handfuls of genes (Bentovim et al., 2017; Bollenbach et al., 2008; Briscoe and Small, 2015; Gregor et al., 2007; Houchmandzadeh et al., 2002; Patel and Lall, 2002; Petkova et al., 2014; Reeves et al., 2012), the processes underlying patterning are subject to molecular noise (Arias and Hayward, 2006; England and Cardy, 2005; Houchmandzadeh et al., 2005; Hu et al., 2010; Tkačik et al., 2008a; Tostevin et al., 2007; Tsimring, 2014; van Kampen, 2007). Moreover, there is random variability not only within a specimen, but also between specimens, e.g. in the strength of the morphogen sources (Bollenbach et al., 2008; Howard, 2012).

The necessity for a probabilistic approach is best exemplified when considering an undifferentiated cell in a developing organism. The cell experiences a single random realization of an otherwise variable information-carrying profile. When fluctuations between specimens or between adjacent cells of the same specimen are large, differences between cells can no longer be distinguished and PI is lost. This statement is true irrespective of the biological mechanism that reads out the gradient. It is a theoretical statement about what is possible in principle, which no biological (or engineered) system can evade. Thinking about what individual cells can measure locally – as in Wolpert's original concept – sharply contrasts with the typical approach to data analysis in biology, where one identifies ‘statistically significant differences’ in the mean gradient profile from one cell to the next, or where one disregards stochasticity by looking only at aggregated (averaged) profiles. A theoretical framework appropriate for Wolpert's PI concept therefore must be phrased in terms of probability distributions, not geometrically, as would be appropriate when dealing with shapes and patterns in the absence of noise.

## Establishing a mathematical framework for PI

Information theory is the mathematical treatment of concepts, parameters and rules governing efficient and reliable transmission of messages through communication systems (see Box 1). It has been applied to biological problems (Tkačik and Bialek, 2016) but it was not until the late 2000s that ideas about information transmission appeared for biochemical networks (Bowsher and Swain, 2014; de Ronde et al., 2011; Mugler et al., 2010; Tkačik and Walczak, 2011; Tkačik et al., 2008c; Tostevin and Ten Wolde, 2009; Ziv et al., 2007), specifically for the anterior-posterior (AP) patterning gene network of the early *Drosophila* embryo (Tkačik et al., 2008b). These initial studies focused on computing how well fluctuations in some ‘input’ chemical signal (morphogen, transcription factor or ligand concentration) are encoded in the resulting ‘output’ gene expression levels, given that gene expression is necessarily subject to molecular noise of well-understood biophysical origins (Gregor et al., 2007; Tkačik et al., 2008a). At that time, molecular signals were only starting to be experimentally measurable at a single-cell level (Blake et al., 2003; Elowitz et al., 2002; Golding et al., 2005; Ozbudak et al., 2002; Raser and O'Shea, 2004; Rosenfeld et al., 2005).

To introduce information theory in the context of genetic networks, and as a vehicle for a mathematical framework for PI, we focus here on the example of the early *Drosophila* embryo. The general framework we develop can be generalized to other systems in a straightforward manner, but depends on the specific circumstances and constraints imposed by the different experimental setups. In the case of the *Drosophila* embryo, we postulate that it has evolved to ‘send’ or encode real physical coordinates *x* of cells or nuclei through a noisy biochemical reaction network that at different *x* generates different patterning molecule concentrations ** g**. Here,

**represents morphogen concentrations, either primary gradients or subsequently expressed developmental genes (such as gap or pair-rule genes) – the mathematics remain the same. The concentrations**

*g***are denoted in bold face to indicate that there can be multiple relevant concentrations, and thus, formally,**

*g***is a vector at every position**

*g**x*. Because of noise,

**is not a deterministic function of**

*g**x*, but we have to use a probability distribution

*P(*that tells us the probability of finding a certain

**g**|x)**at**

*g**x.*

*x*into concentration levels

**, probabilistically, as described by**

*g**P(*. Neither the concept of PI nor the channel concept depends on underlying mechanisms, but only on how input signals

**g**|x)*x*are mathematically transformed into outputs

**. Biological mechanisms inside the channel are**

*g**de facto*treated as a black box. Information theory then introduces a general and unique measure of how well information can be sent through such noisy channels, the mutual information

*I(*(Cover and Thomas, 2006):

**g**|x)Angular brackets indicate an average over all locations *x*, assuming that cells or nuclei are uniformly distributed over the coordinate *x*. (See Dubuis et al., 2013b and Tkačik et al., 2015 for straightforward generalizations.) Similarly, *P*_{g}(** g**)=〈

*P*(

**|**

*g**x*)〉

_{x}is the average of the distribution of morphogen concentrations across all positions

*x*; it represents the probability that a particular combination of concentrations,

**, can be seen anywhere in the embryo (Fig. 2).**

*g*Our key assertion can now be made precise: we claim that the mutual information [a mathematical object of information theory (Cover and Thomas, 2006)] linking position and morphogen concentration, *I( g;x)*, is the proper formalization of PI (a concept of developmental biology). The distribution of morphogen concentrations at a given position,

*P(*, can be estimated from experimental data (see Box 2), giving access to empirical measures of PI

**g**|x)*I(*, which is mathematically derived from

**g**;x)*P(*by Eqn 1. Although proper estimation from finite datasets requires care, the technical procedures have been documented elsewhere (Borst and Theunissen, 1999; de Polavieja, 2004; Strong et al., 1998; Tkačik et al., 2015). More pertinent for morphogenesis are the following characteristics of PI (summarized below and expanded in Boxes 3 and 4):

**g**|x)PI is a unique measure of all statistical dependence between morphogen concentrations and position with important theoretical guarantees. It measures how well any variation of morphogen profile with position (linear or not) can be used to determine positional specification (Dubuis et al., 2013b). Thereby, PI satisfies property 1 (Fig. 3).

PI is a single number with interpretable units. Intuitively,

*I*bits of information (see Box 4) are necessary and sufficient to distinguish*2*discrete alternatives with zero error (Hillenbrand et al., 2016); if some degree of positional error is allowed,^{I}*I*bits suffice to specify more alternatives (Tkačik et al., 2015). Thereby, PI satisfies property 2 (Fig. 4).PI is applicable to single or multiple morphogen gradients of arbitrary shapes, independently of the biological system and mechanistic detail. The framework does not single out particular profile shapes, positional markers or special positions. Thereby, PI satisfies properties 3 and 4 (Tkačik et al., 2015), also enabling a theoretical search through the space of all possible morphogen profiles to predict ones that maximize PI, thereby satisfying property 5 (Sokolowski and Tkačik, 2015; Tkačik and Walczak, 2011; Tkačik et al., 2009).

*P( g|x)* can be estimated experimentally: samples with simultaneously recorded concentrations

**can be collected at every position**

*g**x*from many identical specimens. In biological systems, it is most common to focus on the mean or the ‘mean spatial profile’ in the case of the embryo. Thus, implicitly, the joint distribution is reduced (i.e. marginalized) to averages, . Yet there is no fundamental reason to focus solely on averages. Crucially, retaining the variability in the profiles [mathematically given by ] is in fact necessary for a probabilistic approach.

*P(*keeps all the information about concentration profiles, their variability and co-variability (for multiple genes), and even their higher-order statistics. Experiments that reliably sample this distribution are significantly more demanding than experiments that solely focus on measuring mean profiles, but this difficulty is technical rather than fundamental, and it can be surmounted (Dubuis et al., 2013a; Petkova et al., 2019; Tkačik et al., 2015). A full protocol for the experimental procedures and the measurement error treatment to quantify PI in fly embryos can be found elsewhere (Dubuis et al., 2013a,b; Gregor et al., 2014; Tkačik et al., 2015). Here, we stress that, in order to test the theoretical formalism applied to PI, precision measurements are necessary. Such measurements are typical for testing theories in the physical sciences, but are still not the norm for biological systems.

**g**|x)Positional information (PI) measures any kind of statistical dependence between position *x* and morphogen concentrations ** g**. PI is zero only if there is no systematic variation in morphogen mean profile or any other statistic with position: in this case no mechanism exists to extract knowledge about position from morphogen concentrations (Cover and Thomas, 2006). Otherwise, positional knowledge can be extracted using a properly constructed decoding mechanism (which may, however, be biologically unrealistic). Even though linear gradients are often used as example cases, real gradients are not linear (and sometimes not even monotonic, e.g. for patterning ‘stripes’); their variance typically changes with position (known in statistics as ‘heteroscedasticity’); and their fluctuations may not be Gaussian, requiring a more powerful alternative to linear correlation.

As an example, the figure shows three gene expression profiles *g(x)*, with variability *σ _{g}* (shaded area). A step function (A) carries (at most) one bit of PI by perfectly distinguishing between ‘off’ (not induced, posterior) and ‘on’ (induced, anterior) states. A sigmoidal profile (B) has a wider boundary, but PI can be >1 bit because the transition region itself is distinguishable from the on and off domains. A linear gradient (C) has no boundary but increases PI by being equally sensitive at every value of

*x*. In the absence of noise, B and C could theoretically reach arbitrarily high PI, as each concentration would correspond to a unique position without ambiguity. In reality, such infinities are avoided because the mapping is noisy and positions are discrete (e.g. columns of nuclei rather than physical coordinates with infinite precision).

Information is measured in bits, which are meaningful units: 1 bit of PI in the morphogen gradient suffices to make a reliable discrimination between two sets of positions that are, in the absence of morphogen readout, equally likely. For example, 1 bit of PI suffices to reliably discriminate the front half of the embryo from the back; or odd columns of cells from even columns. More generally, *I* bits of information are necessary and sufficient to distinguish *2 ^{I}* discrete alternatives with zero error. Thus, the patterning of an embryo with

*N*columns of nuclei that need to be uniquely distinguished with no possibility of error requires at least

*I*

_{0}=log

_{2}

*N*bits of PI. If some error in specification can be tolerated, the required amount of PI is smaller than

*I*. More PI can be provided (usually at a higher metabolic or time cost) to compensate for the decoding processes that do not use the information optimally. If the morphogens provide

_{0}*I<I*bits of PI, a minimal error exists by which cells can determine their positions: they can do worse (perhaps due to biological limitations in their gradient readout) but not better.

_{0}PI and the associated bounds to positional error provide a powerful and unbiased tool for asking biologically-relevant questions. How much additional PI is provided by each morphogen gradient in systems with multiple gradients? Are their individual PI contributions additive, redundant or synergistic? How much information is there in non-monotonic profiles (such as stripes) and how much information does each profile ‘feature’ contribute, especially when the features can be generated *in silico,* or isolated *in vivo* through appropriate genetic modifications? PI can be computed for various morphogen profiles (e.g. a sharp step, or a linear or exponential ramp) and compared with data, to question whether our expectations about ‘ideal’ shapes align with reality. Ultimately, morphogen profiles can be computationally optimized to find those that maximize PI, thus deriving the best morphogen patterns *ab initio*, and comparing such first-principle theory predictions with data.

Within this theoretical framework, PI summarizes the fidelity by which position is encoded in any number of morphogen gradients of arbitrary shapes, independent of the system and biological mechanisms. While such a formalism employing a single statistic is undeniably attractive, its benefits come at a price (see also Box 5): a single number might measure the overall limits of patterning, but it cannot explain how and where these limits arise. Specifically, PI cannot answer local questions or make testable predictions about limits to patterning at individual positions within an embryo. To this end, the PI framework must be appropriately extended (see Box 6).

**Patterning dynamics**

Although it is possible to mathematically extend the PI framework to cases where PI is encoded in temporal trajectories of morphogen concentrations, this has not been tried in practice. In the *Drosophila* example considered here, information is stored in a single static snapshot of gene expression patterns, which greatly simplifies the technical analyses and their interpretation.

**Positional coordinate**

The theory is agnostic about how ‘position’ *x* should be represented to compute PI, *I( g;x)*. In the

*Drosophila*example considered here,

*x*is a relative coordinate along the anterior-posterior axis of the embryo. This choice relies on the finding that demonstrated spatial scaling of the morphogenetic patterns in this system (Gregor et al., 2005; Houchmandzadeh et al., 2002). An absolute coordinate

*x*would thus be less appropriate. Nevertheless, a relative coordinate is not the only possible choice:

*x*could also be a discrete nuclear column index. In contrast, it is much less is clear how to choose a representation for position in a growing or deforming tissue: should position be taken at a particular temporal snapshot or perhaps relative to a constantly co-moving and growing reference frame? Although the theory can be applied in either case, it does not provide us with an answer about the positional coordinate system.

**How much of the information is biologically relevant****?**

Information-theoretic definition for PI has many attractive mathematical properties, but it does not tell us how many bits can actually be extracted from single morphogen snapshots with biologically plausible mechanisms. One can imagine gene expression patterns that formally carry a lot of PI, but the interpretation of which would likely require unrealistic computations.

We have introduced the concepts of PI (Eqn 1), positional error (Eqn 4) and the decoding map (Eqn 3) (Fig. 7). PI is entirely agnostic to encoding and decoding mechanisms, and is a single number expressed in bits that characterizes the global performance of the patterning system. Positional error and the decoding map are local constructs that characterize the performance of the patterning system location by location, but assume statistically optimal readout of the morphogen profiles. The positional error can be derived from the average decoding map in Eqn 4 and, under the assumptions of scenario A (Fig. 5A), has a clear biological interpretation.

The precise relationship between PI and the two decoding-related quantities is technically involved, but two generic statements hold universally. First, from the fundamental theorem of information theory known as the Data Processing Inequality (DPI) (Cover and Thomas, 2006), we can assert that, regardless of the chosen decoding algorithm (e.g. Eqn 2), PI is always greater or equal to the mutual information between the true locations and the best estimates of position (Brunel and Nadal, 1998). In other words, PI is an upper bound to the information between true and implied positions.

**at true location**

*g**x*decodes to a single peak in the posterior

*X**, the width of which is given by positional error,

*σ*, the approximation holds:

_{x}(x)<<Lwhere *L* is the range of *x* over which the patterned cells are uniformly distributed*.* As the DPI has general validity, Eqn 5 will always bound PI from below; but as the positional error shrinks and the posterior approaches a Gaussian distribution (as in scenario A), Eqn 5 will also be a good approximation for PI *I( g;x).* Indeed, for the case of

*Drosophila*anterior-posterior patterning, the direct estimate of the PI,

*I(*, and the decoding estimate from positional error,

**g**;x)*I(x;x*)*, differ by only ∼0.1 bit out of 4.3 bits, a discrepancy of ∼2% (Dubuis et al., 2013b). This agreement is a quantitative consistency check that the gap gene system of wild-type

*Drosophila*embryos indeed forms a precise, unambiguous positional code in which positional error is small and nearly Gaussian almost everywhere.

## Decoding PI

An undifferentiated cell in a field of morphogen concentrations needs to determine its location by ‘reading out’ the available PI. It thus needs to perform local concentration measurements and estimate, or infer, its position. Early demonstrations of quantitative limits to this process (Gregor et al., 2007) were followed by the development of a rigorous mathematical framework for optimal decoding (Hironaka and Morishita, 2012; Morishita and Iwasa, 2009, 2011), which has since been applied to data and connected to information-theoretic concepts (Dubuis et al., 2013b; Petkova et al., 2019; Tkačik et al., 2015; Zagorski et al., 2017), as summarized in Box 6.

Suppose that the distribution of morphogen concentrations given position, *P( g|x)*, is known. For example, an image collected in an experiment provides absolute knowledge about position, and multiple images can then deliver the probability of finding a particular concentration at that position across a set of samples. If the cell measures one set of local morphogen concentrations,

**, to estimate its location, what would that estimate be and how precise could it be? Here, the true location of the cell,**

*g**x*(unknown to the cell, but known to the experimenter), needs to be clearly distinguished from the best estimate of the location that the cell might be able to extract from

**, denoted here as implied position,**

*g**x**.

*x**from morphogen concentration measurements by means of a decoding mechanism. Although many such mechanisms and their biological implementations are possible, there is a single decoding algorithm that is statistically optimal, leading to the best positional estimate, given by Bayes' law:

On the right-hand side, we have the *a priori* distribution of locations (e.g. cell positions) to be decoded, *P _{x}(x*)*, which for spatially uniformly distributed cells is a uniform distribution;

*P(*is the measured distribution of concentrations introduced earlier; and a normalization factor

**g**|x*)*Z*enforces that the resulting posterior distribution

*P(x*|*is correctly normalized.

**g**)The posterior distribution summarizes all knowledge about *x** that can possibly be extracted by measuring morphogen concentrations, ** g**. It is a distribution over implied locations, and there are multiple qualitative shapes that this distribution may take (Fig. 5). In scenario A, for a particular

**, the posterior may be sharply localized around a single peak**

*g**X*(*, typically at the mean of the posterior distribution, . Mathematically, this scenario is equivalent to the statistical inference of a ‘parameter’

**g**)*x*from noisy data

**in the regime where the posterior is nearly Gaussian. In this case, the maximum likelihood estimate [assuming a uniform prior**

*g**P*], the

_{x}(x*)*maximum*

*a posteriori*(MAP) estimate, and the posterior mean all coincide. Concentrations

**accurately and unambiguously determine a single location, a hallmark of a good positional code. The decoding error, formally defined as the spread of the posterior around its mean, is low. In scenario B, a single maximum of the posterior exists, but the decoding error is large, implying that the set of morphogen concentrations**

*g***provides only weak evidence for a particular location and that, at these morphogen concentrations, the precise localization of morphological features is impossible. In scenario C,**

*g**P(x*|*peaks either around the location

**g**)*X**that is very far from the true location

*x*, or peaks at multiple locations

*X**, and is thus not unique. In this case, essential errors or ambiguities in the positional code exist, with the morphogen concentrations

**likely ‘pointing’ to either wrong or multiple locations.**

*g*Applied to a realistic biological scenario, the decoding of cellular location along the AP axis of the early *Drosophila* embryo, one can construct *P( g|x)* from many samples of wild-type morphogen profiles and their biologically relevant variabilities (Petkova et al., 2019). The measured

*P(*are used in Eqn 2. Mathematically, any set of concentrations

**g**|x)**can be inserted to decode the most likely implied position,**

*g**X*(*Biologically, however, the focus must be on those concentration combinations that are actually observed. This is a non-trivial point: if multiple morphogens

**g**).**vary along a single positional axis, many combinations of**

*g***are unlikely ever to happen (at least in the wild-type embryo), and thus their decoded locations are irrelevant.**

*g**α:*Eqn 3 represents a fundamental relationship between the real locations

*x*in a single specific embryo

*α*, and what is implied about these locations by the morphogen profiles, assuming optimal use (‘optimal decoding’) of PI. The decoding map can be visualized as a matrix of implied versus true locations in the embryo (Fig. 6). A precise positional code, corresponding to scenario A discussed above, will result in , which is tightly localized around the diagonal where

*x*=x.*Here, positions implied by noisy morphogen profiles are almost equal to the true, ideal positions known to the experimenter. Scenario B, with high positional error, corresponds to situations where at some location

*x*, the decoding map has a single but broad, or ‘diffuse’, range of locations

*x**that are consistent with the measured morphogen profiles. Scenario C typically corresponds to the situations where, at multiple locations, at least two separated peaks of implied positions

*x**exist, and where cells cannot unambiguously determine whether they reside in one or the other peak (Fig. 7A).

*x*within a particular embryo

*α*. By determining how the best estimate of position, i.e. the peak

*X**of the map at every position

*x*, varies between embryos, it predicts how embryo-to-embryo variability maps into uncertainty in specifying position estimates. By averaging individual embryo decoding maps across all embryos

*α*of the same class, one can obtain an average decoding map

*P(x*|x)*that, for wild-type embryos in scenario A, defines the positional error,

*σ*, as a function of real position

_{x}(x)*x*: where . This positional error quantifies how precisely positional markers can be localized in the embryo (Fig. 7B). For example, if wild-type embryos are known to express a positional marker at some position

*x*based on morphogen readout, this framework states that, except for some residual experimental error, the positional accuracy of a marker across embryos is bounded from below by the positional error,

*σ*, at that position.

_{x}(x)*σ*thus quantifies the minimal uncertainty about the implied cellular location due to the combined variability and intrinsic noise in the morphogen profiles (Morishita and Iwasa, 2011; Tkačik et al., 2015).

_{x}(x)Optimal decoding is particularly relevant in the context of mutations that affect a patterning system. Here, the decoding map *P(x*|x)* becomes a mathematical and quantitative formalization of the classical concept of a fate map (Conklin, 1905; Gilbert, 2000; Schüpbach and Wieschaus, 1986). Often, a mutation has consequences for the entire morphogen system, causing a global shift in the decoding map *P(x*|x)*. In this case, the decoding map predicts how physical locations in the mutant (*x*) map to cell fates that are characteristic of the location in the wild type (*x**). But within a probabilistic framework there are other possible outcomes, implying that the decoding map can accommodate a richer set of possibilities than a traditional fate map. For example, there could be multiple peaks in *x** for some fixed position *x* in the mutants, predicting large mutant-to-mutant variability, where the same wild-type positional marker is placed at different, random positions *x** that correspond to the multiple peaks in the mutant.

The decoding map can thus make parameter-free predictions derived solely from wild-type embryos about how patterning mutants behave. Its only assumption is that a very good approximation to optimal decoding of Eqn 2 has evolved in the biological ‘hardware’. This is an information-rich, quantitative and falsifiable prediction that can be viewed as the test of the optimality assumption, which, to date, has been experimentally verified with high fidelity in the *Drosophila* AP patterning system (Petkova et al., 2019) and for the mammalian neural tube (Zagorski et al., 2017).

## Lessons for biology

By combining our mathematical framework for PI with applicable quantitative measurements, we can gain novel biological insights into patterning events, as summarized below.

### Optimal patterning without sharp boundaries

Within the original paradigm for PI, morphogen profiles are ‘read out’ by downstream genes to guide cell fate decisions. Is there a notion of a best profile shape that supports reliable fate determination? Theoretical work typically considers linear profiles; in contrast, maternal morphogens often exhibit exponentially decaying profiles that span a significant fraction of the length of an embryo. Yet other patterning genes may show very sharp gene expression boundaries (Fig. 5). The theory of PI can guide us on what the best profile shape is for encoding a maximum amount of PI. Perhaps surprisingly, the answer depends on how variability (i.e. noise) changes with position. If variability is independent of position and is low compared with the maximum gene expression magnitude, then the optimal profile is linear. In this case, a single profile can encode more than one bit of PI.

In biochemical networks, however, the noise magnitude typically changes with position. Intrinsic noise, e.g. fluctuations in morphogen levels, depend on the mean morphogen concentration, and thus on position. This is true empirically and is expected on biophysical grounds, because, when morphogen concentrations are low, noise at these concentrations is ultimately Poissonian and its variance scales linearly with the mean. In this case, the optimal shape can be computed from the noise profile, and is typically not a linear one. Last, when noise is large, PI drops to below one bit, where even a trivial discrimination of location, such as between the front and back of the positional axis, can no longer be error free. Generally, with noise being low enough, most of PI is encoded in the smooth slopes of a (monotonic) profile; with high noise, slopes cannot be read out precisely and PI is reduced to the binary discrimination of being below or above an expression threshold (Tkačik et al., 2008b,c, 2015). This insight parallels the discussion in neuroscience on the optimal shape of tuning curves of sensory neurons (Butts and Goldman, 2006).

### Patterning genes are more than binary ON/OFF switches

Hunchback (Hb), a gap gene involved in *Drosophila* AP patterning, primarily responds to a gradient of maternal Bicoid, resulting in an expression profile that makes a seemingly sharp transition between high expression (‘ON’ domain) in the anterior half of the embryo and low expression (‘OFF’ domain) in the posterior half (Albert and Othmer, 2003; Alberts et al., 2002; Meinhardt, 1986; Spirov and Holloway, 2003). Hb has been the paradigm of a switch-like gene whose threshold is positioned precisely and reproducibly across embryos, roughly at the half-way point of the axis of the embryo (Crauk and Dostatni, 2005; Gregor et al., 2007; Holloway et al., 2006; Houchmandzadeh et al., 2002). Switches are expected to encode, at most, one bit. Surprisingly, our model-free estimates of PI reveal empirically that Hb encodes almost 2.2 bits of PI, indicating that the switch-like approximation would miss more than half of the available information, vastly underestimating the capacity of this patterning system (Dubuis et al., 2013b; Tkačik et al., 2015). The extra bit comes from the fact that Hb expression, although steep, is not a step function; indeed, about one third of the nuclei experience intermediate levels of expression, clearly distinguishable from the ON or OFF states.

Similar values have been reported for other gap genes in the early *Drosophila* embryo. Together, the four trunk gap genes provide ∼4.2 bits of PI, enough to specify every nucleus in the central 80% of the AP axis of the embryo with only ∼1% positional error. This precision is completely inaccessible if each gap gene provides at most one bit of PI (Fig. 8). Distinguishing between the binary or analog character of these gene expression profiles thus clearly necessitates a quantitative analysis framework.

### The role of spatiotemporal averaging during patterning

What can regulatory circuits do to mediate the impact of noise intrinsic to chemical reactions taking place at low molecule copy numbers? Cells can reduce the impact of such noise by performing many noisy concentration measurements of morphogen molecules and then averaging across them. This averaging can happen either over time or across space. But although these mechanisms are thought to play an essential role, they are subject to biophysical limits. Temporal averaging is in tradeoff with dynamics: regulatory circuits with long timescales that can average their inputs imply a slowdown in response dynamics (which may be undesirable) and require temporally stable morphogen inputs. Spatial averaging is in tradeoff with sharp spatial gradients: noise can be reduced if morphogen inputs are nearly constant over the spatial averaging window, but if the averaging window is larger, it will ‘flatten out’ information-carrying morphogen gradients.

In *Drosophila*, PI carried by the Bicoid gradient [*I(Bcd;x) ∼1.6* bits] is roughly equal to the mutual information between Bicoid and Hunchback [

*I(Bcd;Hb)*bits], yet considerably lower than the PI carried by the spatial profile of Hunchback alone [

*∼*1.5*I(Hb;x)*bits], even though Hunchback is downstream of Bicoid (Dubuis, 2012). However, according to a naïve application of the Data Processing Inequality (DPI; Box 6), if concentration levels

*∼*2.3**serve to locally regulate the expression of downstream genes**

*c***, the PI in**

*g***should be less than in**

*g***,**

*c**I(*. How then can empirical observations for Bicoid and Hunchback be reconciled with the DPI?

**g**; x)<I(**c**; x)One possibility is that Hunchback receives additional PI from inputs other than Bicoid, although a strong and precise Hunchback boundary is observed in mutants deficient in AP morphogens aside from Bicoid (Petkova et al., 2019). Another possibility is that PI carried by Hunchback is higher because of the spatiotemporal averaging performed over Bicoid concentration by the Hunchback readout mechanism (Gregor et al., 2007; Little et al., 2013; Zoller et al., 2018). Hence, a local, instantaneous measurement of Hunchback is in fact a function of the temporal and spatial history of Bicoid. DPI applies when ** c** and

**correspond to complete spatiotemporal patterns of Bicoid and Hunchback, but not necessarily when they are local instantaneous values. Thus, the biophysical mechanisms of spatial and temporal averaging increase the local instantaneous PI in the Hunchback profile. Temporal averaging is achieved through gene expression dynamics (Tkačik et al., 2008a) and spatial averaging through diffusion of the regulated gene product (Erdmann et al., 2009; Gregor et al., 2007; Little et al., 2013; Sokolowski and Tkačik, 2015).**

*g*### PI quantitatively predicts number of unique cell fates

The values of PI are not only comparative (i.e. between morphogen profiles) but also have absolute meaning (Box 3). If unique identities for *N* cells have to be conferred without error, log_{2}(*N*) bits of PI are required. Typically, biological systems can tolerate some positional error (e.g. cell width sets a limit to positional accuracy), and thus a smaller number of bits of PI is required. For example, during nuclear cycle 14, *Drosophila* embryos have about 60 columns of nuclei in the central 80% of the AP axis, implying that at least log_{2}(60) ∼5.9 bits of PI would be needed for error-free unique specification of each column. However, if the tolerated positional error is ∼1%, then ∼4.3 bits are sufficient (Fig. 4), which corresponds precisely to the physical distance expressed in terms of embryo length between two adjacent cells (Dubuis et al., 2013b). Thus, interpreting absolute values of PI is a simple, yet powerful concept, free from the arbitrariness of normalization procedures, null-model formulations and aesthetic or philosophical decisions about what constitutes ‘precise’ or ‘imprecise’ patterning. The cost of extracting an absolute value, however, comes with the requirement that the measurements themselves are precise, are systematically unbiased and are in a regime in which intrinsic biological noise – and not experimental or statistical noise – is largely the dominant source of the measured variance (Box 5).

### Threshold-free positional cues from multiple combined patterning systems

Although conceptually simple, a threshold-dependent concentration readout process is problematic (Houchmandzadeh et al., 2002; Jaeger et al., 2004): the concentration of the signaling molecules is often very low, resulting in very high concentration noise levels (Gregor et al., 2007). These concentration fluctuations propagate to downstream genes, and a reliable outcome of the morphogenic process would be questionable if it is implemented via sharp thresholding (Lacalli and Harrison, 1991).

Several hypotheses exist to explain how cells can integrate PI from single morphogen gradients without thresholding or from multiple morphogen gradients. For example, cells could sense the difference or ratio of two opposing morphogen gradients (Houchmandzadeh et al., 2005; McHale et al., 2006), compare concentration values at two nearby spatial locations and thus estimate the local gradient (Mugler et al., 2016), or respond to temporal dynamics of the morphogen (Bergmann et al., 2007; Cepeda-Humerez et al., 2019). However, given the shapes and variabilities of gradients, the only statistically optimal possibility is the maximum *a posteriori* (MAP) decoding rule (Eqn 2). Recent analyses of the *Drosophila* embryo (Petkova et al., 2019) and the vertebrate neural tube (Zagorski et al., 2017) have shown that biological results are indeed consistent with a statistically optimal readout of PI provided by multiple patterning cues. Thus, the mathematical framework for PI generalizes naturally to patterning by multiple gradients.

While Wolpert's prescription of applying thresholds to a single gradient is intuitive, it is unclear how to extend it to multiple gradients. In particular, an obvious generalization whereby cells apply independent thresholds to each gradient suffers from three fundamental problems. First, after applying multiple thresholds, how can a readout decision be computed from the resulting set of thresholded (binary) values (Fig. 8); second, how could such computation be implemented in biophysical circuitry; and third, why would thresholding of multiple gradients be optimal? Statistically optimal decoding of Eqn 2 is free of such *a priori* constraints, and suggests that independent thresholds typically are not optimal. The exact connection between the mathematical structure of optimal decoding and the mechanisms that fulfill it remain to be determined.

### Evolution can drive patterning systems towards theoretically optimal performance

How far did evolution drive a real patterning system towards the mathematically optimal patterns that maximize PI? The same framework that allows us to estimate PI from real data also allows us to formulate an optimization theory to search for optimal patterns and to predict empirically observable signatures of optimal patterning. One of the salient predictions of such a theory has been the constancy of the positional error *σ _{x}(x)* across position for uniformly distributed cells/nuclei (Dubuis et al., 2013b; Tkačik et al., 2015). In the

*Drosophila*embryo, this prediction is remarkably well matched by the data (Fig. 7B). Similarly, the theory of optimal information transmission quantitatively predicts the distribution of Hb expression levels from Hb expression noise (Tkačik et al., 2008b), which has been confirmed experimentally (Tkačik et al., 2008a). Optimal decoding – but not other schemes that map developmental gene expression levels into estimates of position – correctly predicts developmental consequences in

*Drosophila*mutants for maternal morphogens, based solely on wild-type data (Petkova et al., 2019).

Such evidence suggests that evolution can drive patterning systems towards theoretically optimal performance. The question of whether the biological systems are ‘at’ or ‘near’ optimality is an interesting empirical question about the strength of evolutionary pressure to use limited resources in an efficient manner in a given population. Far away from optimality, where PI is small, biological function can simply not be supported irrespective of the resource availability, leading to malformation or death, as in some patterning mutants. What remains to be seen is whether such an optimization principle is powerful enough to quantitatively predict the entire set of spatial patterning gene expression profiles *ab initio*, thus leading to a potential design principle for the observed wild-type system (Tkačik and Walczak, 2011). The success of this approach depends precisely on how close to optimality evolution has driven a particular patterning system, and whether near-optimal solutions can also be explored mathematically.

## Open conceptual puzzles

Abstractions and graphical visualizations of regulatory networks have been fundamental in allowing cross-species studies (Davidson, 2002; Gerstein et al., 2012). In a similar manner, a mathematical framework for PI should allow us to analyze and quantitatively compare different patterning systems to find evolutionary convergence or divergence in their function. How many bits are provided by different patterning systems and how is their decoding precision distributed across space? Does PI depend on the number of specified cell types, on the number of system components (e.g. genes or gradients), or perhaps some (more qualitative) notion of patterning complexity? In parallel to these direct applications, the framework also enables us to revisit several fundamental questions that we highlight below.

### Is PI encoded by temporal dynamics of developmental genes?

During AP patterning of the *Drosophila* embryo, PI can be encoded in a single temporal snapshot of gene expression patterns: it has been empirically shown that a single snapshot of gap gene expression is sufficient to provide the PI required to quantitatively decode the positions of striped patterns of pair-rule genes with the precision that matches natural reproducibility (Petkova et al., 2019). Nevertheless, the temporal dynamics leading up to this snapshot are essential for bringing about this instantaneous state. Moreover, information could be directly encoded in these transient dynamics (Granados et al., 2018), e.g. in temporal trajectories of morphogen concentrations, ** g(t)**, at different spatial locations (Heemskerk et al., 2019; Rushlow and Shvartsman, 2012; Villoutreix et al., 2017). A rise and subsequent fall in morphogen concentration with time could designate a different position than a fall followed by a rise, even though the average initial and final morphogen concentrations were identical. The mathematical framework is readily generalizable for such a case: optimal decoding would be carried out using full temporal trajectories of morphogen concentrations,

**. Extension of the framework to intrinsically dynamic processes could be particularly relevant to vertebrate somitogenesis, where system growth and patterning are dynamically highly intertwined (Oates et al., 2012).**

*g(t)*Dynamics open up new operating regimes for patterning circuits: while in a static picture reading out more than one bit of PI would imply the ability to precisely respond to graded morphogen concentrations, in a dynamic picture the same amount of information could be extracted by temporally varying morphogens driving a simple binary switch through a sequence of ON/OFF transitions. This picture is attractive as any single temporal snapshot would only carry, at most, a single bit of information per patterning gene, whereas temporal dynamics could encode significantly more. An advantage of such a strategy is that achieving gene expression precision corresponding to a single bit is metabolically cheaper than scaling the information to two or more bits in the static case (Tkačik et al., 2008a). Another possible advantage would be for patterning in growing tissues, as binary expression states can be made persistent and robust against external perturbations using simple bistable genetic circuitry. On the other hand, it is unclear how biological circuits would implement the computations necessary to decode such temporal profiles.

Alternatively**,** PI could depend not only on the local concentration, but also on some other relevant variable that is set in the history of a cell (or its lineage). Mathematically, this could be implemented by increasing the dimensionality of ** g** to incorporate recent history. In practice, however, estimating PI from high-dimensional trajectories or internal cell states is challenging (Cepeda-Humerez et al., 2019), and the number of possibilities of what constitutes an unknown ‘internal state of the cell’ is vast. What constitutes a relevant internal state is also unclear. As such, we are far from fully understanding the range of patterning possibilities that can emerge when cells not only read out local morphogen values but also have memory and can act and interpret morphogens based on their internal state.

### Is PI ‘produced’ during development?

As discussed above, spatiotemporal averaging can increase the amount of PI available from a single snapshot of downstream gene expression patterns relative to a single snapshot of input morphogen profiles, without violating the DPI. But how is the inequality consistent with the establishment of the primary morphogen gradient? Is PI created from nothing during this process? More generally, how should we think about Turing patterning and mechanisms of lateral inhibition (Afek et al., 2011), which establish spatial patterns *de novo?* For all of these cases, PI seemingly emerges. But how?

Turing patterning can be reconciled with the PI framework (Green and Sharpe, 2015). In essence, information about initial and boundary conditions is transformed into PI in the bulk of the organism (Hillenbrand et al., 2016). A key insight here is that establishing a sharp pattern with clear boundaries is insufficient. Such patterns need to be generated reproducibly from specimen to specimen. In the Turning mechanism, which is deterministic, the locations of boundaries depend on the exact geometry and on the initial ‘noise’ in the system that breaks the symmetry. For the same pattern to emerge reproducibly, the initial noise and the geometry need to be controlled precisely. Thus, PI in the final pattern of the dynamic process arises from the bits that carefully specify the geometry and the initial conditions. Nevertheless, much is yet to be understood, both conceptually as well as mathematically, even in simple toy models of gradient establishment, or models where cells are seen as proceeding algorithmically through sequences of switch-like decisions to set up a spatial pattern. These questions are especially pertinent when self-organized patterning systems based on reaction-diffusion mechanisms interact with global PI (Green and Sharpe, 2015).

### What sources of variability constitute ‘biologically relevant’ variability?

In its information-theoretic formulation, PI fundamentally depends on fluctuations in morphogen patterns and on the variability of morphogen profiles. Although experimental noise must clearly be accounted for before PI can be computed, the other sources of variability that should be considered are less clear. We stress that this is not a mathematical or a technical issue, but a matter that depends on the system and the biological objective or question. Different choices of variability imply different interpretations for the resulting PI. The fundamental question here is what constitutes biologically relevant variability?

Is it simply single-embryo variability due to intrinsic stochasticity of molecular biochemical reactions? This would be appropriate if one assumes that molecular decoding mechanisms within individual embryos can compensate for systematic embryo-to-embryo variability, e.g. due to variation in the finite amount of deposited maternal morphogen molecules. If this is unlikely, then we must include such ‘extrinsic noise’ into the relevant variability in addition to intrinsic noise, which in turn must decrease PI. It is even less clear whether environmental noise should be included in biologically relevant variability. For example, exposure of embryos to temperature or chemical variations certainly occurs under natural conditions (Kuntz and Eisen, 2014); but should such variability be removed under laboratory conditions? The choice would again depend on assumptions about potential compensation mechanisms for such variability. For example, when we subtract variability due to developmental timing, we assume that the system determines its PI according to an internal timing mechanism. Thus, when the internal timing is slowed due to, e.g. lower temperature conditions, PI readout must be slowed accordingly.

### How is PI related to robustness?

The advantage of a quantitative framework is that it circumvents *a priori* choices about the relevance of biological variability. In fact, it also allows concepts such as robustness of developmental networks and canalization to be interrogated. PI can be measured in differently conditioned ensembles of specimens, and its dependence on various sources of variability can be determined. Such an exercise provides a productive way to understand and mathematically formalize the notion of robustness (Barkai and Leibler, 1997; Goldman et al., 2001) under the hypothesis that a patterning system is robust when PI is maintained under parameter variation, both environmental (temperature, genetics and food) or internal (embryo size) (Cheung et al., 2014; Gregor et al., 2005; Houchmandzadeh et al., 2002; Miles et al., 2011). Selection for robustness thus implies that we should observe small differences in PI between wild-type embryo populations that are perfectly environmentally controlled, and between the ones that also include environmental variability. Making this link precise, putting robustness on a firm mathematical footing that is inherited from PI (Hillenbrand et al., 2016) and testing the above hypothesis empirically are exciting future prospects.

### Why is PI transformed and how are the different representations related to developmental networks?

PI present in primary morphogen gradients is transformed, or recoded, in a series of steps before cells commit to discrete fates. Understanding the rationale for the emergence of these transformations is still an unresolved issue. In part, recoding can effectively implement spatiotemporal averaging, as explained above, thereby increasing the amount of PI available at a single point in space and time. This is the case for the transformation of primary morphogens into gap gene expression profiles in *Drosophila*. Alternatively, network interactions among gap genes could increase robustness (Hillenbrand et al., 2016), i.e. stabilize the representation of PI against external sources of variability, or ensure that the representation of position is equally precise along the whole body axis, a hallmark of optimality. In growing tissues, information could also be read out from a primary morphogen gradient at an early developmental timepoint and recoded stably into a new representation with a time delay (Zagorski et al., 2017).

From an information-theoretic perspective, however, the necessity for long developmental cascades is still largely unresolved. The positional code of the gap genes, for example, contains sufficient PI already at a local level (Petkova et al., 2019); why then recode it into expression patterns of pair-rule and segment-polarity genes (Lawrence, 1992)? One hypothesis is that these subsequent transformations, while retaining PI, make it more explicit, allowing cells to ultimately turn on or off individual fate-specifying genes in a switch-like fashion to resolve and then permanently memorize a particular cell fate. PI would thus be transformed from graded, combinatorial representations carried by a small number of genes, into more binary, and possibly less-combinatorial, representations distributed over more genes (McGinnis and Krumlauf, 1992). Such an architecture has analogies to signal processing in natural and artificial neural networks, where inputs are transformed layer by layer into robust, invariant and easily learnable representations, before being acted on by a discrete ‘decision-making’ circuit that minimizes the classification error (Kriegeskorte, 2015; Yamins and DiCarlo, 2016).

### Can PI be related to cell fate and canalization?

The information-theoretical framework for PI describes how information about position is represented biochemically, while decoding prescribes how to extract that information optimally. Cells, however, do not need to estimate a positional coordinate in the embryo, but instead need to decide on a discrete cell fate. Although similar, these two problems are mathematically not identical. First, a coordinate is a continuous variable (making its decoding a regression problem), whereas cell fate is typically thought of as discrete (making its decoding a classification problem). Although the positional coordinate in an organism can typically be discretized by cell diameters, the issue remains of whether the task of the patterning system is indeed to permit cells to learn about their absolute positions. Second, even in a discrete cellular lattice, there is no one-to-one mapping between different cell types and different cell positions; a region of one type can, for example, stretch over more than one position. Third, when making fate decisions, different ‘errors’ that cells can make might not be equally deleterious; some errors, such as mis-specifying one cell in a homogenous island of other cells, could perhaps be locally corrected.

Yet the biggest challenge may be in the definition of ‘cell fate’ itself. What precisely constitutes cell fate or identity? In the French Flag problem, fate is the unambiguous red/white/blue ‘color’ of the cell denoting its discrete type, and this choice is concomitant with applying a threshold on the primary morphogen gradient. But what is the equivalent representation of fate in real cells? In *Drosophila*, local combinations of four genes at 2-3 h of development suffice to identify a specific position for a cell along the AP axis of the embryo. However, specifying the position of a cell or its fate are very different processes. In fact, it is unclear what exactly specifies fate molecularly. Even though there is enough PI to establish a fate, the actual molecular committal might only happen in subsequent layers of the regulatory network.

To tackle this problem, PI theory needs to be extended to describe how discrete fate decisions are taken optimally. It should be based on the PI encoded in the morphogen profiles, and on minimization of deleterious patterning errors. Bayesian decision making or rate-distortion theory could potentially address this issue formally (Bowsher and Swain, 2014). Recent advances in single-cell sequencing, in particular in conjunction with machine learning and large dataset analyses (Van Der Maaten and Hinton, 2008), allow for connections between developmental patterns and fates, and the systems biology of gene expression (Baron and van Oudenaarden, 2019). Theoretical frameworks for PI and (putatively) cell-fate determination should thus incorporate single-cell gene expression data, but how to achieve that in a way that is theoretically coherent and computationally tractable remains an unresolved issue.

The powerful concept of canalization put forward by Waddington in the 1940s (Waddington, 1942) provides an intuitive explanation of how cells are reliably guided to their final fates through a series of decision events on a ‘genetic landscape’ that resembles a potential energy surface. A major outstanding issue is therefore whether we can elevate canalization (analogously to PI) from a biological concept to a mathematical object within the information, rate-distortion and/or decision-making theory (Cover and Thomas, 2006).

## Conclusions

This Review is a biased historical appraisal of the PI paradigm, written from our perspective on how the concepts of information theory can be incorporated into developmental biology. Time will tell whether this fusion of ideas will be productive and/or whether it will lead to novel insights with predictions that would not be possible without this rigorous formalization. In our view, the act of applying an exact mathematical framework to a biological concept and actual data has already helped sharpen ideas and concepts, and has led to the next generation of precision experiments focusing on testing a theory. One might wonder whether it has been worth the effort. In this context, it is interesting to look back at Shannon's opinion piece ‘The Bandwagon’, which appeared 8 years after he published his seminal work on information theory. Shannon warned of hype and blind over-application of information-theoretical concepts and words across the spectrum of natural and social sciences, calling for restraint and meticulous work (Shannon, 1956). Nevertheless, his vision is optimistic:

…many of the concepts of information theory will prove useful in these other fields but the establishing of such applications is not a trivial matter of translating words to a new domain, but rather the slow tedious process of hypothesis and experimental verification.

Fifty years on from Wolpert's seminal idea of PI, and 70 years since Shannon's work on information theory, we are truly beginning to make a connection between these two ideas and encourage more work to strengthen this connection in the future.

## Acknowledgements

We thank J. Briscoe, T. R. Sokolowski and B. Zoller for helpful comments and discussion.

## Footnotes

**Funding**

This work was supported in part by the National Science Foundation, through the Center for the Physics of Biological Function (PHY-1734030), by the National Institutes of Health (R01GM097275) and by the Fonds zur Förderung der wissenschaftlichen Forschung (FWF P28844). Deposited in PMC for release after 12 months.

## References

*Science (80-.)*

*Cell*

*J. Theor. Biol.*

*Nat. Rev. Genet.*

*Molecular Biology of the Cell*

*Nature*

*Nat. Rev. Mol. Cell Biol.*

*Development*

*PLoS Biol.*

*Nature*

*Development*

*Nat. Neurosci.*

*Zool. Jahrb. Abt. Anat. Ontog. Tiere*

*Verh. dt. phys. med. Ges.*

*Curr. Opin. Biotechnol.*

*Development*

*Neural Comput.*

*PLoS Biol.*

*Development*

*PLoS Comput. Biol.*

*Nature*

*Development*

*Biol. Bull.*

*Curr. Biol.*

*Nature*

*Science*

*Phys. Rev. E Stat. Physics Plasmas Fluids Relat. Interdiscip. Top*

*Phys. Rev. Lett.*

*PLoS Comput. Biol.*

*Cell*

*Cell*

*Mol. Syst. Biol.*

*Proc. Natl. Acad. Sci. USA*

*Science (80-.)*

*Phys. Rev. Lett.*

*Cell*

*Phys. Rev. Lett.*

*Development*

*Development*

*Nature*

*Cell*

*J. Neurosci.*

*Proc. Natl. Acad. Sci. USA*

*Development*

*Nature*

*Trends Genet.*

*Development*

*Proc. Natl. Acad. Sci. USA*

*Cell*

*Trends Genet.*

*Biophys. J.*

*Elife*

*PLoS ONE*

*Curr. Opin. Genet. Dev.*

*Dev. Dyn.*

*Nature*

*Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys.*

*Trends Cell Biol.*

*Phys. Rev. Lett.*

*Proc. Natl. Acad. Sci. USA*

*BioEssays*

*Nature*

*Development*

*Development*

*Annu. Rev. Vis. Sci.*

*PLoS Genet.*

*Semin. Dev. Biol.*

*Cell*

*The Making of a Fly: The Genetics of Animal Design*

*Nat. Cell Biol.*

*Nature*

*J. Theor. Biol.*

*Cell*

*Cell*

*Phys. Biol.*

*Curr. Top. Dev. Biol.*

*J. Cell Sci. Suppl.*

*J. Theor. Biol.*

*Evolution (N. Y).*

*J. exp. Zool.*

*J. exp. Zool.*

*Phys. Rev. E Stat. Nonlinear Soft. Matter Phys.*

*Biophys. J.*

*Genes Dev.*

*Phys. Rev. Lett.*

*Proc. Natl. Acad. Sci. USA*

*BioEssays*

*Development*

*Nat. Genet.*

*Nature*

*Curr. Biol.*

*Cell*

*Dev. Biol*

*Science*

*Dev. Cell*

*J. Exp. Zool.*

*Nature*

*Science (80-.).*

*Curr. Opin. Genet. Dev.*

*Epithelial-Mesenchymal Interactions*

*Roux's Arch. Dev. Biol.*

*Bell Syst. Tech. J.*

*IRE Trans. Inf. Theory*

*Development*

*Phys. Rev. E - Stat. Nonlinear, Soft Matter Phys.*

*PLoS Comput. Biol.*

*Naturwissenschaften*

*In Silico Biol.*

*Phys. Rev. Lett.*

*Proc. Natl. Acad. Sci. USA*

*Annu. Rev. Condens. Matter Phys*

*J. Phys. Condens. Matter*

*PLoS One*

*Proc. Natl. Acad. Sci. USA*

*Phys. Rev. E*

*Phys. Rev. E Stat. Nonlinear, Soft Matter Phys.*

*Genetics*

*Cell*

*Phys. Rev. Lett*

*PLoS Comput. Biol*

*Reports Prog. Phys.*

*Philos. Trans. R. Soc. Lond. B. Biol. Sci.*

*J. Mach. Learn. Res.*

*PLoS Comput. Biol*

*Nature*

*J. Exp. Zool.*

*J. Theor. Biol.*

*Curr. Top. Dev. Biol*

*Trends Genet*

*Nat. Neurosci*

*Science*

*PLoS ONE*

*Cell*

**Competing interests**

The authors declare no competing or financial interests.