## ABSTRACT

Meta-analysis is a powerful tool used to generate quantitatively informed answers to pressing global challenges. By distilling data from broad sets of research designs and study systems into standardised effect sizes, meta-analyses provide physiologists with opportunities to estimate overall effect sizes and understand the drivers of effect variability. Despite this ambition, research designs in the field of comparative physiology can appear, at the outset, as being vastly different to each other because of ‘nuisance heterogeneity’ (e.g. different temperatures or treatment dosages used across studies). Methodological differences across studies have led many to believe that meta-analysis is an exercise in comparing ‘apples with oranges’. Here, we dispel this myth by showing how standardised effect sizes can be used in conjunction with multilevel meta-regression models to both account for the factors driving differences across studies and make them more comparable. We assess the prevalence of nuisance heterogeneity in the comparative physiology literature – showing it is common and often not accounted for in analyses. We then formalise effect size measures (e.g. the temperature coefficient, *Q*_{10}) that provide comparative physiologists with a means to remove nuisance heterogeneity without the need to resort to more complex statistical models that may be harder to interpret. We also describe more general approaches that can be applied to a variety of different contexts to derive new effect sizes and sampling variances, opening up new possibilities for quantitative synthesis. By using effect sizes that account for components of effect heterogeneity, in combination with existing meta-analytic models, comparative physiologists can explore exciting new questions while making results from large-scale data sets more accessible, comparable and widely interpretable.

## Introduction

Meta-analysis has emerged as the ‘gold standard’ across various disciplines for quantitative synthesis and is becoming increasingly prominent in the field of comparative physiology (Glass, 2015; Gurevitch et al., 2018; Nakagawa et al., 2017a,b). Meta-analyses answer research questions by synthesising research results and identifying sources of variation across studies (Arnqvist and Wooster, 1995; Borenstein, 2019; Cooper et al., 2009; Gurevitch and Hedges, 1999; Gurevitch et al., 2018; Koricheva et al., 2013; Nakagawa et al., 2017a,b). Ideally, meta-analyses are part of a systematic review. As such, methods are carefully reported upon to ensure the review and data collation are transparent and reproducible [see the Preferred Reporting Items for Systematic Reviews and Meta-Analyses in Ecology and Evolutionary Biology (PRISMA-EcoEvo) checklist; O'Dea et al., 2021].

A meta-analysis can have three goals: (1) to provide an overall mean estimate of a treatment effect or relationship, (2) to quantify effect size variance and understand key drivers explaining differences in effects across studies and (3) to attempt to identify research gaps and publication biases (Borenstein, 2019; Cooper et al., 2009; Gurevitch and Hedges, 1999; Gurevitch et al., 2018; Koricheva et al., 2013; Nakagawa et al., 2017a,b). Meta-analysing independent studies provides greater statistical power and precision than what any individual study on its own would be able to provide – particularly given that most empirical studies are already under-powered in many areas of biology (Button et al., 2013; Forstmeier et al., 2017; Jennions and Møller, 2003). By expressing study effects on a common scale (i.e. standardised effect size), we can gain broader insight into the direction and efficacy of a particular treatment effect or the strength of a relationship between two variables of interest (Gurevitch et al., 2018; Koricheva et al., 2013). Meta-analyses have already provided comparative physiologists with powerful insights on pressing global challenges – from testing whether physiological plasticity can buffer organisms against climate change (e.g. Seebacher et al., 2015) to the degree to which endocrine disrupting chemicals, such as bisphenol A (BPA), impact aquatic organisms (e.g. Wu and Seebacher, 2020).

Despite its widespread adoption and well-established methodological procedures, meta-analysis is often criticised for mixing ‘apples and oranges’ – in other words, mixing effects from studies that are not comparable (Arnqvist and Wooster, 1995; Carpenter, 2020; Gallo, 1978; Glass, 2015; Gurevitch et al., 2018; Stewart, 2010). Lack of comparability could result from studies differing in experimental design, species, measurement variables and sampling units (Arnqvist and Wooster, 1995; Stewart, 2010). For example, experimental designs in the field of comparative physiology can vary greatly in the temperatures or dosages of chemicals that they use in experimental treatments (e.g. Wu and Seebacher, 2020). To the uninitiated, this concern might not be unfounded given that heterogeneity in effects is often high in ecological and evolutionary meta-analyses (Senior et al., 2016), telling us that the effect size varies a great deal across studies. However, to many, having highly heterogeneous effects ‘…is the spice of life’ (p. 519, Cooper et al., 2019) because it provides opportunities to explore the reasons for why effects vary within and across studies (Borenstein, 2019; Cooper et al., 2019; Glass, 2015; Gurevitch et al., 2018).

The goal of our Review is to briefly overview the meta-analytic process and argue for the value of heterogeneity in answering fundamental research problems and guiding research directions. We present results from a survey of published meta-analyses in the field of comparative physiology to gauge methodological and meta-analytic practices – determining what types of effect sizes are commonly used, how often effect size estimates are impacted by ‘nuisance heterogeneity’ and whether meta-analytic models (i.e. models that account for sampling variance) are commonly applied. Then, we show how nuisance heterogeneity in comparative physiology can be comfortably dealt with by re-formalising many existing effect size measures and/or by using multilevel meta-regression models. We formalise alternative effect sizes and their associated sampling variance to provide comparative physiologists with opportunities to explicitly incorporate nuisance heterogeneity at the effect size level to ease their interpretation. Finally, we describe how more complex treatment differences, such as non-linear dosage differences, can be accommodated using multilevel meta-regression models. We hope that expanding the meta-analytic toolkit will provide new opportunities for comparative physiologists to address how organisms will cope with rapidly changing environments and anthropogenic stressors in the future.

## The apples and oranges ‘problem’ in comparative physiology

Meta-analysis is always concerned with effect heterogeneity; in other words, the factors that drive differences in the direction and magnitude of effects within and across studies (Borenstein, 2019; Cooper et al., 2019; Gurevitch et al., 2018; Lajeunesse, 2010; Nakagawa et al., 2017a,b). The concept of heterogeneity is vitally important because it tells us how general our findings are likely to be and how much of the variance we see is the result of real biology or methodological differences (after accounting for sampling variance) (Borenstein, 2019; Lajeunesse, 2010; Nakagawa et al., 2017a,b).

There are many phenomena driving effect heterogeneity. First, there is inherent uncertainty in estimating the ‘true’ population effect (e.g. a correlation coefficient) from smaller samples (Borenstein, 2019). A formal meta-analysis weights sample estimates by their precision, and removes this source of heterogeneity (Gurevitch et al., 2018; Koricheva et al., 2013). Removing sampling variance is possible because we know, mathematically, how to calculate it for many effect measures. While unweighted analyses are common, weighting studies makes overall estimates more precise, and less susceptible to publication bias (see below), even if the overall mean itself is unbiased (Morrissey, 2016). Second, there are sociological factors impacting heterogeneity. These include the ease with which novel and significant results are published relative to non-significant results, manifesting as publication bias (Jennions et al., 2013; Nakagawa et al., 2021a; Rothstein et al., 2005). Third, real biological processes can drive effect heterogeneity. This is particularly true in comparative physiology, where we synthesise data from different species with varying life histories, habitats, mating systems and reproductive modes. As comparative physiologists, this is the stuff that gets us up every morning! We can model these biological factors by including relevant predictors (moderator variables) and/or random effects (to account for non-independence of effects). By doing so, we can test predictions from hypotheses about the key players impacting mean effect size and direction. This improves our understanding of the biological world around us and helps inform future research directions.

Finally, methodological factors are also big contributors to effect variability. In some cases, these methodological factors are of direct interest. For example, the methods for measuring an outcome variable can result in different effects, and it is important we know and understand this to determine how experiments can be designed in the future. In other cases, methodological factors might be considered ‘nuisance heterogeneity’ (following from Cooper et al., 2019) – factors we know vary, but are rather unsurprising to us (Fig. 1). These might include differences in temperature, pH, dosages, water potential, etc. Regardless of the relative importance of such variables compared with core questions in a meta-analysis, nuisance variables may complicate the interpretation of effect sizes used if not accounted for properly.

Besides sampling variance, all the sources of heterogeneity discussed above contribute (to varying extents) to the ‘apples and oranges’ problem often touted as compromising the reliability of meta-analyses. Combining standardised effects while ignoring these sources of heterogeneity can indeed weaken the reliability of a meta-analysis. However, many argue that heterogeneity is just what we are interested in synthesising (Borenstein, 2019; Cooper et al., 2019; Glass, 2015; Gurevitch et al., 2018; Lajeunesse, 2010). In the words of Gene Glass, the founder of meta-analysis: ‘Of course it [meta-analysis] mixes apples and oranges; in the study of fruit nothing else is sensible; comparing apples and oranges is the only endeavor worthy of true scientists; comparing apples to apples is trivial’ (p. 224, Glass, 2015). Whether comparing apples and oranges is a good idea simply boils down to the specific question of interest, the population one wishes to make conclusions about and how heterogeneous effects actually are. If we are interested in generalising only to apples, then mixing fruit will not be ideal. However, if we want to understand why different ‘fruit’ vary in their properties, then we need to consider apples and oranges together (Borenstein, 2019).

We tend to agree with Glass (2015), and his sentiment is particularly relevant to comparative physiology. For example, if we want to understand the impact of a pollutant on reproduction in aquatic organisms, including aquatic mammals and fish would provide answers to this general question, even if their reproductive biology and physiology are different. Synthesising diverse study outcomes allows us to also understand critical features of the literature, study systems (e.g. different effects of pollutants on fish and mammal reproduction) and approaches (e.g. short-term versus long-term studies) that can explain the diversity of effect outcomes observed. Doing so provides a rich set of conclusions that can draw attention to important sources of variability that might have otherwise been overlooked. Having said that, it is still very important to understand the limitations and diversity of studies included in a meta-analysis, and how this can affect interpretations made – an important reason why transparency and reproducibility are such prominent features of meta-analysis (or ideally should be) (O'Dea et al., 2021). The validity of general conclusions to certain questions (e.g. what is the overall effect of a pollutant on reproduction?) will inevitably depend on the homogeneity of effects being synthesised. If the estimate of an overall effect coincides with low heterogeneity then the effects are reasonably consistent (Borenstein, 2019). If, however, an overall effect is accompanied by large heterogeneity, then we would shift focus and explore drivers of that heterogeneity. For example, given the very different physiology of aquatic mammals and fish, understanding an overall combined effect on reproduction may not be particularly relevant and may even be misleading. It may, therefore, be more useful to understand overall effects in each of these groups separately, and assess the extent to which they differ.

## Nuisance heterogeneity complicates meta-analytic interpretations

Nuisance heterogeneity can get in the way of understanding real biological patterns that interest comparative physiologists. Take, for example, a meta-analysis that attempts to assess the effects of diet on fish growth across different populations of a widely distributed (cosmopolitan) species. Studies might rear populations of the same fish species under different temperatures, but in the end, they are still all governed by the same thermodynamic effects on growth. In this situation, we are still interested in understanding overall diet effects, but we may question whether such a conclusion is sensible given the diversity of temperatures applied. If temperature differences are of direct interest, then this variable can be formally incorporated in statistical models. If, however, it is not the primary interest, then it may introduce an unnecessary complication to the interpretation of the effect of diet on fish growth. Such heterogeneity is an important contributing factor to the ‘apples and oranges’ problems described above. Nonetheless, there is limited discussion about the ways in which it can be dealt with in the literature.

Here, we focus on meta-analytic solutions to circumvent nuisance heterogeneity, showing how it can easily be overcome using a number of meta-analytic tools. Prior to overviewing the solutions, we conducted a literature survey to better understand the different types of effect sizes being used in comparative physiology, the susceptibility of studies to nuisance heterogeneity, and how these sources are currently (if at all) being controlled for in a meta-analysis. More specifically, we asked: (1) what type of effect size was used within the quantitative synthesis?; (2) based on the methods provided, was the effect size likely susceptible to nuisance heterogeneity?; (3) if so, was such variation accounted for in the effect size or during analysis?; and finally, (4) how many analyses used ‘weighted’ meta-analytic models?

We searched the Scopus database between 16 and 18 May 2021, with a follow-up search on 1 September 2021 (see the supplementary information at https://daniel1noble.github.io/nuisance_heterogeneity/). After a series of pilot searches to refine our search string (see Foo et al., 2021 for an overview of this process), we searched for studies that included: ‘meta-analysis’, ‘meta regression’, ‘comparative analysis’, ‘comprehensive analysis’, ‘global analysis’ or ‘macro physiology’ in their title or abstract. We restricted our search to key comparative physiology journals known to the authors that would likely contain literature-based quantitative syntheses within the last 6 years (detailed in the supplemental tables at https://daniel1noble.github.io/nuisance_heterogeneity/). We expected this to provide a contemporary overview of quantitative syntheses in comparative physiology while making the survey and screening feasible. We acknowledge that this may have missed some important, and influential, quantitative syntheses in the field of comparative physiology, but likely provided an informative random sample.

All title, abstract and full text screening was done by two people (D.W.A.N. and M.L.). Any disagreements were resolved by discussion to reach a consensus. Out of the 426 papers originally identified, 80 full texts were screened for eligibility, and we included a total of 63 for final data extraction. For full details on the search strings used, search dates, the number of papers found and our PRISMA flowchart (along with reasons for exclusion at the full text stage), we refer readers to the supplementary information (https://daniel1noble.github.io/nuisance_heterogeneity/).

To address our questions, we categorised effect sizes as falling into the following categories: response ratio (lnRR), standardised mean differences (SMD; e.g. Hedges *d*, *g* or Cohen's *d*), correlation or Fisher's *z*-transformed correlation (*Z _{r}*), regression slopes (slopes), risk ratios, or whether the paper analysed the raw data or summary statistics directly (i.e. means/proportions) (which we labelled as ‘raw’). All other effects were labelled as ‘other’. To determine whether effect sizes were impacted by nuisance heterogeneity, we assessed whether authors: (1) acknowledged this explicitly in the paper or (2) accounted for it in their models. If it was not clear, two of us (D.W.A.N. and P.P.) discussed the paper, and made a decision about whether the effect measure was likely to be impacted. For example, trait means, such as ectothermic metabolic rate (i.e. raw/mean/proportion-based effect size), will vary depending on the temperature being measured. The magnitude of differences between two treatments (e.g. contrast-based effect sizes – SMD or lnRR) will depend on the pH or dosage used in treatments, and the strength of the relationship between two variables (e.g. correlation between behaviour and metabolism) might depend on the temperature context under which metabolic rate or behaviour were measured. Given the high heterogeneity in effects used in comparative physiology, we expected nuisance heterogeneity to be common to most effect sizes.

We also categorised the types of statistical models fitted in each study as: (1) fixed-effect meta-analysis; (2) random-effects meta-analysis (i.e. intercept-only model, but has random study effects); (3) multilevel meta-analysis (i.e. intercept-only model, but has multiple random effect levels such as study-, species- and observation-level random effects); (4) meta-regression models (i.e. a multilevel or random-effects model with fixed effects – including random effects); (5) weighted regression models (including mixed effects models); and (6) linear regression models (including linear and generalised linear mixed-effect models, and phylogenetic least squares models) (see Nakagawa and Santos, 2012 for discussion of models). Notably, only model types 1–5 are weighted models because they weight effect sizes based on the inverse of their sampling variance or sample size.

We summarise the results of our literature survey below while discussing solutions for dealing with nuisance heterogeneity.

## Opening new opportunities in comparative physiology: expanding the breadth of effect sizes to deal with nuisance heterogeneity

### Common effect sizes and sampling errors that are useful in comparative physiology

The key features of all effect sizes used in meta-analyses are that: (1) they are statistical parameters that have been placed on a common scale so that they can be compared across studies and (2) they have some associated measure of precision (i.e. sampling variance) (Borenstein et al., 2009; Koricheva et al., 2013; Schmid et al., 2021). Effect sizes in comparative physiology are quite diverse; they can be the mean difference between two groups (e.g. difference in mean between a control and treatment group; e.g. Wu and Seebacher, 2020), the slope of a regression (e.g. between body size and metabolic rate; Uyeda et al., 2017), a correlation between two variables (e.g. between hormone levels and behaviour; Holtmann et al., 2016), or even the raw mean (or variance) itself (e.g. comparing CT_{max} or CT_{min} across species; e.g. Sunday et al., 2014).

In many cases, different effect sizes might be used depending on the question. For example, to investigate the effect of temperature on a physiological trait, one can compile studies experimentally manipulating temperature and analyse the mean ‘effect’ by creating contrast-based effect sizes, or alternatively, just model the trait in each temperature treatment. The choice as to which one to use boils down to the types of data available from primary studies, the specific question of interest, and whether studies are observational or experimental in nature. Considering our temperature example above, if we were interested in the magnitude of change between temperature treatments, we may want to use a contrast-based effect size, such as the log response ratio, lnRR. Instead, if we were interested in how the mean of some physiological trait changes with temperature, we may model the mean itself. In the latter scenario, if all the means are measured in the same units (e.g. °C) or can be converted to the same units, then the mean may provide the desired answer to the original question.

The results of our literature survey show that quantitative syntheses in comparative physiology commonly analyse raw summary statistics (28.36% of studies), followed closely by Fisher’s transformed correlation coefficients (*Z _{r}*) (23.88% of studies), standardised mean differences (SMD) (23.88% of studies) and log response ratios (lnRR) (11.94%, of studies) (Fig. 2A). We provide details on different effect measures and associated sample sizes in Table 1. In many cases, studies make use of multiple effect size measures. Approximately 72.58% of effect sizes used within the study were deemed to be impacted by nuisance heterogeneity in some form or another (Fig. 2B), but of the studies where this was the case, only 53.33% of the studies clearly dealt with it (Fig. 2B), often through meta-regression modelling.

### Expanding the scope of effect size metrics in comparative physiology

As can be seen in Table 1, none of the effect measures control for nuisance heterogeneity – a point reinforced by our survey. For example, the magnitude of lnRR or SMD can depend on differences in the temperatures, dosages or pH applied to each treatment. The same applies to modelling the means. For example, standard metabolic rate (SMR) is temperature-dependent. As such, one needs to standardise the temperature measurement to compare across studies (Uyeda et al., 2017; White et al., 2006) (if, of course, temperature is not of interest itself). Although we could model nuisance heterogeneity explicitly, this may not always be desirable. It can often be useful to standardise the effect to both ease its interpretation and simplify modelling. In the Appendix, we provide readers with guidance on how comparative physiologists can apply corrections to the effect measure itself. We highlight two examples in the Appendix that can be easily generalized to alternative effect size measures and show how they are distinct from traditional effect measures (Fig. 3). Below, we focus on one familiar example to elaborate on how a traditional effect size, such as lnRR, can be reformalised to capture nuisance variation. We use a common, well-understood effect measure of interest to comparative physiologists – the temperature coefficient (*Q*_{10}).

### Comparing changes in mean physiological rates, *Q*_{10}

A common experiment in comparative physiology is to manipulate the temperature organisms experience and measure some physiological rate (e.g. metabolic rate). Using the effect size measures from Table 1 would result in the effect sizes varying with the temperatures applied within and across studies contributing to ‘comparability’ issues. For example, an effect size for a temperature manipulation of 10°C would be larger than an effect size for a temperature manipulation of 5°C (see Fig. 1). As such, it is common to standardise effects by temperature. One common effect size measure already used in comparative physiology to compare physiological rates is the temperature coefficient *Q*_{10} (Havird et al., 2020; e.g. Rodgers et al., 2021; Seebacher et al., 2015). This effect size describes the multiplicative change in physiological rates across a 10°C temperature change. Higher *Q*_{10} values indicate larger changes in physiological rates. However, currently there is no formal sampling variance associated with *Q*_{10}, making it challenging to make use of the full power of weighted meta-analytic models. Interestingly, *Q*_{10} can be seen as a variant of lnRR, meaning that we can derive a *Q*_{10}-based effect size and sampling error using the well-known mathematical properties of lnRR. This opens up the meta-analytic toolbox and improves our ability to account for well-known sources of non-independence (Lajeunesse, 2011; Noble et al., 2017).

*Q*

_{10}effect size can be calculated, it is useful to understand its similarities to lnRR. The lnRR described by Hedges et al. (1999) and extended by Lajeunesse (2015) can be calculated as follows (but see also Senior et al., 2020):

In Eqn 1, *M*_{1} is the mean of group 1 (e.g. a control group), whereas *M*_{2} is the mean of group 2 (e.g. a treatment group). The mean for group *i*, *M _{i}*, can be any measurement type (e.g. a physiological rate, mass, etc.) so long as the variable is measured on the ratio scale. Natural log transformation of this ratio makes the effect size normally distributed (as commonly assumed by meta-analytic models). Eqn 2 is the analytical solution for the sampling variance of lnRR, where and are the sample standard deviations and

*N*

_{1}and

*N*

_{2}are the sample sizes for group 1 and 2, respectively.

*Q*

_{10}response ratio, lnRR

_{Q}_{10}. Notably, using this effect rather than the traditional

*Q*

_{10}has two statistical advantages: (1) zero becomes biologically meaningful as zero means two rates are exactly the same, and ‘significantly different from zero’ means that two rates are significantly different to each other; and (2) lnRR

_{Q}_{10}is more likely to satisfy the assumption of residual normality than

*Q*

_{10}(see Hedges et al., 1999). The recognition of this equivalence means that we can calculate the sampling variance for Eqn 4 as follows:

formalising effect size metrics that compare changes in variability across treatments in the presence of nuisance heterogeneity.

*Q*

_{10}analogues (and associated sampling variances) as follows:

where CV is the coefficient of variation defined as SD/*R*. Whether one chooses to use SD- or CV-based effect size depends on the questions at hand (see Nakagawa et al., 2015; Senior et al., 2020).

Our example using *Q*_{10}, a well-known effect measure in comparative physiology, shows that common effect sizes can be re-formalised to account for nuisance heterogeneity (in this case, temperature). In doing so, we are making assumptions about the nature of temperature effects on an effect size. Nonetheless, we can apply similar approaches to other commonly used effect size metrics. We describe more generally how this can be done in the Appendix, applying similar principles to standardise temperature differences for slopes and standardised mean differences.

## Meta-analytic models to control for nuisance heterogeneity and investigate heterogeneous effects within and across studies

It will not always be possible to derive an effect size that completely controls for nuisance heterogeneity at the effect size level. This limitation is partly because treatment effects may not be simple linear or exponential functions of the mean response, making the effect sizes we discuss above unsuitable (e.g. thermal performance curves; Noble et al. 2018). For example, dosages applied to treatments can follow quite complex non-linear relationships in relation to some mean response (e.g. quadratic relationships). When this occurs, we recommend comparative physiologists apply meta-regression approaches, controlling for nuisance heterogeneity as a moderator. In many cases, this will be the easiest, or even preferred, approach because exploring drivers of heterogeneity is the main interest in most meta-analyses anyway. Nonetheless, there are some simple statistical approaches a comparative physiologist can use to make the overall effects more easily interpretable, allowing one to control for nuisance heterogeneity, while also testing other biological moderators of interest (Schielzeth, 2010). Before diving into meta-regression models, we will first overview the meta-analytic models commonly used.

### Multilevel meta-analysis when the overall effect is of interest

A common goal of meta-analysis is to obtain an overall meta-analytic mean effect size estimate and some form of uncertainty around that mean, such as a 95% confidence interval (CI). Including a prediction interval (PI) can also be useful (discussed below) (Nakagawa et al., 2021b). Estimating an overall mean and 95% CI is generally done with a simple random-effects or multilevel meta-analytic model that accounts for effect size sampling error (Nakagawa and Santos, 2012; Nakagawa et al., 2017a,b). Often, meta-analyses in comparative physiology will have complex hierarchical structure. For example, multiple effect sizes may be taken from a single study on traits measured on the same individuals or from many different species that share an evolutionary history (Cinar et al., 2021 preprint; Noble et al., 2017). As such, it is far more likely that comparative physiologists will need to apply a multilevel meta-analytic model to control for the various sources of non-independence (Cinar et al., 2021 preprint; Nakagawa and Santos, 2012; Nakagawa et al., 2021c; Noble et al., 2017; Song et al., 2021). Multilevel models are also extremely useful in describing the various drivers of effect size variance (a feature we describe below). For the purpose of our discussion, we will assume one needs to use more advanced multilevel models. Having said that, in some instances, it may be sufficient for a meta-analysis to make use of traditional random-effects models. We refer readers to excellent reviews and books that overview the distinctions between common (fixed) effect, random-effect and multilevel models and when one might (or might not) want to apply these models (Borenstein et al., 2009; Koricheva et al., 2013; Nakagawa and Santos, 2012; Nakagawa et al., 2017a,b; Schmid et al., 2021).

where *Y _{i}* is the

*i*th effect size (

*i*=1, …,

*N*

_{ES}, the number of effect sizes), β

_{0}is the overall meta-analytic mean, and

*a*

_{k}_{[i]}is the phylogenetic effect (a random effect) for species

*k*applied to effect size

*i*(i.e. species effects because of shared evolutionary history; Cinar et al., 2021 preprint). Phylogenetic effects are assumed to be normally distributed deviates sampled from a distribution with a mean of 0 and variance [i.e.

*N*], where

**A**is a phylogenetic correlation matrix derived from a phylogenetic tree.

*sp*

_{k}_{[i]}is an additional species-specific effect for the

*k*th species applied to effect size

*i*(i.e. these are species-specific effects because of shared ecology and other factors, Cinar et al., 2021 preprint),

*s*

_{j}_{[i]}is the study-specific effect for the

*j*th study applied to effect size

*i*,

*e*is the effect size-specific effect (or within study effect, or residuals) for the

_{i}*i*th effect size, and

*m*is the sampling effect for the

_{i}*i*th effect size, resulting from varying precision for each effect size,

*v*(which is known as the sampling variance for the effect).

_{i}This model is a ‘weighted’ multilevel meta-analysis model because the estimation of the overall meta-analytic mean, β_{0}, is partially weighted by the inverse sampling variance of each effect size. Weighted meta-analytic models are important to use because they: (1) improve the precision on meta-analytic mean estimates by accounting for sampling variance; (2) allow for a formal analysis of heterogeneity; and (3) give precedence to higher quality studies (see below). From our survey, 62.9% of studies used a weighted meta-analytic model; however, of the studies self-described as ‘meta-analyses’, 17.39% (*N*=8 of 46) did not account for sampling variance. Many studies used ‘linear regression’ (including linear and generalised linear mixed effects models) as the main type of model fit to the data (Fig. 2C). Of the studies using weighted meta-analytic models, 35% (*N*=14 meta-analyses) did not report any measure of heterogeneity, which is unfortunate given how important these measures are to interpreting effects.

Importantly, if we have effect sizes that are derived from the same sample of organisms (e.g. because traits are measured on the same sets of individuals), then we need to remove this dependency from the calculation of each *v _{i}*; we can then account for the assumed sampling covariance between effect sizes by modifying

**I**to include off-diagonals that describe this covariance (see the Appendix and Noble et al., 2017). Alternatively, robust variance estimators can also be used to deal with effect size non-independence (Nakagawa et al., 2021c; Song et al., 2021). The fact that we can obtain an overall estimate of sampling variance (i.e. , see below for how it is defined) is very important because it provides a way for us to understand just how much variation in effects are the result of sampling differences across studies versus other, potentially important sources of variation, such as differences in the methods applied, species differences, or even real biological effects that impact the effect size a study is likely to observe (Nakagawa and Santos, 2012; Nakagawa et al., 2017a,b). This is formalised in what we call ‘heterogeneity analysis’, which we describe below.

Importantly, this model provides us with the answer to our question: across all the species and studies manipulating egg testosterone, how much of an overall effect does testosterone have on offspring traits relative to control conditions? Of course, the overall effect does not provide us with a way to interpret its magnitude in the context of how much testosterone has been applied in the treatments used in the sample of studies. The most we can say about it is that the effect is small, moderate or large given the sample of effect sizes extracted from this set of studies. In the next section, we show how to ‘correct’ the overall effect size so that its interpretation is more meaningful.

### Multilevel meta-regression to understand variation, account for nuisance heterogeneity and improve effect interpretation

where is simply the sum of all effects for all moderators (fixed effects; *p*=number of slopes), *x*_{l}. All other notation is the same as described above. Importantly, the variable *x*_{l} could be any moderator collected from studies, including dosage, temperature, salinity or pH. Some moderators will be at the level of the study (e.g. was the study experimental or observational in nature), but others could be taken at the effect size level (i.e. was the effect derived using a temperature of 23°C or 35°C). By including these moderators in a meta-regression model, the interpretation of the overall meta-analytic mean, β_{0}, changes. For example, assume we collected the change in *in ovo* testosterone dosage relative to a control group (i.e. in ng testosterone g^{−1} yolk) for our meta-analysis on egg testosterone effects on bird offspring traits. If the untransformed dosage, *x _{l}*, is included in the model, then the overall meta-analytic mean estimate, β

_{0}, would be the mean effect of testosterone on traits when the dosage difference between the treatment and control group is 0 ng testosterone g

^{−1}yolk (i.e. it would be reflective of unmanipulated eggs). Of course, this interpretation does not make much sense.

We can, however, apply some simple transformations to make the overall mean estimate more intuitive while also improving the interpretation of non-linear parameters and interactions estimated from the model (Gelman and Hill; Schielzeth, 2010). One simple transformation would be to centre the variable *x _{l}* by subtracting each value from the mean to create a new input variable (i.e. , where the subscript

*i*denotes each individual effect size dosage difference). The new variable,

*c*, is now centred on the mean and can replace

_{l}*x*in the meta-regression model. When the model is refitted with

_{l}*c*, the estimated β

_{l}_{0}can now be more intuitively interpreted as the overall mean effect of testosterone on offspring traits at an average dosage difference between treatment and control groups. Importantly, by centring, we retain the original units (ng testosterone g

^{−1}yolk) of the variable, which can be very useful to ease the interpretation of the effect magnitude (e.g. when the dosage difference is 10 ng testosterone g

^{−1}yolk).

Given it is very common to estimate and compare overall meta-analytic means (i.e. β_{0}) in meta-analyses, centring can be particularly useful in making these means more interpretable. Mean centring can even be extended to include other ‘nuisance’ variables that might creep into the dataset; the same centring approach can be done with a second variable, and the overall mean adjusted and interpreted in the context of this new variable along with dosage. Of course, if one is not interested in the magnitude of the effect, we can simply model mean reproductive output as a function of dosage (often what are referred to as ‘arm-based’ meta-analytic models). Here, one can model the mean for each group (i.e. control and treatment) and account for the known sampling variance associated with each group mean. Dosage, or other continuous variables, such as temperature, can then be modelled (accounting for sampling variance) to understand mean trait changes. It is important to recognise that all means need to be comparable and that these models are more complex to fit. While they can be equivalent to ‘contrast-based’ models, they require complex interactions to be estimated (Nakagawa et al., 2015).

Centring moderator variables, like temperature and dosage, not only provides a way to make the overall mean effect more interpretable in the face of treatment heterogeneity, but it also allows more complex relationships to be fitted without compromising interpretations, such as when one includes non-linear parameters (e.g. higher-order polynomials) and interactions with other variables. In other words, linear or main effects can still be interpreted normally in the presence of non-linear effects or interactions fit in the model (Gelman and Hill; Schielzeth, 2010). We provide examples in the online supplementary material (https://daniel1noble.github.io/nuisance_heterogeneity/) along with R code to show readers how the models we describe above can be fitted, and how meta-analytic means can be adjusted for treatment heterogeneity.

### Heterogeneity analysis: how much nuisance heterogeneity actually exists?

where *k* is the number of studies and the weights, *w _{i}*=1/

*v*, can be calculated using the inverse of the sampling variance (

_{i}*v*) for each effect size

_{i}*i*.

Quantifying heterogeneity using provides a formal way to assess the relative amount of variation that is the result of ‘true’ biological or methodological differences as opposed to chance (i.e. sampling variance) (Higgins and Thompson, 2002; Nakagawa and Santos, 2012). In other words, is the percentage of variance between effect sizes after removing the effects of sampling error (Higgins and Thompson, 2002). Note that we can only calculate this metric if we have an estimate of the sampling variance for a given effect measure because only then can we quantify the proportion of variation that is the result of sampling variability. Senior et al. (2016) have shown that total heterogeneity is often extremely high in ecological and evolutionary meta-analysis (∼91%), suggesting that effect measures vary widely within and across studies. Some of the variation is clearly the result of working with different species or strains, or measuring different traits and outcome measures. However, some of this variation also arises because of methodological differences across studies (e.g. different temperature or dosage treatments).

Formally quantifying and presenting heterogeneity estimates from meta-analytic models can provide a way to understand the major drivers of effect size variation by producing various *I*^{2} estimates that can be compared within and across studies. For example, we could fit the multilevel model described above to understand what proportion of variation is the result of, say, ‘study’ effects following Nakagawa and Santos (2012): where is the study-specific variance.

*I*

^{2}and

*R*

^{2}(Nakagawa and Schielzeth, 2010, 2013; Nakagawa et al., 2017a,b). If we want to formally assess how much effect size variation is the result of nuisance heterogeneity (e.g. temperature and dosage differences applied across studies), then we could fit our multilevel meta-regression model including dosage or temperature (linear, quadratic or even more complex fits), and estimate how much variation is explained by these fixed effects following Nakagawa and Schielzeth (2013):

Note that this formula does not include , as sampling error variance is assumed to be known in meta-analysis, as explained above. When including continuous moderators, such as dosage, temperature, salinity and even pH, provides a formal means to assess just how much variation in effects is the result of nuisance heterogeneity. Other biological moderators of interest could also be included as fixed effects and can be calculated for each independently, or all together. Importantly, moderators could be those that explain variation at the effect-size level (i.e. dosage, temperature) or variation at the study or species level (e.g. reproductive mode, endothermy, etc.). , or how we like to think of it in this context, , may be a useful measure to help readers understand just how much effect variation can result from the specific treatment application. If the heterogeneity is high as a result of, say, dosage differences, then it is clear that the choice of dosage will have a critical impact on the effect of interest. Comparative physiologists interested in implementing these calculations can do so using our packages orchaRd (Nakagawa et al., 2021b) or metaAidR (https://github.com/daniel1noble/metaAidR).

Heterogeneity described using *I*^{2} measures can be useful for understanding the relative contributions of different factors to effect size variation; however, prediction intervals might be more appropriate in many cases (Borenstein, 2019; Nakagawa et al., 2021b). Prediction intervals describe the range of plausible effect-size values expected from a future study (Nakagawa et al., 2021b). This is different from a confidence interval that expresses the range of uncertainty around a statistical parameter estimate. Unlike *I*^{2}, prediction intervals (PIs) are probably more meaningful in the context of meta-analysis because they explicitly incorporate measures of dispersion around a mean effect size. They provide information about the likely effect size one can expect if we were to randomly sample a new population. For example, assume that we were interested in the overall impact of salinity stress in freshwater fish. To tackle this question, we might collect experimental studies measuring fish swimming performance under high salinity treatments (∼15–20 ppm NaCl) compared with freshwater (∼0 ppm NaCl) controls, using SMD to quantify the effect salinity had on swim performance. We conducted a meta-analysis and estimated an overall SMD of 0.50 with a 95% prediction interval of 0.05 to 0.85. The PI indicates that effects can vary widely from as low as an SMD of 0.05 to as high as an SMD of 0.85, depending on the population. In addition, if we were to repeat a similar study, we would expect a new effect to fall within the range of 0.05 and 0.85, 95% of the time (Borenstein, 2019; Nakagawa et al., 2021b). This is a very intuitive interpretation of how variable effects are, yet PIs are often not reported in meta-analyses (Nakagawa et al., 2021b). We encourage more meta-analysts to report these (for examples of PIs, see supplementary material at https://daniel1noble.github.io/nuisance_heterogeneity/).

## Conclusions

Comparative physiologists are interested in meta-analysing effects across a diversity of species, experimental designs and environments. Importantly, sampling variances for many effect sizes commonly used by comparative physiologists (e.g. CT_{max}, *Q*_{10}, etc.) can be formalized and powerful meta-analytic models can be easily applied to these effect measures (Fig. 3). Although an overall meta-analytic mean may be of interest, more often than not, understanding the drivers of effect size heterogeneity will be the primary interest. However, factors such as temperature and dosage differences across studies may generate obvious sources of heterogeneity that may not be of prime interest. Instead, comparative physiologists might simply want to focus on biological drivers of effect-size variability and want an effect size or analytical approach that allows them to remove these ‘nuisances’. Surveying the comparative physiology literature shows that nuisance heterogenity is common, and is often not completely dealt with in a meta-analysis. Here, we provide a set of tools to deal with nuisance heterogeneity (Fig. 1) by reinventing standard effect sizes and/or using mean centred nuisance variables in meta-regression models. Estimating complex meta-regression models with a multitude of fixed effects might result, however, in more challenging model interpretation. As such, a combined approach might be more desirable. Nuisance heterogeneity can be dealt with by using an appropriate effect size and associated sampling variance in combination with meta-regression models that contain moderators that test biological hypotheses of interest. Indeed, this may even be a necessity to simplify the model. The set of tools we describe provide clarification around the importance and limitations of heterogeneity. We also hope to provide new ways to help comparative physiologists communicate effect measures more richly to guide our understanding and decisions around pressing global challenges.

### APPENDIX

#### Examples of how to derive sampling variances

Here, we show how to obtain sampling variance for a slope (or a ‘rate of change’) when you have measurements of a physiological trait at two points along an environmental gradient (e.g. temperature, salinity, pH). We then derive sampling variance for the difference between two slopes, and demonstrate how SMD can be corrected to account for differences in both units (e.g. cm or mg) and points on an environmental gradient (e.g. temperature of 24°C and 30°C). Finally, we introduce a useful approximation technique known as ‘the Delta method’. The sampling variances associated with these ‘new’ effect sizes will allow comparative physiologists to take advantage of powerful meta-analytic models.

#### Sampling variance for a slope between two points

where *T* is temperature (°C) and *T*_{2}>*T*_{1}, and *M*_{1} and *M*_{2} are the average physiological responses (e.g. mean CT_{max}) at temperature points *T*_{1} and *T*_{2}, respectively. Many studies might manipulate multiple variables using a fully factorial design (e.g. temperature and pH); however, for simplicity we assume the study only manipulates temperature. In the supplementary information (https://daniel1noble.github.io/nuisance_heterogeneity/), we also show how to derive ARR from fully factorial studies (i.e. main effects and interactions).

*M*

_{1}is a random variable drawn from a distribution that can be characterised by a mean and standard deviation (note that this standard deviation is not the ‘sample’ but the ‘sampling’ standard deviation, which is often referred to as standard error; see fig. 1 in Nakagawa et al., 2021a). Multiplying it by a constant (

*a*) will change the variance by the square of that constant (

*a*

^{2}) while adding or subtracting the constant (

*b*) does not change the variance of

*M*

_{1}. This can be summarized as:Also, when adding two random variables (

*M*

_{1}and

*M*

_{2}), the combined variance is the sum of the variance of

*M*

_{1}and the variance of

*M*

_{2}plus 2 times the covariance between

*M*

_{1}and

*M*

_{2}. This relationship can be written as:where the covariance Cov(

*M*

_{1},

*M*

_{2}) equals the correlation multiplied by the square root of the product of two variances .

Importantly, when *M*_{1} and *M*_{2} are independent of each other, their covariance is 0. In other words, if measurements are taken from two different groups of animals at two different temperatures (*T*_{1} and *T*_{2}), then the covariances between these two sets of measurements are 0.

*M*

_{1}and

*M*

_{2}are independent, we can obtain the sampling variance for ARR as:where SD

_{1}and SD

_{2}and

*N*

_{1}and

*N*

_{2}are standard deviations and sampling sizes at temperatures

*T*

_{1}and

*T*

_{2}, respectively. Readers may find it difficult to see how we obtained this equation. Let us explain further. The (sampling) standard error for

*M*

_{1}is . Given this, the sampling variance for

*M*

_{1}is , and the sampling variance for

*M*

_{2}is . You can see those elements of the sampling variance of ARR in the formula above. The term [1/(

*T*

_{2}–

*T*

_{1})]

^{2}comes from recognizing 1/(

*T*

_{2}–

*T*

_{1}) as a constant.

*M*

_{1}and

*M*

_{2}[Cov(

*M*

_{1},

*M*

_{2})] in the equation. Note that, as above, the covariance equals the correlation multiplied by the square root of two (sampling) variances. Therefore, the sampling variance of ARR can be now written as:

*N*) are the same at the two temperatures, we can slightly simplify this formula:where

*r*is the correlation between a set of measurements

*M*

_{1}and

*M*

_{2}from the same individuals at two points

*T*

_{1}and

*T*

_{2}. As you may notice, we need the raw data to calculate

*r*. Therefore, in reality, we often need to assume a certain value of

*r*. When we do not have an estimate of

*r*we can reasonably assume it to be 0.5 (Noble et al., 2017).

#### Sampling variance for the difference between two slopes

*T*

_{3}and

*T*

_{4}(and

*T*

_{4}>

*T*

_{3}) and males at

*T*

_{5}and

*T*

_{6}(and

*T*

_{6}>

*T*

_{5}), like below:Also, note that the difference between slopes does not need to be that of males and females. This formula can be used to compare any two treatments or biological groups.

#### Controlling for unit differences and temperature across studies

^{−1}, g, s, min). Fortunately, lnRR and SMD (Table 1) already control for unitary differences. As such, we can apply similar logic that we discuss above to correct SMD for both temperature and unitary differences. Table 1 provides the formula for SMD, its pooled standard deviation (SD

_{p}) and sampling variance. Just like with ARR, we can apply our temperature correction to the SMD formula as follows:Here, we can see that the difference between

*M*

_{1}and

*M*

_{2}is standardised by the pooled SD, correcting for differences in measurement units. Additionally, dividing SMD by the temperature difference results in further correcting the effect size by the applied temperature difference.

*J*is a small sample correction (see Table 1). To derive the sampling variance for SMD

_{T}, we can apply the same principle as we did above to derive a sampling variance as:

#### The Delta method

*f*(

*X*) represents the function of the random variable

*X*, and importantly,

*f*′(

*X*) is the first derivative of

*f*(

*X*). Let us demonstrate this with a concrete example by re-deriving the sampling variance for lnRR defined as:Here,

*f*(

*M*

_{1})=ln

*M*

_{1}. By applying

*f*′(

*M*

_{1})=1/

*M*

_{1}(i.e. the first derivative of ln

*M*

_{1}is 1/

*M*

_{1}) to the Delta method and using the variance's basic properties, we have:Note that the first term, is once again the sampling variance for group 1, and we simply multiply this term by the square of the first derivative of

*M*

_{1}as described by the Delta method. Readers may notice that this formula is now equivalent to the sampling variance for lnRR (Table 1). We can extend this sampling variance to include dependency between the groups as:where

*r*is the correlation between ln

*M*

_{1}and ln

*M*

_{2}when organisms are measured multiple times (with

*r*often assumed to be 0.5), while

*r*is 0 when two sets of measurements are independent.

Finally, we note that, by using the basic properties of variance and the Delta method, one can derive sampling variance for most effect size measures. For instance, these methods allowed us to derive sampling variance for our ‘new’ effect sizes (e.g. ln*Q*_{10}; see ‘Comparing changes in mean physiological rates, *Q*_{10}’) and were used in Nakagawa et al. (2015) and Senior et al. (2020). We also refer the reader to Nakagawa et al. (2017a) for additional details on the Delta method.

## Acknowledgements

We would like to thank the Editorial board (Craig Franklin, Michaela Handel and Charlotte Rutledge) for inviting us to contribute to this special issue along with feedback on previous drafts. We would also like to thank Megan Head and two anonymous reviewers for very helpful comments on previous drafts.

## Footnotes

**Author contributions**

Conceptualization: D.W.A.N., S.N.; Methodology: D.W.A.N., S.N.; Software: D.W.A.N., S.N., M.L., P.P., R.E.O., S.D., S.B.; Validation: M.L., P.P., R.E.O., S.D., S.B.; Formal analysis: D.W.A.N., S.N., M.L., P.P., R.E.O., S.D., S.B.; Resources: D.W.A.N., S.N.; Data curation: M.L., P.P., R.E.O., S.D., S.B.; Writing - original draft: D.W.A.N., S.N.; Writing - review & editing: M.L., P.P., R.E.O., S.D., S.B.; Visualization: D.W.A.N., S.N., M.L., P.P., R.E.O., S.D., S.B.; Project administration: D.W.A.N., S.N.; Funding acquisition: D.W.A.N., S.N.

**Funding**

D.W.A.N., S.M.D., S.N., M.L. and R.E.O. were supported by Australian Research Council Discovery Projects (DP210101152 to D.W.A.N., DE180100202 to S.M.D., and DP200100367 to S.N., M.L. and D.W.A.N.). P.P. and S.B. were supported by a University of New South Wales Scientia PhD Scholarship. This work was supported by the Australian National University.

**Data availability**

Supplementary information, data, code and results associated with this paper can all be found at: https://daniel1noble.github.io/nuisance_heterogeneity/.

## References

*Trends Ecol. Evol.*

*Introduction to Meta-Analysis*

*:*

*Nat. Rev. Neurosci.*

*Hum. Commun. Res.*

*EcoEvoRxiv*

*The Handbook of Research Synthesis and Meta-Analysis*

*Method. Ecol. Evol.*

*Biol. Rev.*

*Am. Psychol.*

*Data Analysis Using Regression and Multilevel/Hierachical Models*

*Res. Synth. Methods*

*Ecology*

*Nature*

*Funct. Ecol.*

*Ecology*

*Stat. Med.*

*Funct. Ecol.*

*Behav. Ecol.*

*Handbook of Meta-Analysis in Ecology and Evolution*

*Handbook of Meta-Analysis in Ecology and Evolution*

*Ecology*

*Ecology*

*Ecology*

*J. Evol. Biol.*

*Evol. Ecol.*

*Biol. Rev.*

*Method. Ecol. Evol.*

*Method. Ecol. Evol.*

*J. R. Soc. Interface*

*BMC Biol.*

*Methods Ecol. Evol.*

*Res. Synth. Method.*

*Ecology*

*Mol. Ecol.*

*Biol. Rev.*

*Biol. Rev.*

*Biol. Rev.*

*Funct. Ecol.*

*J. Exp. Biol.*

*Method. Ecol. Evol.*

*Handbook of Meta-Analysis*

*Nat. Clim. Chang.*

*Ecology*

*Res. Synth. Methods*

*Ecology*

*Biol. Lett.*

*Proc. Natl. Acad. Sci. USA*

*Am. Nat.*

*Biol. Lett.*

*Glob. Change Biol.*

**Competing interests**

The authors declare no competing or financial interests.