## SUMMARY

We compiled published values of mammalian maximum oxygen consumption during exercise () and supplemented these data with new measurements of for the largest rodent (capybara), 20 species of smaller-bodied rodents, two species of weasels and one small marsupial. Many of the new data were obtained with running-wheel respirometers instead of the treadmill systems used in most previous measurements of mammalian . We used both conventional and phylogenetically informed allometric regression models to analyze of 77 ‘species’ (including subspecies or separate populations within species) in relation to body size, phylogeny, diet and measurement method. Both body mass and allometrically mass-corrected showed highly significant phylogenetic signals (i.e. related species tended to resemble each other). The Akaike information criterion corrected for sample size was used to compare 27 candidate models predicting (all of which included body mass). In addition to mass, the two best-fitting models (cumulative Akaike weight=0.93) included dummy variables coding for three species previously shown to have high (pronghorn, horse and a bat), and incorporated a transformation of the phylogenetic branch lengths under an Ornstein–Uhlenbeck model of residual variation (thus indicating phylogenetic signal in the residuals). We found no statistical difference between wheel- and treadmill-elicited values, and diet had no predictive ability for . Averaged across all models, the allometric scaling exponent was 0.839, with 95% confidence limits of 0.795 and 0.883, which does not provide support for a scaling exponent of 0.67, 0.75 or unity.

## INTRODUCTION

The scaling of mammalian energy metabolism to body size has been a subject of scientific study for more than a century (Rubner, 1883) (for reviews, see Schmidt-Nielsen, 1975; Savage et al., 2004), and that history has included considerable controversy over both the scaling relationship *per se* and its mechanistic causality. In mammals, the range of metabolic measures is generally bracketed by basal metabolic rate (BMR) at the lower limit (Lovegrove, 2006) and maximum oxygen consumption elicited during whole-body exercise () at the upper limit (Seeherman et al., 1981; Jones and Lindstedt, 1993; Levine, 2008; Hillman et al., 2012; Spurway et al., 2012). Although is not a measure of locomotor performance *per se* (Careau and Garland, 2012), it does set an upper limit to the intensity of work that can be sustained aerobically, and hence might often be subject to natural or sexual selection in the wild.

Across a wide range of mammalian diversity, it is clear that body mass (*M*_{b}) is the most important ‘global’ factor affecting physiological traits in general and metabolic rates in particular. *M*_{b} and metabolic rate scale non-linearly in good accord with a power function: metabolism=*a*×*M*_{b}^{X}, where *X* is commonly referred to as the allometric scaling exponent. For many decades the allometry of BMR and has sparked enormous interest and been much debated. Originally, a scaling exponent of 0.67 was suggested for the relationship between BMR and *M*_{b}, based on a comparison of domestic dogs of various sizes (Rubner, 1883). Scaling to *M*_{b}^{0.67} has also been proposed as theoretically appropriate given the geometry of surface–volume relationships, a concept often termed ‘geometric similarity’ (Heusner, 1985). Kleiber's analysis of an interspecific data set including birds and mammals pointed to a scaling exponent of ~3/4 for BMR (Kleiber, 1932). Since that pioneering paper, numerous other studies have suggested a scaling exponent of 0.75 (referred to by some as the ‘universal scaling exponent’) for metabolic traits. The emphasis on ‘three-quarter’ power scaling has been based on empirical findings, consideration of how surface areas and body composition change with body size, biomechanical ‘elastic similarity’ concepts and, most recently, mathematical modeling of the fractal nature of nutrient distribution ‘trees’ (Kleiber, 1932; McMahon, 1973; Weibel et al., 1991; Weibel et al., 1992; Weibel et al., 2004; West et al., 1997; Savage et al., 2004). Most of this work (particularly empirical testing of scaling theory) has focused on BMR or resting metabolic rate (RMR).

The maximal aerobic performance limit represented by has been less systematically examined, in part because its measurement often entails considerable technical challenges (see Thomas and Suthers, 1972; Seeherman et al., 1981) and hence the dataset is considerably smaller than that for BMR or RMR. Nevertheless, several studies have found that mammalian scaling deviates from the ‘universal’ 0.75 exponent, with exponents as high as 0.87 reported for exercise-induced (Taylor et al., 1981; Weibel et al., 1991; Weibel et al., 2004; Weibel and Hoppeler, 2005; White and Seymour, 2005; White et al., 2008). Despite the substantial number of theoretical and experimental estimates, there is still no agreed-upon allometric scaling exponent for exercise (Darveau et al., 2002; Hochachka et al., 2003; White and Seymour, 2005). Given that the evolution of enzymatic, cellular and/or organ-system flux capacities, distribution networks, etc., will be affected by the maximal demands upon the animal, greater attention to the scaling of is appropriate.

In studies of the upper limit to the intensity of aerobic power production, has become something of a benchmark (Wagner, 2012). Numerous studies, especially in the exercise physiology literature, have attempted to explain the mechanistic foundations of, and limiting factors for, in mammals [e.g. Jones, 1998; Hochachka et al., 2003; Rezende et al., 2006 (and references therein); White et al., 2008; Longworth et al., 1989; Spurway et al., 2012] and other taxa (Hillman et al., 2012). Various potential constraints, such as aerobic enzyme catalytic capacity, muscle mitrochondrial density, microvasculature, capillary volume, hematocrit, cardiac output and lung O_{2} diffusion capacities have been considered in investigations of (Weibel et al., 2004; White and Seymour, 2005). Theoretical limitations, including O_{2} flux, network design and temperature effects have also been discussed (West et al., 1997; White et al., 2008; Hillman et al., 2012). Despite a long list of proposed explanatory factors and measurements from species spanning a wide mass range (e.g. Taylor et al., 1981; Garland and Huey, 1987), few solid conclusions regarding either scaling exponents or the functional factors limiting to have been reached (but see Noakes, 2011; Wagner, 2011; Hillman et al., 2012; Spurway et al., 2012).

Another problematic issue in studies of is the historical reliance on conventional statistical methods (least-squares regression and analysis of covariance) to estimate mass-scaling exponents. An inherent – usually implicit – assumption of conventional regression models is that all species in the analysis are equally related and hence provide statistically independent data points. Incorporating phylogenetic relationships accounts for the biological reality that all species are not equally related (Garland et al., 2005; Nunn, 2011; Rezende and Diniz-Filho, 2012) and that shared evolutionary history may affect differences among species. Various phylogenetic statistical methods have been incorporated into a number of comparative studies of physiology, biochemistry, morphology and behavior, including a recent analysis of mammalian basal metabolic rate (White et al., 2009). However, no comprehensive study of mammalian has employed phylogenetically aware statistics.

To date, most tests of exercise have been performed on treadmills but some recent studies have utilized running wheel respirometry chambers (e.g. Chappell and Bachman, 1995; Chappell et al., 2004; Chappell and Dlugosz, 2009; Dlugosz et al., 2009; Dlugosz et al., 2012). Different methodologies can provide unequal values in treadmill tests (e.g. Kemi et al., 2002) and it is possible that different values could be elicited by exercise in treadmills *versus* wheels. Hence it is appropriate to perform robust comparisons of the two methods. Here, we include new data on a number of species tested in wheel respirometers and provide the first large multispecies analysis of wheel- *versus* treadmill-elicited in small mammals.

Beyond studying mass scaling and comparing wheel and treadmill methods, we had several additional goals. We included new data for 21 rodent species (including the largest rodent, the capybara *Hydrochoerus hydrochaeris*, the small marsupial *Dromiciops gliroides*, and the smallest member of the order Carnivora, the least weasel *Mustela nivalis*) in addition to data compiled from the literature, for a total of 77 ‘species’ (including some subspecies or separate populations within species). We searched for phylogenetic signal (the tendency of related species to resemble each other) in body mass-corrected , and in particular tested whether three species previously reported to have particularly high (the bat *Phyllostomus hastatus*, the domestic horse *Equus caballus* and the pronghorn *Antilocapra americana*) (Thomas and Suthers, 1972; Lindstedt et al., 1991; Jones and Lindstedt, 1993; Weibel and Hoppeler, 2005) are statistically unusual when accounting for phylogenetic relationships. We tested for an effect of diet type (five categories) on , as has been reported for basal metabolism (e.g. McNab, 1992; Muñoz-Garcia and Williams, 2005; White and Kearney, 2013). Finally, we estimated the allometric scaling exponent for mammalian through a model-averaging approach.

## MATERIALS AND METHODS

### Data collection

We used literature searches (Google, Google Scholar, Web of Science, PubMed) for ‘’ and ‘aerobic capacity’ to compile published values of mammalian exercise-induced and body mass for 54 ‘species’ [including some subspecies or separate populations within species (Nowak, 1999)] (supplementary material Table S1) (Segrem and Hart, 1967; Pasquis et al., 1970; Wunder, 1970; Hoppeler et al., 1973; Lutton and Hudson, 1980; Hiernaux and Ghesquiere, 1981; Elsner and Ashwell-Erikson, 1982; Williams, 1983; Chappell and Snyder, 1984; Chappell et al., 1995; Chappell et al., 2004; Chappell et al., 2007; MacMillen and Hinds, 1992; Ong, 1993; Williams et al., 1993; Dohm et al., 1994; Schaeffer et al., 2001; Dawson et al., 2004; Sadowska et al., 2005; Sadowska, 2009; Gaustad et al., 2010). We chose to use the data for horses (*Equus caballus*) from Kayar et al. (Kayar et al., 1989) in order to maximize comparability with the data for steer (*Bos taurus*) included in the same study. Since that study, several studies have published data on horse that indicate higher values (e.g. Nieto et al., 2009; Ohmura et al., 2010). Use of these alternative values would not change our overall statistical results, as the horse turns out to be a large, positive statistical outlier in any case (see Results).

We also collected new data for an additional 24 species (Table 1), as described below. At least two individuals of each species were tested. For agoutis (*Dasyprocta cristata*) and capybaras, standard and maximal rates of O_{2} were measured using a flow-through system calibrated using the N_{2} dilution technique (Fedak et al., 1981). Exercise metabolism in both species was measured in enclosed boxes resting on motorized treadmills (agoutis: 1.5 m long×0.3 m wide box; capybaras: 2.1 m long×0.8 m wide box). Agoutis wore a lightweight plastic mask that captured expired gas. Air was pulled through the mask at 90 l min^{−1}. Capybaras wore no mask; instead, air was pulled through the respirometry box (flow rate 250–300 l min^{−1}; volume ~180 l after excluding the animal's body volume). O_{2} was measured in agoutis over a period of 1–2 h while they rested within the box; in capybaras, it was measured over a period of 6–12 h at different times of the day and after the animals had been kept off feed for 12–24 h. After a warm-up, animals were run at high speed for 1–2 min (agoutis) or 2–3 min (capybaras) while O_{2} was recorded. Only one high-speed run was performed per day. The was identified as the highest O_{2} when further increases in speed elicited no increase in O_{2}, when the respiratory exchange ratio exceeded 1.0 (capybaras) and at running speeds (1.8 m s^{−1} for agoutis, <3 m s^{−1} for capybaras) at which plasma lactate accumulation rates approached 7–10 mmol l^{−1} min^{−1} [as suggested previously (Seeherman et al., 1981)].

In addition to capybaras and agoutis, we measured in 22 species of small mammals (Table 1). Wild specimens were live-trapped at several locations in western USA (including Boyd Deep Canyon Research Center, the Motte Rimrock Reserve and the University of California, Riverside campus, all in Riverside County, CA, USA, and the Sierra Nevada Aquatic Research Laboratory in Mono County, CA, USA), northeastern Poland (the Mammal Research Institute in Bialowieza, and the University of Bialystok's Gugny field station) and central and southern Chile (Santiago and Valdivia). Captured animals were immediately transported to the laboratory and their was measured within 24 h of capture (for most species, within 4 h of capture). Food, water, bedding and shelter were available to animals prior to tests. Wild-caught individuals were released at the site of capture.

Capture, handling and measurement protocols were approved by the University of California, Riverside, University of California, Davis, and Harvard University Institutional Animal Care and Use Committee, the California Department of Fish and Game, the Polish Nature Conservancy authorities (permit nos DOLPiK-po/ogiz-4200/IV-6.1/2979/8742/07/aj, LKE42/2008 and LKE28/2009) and the Chilean Servicio Agricola y Ganadero (permit 444-2007), and conform to US National Institutes of Health Guidelines (NIH publication 78-23) and US, Polish and Chilean laws.

### Wheel respirometry

We used forced exercise in enclosed running wheel respirometers to elicit in the 22 species of small mammals mentioned above (Chappell and Bachman, 1995). Wheels were supplied with dry air under positive pressure from upstream mass flow controllers (Tylan, Bedford, MA, USA or Sensirion, Staefa, Switzerland) calibrated against an accurate dry volume meter. Flow entered and exited the wheels through airtight axial bearings. A dispersive manifold on the incurrent side helped ensure even perfusion of the wheel volume, and mixing was also assisted by animal motion and the rotation of the wheel. A subsample of excurrent air (about 150 ml min^{−1}) was dried with Drierite, scrubbed of CO_{2} with soda lime, redried, and flowed through an oxygen analyzer (Sable Systems Oxzilla or FC-10, Las Vegas, NV, USA). Outputs from the instruments were digitized by an A–D converter (Sable Systems UI-2) and recorded every 1.0 s on a computer running LabHelper software (Warthog Systems, www.warthog.ucr.edu).

*Otospermophilus beecheyi*; >700 g) were tested in a wheel with an internal volume of 57 l (54 cm diameter×25 cm width) and a flow rate of 23 l min

^{−1}at standard pressure and temperature (STP). Species weighing 120–400 g were measured in a 9 l wheel (32 cm diameter×11 cm width) at 5 l min

^{−1}STP. A 3.8 l wheel (24 cm diameter×8.5 cm width) and flow rates of 2–2.5 l min

^{−1}STP were used for 60–120 g animals, and for smaller species we used a 1.5 l wheel (16.5 cm diameter×7 cm width) at flow rates of 1.5–2 l min

^{−1}. The two large wheels were padded with carpet to provide traction and prevent injury; smaller wheels were lined with friction tape. Flow was measured upstream of the wheels and CO

_{2}was absorbed prior to O

_{2}measurements, so we computed oxygen consumption as: Here, is flow rate (STP),

*F*I

_{O2}is the fractional content of O

_{2}in incurrent gas (0.2095), and

*F*E

_{O2}is the fractional content of O

_{2}in excurrent gas. Because tests were generally too brief to attain steady-state conditions, we used the ‘instantaneous’ adjustment (Bartholomew et al., 1981) to accurately measure rapid changes in O

_{2}. Effective volumes (derived from washout kinetics during rotation) for the four wheels at the flow rates described above were 56, 8.3, 3.1 and 0.9 l, respectively.

To measure , we weighed animals (±0.1 g; ±1 g for California ground squirrels) and placed them into the chamber. A reference reading of unbreathed air was obtained, after which we recorded ‘resting’ O_{2} for several minutes with the wheel locked. Wheel rotation began at low RPM. After animals oriented appropriately, we increased rotation speed approximately every 30 s while monitoring behavior and O_{2}. Rotation was halted when animals were no longer able to maintain position or O_{2} did not increase with further speed increases. We recorded O_{2} for several minutes during the recovery period and then took a second reference reading. After applying the ‘instantaneous correction’, we calculated as the highest 1 min running average of O_{2} during the period of forced exercise. All tests were performed at room temperature (18–23°C). Whenever possible, repeated measurements were taken on the same individual, and the highest of the measurements was used in analyses.

### Phylogenetic tree construction

We constructed the phylogenetic tree using Mesquite (Maddison and Maddison, 2009) and phylogenetic hypotheses from several previously published studies (Table 2). The basic tree structure is from Meredith et al. (Meredith et al., 2011). Within-clade structure was determined according to the sources listed in Table 2. Arbitrary branch lengths were set according to the method of Pagel (Pagel, 1992). Electronic versions of the tree are available in supplementary material Tables S3 and S4.

### Statistical analyses

We first tested for phylogenetic signal in log_{10} body mass and in log_{10} mass-corrected following Blomberg et al. [see pp. 720–721 (Blomberg et al., 2003)]. In brief, was divided by body mass raised to the scaling exponent determined from a regression performed with phylogenetically independent contrasts; this quantity was then log-transformed. We used the PHYSIG_LL.m Matlab program provided by Blomberg et al. (Blomberg et al., 2003) to calculate the *K*-statistic, the randomization test for phylogenetic signal based on the mean squared error, and the likelihood of the specified phylogenetic tree (see ‘Phylogenetic tree construction’ above), and an assumed model of Brownian motion-like trait evolution, *versus* the likelihood of a star phylogeny. The *K*-statistic is measured on the interval of zero to infinity and indicates the amount of phylogenetic signal relative to a Brownian motion expectation of 1.00 (Blomberg et al., 2003; Revell et al., 2008). Values below unity indicate less tendency for related species to resemble each other than expected under a Brownian motion model of character evolution, whereas values above one indicate more than expected. It is important to note that values substantially below unity can still be associated with statistically significant phylogenetic signal, based on the randomization test developed by Blomberg and colleagues (Blomberg et al., 2003).

We computed (multiple) regressions in three ways (reviewed in Garland et al., 2005; Lavin et al., 2008): conventional, non-phylogenetic, ordinary least squares (OLS); phylogenetic generalized least squares (PGLS); and regression in which the residuals are modeled as having evolved *via* an Ornstein–Uhlenbeck process (RegOU), which is intended to mimic stabilizing selection on the specified phylogenetic tree. Values of *r*^{2} were calculated using eqn 2.3.16 (p. 32) in Judge et al. (Judge et al., 1985) [see p. 546 of Lavin et al. (Lavin et al., 2008)]. These three models form a continuum between assuming a star phylogeny with no hierarchical structure (OLS), a specified phylogeny (PGLS) and a phylogeny whose branch lengths are altered such that it can take on values intermediate between the star and the original phylogeny, or even become more strongly hierarchical than the original tree (RegOU). The RegOU model contains an additional parameter, *d*, that estimates the transformation of the phylogenetic tree (Blomberg et al., 2003; Lavin et al., 2008). Hence, its fit can be compared with the OLS or PGLS models by a ln maximum likelihood ratio test, where twice the difference in the ln maximum likelihood is assumed (asymptotically) to be distributed as a χ^{2} with 1 d.f., for which the critical value at α=0.05 is 3.841. Similar tests can be used to compare the fit of models within the OLS, PGLS or RegOU classes when they contain nested subsets of independent variables (e.g. Lavin et al., 2008; Gartner et al., 2010).

All regression models included log_{10} body mass as a predictor of log_{10} . Additional candidate independent variables were wheel *versus* treadmill measurement (WHEEL), a variable coding for clade membership (CLADE2), which subdivided the tree into a total of 12 monophyletic groups (see color coding in Fig. 1), and thus involved 11 dummy variables, and three dummy variables coding for species previously shown to have a high for their body size (pronghorn, horse and bat). Although it has been noted that domestic dogs have a relatively high (e.g. Weibel and Hoppeler, 2005), they were not singled out for analysis here because the data set includes three other canids, and canids in general have relatively high for their body mass (Weibel et al., 1983; Garland and Huey, 1987; Longworth et al., 1989; Bicudo et al., 1996).

We also considered testing for differences between wild-caught *versus* domestic forms. However, we found it problematic to code some species because of a lack of information concerning their origin (e.g. fox, *Vulpes vulpes*) or ambiguity as to the appropriate category (e.g. some human populations). In the first large-scale comparative study of mammalian , Taylor et al. [see p. 35 (Taylor et al., 1981)] noted that ‘… domestic animals provide the extremes of adaptation for oxygen demand within a size class, and it is interesting to note that the wild animals generally fall midway between these extremes … For this reason, including the domestic animals in the allometric relationship between and [body mass] of wild animals changes the scaling factor little, but does increase the range of the 95% confidence interval for the coefficient and the exponent of the equation.’ Although we did not attempt the comparison in this paper, all of the raw data are presented, so readers can perform such comparisons as desired.

We utilized a model selection approach to objectively choose among models corresponding to different combinations of adaptive hypotheses for each of the regression methods (OLS, PGLS and RegOU) (e.g. see Gartner et al., 2010). For each model, we report the ln maximum likelihood, Akaike information criterion [AIC=(−2×ln maximum likelihood)+(2×number of parameters)], and AIC corrected for small sample size [AICc=(−2×ln maximum likelihood)+(2×*p*×*n*/(*n*−*p*−1)], where *p* is the number of parameters and *n* is the sample size (in these formulations, smaller numbers indicate better-fitting models) (see Burnham and Anderson, 2002). Because AICc converges to AIC with larger sample sizes, it is recommended to always use AICc to select the best model (Burnham and Anderson, 2002). When comparing a series of models, nested or not, the one with the lowest AICc is considered to be the most parsimonious. Note that maximum likelihoods are used for computing AIC and likelihood ratio tests, whereas restricted maximum likelihood (REML) is used for estimating coefficients in the model, such as the allometric scaling exponent. REML estimates of the OU transformation parameter, *d*, are also reported. All of the regression models were computed using the Matlab Regressionv2.m program (Lavin et al., 2008).

We also calculated an Akaike weight (*w _{i}*), or model probability, for each model. For the whole set of models considered, the

*w*sum to 1. For these models,

_{i}*w*is the probability that model

_{i}*i*would be selected as the best-fitting model if the data were collected again under identical circumstances. As another way to quantify the strength of evidence, we calculated the evidence ratio (ER) between the best model and each model in the set: ER has a ‘raffle ticket’ interpretation, in the sense that an ER value of 8 means that the best model has 8 tickets, whereas the other model only has one (Anderson, 2008). Another way to say this is that the ER indicates the relative amount of evidence favoring the best model over the others.

Because the *w _{i}* are probabilities, it is possible to sum these for models containing given variables to identify the variables that are more strongly represented across all well-supported models and are thus more likely to have important predictive value (Burnham and Anderson, 2002). For instance, if one considers diet (here coded as five categories in the DIET1 variable: 1=carnivore, 2=herbivore, 3=omnivore, 4=insectivore, 5=piscivore), one can calculate the sum of the

*w*of all the models including diet, and this is the probability that, of the variables considered, diet would be included in the best approximating model were the data collected again under identical circumstances. This procedure allows one to compare the relative importance of candidate independent variables.

_{i}Finally, we used model averaging to estimate the allometric exponent of *M*_{b}. For every model, estimated coefficients and standard errors (s.e.) are conditional on the model being correct, but if we are unsure about the model structure (because more than one model has a relatively high AICc), then these estimates should incorporate this source of uncertainty. Therefore, we weighted the allometric exponent estimates from alternative candidate models by the evidence for the respective models (e.g. measured as *w _{i}*), and averaged across models (Burnham and Anderson, 2002).

## RESULTS

The *K*-statistic (Blomberg et al., 2003) indicated a high level of phylogenetic signal for log_{10} body mass (*K*=1.322), and the randomization test for statistical significance based on the mean squared error was highly significant (*P*<0.0005). The ln maximum likelihood for the specified phylogeny was −88.40 *versus* −135.65 for a star phylogeny, thus indicating a much better fit of the former to the data for log_{10} body mass. For log mass-corrected , the *K*-statistic was substantially lower (*K*=0.532), but the randomization test again indicated strong statistical significance (*P*<0.0005). The ln maximum likelihood for on the specified phylogeny was 24.38 *versus* 11.79 for a star phylogeny, indicating a better fit of the former.

In all 27 regression models (see supplementary material Table S2 for full results), AICc indicated that the PGLS and RegOU models fit the data better than their OLS (non-phylogenetic) counterparts. Likelihood ratio tests indicated that in all cases, the RegOU models fit the data significantly better than the OLS models (all *P*<0.03). Stated another way, a model that assumed a star phylogeny was never the most appropriate. These results highlight the importance of accounting for hypothesized phylogenetic relationships in the statistical analyses.

The top three models that accounted for 98% of the cumulative evidence (cumulative *w*_{i}) are all RegOU models (Table 3). In addition to log_{10} body mass (included in all models), the most influential independent variables were the indicators coding for the bat (BAT), pronghorn (PRONG) and horse (HORSE). Likewise, *P*-values associated with BAT, PRONG and HORSE in the top three models were always statistically significant (all *P*<0.02; and see Fig. 2). These three species have long been viewed as having very high for their body mass, but this has not been subjected to formal statistical tests that incorporate phylogenetic information (Thomas and Suthers, 1972; Lindstedt et al., 1991; Jones and Lindstedt, 1993; Weibel and Hoppeler, 2005). Our results with phylogenetically aware analyses confirm that the bat, pronghorn and horse have unusually high aerobic capacity.

The independent variable WHEEL (indicating wheel *versus* treadmill data collection method) was found in the second-ranked model, but its cumulative weight across all models was small (only 0.247; supplementary material Table S2). Additionally, the *P*-value for WHEEL in this model did not approach statistical significance (*P*=0.48). DIET1 and CLADE2 did not appear in the top three models. The cumulative weight of DIET1 across all models was 0.005 and the cumulative weight of CLADE2 (which, in order of AICc values, does not appear until model 20 of 27) was only 0.000005.

Finally, values from all models were averaged to estimate the allometric scaling exponent: 0.839±0.022 (±s.e.). The corresponding 95% confidence interval is 0.795 to 0.883 (based on degrees of freedom provided by the most inclusive model: d.f.=64).

_{10}(in ml O [STPD] h

^{−1}), then the first one in Table 3 can be used: where body mass is in grams.

## DISCUSSION

Our findings strongly underscore the importance of utilizing phylogenetically informed statistics when evaluating comparative data. Using three versions of a given statistical model (i.e. a given set of independent variables), we determined which model (OLS, PGLS or RegOU) best fitted our dataset, based on the Akaike information criterion corrected for small sample size (AICc). log_{10} Body mass was used in all models; it has been previously shown to have high phylogenetic signal (the tendency for related species to resemble each other), and our estimate of the *K*-statistic (*K*=1.322) agrees with this general result. Importantly, mass-corrected also had highly significant phylogenetic signal (*P*<0.001), but at a lower level (*K*=0.532), also consistent with previous studies of other physiological traits, such as BMR (Blomberg et al., 2003; Rezende et al., 2004; Rezende and Diniz-Filho, 2012).

The best model (smallest AICc) included body mass, BAT, PRONG and HORSE as independent variables. To identify their relative importance, all independent variables used in the analyses were weighted according to the predictive values of the models (i.e. Akaike weights) in which they appeared, and summed across all models (Burnham and Anderson, 2002) (see supplementary material Table S2). The top three independent variables (not including body mass, which has a relative importance of 100% because it is present in all models) were PRONG, BAT and HORSE. For each of these, there is >95% probability that of all the models considered, these variables will be present in the best model (if data were collected again under identical circumstances). Thus, the very high of bats, pronghorns and horses is substantially different from that of the rest of the included species, supporting the conclusions of several previous studies (e.g. Thomas and Suthers, 1972; Lindstedt et al., 1991; Jones and Lindstedt, 1993; Weibel and Hoppeler, 2005). Although these results might be viewed as largely confirmatory, it is important to note that in conventional statistical analyses, potentially unusual species (coded as dummy variables) are compared with all other species in the data set in an equally weighted fashion, ignoring phylogenetic position. With the methods we used, however, as with phylogenetically independent contrasts (see Garland and Janis, 1993; Garland and Adolph, 1994; Garland and Ives, 2000; Rezende and Diniz-Filho, 2012), a focal species is, in effect, compared most directly with its nearest relative in the data set, and less directly with species that are further removed in terms of phylogenetic relationships. In other words, the statistical comparisons are performed in a phylogenetically principled fashion.

Note that other models, not presented here, may also be of interest. For example, the original reason for studying the capybara (the largest extant rodent) was to test whether the scaling of might differ between rodents and other mammals. Conventional statistical analyses of our data set for 77 species indicate that the scaling exponent is not statistically different between rodents and other mammals, but that rodents on average have a lower (whether or not BAT, PRONG and HORSE are included as dummy variables). For Canidae, conventional statistical analyses indicate a higher average as compared with other mammals (also whether or not BAT, PRONG and HORSE are included as dummy variables).

Out of the 27 models, the top four were RegOU models. REML *d* values associated with each model were considerably greater than zero, indicating that some intermediate transformation of the branch lengths (between a star phylogeny and the specified phylogeny) yielded the best fit of the regression model to the data. This finding demonstrates phylogenetic signal in mass-corrected . Even after accounting for three species with a particularly high (bat, pronghorn, horse) – two of which are dummy-coded individually for the CLADE2 variable because they are the only member of that lineage included in our data set – CLADE2 was not among the most important variables in the models. As the cumulative weight of CLADE2 is extremely low (<0.001%) and the top three models all indicate the presence of phylogenetic signal in the residuals, we conclude that phylogenetic signal in cannot be adequately accounted for by modeling differences among the 12 major clades identified for these analyses. Rather, the tendency for related species to have a somewhat similar is distributed throughout the phylogeny. It is also important to note that all models including only body mass had very low support, and the worst model of all was an OLS regression with only body mass as an independent variable (see supplementary material Table S2).

Diet has been implicated in interspecific differences in metabolic rates, particularly in BMR (e.g. McNab, 1992; Muñoz-Garcia and Williams, 2005; White and Kearney, 2013). This effect is proposed to derive from differences in the nutrient content, digestibility, or predictability and abundance of various food types, which somehow leads to variation in the selective regime that acts on BMR. For example, diets of low caloric value might lead to selection that favors reduced BMR because it lowers the overall energy requirements of an animal. In addition, selection related to diet variation could lead to the evolution of other traits (e.g. gut size, body composition, circulating hormone concentrations) (Konarzewski and Książek, 2013) that affect whole-animal BMR. Diet is also thought to be associated with a suite of other traits, including home range size, which may also show strong phylogenetic signal (Garland et al., 1993; Muñoz-Garcia and Williams, 2005; Rezende et al., 2004). However, we found diet to be of little importance in our analysis of , considering all models (Table 3). It is not clear why diet might be directly related to exercise , but diet is likely related to body size (Rezende et al., 2004), which is accounted for in every model in our analysis. Unsurprisingly, ANOVA of body mass in relation to diet in the 77 ‘species’ in our study indicate statistically significant differences in mass among diet types, irrespective of the assumed model of evolution (Table 4).

Intuitively, differences in measurement technique might be expected to influence such performance measures as (e.g. mice running uphill on treadmills had higher than when running on the level) (Kemi et al., 2002). The species in our study were tested with either traditional treadmill exercise techniques or forced running in enclosed wheel respirometers. In terms of practicality, wheel respirometers offer several advantages over treadmills for measuring exercise in small (<1 kg) animals. Wheels are smaller, simpler and less expensive than typical treadmills. Most individuals require no training to exercise intensively in a wheel, and many species, particularly rodents, spontaneously run in them. Wheels lack a treadmill's characteristic interface between rigid walls and moving tread, and hence are less likely to cause injury, particularly at the high speeds typically necessary to elicit . These benefits are substantial, but it is important to verify whether results from wheels are similar to those from treadmills. In our analysis, the variable WHEEL (coding for measurements made using a wheel *versus* a treadmill) was included in the second-best model (Table 3), but the significance level was not close to 0.05, and the relative importance of WHEEL averaged across all models was quite low. Therefore, we conclude that was not significantly affected by testing using treadmills *versus* running wheels. We caution, however, that wheel respirometers have thus far been used only for mammals of relatively small body size (1 kg or less) and would likely be too unwieldy for use with large mammals.

We conclude by briefly considering our findings in the context of the decades-old and continuing controversy about the manner in which energy metabolism is – or should be – related to body mass among species of animals (e.g. Kleiber, 1932; McMahon, 1973; Weibel, 1973; Schmidt-Nielsen, 1975; Taylor et al., 1981; Heusner, 1985; McNab, 1988; West et al., 1997; Darveau et al., 2002; West et al., 2002; Chown et al., 2007; Hochachka et al., 2003; Agutter and Wheatley, 2004; Savage et al., 2004; Weibel et al., 2004; Weibel and Hoppeler, 2005; White and Seymour, 2005; White et al., 2009; White and Kearney, 2014). In both vertebrates and invertebrates, the majority of theoretical models of mass scaling of energy metabolism yield a predicted scaling exponent between 2/3 and 3/4, and most of the empirical tests of these models have relied on the very extensive datasets for endotherm BMR (which include hundreds of species of birds and mammals) (White et al., 2006; White et al., 2007) or standard metabolic rate (SMR) in ectotherms. Most analyses of these large BMR or SMR datasets do report interspecific scaling consistent with mass exponents of 2/3 to 3/4 (e.g. Rezende et al., 2004; McKechnie and Wolf, 2004; Chown et al., 2007; White and Kearney, 2013) (but see White et al., 2009).

In contrast, we found a signficantly higher mass scaling exponent for mammalian exercise , averaging 0.839 (±0.022 s.e.) across all models. The species in our study span almost 5 orders of magnitude in mass (7.2 g to 475 kg) and include much of the size range in the mammalian lineage, with the caveat that data are lacking for the largest mammals (e.g. elephants, whales). The most conservative estimate of a 95% confidence interval (CI) for the mass exponent (based on 64 degrees of freedom in the most inclusive model) is 0.795–0.883. That CI excludes the ‘universal scaling constant’ of 0.75, but includes the value of 0.809 reported for the large mass range of species analyzed by Taylor et al. (Taylor et al., 1981) (see also Garland and Huey, 1987) and is also consistent with several more recent studies of exercise in mammals (Weibel et al., 2004; Weibel and Hoppeler, 2005; White and Seymour, 2005; White et al., 2008), all of which report scaling exponents significantly greater than 0.75. In other words, the empirical findings are not consistent with the most commonly considered theoretical constructs of relationships between body mass and metabolic rate. These include geometric similarity (predicting scaling to mass^{2/3}) (Heusner, 1985), elastic similarity (derived from biomechanical properties and predicting scaling to mass^{3/4}) (McMahon, 1973), and the recent and widely discussed concept of scaling imposed by the 3-dimensional fractal geometry of nutrient distribution networks (e.g. West et al., 1997; West et al., 2002; Savage et al., 2004), which also predicts scaling to mass^{3/4} across multiple levels of biological organization. It must also be noted that none of our analyses incorporated estimates of measurement error in the independent variables, and so all of them likely underestimate the true allometric scaling exponent (Ives et al., 2007). Empirically, among species of mammals, the allometric variation of is directly related to the scaling of the total effective surface areas of mitochondria and capillaries [Weibel et al., 2004; Weibel and Hoppeler, 2005 (and references therein)].

Although the idea of a ‘universal scaling constant’ is appealing, our analyses (and several others cited above) indicate that no theoretical model, including the recent fractal geometry concept, adequately describes the scaling of maximal aerobic metabolism in mammals. Moreover, we concur with Darveau et al. (Darveau et al., 2002) that the allometry of is a substantially more rigorous test of scaling ‘laws’ than analyses based on BMR, as the selective factors and sub-organismal mechanisms responsible for whole-animal aerobic metabolism seem far more likely to be driven by the maximal demands upon the system represented by (which sets an upper limit to the intensity of work that can be sustained aerobically) than by the ‘idling’ power requirements respresented by BMR. Indeed, it is arguable that BMR is largely an epiphenomenon, unlikely to experience direct selection except in unusual and rare circumstances (e.g. those emphasizing fasting endurance in resting animals in thermoneutrality, or minimization of heat production of resting animals in hot environments) (see also Careau and Garland, 2012) (but see Lovegrove, 2006).

## Acknowledgements

We thank M. Springer for an electronic version of the Meredith et al. (Meredith et al., 2011) tree, and both the Mammal Research Institute in Bialowieza, Poland, and the Instituto de Ciencias Ambientales y Evolutivas, Universidad Austral de Chile in Valdivia, Chile, for hosting M.A.C. during part of the study.

## FOOTNOTES

**FUNDING**

This study was supported in part by the University of California, Riverside Academic Senate, a Biomedical Research Support Grant from University of California, Davis to J.H.J., a Marie Curie Transfer of Knowledge project BIORESC within the European Commission's 6th Framework Programme (contract no. MTKD-CT-2005-029957) to M.A.C., a University of São Paulo and Conselho Nacional de Desenvolvimento Científico e Tecnológico – CNPq, Brazil to J.E.P.W.B., the Polish Committee for Scientific Research (2 P04F 01329 to P.A.S.), and National Science Foundation (NSF) IOS-1121273 to T.G.

## REFERENCES

_{2}transport in muscles of exercising foxes

_{2}-dilution technique for calibrating open-circuit VO

_{2}measuring systems

_{O2max}: what do we know, and what do we still need to know?

_{2}consumption in exercising foxes: large PO

_{2}difference drives diffusion across the lung

_{2}max and cost of transport in goats

_{2}max)?

_{2}transport as an integrated system limiting

**COMPETING INTERESTS**

No competing interests declared.