ABSTRACT
Accessing many fundamental questions in biology begins with empirical estimation of simple monotonic rates of underlying biological processes. Across a variety of disciplines, ranging from physiology to biogeochemistry, these rates are routinely estimated from non-linear and noisy time series data using linear regression and ad hoc manual truncation of non-linearities. Here, we introduce the R package LoLinR, a flexible toolkit to implement local linear regression techniques to objectively and reproducibly estimate monotonic biological rates from non-linear time series data, and demonstrate possible applications using metabolic rate data. LoLinR provides methods to easily and reliably estimate monotonic rates from time series data in a way that is statistically robust, facilitates reproducible research and is applicable to a wide variety of research disciplines in the biological sciences.
INTRODUCTION
Living organisms, the communities they form and the environments they inhabit are all temporally dynamic systems. Consequently, many fundamental questions in biology begin with estimating the rates of underlying biological processes. When non-linearity is of biological interest and accurately represents the rate of interest, then non-linear, function-valued approaches may be most appropriate (Marshall et al., 2013; Stinchcombe et al., 2012). Non-linear approaches can be further generalized using function regression if the question also requires accounting for change in parameters as a function of other variables (Yen et al., 2015). However, in many cases, there is a putatively linear rate that we wish to estimate free of artifactual non-linear regions (e.g. Fig. 1). In these cases, researchers often reduce experimental complexity by holding state variables constant in order to estimate monotonic rates, which are then used in subsequent analyses. For example, the metabolic theory of ecology (MTE) seeks to explain ecological patterns by scaling the size dependence and temperature dependence of metabolic rates from individuals to ecosystems (Brown et al., 2004; Gillooly et al., 2001). The study of MTE therefore begins with estimates of metabolic rate at standardized temperatures (e.g. Barneche et al., 2014; Brown et al., 2004; Gillooly et al., 2001; White et al., 2011a,b). Similarly, predicting the impacts of climate change on ecosystem processes, such as community primary productivity, begins with estimates of O2 production rates (Tanaka et al., 2013; Yvon-Durocher et al., 2015). Other examples include estimates of leaf respiration rate (Shapiro et al., 2004), ecosystem functioning (Ross et al., 2013), and components of biogeochemical cycles including denitrification (Song et al., 2011), CO2 and CH4 gas emissions (e.g. Larmola et al., 2013). Thus, accurate estimates of biological rates provide the foundation for many branches of biology but, surprisingly, there are few systematic approaches to estimating monotonic rates from biological data.
Biological rates are routinely estimated from non-linear or noisy time series using linear regression. For example, many studies in physiology, ecosystem ecology and biogeochemistry monitor reactant consumption in closed chambers at standardized temperatures, and use the resulting, often non-linear, time series to estimate the rate of interest (e.g. Larmola et al., 2013; Ross et al., 2013; Song et al., 2011; Tanaka et al., 2013; White et al., 2011a,b; Yvon-Durocher et al., 2015). There are several problems with this approach. First, the data used rarely meet the criteria for linear regression to be an appropriate analytic tool. The first measures in a time series are often noisy as equipment/samples/organisms equilibrate after setup, while at the end of the time series, rates can change because of saturation effects or the exhaustion of a limiting resource (e.g. Fig. 1). Consequently, naive linear regression of a full time series can conflate the biological rate of interest with undesired effects. Second, common ad hoc methods to ameliorate this problem, such as manually truncating non-linear portions of the time series, introduce subjectivity into the analysis, and may reduce statistical power by removing useful data. Last, published studies rarely provide both the raw data and the specific methods necessary to reproduce reported results. This makes it difficult or impossible to evaluate the appropriateness of the methods, and is particularly problematic in a new era that demands scientific transparency and reproducibility (Fang et al., 2012; Grieneisen and Zhang, 2012). The need to estimate monotonic rates from time series data will only increase as technological advances continue to make collection of high-resolution data easy and cost effective. This presents biologists with a non-trivial challenge: to reliably estimate biological rates in a way that is statistically robust and fully reproducible.
Here, we introduce the LoLinR package for R (R Development Core Team, 2016), which provides a suite of simple functions to implement local linear regressions to estimate monotonic rates from time series data. We describe the general approach to reproducible and statistically robust estimation of monotonic rates, and the specific methods used in the package. We then walk through two example analyses to illustrate the utility of the package, as well as important analytic considerations and pitfalls (all computer code necessary to reproduce the analyses and figures in this article are available in Scripts S1 and S2; additional examples are available through https://github.com/colin-olito/LoLinR).
MATERIALS AND METHODS
We first describe the base function around which the LoLinR package is built, then detail the component linearity metrics underpinning the function. The methods provided in the LoLinR package are a modification of traditional Loess regression techniques built around the wrapper function rankLocReg. While Loess techniques are designed primarily for data interpolation and visualization, rankLocReg stores relevant data from all possible local regressions of ordered subsets (adjacent points) of a full time series, given a minimum window size. The function returns an object of class rankLocReg, which includes a ranked list of the most linear subsets of the data, as well as corresponding data for each local regression. A call to rankLocReg implements three steps: (1) define the minimum window size, (2) fit local regressions and (3) rank local regressions.
The only user-defined constraint imposed on the analysis is alpha, which defines the minimum window size used to fit local regressions, expressed as a proportion of the total number of observations in the full data set (analogous to a traditional Loess smoothing parameter). At a minimum, alpha must take into account the total number of observations in the full data set, N, such that (alpha×N)≥15 (Harrell, 2001). Ideally, (alpha×N) should also represent a biologically meaningful interval for the given data set. A call to rankLocReg fits all possible local regressions with n≥(alpha×N) adjacent observations using ordinary least squares.
When used with rankLocReg objects, the plot function generates several diagnostic plots to help determine the most appropriate method for a given analysis. Users can examine results from alternative L metrics by using the reRank function. Fig. 1 provides a schematic overview of a typical workflow using LoLinR to estimate biological rates. Crucially, analyses using LoLinR can be fully reproduced from (1) the time series data and (2) any one of the following: summary plots, summary tables or the R code used to perform the analysis. All are easily included as appendices or supplementary material to published articles, making LoLinR analyses extremely easy to reproduce.
RESULTS AND DISCUSSION
Larval metabolic rate
The first example is from a study of allometric scaling of metabolic rate during larval development in two bryozoan species (Bugula neritina and Watersipora subquortata; Pettersen et al., 2015). Metabolic rate was estimated for individual larva from O2 saturation time series collected using closed chambers. Fig. 2A provides an example of the analytical challenge presented by these data. The full time series is clearly non-linear: the rate of O2 consumption initially decelerates as the chamber and larva equilibrate after handling. There is also a subtle acceleration towards the end of the time series, probably resulting from a physiological response by the larva to declining O2 availability (Lagos et al., 2015), or accumulation of bacterial biofilm that began to consume oxygen. Any estimate of O2 consumption rate including these non-linearities would be conflated with these other processes. However, truncating the data to exclude these non-linearities is subjective and difficult, especially for the subtle curvature towards the end of the data set. Ultimately, we wish to identify the region where the relationship between O2 concentration and time is most linear, and estimate its slope.
Using the LoLinR package, we can analyse this data set with the call:
library(LoLinR)
data(BugulaData)
BugulaRegs <− rankLocReg(xall=BugulaData$Time.s, yall=BugulaData$D1, alpha=0.2, method=“eq”)
which implements the rankLocReg function with a minimum window size of alpha=0.2, and uses the linearity metric Leq to rank local regressions. This alpha value results in a minimum window size of (alpha×N)=22 for this data set. This call returns an object of class rankLocReg which includes a ranked list of all possible local regressions, the number of local regressions fitted and several summary statistics. Examination of the summary output and the distribution of local regression slopes (Fig. 2D, density plot) indicates that both the LZ and Leq weighting methods return the same rank 1 local regression, while the L% method returns a slightly different result. However, all three methods identify local regressions in the later half of the time series, where the rate of O2 consumption has stabilized. The L% rank 1 local regression includes a larger subset of the data (n=44 observations) than the other two methods (n=26), and all three rank 1 local regressions have nearly identical slopes (LZ, Leq: β1=−0.00133; L%: β1=−0.00132). Given the similarity of the results, we would recommend using the L% method in this case, because it provides greater statistical power for the estimation of β1, the parameter of interest. Inspection of the chosen local regression and accompanying residual plots also suggests that other than some autocorrelation, which is expected in time series data, there are no other major concerns (Fig. 2C–F).
A comparison of this result with common alternative approaches highlights the usefulness of the methods. Naive linear regression of the full time series yields an estimate of β1=−0.00119, indicating that non-linearities, particularly early in the time series, result in under-estimation of the metabolic rate. Estimates obtained by linear regression of manually truncated subsections of these data (with the same window size of n=22 observations) can range from β1=−0.00212 to β1=−0.00067, more than a threefold difference in the estimate of metabolic rate. In addition to being methodologically opaque and conflating the desired rate with other experimental and biological processes, these common ad hoc methods can give highly inaccurate estimates.
Flow-through respirometry
The second example is a study of the metabolic costs of living in the Arctic for great cormorants (Phalacrocorax carbo) (White et al., 2011a). In this study, metabolic rate was estimated for individual birds using flow-through respirometry protocols (see supplementary material in White et al., 2011a). These techniques generate time series of the rate of O2 consumption (V̇O2, ml O2 kg−1 min−1) rather than O2 saturation or concentration. For these data, the analytic goal was to estimate resting metabolic rate, which should correspond to the subset of the time series where V̇O2 is lowest and most linear. Conventional methods for estimating V̇O2 from flow-through respirometry data are based on analysis of the distribution of sequences of adjacent data points, and the minimum running average of subsets of adjacent points with varying numbers of included observations (e.g. Withers, 2001). Here, we illustrate how rankLocReg can be used to estimate resting metabolic rate from these data by leveraging the statistical framework of local linear regression and examining the distribution of standardized residuals.
We analyse a representative V̇O2 time series for an individual cormorant. The time series is non-linear with large spikes occurring when the animal is physically active inside the chamber, but there appears to be a region of relative stability between 2.5 and 5.25 h (Fig. 3). We analyse the thinned data with a call to rankLocReg using alpha=0.1. This ensures that the minimum window size corresponds to approximately 30 min, and a minimum of 15 observations for the local regressions. Although we are not necessarily interested in the slopes of the local regressions (β1), an examination of the distribution of β1 highlights the fact that the L% metric returns a different rank 1 local regression from the other two L metrics (Fig. 3A). For these data, the LZ and Leq metrics misidentify the most stable subset of these data (a consequence of strongly skewed empirical distribution of ) (Fig. 3B; Fig. S1). However, L% identifies a period of approximately 2 h where V̇O2 is most stable (Fig. 3C). The average (or median) V̇O2 during this period is easily recovered using the summary for this analysis, and returns an estimate of V̇O2. This estimate is similar to those obtained using conventional methods (V̇O2=33.90 and 35.57 ml kg−1 min−1; Fig. S2; see Withers, 2001, for detailed methods), as well as the median of the full time series (36.71 ml kg−1 min−1), but has two distinct advantages. First, the L metrics provide an objective measure of linearity to identify periods of stability in the V̇O2 time series. Second, the L metrics do not preferentially select the smallest possible subset of the time series to estimate resting metabolic rate, resulting in an estimate that is based on more observations, and therefore has more statistical power (n=47 using LoLinR; n=11 or 18 using conventional methods).
Appendix
Quantifying LoLinR performance using simulated data
An informative comparison between the methods provided in LoLinR and common alternatives is deceptively difficult for at least two reasons. First, the most common alternative methods (e.g. eyeballing the data to select a linear region and then running a linear regression, or the methods described in Withers, 2001) simply cannot be reliably compared with LoLinR because they are not objectively reproducible. Second, generating appropriate simulated data for the types of analyses LoLinR is designed to assist with (i.e. where some subset of a time series is indeed linear, or expected to be linear, but the rest is arbitrarily non-linear) is a non-trivial problem. This is particularly problematic when the specific nature of the non-linearity has a strong influence on the behaviour of the methods that LoLinR might be compared against (e.g. naive linear regression of full data sets). As a first step towards providing objective validation and comparison of our methods, we provide functions to generate simulated data and analyse the performance of rankLocReg (see Script S1). However, we emphasize that this is not a comprehensive sensitivity analysis, but rather a starting point for validating our methods and making future comparisons with other reproducible methods. Here, we briefly describe how simulated data were generated, and how the performance of rankLocReg was assessed.
We generated simulated data that roughly resemble O2 consumption data (similar to the ‘Larval metabolic rate’ example in Results and Discussion), with an initial phase of acceleration/deceleration which then stabilizes as a straight line. Simulated data sets are composed of 100 observations. The first 50 observations are non-linear, following a sine wave from the apex (or trough) at ±π/2 to the inflection point at ±π. The second 50 observations are linear, following y=β0±xβ1, where β0 is equal to the 50th observation (the last of the non-linear subset), and the regression slope is uniformly distributed on the intervals β1∈[−0.028, −0.004] ∪ [0.004, 0.028]. We add a small amount of normally distributed noise to the entire data set [ε∼N(0, 0.05)] to simulate random variation present in real-time series data. We analyse each randomly generated data set using rankLocReg, with alpha=0.2.
To quantify the performance of rankLocReg, we use three simple metrics. (1) The difference between the actual slope of the linear subset of the simulated data and the slope of the local linear regressions identified by rankLocReg (Δi=βi−βreal), where i indicates each of the L methods used by rankLocReg (i∈[Z, eq, %]). As each of the three L metrics performs differently for different data sets, we also compare the difference between the real regression slope and the best of the three local regressions identified by rankLocReg (i.e. the one with the smallest Δi), which we designate Δbest. (2) The proportion of the linear subset of the data that is correctly included in the local linear regressions identified by rankLocReg. (3) The proportion of the observations included in local linear regressions that correctly include the linear subset of the data.
Results summary
For this particular type of simulated test data, rankLocReg performs remarkably well, particularly in comparison with a naive linear regression of the full data sets. The result of this specific comparison is not surprising, however, as the curvature in the first half of the simulated data results in systematic bias of βnaive towards 0. This is reflected in Fig. S3A, where the distribution of Δnaive is right skewed with a thicker tail than the distribution of Δbest. In contrast, Δbest is tightly distributed about 0, with a few outliers in the right tail. Each of the three L metrics performs similarly, although the local regressions identified using the L% methods are generally better at recovering regression slopes that are more similar to βreal (Fig. S3B). As expected, the absolute values of βreal also influence the performance of rankLocReg. Overall, rankLocReg performs better when βreal is further from 0 (Fig. S3C). Specifically, when βreal is negative but close to 0, rankLocReg tends to choose regressions that include the non-linear portion of the data, with slopes that are more steeply negative than βreal. When βreal is positive but close to 0, rankLocReg tends to choose local regressions with more steeply positive slopes than βreal for the same reasons. This behaviour makes sense because the 95% CI for β1 is used to calculate the L metrics used by rankLocReg, which becomes increasingly inflated as βreal approaches 0.
rankLocReg also does a reasonably good job of correctly identifying the truly linear subset of these simulated data. This is encouraging, particularly because the curvature of the simulated data in this example should make this rather difficult. This is because the second half of the non-linear portion of the simulated data is increasingly linear (with a slope of β1≈±0.016 on the x-scale used for this analysis) as they approach the inflection point of the sine wave at ±π. Thus, it should be difficult for rankLocReg to distinguish between the end of the non-linear subset of the data and the truly linear subset. However, at least one of the L methods implemented by rankLocReg (i.e. the ‘best’ local regression with the smallest Δ value) generally included a large fraction of the truly linear subset of the data (Fig. S3D). However, there was quite a bit of variability in the performance of each of the L methods (Fig. S3E). The LZ and Leq methods in particular either performed very well or very poorly at identifying the truly linear subset of the data. In contrast, the L% method generally identified at least half of the truly linear subset (Fig. S3E). The ability of rankLocReg to correctly identify the truly linear subset of the data was again sensitive to the absolute value of βreal. Specifically, even the ‘best’ local regression mis-identified the truly linear subset more frequently as βreal approached 0 (Fig. S3F).
rankLocReg also performed well at identifying local regressions that correctly include the truly linear subset of the data. For a large majority of simulated data sets, more than half of the observations included in the ‘best’ local regression identified by rankLocReg were part of the truly linear subset of the data (Fig. S3G). However, there was again significant variability in the performance of each of the three L methods (Fig. S3H). The L% method clearly performed the best in this respect, generally choosing local regressions with the majority of observations falling within the truly linear subset of the data (Fig. S3H). However, the LZ method generally performed very poorly, choosing local regressions that badly mis-identified the linear subset of the data, or choosing local regressions of which only half of the included observations were actually part of the truly linear subset (Fig. S3H). The Leq method also performed poorly, but was better at identifying local regressions, with the majority of observations falling within the truly linear subset of the data (Fig. S3H). Once again, the performance of even the ‘best’ local regression became worse as βreal approached 0. The variability in the performance of the three L methods is almost certainly a direct consequence of the difficulty in distinguishing between the end of the non-linear and the beginning of the linear subsets of these simulated data. This is particularly clear for L%, which often identified local regressions that spanned both the non-linear and linear portions of the data.
Four main conclusions emerge from this limited test of the performance of rankLocReg against simulated data. First, rankLocReg performs better than naive linear regression of full time series at estimating the slope of a linear subset of the time series. Second, this analysis strongly supports our recommendations in the Materials and Methods and Results and Discussion sections that L% be used as the preferred weighting method. L% is generally more robust and does a better job than the other methods of choosing local regressions that both accurately estimate βreal and correctly include the truly linear subset of these simulated data. Third, rankLocReg becomes progressively better at accurately estimating βreal values that are further from 0 (and presumably further from 1 as well). Fourth, this analysis highlights that while there can be significant variation in the performance of each of the three L metric weighting methods, it is rare that all three mis-identify the truly linear subset of the time series. Taken together, these results indicate that the methods provided in LoLinR perform well for their intended purpose for the type of data simulated here.
Acknowledgements
The authors thank A. Pettersen and G. Yvon-Durocher for graciously providing data, and two anonymous reviewers.
Footnotes
Author contributions
Conceptualization: C.O. and D.R.B.; Methodology: C.O. and D.R.B.; Software: C.O. and D.R.B.; Writing – original draft preparation: C.O.; Writing – review and editing: C.R.W., D.J.M. and D.R.B.; Funding acquisition: D.J.M. and C.R.W.
Funding
This work was funded by a Monash University Dean's International Postgraduate Student Scholarship to C.O., Australian Research Council grants to D.J.M. and C.R.W., and a Monash University Centre for Geometric Biology Post-Doctoral Fellowship to D.R.B.
Data accessibility
All data used in this study are included in the LoLinR package and available at https://github.com/colin-olito/LoLinR.
References
Competing interests
The authors declare no competing or financial interests.