If this post had a single take-home message, it would be this: whatever statistical method you use, make sure that you use and report it appropriately.
I’m writing a chapter on statistical inference for a forthcoming book on Ecological Statistics (eds Gordon Fox, Simoneta Negrete-Yankelevich, and Vinicio Sosa, and to be published by Oxford University Press). This post is essentially a draft of my chapter. In addition to the editors arranging the review process, I thought I’d take the opportunity to share the chapter more widely (a sort of open access peer-review!) – so please feel free to comment. Have I missed something important? Or have I misrepresented something?
The intended readership of the book is ecologists. My chapter is the first of the book after the introduction by the editors, so some parts are very introductory. Other chapters will give much more detail about several aspects. So, in particular, I’d be interested to know if anyone finds some parts unclear – I want this to be accessible to all ecologists. Please comment below, or send me an email. And many thanks for reading!
1 Introduction
Statistical inference is needed in ecology because the natural world is variable. A quote, commonly attributed to Ernest Rutherford, one of the world’s greatest scientists, is “If your experiment needs statistics, you ought to have done a better experiment.” Of course, such a quote is relevant to deterministic systems or easily replicated experiments. Ecology faces variable data and replication constrained by costs and logistics.
Ecology, as the study of the distribution and abundance of organisms and how these are influenced by the environment and interactions among those organisms (Begon et al. 2005), requires measuring of quantities and studying relationships. However, data are imperfect. Species fluctuate somewhat unpredictably over time and space. Fates of individuals, even in the same location, differ due to different genetic composition, individual history or chance encounters with food, diseases and predators.
Further to these intrinsic sources of uncertainty, measurement error limits the ability of ecologists to know the true state of the environment under study. Species and individuals are commonly detected imperfectly (Pollock et al. 1990; Parris et al. 1999; Kery 2002; Tyre et al. 2003), so the true composition of communities and the abundance of species need to be estimated. Environmental variables are usually only measured for a subset of the total environment, are subject to measurement error, and are often proximal to the direct drivers of distribution and abundance.
The various sources of error and the complexity of ecological systems mean that statistical inference is required to distinguish between the signal and the noise. Statistical inference uses logical and repeatable methods to extract information from noisy data, so it plays a central role in ecological sciences.
While statistical inference plays an extremely important role, it can be daunting. The choice of statistical method can also appear to be controversial (e.g., Dennis 1996; Anderson et al. 2000; Burnham and Anderson 2002; Stephens et al. 2005). This chapter outlines the range of approaches to statistical inference that are used in ecology. I take a pluralistic view; if a logical method is applied and interpreted appropriately, then it should be acceptable. However, I also identify common major errors in the application of the various statistical methods in ecology, and note some strategies for how to avoid them.
Mathematics underpins statistical inference. While anxiety about mathematics is painful (Lyons and Beilock 2012), without mathematics, ecologists need to follow Rutherford’s advice, and do a better experiment. However, designing a better experiment is often prohibitively expensive or otherwise impossible, so statistics, and mathematics more generally, are critical to ecology. I limit the complexity of mathematics in this chapter. However, some mathematics is critical to understanding statistical inference. I am asking you, the reader, to meet me halfway. If you can put aside any mathematical anxiety, you might find it less painful. Please work at any mathematics that you find difficult; it is important for a proper understanding of your science.
2 A short overview of some probability and sampling theory
Ecological data are variable – that is the crux of why we need to use statistics in ecology. Probability is a powerful way to describe unexplained variability (Jaynes 2003). One of probability’s chief benefits is its logical consistency. That logic is underpinned by mathematics, which is a great strength because it imparts repeatability and precise definition. Here I introduce, as briefly as I can, some of the key concepts and terms used in probability that are most relevant to statistical inference.
All ecologists will have encountered the normal distribution, which also goes by the name of the Gaussian distribution, named for Carl Friedrich Gauss who first described it (Fig. 1). The probability density function (Box 1) of the normal distribution can be written as:
The probability density at x is defined by two parameters μ and σ (note that π is the mathematical constant given by the ratio of a circle’s circumference to its diameter, which is approximately 3.1416). In this formulation of the normal distribution, the mean is equal to μ and the standard deviation is equal to σ.
Many of the examples in this chapter will be based on assuming data are drawn from a normal distribution. This is primarily for the sake of consistency and because of its prevalence in ecological statistics. However, the same basic concepts apply when considering data generated by other distributions.
Box 1. Probability density
Consider a discrete random variable that takes values of non-negative integers (0, 1, 2, …), perhaps being the number of individuals of a species within a field site. We could use a distribution to define the probability that the number of individuals is 0, 1, 2, etc. Let Y be the random variable, then for any y in the set of numbers {0, 1, 2, …}, we could define the probability that Y takes that number; Pr(Y = y). This is known as the probability mass function, with the sum of Pr(Y = y) over all possible values of y being equal to 1.
Probability mass functions cannot be used for continuous probability distributions, such as the normal distribution, because the random variable can take any one of infinitely many possible values. Instead, continuous random variables can be defined in terms of probability density.
Let f(x) be the probability density function of a continuous random variable X, which describes how the probability density of the random variable changes across its range. The probability that X will occur in the interval [x, x+dx] approaches dx×f(x) as dx becomes small. More precisely, the probability that X will fall in the interval [x, x+dx] is given by the integral of the probability density function .
The cumulative distribution function F(x) is the probability that the random variable X is less than x. Hence, , and .
The behavior of random variables can be explored through simulation. Consider a normal distribution with mean 2 and standard deviation 1 (Fig. 2). If we generate 10 samples from this distribution, the mean and standard deviation of the data will not equal 2 and 1 exactly (Fig. 2). The mean of the 10 samples is named the sample mean. If we repeated this procedure multiple times, the sample mean will sometimes be greater than the true mean, and sometimes less (Fig. 3). Similarly, the standard deviation of the data in each sample will vary around the true standard deviation. These statistics such as the sample mean and sample standard deviation are referred to as sample statistics.
The 10 different sample means have their own distribution; they vary around the mean of the distribution that generated them (Fig. 3). These sample means are much less variable than the data; that is the nature of averages. The standard deviation of a sampling statistic, such as a sample mean, is usually called a standard error. Using the property that, for any distribution, the variance of the sum of independent variables is equal to the sum of their variance, it can be shown that the standard error of the mean is given by se = σ/√n, where n is the sample size.
While standard errors are often used to measure uncertainty about sample means, they can be calculated for other sampling statistics such as variances, regression coefficients, correlations, or any other value that is derived from a sample of data.
First noted by de Moivre, Wainer (2007) describes the equation for the standard error of the mean as “the most dangerous equation”. Why? Not because it is dangerous to use, but because ignorance of it causes huge waste and misunderstanding. The standard error of the mean indicates how different the true population mean might be from the sample mean. This makes the standard error very useful for determining how reliably the sample mean estimates the population mean. De Moivre’s equation indicates that uncertainty declines with the square root of the sample size; to halve the standard error one need to quadruple the sample size. This provides a simple but useful rule of thumb about how much data would be required to achieve a particular level of precision in an estimate.
These aspects of probability (the meaning of probability density, the concept of sampling statistics, and precision of estimates changing with sampling size) are key concepts underpinning statistical inference. With this introduction complete, I now describe different approaches to statistical inference.
3 Approaches to statistical inference
The two main approaches to statistical inference are frequentist methods and Bayesian methods. The former is based on determining the probability of obtaining the observed data, given that particular conditions exist. The latter is based on determining the probability that particular conditions exist given the data that have been collected. They both offer powerful approaches for estimation and considering the strength of evidence in favor of hypotheses.
There is some controversy about the legitimacy of these two approaches. In my opinion, the importance of the controversy has sometimes been overstated. The controversy has also seemingly distracted attention from, or completely overlooked, more important issues such as the misinterpretation and misreporting of statistical methods, regardless of whether Bayesian or frequentist methods are used. I address the misuse of different statistical methods at the end of the chapter, although I touch on aspects earlier. First, I introduce the range of approaches that are used.
Frequentist methods are so named because they are based on thinking about the frequency with which an outcome (e.g., the data, or the mean of the data, or a parameter estimate) would be observed if a particular model had truly generated those data. It uses the notion of hypothetical replicates of the data collection and method of analysis. Probability is defined as the proportion of these hypothetical replicates that generate the observed data. That probability can be used in several different ways, which defines the type of frequentist method.
3.1 Sample statistics and confidence intervals
I previously noted de Moivre’s equation, which defines the relationship between the standard error of the mean and the standard deviation of the data. Therefore, if we knew the standard deviation of the data, we would know how variable the sample means from replicate samples would be. With this knowledge, and assuming that the sample means from replicate samples have a particular probability distribution, it is possible to calculate a confidence interval for the mean.
A confidence interval is calculated such that if we collected many replicate sets of data and built a Z% confidence interval for each case, those intervals would encompass the true value of the parameter Z% of the time (assuming the assumptions of the statistical model are true). Thus, a confidence interval for a sample mean indicates the reliability of a sample statistic.
Note that the limits to the interval are usually chosen such that the confidence interval is symmetric around the sample mean , especially when assuming a normal distribution, so the confidence interval would be [, ]. When data are assumed to be drawn from a normal distribution, the value of is given by = zσ/√n, where σ is the standard deviation of the data, and n is the sample size. The value of z is determined by the cumulative distribution function for a normal distribution (Box 1) with mean of 0 and standard deviation of 1. For example, for a 95% confidence interval, z = 1.96, while for a 70% confidence interval, z = 1.04.
Of course, we rarely will know σ exactly, but will have the sample standard deviation s as an estimate. Typically, any estimate of σ will be uncertain. Uncertainty about the standard deviation increases uncertainty about the variability in the sample mean. When assuming a normal distribution for the sample means, this inflated uncertainty can be incorporated by using a t-distribution to describe the variation. The degree of extra variation due to uncertainty about the standard deviation is controlled by an extra parameter known as “the degrees of freedom” (Box 2). For this example of estimating the mean, the degrees of freedom equals n–1.
When the standard deviation is estimated, the difference between the mean and the limits of the confidence interval is = t_{n}_{–1}s/√n, where the value of t_{n}_{–1} is derived from the t distribution. The value of t_{n}_{–1} approaches the corresponding value of z as the sample size increases. This makes sense; if we have a large sample, the sample standard deviation will provide a reliable estimate of σ so the value of t_{n}_{–1} should approach a value that is based on assuming σ is known.
However, in general for a particular percentage confidence interval, t_{n}_{–1} > z, which inflates the confidence interval. For example, when n = 10, as for the data in Fig. 3, we require t_{9} = 2.262 for a 95% confidence interval. The resulting 95% confidence intervals for each of the data sets in Fig. 3 differ, but they are somewhat similar (Fig. 4).
As well as indicating the likely true mean, each interval is quite good at indicating how different one confidence interval is from another. Thus, confidence intervals are valuable for communicating the likely value of a parameter, but they can also foreshadow how replicable the results of a particular study might be.
Box 2. Degrees of freedom
The degrees of freedom parameter reflects the number of data points in an estimate that are free to vary. For calculating a sample standard deviation, this is n–1 where n is the sample size (number of data points).
The “–1” term arises because the standard deviation relies on a particular mean; the standard deviation is a measure of deviation from this mean. Usually this mean is the sample mean of the same data used to calculate the standard deviation; if this is the case, once n–1 data points take their particular values, then the nth (final) data point is defined by the mean. Thus, this nth data point is not free to vary, so the degrees of freedom is n–1.
3.2 Null hypothesis significance testing
Null hypothesis significance testing is another type of frequentist analysis. It has close relationships to confidence intervals, and is used in a clear majority of ecological manuscripts, yet it is rarely used well (Fidler et al. 2006). It works in the following steps:
1) Define a null hypothesis (and a complementary alternative hypothesis);
2) Collect some data that are related to the null hypothesis;
3) Use a statistical model to determine the probability of obtaining those data or more extreme data when assuming the null hypothesis is true (this is the p-value); and
4) If those data are unusual given the null hypothesis (if the p-value is sufficiently small), then reject the null hypothesis and accept the alternative hypothesis.
Note, there is no “else” statement here. If the data are not unusual (i.e., if the p-value is large), then we do not “accept” the null hypothesis; we simply fail to reject it. Null hypothesis significance testing is confined to rejecting null hypotheses, so a reasonable null hypothesis is needed in the first place.
Unfortunately, generating reasonable and useful hypotheses in ecology is difficult, because null hypothesis significance testing requires a precise prediction. Let me illustrate that point by using the species area relationship that defines species richness S as a power function of the area of vegetation (A) such that S = cA^{z}. Taking logarithms, we have log(S) = log(c) + zlog(A), which might be analyzed by linear regression. The parameter c is the constant of proportionality and z is the scaling coefficient. The latter is typically in the approximate range 0.15 – 0.4 (Durrett and Levin 1996).
A null hypothesis cannot be simply “we expect a positive relationship between the logarithm of species richness and the logarithm of area”. The null hypothesis would need to be precise about that relationship, for example, specifying that the coefficient z is equal to a particular value. We could choose z = 0 as our null hypothesis, but logic and the wealth of previous studies tell us that z must be greater zero. The null hypothesis z = 0 would be a nil null, which are relatively common in ecology. Unfortunately rejecting a nil null is uninformative, because we already know it to be false.
A much more useful null hypothesis would be one derived from a specific theory. There are ecological examples where theory can make specific predictions about particular parameters, including for species-area relationships (Durrett and Levin 1996). For example, models in metabolic ecology predict how various traits, such as metabolic rate scale with body mass (Koojimann 2010, West et al. 1997). Rejecting a null hypothesis based on these models is informative, at least to some extent, because it would demonstrate that the model made predictions that did not match data. Subsequently, we might investigate the nature of that mismatch, and seek to understand the failure of the model (or the data).
Of course, there are degrees by which the data will depart from the prediction of the null hypothesis. In null hypothesis testing, the probability of generating the data or more extreme data is calculated assuming the null hypothesis is true. This is the p-value, which measures departure from the null hypothesis. A small p-value suggests the data are unusual given the null hypothesis.
How is a p-value calculated? Look at the data in Fig. 2, and assume we have a null-hypothesis that the mean is 1.0 and that the data are drawn from a normal distribution. The sample mean is 1.73, marked by the cross. We then ask “What is the probability of obtaining, just by chance alone, a sample mean from 10 data points that is 0.73 units away from the true mean?” That probability depends on the variation in the data. If the standard deviation were known to be 1.0, we would know that the sample mean would have a normal distribution with a standard deviation (the standard error of the mean) equal to 1/√10. We could then calculate the probability of obtaining a deviation larger than that observed.
But we don’t really know the true standard deviation of the distribution that generated the data. The p-value in this case is calculated by assuming the distribution of the sample mean around the null hypothesis is defined by a t-distribution, which accounts for uncertainty in the standard deviation. We then determine the probability that a deviation as large as the sample mean would occur by chance alone, which is the area under the relevant tails of the distribution (sum of the two grey areas in Fig. 5). In this case, the area is 0.04, which is the p-value.
Note that we have done a “two-tailed test”. This implies the alternative hypothesis is “the true mean is greater than or less than 1.0”; more extreme data are defined as deviations in either direction from the null hypothesis. If the alternative hypothesis was that the mean is greater than the null hypothesis, then only the area under the right-hand tail is relevant. In this case, more extreme data are defined only by deviations that exceed the sample mean, and the p-value would be 0.02 (the area of the right-hand tail). The other one-sided alternative hypothesis, that the mean is less than 1.0 would only consider deviations that are less than the sample mean, and the p-value would be 1 – 0.02 = 0.98. The point here is that the definition of “more extreme data” needs to be considered carefully by clearly defining the alternative hypothesis when the null hypothesis is defined.
We can think of the p-value as the degree of evidence against the null hypothesis; the evidence mounts as the p-value declines. However, it is important to note that p-values are typically variable. Cumming (2011) describes this variability as “the dance of the p-values”. Consider the 10 datasets in Fig. 3. Testing a null hypothesis that the mean is 1.0 leads to p-values that vary from 0.00018 to 0.11, even though the process generating the data is identical in all cases (Fig. 6). Further, the magnitude of any one p-value does not indicate how different other p-values, generated by the same process, might be. In Fig. 6, the p-values vary across almost three orders of magnitude, despite the data being generated by the same process. How much more variable might p-values be when data are collected from real systems?
Rather than simply focusing on the p-value as a measure of evidence (variable as it is), ecologists seem to perceive a need to make a dichotomous decision about whether the null hypothesis can be rejected or not. Whether a dichotomous decision is needed is often debatable, but assuming that it is, a threshold p-value is required. If the p-value is less than this particular threshold, which is known as the type I error rate, then the null hypothesis is rejected. The type I error rate is the probability of falsely rejecting the null hypothesis when it is true. The type I error rate is almost universally set at 0.05, although this is largely a matter of convention and is rarely based on logic.
Null hypothesis significance tests with a type 1 error rate of α are closely related to 100(1−α)% confidence intervals. Note that the one case where the 95% confidence interval overlaps the null hypothesis of 1.0 (Fig. 4) is the one case in which the p-value is greater than 0.05. More generally, a p-value for a two-sided null hypothesis significance test will be less than α when the 100(1−α)% confidence interval does not overlap the null hypothesis. Thus, null hypothesis significance testing is equivalent to comparing the range of a confidence interval to the null hypothesis.
While the type I error rate specifies the probability of falsely rejecting a true null hypothesis, such a dichotomous decision also entails the risk of failing to reject a false null hypothesis. The probability of this occurring is known as the type II error rate. For example, in the 10 datasets shown in Fig. 3, only 9 of them lead to p-values that that are less than the conventional type I error rate of 0.05. Thus, we would reject the null hypothesis in only 9 of the 10 cases.
The type I and type II error rates are related, such that one increases as the other declines. For example, if the type I error rate in Fig. 6 were set at 0.01, then the null hypothesis would be rejected for only 7 of the 10 datasets.
The type II error rate also depends on the difference between the null hypothesis and the true value. If the null hypothesis were equal to 0 (a difference of 2 units from the truth), then all the datasets in Fig. 3 would generate p-values less than 0.05. In contrast, a null hypothesis of 1.5 (0.5 units from the truth) would be rejected (with a type-I error rate of 0.05) in only 4 of the 10 datasets. The type-II error rate also changes with variation in the data and the sample size. Less variable data and larger sample sizes both decrease the type-II error rate; they increase the chance of rejecting the null hypothesis if it is false.
In summary, the type II error rate depends on the type of statistical analysis being conducted (e.g., a difference between means, a linear regression, etc), the difference between the truth and the null hypothesis, the chosen type I error rate, the variation in the data, and the sample size. Calculating type II error rates is not straight-forward to do by hand, but software for the task exists (e.g., G*power). Because the truth is not known, the type II error rate is usually calculated for different possible truths. These calculations would indicate the size of the deviation from the null hypothesis that might be reliably detected with a given analysis and sample size.
The type II error rate (β), or its complement power (1–β), is clearly important in ecology. What is the point of designing an expensive experiment to test a theory if that experiment has little chance of identifying a false null hypothesis? Ecologists who practice null hypothesis testing should routinely calculate type II error rates, but the evidence is that they do not. In fact, they almost never calculate it (Fidler et al. 2006). The focus of ecologists on the type I error rate and failure to account for the type II error rate might reflect the greater effort required to calculate the latter. This is possibly compounded by practice, which seems to accept ignorance of type II error rates. If type II error rates are hard to calculate, and people can publish papers without them, why would one bother? The answer about why one should bother is discussed later.
3.3 Likelihood
An alternative approach to frequentist statistical methods is based on the concept of “likelihood”. Assume we have collected sample data of size n. Further, we will assume these data were generated according to a statistical model. For example, we might assume that the sample data are random draws for a normal distribution with two parameters (the mean and standard deviation). In this case, a likelihood analysis would proceed by determining the likelihood that the available data would be observed if the true mean were μ and the true standard deviation were σ. Maximum likelihood estimation finds the parameter values (μ and σ in this case) that were most likely to have generated the observed data (i.e., the parameter values that maximize the likelihood).
The likelihood of observing each data point can simply equal the probability density, f(x), for each; likelihood need only be proportional to probability. The likelihood of observing the first data point x_{1} is f(x_{1}). The likelihood of observing the second data point x_{2} is f(x_{2}), etc. In general, the likelihood of observing the ith data point x_{i} is f(x_{i}). If we assume that each data point is generated independently of each other, the likelihood of observing all n data points is simply the product of the n different values of f(x_{i}). Thus:
.
For various reasons, it is often simpler to use the logarithm of the likelihood. Thus,
.
Note that by expressing the equation in terms of the log likelihood, a sum has replaced the product operator, which can be easier to manipulate mathematically. Further, because lnL is a monotonic function of L, maximizing lnL is equivalent to maximizing L. Thus, maximum likelihood estimation usually involves finding the parameter values (μ and σ in the case of a normal distribution) that maximize lnL.
While it is possible to derive the maximum likelihood estimators for the normal model (Box 3) and some other statistical models, often such expressions do not exist for other statistical models. In these cases, the likelihood needs to be maximized numerically.
Box 3. Maximum likelihood estimation and the normal distribution.
For the case of the normal distribution, the log-likelihood function is given by:
(1)
Note that by expressing the equation in terms of the log likelihood, we have avoided the exponential terms for the normal probability density function, simplifying the expression for the likelihood substantially.
For the case of a normal distribution, it is possible to obtain mathematical expressions for the values of μ and σ (known as the maximum likelihood estimators) that maximize lnL. Inspecting equation (1) reveals that the value of μ that maximizes lnL is the value that minimizes
because μ does not appear in the other terms. This term is the sum of squares, so the value of μ that maximizes the likelihood is the same as the value that minimizes the sum of squares. Thus, the maximum likelihood estimate of μ is the same as the least squares estimate in the case of a normal distribution. Differentiating S_{x} with respect to μ, setting the derivative to zero and solving for μ, gives the value of μ that minimizes S_{x}. This procedure shows that the maximum likelihood estimator for μ is the sample mean because this maximizes lnL.
The maximum likelihood estimate of σ can be obtained similarly. Note that μ = when lnL is maximized, so at this point , where s^{2} is the sample variance. Thus, the value of σ that maximizes lnL is the one that maximizes . Taking the derivative of this expression with respect to σ, setting it to zero and solving for σ yields its maximum likelihood estimate. This procedure reveals that the maximum likelihood estimator of σ is the sample standard deviation s.
Maximum likelihood estimation can also be used to place confidence intervals on the estimates. A Z% confidence interval is defined by the values of the parameters for which values of lnL are within χ^{2}_{1−Z/100}/2 units of the maximum, where χ^{2}_{1−Z/100} is the chi-squared value with 1 degree of freedom corresponding to a p-value of 1−Z/100.
For example, in the case of the normal distribution, the 95% confidence interval based on the likelihood method reduces to the expression , which is a standard frequentist confidence interval.
Maximum likelihood estimation might appear a convoluted way of estimating the mean, standard deviation and confidence intervals that could be obtained using conventional methods when data are generated by a normal distribution. However, the power of maximum likelihood estimation is that it can be used to estimate parameters for probability distributions other than the normal using the same procedure of finding the parameter values under which the likelihood of generating the data is maximized (Box 4).
Box 4. Maximum likelihood estimation of a proportion.
Assume that we wish to estimate the probability of occurrence (p) within quadrats of a species in a particular vegetation type. With the species observed in y of n surveyed quadrats (and ignoring imperfect detectability), the likelihood of observing the data is proportional to p^{y}(1−p)^{n}^{−y}. That is, the species occurred in y quadrats, an outcome that has likelihood p for each, and it was absent from n−y quadrats, an outcome that has likelihood (1−p) for each.
The log-likelihood in this case is yln(p) + (n−y)ln(1−p). The derivative of this with respect to p is y/p − (n−y)/(1−p), which equals zero at the maximum likelihood estimate of p. Some simple algebra yields the maximum likelihood estimator for p as y/n.
A Z% confidence interval can be obtained by finding the values of p such that the log-likelihood is within χ^{2}_{1–Z/100}/2 units of the maximum. The maximum log-likelihood is yln(y/n) + (n−y)ln(1− y/n), so the limits of the confidence interval are obtained by solving
yln(y/n) + (n−y)ln(1−y/n) − yln(p) − (n−y)ln(1−p) = χ^{2}_{1–Z}/2.
When y=0 or y=n, the terms beginning with y or (n−y) are zero, respectively, so analytical solutions are possible. In the former case, the confidence interval is [0, 1 − exp(−χ^{2}_{1–Z/100}/2n)], while in the later case it is [exp(−χ^{2}_{1–Z/100}/2n), 1]. In other cases, a numerical solution is required. For example, for y=1 and n=10, the 95% confidence interval is [0.006, 0.37].
Maximum likelihood estimation also extends generally to other statistical models. If we think of the data as being generated by a particular probability distribution, and relate the parameters of that distribution to explanatory variables, we have various forms of regression analysis. For example, if we assume the mean of a normal distribution is a linear function of explanatory variables, while the standard deviation is constant, we have standard linear regression. In this case, maximum likelihood methods would no longer estimate the mean; instead it would estimate the regression coefficients of the relationship between the mean and the explanatory variables. Assuming a non-linear relationship leads to non-linear regression. Change the assumed probability distribution, and we have a generalized linear model (McCullagh and Nelder 1989). Include both stochastic and deterministic components in the relationships between the parameters and the explanatory variables and we have mixed models (Gelman and Hill 2007). Thus, maximum likelihood estimation provides a powerful general approach to statistical inference.
3.4 Information theoretic methods
In general, adding parameters to a statistical model will improve its fit. Inspecting Fig. 7 might suggest that a 3 or 4 parameter function is sufficient to describe the relationship in the data. While the fit of the 10-parameter function is “perfect” in the sense that it intersects every point, it fails to capture what might be the main elements of the relationship (Fig. 7). In this case, using 10 parameters leads to over-fitting.
As well as failing to capture the apparent essence of the relationship, the 10-parameter function might make poor predictions. For example, the prediction when the dependent variable equals 1.5 might be wildly inaccurate (Fig. 7). So while providing a very good fit to one particular set of data, an over-fitted model might both complicate understanding and predict poorly. In contrast, the two parameter function might under-fit the data, failing to capture a non-linear relationship. Information theoretic methods address the trade-off between over-fitting and under-fitting.
Information theoretic methods use information theory, which measures uncertainty in a random variable by its entropy (Kullback 1959, Burnham and Anderson 2002, Jaynes 2003). Over the range of a random variable with probability density function f(x), entropy is measured by . Note that this is simply the expected value of the log-likelihood. If we think of f(x) as the true probability density function for the random variable, and we have an estimate (g(x)) of that, then the difference between the information content of the estimate and the truth is the Kullback-Leibler divergence, or the relative entropy (Kullback 1959):
The Kullback-Leibler divergence can measure the relative distance of different possible models from the truth. When comparing two estimates of f(x), we can determine which departs least from the true density function, and use that as the best model because it minimizes the information lost relative to the truth.
Of course, we rarely know f(x). Indeed, an estimate of f(x) is often the purpose of the statistical analysis. Overcoming this issues is the key contribution of Akaike (1973), who derived an estimate of the relative amount of information lost or gained by using one model to represent the truth, compared to another, when only a sample of data is available to estimate f(x). This relative measure of information loss, known as Akaike’s Information Criteria (AIC) is asympotically for large sample sizes
AIC = −2lnL + 2k,
where lnL is the value of the log-likelihood at its maximized value and k is the number of estimated parameters in the model. Thus, there is a close correspondence between maximum likelihood estimation and information theoretic methods based on AIC.
AIC is a biased estimate of the relative information loss when the sample size (n) is small, in which case a bias-corrected approximation can be used (Hurvich and Tsai 1989):
AIC is based on an estimate of information loss, so a model with the lowest AIC is predicted to lose the least amount of information relative to the unknown truth. The surety with which AIC selects the best model (best in the sense of losing the least amount of information) depends on the difference in AIC between the models. The symbol ΔAIC is used to represent the difference in AIC between one model and another, usually expressed relative to the model with the smallest AIC for a particular dataset. Burnham and Anderson (2002) suggest rules of thumb to compare the relative support for the different models using ΔAIC.
For example, the ΔAIC_{c} values indicate that the 3 parameter (quadratic) function has most support relative of those in Fig. 7 (Table 1). This is perhaps reassuring given that these data were actually generated using a quadratic function with an error term added.
Table 1. The ΔAIC_{c} values for the functions shown in Fig. 7, assuming normal distributions of the residuals. The clearly over-fitted 10-parameter function is excluded; in this case it fits the data so closely that the deviance −2lnL approaches negative infinity. AIC_{c} weights (w_{i}) are also shown.
Number of parameters | ΔAIC_{c} | w_{i} | |
2 | 8.47 | 0.0002 | |
3 | 0 | 0.977 | |
4 | 3.74 | 0.023 |
The term −2lnL is known as the deviance, which increases as the likelihood L declines. Thus, AIC increases with the number of parameters and declines with the fit to the data, capturing the trade-off between under-fitting and over-fitting the data. While the formula for AIC is simple and implies a direct trade-off between lnL and k, it is important to note that this trade-off is not arbitrary. Akaike (1973) did not simply decide to weight lnL and k equally in the trade-off. Instead, the trade-off arises from an estimate of the information lost when using a model to approximate an unknown truth.
Information theoretic methods provide a valuable framework for determining an appropriate choice of statistical models when aiming to parsimoniously describe variation in a particular dataset. In this sense, a model with a lower AIC is likely to predict a replicate set of data better, as measured by relative entropy, than a model with a higher AIC.
Use of AIC extends to weighting the support for different models. For example, with a set of m candidate models, the weight assigned to model i is (Burnham and Anderson 2002):
Standardizing by the sum in the denominator, the weights sum to 1 across the m models. In addition to assessing the support for individual models, the support for including different parameters can be evaluated by summing the weights of those models that contain the parameter.
Relative support as measured by AIC is relevant to the particular dataset being analyzed. A variable is not demonstrated to be unimportant simply because a set of models might hold little support for a variable as measured by AIC. Instead, a focus on estimated effects is important. Consider the case of a sample of data that is used to compare one model in which the mean is allowed to differ from zero (and the mean is estimated from the data) and another model in which the mean is assumed equal to zero (Fig. 8). An information theoretic approach might conclude, in this case, that there is at most only modest support for a model in which the mean differs from zero (ΔAIC = 1.34 in this case).
Values in the second dataset are much more tightly clustered around the value of zero (Fig. 8). One might expect that the second dataset would provide much greater support for the model in which the mean is zero. Yet the relative support for this model, as measured by AIC, is the same for both. The possible value of the parameter is better reflected in the confidence interval for each dataset (Fig. 8), which suggests that the estimate of the mean in dataset 2 is much more clearly close to zero than in dataset 1.
This is a critical point when interpreting results using information theoretic methods. The possible importance of a parameter, as measured by the width of its confidence, is not necessarily reflected in the AIC value of the model that contains it as an estimated parameter. For example, if a mean of 2 or more was deemed a biologically important effect, then dataset 2 provides good evidence that the effect is not biologically unimportant, while dataset 1 is somewhat equivocal with regard to this question. Unless referenced directly to biologically important effect sizes, AIC does not indicate biological importance.
3.5 Bayesian methods
If a set of data estimated the annual adult survival rate of a population of bears to be 0.5, but with a wide 95% confidence interval of [0.11, 0.89] (e.g., two survivors from four individuals monitored for a year; Box 4) what should I conclude? Clearly, more data would be helpful, but what if waiting for more data and a better estimate were undesirable?
Being Australian, I have little personal knowledge of bears, even drop bears (Janssen 2012), but theory and data (e.g., Haroldson 2006, Taylor et al 2005, McCarthy et al. 2008) suggest that mammals with large body masses are likely to have high survival rates. Using relationships between annual survival and body mass of mammals, and accounting for variation among species, among studies and among taxonomic orders, the survival rate for carnivores can be predicted (Fig. 9). For a large bear of 245 kg (the approximate average body mass of male grizzly bears, Nagy and Haroldson 1990), the 95% prediction interval is [0.72, 0.98].
This prediction interval can be thought of as my expectation of the survival rate of a large bear. Against this a priori prediction, I would think that the relatively low estimate of 0.5 from the data (with 95% confidence interval of [0.11, 0.89]) might be due to (bad) luck. But now I have two estimates, one based on limited data from a population in which I am particularly interested, and another based on global data for all mammal species.
Bayesian methods can combine these two pieces of information to form a coherent estimate of the annual survival rate (McCarthy 2007). Bayesian inference is derived from a simple re-arrangement of conditional probability. Bayes’ rule states that the probability of a parameter value (e.g., the annual survival rate of the bear, s) given a set of new data (D) is
where Pr(D | s) is the probability of the new data given a particular value for survival (this is simply the likelihood, so Pr(D | s) = L(D | s)), Pr(s) is the probability of the parameter value unconditioned by the new data, and Pr(D) is the probability of the new data unconditioned by the survival rate.
Pr(s), being independent of the data, represents the prior understanding about the values of the parameter s. A probability density function f(s) can represent this prior understanding. A narrow density function indicates that the parameter is already estimated quite precisely, while a wide interval indicates that there is little prior information.
To make Pr(D) independent of a particular value of s, it is necessary to integrate over the possible values of s. Thus, for continuous values of s, . Thus, Bayes’ rule becomes
When the parameter values are discrete, the integral in the denominator is replaced by a summation, but it is otherwise identical. The probability distribution f(s) is known as the prior distribution or simply the “prior”. The posterior distribution Pr(s | D) describes the estimate of s that includes information from both the prior, the data and the statistical model.
In the case of the bear example, the prior (from Fig. 9) combines with the data and statistical model to give the posterior (Fig. 10). The posterior is a weighted average of the prior and the likelihood, and is weighted more toward whichever of the two is more precise (in Fig. 10, the prior is more precise).
Difficulties of calculating the denominator of Bayes’ rule partly explain why Bayesian methods, despite being first described 250 years ago (Bayes 1763), are only now becoming more widely used. Computational methods to calculate the posterior distribution, particularly Markov chain Monte Carlo (MCMC) methods, coupled with sufficiently fast computers and available software are making Bayesian analysis of realistically complicated methods feasible. Indeed, the methods are sufficiently advanced that arbitrarily complicated statistical models can be analyzed.
Previously, statistical models were limited to those provided in computer packages. Bayesian MCMC methods mean that ecologists can now easily develop and analyze their own statistical models. For example, linear regression is based on four assumptions: a linear relationship for the mean, residuals being drawn from a normal distribution, equal variance of the residuals along the regression line, and no dependence among those residuals. Bayesian MCMC methods allow you to relax any number of those assumptions in your statistical model.
Posterior distributions contain all the information about parameter estimates from Bayesian analyses. These are often summarized by calculating various statistics. The mean or median of a posterior distribution can indicate its central tendency. Its standard deviation indicates the uncertainty of the estimate; it is analogous to the standard error of a statistic in frequentist analysis. Inner percentile ranges are used to calculate credible intervals. For example, the range of values bounded by the 2.5 percentile and the 97.5 percentile of the posterior distribution is commonly reported as a 95% credible interval.
Credible intervals of Bayesian analyses are analogous to confidence intervals of frequentist analyses, but they differ. Because credible intervals are based on posterior distributions, we can say that the probability is 0.95 that the true value of a parameter occurs within its 95% credible interval (conditional on the prior, data and the statistical model). In contrast, confidence intervals are based on the notion of replicate sampling and analysis; if we conducted this study a large number of times, the true value of a parameter would be contained in a Z% confidence interval constructed in this particular way Z% of the time (conditional on the data and the statistical model). In most case, the practical distinction between the two definitions of intervals is inconsequential because they are similar (see below).
The relative influence of the prior and the posterior is well illustrated by estimates of annual survival of female European dippers based on mark-recapture analysis. As for the mammals, a relationship between annual survival and body mass of European passerines can be used to generate a prior for dippers (McCarthy and Masters 2005).
Three years of data (Marzolin 1988) are required to estimate survival rate in mark-recapture models that require joint estimation of survival and recapture probabilities. If only the first three years of data were available, the estimate of annual survival is very imprecise. In the relatively short time it takes to compile and analyze the data (about half a day with ready access to a library), a prior estimate can be generated that is noticeably more precise (left-most interval in Fig. 11).
Three years of data might be the limit of what could be collected during a PhD project. If you are a PhD student at this point, you might be a bit depressed that a more precise estimate can be obtained by simply analyzing existing data compared with enduring the trials (and pleasures) of field work for three years.
However, since you are reading this, you clearly have an interest in Bayesian statistics. And hopefully you have already realized that you can use my analysis of previous data as a prior, combine it with the data, and obtain an estimate that is even more precise. The resulting posterior is shown by the credible interval at year 3 (Fig. 11). Note that because the estimate based only on the data is much less precise than the prior, the posterior is very similar to the estimate based only on the prior. In fact, five years of data are required before the estimate based only on the data is more precise than the prior. Thus, the prior is initially worth approximately 4-5 years of data, as measured by the precision of the resulting estimate.
In contrast, the estimate based only on seven years of data has approximately the same precision as the estimate using both the prior and six years of data (Fig. 11). Thus, with this much data, the prior is worth about one year of data. The influence of the prior on the posterior in this case is reduced because the estimate based on the data is more precise than the prior. Still, half a day of data compilation and analysis seems a valuable investment when it is worth another year of data collection in the field.
In Bayes’ rule, probability is being used as measure of how much a rational person should “believe” that a particular value is the true value of the parameter, given the information at hand. In this case, the information consists of the prior knowledge of the parameter as represented by f(s), and the likelihood of the data for the different possible values of the parameter. As for any statistical model, the likelihood is conditional on the model being analyzed, so it is relatively uncontroversial. Nevertheless, uncertainty about the best choice of model remains, so this question also needs to be addressed in Bayesian analyses.
The priors for annual survival of mammals and European passerines are derived from an explicit statistical model of available data. In this sense, the priors are no more controversial than the choice of statistical model for data analysis; it is simply a judgement about whether the statistical model is appropriate. Controversy arises, however, because I extrapolated from previous data, different species and different study areas to generate a prior for a unique situation. I attempted to account for various factors in the analysis by including random effects such as those for studies, species, taxonomic orders and particular cases within studies. However, a lingering doubt will persist; is this new situation somehow unique such that it lies outside the bounds of what has been recorded previously? This doubt is equivalent to questions about whether a particular data point in a sample is representative of the population that is the intended focus of sampling. However, the stakes with Bayesian priors can be higher when the prior contains significant amounts of information relative to the data.
Controversy in the choice of the prior essentially reflects a concern that the prior will bias the estimates if it is unrepresentative. Partly in response to this concern, partly because using informative priors is rare in ecology, and partly because prior information might have little influence on the results (consider using 7 years of data in Fig. 11), most ecologists use Bayesian methods with what are known as “uninformative”, “vague” or “flat” priors.
Note that in Bayes’ rule, the numerator is the prior multiplied by the likelihood. The denominator of Bayes’ rule re-calibrates this product so the posterior conforms to probability (i.e., the area under the probability density function equals 1). Therefore, the posterior is simply proportional to the product of the prior and the likelihood. If the prior is flat across the range of the likelihood function, then the posterior will have the same shape as the likelihood. The consequence is that parameter estimates based on uninformative priors are very similar to parameter estimates based only on the likelihood function (i.e., a frequentist analysis). For example, a Bayesain analysis of the data in Fig. 3 with uninformative priors produces 95% credible intervals that are so similar to the confidence intervals in Fig. 4 that it is not worth reproducing them.
In this example, I know the Bayesian prior is uninformative because the resulting credible intervals and confidence intervals are the same. In essence, close correspondence between the posterior and the likelihood is the only surety that the prior is indeed uninformative. However, if the likelihood function can be calculated directly, why bother with the Bayesian approach? In practice, ecologists using Bayesian methods tend to assume that particular priors are uninformative, or use a range of different reasonable priors. The former is relatively safe for experienced users of standard statistical models, who might compare the prior and the posterior to be sure the prior has little influence. The latter is a form of robust Bayesian analysis, whereby a robust result is one that is insensitive to the often arbitrary choice of prior (Berger 1985).
Why would an ecologist bother to use Bayesian methods when informative priors are rarely used in practice, when uninformative priors provide answers that are essentially the same as those based on likelihood analysis, and when priors are only surely non-informative when the posterior can be compared with the likelihood? The answer is the convenience of fitting statistical models that conform to the data. Hierarchical models represent one class of such models.
While frequentist methods can also be used, hierarchical models in ecology are especially well suited to Bayesian analyses (Clark 2005, Gelman and Hill 2007). Hierarchical models consider responses at more than one level in the analysis. For example, they can accommodate nested data (e.g., one level modeling variation among groups, and another modeling variation within groups), random coefficient models (regression coefficients themselves being modeled as a function of other attributes), or state-space models. State-space models include, for example, a model of the underlying (but unobserved) ecological process overlaid by a model of the data collection, but which then allows inference about the underlying process not just the generated data (McCarthy 2011).
Because prior and posterior distributions represent the degree of belief in the true value of a parameter, an analyst can base priors on subjective judgements. The advantage of using Bayesian methods in this case is that these subjective judgements are updated logically as data are analyzed. Use of subjective priors with Bayesian analyses might, therefore, be useful for personal judgements. However, such subjective judgements of an individual might be of little interest to others, and might have little role in wider decisions or scientific consensus (unless that individual were particularly influential, but even then such influence might be undesirable).
In contrast, when priors reflect the combined judgments of a broad range of relevant people and are compiled in a repeatable and unbiased manner (Martin et al. 2005), combining them with data via Bayes’ rule can be extremely useful. In this case, Bayesian methods provide a means to combine a large body of expert knowledge with new data. While the expert knowledge might be wrong (Burgman 2005), the important aspect of Bayesian analysis is that its integration with data is logical and repeatable.
Additionally, priors that are based on compilation and analysis of existing data are also valuable. Such compilation and analysis is essentially a form of meta-analysis. Indeed, Bayesian methods are often used for meta-analysis. Discussion sections of publications often compare and seek to integrate the new results with existing knowledge. Bayesian methods do this formally using coherent and logical methods, moving that integration into the methods and results of the paper, rather than confining the integration to subjective assessment in the discussion. If ecology aims to have predictive capacity beyond particular case studies, then Bayesian methods with informative priors will be used more frequently.
3.6 Non-parametric methods
This chapter emphasizes statistical analyses that are founded on probabilistic models. These require an assumption that the data are generated according to a specified probability distribution. Non-parametric methods have been developed to avoid the need to pre-specify a probability distribution. Instead, the distribution of the collected data is used to define the sampling distribution. So while non-parametric methods are sometimes described as being “distribution-free”, this simply means that the analyst does not choose a distribution; rather the data are used to define the distribution.
A wide range of non-parametric methods exist (Conover 1998). Instead of describing them all here, I will focus on only one method as an example. Non-parametric methods often work by iterative re-sampling of the data, calculating relevant statistics of each sub-sample, and then defining the distribution of the sample statistics by the distribution of the statistics of the sub-samples.
Bootstrapping is one such re-sampling method. Assume that we have a sample of size n, for which we want to calculate a 95% confidence interval but are unable or unwilling to assume a particular probability distribution for the data. We can use bootstrapping to calculate a confidence interval by randomly re-sampling (with replacement) the values of the original sample, and generate a new sample of size n. We then calculate the relevant sample statistic (e.g., the mean), and record the value. This procedure is repeated many times. Percentiles of the resulting distribution of sample statistics are used to define a confidence interval. For example, the 2.5 percentile and 97.5 percentile of the distribution of re-sampled statistics would define a 95% confidence interval. For the data in Fig. 3, the resulting bootstrapped confidence intervals for the mean, while narrower than those derived assuming a normal distribution, are largely similar (Fig. 12).
Non-parametric methods tend to be used when analysts are unwilling to assume a particular probabilistic model for their data. This reluctance was greatest when statistical models based on the normal distribution were most common. With greater use of statistical models that use other distributions (e.g., generalized linear models, McCullagh and Nelder 1989), the impetus to use non-parametric methods is reduced.
4 Appropriate use of statistical methods
With such a broad array of approaches to statistical inference in ecology, which approach should you choose? The literature contains debates about this (e.g., Dennis 1996; Anderson et al. 2000; Burnham and Anderson 2002; Stephens et al. 2005). To a small extent, I have contributed to those debates. For example, my book on Bayesian methods (McCarthy 2007) was partly motivated by misuses of statistics. I thought greater use of Bayesian methods would reduce that misuse. Now, I am less convinced. And the debates seem to distract from more important issues.
The key problem with statistical inference in ecology is not resolving which statistical framework to choose, but appropriate reporting of the analyses. Consider the confidence and credible intervals for the data in Fig. 3; the intervals, representing estimates of the mean, are very similar regardless of the method of statistical inference (Figs 4 and 12). In these cases, the choice of statistical “philosophy” to estimate parameters is not very important. Yes, the formal interpretation and meaning of a confidence interval and a credible interval differ. However, assume that I constructed a confidence interval using likelihood methods, and interpreted that confidence interval as if it were a Bayesian credible interval formed with a flat prior. Strictly, this is not correct. Practically, it makes no difference because I would have obtained the same numbers however I constructed the intervals.
Understanding the relatively infrequent cases when credible intervals differ from confidence intervals (Jaynes 2003) is valuable. For example, there is a difference between the probability of recording a species as being present at a site, and the probability that the species is present at a site given it is recorded (or not). The latter, quite rightly, requires a prior probability and Bayesian analysis (Wintle et al. 2012). However, the choice of statistical model and appropriate reporting and interpretation of the results are much more important matters. Here I list and briefly discuss some of the most important problems with the practice of statistical inference in ecology, and conclude with how to help overcome these problems.
Null hypothesis significance testing is frequently based on nil nulls, which leads to trivial inference (Anderson et al. 2000, Fidler et al. 2006). Nil nulls are hypotheses that we know, a priori, have no hope of being true. Some might argue that null hypothesis significance testing conforms with Popperian logic based on falsification. But Popper requires bold conjectures, so the null hypothesis needs to be plausibly true. Rejecting a nil null, that is already known to be false, is unhelpful regardless of whether or not Popperian falsification is relevant in the particular circumstance. Ecologists should avoid nil nulls, and if using null hypothesis significance testing they should base the nulls on sound theory or empirical evidence of important effects. If a sensible null hypothesis cannot be constructed, which will be frequent in ecology, then null hypothesis significance testing should be abandoned and the analysis limited to estimation of effect sizes.
Null hypothesis significance testing aims to reject the null hypothesis. Failure to reject a null hypothesis is often incorrectly used as evidence that the null hypothesis is true (Fidler et al. 2006). This is especially important because power is often low (Jennions and Møller 2003), it is almost never calculated in ecology (Fidler et al. 2006), and ecologists tend to overestimate statistical power when they judge it subjectively (Burgman 2005). Low statistical power means that the null hypothesis is unlikely to be rejected even if it were false. Given the preceding, failure to reject the null should never be reported as evidence in favor of the null unless power is known to be high.
A confidence interval or credible interval for a parameter that overlaps zero is often used incorrectly as evidence that the associated effect is biologically unimportant. This is analogous to equating failure to reject a null hypothesis with a biologically unimportant effect. Users of all statistical methods are vulnerable to this fallacy. For example, low AIC weights or high ΔAIC values are sometimes used to infer that a parameter is biologically unimportant. Yet AIC values are not necessarily sensitive to effect sizes (Fig. 8).
P-values are often viewed as being highly replicable, when in fact they are typically variable (Cumming 2011). Further, the size of the p-value does not necessarily indicate how different the p-value from a new replicate might be. In contrast, confidence intervals are less variable, and also indicate the magnitude of possible variation that might occur in a replicate of the experiment (Cumming 2011). They should be used and interpreted much more frequently.
Effect sizes, and associated measures of precision such as confidence intervals, are often not reported. This is problematic for several reasons. Firstly, the size of the effect is often very informative. While statistical power is rarely calculated in ecology, the precision of an estimate conveys information about power (Cumming 2011). Many ecologists might not have the technical skills to calculate power, but all ecologists should be able to estimate and report effect sizes with confidence intervals. Further, failure to report effect sizes hampers meta-analysis because the most informative meta-analyses are based on them. Meta-analysis is extremely valuable for synthesizing and advancing scientific research (ref to Gurevitch chapter), so failure to report effect sizes directly hampers science.
I listed the failure to report effect sizes last because addressing it is relatively easy, and doing so overcomes many of the other problems. Reporting effect sizes with intervals invites an interpretation of biological importance. If variables are scaled by the magnitude of variation in the data, then effect sizes reflect the predicted range of responses in that dataset. For example, Parris (2006) reported regression coefficients in terms of how much the predicted species richness changed across the range of the explanatory variables in Poisson regression models (Fig. 13). This illustrates that more than ten-fold changes in expected species richness are possible across the range of some variables (e.g., road cover) but such large effects are unlikely for other variables (e.g., fringing vegetation). Nevertheless, all the intervals encompass possible effects that are larger than a doubling of expected species richness regardless of the particular statistical model. These results quantify how precisely the parameters are estimated in this particular study. They also permit direct comparison with effect sizes in similar studies, either informally, or by using meta-analysis.
Confidence intervals can still be interpreted poorly by an author. For example, authors might still interpret a confidence interval that encompasses zero as evidence that the associated variable is unimportant. However, reporting them properly is critical, because they can still be interpreted appropriately by readers.
These problems in the use of statistical inference are not unique to ecology. However, looking beyond ecology is important to understand how disciplines have improved. Some disciplines have largely overcome these problems (Fidler et al. 2006), while others are making progress by recommending reporting of effect sizes (Cumming 2011). Keys to change are concerted efforts across a disciplins. These need to involve authors, and reviewers, but as the final arbiters of what constitutes acceptable scientific practice, editors are particularly influential.
Statistical inference is critical in ecology because data are variable and replication is often difficult. While statistical methods are becoming more complex, it is important that statistical practices are founded on sound principles of interpretation and reporting. A greater emphasis in ecology on basic estimation, reporting and interpretation of effect sizes is critical for the discipline.
References
Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In B N Petrov and F Csaki, eds. Second International Symposium on Information Theory, pp. 267-281. Akademiai Kiado, Budapest.
Anderson, D.R., Burnham, K.P. and Thompson, W.L. (2000). Null hypothesis testing: problems, prevalence, and an alternative. Journal of Wildlife Management, 64, 912-923.
Bayes, T.R. (1763). An essay towards solving a problem in the doctrine of chances. Philosophical Transactions, 53, 370-418.
Begon, M., Townsend, C.R. and Harper, J.L. (2005). Ecology: From Individuals to Ecosystems, 4th Edition. Wiley-Blackwell, Malden, MA, USA.
Berger, J.O. (1985). Statistical Decision Theory and Bayesian Analysis. Springer-Verlag, New York, USA.
Buckland, S.T., Anderson, D.R., Burnham, K.P. and Laake. J.L. (1993). Distance Sampling: Estimating Abundance of Biological Populations. Chapman and Hall, London, UK.
Burgman, M. (2005). Risks and Decisions for Conservation and Environmental Management. Cambridge University Press, Cambridge, UK.
Burnham, K.P. and Anderson, D.R. (2002). Model Selection and Multimodel Inference: a Practical Information-Theoretic Approach. Springer-Verlag, New York.
Clark, J.S. (2005). Why environmental scientists are becoming Bayesians. Ecology Letters, 8, 2-15.
Conover, W.J. (1998). Practical Nonparametric Statistics. Wiley, New York.
Cumming, G. (2011). Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-analysis. Routledge, New York.
Dennis, B. (1996). Discussion: should ecologists become Bayesians? Ecological Applications, 6, 1095-1103.
Durrett, R. and Levin, S. (1996). Spatial models for species-area curves. Journal of Theoretical Biology, 179, 119-127.
Fidler, F., Burgman, M., Cumming, G. Buttrose, R. and Thomason, N. (2006). Impact of criticism of null hypothesis significance testing on statistical reporting practices in conservation biology. Conservation Biology, 20, 1539-1544.
Gelman, A. and Hill, J. (2007). Data Analysis Using Regression and Multilevel/Hierarchical Models. Cambridge University Press, Cambridge, UK.
Haroldson, M.A., Schwartz, C.C. and White, G.C. (2006). Survival of independent grizzly bears in the greater Yellowstone ecosystem, 1983-2001. Wildlife Monographs, 161, 33-43.
Hurvich, C.M. and Tsai, C-L. (1989). Regression and time series model selection in small samples. Biometrika, 76, 297-307.
Janssen, V. (2012). Indirect tracking of drop bears using GNSS technology. Australian Geographer, 43, 445-452.
Jaynes, E.T. (2003). Probability Theory: The Logic of Science. Cambridge University Press, Cambridge, UK.
Jennions M.D. and Møller, A.P. (2003). A survey of the statistical power of research in behavioral ecology and animal behavior. Behavioural Ecology, 14, 438-445.
Kery, M. (2002). Inferring the absence of a species: a case study of snakes. Journal of Wildlife Management, 66, 330-338.
Kooijman, S.A.L.M. (2010). Dynamic Energy Budget Theory for Metabolic Organisation, 3rd edition. Cambridge University Press.
Kullback, S. (1959). Information Theory and Statistics. Wiley, New York.
Lyons, I.M. and Beilock, S.L. (2012). When math hurts: math anxiety predicts pain network activation in anticipation of doing math. PLoS ONE, 7(10), e48076. doi:10.1371/journal.pone.0048076
Martin, T.G., Kuhnert, P.M., Mengersen, K. and Possingham, H.P. (2005). The power of expert opinion in ecological models using Bayesian methods: impact of grazing on birds. Ecological Applications, 15, 266-280.
Marzolin, G. (1988). Polygynie du Cincle pongeur (Cinclus cinclus) dans les côtes de Lorraine. L’Oiseau et la Revue Francaise d’Ornithologie, 58, 277-286.
McCarthy, M.A. (2007). Bayesian Methods for Ecology. Cambridge University Press, Cambridge.
McCarthy, M.A. (2011). Breathing some air into the single-species vacuum: multi-species responses to environmental change. Journal of Animal Ecology, 80, 1-3.
McCarthy, M.A. and Masters, P. (2005). Profiting from prior information in Bayesian analyses of ecological data. Journal of Applied Ecology, 42, 1012-1019.
McCarthy, M.A., Citroen, R. and McCall, S.C. (2008). Allometric scaling and Bayesian priors for annual survival of birds and mammals. American Naturalist, 172, 216-222.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Second Edition. Boca Raton, Chapman and Hall/CRC.
Nagy, J. A. and M. A. Haroldson. (1990). Comparisons of some home range and population parameters among four grizzly bear populations in Canada. In L M Darling and W R Archibald, eds. Proceedings of the 8th International Conference on Bear Research and Management, pp. 227–235 International Association for Bear Research and Management, Vancouver.
Parris, K. M. (2006). Urban amphibian assemblages as metacommunities. Journal of Animal Ecology 75, 757-764.
Parris, K. M., Norton, T. W. and Cunningham, R. B. (1999). A comparison of techniques for sampling amphibians in the forests of south-east Queensland, Australia. Herpetologica 55, 271-283.
Pollock, K.H., Nichols, J.D., Brownie, C. and Hines, J.E. (1990). Statistical inference for capture-recapture experiments. Wildlife Society Monographs No. 107, 3-97.
Stephens, P. A., Buskirk, S. W., Hayward, G. D. and Martínez Del Rio, C. (2005). Information theory and hypothesis testing: a call for pluralism. Journal of Animal Ecology 42,4-12.
Taylor, M. K., J. Laake, P. D. McLoughlin, E. W. Born, H. D. Cluff, S. H. Ferguson, A. Rosing-Asvid, R. Schweinsburg, and F. Messier. (2005). Demography and viability of a hunted population of polar bears. Arctic 58, 203-214.
Tyre, A. J., Tenhumberg, B., Field, S. A., Niejalke, D., Parris, K. and Possingham, H. P. (2003). Estimating false negative error rates for presence/absence data: Improving precision and reducing bias in biological surveys. Ecological Applications 13, 1790-1801.
Wainer, H. (2007). The most dangerous equation. American Scientist 95, 249-256.
West, G., Brown, J. and Enquist, B. (1997). A general model for the origin of allometric scaling laws in biology. Science 276, 122-126.
Wintle, B.A. Walshe, T.V., Parris, K.M. and McCarthy, M.A. (2012). Designing occupancy surveys and interpreting non-detection when observations are imperfect. Diversity and Distributions 18, 417-424.
Pingback: Recommended Reading | June 2013 | Cindy E Hauser
A message from Eli Gurarie via email, with a link to a site about likelihoods:
Dear Mick,
I enjoyed this chapter, and appreciate you posting it.
A factual quibble (but worth correcting, or clarifying) – the maximum likelihood estimator for sigma in the normal distribution is: (1/n)
sum (x-x.bar)^2, not (1/(n-1)) sum(x-x.bar)^2. The latter is what’s
usually called the sample variance [I'm assuming that's what you're referring to in the box] – and has the advantage of being unbiased, and being obtainable via method of moments. But the former (sometime
- the population variance?) is definitely the MLE … there’s nowhere to lose the 1 off the n when maximizing the likelihood.
I have found it interesting and unfortunate (having taught statistical ecology at several levels) that the idea of a “likelihood” is so rarely introduced or taught before the graduate level. The basic inversion that equates a probability statement about an observation given parameters to a statement about the likelihood of a parameter given observations is rarely dwelt upon, but it is a very empowering conceptual leap.
Here is a link to a document I wrote when I was co-teaching a course that attempts to “dwell” on the likelihood concept (with the example of the standard deviation, and with R code) a little:
http://faculty.washington.edu/eliezg/StatR201/CommentOnLikelihoods.html
Best,
Eli
For another introduction to statistics for ecologists, check out Bob O’Hara’s two blog posts:
http://deepthoughtsandsilliness.blogspot.de/2007/07/statistical-modelling-bits-pt-1.html
http://deepthoughtsandsilliness.blogspot.de/2007/08/statistical-modelling-bits-pt-2.html
“In my opinion, the importance of the controversy has sometimes been overstated. The controversy has also seemingly distracted attention from, or completely overlooked, more important issues such as the misinterpretation and misreporting of statistical methods, regardless of whether Bayesian or frequentist methods are used.”
– I really like this! And completely agree.
“Additionally, priors that are based on compilation and analysis of existing data are also valuable. Such compilation and analysis is essentially a form of meta-analysis. Indeed, Bayesian methods are often used for meta-analysis.”
– First mention of MA comes under Bayes. A little weird.
“I thought greater use of Bayesian methods would reduce that misuse. Now, I am less convinced. And the debates seem to distract from more important issues.”
—I’m less convinced too. Or rather, can now see that it is a cognitive question, and that we’ve been proceeding without the relevant evidence.
“The key problem with statistical inference in ecology is not resolving which statistical framework to choose, but appropriate reporting of the analyses. Consider the confidence and credible intervals for the data in Fig. 3; the intervals, representing estimates of the mean, are very similar regardless of the method of statistical inference (Figs 4 and 12). In these cases, the choice of statistical “philosophy” to estimate parameters is not very important. Yes, the formal interpretation and meaning of a confidence interval and a credible interval differ..”
–A natural extension of this position might be to change the structure of the chapter to 1. Estimation and Uncertainty, 2. Testing, 3. Model Selection, 4. Data Accumulation and within each section address about Freq, Bayes, Info Theoretic (to the extent that it applies). That takes the emphasis away from the debate and refocuses on the important research questions (e.g., how much? how sure are we? what’s the best representation? where does this leave the current state of knowledge?).
Thanks Fiona. I’ll think about the idea of re-structuring. I think that would work, because that structure would also allow the different approaches could be introduced in a logical order.
And for anyone reading this, you really should look up Fiona Fidler’s research. Lots of people have opinions about the merits of different statistical approaches. Fiona has opinions too, but her opinions tend to be backed by experimental and observational evidence.
Expanding on this point: “Also, AICc should be the default estimator for many ecological datasets (with limited sample size). AIC will give overfitted models with poor predictive ability and complicated biological interpretation”
Because AICc is estimating the strength of evidence for a given model being the K-L best model (i.e. closest to ‘reality’) within a given set, under the principle that that this will yield the best predictive model, it can lead to heavily parameterised models when n is large. This is because as n increases, the precision with which ‘tapering effects’ can be estimated increases, and so including more of these will reduce model bias more than it will decrease precision. This is great when n is moderate, but when n is large it can leave one unsatisfied, because a saturated model is often ranked highest. One common situation in which this occurs is when using AIC to judge the best statistical model for approximating a simulation experiment, where n is arbitrary (and large). In such situations, I think BIC is preferable, because it focuses on main effects.
The other point I often see AIC and other ICs abused is in stepwise selection. This is a terrible way to use them, IMO, because one can easily stop at models that are either far too simple or far too complex, given your data, depending on when forward or backwards stepwise selection is used. Models of intermediate complexity will be missed if those far above or below them in complexity are inferior (in a KL sense).
Dear Mick,
Thanks for sharing this. I think the chapter gives a great overview of the different approaches currently used for statistical inference in ecology. A few comments:
1. In the very first paragraph, I would include ethics as another frequent limitation for doing experiments in ecology, together with costs and logistics. e.g. see Farnsworth & Rosovsky Cons Biol 1993 (https://www.mtholyoke.edu/~efarnswo/ethics.pdf)
2. When talking about model selection using AIC, I wonder if you could warn against using AIC to select among too many models (sort of all subsets selection), i.e. without first proposing a limited set of candidate models with biological basis. Problems with this practise have been raised before (e.g. http://warnercnr.colostate.edu/~anderson/PDF_files/Pitfalls.pdf) but still seems to be too common. Also, AICc should be the default estimator for many ecological datasets (with limited sample size). AIC will give overfitted models with poor predictive ability and complicated biological interpretation (e.g. Link & Barker Ecology 2006).
3. I very much like the emphasis on reporting effect sizes and confidence intervals (CI) over p-values. When you warn against interpreting a CI overlapping zero as evidence of biologically unimportant effects (section 4), it could perhaps be useful to provide some hint about what to do in that -quite common- situation. For instance, if the CI includes zero because it’s very broad (dataset 1 in Fig. 8), we should probably go back to the field and collect more data – with the current dataset we cannot conclude much about this effect.
4. Also, I wonder if you could introduce Andrew Gelman’s type S (sign) and type M (magnitude) errors (http://www.johndcook.com/blog/2008/04/21/four-types-of-errors/), after discussing the problems of type I and type II errors. I think the former are very helpful to promote a paradigm shift from null hypothesis testing to thinking about properly estimating effect sizes.
Hope it’s useful! Thanks again for sharing.
Paco Rodriguez-Sanchez
@frod_san
Thanks for the comments. They are very helpful. Cheers, Mick
Cool. Thanks for directing me to this Michael and Fiona. I’ll check out the paper.
Hi Dan,
Thanks. I agree about the value of reporting all effect sizes – I’ll mention that. However, Fiona Fidler just pointed out this paper to me:
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0066463
I haven’t read it properly yet, but the title and abstract suggest the file drawer problem is overstated. I wonder if I will be convinced.
Hi Michael,
This is a great read. Thanks very much for giving us the opportunely to have a look at it. Overall, I think that this is a very good overview of frequentist and Bayesian statistics and the tools they use to estimate effect sizes etc… I like the way you focus on fundamentals and how you create links between likelihood, Bayesian and frequentist approaches, while opening a window into how they are working. This is something most textbooks seem to miss.
In regards to the equations. I myself have been slightly resistant to these in the past, but I am coming to realize that these are fundamental and the more one uses them and reads them the easier they become to understand and the more likely you are to make connections between different statistical techniques. I personally agree with you that they need to stay. On that note, I do agree that they can be difficult to read and maybe visually depicting how they work or spending more time walking the reader through what each part of the equations mean would be very helpful. I think most of the equations are really easy to deal with, but when it comes to integrals, you may want to spend more time on these. Maybe consider, if you have space, to add a box explaining these in more detail that way you can stay focused on the text which reads really well.
You focus a lot on proper statistical reporting and in particular reporting effect sizes and confidence intervals, regardless of their p-values. I agree with this and I think this is a very important point, but there seems to be some uncertainty in how to deal with this in the context of model selection. For example, you may chose to measure four variables you hypothesize are associated with a particular response, develop possible models and conduct model selection using AIC or AICc. However, inferences are generally done on the top-supported model (if there is one) and the effect sizes for the other hypothesized variables are not reported. I have often just given the effect sizes of the other variables anyway if they seem to be excluded as they can be useful for meta-analyses. This maybe something you discuss further? I think this is discussed in more detail in Forstmeier & Schielzeth (2010) Cryptic hypotheses testing in linear models: overestimated effect sizes and the winner’s curse. Behav. Ecol. Sociobiol. 65:47-55. Anyway, just a thought and I don’t know how much space you have or whether this is off topic. One cool thing to maybe present is a “check-list” of reporting. The meta-analysis people have got a great system on what should be reported for meta-analyses (i.e. PRISMA) and you could maybe apply a similar approach here.
Anyway, thanks for writing this and I look forward to reading the final draft.
Dan
Perhaps ditching the equations is going too far you’re right! I think people with a lot of maths experience will see an equation and it will help to summarise the narrative text for them so it’s useful and adds value. However people who haven’t ever met those equations properly before will see them as a distraction, so much so that they stop reading the text. It’s really hard to balance the two I think. I often wonder whether considering how the page is laid out will help to reduce the level of distraction.
Hi, I’ve done a blog post at http://biomathed.wordpress.com/2013/06/25/more-equations-fewer-citations-part-two/ which was prompted by this but deals with the wider issue of whether to include equations and, if so, how.
I liked the narrative and thought that the explanations of the concepts were really clear and natural – taking away much of the jargon which was great.. However I think you’re relying on the reader being familiar with mathematical symbols and being able to read the maths. For example you give the equation for the probability density function of the normal distribution but I think many biologists will blank that out and not make any sense of it. However if you put a graph alongside then they will get it straight away. I realise the graph is in there further down the page but I would put it higher up and have the equation in second place.
I think it’s helpful to remember that many biologists have no training in calculus so don’t recognise the integral sign – or even the summation notation. I think quite a few won’t recognise the product operator – even though the text in the maximum likelihood section explains it really well. I was wondering whether, in section 3.3, you could just leave out the equations? Do they really add anything? The text explains it quite well anyway.
I hope this is helpful – these are issues I’m struggling with a lot and I’m a molecular pharmacologist so completely at the other end of biology!
Thanks – I just saw your post – good thinking. I’m not sure about ditching the equations – I’d much rather try to help readers understand them. Your idea of using a diagram makes a lot of sense, but I’ll also thinking about which equations are really necessary. Many thanks!
Pingback: More equations = fewer citations: part two | Biomaths Education Network
This is an interesting comment – just the sort of thing I was after. This quote sums it up: “If a biologist hasn’t studied calculus then including an equation such as this is the equivalent of putting a quote in another language”. The suggested use of a diagram to illustrate the concept is helpful – thanks.
It reads nice … a few comments / questions
* “An alternative approach to frequentist statistical methods is based on the concept of “likelihood”” -> I see it often that people equate NHST with frequentism and treat MLE as a separate thing, but I think it’s pretty uncontroversial that MLE in it’s traditional interpretation is a frequentist method
* Given the focus, it’s weird that only AIC model selection is discussed, and no Bayesian model selection methods
* I was missing the mentioning of simulation-based approaches such as null-models or ABC, but that may be my personal bias.
I liked the discussion about which things to report in section 4.
Thanks Florian! I’ve seen a similar comment that MLE is not frequentist. I think I had better check that, but I’m with you on that – I’ve always thought of it as a frequentist method.
There are different flavours of Bayesian analysis too, which I haven’t really drawn out.
You’re right about the extra focus on AIC model selection. The book will have an entire chapter based heavily on AIC, so I might trim that part and boost the other aspects of model selection.
I’m already over the word limit, so adding extra detail (e.g., null models and ABC as you mention) will require trimming elsewhere.
Thanks again for sharing your thoughts.
ah, now I get it, you mean “MLE is another frequentist approach” … I misunderstood your sentence as “MLE provides an alternative to frequentist methods”, that’s why I was bringing it up … now that I read it again it’s more clear, but maybe still reformulate?
thanks Mick for posting this. I’m going to have a read in the next few days and I’ll shoot you some feedback. I look forward to seeing how you’ve introduce the topic. From a quick skim, I like how you’ve approached it.
JOHN
Thanks John. I look forward to hearing your thoughts.