Percentile-based grain size distribution analysis tools (GSDtools) – estimating confidence limits and hypothesis tests for comparing two samples

Most studies of gravel bed rivers present at least one bed surface grain size distribution, but there is almost never any information provided about the uncertainty of the percentile estimates. We present a simple method for estimating the grain size confidence intervals about sample percentiles derived from standard Wolman or pebble count samples of bed surface texture. The width of a grain size confidence interval depends on the confidence level selected by the user (e.g., 95%), the number of stones sampled to generate the cumulative frequency distribution, and the shape of the frequency distribution itself. For a 5 95% confidence level, the computed confidence interval would include the true grain size parameter in 95 out of 100 trials, on average. The method presented here uses binomial theory to calculate a percentile confidence interval for each percentile of interest, then maps that confidence interval onto the cumulative frequency distribution of the sample in order to calculate the more useful grain size confidence interval. The validity of this approach is confirmed by comparing the predictions using binomial theory with estimates of the grain size confidence interval based on repeated sampling from a known population. We 10 also developed a two-sample test of the equality of a given grain size percentile (e.g., D50), which can be used to compare different sites, sampling methods or operators. The test can be applied with either individual or binned grain size data. These analyses are implemented in the freely available GSDtools package, written in the R language. A solution using the normal approximation to the binomial distribution is implemented in a spreadsheet that accompanies this paper. Applying our approach to various samples of grain size distributions in the field, we find that the standard sample size of 100 observations is typically 15 associated with uncertainty estimates ranging from about ±15% to ±30%, which may be unacceptably large for many applications. In comparison, a sample of 500 stones produces uncertainty estimates ranging from about ±9% to ±18%. In order to help workers develop appropriate sampling approaches that produce the desired level of precision, we present simple equations that approximate the proportional uncertainty associated with the 50 and 84 percentiles of the distribution as a function of sample size and sorting coefficient; the true uncertainty of any sample depends on the shape of the sample distribution, and can 20 only be accurately estimated once the sample has been collected.


Introduction
A common task in geomorphology is to estimate one or more percentiles of a particle size distribution, denoted D P , where D represents the particle diameter (mm) and the subscript P indicates the percentile of interest. Such estimates are typically used in calculations of flow resistance, sediment transport, and channel stability; they are also used to track changes in bed condition over time, and to compare one site to another. In fluvial geomorphology, 5 commonly used percentiles include D 50 (which is the median) and D 84 . In practice, sampling uncertainty for the estimated grain sizes is almost never considered during data analysis and interpretation. This paper presents a simple approach based on binomial theory for calculating grain size confidence intervals, and for testing whether or not the grain size percentiles from two samples are statistically different.
Various methods for measuring bed surface sediment texture have been reviewed by previous researchers (Church 10 et al., 1987;Bunte and Abt, 2001b;Kondolf et al., 2003). While some approaches have focused on using semiqualitative approaches such as facies mapping (e.g. Buffington and Montgomery, 1999), or visual estimation procedures (e.g. Latulippe et al., 2001), the most common means of characterizing the texture of a gravel bed surface is still the cumulative frequency analysis of some version of the pebble count (Wolman, 1954;Leopold, 1970;Kondolf and Li, 1992;Bunte and Abt, 2001a). Pebble counts are sometimes completed by using a random walk approach, 15 wherein the operator walks along the bed of the river, sampling those stones that are under the toe of each boot and recording the b-axis diameter. In other cases, a regular grid is superimposed upon the sedimentological unit to be sampled, and the b-axis diameter of all the particles under each vertex is measured. In still other cases, computerbased photographic analysis identifies the b-axis of all particles in an image of the bed surface. Data are typically reported as cumulative grain size distributions for 0.5φ size intervals (e.g., 8 -11.3 mm, 11.3 to 16 mm, 16 -22.6 20 mm, 22.6 -32 mm, and so on), from which the grain sizes corresponding to various percentiles are extracted.
Operator error and the technique used to randomly select bed particles have frequently been identified as important sources of uncertainty in bed surface samples (Hey and Thorne, 1983;Marcus et al., 1995;Olsen et al., 2005;Bunte et al., 2009), but the largest source of uncertainty in many cases is likely to be associated with sample size, particularly for standard pebble counts of about 100 stones. Unfortunately, the magnitude of the confidence interval bounding an 25 estimated grain size is seldom calculated and/or reported, and the implications of this uncertainty are -we believe -generally under-appreciated. To address this issue, we believe that it should become standard practice to calculate and graphically present the confidence intervals about surface grain size distributions.
For the most part, attempts to characterize the uncertainty of pebble counts have focused on estimating the uncertainty of D 50 , and have typically assumed that the underlying distribution is log normal (Hey and Thorne, Attempts to characterize the uncertainty associated with other percentiles besides the median have relied on empirical analysis of extensive field data sets (Marcus et al., 1995;Rice and Church, 1996;Green, 2003;Olsen et al., 2005), which cannot be easily applied to pebble counts from other gravel bed rivers having a different population of grain sizes. Perhaps because of the complexity involved in extending the grain size confidence intervals about the median to the rest of the distribution, researchers almost never present confidence intervals on cumulative frequency 5 distribution plots, or constrain comparisons of one distribution to another by any estimate of statistical significance.
While others have recognized the limitations of relatively small sample sizes (Hey and Thorne, 1983;Rice and Church, 1996;Petrie and Diplas, 2000;Bunte and Abt, 2001b), it still seems to be standard practice to rely on surface samples of about 100 observations. Fripp and Diplas (1993) do present a means of generating confidence intervals bounding a grain size distribution. 10 They present a method for determining the minimum sample size required to achieve a desired level of sample precision using the normal approximation to the binomial distribution, wherein uncertainty is expressed in terms of the percentile being estimated (i.e., they estimate the percentile confidence interval), but not in terms of actual grain sizes (i.e., the grain size confidence interval). Petrie and Diplas (2000) demonstrate that the percentile confidence interval predicted by Fripp and Diplas (1993) is similar to the empirical estimates produced by Rice and Church 15 (1996), who repeatedly sub-sampled a known population of grain size measurements in order to quantify the confidence interval; they also recommend plotting the confidence intervals on the standard cumulative distribution plots as an easy way of visualizing the implications of sampling uncertainty. It is worth noting that the primary focus of the previous analyses has been directed toward determining the sample size necessary to achieve a given level of sample precision; it has not been adapted to the analysis and interpretation of surface distribution samples, once 20 they have been collected.
A number of studies have compared grain size distributions for two or more samples to assess differences among sites, sampling methods or operators (Hey and Thorne, 1983;Marcus et al., 1995;Bunte and Abt, 2001a;Olsen et al., 2005;Bunte et al., 2009;Daniels and McCusker, 2010). A simple approach would be to construct confidence intervals for the two estimates. If the confidence intervals do not overlap, one can conclude that the estimates are 25 significantly different at the confidence level used to compute the intervals (e.g., 95%); and if a percentile estimate from one sample falls within the confidence interval for the other sample, then one cannot reject the null hypothesis that the percentile values are the same. However, the conclusion is ambiguous when the confidence intervals overlap but do not include both estimates; even for populations with significantly different percentile values, it is possible for the confidence intervals to overlap. Therefore, there is a need for a method to allow two-sample hypothesis tests 30 of the equality of percentile values.
The objective of this note is to introduce robust, distribution-free approaches to (a) computing percentile confidence intervals and then mapping them onto a given cumulative frequency distribution from a standard pebble count in order to estimate the grain size confidence interval for the sample, and (b) conducting two-sample hypothesis tests of the equality of grain size percentile values. The approaches can be applied not only in cases in which individual grain 35 diameters are measured, but also to the common situation in which grain diameters are recorded within phi-based classes, so long as the number of stones sampled to derive the cumulative distribution is also known.
The primary purpose of this work is to guide the analysis and interpretation of the grain size samples. While grain size confidence intervals are most applicable when comparing two samples to ascertain whether or not they are statistically different, we also demonstrate how knowledge of grain size uncertainty could be applied in a man-5 agement context, where flood return period is linked to channel instability (for example). As we demonstrate in the paper, percentile uncertainty is distribution-free, and can be estimated using standard look-up tables similar to those used for t-tests, or using the normal approximation to the binomial distribution referred to by Fripp and Diplas (1993) (see Appendix A). Translating percentile confidence intervals to grain size confidence intervals requires information about the grain size distribution, but is essentially a mapping exercise, not a statistical one. 10 We implement both the estimation of a percentile confidence interval and the mapping of it onto a grain size confidence interval using: (1) a spreadsheet that we provide which uses the normal approximation to the binomial distribution, described by (Fripp and Diplas, 1993); and (2) an R package called GSDtools that we have written for this purpose that uses the statistical approach described in this paper. A demonstration is available online at https://bceaton.github.io/GSDtools_demo_2019.nb.html, which provides instructions for installing and using 15 the GSDtools package; the demonstration is also included in the data repository associated with this paper. Finally, we use both existing data sets and the results from a Monte Carlo simulation to develop recommendations regarding the sample sizes required to achieve a pre-determined precision for estimates of the D 50 and the D 84 . The key to our approach is that the estimation of any grain size percentile can be treated as a binomial experiment, much like predicting the outcome of a coin-flipping experiment. For example, we could toss a coin 100 times and count the number of times the coin lands head-side up. For each toss (of a fair coin, at least), the probability (p) of obtaining a head is 0.50. The number of times that we get heads during repeated experiments comprising 100 coin tosses will vary about a mean value of 50, following the binomial distribution (see Fig. 1).

25
The probability of getting a specific number of heads (B k ) can be computed from the binomial distribution: for which k is the number successes (in this case, the number of heads) observed during n trials for which the probability of success is p. The probabilities of obtaining between 40 and 60 heads calculated using Eq. 1 are shown in Fig. 1. The sum of all the probabilities shown in the figure is 0.96, which represents the coverage probability, P c , 30 associated with the interval from 40 to 60 successes. We can apply this approach to a bed surface grain size sample. Imagine that we are sampling a population of surface sediment sizes like that shown in Fig. 2a, for which the true median grain size of the population (D 50 ) is known (the population shown is defined by 3411 measurements of bed surface b-axis diameters at randomly selected locations in the wetted channel of a laboratory experiment performed by the authors, and has a median surface size of 1.7 mm). We know that half of the surface grains are smaller than the D 50 , so for each stone that we select, the 5 probability of it being smaller than the D 50 is 0.50. If we measure 100 stones and compare them to the D 50 , then binomial sampling theory tells us that the probability of selecting exactly 50 stones that are less than D 50 is just 0.08, but that the probability of selecting between 40 and 60 stones less than D 50 is 0.96 (see Fig. 1). Figure 2b shows a random sample of 100 stones taken from the population shown in Fig. 2a. Each circle represents a measured b-axis diameter, and all 100 measurements are plotted as a cumulative frequency distribution; the median 10 surface size of the sample, d 50 , is 1.5 mm. There are clear differences between the distribution of the sample and the underlying population, which is to be expected.
The first step in calculating a grain size confidence interval that is likely to contain the true median value of the population is to choose a confidence level; in this example, we set the confidence level to 0.96, corresponding to the coverage probability shown in Fig. 1. As a result, the true value of the D 50 will fall between the sample d 40 and 15 the sample d 60 96% of the time. This represents the percentile confidence interval (see Fig. 2c), and it does not depend on the shape of the grain size distribution. For reference, a set of percentile confidence interval calculations are presented in Appendix B.
Once a confidence level has been chosen and the percentile confidence interval has been identified, a grain size confidence interval can be estimated by mapping the percentile confidence interval onto the sampled grain size 20 distribution, as indicated graphically in Fig. 2d. Unlike the percentile confidence interval, the grain size confidence  interval depends on the shape of the cumulative frequency distribution, and can only be calculated once the sample has been collected.
The approach demonstrated above for the median size can be applied to all other grain size percentiles by varying the probability p in Eq. 1, accordingly. For example, the probability of picking up a stone smaller than the true D 84 of a population is 0.84, while the probability of picking up a stone smaller than the true D 16 is just 0.16. If 5 we define P to be the percentile of interest for the population being sampled, then the probability of selecting a stone smaller than that percentile is p = P/100, meaning that there is a direct correspondence between the grain size percentile and the probability of encountering a grain smaller than that percentile. As we show in the next section, blue circles indicate individual grain size measurements (d (i) ), and the red line is the cumulative frequency distribution for binned data using the standard 0.5 φ bins. In Panel (b), the interpolated upper and lower percentile confidence bounds for the binned data are shown as horizontal lines, and the associated 95% grain size confidence interval containing the true D84 for the population is shown in grey.
the binomial distribution can be used to derive grain size confidence intervals for any estimate of d P for a sample that can be expected to contain the true value of D P for the entire population.

Statistical basis
In order to illustrate our approach for estimating confidence intervals in detail, we will use a sample of 200 measurements of b-axis diameters from our laboratory population of 3411 observations. These data are sorted in rank 5 order and then used to compute the quantiles of the sample distribution. The difference between the cumulative distribution of raw data (based on 200 measurements of b-axis diameters) and the standard 0.5φ binned data (which is typical for most field samples) is illustrated in Figure 3. While the calculated d 84 value for the binned data shown in Fig. 3a is not identical to that from the original data, the difference is small compared with the grain size confidence interval associated with a sample size of 200, shown in Fig. 3b. We first develop a method to apply to samples 10 comprising n individual measurements of grain diameter, and then describe an approximation that can be applied to the more commonly encountered 0.5φ binned cumulative grain size distributions.

Exact solution for a confidence interval
Suppose we wish to compute a confidence interval containing the population percentile, D P , from our sample of 200 b-axis diameter measurements. The first step is to generate order statistics, d (i) , by sorting the measurements into 15 rank order from lowest to highest (such that d (1) ≤ d (2) ≤ ... ≤ d (n) ). Figure 3a plots d (i) against the ratio (i − 1)/n, which is a direct representation of the proportion of the distribution that is finer than that grain size.
To define a confidence interval, we first specify the confidence level, usually expressed as 100 · (1 − α)%. For 95% confidence, α = 0.05. Following Meeker et al. (2017), we then find lower and upper values of the order statistics (d (l) and d (u) , respectively) that determine the percentile confidence interval, such that the coverage probability is as 5 close as possible to 1 − α , but no smaller. Note that, in our example of 100 coin tosses from the previous section, we made a calculation by setting l = 40 and u = 60, which gave us a coverage probability of 96%. Coverage probability is defined as: where B k is the binomial probability distribution for k successes in n trials for probability p, defined in Eq. 1. The 10 goal, then, is to find integer values l and u that satisfy the condition that P c ≥ 1 − α, with the additional condition that l and u be approximately symmetric about the expected value of k (i.e., n · p). The lower grain size confidence bound for the estimate of D P is then mapped to grain size measurement d (l) and upper bound is mapped to d (u) .
Obviously, this approach cannot be applied to the binned data usually collected in the field, but is intended for the the increasingly common automated, image-base techniques that retain individual grain size measurements. 15 We have created an R function (QuantBD) that determines the upper and lower confidence bounds, and returns the coverage probability, which is included in the GSDtools package. Our function is based on a script published online by W. Huber 1 , which follows the approach described in Meeker et al. (2017). For n = 200, p = 0.84 and α = 0.05 (i.e., 95% confidence level), l = 159 and u = 180, with a coverage probability (0.953) that is only slightly greater than the desired value of 0.95. This implies that the number of particles in a sample of 200 measurements that would be 20 smaller than the true D 84 should range from 159 particles to 180 particles, 95% of the time. This in turn implies that the true D 84 could correspond to sample estimates ranging from the 80 th percentile (i.e., 159/200) to the 90 th percentile (i.e., 180/200). We can translate the percentile confidence bounds into corresponding grain size confidence bounds using our ranked grain size measurements: the lower bound of 159 corresponds to a measurement of 2.8 mm, and the upper bound corresponds to a measurement of 3.6 mm.

Approximate solution for equal-area tails
One disadvantage of the exact solution described above is that the areas under the tails of the binomial distribution differ (Fig. 4), such that the expected value is not located in the center of the confidence interval. Meeker et al. (2017) described an alternative approach based on interpolation for finding lower or upper limits for one-sided intervals (i.e., confidence intervals pertaining to a one-tailed hypothesis test). This approach can be applied to find two-sided 30 intervals by finding one-sided intervals, each with a confidence level of 1 − α/2, which results in a confidence interval that is symmetric about the expected value (see the dashed lines in Fig. 4). By interpolating between the integer values of k, we can find real numbers for which the binomial distribution has values of α/2 and 1 − α/2, which we refer to as l e and u e . The corresponding grain sizes can be found by interpolating between measured diameters whose ranked order brackets the real numbers l e and u e .
The values of l e and u e are indicated on Fig. 4 by dashed vertical lines. As can be seen, the values of l and u 5 generated using the equal tail approximation are shifted to the left of those found by the exact approach, resulting in a symmetrical confidence interval. The corresponding grain sizes representing the confidence interval are 2.7 mm and 3.4 mm, which are similar to the exact solution presented above.

Approximate solution for binned data
We have adapted the approximate solution described above to allow estimation of confidence limits for binned data, 10 which is accomplished by our R function called WolmanCI in the GSDtools package. Just as before, we use the equal area approximation of the binomial distribution to compute upper and lower limits (l e and u e ), but then we transform these ordinal values into percentiles by normalizing by the number of observations. Using our sample data, the ordinal confidence bounds l e = 157.03 and u e = 177.36 thus become the percentile confidence bounds d 79 and d 89 , respectively. 15 Next, we simply interpolate from the binned cumulative frequency distribution to find the corresponding grain sizes that define the grain size confidence interval. Note that the linear interpolation is applied to log 2 (d), and that the interpolated values are then transformed to diameters in mm. This interpolation procedure is represented graphically in Fig. 3b; the horizontal lines represent the percentile confidence interval (defined by l e /n and u e /n), while the grey box indicates the associated grain size confidence interval. Our binned sample data yield a grain size confidence interval for the D 84 that range from 2.8 mm to 3.5 mm.
The binomial probability approach uses the sample cumulative frequency distribution to calculate the grain size confidence interval. This makes it difficult to predict the statistical power of sample size, n, prior to collecting the sample. However, the approach can be applied to any previously collected distribution, provided the number of 5 observations used to generate the distribution is known.
3 Two-sample hypothesis tests 3.1 When individual grain diameters are available Suppose we have two samples for which individual grain diameters have been measured (e.g., two sites, two operators, two sampling methods). The values in the two samples are denoted as X i , (where i ranges from 1 to n x ) and Y j (j = 1 10 to n y ) where n x and n y are the number of grains in each sample. In this case, one can use a resampling method (specifically the bootstrap) to develop a hypothesis test. A straightforward approach is based on the percentile bootstrap (Efron, 2016), and involves the following steps: 1. Take a random sample of n x diameters, with replacement, from the set of values of X i . This bootstrap sample is denoted as x k , k = 1 to n x . 15 2. Take a random sample of n y diameters, with replacement, from the set of values of Y j . This bootstrap sample is denoted as y l , l = 1 to n y .
3. Determine the desired percentile value from each sample, (d P ) x and (d P ) y , and compute the difference: ∆d P = 4. Repeat steps 1 to 3 n r times (e.g., n r = 1000), each time storing the value of ∆d P . 20 5. Determine a confidence interval for ∆d P by computing the quantiles corresponding to α/2 and 1 − α/2, where α is the desired significance level for the test (e.g., α = 0.05).
6. If the confidence interval determined in step 5 does not overlap 0, then one can reject the null hypothesis that the sampled populations have the same value of D P . This analysis is implemented with the function CompareRAWs in the GSDtools package. The required inputs are 25 two vectors listing the measured b axis diameters for each sample.

When only binned data are available and sample size is known
For situations in which only the cumulative frequency distribution is available, an approach similar to parametric bootstrapping can be applied, which employs the inverse transform approach (see Chapter 7 in Wicklin, 2013) to convert a set of random uniform numbers in the interval (0, 1) to a random sample of grain diameters by interpolating from the binned cumulative frequency distribution, similar to the procedure described above for determining confidence intervals for binned data.
The approach involves the following steps: 1. Generate a set of n x uniform random numbers, u i , i = 1 to n x . Transform these into a corresponding set of 5 grain diameters x i by using the cumulative frequency distribution for one sample.
2. Generate a set of n y uniform random numbers, u j , j = 1 to n y . Transform these into a corresponding set of grain diameters y j by using the cumulative frequency distribution for the second sample.
3. Determine the desired grain size percentile from each sample, (d P ) x and (d P ) y , and compute the difference: 4. Repeat steps 1 to 3 n r times (e.g., n r = 1000), each time storing the value of ∆d P . 5. Determine a confidence interval for ∆d P by computing the quantiles corresponding to α/2 and 1 − α/2, where α is the desired significance level for the test (e.g., α = 0.05).
6. If the confidence interval determined in step 5 does not overlap 0, then one can reject the null hypothesis that the sampled populations have the same value of D P . 15 This analysis is implemented in the CompareCFDs function. It requires that the user provide the cumulative frequency distribution for each sample (as a data frame), as well as the number of measurement upon which each distribution is based.

Confidence interval testing
We can test whether or not our approach successfully predicts the uncertainty associated with a given sample size 20 using our known population of 3411 measurements from the lab. The effect of sample size on the spread of the data is demonstrated graphically in Fig. 5. In Fig. 5a, 25 random samples of 100 stones selected from the population are plotted, along the with 95% grain size confidence interval bracketing the true grain size population, calculated using our binomial approach. In Fig. 5b, random samples of 400 stones are plotted, along the with corresponding confidence interval. A comparison of the two plots shows that sample size (i.e. 100 vs. 400 stones) has a strong effect 25 on variability of the sampled distributions. It is also clear that the variability of the samples is well predicted by the binomial approach, since the sample data generally fall within the confidence interval for the population.
In order to more formally test the binomial approach, we collected 10,000 random samples (with replacement) from our population of 3411 observations, calculated sample percentiles ranging from the d 5 to the d 95 for each sample, and used the distribution of estimates to determine the grain size confidence interval. This resampling analysis was conducted twice; once for samples of 100 stones and then again for samples of 400 stones. This empirical approximation of the grain size confidence interval is the same technique used by Rice and Church (1996). The advantage of a resampling approach is that it replicates the act of sampling, and therefore does not introduce any additional assumptions or approximations. The accuracy of the resampling approach is limited only by the number of samples collected, and the degree to which the individual estimates of a given percentile reproduce the distribution 5 that would be produced by an infinite number of samples. The only draw back of this approach is that the results are only strictly applicable to the population to which the resampling analysis has been applied (Petrie and Diplas, 2000). While it is an ideal way to assess the effect of sample size on variability for a known population, resampling confidence intervals cannot be calculated for individual samples drawn from an unknown grain size population.
In Fig. 6, the resampling estimates of the 95% grain size confidence intervals for D 5 to D 95 based on samples 10 of 100 stones are plotted as red circles, and those based on samples of 400 stones are plotted as blue circles. For comparison, the confidence intervals predicted using our binomial approach are plotted using dashed lines. There is a close agreement between the resampling confidence intervals and the binomial confidence intervals, indicating that our implementation of binomial sampling theory captures the effects of sample size that we have numerically simulated using the resampling approach. 15 We have also calculated the statistics of a 1:1 linear fit between the upper and lower bounds of the confidence intervals predicted by binomial theory and those calculated using the resampling approach for sample sizes of 100 In order to confirm that the size of the original population did not affect our comparison of the resampling and binomial confidence bounds estimates, we repeated the entire analysis using a simulated log-normal grain The close match between the grain size confidence intervals predicted using binomial theory and those estimated using the resampling analysis supports the validity of the proposed approach for computing confidence intervals.

Reassessing previous analyses
In order to demonstrate the importance of understanding the uncertainty, we have reanalyzed the results of previous papers that have compared bed surface texture distributions, but which have not considered uncertainty associated with sampling variability. In some cases, these re-analyses confirm the authors' interpretations, and strengthen them by highlighting which parts of the distributions are different and which are similar, thus allowing for a more nuanced 5 understanding. In others, they demonstrate that the observed differences do not appear to be statistically significant, and suggest that the interpretations and explanations of those differences are not supported by the authors' data.
In either case, we believe that adding information about the grain size confidence intervals is a valuable step that should be included in every surface grain size distribution analysis.
The data published by Bunte et al. (2009) include pebble counts of about 400 stones for different channel units 10 in two mountain streams (see Fig. 7). Adding the grain size confidence intervals to the distributions emphasizes the differences and similarities between the distributions. Based on the data in Fig. 7, it seems that clear differences in bed texture exist when comparing pools, runs, and riffles for the fraction of sediment less than about 22.6 mm; the distributions of sediment coarser than this are quite similar. Using the CompareCFDs function to compare percentiles ranging from D 5 to D 95 (in increments of 5), we found that the differences in the samples from Willow Creek for 15 percentiles greater than D 65 are significant for α = 0.05, but not for α = 0.01 (i.e., for a 99% confidence interval).
For North St. Vrain Creek, there are significant differences at α = 0.05 for percentiles finer than D 20 , and for the D 80 and D 85 , though none of the differences for the coarser part of the distribution are significant for α = 0.01.
The relative similarity of pool and run/riffle sediment textures for the coarser part of the distribution suggests that the most noticeable differences in bed surface texture are likely due to the deposition of finer bed-load sediment in pools on the waning limb of the previous flood hydrograph (as suggested by Beschta and Jackson, 1979;Hilton, 1992, 1999), and that the bed surface texture of both kinds of mainstem units during flood events could generally be quite similar. The analysis also clearly demonstrates that size distributions of the exposed channel bars in these two streams are statistically different from both the pools and the runs/riffles. From these plots we can conclude that the bed roughness (which is typically indexed by the bed surface D 50 or by sediment coarser than that) 5 is similar for the mainstem units (i.e., pools, and runs/riffles), but that exposed bar surfaces in these two streams are systematically less rough. These kinds of inferences could have important implications for decisions about the spatial resolution of roughness estimates required to build 2D or 3D flow models; it is also possible to reach the same conclusions based on the original data plots in Bunte et al. (2009), but the addition of confidence bands supports the robustness of the inference.

10
A more fundamental motivation for plotting the binomial confidence bands is illustrated in Fig. 8, which compares the bed surface texture estimated by two different operators using the standard heel-to-toe technique to sample more than 400 stones from the same sedimentological unit. These data were published by Bunte and Abt (2001a) (see their Fig. 7). Based on their original representation of the two distributions (Fig. 8a), Bunte and Abt (2001a) concluded that 15 "operators produced quite different sampling results . . . operator B sampled more fine particles and fewer cobbles . . . than operator A and produced thus a generally finer distribution." However, once the grain size confidence intervals are plotted (Fig.8b), it is clear that the differences are not generally statistically significant. Using the CompareCFDs function to compare each percentile from D 5 to D 95 , we found no statistically significant differences for any percentile at α = 0.01; at α = 0.05, only differences for the D 80 , D 85 and D 95 are significant.When comparing distributions, it is common practice to apply the Bonferroni correction in which α is replaced by α/m, where m is the number of metrics being compared. Applying this correction, there is no statistical difference between the two samples for α = 0.05. The value in considering sampling variability in the analysis is that it supports a more nuanced interpretation of differences in grain size distributions.
A similar analysis of the heel-to-toe sampling method and the sampling frame method advocated by Bunte and 5 Abt (2001a) shows that the distributions produced by the two methods are not generally statistically different, either ( Fig. 9). The CompareCFDs function only found significant differences for grain size percentiles coarser than D 70 for α = 0.05, and between D 75 and D 90 for α = 0.01. Once the Bonferroni correction is applied, none of the differences between the two samples would be considered significant at α = 0.05.
In both cases, the uncertainty associated with sampling variability appears to be greater than the difference between 10 operators or between sampling methods, and thus one cannot claim these differences as evidence for statistically significant effects. It is likely the case that there are significant differences among operators or between sampling methods, but larger sample sizes would be required to reduce the magnitude of sampling variability in order to identify those differences.
Indeed, Hey and Thorne (1983) found that operator errors were difficult to detect for small sample sizes (wherein 15 the sampling uncertainties were comparatively large), but became evident as sample size increased, so the issue at hand is not whether there are important differences between operators, but whether the differences in Fig. 8 are statistically significant. Interestingly, Hey and Thorne (1983) were able to detect operator differences at sample sizes of about 300 stones, whereas Bunte and Abt (2001a) did not detect statistical differences for samples of about 400 stones, indicating either that Hey and Thorne (1983) had larger operator differences than did Bunte and Abt 20 (2001a), or smaller sample uncertainties due to the nature of the sediment size distribution.

Determining sample size
As we demonstrated in the previous section, grain size confidence intervals can be constructed and plotted for virtually all existing surface grain size distributions (provided that the number of stones that were measured is known, which is almost always the case), and future sampling efforts need not be modified in any way in order 25 to take advantage of our method. While the primary purpose of our paper is to demonstrate the importance of calculating grain size confidence intervals when analyzing grain size data, our method can also be adapted to predict the sample size required to achieve a desired level of sampling precision, prior to collecting the sample.
While the percentile confidence interval for any percentile of interest can be calculated based on the sample size, n, and the desired confidence level, α (see Appendix B, for example), it cannot be mapped onto the grain size 30 confidence interval before the cumulative distribution has been generated. This problem is well recognized, and has been approached in the past by making various assumptions about the distribution shape (Hey and Thorne, 1983;Church et al., 1987;Bunte and Abt, 2001a, b), or using empirical approximations (Marcus et al., 1995; Rice and B uses the 95% grain size confidence intervals calculated for the pebble count to demonstrate that the two distributions do not appear to be statistically different. Church, 1996;Green, 2003;Olsen et al., 2005), but in all cases it is still necessary to know something about the spread of the distribution -regardless of its assumed shape -in order to assess the implications of sample size for the precision of the resulting grain size estimates. It is perhaps the difficulty of predicting sample precision that has led to the persistent use of the standard 100-stone sample. Here we provide a simple means of determining the appropriate sample size; first we use existing data to calculate the uncertainty of estimates for d 50 and d 84 ; and 5 then we use simulated log-normal grain size distributions to quantify the effect of the spread of the distribution on uncertainty.

Uncertainty based on field data
Here, we demonstrate the effect of sample size on uncertainty. We begin by calculating the uncertainty of estimates for D 50 and D 84 for all the surface samples used in this paper, for eight samples collected by BGC Engineering from 10 gravel bed channels in the Canadian Rocky Mountains, and for samples from two locations on Cheakamus River, British Columbia, collected by undergraduate students from the Department of Geography at The University of British Columbia. The number of stones actually measured to create these distributions is irrelevant, since it is the shape of the cumulative distribution that determines how the known percentile confidence interval maps onto the grain size confidence interval. Since these distributions come from a wide range of environments and have a range 15 of distribution shapes, they are reasonable representation of the range grain size confidence intervals that could be associated with a given percentile confidence interval. Uncertainty ( ) in the grain size estimate is calculated as follows: where d upper is the upper bound of the grain size confidence interval, d lower is the lower bound, and d P is the estimated grain size of the percentile of interest. As a result, 50 represents the half-width of the grain size confidence interval about the median grain size (normalized by d 50 ), and 84 represents half-width of the normalized grain size 5 confidence interval for the d 84 . for n = 500, 50 = 0.11, and 84 = 0.09. This analysis transforms the predictable, distribution-free contraction of the 10 percentile confidence interval as sample size increases into the distribution-dependent contraction of the grain size confidence interval. Clearly there is a wide range cumulative frequency distribution shapes in our data set, resulting in a large differences in 50 and 84 for the same sample size (and therefore the same percentile confidence interval).

Uncertainty for Log-normal distributions
In order to quantify the effect of distribution shape on the grain size confidence interval, we conducted a modelling  sorting index (si φ ) is defined by the following equation.
The term φ 84 refers to the 84 th percentile grain size (in φ units), and φ 16 refers the 16 th percentile. As a point of comparison, we estimated si φ for the samples analyzed in the previous section. For those samples, the sorting index ranges from 1.5φ to 5.6φ, with a median value of 2.5φ. The largest values of si φ were associated with samples from 5 channels on steep gravel bed fans and on bar top surfaces, while samples characterizing the bed of typical gravel bed streams had values close to the median value.
We simulated 3000 log-normal grain size distributions with D 50 ranging from 22.6 mm to 90.5 mm, n ranging from 50 to 1000 stones, and si φ ranging from 1φ to 5φ. For each simulated sample, we calculated uncertainty for D 50 and D 84 using Eq. 3. The calculated values of 50 and 84 are plotted in Fig. 11. Using the data shown in the 10 figure, we fit least-squares regression to fit models of the form where a, b, and c are the estimated coefficients. The empirical model predicting 50 has an adjusted R 2 value of 0.95, with the variable n explaining about 43% of the total variance, and si φ explaining 51% of the variance. The model for 84 has an adjusted R 2 value of 0.91 with the variables n and si φ explaining similar proportions of the 15 total variance as they do in the 50 model (41% and 50%, respectively).
After back-transforming from logarithms, the equation describing the 50 can be expressed as: Table 1. Coefficient values for estimating uncertainty in D50 and D84 as a function of si φ using Eqs. (6) and (8) Coef. where the coefficient A is given by: The equations for 84 are: where B is given by: Table 1 provides values of A and B for a range of sorting indices.

Practical implications of uncertainty
The implications of uncertainty can be important in a range of practical applications. As an example, we translate grain size confidence intervals into confidence intervals for the critical discharge for significant morphologic change 10 using data for Fishtrap Creek, a gravel bed stream in British Columbia that has been studied by the authors (Phillips and Eaton, 2009;Eaton et al., 2010a, b). The estimated bed surface D 50 for Fishtrap Creek is about 55 mm, which we estimate becomes entrained at a shear stress of 40 Pa, corresponding to a discharge of about 2.5 m 3 s −1 (Eaton et al., 2010b); the threshold discharge is based on visual observation of tracer stone movement, and corresponds to a critical dimensionless shear stress of approximately 0.045. If we assume that significant channel change can 15 be expected when D 50 becomes fully mobile (which occurs at about twice the entrainment threshold, according to Wilcock and McArdell, 1993), then we would expect channel change to occur at a shear stress of 80 Pa, which corresponds to a critical discharge of 8.3 m 3 s −1 , based on the stage-discharge relations published by Phillips and Eaton (2009).
Since we used the standard technique of sampling 100 stones to estimate D 50 and since the sorting index of the 20 bed surface is about 2.0φ, we can assume that the uncertainty will be about ±17%, based on Eqs. 6 and 7, which in turn suggests that we can expect the actual surface D 50 to be as small as 46 mm or as large as 64 mm. This range of D 50 values translates to shear stresses that produce full mobility that range from 67 Pa to 94 Pa. This in turn translates to critical discharge values for morphologic change ranging from 5.9 m 3 s −1 to 11.2 m 3 s −1 , which correspond to return periods of about 1.5 years and 7.4 years, based on the flood frequency analysis presented in Eaton et al. (2010b). Specifying a critical discharge for morphologic change that lies somewhere between a flood that occurs virtually every year and one that occurs about once a decade, on average, is of little practical use, and highlights the cost of relatively imprecise sampling techniques.
If we had taken a sample of 500 stones, we could assert that the true value of D 50 would likely fall between 51 mm 5 and 59 mm, assuming an uncertainty of ±7%. The estimates of the critical discharge would range from 7.2 m 3 s −1 to 9.5 m 3 s −1 , which in turn correspond to return periods of 2 years and 4.1 years, respectively. This constrains the problem more tightly, and is of much more practical use for managing the potential geohazards associated with channel change.
Operationally, it takes about 20 minutes for a crew of two or three people to sample 100 stones from a typical 10 dry bar in a gravel bed river, and a bit over an hour to sample 500 stones, so the effort required to sample the larger number of stones is often far from prohibitive. In less ideal conditions or when working alone, it may take upwards of 5 hours to collect a 500 stone sample, but as we have demonstrated, the uncertainty of the data increases quickly as sample size declines (see Figs. 10 and 11), which may make the extra effort worthwhile in many situations.
Furthermore, computer-based analyses using photographs of the channel bed may be able to identify virtually all 15 of the particles on the bed surface, and generate even larger samples. The statistical advantages of the potential increase in sample size are obvious, and justify further concerted development of these computer-based methods, in our opinion.

Conclusions
Based on the statistical approach presented in this paper, we developed a suite of functions in the R language that 20 can be used to first calculate the percentile confidence interval and then translate that into the grain size confidence interval for typical pebble count samples (see the supplemental material for the source code). We also provide a spreadsheet which uses the normal approximation to the binomial distribution to estimate the grain size confidence interval. The approach presented in this paper uses binomial theory to calculate the percentile confidence interval for any percentile of interest (e.g. P = 50 or P = 84), and then maps that confidence interval onto the cumulative grain 25 size distribution based on pebble count data to estimate the grain size confidence interval. As a result, the approach requires only that the total number of stones used to generate the distribution is known in order to generate grain size distribution plots that indicate visually the precision of the sample distribution (e.g. Fig. 7). We have developed statistical approaches that can be used for samples in which individual grain sizes are known and for samples in which data are binned (e.g., into φ classes).
distributions, this means of displaying grain size distribution data highlights which distributions appear statistically different, and which do not.
Our analysis of various samples collected in the field demonstrates that the grain size confidence interval depends on the shape of the distribution, with more widely graded sediments having wider grain size confidence intervals than narrowly graded ones. Our analysis also suggests that typical gravel bed river channels have a similar gradation, and 5 that the typical uncertainty of the D 50 varies from ±25% for a sample size of 100 observations to about ±11% for 500 observations.
When designing a bed sampling program, it is useful to estimate the precision of the sampling strategy and to select the sample size accordingly; to do so, we must first assume something about the spread of the data (assuming a log-normal distribution), and then verify the uncertainty after collecting the samples. Simple equations for predicting 10 uncertainty (as a percent of the estimate) are presented here to help workers select the appropriate sample size for the intended purpose of the data.

Appendix A: Normal approximation
While it is difficult to determine the percentile confidence interval using Eq. 1 without using a scripting approach similar to the one we implement in the GSDtools package, we can approximate the percentile confidence interval 15 analytically, and use the approximating equations in spreadsheet calculations. As Fripp and Diplas (1993) point out, the percentile of interest (P ) can be approximated by a normally distributed variable with a standard deviation calculated as follows: The term n refers to the number of stones being measured, and p refers to the probability of a single stone being finer 20 than the grain size for a percentile of interest, D P (recall from above that p = P/100, such that p = 0.84 for D 84 ).
The standard deviation for n = 100 and P = 84 would be 3.7 . That means that the true D 84 would be expected to fall between sampled d 80.3 and d 87.7 for a sample of 100 observations approximately 68% of the time, and would fall outside that range 32% of the time.
More generally, we can use the normal approximation to calculate the percentile confidence interval for any chosen 25 confidence level (α). We simply need to find the appropriate value of the z statistic for the chosen values of α and n, and calculate the percentile confidence interval using the following confidence bounds: The use of a normal distribution to approximate the binomial distribution is generally assumed to be valid for p values in the range 5 n ≤ p ≤ 1 − 5 n , although some have recommended the more stringent range of 20 n ≤ p ≤ 1 − 20 n (e.g. Fripp and Diplas, 1993). For a sample size of 100 stones, the limits correspond to 5 th and 95 th percentiles of the distribution.
For ease of reference, Table A1 presents σ values for P ranging from 10 (i.e., the D 10 ) to 90 (D 90 ) and for n 5 ranging from 50 observations to 3200 observations. For α = 0.10, z = 1.64; for a α = 0.05, z = 1.96; and for α = 0.01, z = 2.58. The table can be used to estimate the approximate percentile confidence intervals for common values of α, P and n. However, the user will have to manually translate the percentile confidence intervals into grain size confidence intervals using the cumulative frequency distribution for their sample.
A spreadsheet (see supplemental material) implementing these calculations has also been developed. That spread-10 sheet maps the percentile confidence interval onto the user's grain size distribution sample in order to estimate the grain size confidence interval. The source code for the package can be found in the online data repository associated with this paper.
These percentile confidence bounds do not depend on the characteristics of the grain size distribution, since they are determined by binomial sampling theory. Estimating the corresponding grain size confidence bounds requires the user to map the percentile confidence interval onto the grain size distribution in order to find the grain size confidence interval. The GSDtools package will automatically estimate the grain size interval.    Author contributions. B.C. Eaton drafted the manuscript, created the figures and tables, and wrote the code for the associated modelling and analysis in the manuscript; R.D. Moore developed the statistical basis for the approach, wrote the code to 5 execute the error calculations, reviewed and edited the manuscript, and helped conceptualize the paper; and L.G. MacKenzie collected the laboratory data used in the paper, tested the analysis methods presented in this paper, and reviewed and edited the manuscript.
Competing interests. The authors declare that they have no conflict of interest.
Acknowledgements. The field data used in this paper was provided by BGC Engineering, by undergraduate students in a field 10 course run by one of the authors, and by an NSERC-funded research project being conducted by two of the authors. The paper benefited greatly from the input of one anonymous reviewer, and from Dr. K. Bunte. We would also like to acknowledge the contributions of numerous graduate students at UBC who tested our R package and commented on the revised version of the manuscript.