Estimating confidence intervals for gravel bed surface grain size distributions: Reply to reviewer comments

Most studies of gravel bed rivers present at least one bed surface grain size distribution, but there is almost never any information provided about the uncertainty of the percentile estimates. We present a simple method for estimating the :::: grain :::: size confidence intervals about the grain size ::::: sample : percentiles derived from standard Wolman or pebble count samples of bed surface texture. Our approach uses binomial probability theory to generate confidence intervals for all grain sizes in the distribution. We ::: The ::::: width :: of :: a :::: grain :::: size ::::::::: confidence ::::::: interval ::::::: depends ::: on ::: the ::::::::: confidence :::: level ::::::: selected ::: by ::: the :::: user ::::: (e.g., 5 ::::::: ↵= 0.05 ::: for :: a :::: 95% ::::::::: confidence :::::::: interval), ::: the :::::: number :: of :::::: stones ::::::: sampled :: to ::::::: generate ::: the :::::::::: cumulative :::::::: frequency :::::::::: distribution, :::: and :: the ::::: shape :: of ::: the ::::::::: frequency ::::::::: distribution ::::: itself. ::: For :: a :::: 95% ::::::::: confidence ::::::: interval, ::: the ::: true ::::: grain ::: size :: of ::: the ::::::::: underlying ::::::::: population :::: will ::: fall ::::: within ::: the :::::::::: confidence :::::: interval ::: for ::: the :::::: sample ::::: 95% :: of ::: the :::: time. :::: The ::::::: method :::: uses ::::::: binomial :::::: theory :: to :::::::: calculate : a ::::::::: percentile ::::::::: confidence :::::: interval ::: for :::: each :::::::: percentile :: of ::::::: interest, :::: then ::::: maps ::: that ::::::::: confidence ::::::: interval :::: onto ::: the ::::::::: cumulative :::::::: frequency :::::::::: distribution :: of ::: the :::::: sample :: in :::: order ::: to ::::::: calculate ::: the ::::: more ::::: useful ::::: grain ::: size ::::::::: confidence ::::::: interval. :::: The ::::::: validity :: of ::: this :::::::: approach :: is :::::::: confirmed ::: by 10 ::::::::: comparing :: the :::::::::: predictions :::: using :::::::: binomial ::::: theory :::: with :::::::: estimates :: of ::: the ::::: grain ::: size ::::::::: confidence ::::::: interval :::: based ::: on ::::::: repeated :::::::: sampling :::: from : a :::::: known ::::::::: population. ::: We :::: also ::::::::: developed : a :::::::::: two-sample ::: test :: of ::: the ::::::: equality :: of : a ::::: given :::: grain :::: size :::::::: percentile ::::: (e.g., :::: D50),:::::: which ::: can :: be :::: used :: to :::::::: compare ::::::: different ::::: sites, ::::::: sampling :::::::: methods :: or :::::::: operators. :::: The ::: test ::: can ::: be :::::: applied :::: with ::::: either ::::::::: individual :: or :::::: binned :::: grain :::: size :::: data. :::::: These ::::::: analyses :::: are ::::::::::: implemented :: in ::: the ::::: freely :::::::: available : GSDtools ::::::: package, ::::::: written :: in ::: the :: R ::::::::: language. :: A :::::: solution ::::: using ::: the :::::: normal ::::::::::::: approximation :: to ::: the ::::::: binomial :::::::::: distribution :: is ::::::::::: implemented :: in : a ::::::::::: spreadsheet. :::::::: Applying ::: our :::::::: approach 15 :: to :::::: various ::::::: samples :: of :::: grain :::: size :::::::::: distributions :: in ::: the ::::: field, :: we : find that the standard sample size of 100 observations is associated with errors ::::::: typically ::::::::: associated :::: with :::::::::: uncertainty :::::::: estimates : ranging from about ±15% to ±30%, which may be unacceptably large for many applications. In comparison, a sample of 500 stones produces an uncertainty ::::::::: uncertainty :::::::: estimates : ranging from about ±9% to ±18%. In order to help workers develop appropriate sampling approaches that produce the desired level of precision, we present simple equations that approximate the proportional uncertainty associated with the median size and 20 the 84th percentile ::: 50th :::: and :::: 84th ::::::::: percentiles : of the distribution as a function of the sample size and the standard deviation of the distribution, assuming that the underlying distribution is log-normal. However, the ::::: sorting :::::::::: coefficient; ::: the true uncertainty of any sample ::::::: depends ::: on ::: the ::::: shape :: of ::: the ::::::: sample :::::::::: distribution, :::: and can only be accurately estimated once the sample has


reply by authors
This is very useful feedback for us. Our intention is indeed to provide a user-friendly tool that implements binomial statistical theory to calculate confidence bands about grain size distributions to prevent type 1 statistical errors. The revised manuscript now provides an overview of the confidence interval calculation procedure, and then lays out the precise statistical basis for the calculations for di↵erent kinds of data (i.e. raw observations and binned data). We have also written a new pair of functions to perform sample to sample comparisons to determine whether sample grain sizes for a percentile of interest are statistically di↵erent.
We have re-written the introduction and statistical basis sections of the paper, and we have added an overview section to better explain 1. how the binomial distribution can be applied to both raw data comprising n measurements of b-axis diameters and also to the typical binned data collected in the field; and 2. how the binomial theory can be used to generate confidence intervals about an estimate of a given grain size percentile.
The process is summarized in the new overview section, which describes how to estimate the percentile confidence interval (a term we introduce and use throughout the revised paper), and how to map that onto the sample cumulative frequency distribution to estimate the associated grain size confidence interval. The distinction between these two things is at the root of much of the confusion generated by our original manuscript.
In addition, we have written an appendix to the paper that describes how to use the simpler normal approximation to the binomial distribution to calculate the confidence interval, as well as a spreadsheet implementation of that approach.
We have also created an appendix containing reference tables of the percentile confidence interval bounds for a range of percentiles of interest (i.e. D 10 , D 15 , D 20 . . . D 90 ), sample size (n), and acceptable confidence limit (↵).

Comment 2
Computation of confidence bands around grain-size distributions without assuming an underlying distribution type is not a new idea. Fripp and Diplas (1993) presented a binomial approach to compute the relation between sample size and error around individual percentiles. The study by Church and Rice (1996) applied a bootstrap approach to a large pebble count of 3500 particles and computed error bands around various percentiles of the grain size distribution. The grain-size distributions did not fit a particular distribution type, but the bootstrap confidence limits were reasonably close to those computed assuming an underlying skewed log-normal distribution. Petrie and Diplas (2002) cautioned that ...the binomial distribution considers only two possibilities for each particle sampled: (1) the particle is within a specific size class (e.g.,smaller than a certain size) or (2) the particle is not within the specified size class. The binomial distribution is then inadequate to use for representing entire size distributions. To overcome this limitation and to compute confidence bands around the cumulative frequency distribution from a pebble count with data binned into size classes while considering distribution characteristics of the distribution, Petrie and Diplas (2000) developed a multinomial approach.

reply by authors
This is also very useful information for us, and we have read the papers with interest. The work by Diplas and colleagues is particularly relevant and strengthens our paper. The analysis by Fripp and Diplas (1993) is now used as a jumping o↵ point for our analysis: we have re-written our manuscript to use that paper as the basis from which we start, we describe that approach in the appendix, and we have implemented a version of it in a spreadsheet that accompanies this paper.
The paper by Rice and Church (1996) was the inspiration for the re-sampling analysis that we presented in our original paper. However, we have clearly not done justice to the analysis presented therein, so we have expanded that section.
The work by Petrie and Diplas with multinomial theory is primarily focused on determining the sample size required for a given level of accuracy for estimating the shape and relative position of the cumulative grain size distribution, using binned data. Our approach and intent is di↵erent: we develop our statistical theory using individual measurements of b-axis diameters, and we develop confidence bounds to be plotted when comparing distributions to avoid type 1 and 2 statistical errors. In this context, the binomial approach is most appropriate. Our implementation of binomial theory is based on the interpretation that a measured stone is either (a) greater than a percentile of interest for the population, or (b) less than or equal to the percentile of interest, with no reference to or limitation imposed by having binned data. In this context, the estimation of j percentiles involves the execution of j independent binomial experiments with assumed probabilities corresponding to the percentile of interest. To test the di↵erence between our approach and the traditional binned data, we use the scheme described in our paper to directly compare the distributions based on all measurements, and the binned data.
While current practice in the field is still to collect binned data, the automated techniques for grain size analysis that are standard practice in most experimental laboratories, and which are being increasingly deployed in the field promise to deliver much more data than can be collected manually, and will obviate the need for binned data. Our methodology is best leveraged in that context, using the automated data analysis approach possible using languages like R and Python. Therefore, our di↵erentiation between binned data and the underlying b-axis diameter measurements is not simply a technical one, it is based on our perceptions of the future data types that will be commonly used.

Comment 3
While the study presented by Eaton et al. (2019) is successful in raising awareness that the n=100 sample size is too low to attain reasonable accuracy for pebble counts in most gravel beds and that sample sizes of 400 or 500 particles are required to enable statistical evaluations about sameness or di↵erence, the study does not succeed in presenting its computational approach in an easy to understand way. Providing computer code in R-language is not helpful for most users, hence the authors computations cannot be repeated or applied by users who are not expert statisticians but are seeking to determine confidence limits around their sampled grain-size distributions.The authors display the confidence bands that they drew with their binomial approach around grain-size distributions sampled in other studies (Kondolf, 1992;Bunte et al.2009, Bunte andAbt, 2011) and go on to discuss whether the now-drawn confidence bands warrant the interpretations made in the original studies. In the final sections of the study, the authors show general relations between sampling error, as computed with their binomial approach, and sample size as well as distribution sorting.

reply by authors
We are very grateful for the feedback about the relative di culty in understanding our approach, and about the need for addition means of implementing our tools for estimating the confidence bands. We have responded to the first point by re-writing the section of the paper presenting the method, and to the second by developing reference materials in two appendices, as well as a spreadsheet implementing the normal approximation to our solution, as described by Fripp and Diplas (1993).

Recommendations for improving the paper
The reviewer made several helpful suggestions for improving the paper, listed below: Reference prior work and build on it Eaton et al. (2019) should discuss prior studies that likewise compute errors around percentiles without assuming an underlying distribution type and explain the improvements and advantages o↵ered in the study presented. What reason is therefor a user to select the authors approach if the authors do not explain WHY their approach constitutes an improvement?
We have improved the links between our paper and the previous work. We also re-iterate in the revised paper that our main purpose is to produce a user-friendly introduction to the basic method for estimating confidence bounds using binomial theory. We point out that our approach is statistically conventional, has precedents in the literature, and is consistent with empirical analyses. We also more strongly articulate our key message -that all grain size curves ought to be plotted with confidence intervals, particularly when two distributions are being compared.
Provide explanations and instructions In order for readers to apply the binomial approach to their own data, the authors need to provide a step-by step explanation on how to use their approach rather than referring to a book on statistics, pointing to a website, and o↵ering computer code in R-language. O↵ering a reader access to computer code is a courtesy, but not a substitute for a step-by step explanation, especially not for a very hands-on and applied topic of monitoring bed-material changes.
With this particular comment in mind, we have re-written the manuscript and generated various reference materials.
Comparison of results to those from prior work: How do percentile errors computed from the authors binomial approach compare to percentile errors computed from other approaches? Apart from a similarity of sampling errors around the D50 and D84that the authors computed from their binomial as well as a bootstrap approach for asymmetrical grain-size distribution (the authors flume experiment), the authors do not show how their binomial approach to computing confidence bands relates to confidence bands computed from other approaches. The authors should apply their binomial approach together with the approaches suggested by Fripp and Diplas (1993), Petrie and Diplas (2000), and Rice and Church (1996) as well as simply to sample-size equations for an error around the mean to a few pebble-count distributions that di↵er in their sorting and skewness (esp. the extent of a fine tail) and then assess di↵erence sand similarities between results.
In our revised paper, we make the links to the cited literature clear, and we replicate the approach described by Rice and Church, and then compare it to the binomial methodology we describe.
Explain whether or how confidence intervals computed from the binomial approach are a↵ected by sorting and skewness of a sampled grain-size distribution While the authors show that confidence bands increase in width with a distribution's sorting co-e cient, the authors do not explain how exactly sorting (and skewness) of a sampled grain-size distribution (e.g., a tail of fines) flow into the computation of confidence intervals based on the binomial approach. The binomial approach introduced by Fripp and Diplas (1993) does not seem to involve sorting or skewness of the sampled distribution, suggesting that confidence intervals from a binomial approach are similar for all percentiles within a sampled grain-size distribution with a known sample size and number of size classes.
The revised text and several new figures address this point.
Have a user in mind and o↵er a procedure that is reasonably easy to be applied by the user The authors provide a study that is of interest to users who are involved in relations of sample size to error. However, the study is geared towards a statistically expert audience rather than the needs of non-expert potential users. If the authors' work is to be applied for monitoring purposes by sta↵ from environmental agencies or consulting and by those whose main interest is not statistical but who need to apply such relations, then the authors need to provide detailed explanation and instruction.A spreadsheet implementation of their computations of a percentile error would be considerably more helpful than code in R-language.
We have developed additional resources that address this point, and we are particularly thankful for this feedback, since our main purpose is to make it easy for people to use our approach.
Editing suggestions Figures provided by the authors are generally fine, but considering that the study discusses plotted details of whether or not confidence bands overlap,a larger figure size would be helpful. It would also be helpful to place the figures below their first mention in the text, not simply at the top of the page with a mention some-where below on the page. With respect to writing style and typos (etc.), the manuscript is well written and clean We have re-worked many of our figures, but will leave it to the editorial sta↵ to properly place the figures in the final version of the manuscript.

Reviewer 1: Specific comments
The reviewer also provides a list of specific comments that improved the paper. Those comments are quoted below, along with our responses to them. p.2, l. 15: ". . .but the largest source of uncertainty in many cases is likely to be sampling variability, which is a function of sample size." How do the authors know that sampling variability (do they mean statistical uncertainty due to a poorly sorted channel bed?) rather than methodological di↵erences (e.g., measurements of particle sizes, spatial heterogeneity, di↵erences in the sampled channel width or leaving poorly accessible stream locations unsampled) is the most likely factor causing uncertainty? The comparative study by Bunte et al. (2009) showed that di↵erences in sampling outcomes due to methodological variability can be huge.
In order to avoid confusion, we have rewritten the sentence to read "but the largest source of uncertainty in many cases is likely to be associated with sample size, particularly for standard pebble counts of about 100 stones." p. 3, l. 3: ". . .since we preserve each measurement rather than grouping them into size classes, the data can be treated as a binomial experiment, . . ." Does that mean that the binomial computations is not applicable to field data binned in 0.5 phi units which results from measuring particle size using a 0.5-phi template?
We have hopefully addressed this question more clearly in the new section presenting an overview of the method we use and in the revised section where we discuss how to apply binomial theory to binned data. (in any case this section containing this sentence has been rewritten to improve clarity).
In Eq. 1, Pr and p are not defined This equation is now introduced (and defined) in the overview section to improve clarity. It is used first in an example of the standard coin toss binomial experiment, and then in the directly analogous problem of estimating the bed surface D 50 .
p. 4, line 10-19: The description of the methodology is too vague. To allow a reader to replicate the computations, authors need to provide step-by-step guidance. Reference to websites and other studies is not su cient for a paper that would like to introduce a new approach to computing confidence bands.
The new overview section and the re-written statistical basis section hopefully address this point. We have also adopted the term percentile confidence interval and grain size confidence interval throughout the text to more clearly explain how binomial theory can be used address the uncertainty associated with sampling (i.e. the percentile confidence interval), and how the shape of the cumulative frequency curve determines the uncertainty for a given grain size percentile estimate (i.e. the grain size confidence interval). In the overview section, we use a new figure to explain the relation between percentile and grain size confidence intervals. The sentence now reads "One disadvantage of the exact solution described above is that the areas under the tails of the binomial distribution di↵er". The Figure shows the binomial distribution, so the link between the figure and the text is now more explicit. We have also modified the figure caption and legend labels to explicitly identify the distribution tails. p. 5, line 1-5: Again, step-by-step instructions are needed to allow a reader to replicate the authors' approach.
We have tried to address this confusion by creating the overview section that precedes the admittedly rather dense description of the statistical basis for our approach. The precise mathematical approach is laid out in the code behind our functions in the GSDtools package (note: we have changed the name of the R package to reflect its more general nature since the addition of two hypothesis testing tools); the underlying calculations which are described in the text can be viewed mathematically by installing the package and then typing WolmanCI at the command line prompt. We have also included the source code for the functions in the online archive of code and data associated with this paper. The purpose of publishing an R package is to make our exact code and methodology available for both scrutiny and practical use. We have implemented the simpler normal approximation used by Fripp and Diplas (1993) in a spreadsheet version, and we have described the basis for this approximation in a new appendix to the paper. Hopefully, these additions will help potential users replicate our approach.
Also, we have included step-by-step instructions for the two new functions we have created to test hypotheses about di↵erences between two samples. p. 6, line 5-6 ". . .Based on the overlap in 5 confidence intervals for the eight samples, the distributions do not appear to be statistically di↵erent (see Fig. 3). . .. 1) Confidence bands plotted by the authors for their stream table sediment overlap for samples 2 and 3, but not for samples 1 and 4 (Fig. 3, panel  A). 2) With respect to their multinomial approach, Petrie and Diplas (2000) stated that error bands are identical for all particle-size distributions as long as the value for alpha (e.g., 0.05) and the number of sampled size classes remain the same. For the authors' 8 samples from the stream table sediment surface, I assume that the same number of size classes were collected in each of the 8 samples and that the same alpha value was applied to all computed confidence bands. If the statement by Petrie and Diplas (2000) was true for the error bands conducted by the authors, then why do the error bands plotted in Fig. 3 di↵er between samples? 3) The authors use as basis for their analyses a sand-rich sand-gravel mixture with a D50 near 1.5 mm. The lengths of b-axes appear to have been determined to a precision of two decimals (e.g., 0.53 mm). It is di cult to imagine how a pebble count was performed and particle sizes were measured on sediment this small. This section has been completely re-written, and the text and figure referred to has been removed. In summary though, the data collected were not binned into size classes, individual grain diameters were recorded; the error bands referred to by Petrie and Diplas are percentile confidence intervals, not grain size confidence intervals (an issue we explain in our new overview section); and the measurements were made from a digital photographs of the bed taken 15 cm above the bed with a pixel resolution of about 50 microns. Obviously this introduces the possibility of grains being partially hidden in the photo, but this e↵ect is far less pronounced in laboratory sediments because, due to scaling issues, sediment finer than the field equivalent of 10 mm grains are not included in the bulk mixture (i.e. there are relatively few 'fine' grains that can fill in pores and obscure the larger grains the way they can in the field). In addition, the purpose of these data is simply to represent a known population of grains from which to draw samples, not to actually represent the bed surface GSD of the experiment accurately. p. 7, Fig. 4: 1) While the box of box and whisker plots typically shows the quartiles, there is less standardization of what the whiskers represent. Please indicate what the whiskers in this plot represent. It can't be the overall spread because "outliers" are plotted as dots. Please define. 2) What parameter is plotted on the y-axis? Please clarify. 3) It would have been useful to show the 95 We have abandoned this figure, and instead used a di↵erent approach to test the binomial predictions against bootstrap error estimates for a much wider range of percentiles. The new figure plot the predicted and bootstrap errors on a typical grain size distribution curve, and we evaluate their goodness of fit using s 1:1 model (i.e. a model of perfect agreement) and the Nash Sutcli↵e goodness of fit statistic (which is basically the same as an R 2 value, where 1 equals a perfect model). The completely re-written section on confidence interval testing now engages with previous approaches more explicitly and is more extensive. Note that we replicated the entire confidence interval testing using a di↵erent population of grain sizes defined by 1,000,000 observations drawn from a log normal distribution with virtually the same results.
p. 6, line 9-19. The authors state that they found a close match between the confidence bands computed from the binomial approach and a bootstrap approach (Fig. 4) for an unskewed grain-size distribution (i.e., their stream table sediment). The comparison plot by Petrie and Diplas (2000) for a pebble count from the Mamquam River shows that the confidence bands computed with the approach by Fripp and Diplas (1993) are between n 0.02 and 0.06 phi-units higher than those from the bootstrap approach computed by Rice and Church (1996). Is the binomial approach by Fripp and Diplas (1993) similar or di↵erent to the authors' binomial approach? Does a binomial approach yield wider confidence bands than a bootstrap approach?
We address all of these points in revisions to the introduction (where we talk about the Fripp and Diplas approach), and in the confidence interval testing section. We write in the revised paper "The advantage of a bootstrap approach is that is replicates the act of sampling, and therefore does not introduce any additional assumptions or approximations. The accuracy of the bootstrap approach is limited only by the number of samples collected, and the degree to which the individual estimates of a given percentile reproduce the distribution that would be produced by an infinite number of samples." The di↵erences observed by Petrie and Diplas are presumably due to their use of the normal approximation of the binomial distribution. p. 7, lines 11-19: In the authors' reassessment of particle-size distributions from Kondolf (1997) and from Bunte et al. (2009), the authors need to clearly state to what percentage confidence the plotted confidence bands refer? I assume they are 95% confidence bands. Please clarify.
Figure captions all now clearly indicate that the polygons represent 95% confidence intervals. p. 8, Fig. 6: The study by Eaton et al. (2019) has drawn confidence limits around grain-size distributions from three Rocky Mountain gravel-bed streams sampled by Bunte et al. (2009) and Bunte and Abt (2001). 1) Based on visual examination of the error bands plotted in Fig. 6, I'd say that for Willow Creek, the error bands for ri✏es and pools are di↵erent except for the narrow range between 20 and 50 mm within which they cross. 2) The plotted confidence intervals for Willow Creek and the St. Vrain are jagged around the sampled distribution and seem to widen notably for the flatter sections of the cumulative size distribution but neck down for the steeper sections. The authors o↵er no explanation for this phenomenon.
The observed changes in the width of the grain size confidence interval do indeed correlate with the shape of the cumulative frequency curve. This e↵ect is due to the mapping of the percentile confidence interval onto the grain size confidence interval. We have added a new figure and an overview section to better explain this point. Comparing samples to determine whether a given percentile of interest is di↵erent or whether the samples can be considered di↵erent as a whole can only be approximately done using a visual interpretation of the confidence intervals. We have developed two new functions to rigorously compare samples; these functions (and the step-by-step instructions for how to conduct the analysis) are presented in the statistical basis section; they are also used in the reanalysis section; and they are included in the online demonstration of how to use the GSDtools package.
p. 10, line 9-14: The authors write: "Our method for estimating uncertainty requires only the cumulative distribution and the number of measurements used to construct the distribution. Therefore, confidence intervals can be constructed and plotted for virtually all existing surface grain size distributions (provided that the number of stones that were measured is known, which is almost always the case),. . ." If computation of the width of the confidence interval for any percentile of interest re-quires only knowledge of the sampled distribution and sample size n, and if the computation is conducted for each percentile individually, then how does the spread or sorting of the sampled distribution influence the computed confidence interval? Please CLARIFY! This is explained in the overview section, and relates to the di↵erence between the percentile confidence interval and the grain size confidence interval.
p. 11, Fig. 9 and p. 12, Fig. 10: 1) The units in which the error is computed needs to be clearly stated. Somewhere down in the text the reader gets a hint that the error pertains to a percentage error in mm units.
2) The findings that percentile errors decrease with sample size and with the distribution sorting is in and of itself nothing new. What is new here is that the error is computed from the authors' binomial approach (assuming an underlying log-normal distribution for Fig. 10). To allow a reader to see whether there is a di↵erence between errors computed from the authors' binomial approach and other approaches (e.g., Fripp and Diplas (1993) or simply errors around a mean), the computed relations between errors and n should be compared to errors computed with other approaches. 3) For comparison with other studies that compute percentile errors in terms of absolute +-error in phi-units it would be helpful if the error-n relations in Fig. 9 had a second y-axis with error in terms of the absolute +-error in phi-units. 4) It would be useful if the relation of error to n was also provided for the error around the D16.
The intention of this section is to provide the user with some guidance related to sample size required to reach a specified level of precision. As should be now clear in the revised paper, the grain size confidence interval cannot actually be estimated until the sample is collected. As a result, we have compared our results to those from others in the confidence interval testing section. This section has been edited to better emphasize that the analysis is only meant to guide sample size estimation, but does not obviate the necessity of calculating the grain size confidence intervals once the sample has been collected. With respect to the units, Eq 3 is now written so as to make it clear that we are calculating a normalized di↵erence, which is by definition dimensionless. p. 12, line 8: The authors state that for a given n and sorting, errors are largest for steep gravel fans and bar top surfaces and smaller for typical gravel beds with a sorting near 1. That is a useful comment. It would be even more useful to elaborate a little bit here on what kind of sorting values to expect for di↵erent morphological or sedimentary channel units and hence what a user needs to expect in terms of the error -sample size relation.
We agree with this comment, which is what motivated us to model the e↵ect of grain size distribution spread on uncertainty using log normal grain size distribution (the following section). Unfortunately, our data do not support even finer resolution of the issue on a sedimentary unit by unit basis.
p. 14, line 12-13: I am afraid that the authors' time estimates refer to dry deposits of mainly midsized gravels. The time requirements for a 500-particle pebble count in-creases to about 5 hours when sampling in poorly wadeable conditions, in the presence of abundant algae and large woody debris, under overhanging bushes, and with particles being next to irretrievable from the bed because they are tightly wedged within neighboring particles or small particles placed in tiny pockets between large clasts. The necessity for a large sample size remains, but users and their funding agencies need to commit to realistic time requirement.
We have incorporated the reviewer's time estimate for more arduous samples in a sentence that reads " In less ideal conditions or when working alone, it may take upwards of 5 hours to collect a 500 stone sample, but as we have demonstrated, the uncertainty of the data increases quickly as sample size declines (see Figs. 10 and 11), which may make the extra e↵ort worthwhile in many situations." Typos etc. p. 2, l.5: The value should be 22.6 (=2ˆ0.5*16), not 22.7. p. 3 L. 5. . .compute the quantiles of the (Fig. 1). Something is o↵ in that sentence. p. 4, Footnote: The access date is in the future.
We have fixed all of this smaller issues.

Reviewer 2: General Comments
The comments provided by Reviewer 2 are presented below, along with our responses. Many of the points have been addressed in our reply to Reviewer 1 above, but these comments were equally helpful in re-shaping the paper, particularly in those instances when Reviewer 2 has identified the same points raised by Reviewer 1.

Comment 1
The submitted paper focuses on estimating uncertainties in measured grain size distributions using statistical analysis of grain size data from experiments, field measurements and synthetic data. I think that the authors make an important main point, which is that uncertainties in grain size distributions should be reported especially when used to assess grain size changes over time or in space. Although I am supportive of the overall goals, topics, and messages of this manuscript, I think that there are many details missing from the methods. This makes it di cult to evaluate how this calculation is actually applied, the assumptions involved, and finally how it compares to previously published studies on uncertainties in grain sizes. I suggest adding these details such that your paper can be understood by a broader audience.

reply by authors
To address these concerns, we have re-written much of the paper and generated additional figures that we hope better describe how our approach actually works. The revised paper also includes an expanded results section that clarifies the links to previous work, as well as reference appendices providing supporting information. We also now provide a spreadsheet that implements the normal approximation to our technique (as described by Fripp and Diplas, 1993) to estimating percentile confidence limits. Finally, we added two functions for explicitly comparing two samples to determine whether di↵erences in the grain size estimate for a given percentile are actually significant. We appreciate all of the suggestions that are made in this review, and we are confident that the revised version will reach a broader audience.

Comment 2
I would really like to see a more detailed review of what previous studies have done to quantify uncertainties in the D50 and other percentiles of the grain size distributions. Do approaches without an assumed grain size distribution exist? If so, what is wrong with these approaches that motivates this current study? I'm a bit confused because in the introduction you state that there is no easy way to estimate the required sample size. In the abstract you also write that you propose a simple approach to estimate sample size, but this also relies on assuming a log-normal distribution as in previous studies highlighted on p 2 lines 8-9. What is the di↵erence between your approach that assumes a log normal distribution to estimate sample size and other log normal approaches? It is not entirely clear to me in reading the introduction what is new in this study compared to previous approaches. A more in depth review of previous approaches and a statement of how this new approach is di↵erent would really help.

reply by authors
We have extended our discussion of previous approaches by re-writing the paper to leverage the previous work by Diplas and colleagues as the starting point, and we describe in more detail how we replicated the bootstrap approach of Rice and Church to estimate the uncertainty of samples with various sizes drawn from our population of 3411 b-axis measurements. Basically, we believe that our approach is entirely consistent with that proposed by Fripp and Diplas (1993), and replicates the empirical results presented by Rice and Church (1996). The main issue that we try to address in this paper is not that previous methods are flawed, but rather that we as a community have failed to use those approaches to quantify sampling uncertainty (despite the precedents in the literature). As a result, there are published results that are clearly not statistically defensible, and it is our impression that many people continue to collect relatively small samples with limited appreciation of what that means in terms of uncertainty.
In our revisions, we will also emphasize that we think are the main contributions of this paper, which are: • to describe clearly how surface sampling can be described as a binomial experiment, analogous to a traditional coin toss experiment; • to present a simple set of tools based on binomial theory with which anybody can easily calculate the grain size confidence interval about any sample percentile that will contain the population percentile size; • to demonstrate the importance of considering uncertainty when comparing samples of the bed surface, or when making calculations based on those samples; and finally • to make some assumptions about distribution shape so that we can provide some general guidance on the sample size required to reach a desired level of sampling precision.
This last point involves making assumptions about the underlying distribution (i.e. we assume a log normal grain size curve), but that is simply to generate synthetic data with which to model the e↵ect of sample size and the spread of the distribution on the precision of a percentile estimate. We will make it clear that any distribution form could have been used, but that we chose a log-normal distribution because (1) it is the simplest to describe (i.e. it can be described by a mean and standard deviation), (2) it has been used previously by others, and (3) many gravel beds are approximately log-normal. We more clearly emphasize our central message in our revisions, and de-emphasize the point about sample size.

Comment 4
The reviewer made several comments about our calculations that we would like to address: In section 2.1, how is equation (1) used? Please provide a step wise explanation nohow someone would perform these calculations and what information is needed. Right now it is somewhat di cult to understand how equation (2) is actually solved. Although I appreciate the inclusion of the R code that is part of this paper, a simple explanation of your detailed methodology is really needed in the main text to properly evaluate your methods. What are successes, please define. I am also somewhat confused about the definition of p, earlier you state it is the percentile of a distribution but on P 4 L6 is it called a probability.
We have completely re-written the statistical basis, including an overview section that walks the user through the idea of a binomial experiment, the probabilities of a particular outcome (and the relation of those probabilities to the grain size percentiles for the population being sampled), and the relation between percentile confidence intervals and grain size confidence intervals.
In section 2.2, please also provide more details on this approach, one brief sentence on interpolation really does not make this calculation clear.
We have re-written the entire section to improve clarity. Section 3 and Figure 4 How many times did you create a sample with 100 grains to make these distributions in Figure 4? It seems like the results could really vary with the number of 100 grain samples? Also, some explanation of the boxplots is needed to evaluate the results. What are the horizontal lines at the top and bottom ends of the distributions? This information is needed to validate that the two predictions actually provide similar results. Can you provide the actual numeric values of the 99%confidence interval bounds for the two methods in the figures to enable quantitative comparisons?
We have re-written the entire section with these comments in mind. We repeat the kind of bootstrap error estimates presented by Rice and Church (1996), and make a more extensive comparison of the binomial predictions and the bootstrap estimates. We ended up taking 5000 samples from the population to ensure that the distributions of estimates stabilized. In addition, the entire analysis was repeated using samples from a synthetic log normal population of 1,000,000 observations; the re-analysis yielded nearly identical results.

Reviewer 1: Specific comments
The reviewer also provides a list of specific comments that will improve the paper, listed below, followed by our response to them. P 1 L 21-22 For facies mapping, my understanding of the Bu ngton approach is not that it is meant to be purely qualitative as implied here. They have visual classification of patches that are then verified by numerous pebble counts on the patches. So their approach likely provides a more accurate representation of the grain size distribution because they use many pebble counts in a single reach. This is a good point, and we now refer to semi-qualitative methodologies to avoid the issue.
This text has been deleted during the revisions.
P4 L 12-16 Please state if this text is for a specific sample (e.g. the data shown in Figure 1), right now it seems to be written as if it applies to all grain size measurements but I don't think that is actually the case?
It is in fact true for the percentile confidence interval for all samples, but not the grain size confidence interval (which depends on the shape of the cumulative frequency distribution for the sample in question). We have made extensive edits to existing sections and we have added an overview section that addresses this point explicitly. This section of the paper has been re-written and is augmented by the new overview section that now better explains the how the bounds to the percentile confidence intervals are determined.
P4 L 21-23 Stating that the area under the tails di↵ers is pretty vague. Do you mean tails of the distribution? How are the tails of the distribution defined? Please state why these di↵erent areas are problematic. Similarly, upper and lower limits of what exactly? What do you mean by a one-sided interval and how does this relate to your calculations? I can guess what you mean but the lack of language specificity here makes your text somewhat di cult to follow. This section has been re-written to improve clarity, and is also augmented by Appendix A, which describes how the confidence interval bounds are determined using the more familiar normal approximation to the binomial distribution. We have also added text to the caption of the figure that explicitly references the distribution tails. We also define one-tailed distributions (though admittedly it remains a technical, statistical definition). Figure 3 More details are needed as to how the grain size data were collected, through a random sample or grid count? Were the samples in di↵erent locations on stream table and using the same or di↵erent operators? It is a little di cult to see the confidence bounds in this figure to assess overlap of various distributions, not sure though how you can easily address this problem. This figure has been deleted during the re-write of the paper. The main point is that we have a population of 3411 measurements that we can use to replicate the bootstrap error calculations performed by Rice and Church (1996). Since the time and space distribution of the sub-samples used to generate this population is never referred to in the rest of the paper, we chose to delete the figure and simplify the text. Where this population is first introduced, we provide a bit more information about the sampling, as requested. The sentence in the paper reads "the population shown is defined by 3411 measurements of bed surface b-axis diameters at randomly selected locations in the wetted channel of a laboratory experiment performed by the authors." P 7 L 8 typo here Fixed Figure 5. I appreciate this reanalysis but I don't think that you can say that the distributions are statistically similar or di↵erent without a similar confidence bound on the bulk sample data. Previous studies have demonstrated that bulk samples also have considerable uncertainty depending on the size of the actual bulk sample and the portion of the sample that is occupied by the largest grain sizes. So the bulk sample is also not free from uncertainties and this needs to be acknowledged. This is a fair point. Given that we have added new sections and figures to the paper, and that we have extended our comparison of our method to previous methods, we chose to remove this figure and the associated analysis. P 8 L 3-5 The statement that fine sediment would be deposited preferentially in the pool rather than in the run/ri✏e during the waning limb of the preceding hydrograph needs some references to support it.
We have added references to some of the seminal work on this topic. P 12 L 6-7 Please explain why you are assuming the standard deviation of the distribution is related to logD84-logD50.
To make the paper clearer and to improve the comparability of the field data and the results of the log normal simulations, we now use a sorting index ( 84 16 ) to quantify the spread of the distribution. This is, we think, a clearer way of conveying what we did without introducing unnecessary complications. P 12 L 10-12 I do not entirely why you are simulating log-normal samples with this given range of D50 values and SDlog values? How were these distributions simulated by defining D50 and SDlog beforehand? Figure 10 does not seem to be referenced or explained anywhere in the text.
We have added edits at various points in this section to make a few points related to this comment. We point out that the purpose of this section of the paper is simply to provide some guidance to choosing an appropriate sample size, and that this is a secondary objective of the paper (the primary objective being the articulation of the importance and relative ease of generating confidence intervals about bed surface grain size distributions). We also now clearly state that we approach this problem first using a set of field data to estimate the grain size uncertainty associated with di↵erent sample sizes, and second by using log normal distributions to quantify the e↵ect of data spread, indexed by standard deviation. We generated the log normal distributions using the rnorm function in R (e.g. GSD = 2^rnorm(n = 352, mean = 5.6, sd = 1.3)).
P 13 L. 14-22 More details are needed as to how you estimated that this grain size is entrained at a certain shear stress and discharge. Did you use Shields equation? What critical Shields stress did you assume? How did you then translate this shear stress into a discharge beyond using a stage-discharge relation; did you have a measured channel bed slope and are you assuming stage is equivalent to the average flow depth in a reach? What is the basis of the assumption that D50 becomes fully mobile at twice the shear stress needed to initiate D50 movement? Some rational and supporting references are needed to support this argument. I am also a little confused about this uncertainty in grain size because all of these sizes (46, 55, 64 mm) are essentially in the same half-phi bin. I may be mistaken but if you have binned your data into half phi intervals for this analysis, wouldn't you expect a similar, although likely smaller, level of uncertainty in the D50 anyway? This uncertainty would occur because you are deter-mining the measured stream bed D50 value (55 mm) by interpolation between the two percentiles straddling the 50th percentile value, and these two bounding percentiles correspond to grain size bins 45 and 64 mm. But you do not actually have any grain size resolution finer than half phi bin size. So when you calculate a median grain size of 55 mm, you are interpolating this grain size to a finer resolution than you actually have data. Doesn't this already seem to imply that your uncertainty in D50 might be somewhere within a half phi bin size when you only have binned data, depending of course on how the actual grain sizes are distributed within that half phi bin?
We now explain how we determined the entrainment threshold (visual observation of painted tracers, confirmed to occur at a dimensionless shear stress of about 0.045). The other details of the methodology to estimate shear stress are described in the referenced papers. We have added a reference supporting full mobility at twice the entrainment threshold. The issue of interpolation using binned data, and the accuracy of that kind of data relative to individual measurements of b axis diameters is now addressed in the overview section and in the re-written statistical basis section. In particular, our new Fig. 3 demonstrates that the di↵erences between binned data and interpolations from cumulative data are small compared to the sampling confidence interval, which means that, in practice, binned data can be treated as if they were not binned. P 15 L 12-13 Although I certainly agree that having more than 100 sampled particles would be better for uncertainties in most studies, these time estimates assume a team of people performing pebble counts. Having conducted a very large number of pebble counts on my own, these can take much longer than 20 minutes. The time also really depends as to whether you are binning grain sizes or measuring individual b axes. Finally, setting up and finding grains on a grid also adds to the pebble count time, so I would argue that this 20 minute estimate is a minimum.

Introduction
A common task in geomorphology is to estimate one or more percentiles of a particle size distribution, denoted D p ::: D P , where D represents the particle diameter (mm) and the subscript p :: P indicates the percentile of interest. Such estimates are typically used in calculations of flow resistance, sediment transport, and channel stability; they are also used to track changes in bed condition over time, and to compare one site to another. In fluvial geomorphology, commonly used percentiles include D 50
derive ::::: grain ::: size : confidence intervals for any estimate of D p :: d P : for a sample that can be expected to contain the true value of D p ::: D P : for the entire population.
The values of l e and u e are indicated on Fig. ?? by dashed vertical lines. As can be seen, the values of l and u generated using the equal tail approximation are shifted to the left of those found by the exact approach. Consequently, the approximate confidence limits are also shifted to the left of the exact approach : , ::::::: resulting :: in :: a :::::::::: symmetrical :::::::::: confidence :::::: interval. The corre-215 sponding grain sizes representing the confidence interval are 2.7 mm and 3.6 :: 3.4 : mm, which are similar to the exact solution presented above.

Reassessing previous analyses 340
In order to demonstrate the importance of understanding the uncertainty, we have reanalyzed the results of several previous papers that have compared bed surface texture distributions, but which have not considered uncertainty associated with sampling variability. In most :::: some : cases, these re-analyses confirm the authors' interpretations, and strengthen them by highlighting which parts of the distributions are different and which are similar, thus allowing for a more nuanced understanding. In some cases, however, the re-analyses ::::: others, :::: they : demonstrate that the observed differences do not appear to be statistically signif-345 icant, and suggest that the interpretations and explanations of those differences are not supported but :: by : the authors' data.
In either case, we believe that adding information about the :::: grain :::: size : confidence intervals is a valuable step that should be included in every surface grain size distribution analysis.
A more fundamental motivation for plotting the binomial confidence bands is illustrated in Fig. ??, which compares the bed 370 surface texture estimated by two different operators using the standard heel-to-toe technique to sample more than 400 stones from the same sedimentological unit. These data were published by ? (see their Fig. 7). Based on their original representation of the two distributions (Fig. ??, Panel A : a), ? concluded that "operators produced quite different sampling results . . . operator B sampled more fine particles and fewer cobbles . . . than operator A and produced thus a generally finer distribution."
In both cases, the uncertainty associated with sampling variability appears to be greater than any ::: the difference between operators or between sampling methods, and thus one cannot claim these differences as evidence for statistically significant effects. It may be : is ::::: likely : the case that there are significant differences among operators or between sampling methods, but 390 larger sample sizes would be required to reduce the magnitude of sampling variability in order to identify those differences.
Indeed, ? found that operator errors were difficult to detect for small sample sizes (wherein the sampling uncertainties were comparatively large), but became evident as sample size increased, so the issue at hand is not whether there are important differences between operators, but whether the differences in Fig. ?? are statistically significant. Interestingly, ? were able to detect operator differences at sample sizes of about 300 stones, whereas ? did not detect statistical differences for samples of about 400 stones, indicating either that ? had larger operator differences than did ?, or smaller sample uncertainties due to the nature of the sediment size distribution. Heel−toe Sampling frame Figure 9. Comparing sampling methods for the same bed surface and operator. The data plotted were published by ?, and were collected by operator B. Panel A shows the traditional grain size distribution representation. Panel B uses the ::: 95% :::: grain :::: size confidence band :::::: intervals calculated for the pebble count to demonstrate that the two distributions do not appear to be statistically different.
Since we used the standard technique of sampling 100 stones to estimate D 50 and since the standard deviation :::::: sorting ::::: index of the bed surface distribution is about 1.0 : is ::::: about ::::: 2.0 , we can assume that the uncertainty will be about ±16 Pa. This in turn translates to critical discharge values for morphologic change ranging from 5.9 m 3 s 1 to 11.1 ::: 11.2 : m 3 s 1 , which correspond to return periods of about 1.5 years and 7.2 ::: 7.4 years, based on the flood frequency analysis presented in ?.
Specifying a critical discharge for morphologic change that lies somewhere between a flood that occurs virtually every year and one that occurs about once a decade, on average, is of little practical use, and highlights the cost of relatively imprecise sampling techniques.

495
If we had taken a sample of 500 stones, we could assert that the true value of D 50 would likely fall between 51 mm and 59 mm, assuming an uncertainty of ±7%. The estimates of the critical discharge would range from 7.2 m 3 s 1 to 9.5 m 3 s 1 , which in turn correspond to return periods of 2 years and 4.1 years, respectively. This constrains the problem more tightly, and is of much more practical use for managing the potential geohazards associated with channel change.
When designing a bed sampling program, it is useful to estimate the precision of the sampling strategy and to select the sample size accordingly; to do so, we must first assume something about the spread of the data (assuming a log-normal 530 distribution), and then verify the uncertainty after collecting the samples. Simple equations for predicting uncertainty (as a percent of the estimate) are presented here to help workers select the appropriate sample size for the intended purpose of the data.
Author contributions. B.C. Eaton drafted the manuscript, created the figures and tables, and wrote the code for the associated modelling and analysis in the manuscript; R.D. Moore developed the statistical basis for the approach, wrote the code to execute the error calculations, reviewed and edited the manuscript, and helped conceptualize the paper; and L.G. MacKenzie collected the laboratory data used in the paper,

580
tested the analysis methods presented in this paper, and reviewed and edited the manuscript.
Competing interests. The authors declare that they have no conflict of interest.