the Creative Commons Attribution 4.0 License.
the Creative Commons Attribution 4.0 License.
Short Communication: Evaluating the accuracy of binary classifiers for geomorphic applications
Matthew William Rossi
Abstract. Airborne lidar has revolutionized our ability to map out fine-scale (~1-m) topographic features at watershed- to landscape-scales. As our ‘vision’ of land surface has improved, so has our need for more robust quantification of the accuracy of the geomorphic maps we derive from these data. One broad class of mapping challenges is that of binary classification where remote sensing data are used to identify the presence or absence of a given feature. Fortunately, there are a large suite of metrics developed in the data sciences that are well suited to quantifying pixel-level accuracy of binary classifiers. In this paper, I focus on the challenge of identifying bedrock from lidar topography, though the insights gleaned from this analysis apply to any task where there is a need to quantify how the number and extent of landforms are expected to vary as a function of the environmental forcing. Using a suite of synthetic maps, I show how the most widely used pixel-level accuracy metric, F1-score, is particularly poorly suited to quantifying accuracy for this kind of application. Well-known biases to imbalanced data are exacerbated by methodological strategies that attempt to calibrate and validate classifiers across a range of geomorphic settings where feature abundances vary. Matthews Correlation Coefficient largely removes this bias such that the sensitivity of accuracy scores to geomorphic setting instead embeds information about the error structure of the classification. To this end, I examine how the scale of features (e.g., the typical sizes of bedrock outcrops) and the type of error (e.g., random versus systematic) manifest in pixel-level scores. The normalized version of Matthews Correlations Coefficient is relatively insensitive to feature scale if error is random and if large enough areas are mapped. In contrast, a strong sensitivity to feature size and shape emerges when classifier error is systematic. My findings highlight the importance of choosing appropriate pixel-level metrics when evaluating topographic surfaces where feature abundances strongly vary. It is necessary to understand how pixel-level metrics are expected to perform as a function of scene-level properties before interpreting empirical observations.
- Preprint
(5834 KB) - Metadata XML
- BibTeX
- EndNote
Matthew William Rossi
Status: closed
-
RC1: 'Comment on esurf-2022-51', Stuart Grieve, 05 Jan 2023
I'd like thank the author for this contribution, which feels particularly relevant as machine learning approaches become increasingly common in geomorphology research. This manuscript explores the evaluation of the accuracy of binary classification of raster datasets, using mapping of bedrock tors as a case study. There is some discussion and presentation of the methods used in classifying real data, distinguishing between bedrock and soil within forest canopy gaps. A series of synthetic datasets are generated with known error properties and bedrock to soil ratios. These synthetic datasets are used to identify trends in two accuracy metrics, F1 Score and Matthews Correlation Coefficient, at a range of bedrock fractions when errors are random, systematic or both. A key result is the elegant demonstration of the unsuitability of using F1 Score as an accuracy metric in most settings, with the alternative Matthews Correlation Coefficient performing more robustly. There is then analysis of changes in bedrock tor shape and size with systematic error, demonstrating the sensitivity of these metrics to shape and size variation.
I believe that this Short Communication fits well into the Earth Surface Dynamics remit, and will be a valuable addition to the broader computational geomorphology literature. Overall, I am in favour of this manuscript being published, and look forward to seeing a final version in due course.
General comments
Overall, this is a well written and presented manuscript which is methodologically and theoretically sound. I do not have recommendations or requests for additional analysis, but have some observations on the presentation and structure of the manuscript that I hope the author will find useful.
Taken as a whole, for a short communication, the manuscript feels quite long and in places dense. One of the things that I struggled with when reading this manuscript, was whether it mattered that this was a manuscript about bedrock tors and the emergence of bedrock. Indeed, the paper's title does not mention bedrock at all. I wonder if it would streamline the paper to remove much of the discussion of bedrock tors, in favour of speaking more generically about binary classification of geomorphic data. I realise there is a fine line to tread here to keep the geomorphic relevance and novelty of the work, but I feel that there are a lot of tangential issues that could be raised around this paper regarding the correct mapping of tors, fuzzy boundaries, mixed pixels, etc, none of which are relevant to the evaluation of binary classifications. Particularly when approaching the impact that feature shape has on classification and error - this is really cool and could have broad implications for a range of applications but risks being bogged down in discussion of bedrock structure and emergence, rather than the more theoretical advancements being focused on.
A potentially less extreme restructure of the manuscript could be to present all of the synthetic landscapes first, without any reference to bedrock, simply setting the work up as an evaluation of binary classification metrics. Then the final section of the manuscript can bring in the bedrock mapping as a case study, thereby tying the synthetic data and theoretical work to a geomorphic context, without introducing as much ambiguity and complexity into the earlier sections of the manuscript.
When presenting the results of the comparison between MCC and F1 scores, in the case of asymmetric metrics (eg Table 2), more could be made of this result. This may be a well known issue in the data science literature, but I think it can be highlighted in a disciplinary context here, and the implications of (mis-)use of F1 scores explored in a geomorphic context.
The work on tor shape in Section 6.2 and Appendix B is very exciting. I would prefer to see it all within the main manuscript rather than split out into an Appendix. There are a lot of interesting factors at play here, where binary classifications have to contend with systematic errors, data resolution and feature size. As data resolution improves, features such as bedrock tors are represented by more pixels, creating the potential for more complex shapes with increased edge to area ratios, and therefore these results are critical in beginning our understanding of how to best classify and evaluate those classifications.
A final recommendation for this manuscript would be the addition of a clear section of recommendations for the types of evaluation method to use under different generalised circumstances. I think this would help with the impact of the manuscript, by giving readers clear direction which they can feed back into their own work.
Line by line comments
In addition to the comments above, I have some more general minor line by line comments:
Line 54 - Missing 'as' between 'long their'.
Line 108-110 - This sentence needs a citation.
Code
It is great to see the code associated with this manuscript available online, with the author detailing how they will archive it once the manuscript is accepted. I have gone through the code on github and it is well written and structured, and after some brief testing it appears to do what is described in the manuscript. It would be ideal if the repository had a licence file included, to ensure that people can use the code in the future. If you need help with software licences, you can look at https://choosealicense.com/ to guide you through the process. To aid reproducibility it would also be helpful to record the numpy, scipy, matplotlib and python versions you are using within the readme, in case future upgrades break things in your code.
-- Stuart Grieve
Citation: https://doi.org/10.5194/esurf-2022-51-RC1 -
RC2: 'Comment on esurf-2022-51', Anonymous Referee #2, 21 Feb 2023
Review of the manuscript “Short Communication: Evaluating the accuracy of binary classifiers for geomorphic applications” from Rossi Submitted to Earth Surface Dynamics.
I would like to know the author for this very nice paper. This paper attempts to evaluate the accuracy of binary classifiers commonly used in raster datasets, in this case applied to the mapping of rock formations. In this study, the synthetic datasets (with known errors) were used to test two accuracy metrics F1 score and Matthews correlation coefficient on a range of bedrock fractions when errors are random, systematic and both, as well as variable shape and size, showing that Matthews correlation performs more robustly.
I think this short communication fits well into the Earth Surface Dynamics journal. However, I believe that this type of work would fit better in a broader computational journal (e.g. IEEE) as it makes a valuable contribution as machine learning approaches become increasingly common in geomorphological research.
General comments:
The paper is well written and presented. I don't have any further recommendations or additional analysis in the methodology and results section. However, I have very few concerns, mainly related to the way this paper is presented.
The title refers to a "short communication". However, the text is quite long. There are some places where it is not easy to read, mainly because the focus is on the binary classifiers (which could be very ambiguous) and very little on the geomophical implications (i.e. bedrock). I think this paper would be better suited as a research paper rather than a short communication. I would suggest reorganizing the different sections, showing the methodology first and then presenting the geomorphological applications as a study case.
I suggest changing the verb conjugation in some sentences, especially where it says 'I did' or 'I presented', to more neutral third person verb tenses (e.g. This study, on this work, etc).
Appendix B should appear in the discussion section. This is a valuable contribution to the newer remote sensing products (e.g. higher resolution) from which the patterns presented in Figure B1 originate. Finally, I suggest adding a recommendation section regarding the perspectives of this very nice work, especially regarding the limitations.
Citation: https://doi.org/10.5194/esurf-2022-51-RC2 -
AC1: 'Comment on esurf-2022-51', Matthew Rossi, 21 Feb 2023
Much thanks to both reviewers for their very helpful and constructive comments. Both picked up on similar weaknesses in the presentation of this analysis. In considering this feedback, I see three main weaknesses to the current version of the manuscript. The first main weakness is that the manuscript is too dense and/or long, in places. In part, this is due to jumping between binary classification in general and to the specific considerations to the highlighted application, bedrock mapping. I agree with this assessment. To address this, it seems the general consensus is that this manuscript is better written as a regular Research Article. With respect to the focus on bedrock classification, both reviewers suggest that the specific application to bedrock mapping should be presented later in the manuscript so that the general challenge of binary classification can be more cleanly articulated from the outset. I know the comment period is about to end but would be curious to hear the reviewers’ reaction to removing the focus on bedrock mapping entirely. I am still keen to motivate how the problem of binary classification matters to geomorphic problems. Perhaps I could replace Figures 1 and 2 (and its associated elaboration in the text) with a multi-panel figure that illustrates three example geomorphic tasks that use binary classification (i.e., where bedrock mapping would be one of three examples in the motivation, section 2 disappears, and synthetic examples are the focus). The second main weakness is that Appendix B, which focuses on the importance of feature shape, should be elevated into the main text. I agree with this assessment as well and will re-organize the Discussion to reflect this. The third main weakness is that both reviewers were looking for a section that clearly articulates practical recommendations for using accuracy metrics like F1-score and MCC and their associated limits in analyzing feature classifiers. The simplest way to address this is to carve out a subsection in the Discussion to explicitly address this, which I will now do.
Detailed responses to each of the reviewers’ concerns will be provided during revisions. That said, I want to again thank the reviewers for their thoughtful and valuable feedback.
Citation: https://doi.org/10.5194/esurf-2022-51-AC1
Status: closed
-
RC1: 'Comment on esurf-2022-51', Stuart Grieve, 05 Jan 2023
I'd like thank the author for this contribution, which feels particularly relevant as machine learning approaches become increasingly common in geomorphology research. This manuscript explores the evaluation of the accuracy of binary classification of raster datasets, using mapping of bedrock tors as a case study. There is some discussion and presentation of the methods used in classifying real data, distinguishing between bedrock and soil within forest canopy gaps. A series of synthetic datasets are generated with known error properties and bedrock to soil ratios. These synthetic datasets are used to identify trends in two accuracy metrics, F1 Score and Matthews Correlation Coefficient, at a range of bedrock fractions when errors are random, systematic or both. A key result is the elegant demonstration of the unsuitability of using F1 Score as an accuracy metric in most settings, with the alternative Matthews Correlation Coefficient performing more robustly. There is then analysis of changes in bedrock tor shape and size with systematic error, demonstrating the sensitivity of these metrics to shape and size variation.
I believe that this Short Communication fits well into the Earth Surface Dynamics remit, and will be a valuable addition to the broader computational geomorphology literature. Overall, I am in favour of this manuscript being published, and look forward to seeing a final version in due course.
General comments
Overall, this is a well written and presented manuscript which is methodologically and theoretically sound. I do not have recommendations or requests for additional analysis, but have some observations on the presentation and structure of the manuscript that I hope the author will find useful.
Taken as a whole, for a short communication, the manuscript feels quite long and in places dense. One of the things that I struggled with when reading this manuscript, was whether it mattered that this was a manuscript about bedrock tors and the emergence of bedrock. Indeed, the paper's title does not mention bedrock at all. I wonder if it would streamline the paper to remove much of the discussion of bedrock tors, in favour of speaking more generically about binary classification of geomorphic data. I realise there is a fine line to tread here to keep the geomorphic relevance and novelty of the work, but I feel that there are a lot of tangential issues that could be raised around this paper regarding the correct mapping of tors, fuzzy boundaries, mixed pixels, etc, none of which are relevant to the evaluation of binary classifications. Particularly when approaching the impact that feature shape has on classification and error - this is really cool and could have broad implications for a range of applications but risks being bogged down in discussion of bedrock structure and emergence, rather than the more theoretical advancements being focused on.
A potentially less extreme restructure of the manuscript could be to present all of the synthetic landscapes first, without any reference to bedrock, simply setting the work up as an evaluation of binary classification metrics. Then the final section of the manuscript can bring in the bedrock mapping as a case study, thereby tying the synthetic data and theoretical work to a geomorphic context, without introducing as much ambiguity and complexity into the earlier sections of the manuscript.
When presenting the results of the comparison between MCC and F1 scores, in the case of asymmetric metrics (eg Table 2), more could be made of this result. This may be a well known issue in the data science literature, but I think it can be highlighted in a disciplinary context here, and the implications of (mis-)use of F1 scores explored in a geomorphic context.
The work on tor shape in Section 6.2 and Appendix B is very exciting. I would prefer to see it all within the main manuscript rather than split out into an Appendix. There are a lot of interesting factors at play here, where binary classifications have to contend with systematic errors, data resolution and feature size. As data resolution improves, features such as bedrock tors are represented by more pixels, creating the potential for more complex shapes with increased edge to area ratios, and therefore these results are critical in beginning our understanding of how to best classify and evaluate those classifications.
A final recommendation for this manuscript would be the addition of a clear section of recommendations for the types of evaluation method to use under different generalised circumstances. I think this would help with the impact of the manuscript, by giving readers clear direction which they can feed back into their own work.
Line by line comments
In addition to the comments above, I have some more general minor line by line comments:
Line 54 - Missing 'as' between 'long their'.
Line 108-110 - This sentence needs a citation.
Code
It is great to see the code associated with this manuscript available online, with the author detailing how they will archive it once the manuscript is accepted. I have gone through the code on github and it is well written and structured, and after some brief testing it appears to do what is described in the manuscript. It would be ideal if the repository had a licence file included, to ensure that people can use the code in the future. If you need help with software licences, you can look at https://choosealicense.com/ to guide you through the process. To aid reproducibility it would also be helpful to record the numpy, scipy, matplotlib and python versions you are using within the readme, in case future upgrades break things in your code.
-- Stuart Grieve
Citation: https://doi.org/10.5194/esurf-2022-51-RC1 -
RC2: 'Comment on esurf-2022-51', Anonymous Referee #2, 21 Feb 2023
Review of the manuscript “Short Communication: Evaluating the accuracy of binary classifiers for geomorphic applications” from Rossi Submitted to Earth Surface Dynamics.
I would like to know the author for this very nice paper. This paper attempts to evaluate the accuracy of binary classifiers commonly used in raster datasets, in this case applied to the mapping of rock formations. In this study, the synthetic datasets (with known errors) were used to test two accuracy metrics F1 score and Matthews correlation coefficient on a range of bedrock fractions when errors are random, systematic and both, as well as variable shape and size, showing that Matthews correlation performs more robustly.
I think this short communication fits well into the Earth Surface Dynamics journal. However, I believe that this type of work would fit better in a broader computational journal (e.g. IEEE) as it makes a valuable contribution as machine learning approaches become increasingly common in geomorphological research.
General comments:
The paper is well written and presented. I don't have any further recommendations or additional analysis in the methodology and results section. However, I have very few concerns, mainly related to the way this paper is presented.
The title refers to a "short communication". However, the text is quite long. There are some places where it is not easy to read, mainly because the focus is on the binary classifiers (which could be very ambiguous) and very little on the geomophical implications (i.e. bedrock). I think this paper would be better suited as a research paper rather than a short communication. I would suggest reorganizing the different sections, showing the methodology first and then presenting the geomorphological applications as a study case.
I suggest changing the verb conjugation in some sentences, especially where it says 'I did' or 'I presented', to more neutral third person verb tenses (e.g. This study, on this work, etc).
Appendix B should appear in the discussion section. This is a valuable contribution to the newer remote sensing products (e.g. higher resolution) from which the patterns presented in Figure B1 originate. Finally, I suggest adding a recommendation section regarding the perspectives of this very nice work, especially regarding the limitations.
Citation: https://doi.org/10.5194/esurf-2022-51-RC2 -
AC1: 'Comment on esurf-2022-51', Matthew Rossi, 21 Feb 2023
Much thanks to both reviewers for their very helpful and constructive comments. Both picked up on similar weaknesses in the presentation of this analysis. In considering this feedback, I see three main weaknesses to the current version of the manuscript. The first main weakness is that the manuscript is too dense and/or long, in places. In part, this is due to jumping between binary classification in general and to the specific considerations to the highlighted application, bedrock mapping. I agree with this assessment. To address this, it seems the general consensus is that this manuscript is better written as a regular Research Article. With respect to the focus on bedrock classification, both reviewers suggest that the specific application to bedrock mapping should be presented later in the manuscript so that the general challenge of binary classification can be more cleanly articulated from the outset. I know the comment period is about to end but would be curious to hear the reviewers’ reaction to removing the focus on bedrock mapping entirely. I am still keen to motivate how the problem of binary classification matters to geomorphic problems. Perhaps I could replace Figures 1 and 2 (and its associated elaboration in the text) with a multi-panel figure that illustrates three example geomorphic tasks that use binary classification (i.e., where bedrock mapping would be one of three examples in the motivation, section 2 disappears, and synthetic examples are the focus). The second main weakness is that Appendix B, which focuses on the importance of feature shape, should be elevated into the main text. I agree with this assessment as well and will re-organize the Discussion to reflect this. The third main weakness is that both reviewers were looking for a section that clearly articulates practical recommendations for using accuracy metrics like F1-score and MCC and their associated limits in analyzing feature classifiers. The simplest way to address this is to carve out a subsection in the Discussion to explicitly address this, which I will now do.
Detailed responses to each of the reviewers’ concerns will be provided during revisions. That said, I want to again thank the reviewers for their thoughtful and valuable feedback.
Citation: https://doi.org/10.5194/esurf-2022-51-AC1
Matthew William Rossi
Data sets
Rossi et al. (2020) Bedrock Maps Rossi, Matthew W. https://github.com/mwrossi/cfr_extremes
Model code and software
Synthetic Bedrock Mapping Code Rossi, Matthew W. https://github.com/mwrossi/bedrock-mapping-accuracy
Matthew William Rossi
Viewed
HTML | XML | Total | BibTeX | EndNote | |
---|---|---|---|---|---|
551 | 146 | 23 | 720 | 13 | 11 |
- HTML: 551
- PDF: 146
- XML: 23
- Total: 720
- BibTeX: 13
- EndNote: 11
Viewed (geographical distribution)
Country | # | Views | % |
---|
Total: | 0 |
HTML: | 0 |
PDF: | 0 |
XML: | 0 |
- 1