Sandy coasts are constantly changing environments governed by complex, interacting processes. Permanent laser scanning is a promising technique to monitor such coastal areas and to support analysis of geomorphological deformation processes. This novel technique delivers 3-D representations of the coast at hourly temporal and centimetre spatial resolution and allows us to observe small-scale changes in elevation over extended periods of time. These observations have the potential to improve understanding and modelling of coastal deformation processes. However, to be of use to coastal researchers and coastal management, an efficient way to find and extract deformation processes from the large spatiotemporal data set is needed.
To enable automated data mining, we extract time series of surface elevation and use unsupervised learning algorithms to derive a partitioning of the observed area according to change patterns. We compare three well-known clustering algorithms (

Coasts are constantly changing environments that are essential to the protection of the hinterland from the effects of climate change and, at the same time, belong to the areas that are most affected by it. Especially long-term and small-scale processes prove difficult to monitor but can have large impacts

The resulting spatiotemporal data set consists of a series of point cloud representations of a section of the coast. The high temporal resolution and long duration of data acquisition in combination with high spatial resolution (on the order of centimetres) provides a unique opportunity to capture a near-continuous representation of ongoing deformation processes, for example, storm and subsequent recovery, on a section of the coast. As reported by

The PLS data set is large (on the order of hundreds of gigabytes), and to be relevant, the information on deformation processes has to be extracted concisely and efficiently. Currently, there are no automated methods for this purpose and studies focus on one or several two-dimensional cross-sections through the data (for example,

One example of spatiotemporal segmentation on our data set from permanent laser scanning was recently developed by

Time series data sets are also used to assess patterns of agricultural land use by

The goal of the present study is to evaluate the application of clustering algorithms on a high-dimensional spatiotemporal data set without specifying deformation patterns in advance. Our objectives in particular are

to analyse and compare the limits and advantages of three clustering algorithms for separating and identifying change patterns in high-dimensional spatiotemporal data, and

to detect specific deformation on sandy beaches by clustering time series from permanent laser scanning.

We compare the

A majority of the observation area is separated into distinct regions.

Each cluster shows a change pattern that can be associated with a geomorphic deformation process.

Time series contained in each cluster roughly follow the mean change pattern.

We use the different clustering approaches on a small area of the beach at the bottom of a footpath, where sand accumulated after a storm, and a bulldozer subsequently cleared the path and formed a pile of sand. We determine the quality of the detection of this process for both algorithms and compare them in terms of standard deviation within the clusters and area of the beach covered by the clustering. We compare and evaluate the resulting clusters using these criteria as a first step towards the development of a method to mine the entire data set from permanent laser scanning for deformation processes.

The data set from permanent laser scanning is acquired within the CoastScan project at a typical urban beach in Kijkduin, the Netherlands

For the present study, a subset of the available data is used to develop the methodology. This subset consists of 30 daily scans taken at low tide over a period of 1 month (January 2017). It covers a section of the beach and dunes in Kijkduin and is displayed in top view in Fig.

Top view of a point cloud representing the observation area at low tide on 1 January 2017. The laser scanner is located at the origin of the coordinate system (not displayed). The point

Riegl VZ2000 laser scanner mounted on the roof of a hotel facing the coast of Kijkduin, the Netherlands. The scanner is covered with a protective case to shield it from wind and rain.

The data are extracted from the laser scanner output format and converted into a file that contains

Each point cloud is chosen to be at the time of lowest tide between 18:00 and 06:00 LT, in order to avoid people and dogs on the beach, with the exception of two instances where only very few scans were available due to maintenance activities. The data from 9 January 2017 are entirely removed from the data set because of poor visibility due to fog. This leads to the 30 d data set, numbered from 0 to 29. Additionally, all points above 14.5 m elevation are removed to filter out points representing the balcony of the hotel and flag posts along the paths. In this way, also a majority of reflections from particles in the air, birds or raindrops are removed. However, some of these particles might still be present at lower heights close to the beach.

Since the data are acquired from a fixed and stable position, we assume that consecutive scans are aligned. Nevertheless, the orientation of the scanner may change slightly due to strong wind, sudden changes in temperature or maintenance activities. The internal inclination sensor of the scanner measures these shifts while it is scanning, and we apply a correction for large deviations (more than 0.01

The remaining error in elevation is estimated as the standard error and the 95th percentile of deviations from the mean elevation over all grid cells included in the stable paved area. We chose the stable surface that is part of the paved paths on top of the dunes and leading to the beach in the northern and southern directions as indicated in Fig.

Test statistics of the gridded elevation values on the paved area, which is assumed to be stable throughout the observation period of 1 month. Values are calculated per time series and averaged over the entire stable area, which results in mean elevation, standard error and an average 95th percentile of deviations from the mean.

Time series of elevation at location

To derive coastal deformation processes from clusters based on change patterns, we follow three steps: extraction of time series, clustering of time series with three different algorithms and derivation of geomorphological deformation processes. To cluster time series, the definition of a distance between two time series (or the similarity) is not immediately obvious. We discuss two different options (Euclidean distance and correlation) to define distances between time series with different effects on the clustering results. The rest of this section is organized as follows: we focus on time series extraction in Sect.

Time series of surface elevation are extracted from the PLS data set by using a grid in Cartesian

Before defining a grid on our observed area, we rotate the observation area to make sure that the coastline is parallel to the

In this way, we extract around 40 000 grid cells that contain complete elevation time series for the entire month. The point density per grid cell varies depending on the distance to the laser scanner. For example, a grid cell on the paved path (at about 80 m range) contains about 40 points (i.e. time series at

We consider two different distance metrics for our analysis: the Euclidean distance as the default for the

The most common and obvious choice is the Euclidean distance metric defined as

Another well-known distance measure is correlation distance, defined as 1 minus the Pearson correlation coefficient (see, for example,

For a comparison of the two distances for some example time series, see Fig.

Example of three pairs of time series that are “similar” to each other in different ways. The Euclidean distance would sort the differences as follows:

Both Euclidean distance and correlation are not taking into account the order of the values within each time series. For example, two identical time series that are shifted in time are seen as “similar” with the correlation distance but not as similar with the Euclidean distance and would not be considered identical by either of them (see Fig.

Clustering methods for time series can be divided into two categories: feature-based and raw-data-based approaches (see, for example,

Example of clustering of data with two clusters with different variance: the

For the implementation of all three algorithms, we make use of the Scikit-learn package in Python (see

The

assign each point to the cluster with the closest centroid;

move centroid to the mean of each cluster;

calculate the sum of distances over all clusters (Eq.

Note that minimizing the squared sum of distances over all clusters coincides with minimizing the squared sum of all within-cluster variances. The convergence to a local minimum can be shown for the use of Euclidean distance (see, for example,

There are variations of the

Agglomerative clustering is one form of hierarchical clustering: it starts with each point in a separate cluster and iteratively merges clusters together until a certain stopping criterion is met. There are different variations of agglomerative clustering using different input parameter and stopping criteria (see, for example,

Loop through all combinations of clusters:

form new clusters by merging two neighbouring clusters into one and

calculate the squared sum of distances (Eq.

Keep clusters with minimal squared sum of distances.

In this way, we use agglomerative clustering with a similar approach to the

DBSCAN is a classical example of clustering based on the maximal allowed distance to neighbouring points that automatically derives the numbers of clusters from the data. It was introduced in 1996 by

Determine neighbourhood of each point and identify core points.

Form clusters out of all neighbouring core points.

Loop through all non-core points and add to cluster of neighbouring core point if within maximal distance; otherwise, classify as noise.

In this way, clusters are formed that truly represent a dense collection of “similar” points. Since we choose to use correlation as a distance metric, each cluster will contain correlated time series in our case. All points that cannot be assigned to the neighbourhood of a core point are classified as noise or outliers.

To determine if an algorithm is suitable, we expect that it fulfils the previously defined criteria:

A majority of the observation area is separated into distinct regions.

Each cluster shows a change pattern that can be associated with a geomorphic deformation process.

Time series contained in each cluster roughly follow the mean change pattern.

In order to establish these criteria, we compare the three clustering algorithms, as well as two choices for the number of clusters

The clustered data are visualized in a top view of the observation area, where each point represents the location of a grid cell. Each cluster is associated with its cluster centroid, the mean elevation time series of all time series in the respective cluster. For visualization purposes, we have added the median elevation back to the cluster centroids, even though it is not taken into account during the clustering. We subsequently derive change processes visually from the entire clustered area. We establish which kind of deformation patterns can be distinguished and estimate rates of change in elevation and link them to the underlying process.

We use the following criteria to compare the respective clustering and grid generation methods quantitatively:

percentage of entire area clustered;

minimum and maximum within-cluster variation;

percentage of correctly identified change in test area with bulldozer work.

The percentage of the area that is clustered differs depending on the algorithm. Especially DBSCAN sorts out points that are too far away (i.e. too dissimilar) from others as noise. This will be measured over the entire observation area. The number of all complete time series counts as 100 %.

Each cluster has a mean centroid time series and all other time series deviate from that to a certain degree. We calculate the average standard deviation over the entire month per cluster and report on the minimum and maximum values out of all realized clusters.

To allow for a comparison of the clusters with a sort of ground truth, we selected a test area at the bottom of the footpath. In this area, a pile of sand was accumulated by a bulldozer after the entrance to the path was covered with lots of sand during a period of rough weather conditions (8–16 January, corresponding to days 7–14 in our time series), as reported by

Test area for the comparison of clusters generated with three different algorithms. The test area is located where the northern access path meets the beach (see Fig.

The stable cluster consists of cluster 0, the largest cluster when using

The results are presented in two parts. First, we compare two different choices of the parameter

For the

With the

Summary of comparison of

On the test area the

The agglomerative clustering algorithm is set up, as the

On the test area, the detection of negative and positive changes is more balanced and leads to an overall score of 88 % correctly identified points. Agglomerative clustering clearly separates the path that was cleared by the bulldozer and identifies it as eroding.

When we use the DBSCAN algorithm on the same data set, with minimum number of points

The intertidal zone cannot be separated clearly from the “noise” part of the observation area, nor can we distinguish the stable path area or the upper part of the beach. In the test area, the sand pile is not represented by a separate cluster and positive changes in elevation are not found, which results in an overall worse percentage of correctly identified points. However, two clusters represent areas which are relatively stable throughout the month, except for a sudden peak in elevation on 1 d. These peaks are dominated by a van parking on the path on top of the dunes and people passing by and are not caused by actual deformation; compare Fig.

Mean time series per cluster found with the DBSCAN algorithm. Outliers or points that are not clustered are represented by the blue mean time series. The two most prominent time series (clusters 5 and 6, light green and light blue) are located on the path on top of the dunes. The peaks are caused by a group of people and a van, on 5 and 6 January, respectively, illustrated by the point clouds in the middle of the plot.

On the test area, the DBSCAN algorithm performs worse than both other algorithms. In total, 79 % of points are correctly classified into “stable” or “significant negative change”. As stable points, we count in this case all points that are classified either as noise or belonging to cluster 1 (orange). The reason for this is that the mean of all time series that are not clustered appears relatively stable, while cluster 1 describes very slow erosion of less than 0.15 cm d

Considering the clusters found by the

Observation area partitioned into clusters by the

We successfully applied the presented methods on a data set from permanent laser scanning and demonstrated the identification of deformation processes from the resulting clusters. Here, we discuss our results on distance measures, clustering methods and the choice of their respective input parameters and derivation of change processes.

Possible distance measures for the use in time series clustering are analysed, among others, by

We chose the use of correlation distance with the DBSCAN algorithm, because correlation in principle represents a more intuitive way of comparing time series (see Fig.

Scaling the time series with their respective standard deviations for the use of Euclidean distance would make these two distance measures equivalent. However, this did not improve our results using

Neither of the two distance measures analysed here can deal with gaps in the time series, which would be of great interest to further analyse especially the intertidal area and sand banks. Additionally, both distance measures do not allow us to identify identical elevation patterns that are shifted in time as similar. An alternative distance measure suitable to deal with these issues would be DTW, which accounts for similarity in patterns even though they might be shifted in time (

The use of

a majority of the observation area is separated into distinct regions,

each cluster shows a change pattern that can be associated with a geomorphic deformation process, and

time series contained in each cluster roughly follow the mean change pattern,

However, the computational effort needed to loop through all possible combinations of merging clusters for agglomerative clustering is considerably higher. Of the three algorithms that were used in this study, agglomerative clustering is the only one that regularly ran into memory errors. This is a disadvantage considering the possible extension of our method to a data set with longer time series.

One of the disadvantages of the

To avoid this issue, we also compare both approaches with the use of the DBSCAN algorithm. It is especially suitable to distinguish anomalies and unexpected patterns in data as demonstrated by

DBSCAN selection of input parameters: number of clusters versus input parameter maximum distance within clusters, and minimum number of points and percentage of total points in clusters (not classified as noise or outliers). The choice of an “optimal” set of parameters is not obvious. We indicate our selection with a red circle in both plots.

An alternative clustering approach for time series based on fuzzy

A similar approach would be to use our clustering results and identified change patterns as input to the region-growing approach of

As shown in Fig.

The DBSCAN algorithm successfully identifies parts of the beach that are dominated by a prominent peak in the time series (caused by a van and a small group of people). Out of the three algorithms that we compare, it is most sensitive to these outliers in the form of people or temporary objects in the data. It was not our goal for this study to detect people or objects on the beach, but this ability could be a useful application of the DBSCAN algorithm to filter the data for outliers in a pre-processing step.

We compared three different clustering algorithms (

The most promising results are found using

Our key findings are summarized as follows:

Both

Predominant deformation patterns of sandy beaches are detected automatically and without prior knowledge using these methods. The level of detail of the detected deformation processes is enhanced with increasing number of clusters

Change processes on sandy beaches, which are associated with a specific region and time span, are detected in a spatiotemporal data set from permanent laser scanning with the presented methods.

Our results demonstrate a successful method to mine a spatiotemporal data set from permanent laser scanning for predominant change patterns. The method is suitable for the application in an automated processing chain to derive deformation patterns and regions of interest from a large spatiotemporal data set. It allows such a data set to be partitioned in space and time according to specific research questions into phenomena, such as the interaction of human activities and natural sand transport during storms, recovery periods after a storm event or the formation of sand banks. The presented methods enable the use of an extensive time series data set from permanent laser scanning to support the research on long-term and small-scale processes on sandy beaches and improve analysis and modelling of these processes. In this way, we expect to contribute to an improved understanding and management of these vulnerable coastal areas.

The data set used for this study is available via the 4TU Centre for Research Data:

MK carried out the investigation, developed the methodology and software, realized all visualizations and wrote the original draft. RL supervised the work and contributed to the conceptualization and the writing by reviewing and editing. SV developed the instrumental setup of the laser scanner and collected the raw data set that was used for this research.

The authors declare that they have no conflict of interest.

The authors would like to thank Casper Mulder for the work on his bachelor's thesis “Identifying Deformation Regimes at Kijkduin Beach Using DBSCAN”.

This research has been supported by the Netherlands Organization for Scientific Research (NWO, grant no. 16352) as part of the Open Technology Programme.

This paper was edited by Andreas Baas and reviewed by two anonymous referees.