Machine learning applied to the prediction of rockfall slope probability

The objective of this work is to propose a predictive model of rockfall slope probability in rock slopes using the K-Nearest Neighbors (KNN) method. A dataset composed by 220 rock slopes was used, whose variables are related to the presence of water, characteristics of the rock mass, degree of overhang, among others. For each slope of the dataset, rockfall probability (high, medium, or low) is known and determined by cluster analysis. The number of the nearest neighbors (k) ranged from 1 to 20. The obtained average accuracy of the tested predictive models was equal to 78.4%. The models produced satisfactory results in the prediction of the rockfall probability, since the area under the ROC curve was equal to 0.80. The best model was selected based on the k value with the highest accuracy and the highest area under the ROC curve. The selected model had a k value equal to 7.

those with the aim of finding dataset patterns. The identification of these patterns provides the possibility of predicting the behavior of new individuals in the model. Mascarenhas et al. (2020) applied machine learning algorithms to propose an automatic classification system of Specialized Knowledge of Physics Teachers based on a pre-classified database. Ossani et al. (2020) carried out unsupervised classification learning techniques to find clustering patterns of specialty coffees and compared the obtained clusters with the original ones. Subsequently, Ossani et al. (2021) used supervised machine learning techniques to classify specialty coffees and they compared the performance of each used technique. Silva et al. (2021) used artificial neural networks and linear regression to build a tool for predicting the spatio-temporal distribution of viruses transmitted by Aedes aegypti. Fernandes et al. (2021) compared different artificial neural networks architectures to evaluate their behavior into predicting charges in an electrical system. Pessoa et al. (2021) used artificial neural networks to predict the load capacity of foundation.
In geotechnical engineering, there are worldwide methodologies for rock mass classification and excavation stability analysis, such as Rock Mass Rating (RMR) (Bieniawski, 1989), Q-System (Barton et al., 1974), Slope Mass Rating (SMR) (Romana, 1985), Q-Slope (Bar & Barton, 2017). However, assessment models and systems of rock mass classifications have often a high degree of uncertainty and subjectivity, since they are based only on the field survey experience and general empirical rules.
Taking into account the successful application of machine learning and multivariate statistical techniques to create prediction models, Santos et al. (2021) applied these techniques to predict the class of a rock mass according to a modified Rock Mass Rating (RMR). In the model, only the relevant variables were considered, since they were determined through multivariate factor analysis, which reduces the subjectivity inherent to rock mass classification problems.
Subsequently, Santos et al. (2022) compared machine learning techniques to make predictions of classes in rock mass using the same dataset of Santos et al. (2021). Regarding the use of multivariate statistical techniques and machine learning for rock slope stability analysis, Santos et al. (2019) and Naghadehi et al. (2013) proposed models to predict the stability condition classification of rock mine slopes.
Rockfall is a complex slope mass movement, difficult to predict. A trigger is not always necessary for a rockfall movement, differently from soil failures, which have in the precipitation an example of a common trigger. In this context, monitoring of geological risk areas is common in periods of high precipitation. It does not always occur in relation to rockfalls, which can result in catastrophic events in urban areas, highways and mining.
Some methodologies were developed in order to assess rockfall hazard, such as Rockfall Hazard Rating System (RHRS) (Pierson & Van Vickle, 1993) and Colorado Rockfall Hazard Rating System (CRHRS) (Santi et al., 2009). These methodologies were proposed for highway slopes; but they are not able to predict rockfall probability. These methods rank the evaluated slopes in more hazardous and less hazardous, according to the sum of the scores attributed to the variables related to rock mass and traffic conditions. They have the goal of determining the slopes where the intervention is more urgent.
Therefore, methodologies capable of predicting rockfall probability are needed to solve the aforementioned problems.
However, because of the uncertainties inherent in field surveys and rockfall movements, these methodologies must be optimized, as accurate as possible, and must be able to quantify the errors of prediction. The objective of this research is to propose a classification model of rockfall probability through K-Nearest Neighbors (KNN). The number of nearest neighbors (k) was varied and the best classification model to predict the class of any rock slope was proposed. A dataset with 220 slopes was used, whose rockfall probability classification (high, medium and low) was determined through cluster analysis, which is an unsupervised multivariate statistical technique.

Cluster Analysis
Cluster Analysis is an unsupervised multivariate statistical technique used to group individuals in homogeneous clusters without any prior labeling of individuals. Among the various clustering techniques presented in the literature, an example is the non-hierarchical method Kmedoids (Kaufman & Rousseeuw, 1990). Partitioning Around Medoids algorithm (PAM) can be used to perform cluster analysis through kmedoids method.
According to Kassambara (2017), the steps of the PAM algorithm are: 1st -the algorithm randomly selects k individuals to become the medoids. The medoid is the representative individual of each group, so the number of groups is equal to k; 2nd -the dissimilarity matrix is calculated and every individual in the dataset is assigned to a cluster, according to the distance between it and the medoid; 3rd -if any individual in any cluster is able to reduce the dissimilarity coefficient, this individual becomes the new medoid of this cluster and the algorithm repeats the steps mentioned above. If not, the algorithm ends.
Dissimilarity matrix can be computed using any statistical distance, like Euclidean distance and the Manhattan distance.
Manhattan distance may be applied when the database contains outliers.

K-Nearest Neighbors
K-Nearest Neighbors (KNN) is a supervised machine learning technique used to predict the class of an individual according to the similarities between this individual and the individuals pre-classified in specific classes. A way to evaluate the similarity between individuals is through the statistical distance measures. An individual will be classified in a class where the distances between it and a k number of the nearest neighbors (labeled individuals) are the smallest. This distance measure can be the Euclidean distance, Minkowski distance or the Mahalanobis distance (Kubat, 2017). The number of the near neighbors whose distances will be evaluated, the number k, must be provided. If k is equal to 1, the individual will automatically be classified in the class where its nearest neighbor is allocated. If k is equal to 3, the distance between the new individual and its three closest neighbors will be evaluated and the new individual is classified according to the class of the k nearest neighbors. In Figure 1, when k is equal to 1, the new individual is classified in Class B; when k is equal to 3, the individual is classified in Class A. Research, Society and Development, v. 11, n. 10, e89111032603, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i10.32603 To obtain the best value for k, tests of models varying the value of k must be done. The optimal k is related to the model with the best validation metrics, such as the apparent error (Equation 1) and the accuracy of the model (Equation 2). Another important and widely used metric to evaluate the performance of the model is the area under the ROC curve (AUC). An AUC equal to 1 represents a perfect model, without errors. Therefore, the closer the AUC is to 1, the better the model.

Dataset
The dataset used in this research is part of the dataset used by Santi et al. (2009) to generate CRHRS. It is composed of 220 rock slopes and the variables were surveyed in highway slopes of Colorado (USA). Although the database refers to highway slopes in the state of Colorado, all the variables and characteristics considered in this study could be easily surveyed on rock slopes located in any place of the world. These parameters are traditionally used in rock mass classifications and slope stability analysis, and some of the variables included in this research are evaluated in a similar way in classification methodologies, already established in geotechnical engineering practice, such as the RMR (Bieniawski, 1989).
All variables related to the rock mass in CRHRS method were considered in this study; except the number of discontinuity sets and the weathering degree of the intact rock. The number of sets does not vary in the dataset used to propose the model. Weathering degree of the intact rock was not considered, as rockfalls occur even if the intact rock is fresh. Weathering degree and infilling of the joints were considered. Table 1 shows the eight independent variables (P1 to P8) used in the proposed model, where each variable received scores ranging from 1 to 4, according to its characteristics (1 represents a safe condition and 4 a critical condition). Research, Society and Development, v. 11, n. 10, e89111032603, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i10.32603 As KNN is a supervised machine learning technique, the status or dependent variable must be known. The dependent variable was determined by cluster analysis. The obtained clusters were labeled and each group was classified as high, medium and low rockfall probability. Non-hierarchical method kmedoid was used. Cluster analysis was carried out through PAM algorithm, using factoextra package (Kassambara & Mundt, 2020) from R software (R Core Team, 2020).
Cluster analysis grouped the dataset into three clusters based on the sum of scores assigned to variables. As there are eight variables and the score ranges from 1 to 4, the minimum possible final score is 8, representing a safe slope, with low rockfall probability. The maximum possible score is 32, representing an unsafe slope, with high rockfall probability. Table 2 presents the range of the scores for each class, used to classify the 220 slopes of the dataset. Table 3 presents a part of the dataset with scores ranging from 1 to 4, assigned to the independent variables and the dependent variable, obtained through kmedoid.   2  3  3  3  2  4  3  1  high   3  4  3  3  4  4  3  1  high   3  3  2  2  4  3  3  2  high   2  3  4  3  2  4  3  1  high   1  3  3  2  2  4  3  1  medium   1  3  3  2  2  4  3  1  medium Source: Authors.

Applied methodology
The developed methodology is summarized in the flowchart shown in Figure 2. The explanation of each step summarized in the flowchart appears next. Before applying machine learning techniques, standardization of variables must be applied to solve scale problems, especially when the variables have different measurement units or large differences in their magnitude (Kubat, 2017). As the dataset used in this work is ordinal and all variables can only receive integer values between 1 and 4, this problem does not occur. However, according to Laurence (1992), for the application of artificial neural networks (ANN) and other machine learning techniques in ordinal data, it is convenient to convert the data to a percentile to keep the value below 1. Therefore, the dataset was normalized on a scale between 0 and 1. Table 4 shows part of the dataset already pre-processed, ready for application of the technique. In order to validate the proposed rockfall probability model, a randomly subsampling of the dataset was carried out. The 220 samples were randomly divided into 70% for training and 30% for test to validate the model.
To apply KNN algorithm, class package from R software was used (Venables & Ripley, 2002; R Core Team, 2020).
To use KNN algorithm, the k number of neighbors must be predetermined. K was varied between 1 and 20, and the model with best metrics (apparent error, accuracy and AUC) was chosen. The performance was evaluated in training and test samples, in order to find models with overfitting and choose the most suitable model to predict rockfall probability.
After the choice of the most suitable rockfall probability model, it was applied to two new slopes in order to determine their rockfall probability. These two slopes are located in a quartzite mine in São Thomé das Letras city, Minas Gerais State (Brazil).  Source: Authors.

Determination of the best (suitable) model
Each point in Figure 3 represents the value of the obtained apparent error; upper and lower dashed horizontal lines represent, respectively, the largest and the smallest error; vertical lines indicate the k value referring to the smallest errors. Development, v. 11, n. 10, e89111032603, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i10.32603 8 The lowest error rate (16.7%) refers to k values equal to 1, 3 and 7, see Figure 3. 11 slopes of the total of 66 (test sample is 30% of the dataset) were incorrectly classified. The highest error rate was equal to 30.3%, when the k value was equal to 19.
It is important to emphasize that a value of k equal to 1 is not statistically significant, because the algorithm will force the classification of the individual in the class where its nearest neighbor is allocated. The algorithm uses only one sample to classify the individual. According to Kubat (2017), KNN classifier's performance should improve for k > 1, because the effect of the noisy nearest neighbors may be eliminated.
Depending on the number of neighbors selected, overfitting can occur. Overfitting occurs when the error rate of the test sample is high and the error rate of the training sample is low. Thus, the error rates of the test and training sample were evaluated in order to evaluate if there is overfitting. The error values obtained through KNN application in the test and training sample are presented in Figure 4. Overfitting was not observed for k between 2 and 15. For k values between 16 and 20, the error rate in the test sample increases (Figure 4). In case of k equal to 3 and 7, the error rates of the test and training samples were small and close. Training samples presented an error rate equal to 11% for k equal to 3 and 12.3% for k equal to 7. Thus, considering error rate evaluation, both the model with k equal to 3 and the model with k equal to 7 are suitable, as they present acceptable and close error rates in the training and test samples, with an accuracy of 83.3 % in the test sample. In the model for k equal to 7, the error rates in the training and test samples are closer than in the model for k equal to 3.
The area under the ROC curve (AUC) was also analyzed in order to verify the most suitable/best model. Figure 5 shows the results for the AUC when KNN was applied to the test and training samples. Observing the Figure 5, the model with the highest AUC, considering the test sample, was the model with k equal to 7 (0.858). Therefore, considering the error rates, and the AUC values, the suitable/best model is the one whose number of neighbors is equal to 7.
In general, the models obtained through KNN method to determining rockfall probability were satisfactory. Considering the test sample, the average error of the models was equal to 21.6% (78.4% of accuracy). For the training sample, the average error of the models was equal to 15% (85% of accuracy). The average AUC values for the test and training samples were 0.80 and 0.86, respectively. Considering the uncertainties arising from geotechnical field surveys and the difficulty of predicting rockfall probability, the obtained prediction models for rockfall probability classes are quite satisfactory.

Validation of rockfall probability classes
After obtaining the best model (k equal to 7), the behavior of this model regarding its errors was verified, since an error of 16.7% is acceptable, but not negligible. Therefore, it is necessary to know the type of error of the model. Thus, slopes incorrectly classified by KNN model in the test samples were analyzed. Table 5 shows these slopes and their classifications. All incorrect classifications obtained through KNN model are related to the transition zone (medium probability). In addition, the probability of the sample belonging to the class determined by KNN model is smaller than 70% in 8 samples.
Therefore, the KNN confirms that there are uncertainties regarding classification of slopes whose sum of scores is ranging from 17 to 21 and its borderline zone.
Among the eleven slopes in Table 5, the classification provided by KNN model was less conservative than the classification proposed in Table 2 in ten of them, i.e., medium probability rockfall was classified by KNN as low probability rockfall; high probability rockfall was classified as medium probability. This type of error means underestimating the rockfall probability, so that slopes with need of intervention or monitoring could not receive the proper treatment.
Despite the errors shown in Table 5, the KNN model can be considered adequate, as an error of 16.7% calls the attention for the uncertainties of the variables, which were scored according to a situation observed at the field or a range of values. Their values are associated with a description of a situation that better represents the rock mass behavior, in the point of view of the geologist or the engineer.

Determination of the rockfall probability for 2 new slopes
After validating the optimal model with k equal to 7 and understanding the type of classification errors of KNN in rockfall probability, the model was used to classify two new slopes whose probability classes were unknown. These slopes are located in a quartzite mine in São Thomé das Letras city, Minas Gerais State (Brazil).
The slope 1 (Figure 6a) is composed of a homogeneous fresh quartzite. There is one set of discontinuity (the quartzite foliation), whose persistence is higher than 3m and the spacing varies between 3cm and 20cm, being the smallest spacing predominant. The aperture is in the 0.1mm to 1mm range, with granular infilling. The foliation is practically perpendicular to the slope face, with an average orientation of 11/240. The slope face has an orientation equal to 86/206 (dip/dip direction) and the surface is regular, without overhangs; no evidence of rockfall or sliding was observed. Water dripping in the slope face and in the discontinuities was observed.
The slope 2 (Figure 6b) is also composed of a homogeneous quartzite. There are three sets of discontinuities; one of them is the foliation (set 1). The foliation is practically perpendicular to the slope face, with an average orientation of 08/265.
The set 2 is the more critical set, because it daylights out of the slope and can cause rock sliding; the average orientation is Development, v. 11, n. 10, e89111032603, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i10.32603 11 64/120. The slope face has an orientation equal to 75/138 (dip/dip direction) and the surface is irregular, with a degree of overhang ranging from 0.6 to 1.2 m; evidences of rock sliding were observed (Figure 6c). The persistence of the discontinuities is higher than 3m. The spacing of the critical set varies between 20cm and 80cm and this set is planar. Discontinuities with aperture higher than 1cm, with granular infilling were observed. The slope was dry during the field surveys and operations were paralyzed on this front, due to evidence of rockfall hazard. Table 6 presents the scores of the variables P1 to P8 for each slope, according to the described characteristics.   P1  P2  P3  P4  P5  P6  P7  P8   1  3  1  1  1  2  2  3  3   2  1  1  3  3  4  4  3  3 Source: Authors.
The sum of the scores of the Slope 1 is 16, thus according to Table 2, the expected rockfall probability is low. The sum of the scores of the Slope 2 is 22, so the expected rockfall probability is high. The predicted rockfall probability by KNN for Slope 1 is low and the probability of this slope belonging to the low class is 87.50 %. The predicted rockfall probability by KNN for Slope 2 is high and the probability of this slope belonging to the high class is 77.80%. Thus, for these two slopes, the algorithm was able to make correct predictions, consistent with the observations in the field.

Conclusion
This article presented a complete assessment of the performance of the KNN for predicting the rockfall probability in rock slopes, through the analysis of error, accuracy and AUC for different values of k. The choice of the optimal model considered the error rates, overfitting trends and AUC; the suitable/best model is the one that presented the best metrics.
The suitable/best model is the one whose number of neighbors is equal to 7. This model presented an apparent error of 16.7%, accuracy of 83.3% and AUC of 0.858; the highest AUC among all the models tested. The average error rate considering all the tested models was 21.6% and the average AUC was 0.80, which shows that in general, KNN, a simple machine learning technique, presents good results in predicting rockfall probability in rock slopes.
Analyzing the 11 slopes of the test sample incorrectly classified using the best KNN model; it was observed that all errors involved the medium class of rockfall probability. It was also possible to verify that in most of these cases, the KNN achieved a probability of less than 70% that these slopes belonged to the predicted class, proving that there is an uncertainty or transition zone in this type of analysis. As the classification errors were concentrated in the borderline zone of the classes, it can be considered that the trained KNN model is suitable to predict the rockfall probability.
In view of the results presented and the efficiency of KNN to predict the rockfall probability, it is suggested that the research continues through more robust machine learning techniques, such as Artificial Neural Networks, Decision Trees and Random Forest, in order to compare the results with KNN and understand which variables have the greatest impact on the results and which have little impact.