An in-depth assessment of convolutional neural networks for rail surface defect detection

The consistent monitoring of rails is based on correctly identifying defects to support corrective measures. Recently, convolutional neural networks (CNN), a deep learning method, have been providing outstanding results for the automatic detection of defects. However, several aspects of CNN-based approaches such as network architecture, transfer learning and processing time remains not fully understood. In this work, we performed an in-depth assessment of ten widely used CNN models with the objective of finding the one with the best performance in identifying defects in rail surface images. The classification results are promising, reaching an average accuracy of 83.7% on detection of mild defects and squat. The Inceptionv3 network provided the best results by correctly identifying 92% of images with severe squat defects.


Introduction
The propagation of cracks in railway tracks gives rise to fractures that could lead to catastrophic events. Therefore, one should conduct safety inspections to detect the crack formation and propagation before the fracture occurs. Moreover, consistent monitoring of railways is required to provide reliable data to maintenance teams for planning future corrective actions. Thus, achieving operational security and eliminate existing defects (MRS, 2008).
Recently, automated defect detection in railway tracks has been increasingly studied due to the development of computer vision and as an exciting alternative to manual monitoring, which is slow, exhausting, subjective, and costly (Yanan et al., 2018). Among the automated detection methods, those using railway images and convolutional neural networks (CNNs) are promising (Faghih-Roohi et al., 2016). CNNs constitute a class of deep artificial neural networks that rely on local linear operations (convolutions) followed by non-linear transformations, creating different representations of the input data. The convolutional layers are filters that extract low-level features (e.g., object edges) and high-level features (e.g., object shapes), considering the spatial context. A non-linear activation function is usually applied to the output of a convolutional layer, followed by a pooling (downsampling) operation to reduce its dimension. After several convolutional and pooling layers, a fully connected (FC) layer might be included to exploit the high-level features learned. The FC layer could be seen as hidden layers of a multilayer perceptron (MLP) network. Finally, the last layer is often a softmax classifier that outputs class membership probabilities for each class. A comprehensive overview of CNNs and deep learning can be found in Ponti et al. (2017). To rail surface defect detection, some studies employed CNN-based methods for scene classification and object detection. Scene classification aims to identify a defect in the rail given an image as input (usually a grayscale photograph). At the same time, object detection approaches draw a bounding box around the defect, finding it in the image. Faghih-Roohi et al.
(2016), for example, proposed three deep CNN (DCNN) for scene classification to identify defects in railway tracks. The authors successfully classified normal rail, small defects, and squats with almost 92% accuracy.
Similarly, Jamshidi et al. (2017) developed a DCNN model to classify images representing normal rail, trivial defects (seed squats), and squats. For a binary classification problem (squat vs. normal), the authors obtained a classification accuracy of 96.9%. Object detection networks such as the YOLOv3 (Redmon and Farhadi, 2018) have been employed to retrieve rail surface defects in grayscale images. Yanan et al. (2018) detected defects in rails with 97% accuracy with YOLOv3. Yuan et al. (2019) combined YOLOv3 network with MobileNetV2 (Sandler et al., 2018) to retrieve three types of rail surface defects. The MobileNetV2 architecture is used as a backbone network to extract image features, whereas YOLOv3 works on the regression prediction. The experiments showed that the combination of MobileNetV2 with YOLOv3 achieves higher detection accuracy and robustness when compared to the YOLOv3 alone, achieving 87.40% of average accuracy. Rodrigues (2020) used the supervised machine learning algorithm SVM (Support Vector Machine) to detect defects on the surface of the billet through images. After training, the network was able to identify tracks in grinding condition and tracks with severe damage. The model (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.30252 3 achieved more than 95% accuracy in image classification.
Efforts have been made to detect irregularities in rail components, such as fasteners, for example. In a recent study, Yuan et al. (2021) designed a one-dimensional CNN to inspect the fasteners from the time domain recorded by the accelerometer of a rail with fasteners in different degradation conditions. The authors report that the model achieves high detection accuracy and good noise flexibility.
Some studies employed CNNs for semantic segmentation, assigning a label to each image pixel, thus performing pixel-level classification of defects. Liang et al. (2018) proposed an image processing pipeline based on the SegNet (Badrinarayanan et al., 2017) semantic segmentation architecture and obtained results with a 100% detection rate. More recently, Kim et al. (2020) modified the AlexNet (Krizhevsky et al., 2012) and the visual geometry group (VGG) (Simonyan & Zisserman, 2014) networks for semantic segmentation and achieved 99% of accuracy. Given the outstanding results of CNNbased methods to automatically detect rail surface defects in the last years, an in-depth assessment of the most used architectures is needed. Such an assessment may provide valuable insights for the real-world application of CNNs aiming at reliable and fast detection of rail defects. In this work, we assess ten CNN architectures. We focused on identifying a severe type of defect called squat, caused by rolling contact fatigue at the wheel-rail interface and is characterized by the shattering of the gauge corner (MRS, 2008). We trained the networks with and without transfer learning. We analyze features such as computation time, accuracy, and the number of model parameters. This is the first work that assessed different CNNs to identify rail surface defects to the best of our knowledge. Moreover, most studies have focused on detecting the presence or not of a defect, with its classification remaining a challenge.

Image database and pre-processing
The images used in this work were collected from a critical section of Barra do Piraí's railway line in Rio de Janeiro, Brazil. The railway line is under the concession of MRS Logística S.A, which captured the images using a railway inspection vehicle (RIV) (Figure 1). The grayscale images captured by the RIV have a dimension of 1600×1200 pixels. A total of 244 images were collected, of which 80 were taken from railways without any defects while the others presented some flaws. The image labeling was performed according to the fracture and defect identification guidelines of railways from MRS (2008) by an expert from the company. Research, Society and Development, v. 11, n. 8, e12211830252, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.30252 4 The dataset contains many defects such as head checking, flaking spalling, and squat. The railway lines without defects were grouped in a class called "Normal" (Figure 2a). The less severe superficial defects were arranged in "Mild Defects" (Figure 2b). Images with squat defects were arranged in a group called "Squat" (Figure 2c). As shown in Figure 2b, the "moderate" class is represented by a slight loss of billet material due to the high stresses of the wheel-rail contact. The "Squat" class ( Figure 2c) is represented by cracks and holes in a large area on the rail surface, caused by contact fatigue and rail irregularities, such as weld and billet widening (MRS, 2008).
We cropped 300 pixels from each side of the original images to avoid processing areas without the railway. The cropped images have a dimension of 1001 x 1200 pixels that should be reduced by half, i.e., 500 x 600 pixels, to reduce the processing time. Moreover, we improved the contrast of the images to highlight the railway. We performed a simple linear contrast enhancement, excluding the bottom 1% and the top 1% of all pixel values. The original and final images are shown in  Research, Society and Development, v. 11, n. 8, e12211830252, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.30252 5 Through Figure 3b, it is possible to observe that the contrast change performed made the light colors of the image lighter and the dark colors darker, improving the visual quality of the image and, consequently, facilitating the identification of the defect by the neural network.

CNNArchitectures
We assessed ten CNN architectures widely used in computer vision and image classification tasks. Table 1 summarizes the characteristics of each architecture, including depth (number of layers), size, parameters, image input size, and the reference work.  (2019) Source: MATHWORKS (2020).

Experimental setup
We first evaluated the use of transfer learning, which refers to initializing the weights of the CNN models with values obtained after training them for a different classification problem. Transfer learning proved helpful in reducing computation time and improving accuracy in several image classification tasks (Shin et al., 2016). We initialized the weights of the networks using pre-trained values of the ImageNet database (Deng et al., 2009) and with random values following a uniform distribution, which we refer to from "scratch." We used a desktop computer with an Intel Core i7-8700 3.2GHz CPU, 24GB of main memory, and an NVIDIA® GeForce Titan V GPU with 12GB of dedicated memory for training and inference. We implemented all image processing procedures in the MATLAB® environment.
Due to the small number of images, we used image augmentation on the training dataset, which is composed of 244 images or 80% of the total number of images. The augmentation methods comprised image rotations between -10• and 10• and a translation of three pixels on the y-and x-axes. The augmented dataset was composed of 1025 images. Details of the dataset and samples that went through the process are shown in Table 2 and Figure 4, respectively.    (Demuth, 2000).
We tested all CNN models with and without transfer learning. To use transfer learning, we replaced the FC layer with a new one with the number of classes of our dataset. Initial tests performed with the Resnet-18 network revealed that low initial learning (< 0.001) produced overfitting. The best accuracy in the network was achieved using an initial learning rate of 0.01 with a decaying factor of 0.5 after 90 epochs with a minibatch of 12 and 200 epochs. We trained all networks with these settings. To avoid biased training due to class imbalance, we used class weights.

Accuracy assessment
We performed the accuracy assessment with the testing images that were not used to train the models. We computed the mean accuracy and F1-score for each class. To evaluate the network accuracy in a binary classification, the classes "Mild defects" and "Squats" were considered one group.

Results and Discussion
Tables 3 and Figure 5 show the multi-class classification result of the trained CNNs. The untrained versions are identified by the ending "Scratch". The Inceptionv3 obtained the best classification Median accuracy, F1-score "Normal" and F1-score "Squat" from the test set, reaching 83.67%, 86,49%, and 91,89%, respectively. Its training time was 4.419 seconds, 5.5 times more than the fastest network, SqueezeNet. On the other hand, the network with the worst performance was Mobilenetv2_Scratch, with 40% accuracy in classifying railway lines with mild defects, 63.15% accuracy for normal tracks, and 40% for the ones with squat defects.    Source: Author himself (2021). Figure 8 shows that the networks were not able to identify images containing railway lines with mild defects.
Conversely, railway lines with Squat were easier to identify.   Source: Author himself (2021).
The median accuracy x depth graph, illustrated in Figure 9, arranges the CNNs in descending order according to the number of layers. It is possible to observe that the line representing the average accuracies obtained by the networks in the classification of the test images does not decrease as the number of layers decreases. The highest average accuracy was Research, Society andDevelopment, v. 11, n. 8, e12211830252, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.30252 11 achieved by the 48-layer Inceptionv3 and the lowest average accuracy, but not far from the Inceptionv3, was achieved by the 18-layer squeezenet. Finally, Table 4 presents a binary classification of the accuracy of "Normal railway lines" and "Railway lines with defects". In the binary classification, the methodology correctly predicted 88.48% of the images using Inceptionv3.

Conclusion
We analyzed the performance of ten untrained and pre-trained artificial neural network structures in detecting rails without defects, with mild defects, and squat. The pre-trained networks were trained with more than 1 million images from the ImageNet dataset, and they learned features to classify 1000 object classes. An augmentation technique was employed to increase the number of images used during training. Additionally, the images underwent some pre-processing to crop the nonimportant regions and improve the contrast.
The best median accuracy was 83.67% with Inceptionv3, the highest F1-score "Squat" (91.89%). The Mobilenetv2_Scratch showed lower performance with a median accuracy of 48.98% and the Resnet-18 an excellent cost of time x accuracy with a median of 81.63% and 1,2 of Relative Run Time.
Defect segmentation should be fast and accurate. In our experiment, we conclude that simple network and transfer learning can be applied in the effective rail surfaces defect detection, how the squat can achieve promising results in mild defects detection. We suggest increasing the number of examples of mild defects to affect the methodology as a preventive maintenance tool in future works.