Wavelets in the analysis of seed image similarity: an approach using the Hurst directional exponent

Modernization is present in all fields of knowledge. The wavelet transform and the Hurst exponent are tools that have fundamental importance in many of these advances. In the present study, the wavelet decomposition technique was combined with the Hurst exponent calculation to analyze X-ray images of seeds and thus classify them as full, slightly damaged or damaged. To calculate the Hurst exponent the mean and median were used as measurements of position. A support vector machine was used to validate the proposed method. For the full, damaged and slightly damaged seed groups, the average accuracy of the method, using the mean as measure position, was 74.5%, and using the median was 57.05%. For the full and damaged seed groups, the average accuracy using the mean was 99.76%, and using the median was 80.93%. For the slightly damaged and damaged seed groups, the average accuracy, using the mean as measure of position, was 99.26%, and the median was 76.22%. When analyzing seeds with slight damage, we observed a decrease in accuracy because the classification of the X-rays was subjective. Therefore, for the image database used in this study, the proposed methodology is efficient for automatic classification.


Introduction
Technological advances are increasing worldwide in different areas of knowledge. In Brazil, agriculture is an area undergoing constant research and technological advancement. One important agricultural activity is the production of grains, especially sunflowers (Helianthus annuus L.). This crop has expanded and consolidates as an economically viable crop because of its resistance to drought, pests and diseases, in addition to other factors. Among annual crops, sunflowers are responsible for 16% of the world's edible oil production and are the third largest crop in the world production of edible oil (EMBRAPA, 2018). To improve the quality of final products that have sunflowers as a raw material, the best grains for planting are selected. Among the various tools used for this purpose, wavelets and machine learning algorithms have gained prominence. The use of these tools allows analyzing the images of the seeds, without destroy them in the process of evaluating the quality of the produced lots. This generates enormous speed in the production process of these grains. The objective of this study was to use the non-decimated discrete wavelet transform together with the Hurst directional exponent to extract characteristics of X-ray images of seeds and classify them (as full, slightly damaged or damaged) with the aid of a machine learning algorithm. Two statistics were used to calculate the Hurst exponent as location measures: the mean and median.
The accuracy of these two methods was compared. Because this method is noninvasive, it avoids the destruction of seed samples for the quantification of sowing quality and increases the range of analysis options. Thus, we expect to reduce the time necessary for seed classification and selection and improve the overall method.
In many of these studies, the Hurst exponent is used as a key measure for decision making. The Hurst exponent was created in 1951 by Harold Edwin Hurst in order to study temporal data series of rainfall, temperature, and river levels. For example, suppose that there is a river that drains into a reservoir of infinite capacity, and an annual amount must be released to meet the seasonal needs of the plantations. Next, consider a period of years and that the amount released annually is the mean volume that the river flowed into the reservoir during that period. In this context, Hurst found a linear relationship between the log ( ⁄ ) and ( ), where is the absolute sum between the maximum accumulated stock and the maximum deficit in the period , and σ is the standard deviation of annual discharges. Hurst defined his coefficient as the slope of the linear regression line between these 2 variables (Hurst, 1956).
There are several ways to calculate the Hurst exponent (Feng & Vidakovic, 2017;Lyashenko, et al. 2016;Silva, et al., 2021), one of which is to use the energy spectrum of the wavelet transform. In this study, the spectrum was calculated in two ways. First, the logarithm of the mean of the squares of the detail coefficients was obtained by means of the discrete wavelet transform (Nicolis, et al., 2011). Considering a fractional Brownian field, the wavelet coefficients are given by in which the domain of the integral is ℝ 2 . These coefficients are random variables with zero mean and variance given by (Nicolis, et al., 2011). From (2), the following can be derived Research, Society and Development, v. 11, n. 14, e297111436211, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i14.36211 Applying the logarithm in base 2 on both sides of expression (3) results in the following equation: (4) where = 2 2 ( ) (Nicolis, et al., 2011). Equation (4) was used to calculate the Hurst exponent ( ) in each direction.
For the second method, the median logarithm of the square of the detail coefficients was used (Kang & Vidakovic, 2017). The median was a more robust alternative to possible outliers that could arise after the logarithmic transformation of the squares of the coefficients.
The Hurst exponent can be used as a measure of self-similarity along the scales of image decomposition. This decomposition was followed by a simulation of the data using the machine learning algorithm support vector machine to images were analyzed by a seed specialist, and the seeds were separated into 3 categories: full, slightly damaged and damaged.
A total of 175 images of full seeds, 130 images of slightly damaged seeds and 140 images of damaged seeds were considered.
In Figure 1, an X-ray sheet is used for analysis and classification. Because the images had different sizes, we decided to work with subimages, which allows for the observation of the "same part" of all images, namely, a 64 x 64 pixel matrix. This cut was performed automatically using a function in the R software. Figure 2 shows the described procedure. Wavelet decomposition was initially used in all images. The wavelet coefficients were obtained through the nondecimated discrete wavelet transform using the function proposed by Daubechies (1992). Nicolis, et al. (2011) studied several wavelet families, including Daubechies, and concluded that the differences between the parametric values were not significant, except for the Haar basis. We chose to work with Daubechies using 4 null moments and 4 levels of decomposition.
Subsequently, the wavelet spectrum S( ), using the mean and the median as measures of location, which is defined as the ordinate presented, is The objective of the decomposition was to verify the Hurst directional exponent characteristics for the different types of seeds. These exponents then formed a set of data for training and set of data for validation with a machine learning algorithm. More specifically, we used support vector machine with the linear core function. Others core functions were tested, e.g., Gaussian, but the best result was found with the linear core. For both the training and validation sets, the decompositions were used in the horizontal, vertical and diagonal direction of each image. Decomposition was performed using the waveslim package (Whitcher, 2015) from the R software. Four groups were considered in this analysis: all seeds, full and slightly damaged seeds, full and damaged seeds, and slightly damaged and damaged seeds.
For each seed group, a random selection of 75% of the seeds was performed for the algorithm training. The remaining 25% were used to verify the learning, or accuracy, of the algorithm by constructing a confusion matrix. This procedure was repeated 1000 times in each seed group. Other ratios for training and validation were tested, but the ratio that provided the best result was 75% and 25%, respectively. The R software packages that were used in this simulation were caTools (Tuszynski, 2014) e1071 (Meyer, et al., 2017) and the caret (Kuhn, 2017).

Results and Discussion
First, we present descriptive statistics of the Hurst exponents obtained with the wavelet decomposition for the mean and median as location measurements. Subsequently, the results from a thousand simulations are presented, with the objective of validating the proposed method. The results of the simulations are presented for each group, namely, all seeds, full and damaged seeds, slightly damaged and damaged seeds, and full and slightly damaged seeds. Figure 3 shows the distribution of the Hurst coefficient, with the mean (a) and median (b) as measures of location in the horizontal, vertical and diagonal directions. Each subfigure contains the empirical distribution for the full, slightly damaged and damaged seeds. Regarding the mean value, the full and slightly damaged seed distributions overlapped in all cases.
However, there was an increase in the variability of the exponent values in the diagonal decomposition of the slightly damaged seeds, resulting in a flattening of the distribution curve. We also observed a slight shift to the right in the mean value of the Hurst coefficients for the slightly damaged seeds group. The distribution of the damaged seed exponents was quite different from the distributions of the other seeds in the 3 directions. We found a considerable shift to the right in the mean of the Hurst coefficients of the damaged seeds. This technique is effective in extracting characteristics from each type of image. We also observed that most Hurst exponent values for all seeds were negative. This result may be related to the data from Jeon, et al. (2014), which indicated a correlation in the residuals.
The replacement of the mean with the median changed the results, albeit less than expected. Figure 3 (b) shows plots of the sample distribution of the Hurst coefficients in the 3 directions. The mean to median substitution caused an "overlap" in the empirical directional distribution of the coefficients for all seeds. In the horizontal direction, however, this overlap was smaller due to a gradual displacement to the right of the mean of the slightly damaged and damaged seed distributions, respectively. In general, there was a shift to the right in the distribution of the coefficients for all types of seeds, indicating a decrease in the correlation of residues, as shown by Jeon, et al. (2014).
The simulation showed that the mean rate of correct classification in the full and damaged seed groups was 99.76%, using the mean as a measure of location. This rate was 80.93% when the median was used. For this group of seeds, we found a clear increase in the classification power when the mean was used rather than the median. In addition, the correct classification rate when using the mean was much higher than the rates found in the methodology proposed by Sáfadi, et al. (2016). In the present study, the authors observed 64% accuracy using slopes s1 and s2 and 82% accuracy using slopes s1, s2, s3 and s4. We found that the rate was slightly lower than the results from Sáfadi, et al. (2016) for the 4 inclinations when the median was used as a measure of location. It is notable that for this group of seeds, 813 simulations of 1000 were made correctly. Research, Society and Development, v. 11, n. 14, e297111436211, 2022 (CC BY 4  In Table 1, the elements of the main diagonal represent the mean number of correct classifications. The other elements represent the mean number of classification errors (confusion). We observed almost no confusion in the classification of images using the mean as a measure of location. A mean of 0.19 damaged seeds were classified as full seeds by the proposed method. Using the median as a localization measure, there was confusion both for full seeds that were classified as damaged and for damaged seeds that were classified as full. Averages of 8.26 damaged seeds were classified as full and 6.81 full seeds were classified as damaged.
The results were similar for the slightly damaged and damaged seed groups. Considering the mean as a measure of location, the mean accuracy was 99.26%, and with the median the accuracy was 76.22%. Using the 4 slopes for this seed group, Sáfadi, et al. (2016) found an accuracy rate of 63%. In this study, both the mean and median provided better results, although using the mean resulted in a considerably better outcome than using the median. Table 2 presents the confusion averaging matrices for the 1000 simulations. The behavior of the group composed of the full and damaged seeds was maintained. There was little confusion in classification when the mean was used. An average of, 0.47 slightly damaged seeds were classified as full, and 0.03 damaged seeds were classified as slightly damaged. Using the median as a measure of location, these values were 7.27 and 8.90, respectively.
When the group of full and slightly damaged seeds was considered, there was a considerable decrease in the accuracy when either the mean or the median was used. When the mean was used as a measure of location, the mean rate of accuracy was 63.09%, and with the median the accuracy was 58.28%. This rate is much lower than that observed by Sáfadi, et al. (2016), who obtained a rate of 89%.
It is important to note the subjectivity and difficulty in characterizing seeds as full and slightly damaged, as they are very similar groups (Leite, et al., 2013). Interestingly, the method used by Sáfadi, et al. (2016) was superior in detecting such differences. Table 3 presents the matrices of confounding means for the 1000 simulations for the full and slightly damaged seed group. We observed (Table 3) considerable confusion in the classification of slightly damaged seeds that were considered full seeds. Using the mean as a measure of localization, we found a mean of 25.43 slightly damaged seeds that were classified incorrectly. When the median was used as a localization measure, the mean was 26.11.
For the group of all seeds, the mean rate of correct classification, using the mean as a measure of location, was 74.15%. When the median was used, this rate was 57.05%. We found increased accuracy of the classification algorithm when the mean was used instead of the median. The rate when using the median was similar to that observed by Sáfadi, et al. (2016), which was 57%. For the group of all seeds, we can compare our results with the work of Leite, et al. (2013). These authors, using independent components, achieved a rate of approximately 80% accuracy in seed classification. According to the variation in the number of independent components, this rate varied slightly. However, the authors did not use a simulation process to estimate the mean accuracy rate, which complicates the comparison between the methodology proposed in this study and that used in Leite, et al. (2013).
In Table 4, the mean confounding matrices are presented for the 1000 simulations for the group of all seeds. Again, we observed confusion in the classification of slightly damaged seeds. We found a mean of 25.58 slightly damaged seeds that were classified as full when the mean was used as the localization measurement. Using the median as a localization measure, this number was 24.36. In the same seed group, 80.1% were classified incorrectly using the mean. Using the median, the classification error percentage was 95%. This high confounding rate is a result of the overlapping distributions of the Hurst coefficients for the group of slightly damaged and full seeds in the 3 directions of decomposition ( Figure 3).

Conclusion
The use of the Hurst exponent in combination with the non-decimated wavelet transform was efficient for seed classification. The use of the mean as a localization measurement in the calculation of the Hurst exponent produced a superior result compared to the median for the considered set of images.
The technique proposed in this study presented good results in most scenarios. The subjective difficulty in distinguishing between full and slightly damaged seeds was reflected in the proposed method, which exhibited reduced accuracy when this group of seeds was considered.