Classification of specialty coffees using machine learning techniques

Specialty coffees have a big importance in the economic scenario, and its sensory quality is appreciated by the productive sector and by the market. Researches have been constantly carried out in the search for better blends in order to add value and differentiate prices according to the product quality. To accomplish that, new methodologies must be explored, taking into consideration factors that might differentiate the particularities of each consumer and/or product. Thus, this article suggests the use of the machine learning technique in the construction of supervised classification and identification models. In a sensory evaluation test for consumer acceptance using four classes of specialty coffees, applied to four groups of trained and untrained consumers, features such as flavor, body, sweetness and general grade were evaluated. The use of machine learning is viable because it allows the classification and identification of specialty coffees produced in different altitudes and different processing methods.


Introduction
The coffee business holds a big social importance due to its capacity of creating jobs and acting in the Brazilian socioeconomic development. Also, the domestic consumption is growing higher and higher (Fehr, et al., 2012).
The coffee consumer in Brazil changed is consumption habits and acquired new perceptions regarding the beverage.
Thus, new strategies aim at the appreciation of the product with remarkable attributes with tangible and intangible aspects. Therefore, competition happens not only through prices but also through products with innovative features (Nicoleli & Moller, 2006).
The differentiated coffee segment is the one with the highest growth, and the purchase of the product is connected to the attributes brand and flavor which are connected to former experiences inherent to the sensory memory that characterizes each consumer associating loyalty to brand to favorite flavor. Thus, there are evidences that Brazilian consumers are ready to acquire quality coffees once there are differences between their preferences and the coffee segments. These characteristics should be attended through marketing strategies that involve differentiation standards which would increase the quality adding value to the consumer's satisfaction (Spers, et al., 2004). Thus, the sensory study of the consumers is an important tool for the identification of the motivation in the processes of coffee purchase in the different segments of this market.
The preferences and the acceptance tests must be considered in a sensory test focusing on the evaluation of the taster in order to differentiate the sensory quality of a product when compared to others. Some outside factors are inherent to the formation of the sensory panel such as individual preferences, panel training, and taster experience which might cause statistical problems coming from errors in measurement when filling the sensory evaluation sheet, or even during data analysis (Ossani, et al., 2017).
According to Figueiredo et al. (2018) there is a growing participation and appreciation of specialty coffees in the international market. A study was carried out with the Bourbon genotypes in different environments relating to the chemical composition of the grains with their sensory profile. It was observed that the genotypes Bourbon Amarelo IAC J9 and Bourbon Amarelo / SSP were the most suitable for the production of specialty coffees. In which the caffeine content made it possible to differentiate the coffee in relation to the quality of the drink, with coffees with higher quality having the lowest caffeine content.
Consequently, there is a large field of researches with the aim at and the use of new approaches that add more precise results to the acceptance analyzes and discrimination of sensory quality in coffees.
In the work carried out by Borem et al. (2009), it was used techniques of logistic regression and correspondence analysis in the environmental aspects such as latitude, longitude, altitude and slope as well as coffee varieties and processing methods in consecutive harvests with the objective of setting the sensory quality of the cultivated coffee. The results suggest that the quality did not correspond to the sample discrimination between the direction of the slope face and the sensory profile of the coffee.
In the work presented by Liska et al. (2015), it was used Fisher's conventional linear discriminant analysis (LDA) and the discriminant analysis via boosting algorithm (Adaboost) as a proposal for a classification rule to discriminate trained and untrained tasters. The authors concluded that the boosting method applied to the discriminant analysis show a higher sensibility rate in the trained panel.
The Multiple Factor Analysis for Contingency Tables (MFACT) was used by Ossani et al. (2017) using categorized data obtained from sensory experiments carried out with different consumer groups investigating similarities among four specialty coffees. The use of the technique was viable since it allowed the discrimination of specialty coffees produced in different environments (altitudes) and processing taking into consideration the heterogeneity of the consumers involved in the sensory analysis.
Unsupervised classification techniques were used by  in specialty coffees, obtaining groupings that were in line with the original groups, having very satisfactory results in the algorithms used in the process. Research, Society andDevelopment, v. 10, n. 5, e13110514732, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i5.14732 3 The supervised classification and the data identification represented in classes deserve special attention since the factors, generally unknown, but related to the sensory quality, can be identified. Among the many techniques proposed for data classification, machine learning is characterized for allowing the classification of groups of variables with different sizes and distinct nature.
According to Amaral (2016), machine learning is the application of computational techniques in the attempt of trying to find hidden patterns in data in order to produce algorithms capable of making the computers learn and, not only, run algorithms.
This technique is closely connected to the statistics and the artificial intelligence and is directly related to data mining.
Classification techniques have been employed in other situations, for example, in  worked on the supervised classification process of unconventional vegetables obtaining excellent results with few attributes.
In Zamora et al. (2020)  There are many supervised data classification techniques covered by machine learning with each one having their own specificities. They can generate different results depending on the inherent structure of the analyzed database. This is easily verified given the previous data classification in order to choose the one that better model the data.
The classification and identification of coffees take an important role in the consumer's choice form a product of better quality that attends to their economic and taste requirements besides allowing a better marketing targeting to the specific segments. Thus, the choice of a good classifier that may model the sensory data accurately to the consumers characteristics becomes an excellent quality tool for the products guaranteeing better results to consumer satisfaction.
The current work was carried out with the objective of proposing the use of the machine learning technique in the construction of algorithms for supervised classification and identification of specialty coffees produced in different processing and altitudes taking into consideration trained and untrained in a sensory analysis experiment.

Data description
According to the proposed objectives, it was considered the data referring to a sensory experiment (Ossani, et al., 2017) relating to the acceptance of specialty coffees produced in Serra da Mantiqueira characterized according to the specifications give in Table 1. Research, Society andDevelopment, v. 10, n. 5, e13110514732, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i5.14732 4 In Ossani et al. (2017) is described all methodology used in the process of sensory experiment performed with tasters, and the way coffees were treated for sensory analyses.
The structure of the juxtaposed table considering the sensory grades obtained in the classed for the attributes body, acidity, sweetness and general grade is presented in the layout in Table 2 added by other attributes. Proc = Processing, Groups = 1, 2, 3, 4 (groups of individuals) and Cls = A, B, C, D (Classes of coffees). = i-th observation (instance) in the group G and coffee C. Source: Authors.
The groups = 1 and 2 were formed by consumers that were trained for the sensory evaluations. They were constituted, respectively, by 52 and 47 individuals while the other groups ( = 3 e 4) were not trained. However, these last individuals were technicians or researchers in the field of coffee researches with 32 and 43 individuals, respectively, in a total of 696 instances (observations) with 174 instances for each coffee class.
The individuals pointed at all the attributes of all the coffee classes with values belonging to the range [0; 10] with 10 as the maximum grade.

Projection pursuit
It is a technique for exploratory analysis of multivariate data that searches low dimension linear projection in high dimension data. Such projections are hit through the optimization of an objective function called projection index. In this work, we will use this technique to find groupings in the analyzed data indicating the existence of separation among the analyzed coffees in the process of supervised classification.
Therefore, according to the number of variables (Table 2), it was applied the indices Legendre and PDA with the purpose of researching the formation of groupings to be detected. The Legendre index is based in the distance 2 between the density of the projected data and the standard bivariate normal density. It is built by the inversion of the density through a normal cumulative distribution function with the transformations = 2 ( ) − 1and = 2 ( ) − 1 in which is the standard normal distribution and using polynomial terms of Legendre for the expansion (Martinez and Martinez 2007). Meanwhile, the PDA index is based in the penalty of the LDA index being applied in situations with many predictors highly correlated when the classification is necessary. However, the LDA index is obtained through the linear discriminant analysis with the objective of searching for linear projections with the highest separation among classes and the lowest intraclass dispersion (Espezua, et al., 2015).

Classification methods
In order to compare the procedure of reduction of dimension carried out with the projection pursuit (section 2.2), it was taken into consideration the classification methods applied in the data of Table 2.

Bayes models
There are many algorithms based on the Bayes rule ( 1) with the j-th class and the instance. The Bayes classifier chooses the class with the highest probability; thus, it chooses if ( ∨ ) = ( ∨ ); therefore, the class value with the highest index (Alpaydin, 2010). In this work, it is not used the supervised discretization to convert numeric attributes into nominal ones.
The Naive Bayes algorithm considers the non-dependence among attributes used in the construction of the models. It evaluates how much the attribute helps in the classification of the instance building a probability table and based on the training data, the precision values of the numeric estimator are chosen (Alpaydin, 2010).
The Bayes Net algorithm is based in graphs to represent the conditional probability relations allowing the data classification (Alpaydin, 2010). In this work, a simple estimator is used to estimate the conditional probability tables once the structure was learned.
There are also algorithms based on bayesian networks that provide excellent results in data classification here characterized by Naive Bayes Multinomial and Naive Bayes Multinomial Updateable (Alpaydin, 2010).
In this work the Naive Bayes Multinomial algorithm ignored the words that do not occur at least in the minimum frequency in the training data. The lemmatization algorithm is also used in in the words.
In the Naive Bayes Multinomial Updateable algorithm in this work, it is not used the supervised discretization to convert numeric attributes into nominal ones.

Function models
The logistic regression can be used as a very efficient classification algorithm. Thus, it there are classes for instances with attributes, with as the order parameter matrix × ( − 1), then, the probability for the class, with the exception of the last class, is given by (Landwehr, et al., 2006) and the last class will have probability of Support Vector Machine (SVM) is a classification method that creates an optimized vector maximizing the margin between the nearest instances allowing the classification. It minimizes the overfits and supports many attributes. In a high dimension space, the input vector is mapped non-linearly and, in this space, a linear decision surface is built (Keerthi & Shevade, 2001). In this work, it was used the logistic function as calibrator and normalized data.
The multilayered artificial neural network perceptron is used as classification algorithm working as a non-parametric estimator in which perceptron is given by with ∈ , = 1, ⋯ , the network input, the synaptic weight; the output and 0 the intercept value to generalize the model (Alpaydin, 2010). In this work, it was used a neural network with three layers in the classification process.

Lazy models
According to Nicoletti (2005), the algorithms of the Instance Based Leaning (IBL) family are considered as an extension of the Nearest Neighbor (NN) algorithm, outlining some limitations associated to the NN model.
In these algorithms, the instances are represented as points in the n-dimensional space defined by r-attributes that describe them with the training instance stored in the memory.
There are many algorithms based on instances that represent the IBL family. Mitchell (1997) points out that the k-Nearest Neighbor (KNN) algorithm is the most basic method based on instances. This algorithm assumes that all instances correspond to points in the n-dimensional space .
The nearest neighbors of an instance are defined based on the euclidean distance in which, given two instances and with ≠ , there is with the r-th attribute of the instance. Other metrics can be used (Nicoletti, 2005). In this work, it was used = 1 nearest neighbors with the euclidean distance in the classification process.
The LWL algorithm is based on instances, but it considers the instances locally in order to classify them using the Naive Bayes algorithm or the linear regression (Frank, et al., 2003).
The Kstar algorithm is, also, a classifier based on instances, but it differs itself because it used a distance function based on entropy (Cleary & Trigg, 1995).

Rules models
The Rules models are characterized by the use of rules in the classification of instances, and there are many algorithms based on rules.
The Jrip algorithm, proposed by Cohen (1995), makes use of the propositional rules in the classification process. It was used the value 2 as minimum total weight of instances in a rule.
The Decision Table algorithm used decision tables as hypothesis space in the classification process . In this work, the research method applied to find good combinations of attributes for the decision table was BestFirst, in which it is possible to research the space of attributes subsets by augmented scale with a setback facility.
According to Frank and Witten (1998), the PART algorithm creates a partial decision tree in each iteration and turns the best leaf into a classification rule. In this work, it was used the minimum description length (MDL) correction when locating divisions in numeric attributes.
The OneR algorithm discretizes the numeric attributes and used the minimum-error attribute for prevision (Holte 1993).
The minimum interval sized employed in this work to discretize attributes was 6.

Tree models
There are many algorithms based on decision trees that generate excellent classifiers.
The REPTree algorithm generates multiple decision trees in changed iterations based on the information gain with the entropy, and minimizes the error resulting from the variation. Then, it chooses the best out of all the trees generated (Lakshmi, 2015). In this work, it was used the value 2 as the minimum total weight of the instances in a leaf, and the minimum proportion of variance in all the data that must be present in a knot for the division to be carried out in regression trees was 0.001.
The Hoeffding Tree algorithm explores the fact that a small sample might be sufficient in the choice of an attribute, and it also assumes that the distribution of generation of examples do not change over time. The Hoeffding limit quantifies the number of examples necessary to estimate how good an attribute is (Hulten, et al., 2001). In this work, it was employed Naive Bayes adaptive as strategy of prevision of leaf to be used. The number of instances (or total weight of instances) that a leaf must attend between division trials was 200. The limit under which a division will be forced to a tie break was 0.05.
The J48 algorithm creates a binary tree using the decision tree C4.5. Next, the algorithm is applied to each tuple in the database resulting in their classification (Quinlan, 1993). In this work, it was used the MDL correction when locating divisions in numeric attributes.
The Decision Stump algorithm consists in one-level decision trees. The prevision was made based on the value of a single input resource (Oliver & Hand, 1994). In this work, it was used the regression based on the mean squared error or the classification based on entropy.
The Random Forest used a mixture of decision tree predictors in a way that each tree depends on the values of a random vector autonomously and with the same distribution to every tree (Breiman, 2001). In this work, it was used 100 trees in the random forest.
The LMT algorithm uses logistic regression functions in the leaves of the decision trees (Landwehr, et al., 2005). It was considered, in this work, the value of 15 as the minimum number of instances in which a knot is considered for the division.
The Random Tree algorithm considers k attributes randomly chosen in each knot in the construction of the decision tree (Hall, et al., 2009). It was employed the value 1 as the minimum total weight of instances in a leaf, and the minimum proportion of variance in all the data that must be present in a knot for the division to be carried out in regression trees was 0.001.

Meta models
The Meta models are characterized by algorithms made of multiple learners that complement themselves so, when combined, they may obtain higher precision since, according to Alpaydin (2010), no algorithm is always the most precise.
Therefore, the Meta models improve the performance of the classification algorithms.
The Bagging model uses a voting method to differentiate the classifiers employing training sets slightly different in the training process of the classifiers (Breiman, 1996). In this work, it was used to improve the performance of the REPTree algorithm with 10 iterations carried out.
The AdaBoost model is based in a training set to build a set of classifiers. Since it is a metaheuristic algorithm, it is used to improve the performance of other classifiers (Freund & Schapire, 1996). In this work, it was used to improve the performance of the Decision Stump algorithm carrying out 10 iterations.
As cited by Alpaydin (2010), Stacking is a technique proposed by Wolpert (1992) in which it used a voting methods in which the outputs of the classifiers are combined. In this work, it was used to improve the performance of the Naive Bayes algorithm.
The Random SubSpace algorithm is based in decision trees to build a classifier improving the precision with the increase of complexity (Ho, 1998). In this work, it was used to increase the performance of the REPTree algorithm, and 10 iterations were carried out.
The CV Parameter Selection algorithm used cross validation in parameter selection for any classifier .
In this work it was used to increase the performance of the J48 algorithm.
The Logit Boost algorithm uses logistic regression in the classification process when dealing with multiple classes (Friedman, et al., 2000). In this work, it was used to increase the performance of the Decision Stump algorithm, and 10 iterations were carried out. Research, Society andDevelopment, v. 10, n. 5, e13110514732, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i5.14732 8 In the Classification Via Regression algorithm, the classed are binarized, and a regression algorithm is built for each class value. Then, it used the decision tree in the classification process (Frank, et al., 1998).

Procedure for validation of the proposed model
With the objective of validating the proposed model, and taking into consideration the coffee classes in Table 2, it was adopted the procedure describe in the following steps: 1) it was used the cross-validation method k-fold in which the set of original data was subdivided in subsets. Next, the − 1 subsets were used for training and the remaining subset for test. This procedure was repeated times with each instance using the same number of times for training and test. In this work, it was used = 10. Using bootstrap, the instances for training and test are random samples with substitution.
2) after adjusting the machine learning models in (1), the validation error rate of the proposed model was given by in which and ̂ denote, respectively, the values observed and predict of classes for the j-th instance. There is, also, as the number of observations of the k-th test set. It was established that, if the classes were equal, 3) next, it was verified it there was a good adjustment considering the validation error rate under 30%.
The supervised classification analyzes were carried out using the software Waikato Environment for Knowledge Analysis (Weka) version 3.9.4 (Hall, et al., 2009).

Results and Discussions
When using the projection pursuit technique in the quantitative variables (Table 2), using the Legendre index in spherical, and with the grant tour simulated annealing optimization algorithm, through the pack MVar 2.1.4 ) of the R software (R Development Core Team 2020), the objective was to verify the presence of grouping. It is possible to observe, in Figure 1, that the data in each class are very disperse. Also, there is not a separation among the coffee clas ses, that being, there are no grouping formation in the analyzed sample. It is important to highlight that other indexes were used in the search of grouping formations without success. When applying the supervised classification techniques (section 2.3) in the quantitative variable (Table 2), it is obtained the Table 3 with the results of the classification error rates.
Although the sensory attributes are relevant in the discrimination of a coffee, and, since the specialty coffees are of higher quality than the commercial coffees, the grades given by the tasters are not different in the composed groups. Thus, since the coffees are indeed different, there are non-numeric attributes that characterize them in the moment of the sensory research that might be used in the classification. Therefore, the variables "Sex", "Processing", and "Groups are added in the classification.
Based on the use of the dummy variables, it was generated Figure 2 through the projection pursuit technique now using the PDA index, in spherical data, and with the optimization algorithm grant tour simulated annealing through the pack MVar 2.1.4  of the R software (R Development Core Team, 2020). Other indexes were used in the search of groupings formations. This was the one that better represented the groupings. Source: Authors.
It is possible to observe through Figure 2 that, unlike the data presented in Figure 1, there was a distinction among classes suggesting the existence of algorithms in machine learning capable of classifying and identifying the specialty coffees studies based on the quantitative and qualitative attributes.
With the objective of creating parcimonial models, that being, models with few variables (attributes) capable of explaining the entire variability contained in the model with all the variables, it was applied the variable selection method ReliefFAttributeEval (Kira & Rendell, 1992) implemented in the WEKA software version 3.9.4 (Hall, et al., 2009). The variables general grade, altitude and processing were enough in the classification process of the specialty coffees with the results being presented in Table 4. In Table 4, it is possible to observe that the use of the qualitative variable processing alongside the quantitative variables general grade and altitude, was efficient in the supervised classification process although some algorithms did not reach a good adjustment what it justified by the inherent specificities. This shows that only quantitative information regarding the data set were not enough in the differentiation of classes with high degree of similarity in its structure in the n-dimensional space in which they are inserted.
Although in Table 4 there are classifiers that obtained validation error rates of 0%, their use is not advisable since, even though they were analyzed via k-fold cross-validation, there is the possibility of overfitting, that being, the model might have an excellent precision in the development environment and a terrible performance in new data.
The classifiers suggested are the ones under 30% and over 0% in the validation error rate. Thus, the Decision Table and Random SubSpace classifiers are the ones best adjusted to the task of classifying these data.
It is important to highlight that there were improvements and declines in the classifiers of the parcimonial models presented in Table 4 when compared to the models presented in Table 3. The results of the algorithms OneR, Decision Stump and Stacking stayed the same while the Naive Bayes Multinomial algorithm worsened the result. With the exclusion of possible overfitting, the rest of the classifiers showed high improvement in the classification error rates. Table 4 obtained validation error rates under 1% which are excellent results. This shows that classifying and identifying specialty coffees are actions viable through machine learning techniques using only the general grade given by the consumers, trained or untrained, the altitude where they were produced and the processing methods.

Most classifiers shown in
The high number of classifiers with excellent results makes clear the intrinsic differentiation of each specialty, adding to the results observer by Borem et al. (2019), Silveira and Pinheiro (2016) and Taveira et al. (2011) which allowed the separation among classes in order to attend the specificities of many classifiers what greatly characterizes these specialty coffees.
According to Silveira and Pinheiro (2016), the factors altitude, slope exposure and fruit color influences the sensory quality of the coffee when analyzed separately or in its interactions (altitude x slope exposure, altitude x fruit color, and slope exposure x fruit color) with the altitude being the major factor to influence the sensory quality of the coffee. In higher altitudes, the coffee producers take long to complete the cycle making the period for grain filling longer allowing higher accumulation of starch in the coffee fruits. Thus, the period for carbohydrate productions becomes sufficient to accumulate substances such as sugars, some acids, and amino-acids, adding to a more pleasant flavor, In Taveira et al. (2011), there are reports that the altitude and the slope face are empirically known as factors that favor the quality of the coffee. These factors allow the formation of a mild micro-climate, and lower temperatures are pointed as responsible for slowing the speed of fruits maturation allowing higher accumulation of precursors of flavor and aroma.
According to Borem et al. (2019). the sensory properties of the coffees directly depend on the cultivation environment, on the genetic characteristics inherent to the varieties, and on the technology used for post-harvest processing. Besides environmental factors, genetic factors, and factor associated with the handling of the coffee culture, the differences in the quality of coffee beverages are directly associated with the changes in the coffee grains during the different processing stages.
In a study by Benedito et al. (2020) which evaluated the acceptance of coffee by consumers using olfactory sensory analysis, observed that the samples most accepted by consumers are associated with coffees classified as hard and soft.
The peeled coffees present a more desirable acidity when compared to the natural coffees. However, its is important to point out that the time of exposition to the drying conditions of the grains produced via drought is higher when compared to the ones produced via wet what produces irreversible damage to the grains decreasing their physiological quality and changing the beverage (Taveira, et al., 2010).

Conclusions
In accordance to the proposed objectives and methodology, it is possible to conclude that the machine learning technique is viable to be applied in the supervised classification and identification of specialty coffees.
When working with quantitative data, it was not possible to find good classification models, nor show inherent distinction in each specialty coffee studies. However, the addition of the qualitative variable processing allowed the classification and identification of the coffees studies.
Excellent results were obtained using only the attributes general grade, altitude, and processing with validation error rates under 1% in the classifiers used.
As future research suggests the use only of qualitative variables in the validation process, since its impact in this study was relevant. Another theme of research interest would be the development of other classification methods derived from data dimension reduction techniques such as projection pursuit.