Identification of cocoa bean quality by near infrared spectroscopy and multivariate modeling

Cocoa is a commodity responsible for the income of millions of people and the manufacture of several important products for the food, pharmaceutical, and cosmetic industries. Its quality is associated with several factors involved in the processing steps, mainly in fermentation and drying. The objective of this study was to evaluate the application of near-infrared spectroscopic data associated with multivariate analysis to classify cocoa beans according to their quality and predict attributes such as pH and total acidity by PLS-DA and PLS, respectively. The pH values (4.4-6.7) and total acidity (6.12-29.9) were determined by conventional methods. The PLS-DA proved to be effective in differentiating the classes of cocoa samples with superior and inferior quality, presenting in the validation 100% and 71.43% correct cocoa bean classification with inferior Quality and Higher Quality, respectively. The models obtained by PLS presented satisfactory parameters, being classified as having moderate practical utility and excellent predictive capacity for pH and moderate practical utility and reasonable predictive capacity for total acidity. Thus, the potential of the NIRS technology associated with chemometrics was found and showed efficiency in the classification and prediction of attributes in cocoa beans.


Introduction
The cacao tree is a plant of the family Sterculiaceae, genus Theobroma, of the species Theobroma cacao L. that grows in tropical regions around the world. Although this species is native to central and northern South America, almost 70% of the world's crop is currently produced in West Africa (Diomande et al., 2015). The fruit of the cacao tree is cocoa and it is divided into a shell, pulp, and seeds. The seed is the part of greatest interest to the chocolate manufacturing industries (Sunoj et al., 2016). Cocoa is a commodity responsible for the income of millions of people and the manufacture of several important products for the food, pharmaceutical, and cosmetics industries, highlighting the need to promote discussions about the quality of cocoa beans as a way to foster alternatives that increase the profitability of cocoa farms and valorize the raw material (Beg et al., 2017).
The process from cocoa beans to the production of their derivatives is complex. Before the cocoa beans can be marketed and processed into their final industrial products, they undergo post-harvest processing that comprises the steps of pod opening and berry removal, fermentation for 6 to 7 days, and drying to a moisture content of 7% (Kadow et al., 2013). The quality of cocoa is associated with several factors involved in the processing stages. When there is control, especially in fermentation and drying, it is possible to obtain a quality product. This quality is due to the occurrence of some changes, such as acidification and temperature increase, which are necessary for the reactions to process satisfactorily and promote the development of characteristics pertinent to cocoa and its derivatives (Krähmer et al., 2015;Oliveira et al., 2016).
During the early fermentation stages, the cocoa beans are purple due to the presence of anthocyanin pigments. During fermentation, these pigments are broken down and form other condensation products, which result in a brownish-brown colour (Afoakwa et al., 2013). The Fermentation Index (FI) is associated with kernel quality and is obtained by cutting the kernels lengthwise and counting the proportion of purple and brown coloured kernels, among other external and internal characteristics of a representative dry sample of 300 kernels. This process is also known as the cut test (CoEx, 2017). The FI comprises the percentage of brown and partly brown cocoa beans and is an indication of the quality of the beans in relation to the conduct of fermentation and consequently the content of bioactive compounds in cocoa (Crafack et al., 2014). The higher the FI, the higher the technological quality of the cocoa beans.
There are a number of conventional methods to determine the biomarkers and centesimal composition of cocoa beans and indicate the quality of this product, resulting in a large number of analyses to be performed and statistically evaluated Research, Society andDevelopment, v. 10, n. 15, e641101522732, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i15.22732 3 (Lohumi et al., 2015). These methods are time-consuming, expensive, and destructive to the sample, highlighting the need for a fast, inexpensive, and non-destructive tool to evaluate these quality parameters. Fourier transform near-infrared spectroscopy (FT-NIR), associated with chemometrics, may be a suitable strategy as it is considered to be fast, sensitive, non-invasive, nondestructive, relatively low cost, providing a large amount of information with only one test, as it requires little or no sample pre-treatment. It is clean, since it eliminates the use of chemical reagents and avoids the generation of waste that is harmful to the environment, and the equipment is relatively easy to handle (Lima et al., 2020). Infrared spectroscopy and chemometrics are techniques that are always allied, because the spectra have wide and overlapping bands and multivariate analysis becomes a valuable tool to remove the information contained therein, allowing the identification and quantification of various parameters in different matrices (Souza, 2013).
NIR combined with chemometrics have been commonly used to determine the composition of cocoa products.
(Hernández-Hernández et al., 2021) highlighted NIR as a fast and easy method to identify cocoa genotypes according to the content of bioactive compounds. Veselá et al. (2007) used NIR to determine the fat, nitrogen, and moisture content of the cocoa powder. Whitacre et al. (2003) detected protein, fat, starch, and proanthocyanidins in cocoa. Barbin et al. (2018) highlighted the potential of NIR for the classification and differentiation of cocoa varieties and for predicting the chemical composition of the spectra obtained for intact and ground cocoa beans. Álvarez et al. (2012) detected theobromine and epicatechin in cocoa using NIR. However, the feasibility of employing NIR to classify cocoa beans according to their quality has not yet been fully explored.
The objective of this study was to evaluate the application of near-infrared spectroscopic data associated with multivariate analysis to classify cocoa beans according to their quality and predict attributes such as pH and titratable acidity by PLS-DA and PLS, respectively.

Cocoa bean samples
A total of 78 samples of bulk cocoa beans (approximately 5 kg), collected from April to August in the cocoa producing regions of Bahia, were used in conducting this research.

Chemical and physical-chemical characterization of cocoa beans
A sample (300 cocoa beans) was collected for evaluation by the longitudinal cut for commercial classification as recommended by CoEx (2017) and the Technical Regulation of Cocoa Almond. Then, the samples were classified into two groups according to their quality: Higher Quality (HQ) and Inferior Quality (IQ). For a sample to be considered HQ it should have a fermentation index (percentage of brown and partially brown almonds) equal to or greater than 60% and defect limits of 4% and 5% for mold and slate, respectively, in addition to the absence of smoke. If they did not meet these criteria, they were classified as IQ.
For the physical-chemical and near infrared spectroscopic analyses, samples were dried and ground in a portable grain grinder (model DCG 20). Physicochemical analyses were performed according to the methods described by AOAC (2010), being pH by method 981.12 and total acidity by 942.15.

Near-infrared spectroscopy (NIRS)
Spectra were collected in an NIR spectrometer (SpectraStar 2500XL, Unity Scientific, Brookfield, CT, USA) equipped with a tungsten halogen lamp as light source, an indium-gallium-arsenide (InGaAs) detector. The signals were generated in reflectance (% R) mode and transformed into absorbance by using log 1/R. Samples were poured into the sample Research, Society and Development, v. 10, n. 15, e641101522732, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i15.22732 4 cup and scanned over the range of 1100-2500 nm at 1 nm intervals. During collecting, the temperature was kept at about 25°C in the laboratory. The Unity InfoStar V3.11.3 software was used for spectrometer configuration, control, and data acquisition (Lima et al., 2020;Santos et al., 2021). The spectral regions and the main functional groups of the samples analyzed in the NIRS were identified in the literature (Sunoj et al., 2016).

Chemometric analysis
The data obtained were analyzed using Chemoface software version 1.63 (Nunes et al., 2012) for Partial Least Squares-Discriminant Analysis (PLS-DA) and Partial Least Squares (PLS). For statistical analysis, spectral data obtained in the NIR were used with and without mathematical methods of data pre-processing (None, Standard Normal Variate -SNV and Multiplicative Scatter Correction -MSC). The data were pre-processed using the Chemoface program.

Partial Least Squares-Discriminant Analysis (PLS-DA)
The PLS-DA chemometric analysis is based on the conventional PLS regression method, in which two matrices are used, a prediction matrix (I × J) X and another response matrix (I × K) Y that presents the categorical variables and that describe class participation. K is the number of classes and if only two classes are considered, the matrix Y is reduced to a vector y (I × 2). When developing the PLS regression, the response value Ypred is predicted for a new sample. The decision is made by comparing Ypred with certain categorical variables in Y, and the sample is ranked based on a minimum distance value between Y and Ypred (Cruz-Tirado et al., 2020).
PLS-DA was performed using Chemoface software version 1.63 (Nunes et al., 2012). The data were divided into two groups using the Kennard-Stone (KS) algorithm (Lopes et al., 2022). A set consisting of 70% of the samples from each class was used to optimize the classification model by a single cross-validation. The other 30% of the data was used as an external validation set.

Partial Least Squares (PLS)
Partial least squares (PLS) is the multivariate tool commonly used to develop a calibration model to relate the information of interest of a sample to its spectrum. The data obtained with NIR were used to build the regression model, which used the information from the data matrix X (full spectrum) and the response matrix Y (pH and total acidity values) to obtain new variables, which were called latent variables, components or factors. The absorbance values from the full spectrum were used as variables (Anyidoho et al., 2021;Hayati et al., 2021;Santos et al., 2021).
The number of latent variables and the performance of the obtained PLS models were evaluated considering the correlation coefficient (R), the Root-Mean-Square Error (RMSE), the ratio of performance to deviation (RPD) and the range error ratio (RER) (Lima et al., 2020). The analysis was performed using Chemoface software version 1.63.

Results and Discussion
The spectra presented in Figure 1 were obtained from the mean absorbance of the NIR of cocoa beans of the higher quality and inferior quality. It can be observed that the spectral behavior of the samples is similar. What differentiates them in each peak formed is the absorbance that indicates a greater or lesser amount of the compound responsible for the vibration generated. The PLS-DA technique is capable of generating functions or models that classify the samples into their relevant classes. The performance of the developed models was evaluated by their ability to separate the samples into their classes (higher or inferior). Table 1 presents the percentage of the correctness of the cocoa sample classes by the developed model with eight latent variables.
It can be seen that there was an excellent classification rate of the samples in their respective classes in the training stage. In the validation stage, the generalization capacity of the model to correctly classify new samples can be verified. The technique can be used to discriminate new cocoa samples and to define the best destination for their application. Cocoa beans in the high-quality group can be used to make chocolate, the most demanded cocoa product, which requires properly processed cocoa beans, with low acidity, bitterness, and astringency. Inferior quality cocoa beans, on the other hand, will present higher intensity in these aspects and thus provide increased nutritional value due to the amount of bioactive antioxidant compounds in their composition.  Table 2 shows the average pH and total acidity values of the cocoa beans. The fermentation stage promotes an increase in pH and consequently a reduction in acidity due to biochemical reactions which occurr in this process. Thus, pH values can be used as an indication of the quality of cocoa processing. Sunoj et al. (2016) found values between 4.26 and 6.13 which were similar to those obtained in the present study. Both results were close to pH 5.0, which indicates that the samples were subjected to the proper fermentation process (Barrientos et al., 2019).
Variability was observed in the total acidity of the cocoa samples, thus showing the representativeness of these values for multivariate modeling. During fermentation, microorganisms produce acids that raise acidity and lower the pH values. The values obtained in the present study are similar to those found by Peláez et al. (2016) and Sunoj et al. (2016). The Partial Least Squares (PLS) technique was used in the development of mathematical modeling able to predict the total acidity and pH variables of cocoa samples as a function of near-infrared spectroscopy. These models were adjusted with and without mathematical data preprocessing methods (None, Standard Normal Variate -SNV, Multiplicative Scatter Correction -MSC). The fit parameters of the optimized models for predicting pH and total acidity variables are presented in Table 3. Different numbers of latent variables were tested and the parameters (Table 3) were evaluated to choose the best model. The choice was based on the external validation data. R values close to 1 and low RMSE values indicate the good performance of the model to predict the parameters pH and titratable acidity. Another very relevant fit parameter is the ratio of performance to deviation (RPD), where models with RPD > 2 can be considered excellent; reasonable models when 1.4 < RPD < 2; and unreliable models when RPD < 1.4 (Lima et al., 2020). The ratio of an analyte's concentration range to the root mean square error (RER) is also a relevant parameter, where a model with RER values < 3 shows low predictive ability, models with RER between 3 and 10 have moderate practical utility, and RER values > 10 indicate good model utility (Lima et al., 2020).
According to this classification, the models generated by the PLS technique can be considered reasonable for predicting the quality attributes. However, the model generated by SNV-treated data with 20 VL has moderate practical utility and excellent predictive ability for pH. For total acidity, the model generated by MSC data with 18LV has moderate utility and reasonable predictive ability. These were considered the best models to predict these quality parameters.
The scatter plots show the correlation between the experimental reference values and the values predicted by the optimized models for pH ( Figure 2) and total acidity ( Figure 3). It is possible to observe a high correlation between the actual and predicted data for all models, which may indicate that the models generated from data with higher sample dispersion over the model range have an easier fit. Therefore, it can be seen that the evaluation parameters of the models cited above indicate the possibility of satisfactorily determining the parameters acidity and pH of the cocoa bean. Source: Authors.
The evaluation parameters of the models indicate the possibility to simultaneously determine two quality attributes of cocoa beans by using the NIR infrared spectroscopy technique associated with multivariate calibration models. Thus, chemometric models can be successfully used as a routine analysis to evaluate the quality of cocoa bean samples.

Conclusion
This work provides a strategic alternative for the quick and simple analysis and classification of cocoa beans. PLS-DA was effective in differentiating the classes of higher and lower quality samples, determining to which class an unknown sample belonged based on the information provided to the system. PLS-DA obtained over 86% accuracy in classifying the samples into their respective classes. The application of PLS to the NIR dataset provided models with moderate practical utility and excellent predictive ability for pH and moderate utility and modest predictive for total acidity.
The authors indicate the use of the calibrated model by PLS-DA and PLS with NIR data as a form of quality control for the classification of higher and inferior quality samples. In addition to providing values of qualitative and essential attributes for the proper destination of the industrial application of cocoa beans.