Mbl-2 gene polymorphisms in pediatric Burkitt lymphoma: an approach based on machine learning techniques

Introduction: Burkitt lymphoma belongs to the group of non-Hodgkin lymphomas. Although curable in 80% of less advanced stages, it presents in advanced stages in about 75% of cases in Brazil’s Northeast region, requiring urgent and intensive care in the early stages of treatment. Objectives: therefore, this study aimed to verify the participation of MBL2 gene polymorphisms in the development of Burkitt lymphoma. Methods: In this article, computational approaches based on the Machine Learning technique were used, where we implemented the Random Forest and KMeans algorithms to classify patterns of individuals diagnosed with the disease and, therefore, differentiate them from healthy individuals. A group of 56 patients aged 0 to 18 years, with Burkitt lymphoma, from a reference hospital in the treatment of childhood cancer, was evaluated, together with a control group consisting of 150 samples, all of which were tested for exon 1 polymorphisms and the MBL2 gene -221 and -550 regions. Results: At first, an unsupervised classification was performed, which identified as two the number of groups that best represent the data present in our database, reaching 72.81% accuracy in the separation of patients and controls. Then, the supervised when performing a cross validation. Conclusion: It was not yet possible to conclude about the participation of the evaluated polymorphisms in the development of the BL, however the computational techniques used proved to be very promising for carrying out studies of this nature.


Introduction
Burkitt lymphoma (BL) is a type of non-Hodgkin's lymphoma of mature B cells, malignant and extremely aggressive, presenting the highest cell proliferation rates among all neoplasms, with a doubling time between 24 to 48 hours. It represents almost half of all childhood lymphoma cases, with a higher incidence rate in caucasian and male children (Aydin et al., 2019;Derinkuyu et al., 2016;Swerdlow et al., 2016). Its main characteristic is the presence of mature B cells and monomorphic lymphocytes, presenting reciprocal translocation involving the MYC proto-oncogene (Hecht & Aster, 2000;Vardiman et al., 2008). Currently, three clinical forms of BL are considered: endemic, sporadic (non-endemic) and the form associated with immunodeficiency. Although these forms of BL are histologically identical and have similar clinical behavior, they have different epidemiological, clinical and genetic characteristics (Freedman et al., 2018).
Regarding the factors involved in the genesis of BL, genetic-based mechanisms (such as reciprocal translocation involving the MYC proto-oncogene) and the participation of infectious agents (Hsu & Glaser, 2000), mainly by the Epstein-Barr Virus (EBV), can be highlighted (Molyneux et al., 2012). In addition, other genetic factors have been associated with genetic Research, Society and Development, v. 10, n. 12, e444101220561, 2021(CC BY 4.0) | ISSN 2525 susceptibility, such as polymorphisms of the promoter genes of interleukin-10 and Tumor Necrosis Factor (TNF), especially in children without EBV infection (White, 2004). The primary deficiency of Mannose-Binding Lectin (MBL), usually caused by polymorphisms in the MBL-2 gene (Kilpatrick, 2002a), can lead to immunodeficiencies and increased susceptibility to various infectious diseases (Da Cruz et al., 2013), cancers of lymphoid origin and autoimmune diseases (Martín-Mateos & Piquer Gibert, 2016).
In the promoter region of the MBL-2 gene, among others, there are two well-studied polymorphic regions: -550 H / L (rs11003125) and -221 X / Y (rs7096206), both resulting from the exchange of guanine for cytosine (G C) (Bouwman et al., 2006). For the structural region, there are three point mutations that have been described in exon-1 of the MBL-2 gene: in the 52 codon (Arg → Cys, D allele), 54 codon (Gly → Asp, B allele), and 57 codon (Gly → Glu, C allele), resulting in the exchange of amino acids (Bouwman et al., 2006;Moslem et al., 2015). These variants are collectively referred to as "O" for the mutant allele and "A" for the wild allele (Martín-Mateos & Piquer Gibert, 2016). Carriers of the O allele show reduced MBL expression, both those with heterozygosis (A / O) and especially homozygosis (O / O) .
The polymorphic sites of the promoter region are associated with different serum MBL levels independent of the variant alleles, being closely linked. Due to the imbalance of the link, these polymorphisms combine to form a limited number of only seven or eight haplotypes (Petersen et al., 2001). Among them, the HYA and LYA haplotypes are most often associated with high plasma concentrations of MBL (Soltani et al., 2014), and LXA, HYO and LYO are associated with low concentration (Bouwman et al., 2006). In this group, the X allele variant is the one that most negatively affects the MBL serum production (Hansen et al., 2004). Consequently, the determination of the Y / X polymorphism in the promoter region is important for a functional reading of MBL (Mendonça et al., 2010). In the promoter region of the MBL-2 gene there is also another polymorphic site in the +4 region, also associated with a decrease in MBL levels, that represents the P/Q loci (Madsen et al., 1995). According to Boldt et al. (2006), haplotypes are associated with progressively lower concentrations of serum MBL in the following sequence: HYPA > LYQA > LYPA > LXPA >> HYPO = LYPO = LYQO (Boldt et al., 2006).
The search for new data analysis methodologies that are efficient and at the lowest possible cost has intensified, and it is precisely in this context that Machine Learning has been inserted. The term "Machine Learning" (ML) can be used to refer to algorithms that give computers the ability to learn without being explicitly programmed (learning from experience) (Van Der Aalst, 2016). To learn and adapt, a model is built from input data (instead of using fixed routines) and the constantly evolving model is used to make predictions or make decisions that are considered to be the most correct according to its experience (Van Der Aalst, 2016). Therefore, the present study proposed another method, which is not as traditional as statistical methods used to analyze polymorphism data. The main objective is to adapt and use computational models based on ML with the purpose of verifying the participation of MBL-2 gene polymorphisms in the susceptibility to the development of BL, and mainly to be able to classify patients and controls efficiently in the future.

Methodology
A total of 56 patients, with BL, aged 0 to 18 years between 2012 and 2017 were examined. Patients were recruited by spontaneous demand, both in the prospective and in the retrospective follow-up, at Hospital Universitário Oswaldo Cruz (HUOC). The clinical and biological data of the patients were collected by consulting the medical records of the Pediatric Oncohematology Center of HUOC (CEONHPE). Meanwhile, the control group is formed by a total of 150 randomly selected individuals in the same age group as the study group (patients), with no history of cancer, provided by the DNA Bank of the Human Molecular Genetics Laboratory of the Genetics Department of the Federal University of Pernambuco. The data collected (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v1012.20561 were: staging, recurrence, gender, age, treatment time, patient status (alive or dead).
It must be highlighted that the study design and experimental protocols were approved by the Human Research and Ethics Committee of the Hospital Complex, HUOC / PROCAPE, under the number CAAE: 02044612.3.0000.5198.

Genetic polymorphisms analysis
Somatic genetic material extracted from peripheral blood collected in an ethylenediamine tetraacetic acid -EDTA tube, followed by processing, DNA extraction using the Mini salting out technique, according to Miller (1988) (MWer et al., 1988, and stored at -20°c, was used to study the gene polymorphisms, in the prospective follow-up patients. DNA extracted from paraffinized material was used to evaluate the retrospective segment, using the QUIAGEN® Kit QIAamp® DNA FFPE Tissue  (Hladnik et al., 2002), whereas the promoter region polymorphisms were detected by using Taqman® SNP (Single Nucleotide Polymorphism). PCR conditions for genotyping are described in Table 1.

Experimental setup
In this study, implementations of models based on Machine Learning techniques were carried out in order to analyze patterns related to groups of patients and controls, to find new insights about the development of BL. For this purpose, the code development environment available by Google Colab 1 was used with the Python 2 Scikit-learn library 3 . In order to verify whether there is a statistically significant difference between the expression groups, Fisher's exact test was used with the aid of the GraphPad Prism® V5.0 program. Then the first actions aimed to analyze the database, identifying the distribution of patient and control data. Then, a Principal Component Analysis (PCA) was performed (Niitsuma & Okada, 2007), aiming to reduce dimensionality and selecting relevant attributes. The main reasons why the dimensionality is as small as possible are: measurement cost and classifier precision. When the feature space contains only the most salient characteristics (which has better explanatory capacity), the classifier will be faster and will take up less memory. When the set of training examples is not very large, a small space of characteristics can tackle the curse of dimensionality and provide small error rates for the classifier (Watanabe, 1985). Following that, the analysis was divided into unsupervised and supervised analysis (Sathya & Abraham, 2013).
When the database has a large amount of data, the best way is to divide this set into three parts: training, validation and testing. When the data set is reduced, the most suitable is to use the resampling technique, which is used to approximate the validation set through the reuse of observations from the set used in training. This is the k-fold cross-validation technique, which consists of the random division of the training bench in k equal parts. Then, k-1 will compose the training data for the adjustment of models, and the other part will be reserved for the estimation of its performance, repeating this process until all parts have been used both in training and in model validation. This is expected to increase the accuracy of these estimates (Hastie et al., 2009;Kuhn & Johnson, 2013). For this reason, our database was divided as follows: 70% of the data was used for training and 30% for testing.
For the unsupervised analysis, the K-means algorithm was used (Xu & Wunsch, 2005). This algorithm uses the "clustering" technique, which is the process of dividing the data set by similarity, where individuals in one group are more similar Research, Society and Development, v. 10, n. 12, e444101220561, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10. 33448/rsd-v1012.20561 to each other than with individuals in other groups, that are called "clusters" (Jain & Dubes, 1988;Sharma, 2019). Starting with the clustering idea, the use of this algorithm is mainly based on the creation of a model, which when adding new data, this data will be automatically inserted in a given cluster. From this, it is possible to deduce its characteristics by the similarities with the components of that same cluster. In order to improve the dissimilarity between data points in the low-dimensional embedding space, we used the dimensional reduction technique t-sne (t-distributed stochastic neighbor embedding), in which preservation is non-linear, unlike Multidimensional Scaling (MDS) or PCA (Li et al., 2017;Van der Maaten & Hinton, 2008). This has many applications, being able to aid in the diagnosis or to determine a prognosis of a patient, for example. In order to interpret and validate an analysis of the obtained clusters, a silhouette technique was performed (Rousseeuw, 1987).
For the supervised technique, the Random Forest (RF) algorithm was used to analyze the data pattern, the same way it was done by Lins et al., (2017). RF is a supervised learning algorithm that creates multiple decision trees and combines them to obtain a more accurate and stable prediction (J. C. da Silva, 2018). Then, with the purpose of increasing the algorithm's predictive power and making the forecasting model more flexible (Kuhn & Johnson, 2013), two adjustment parameters (hyperparameters) were introduced. A parameter relates to the number of trees built by the algorithm before making a decision or making an average of predictions (Estimators) and the other parameter indicates the depth that trees should or could reach (Max Depth). These parameters were assigned different settings, as shown in table 2.
In this study, the cross-validation technique was also implemented (Refaeilzadeh et al., 2009). This technique uses the partitioning of the dataset into subsets (in our case, k = 10 folds) by randomly separating a subset for testing and the others for training. This process is repeated k times, until all parts are used for training and testing, thus more reliably evaluating the classifier predictive power.
Several configurations were tested with a GridSearch algorithm, to try to identify the best configuration to be applied to the supervised technique (Bergstra & Bengio, 2012), and the results expressed through a confusion matrix and accuracy, precision and recall metrics (Bland, 2015). To assess the efficiency of class separation, the Area Under the Curve -Receiver Operating Characteristic (AUC-ROC curve) was used, which is very useful to evaluate domains in which there is a large disproportion between classes (Khan & Rana, 2019;Prati et al., 2008), as shown in graphs 11 and 12.

Biological characteristics analysis
56 patients with BL were analyzed, 64% male and 36% female. The age group of the individuals who participated in the study was from 0 to 18 years old, (mean age of 6.7 years at diagnosis), with a standard deviation of 6.4295 ± 6.9705. However, 44% were under 5 years of age (Table 3). The main site of tumor involvement was the abdomen, in 80% of cases, and about 7% had mandibular/jaw involvement. 70% of patients were cured, with a period of more than 5 years without the disease.
The proportional difference between genders in this study was approximately 2 males to 1 female in the patients group, and approximately 1 male to 1 female in the control group (p = 0.1161).
Analyses of the MBL-2 gene polymorphisms were performed and the results are shown in table 4. We created graph 1 that summarizes the distribution of components with greater variability within the base. Among the polymorphisms present in the general study population (patients and controls), a total of 15 combinations were obtained, as described in Figure 2. Meanwhile, Figure 3 shows the comparison between the patient and control groups in terms of haplotypes. Research, Society and Development, v. 10, n. 12, e444101220561, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v1012.20561  Research, Society and Development, v. 10, n. 12, e444101220561, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10. 33448/rsd-v1012.20561 In order to verify if there is a significant difference between the haplotypes of the patients and the control group, we grouped the study population into three: high expression, intermediate expression, and low expression of MBL, as shown in Figure 4. However, no significant difference was found between the patients and controls (p = 0.9142). Graph shows no statistical difference between patients and controls when comparing haplotypes by expression groups Source: Authors

Machine learning models experiments
Initially, the database had been analysed, with a focus on balancing, thus verifying the possibility of "overfitting". This phenomenon happens when the model has adapted very well to the data it is being trained with, however, it does not specialize for new data. In other words, the classifier "learns" much more about one group of data than another due to the large disparity of components between them. In this study, there is an imbalance in the number of patient and control groups, as can be seen in  Research, Society and Development, v. 10, n. 12, e444101220561, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v1012.20561 From the PCA technique, analyses were performed to indicate the two main components that best explain the variability of the data.
According to the analyzed parameters, the PCA technique was able to identify the MBL expression groups (high, intermediate or low) and the polymorphism present in MBL-2 gene exon 1 as the two main variables that explain the variability of our data, followed by the MBL2 -221 polymorphism. Figure 6 shows the most salient variables on the x-axis and on the y-axis. The most important variables are those that are furthest from the center of the graph. "Expr" refers to the MBL protein group expression; "Éxon 1" refers to the MBL-2 exon 1 polymorphirsm.

Source: Authors
According to the data, these variables had the highest factor loadings, thus, it was verified whether the use of these variables alone could be sufficient for the development of the classifier, along with the inherent benefits of dimensionality reduction, as discussed above.
After this phase, unsupervised clustering techniques were performed in order to verify the best number of clusters for these data. Through association analysis, it was identified that two clusters better represent our dataset showing that, for the training set, the algorithm managed to satisfactorily separate patients and controls, as can be seen in Figure 7. Research, Society and Development, v. 10, n. 12, e444101220561, 2021(CC BY 4.0) | ISSN 2525 Aiming to obtain a better interpretation and validation for the clusters analysis, the Silhouette Technique was used. Table   5 shows the silhouette valuesobtained and Figure 8 shows the so-called "elbow curve", which indicates the point from which there are no gains in relation to the increase in clusters. After this first analysis, part of our data was used to test the model's hit rate when a new dataset was introduced. Thus, it was possible to verify that the classifier's hit rate was 49.51%, however when the analysis was performed using the t-sine, the model accuracy rose to 72.81%.
In the face of overfitting, the supervised analysis was performed using the Random Forest (RF) algorithm because, as is known, this avoids overfitting more efficiently than decision trees, in addition to obtaining greater accuracy and being more stable (J. C. da Silva, 2018).
The RF was first used for the complete database, and with this, we obtained the following confusion matrix, shown in Figure 9. It is possible to observe that, from the portion randomly selected for testing, 45 individuals classified as controls were actually controls, 17 classified as patients were actually controls, no control was classified as patient and there was no occurrence Research, Society and Development, v. 10, n. 12, e444101220561, 2021(CC BY 4.0) | ISSN 2525 of truly patient in this portion. Of the portion randomly selected for testing, 45 individuals classified as controls were actually controls, 17 classified as controls were actually patients, no controls were classified as patients, and there was no classification of true patients. Source: Authors.
According to this matrix, it was possible to calculate that the classifier obtained an accuracy of 72.58%, with a precision of 72.58% and a recall of 100%. However, as the confusion matrix shows, due to the strong imbalance in the data, the model learned much more about controls than about patients.
Following this analysis, we used the RF again, in its standard configuration, but this time we used only the most salient variables according to the PCA, and in this phase, we obtained an accuracy value of 68.77%.
After this, the hyperparameters to be applied in the RF algorithm were implemented along with a cross validation. These multiple configurations were assembled using a GridSearch-type technique, and with this, it was possible to achieve an accuracy value of 75% in the best configuration found by the algorithm where the average accuracy of the configurations was 72.93%, with a precision of 72.58 and a recall of 100%.
To verify the model efficiency in separating classes (patients and controls) two AUC-ROC curves were plotted (Khan & Rana, 2019;Prati et al., 2008). The first plot was made using the complete database with all variables and using the RF at default settings; and the second plot was made with the complete database, but after the introduction of the hyperparameters.
The first curve, shown in Figure 10, presented a rate of 46%, a value considered low. The second, Figure 11, showed a rate of 43%, demonstrating that overfitting caused an unsatisfactory data separation.

Discussion
This study was intended to analyze a possible risk factor for BL through using the Machine Learning approach. Then, it is focused on bringing a new and promising analysis tool that can assist in several future analyzes, besides being able to help in the diagnosis and prognosis of a great number of diseases Like most types of lymphoma, BL is more prevalent in men with a 3-4: 1 ratio, between men and women, according to world statistics (Dozzo et al., 2017). In our study, a higher incidence in males was also observed, however in a slightly lower proportion (about 2: 1). This ratio is lower than that found by Silva et al. (2020), which was approximately 4: 1 when studying patients with BL and carriers of the human immunodeficiency virus in Brazil (W. F. da Silva et al., 2020) and Rodrigues-Fernandes et al. (2020), which was 3: 1, when in a systematic review compiled data on pediatric patients from several countries (Rodrigues-Fernandes et al., 2020).
Regarding the average age, Hassan et al. (2008), observed an average of 7.8 ± 3.7 years in Brazil (Hassan et al., 2008) very similar to that observed by Rodrigues-Fernandes et al. (2020), which was 7.4 years (Rodrigues-Fernandes et al., 2020). Our data, although showing a slightly lower average age (6.7 years at diagnosis with a standard deviation of 6.4295 ± 6.9705) corroborate with Hassan et al. (2008) and with Rodrigues-Fernandes et al. (2020), showing that there is a little variation at the age most affected by this pathology.
When performing a simple comparison, it is possible to observe the haplotype differences that exist between the (CC BY 4.0) | ISSN 2525 population in our study and the population studied by Kilpatrick, (2002b However, in this study, the haplotypes frequencies remained similar when comparing the patient groups and the control group, demonstrating that there were no statistically significant differences between them. This suggests that there is no participation of MBL-2 polymorphisms in the development of Burkitt Lymphoma, unlike what happens in susceptibilities to bacterial and viral infections (Eisen & Minchinton, 2003), atherosclerosis (Rugonfalvi-Kiss et al., 2002), autoimmune disease (for example, type 1 diabetes) (Tsutsumi et al., 2003) and rheumatoid arthritis (Graudal et al., 2000), which have a well-described participation in the literature.
In view of the results obtained, we implemented some algorithms in our database. First, the PCA test was performed and it was possible to identify the variables that best explained the variability of the data, in other words, those that had the best explanatory capacity for the problem. It must be considered that the variables referring to the MBL-2 mutations were highlighted in this test, which leads us to think that there is a participation of these polymorphisms in BL. Possibly this is due to a greater infection rate and maintenance of the EBV in patients with low expression genotypes, leading to a low concentration of serum MBL.
Another hypothesis is that these variables can facilitate the creation of a classifier capable of separating patients and controls. The K-means algorithm is a very popular approach to finding clusters due to its simplicity of implementation and quick execution (Davidson, 2002). Some of these applications are already being documented in the literature, as was done by Salma, (2016) that used a variation of K-means (fast K-means) to select the most relevant resources from a high-dimension breast cancer data set, reaching an accuracy 99.39%. In this context, it is also worth mentioning the study of Kakushadze and Yu, (2017), in which they used 1389 published samples of 14 types of cancer and found that 3 types of cancer (liver cancer, lung cancer and renal cell carcinoma) stand out from the others and had no similar structures to the cluster. In our study, using this same algorithm, we identified that two groups have the best explanatory capacity for our data, dividing them between patients and controls with a hit rate that reached 72.81% when analyzed using the t-sne.
In the context of supervised analysis, some authors have been using the cross-validation strategy to classify some types of cancer, for example Lee et al., (2019), who used the RF with 10-fold cross-validation to classify 31 types of cancer and managed to reach 84% accuracy, reaching up to 94% for the 6 most common types of cancer. As mentioned, RF was used in our study because it avoids overfitting more efficiently than decision trees, in addition to obtaining greater accuracy and being more stable (J. C. da Silva, 2018). Thus, by using this algorithm, it was possible to reach a reasonable accuracy level, even for a small number of individuals, especially when we use cross-validation as well as Lee et al., (2019). Therefore, we can conclude that the option of cross-validation allowed a variability of the training data and a better test than the standard RF approach based on percentages. It is notable that when using the variables highlighted by the PCA, slightly less accuracy was obtained. However, this strategy can be considered when dealing with a large amount of data, due to the lower computational costs required and the reduction in execution time.
It is also worth noting that, as the distribution analysis showed, our data is extremely unbalanced. We have a low number of patients, contributing to a strong bias for the control data represented in the results. Therefore, we understand that we could evaluate a higher number of patients and include other polymorphisms related to the innate immunity. We confirmed this fact through the AUC-ROC curve, which evidenced a low separation efficiency in this model. This low efficiency can be explained by two ways: the database is unbalanced and the algorithm found no differences between patients and controls and/or the studied Research, Society and Development, v. 10, n. 12, e444101220561, 2021(CC BY 4.0) | ISSN 2525 polymorphisms of the MBL-2 gene do not exert significant force on the pathogenesis of BL, being the variables present in this study insufficient for the creation of a classifier.

Conclusion
Machine Learning techniques have brought new expectations to the medicine field, mainly for the diagnosis and prognosis of diseases. These are extremely promising techniques to assist in the analysis of biomedical data, as they make it possible to extract new insights from data sets that are often previously analyzed or too large to be analyzed by methods that are more conventional.
In this article, we used a Machine Learning model to verify the participation of MBL-2 polymorphisms in the development of BL, as well as verifying whether with only the aforementioned parameters, it would be possible to, satisfactorily, classify patients and controls.
Thus, it was possible to observe that the dimensionality reduction techniques should be considered, especially when dealing with large databases, because the losses in hit rates in our study were low, compared to the benefits that these techniques provide.
The two algorithms proved to be quite efficient in classifying individuals, especially when using their variations (72.81% for KMeans and 75% for Random Forest), showing that the sophistications implemented actually bring improvements to their performance and these proposals return even better rates. Therefore, it was not possible to conclude about the participation of the MBL-2 polymorphisms in the development of Burkitt lymphoma. We believe that this is due to the low number of individuals present in our database (56 patients and 150 controls) and the imbalance of groups, facts that led to overfitting.
Even though it is a disease with a relatively low incidence, the results encourage us, as we believe that with the cooperation of several reference centers in the treatment of childhood BL and the creation of a unified digital medical record, approved in Brazil by Bill 3814/ 2020, it will be possible to significantly increase the robustness of our database. Soon, it will be possible to create a classifier containing the main components related to the disease that will serve as a decision support tool, based on a computational intelligence and Machine Learning algorithm.
For this reason new larger and more robust studies, especially regarding the number of patients, are needed to obtain this answer. Another suggestion would be to use other algorithms and variation, to compare the efficiency with those we used in this study.