Machine learning and automatic selection of attributes for the identification of Chagas disease from clinical and sociodemographic data

Objective: evaluate the potential use of machine learning and the automatic selection of attributes in discrimination of individuals with and without Chagas disease based on clinical and sociodemographic data. Method: After the evaluation of many learning algorithms, they have been chosen and the comparison between neural network Multilayer Perceptron (MLP) and the Linear Regression (LR) was done, seeking which one presents the best performance for prediction of the Chagas disease diagnosis, being used the criteria of sensitivity, specificity, accuracy and area under the ROC curve (AUC). Generated models were also compared, using the methods of automatic selection of attributes: Forward Selection, Backward Elimination and genetic algorithm. Results: The best results were achieved using the genetic algorithm and the MLP presented accuracy of 95.95%, 78.30% sensitivity, and specificity of 75.00% and AUC of 0.861. Conclusion: It was proved to be a very interesting performance, given the nature of the data used for sorting and use in public health, glimpsing its relevance in the medical field, enabling an approximation of prevalence that justifies the actions of active search of individuals Chagas disease patients for treatment and prevention.


Introduction
Chagas disease (CD), also known as American Trypanosomiasis is a disease whose etiologic Agent is the Protozoan Trypanosoma cruzi. It is currently considered one of the greatest public health problems in the Americas, is estimated that there are about 8 million disabled people in CD and on average 10,000 deaths per year (WHO, 2020).
Trypanosomiasis is presented in two distinct phases, an acute phase and then a chronic phase, and both phases may remain asymptomatic in some individuals. The acute phase lasts 6-8 weeks, and when symptomatic it can be characterized by fever, tachycardia, splenomegaly and edema. The chronic phase may be asymptomatic in most individuals, however, some of these may show signs and symptoms 20, 30 or more years after infection and which are characterized by impaired cardiac and digestive function (Gunter et al., 2017).
On the other hand, computational innovation works on new techniques in order to drive improvements in various human activities (Martínez-Torres, 2013). For example, in the area of medical informatics, new technologies provide health professionals with computational information that facilitates the assistance of patients in care. Thus, it is possible to issue specialized opinions, based on information, records, electronic records, medical images, being medically understandable without the need for unnecessary, risky, uncomfortable or expensive procedures, and assisting other health professionals with the gain of information, facilitating understanding and assisting diagnosis (Forshyth et al., 2019).
Thus, the machine learning is an area of artificial intelligence (AI) that has as its object the automatic construction of computational models for recognizing complex patterns between variables, describing or enabling decisions based on registered experience (Mitchell, 1997). The use of this tool has multiple benefits, consisting in collecting data that will be processed into information and, from it, for example, obtain knowledge of epidemics and its relationship with the environment, as well as diagnostic aid (Traore et al., 2016).
The use of AI tools has contributed to the diagnosis of CD in the 21st century, especially in the evaluation of the damage caused to the cardiovascular system, such as the analysis of heart rate variability (Moncayo Á, Silveira AC et al., 2017) and topological maps and Kohonen in order to differentiate individuals with CD with heart disease, from indeterminate (asymptomatic) individuals with CD and normal individuals (Neto et al., 2013).
In the field of machine learning, there are several classification algorithms, based on the different methods used to induce knowledge, highlighting: neural networks, support vector machines, decision trees, bayesian networks, nearest kneighbors, linear regression, among others (Spatti et al., 2019).
The objective of this work was to evaluate the potential of using machine learning and automatic selection of attributes in the discrimination of chagasic and non-chagasic individuals based on clinical and sociodemographic data.

Methodology
A field study of transverse type for the construction of the database and the computer models were conducted, involving the population in the rural area of the town of Itabaianinha /IF (villages of Fundão and Piabas) located in the Northeast region of Brazil. The town has a dry and sub humid climate, with an average annual temperature of 24.2° C, annual average precipitation of 976.9 mm and a rainy season that occurs between March and August. The countryside is divided into 72 villages which includes 38.0% of the population. Its economy is based on citrus crops, creation of large and small animals and production of ceramics. For the construction of the database, sociodemographic characteristics data were taken, symptoms, clinical and swallowing in residents of the region as well.
The project was approved by the Ethics Committee Research with Human Beings at the Tiradentes University, in Aracaju/SE with case number 190610R. The collected data were used exclusively for the purposes provided in the protocol.
Were included people over 18 years old who agreed with the "Term of Free and Clear Clarification" (TCLE)who lived in the study area and were available at the time of data collection. Were excluded all individuals with any clinical or physical incapacity.
With the aid of Community Health Agents (CHA), all residents of the study area were invited to participate in the research, but just 143 individuals over 18 years took part. Previously were carried out home visits communicating and guiding the public about the date and the locations in which would be carried out the search procedures. All participants were initially oriented by reading the TCLE and were clearly informed about the goals and procedures to be performed during the research.
For individuals who signed, were applied the same forms used by the CD control program (CDCP) that contains information on gender, age, level of education, kind of housing, therapeutic treatment, handling or contact with triatominae, earlier diagnosis of patient with CD and history related to cardiovascular and digestive systems (Silva et al., 2003).
After data collection, each one of the participants were submitted to specific clinical evaluation of swallowing, which is performed in two steps: an indirect assessment and a direct one (Levy et al., 2003;Silva, 2004). The protocol used was based on protocols described in the literature.
The evaluation consisted in 5 ml blood collected by venipuncture peripheral and a drop of blood on filter paper. The diagnostic methods used for determination of the CD were ELISA and indirect immunofluorescence (IFI). Serological analysis by ELISA were performed in the laboratory of the Serology blood bank coordinator in Aracaju/SE-HEMOSE and repeated in LACEN. Diagnostic techniques for IFIS were performed in the laboratory of CD at Paulista State University Julio de Mesquita Filho, in Araraquara, SP, Brazil.
The Algorithms have been tested and listed below, with their respective configuration settings in search of the best configuration: linear regression, logistic regression (using normal standards of RapidMiner), decision tree C 4.5 (with variations in the Criterion), support vector machines (SVM) (with variation in the kernel type), radial basis function networks (RBF) (varying the number of clusters between 2 to 16) and neural network multilayer perceptron (MLP). The measures of Research, Society and Development, v. 10, n. 4, e19310413879, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i4.13879 4 performance were linked or not to genetic algorithms, forward selection and backward elimination in a database on the rural area from Itabaianinha town, recognized as CD transmission area. Considering the features and the complexity of the models evaluated, as well as the performance achieved, linear regression and neural network multi-layer perceptron (MLP) were selected for the experiments because they were the only ones in which the results were indicative, since other measures not converged the results, the linear regression is considered a linear function for model representation and the MLP a universal opener function that can deal with a problem that is not linearly separable (Cybenko, 1989;Hornik et al., 1989).
All models were constructed and evaluated using the k-fold cross-validation technique with a total of 10 subsets (Kohavi, 1995). Automatic attribute selection algorithms were also used for each model built, namely: forward selection, backward elimination and genetic algorithms (Guyon & Elisseeff, 2003). The number of neurons was variable, and an automatic configuration was used in the present study, which creates the best configuration for the number of intermediate layers.
The All models were built and evaluated using the k-fold cross-validation with a total of 10 subsets (Kohavi, 1995).
Were also used algorithms of automatic selection of attributes for each model: forward selection, backward elimination and genetic algorithms (Guyon & Elisseeff, 2003). The number of neurons was variable, and being used in the present study an automatic configuration, which devises the best setting for the amount of intermediate layers.
Type MLP neural network displays one or more intermediate layers of neuronics, besides being described, are also used non-linear activation functions as a sigmoid function, i.e., each neuron performs a specific function, influencing or combining the functions with other neurons connected (Faceli et al., 2015).
The predictive statistical method of multiple linear regression was used for prediction of output variables, when combined linearly with the input variables are generally estimated minimum errors, which are solved or formulated by quadratic or linear programming. In addition, searches indicate the influence of one variable on another one, which characterizes it as a factor on the output (Yang et al., 2016). In addition, being a popular performance measure for dichotomous and presents a potential classification of auxiliary tool for prediction of clinical diagnosis (Upadhyaya et al., 2013).
The predictive models were created and validated with Rapidminer software Studio 7.3, having their settings optimized for the best performance with the MLP with a learning rate of 0.05 and 1000 training cycles and optimization with genetic algorithms with a population of 20. The other parameters remained with their default value. The figures and the test of statistical significance of the DeLong difference between the areas under the curve (AUC) were made using the MedCalc 17.9.7 tool. For the visualization of data, the array of confusion and ROC curves were used, besides points diagram.

Results
Took part in this research 143 individuals older than 18 years from both sexes that correspond to the families of children studied in previous research and resulting in seronegative to CD (Talbot et al., 2014). The serological diagnosis for CD detection through ELISA showed reactivity indices of order 16.7% (n = 24). It was observed that seropositive individuals were aged 38 to 68 years. Among the subjects surveyed, 75.5% (n = 108) was female, with no statistical significance to gender among individuals with positive serology (Table 1).  To carry out the analyses of each algorithm of machine learning socio-demographic, variables were used, clinics and the direct and indirect swallowing test. The MLP algorithm got the best performance prediction on discrimination between individuals and non-Chagas disease patients, reaching a sensitivity of 78.33%, a specificity of 75.0%, and an accuracy of 95.95% ± 5.36%, (Table 3). The MLP using genetic algorithm as a function presented the greater specificity, reaching 95.8%, while the Linear Regression through the genetic algorithm, presented a specificity of 91.6%. Among all tests used, the MLP using genetic algorithm had the best performance in distinguishing individuals with and without Chagas. (Table 4). Due to the best performance measures associated with the genetic algorithm, other attributes were discarded.
Moreover, the comparison of performance measures, MLP and Linear regression through the area under the ROC curve (Receiver Operating Characteristic) (AUC), with and without using the genetic algorithm, which showed no statistical significance (p > 0.05) of the difference between the AUC and indicating the similarity between the MLP and Linear regression when both were linked to genetic algorithm (Figure 1).

Figure 1.
Performance measures compared by ROC curve, being examined with and without using the genetic algorithm.

Source: Authors.
In addition, the data for both curves linked to the genetic algorithms attribute were shown, showing an AUC value for the MLP of 0.861, while the LUC AUC was 0.893 showing that there is no significant difference between both curves (p = 0.5830) (Table 5). However, although the models do not show a statistically significant difference in AUC, when analyzing the boxplot diagram of the two models, MLP is able to better separate the classified instances (Figure 2) than the Linear Regression ( Figure 3).

Discussion
The use of AI has allowed significant advances in several areas of knowledge, allowing the understanding of epidemiological data through hypothesis tests, data collection, information processing, managing to establish patterns in the dynamics of diseases, influencing, in a way, the detection of these (Esfandiari et al., 2014). Studies using computational methods have been carried out as a complementary aid methodology for the more accurate diagnosis and classification of patients affected by CD, with the aim of assessing damage to the cardiovascular system, such as the analysis of heart rate variability (Moncayo Á, Silveira AC et al., 2017). However, according to our knowledge, no work has yet been carried out similar to the one presented here, that is, building computational models based on machine learning and automatic selection of attributes to identify CD from clinical and sociodemographic data. Neto et al. (2013) developed topological maps of Kohonen to compare the ability of indicators extracted from the electrocardiogram signals inserted in neural networks, with the aim of discriminating CD patients with heart disease, indeterminate CD patients and normal individuals. Thus, the search for techniques that help in solving problems related to CD or other diseases, quickly and efficiently, it is essential for the enhancement and monitoring strategies and health promotion actions that will contribute more effectively in understanding of epidemiological variables involved in these diseases.
The results from the models used in this study were found by cross-validation, which, in a study of Ishibuchi & Nojima (2013), was used for evaluation and accuracy of the tests. Thus, validation indicated the values of accuracy, sensitivity and specificity of the MLP and the LR, showing the best results on the efficiency of the MLP (sensitivity: 75%; specificity: 95.8%; accuracy: 96%).
A research conducted by Kurt et al. (2008), about prediction of coronary artery disease, showed a comparison between MLP, LR and other techniques, highlighting that the MLP has shown the best results for the purpose of the study compared the others.
It was observed that the research carried out by Kurt et al. (2008) presented similar results to the present study, which presented the 78.33% values for sensitivity, 75% for specificity and 95.95% +/-5.36 for accuracy, showing the best results of the MLP. However, studies of Shoostari & Gholamalifard (2015), presented better performance of the LR when compared with the MLP.
Moreover, the introduction of the genetic algorithm to perform the automatic selection of the attributes, improved the result of the models in the performance measures used, increasing their levels of accuracy, specificity, sensitivity and the AUC, as shown in the tables and figures. In the study of Tao et al. (2017), it is also relevant to the introduction of genetic algorithms, outperforming the other techniques used in the study, when compared.
The ROC curve is used in biomedical applications, having as purpose to summarize the accuracy of one or more classifiers discriminatory about the diagnosis, in addition to the functionality of comparing these models based on simultaneous analysis of sensitivity and the specificity, building on the model performance instances sorted (Tang & Chi, 2005). Because of this, the ROC curve was used for the comparison between the MLP and the Linear regression models, being also observed in the study of Shoostari & Gholamalifard (2015), which tried to predict the change of land cover and the quantification of landscape change present in the landscape Neka River basin, in northern Iran. In this manner, ROC curve assesses the correlation between variables and transitions in performance measures (Shoostari & Gholamalifard, 2015). In the present study, the curve revealed, with the method of Delong, that there is no statistically significant difference between the areas under the ROC curve (AUC) of models, despite the MLP model have presented a larger area in relation to the model with LR. However, when looking at the bloxspot diagrams of the models, the MLP can better separate instances sorted, showing a better performance and ease to the cutting point, due to lower the overlapping region between Chagas disease patients and non-Chagas ones.

Conclusion
It has been used predictive methods more frequently in order to get transmission dynamics and patterns of disease symptoms, and this was the objective that the present work presents a unique and unprecedented study that sought to assess a variety of algorithms machine learning and automatic selection of attributes for CD. After such comparison were selected and evaluated more detailed the MLP algorithms and Linear Regression with Forward Selection methods, Backward Elimination and genetic algorithm to better achieve the aim of this study.
The performance of the models was evaluated using the technique of cross-validation and presented using classical measures of accuracy, sensitivity, specificity and area under ROC generating curve, where the algorithm with MLP AG showed better performance despite statistically to be close to the LR with AG, but having their behavioral differences demonstrated by boxplots diagrams presented. The performance achieved by the models was considered interesting for the CD prediction, given the nature of the data collected, not requiring sample of biological fluids and of easy access by health professionals, not necessarily own doctor who performs the collection of this information. Based on that, it is clear the ease and utility provided by the generated models, offering a new alternative in the possibility of screening new cases and unknown cases as well, leading to a faster way to diagnosis and a faster start treatments, as well as influence on new preventive methods.