Forecasting Dengue, Chikungunya and Zika cases in Recife, Brazil: a spatio-temporal approach based on climate conditions, health notifications and machine learning

Dengue has become a challenge for many countries. Arboviruses transmitted by Aedes aegypti spread rapidly over the last decades. The emergence chikungunya fever and zika in South America poses new challenges to vector monitoring and control. This situation got worse from 2015 and 2016, with the rapid spread of chikungunya, causing fever and muscle weakness, and Zika virus, related to cases of microcephaly in newborns and the occurrence of Guillain-Barret syndrome, an autoimmune disease that affects the nervous system. The objective of this work was to construct a tool to forecast the distribution of arboviruses transmitted by the mosquito Aedes aegypti by implementing dengue, zika and chikungunya transmission predictors based on machine learning, focused on multilayer perceptrons neural networks, support vector machines and linear regression models. As a case study, we investigated forecasting models to predict the spatio-temporal distribution of cases from primary health notification data and climate variables (wind velocity, temperature and pluviometry) from Recife, Brazil, from 2013 to 2016, including 2015’s outbreak. The use of spatio-temporal analysis over multilayer perceptrons and support vector machines results proved to be very effective in predicting the distribution of arbovirus cases. The models indicate that the southern and western regions of Recife were very susceptible to outbreaks in the period under investigation. The proposed approach could be useful to support health managers and epidemiologists to prevent outbreaks of arboviruses transmitted by Aedes aegypti and promote public policies for health promotion and sanitation.


Introduction
Prevention and control of dengue fever, chikungunya fever and zika has been a major public health challenge for many countries. Since 2015 other arboviruses have interacted with the dengue virus, which has spread rapidly over the past two decades (de Lima et al., 2016; Bhatt et al., 2013. It is estimated that around 390 million new cases of dengue occur each year. However, problems such as misdiagnosis and inaccurate reporting or absence of case reporting in many regions can contribute to the underestimation of the impact of dengue and other arboviruses transmitted by the mosquito Aedes aegypti . The emergence of other arboviruses, such as chikungunya fever and zika, especially in South America, poses new challenges to vector monitoring and control. This situation worsens from 2015 and 2016, with the rapid spread of chikungunya, causing fever and muscle weakness, among other symptoms, and the emergence of Zika virus, partially related to cases of microcephaly in newborns and directly related to the occurrence of Guillain-Barret syndrome, an autoimmune disease that affects the nervous system, ranging from muscle weakness to paralysis (Cao-Lormeau et al., 2016).
Dengue is a viral infection transmitted to humans through mosquitoes, and is spreading rapidly around the world. Its primary vector is the mosquito Aedes aegypti, a species well adapted to urban areas and distributed mainly in tropical and subtropical regions, but also operating in North America and Europe. Evidence indicates that a secondary vector, the mosquito Aedes albopictus, has also been expanding its geographic range (de Lima et al., 2016; Bhatt et al., 2013. The risk of arbovirus outbreaks and their endemic presence is higher in tropical and subtropical regions, but is also increasingly present in North America and Europe, due to the presence of mosquitoes Aedes and the introduction of viruses (de Lima et al., 2016; Bhatt et al., 2013.
The transmission of arboviruses is a complex process that involves the interaction of multiple agents: human populations, mosquitoes and viruses conditioned by climatic and environmental factors in a very heterogeneous space. The space in which these interactions take place is complex enough that the study of arboviral transmission is fraught with challenges.
Arborovirus pandemics have been favored by a combination of several factors: the global mobility of human populations and mosquito circulation; the swelling of overcrowded urban areas; the difficulty of access by urban populations, especially the economically disadvantaged sectors, to basic sanitation, regular water supply, and the public health system; environmental and climatic factors, such as temperature and rainfall, which measure rainfall density and occurrence; and, finally, the inefficiency of vector control strategies (de Lima et al., 2016; Gubler, 2011; Mohammed & Chadee, 2011.
Several research groups have been dedicated to building risk maps and estimating the global distribution of arboviruses and their correlation with environmental data. Despite the importance of these efforts to map the distribution of these diseases, it is also important to understand the dynamics of arboviruses on a local scale, which is done through mathematical and computational models (Padmanabhan et al., 2017; Jindal & Rao, 2017; de Lima et al., 2016. Local climatic conditions, such as temperature, rainfall and humidity, interfere with vector development, from hatching to mosquito life and dispersal, and other aspects of arboviral transmission (de Lima et al., 2016; Gubler, 2011. The advancement of Digital Epidemiology and geoprocessing technologies, coupled with the development of Data Mining and Machine Learning techniques, have provided rapid monitoring, control and simulation of disease spread, assisting public health systems in controlling epidemics and of the environmental and behavioral factors that favor the vectors of these diseases (Salathe et al., 2012; Beltrán et al., 2018; Musah et al., 2019; Rubio-Solis et al., 2019; Kostkova et al., 2019. In Brazil, arboviruses have received special attention from the Unified Health System through public health policies and campaigns (Pessanha et al., 2009). In Recife, the Recife Municipal Health Secretariat, through its Open Data Portal, distributes the mapping of diseases and symptoms by health unit and the patient's neighborhood of origin since 2015. The Pernambuco Water and Climate Agency, APAC also provides a geographic information system where the daily and monthly rainfall series are published since 2006, by city and, in the case of Recife, by neighborhood.
Machine learning techniques have been shown to be useful to support the diagnosis and prediction of prognosis of different diseases based on biomedical signs and images and different clinical parameters (Commowick et al., 2018; S. M. de Lima et al., 2016; Santana et al, 2018; Cordeiro et al., 2016; Barbosa et al., 2021; de Souza et al., 2021; Pereira et al., 2021.
Additionally, machine learning-based regression techniques have been successfully used for temporal and spatiotemporal prediction of contagious diseases such as Covid-19 (da Silva et al., 2021; de Lima et al., 2020. We believe that this set of techniques can achieve good accuracy results when adapted to arboviruses, including not only the spatial and temporal windows of the number of cases, but also climatic and environmental variables. The objective of this work was to construct a tool to forecast the distribution of arboviruses transmitted by the mosquito Aedes aegypti by implementing dengue, zika and chikungunya transmission predictors based on machine learning, focused on multilayer perceptrons neural networks, support vector machines and linear regression models. As a case study, we investigated forecasting models to predict the spatio-temporal distribution of cases from primary health notification data and climate variables (wind velocity, temperature and pluviometry) from Recife, Brazil, from 2013 to 2016, including 2015's outbreak. Multiplayer perceptrons demonstrated to be the most adequate models, reaching considerable high correlation coefficient values and percentual errors lower than 5%.

Proposed Method
In this work, we propose a prototype of a system for spatio-temporal prediction of the distribution of cases of arboviruses, i.e. dengue, chikungunya and zika. The main hypothesis of this work is that the monthly average measurements of temperature and wind speed, and the number of arbovirus cases per two months by geographic location, considering a 12-month prediction window, can be used to predict the spatial and temporal distribution of dengue, chikungunya and Zika cases. As a case study, we used the climatic variables obtained from the national meteorological systems, and the case information by Therefore, this research is characterized as a quali-quanti case study, ie a case study that combines qualitative aspects (visual analysis from the spatial distributions generated by the geographic information system) and quantitative aspects (regression evaluation indices and other statistics of interest) in your analysis. The following subsections present in detail the databases, data preparation and pre-processing, machine learning methods used to build the predictive models, the geographic information system, and quality indices.

Area under study
The area delimited for this study was the City of Recife (  Source: Authors.

Mapping of arbovirus cases
Data on arbovirus cases were obtained through the Open Data Portal of Recife City (http://dados.recife.pe.gov.br/), which contains the records of the number of cases of Dengue, Zika. and chinkugunya from 2013 to 2016. For each two months of each year, the number of arboviral cases in each of the 94 districts of Recife was counted separately. From the information on the number of cases in each neighborhood, a vector layer of points was generated, shapefile (.shp), geographically locating the number of cases to each neighborhood of the city geographically, as can be observed in the map on the right in Figure 2. In order to estimate the distribution of arboviral cases throughout the municipality, the QGIS interpolation tool was used, in which the interpolation method selected was the inverse distance interpolation. As a result of the interpolation, we obtained a raster image Research, Society and Development, v. 10, n. 12, e452101220804, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i12.20804 6 (.tif) that can be observed in the map on the left in Figure 2. Rasters were generated for each quarter from 2013 to 2016, where each raster represents the distribution map of the cases of arboviruses.

Mapping of climate variables
Climatic factors such as rainfall and temperature are among the causes of an increase in arboviruses. Mosquito behavior is determined by weather conditions. This is because rainfall, temperature, and humidity affect the interaction of biological and viral vectors throughout life, mating age, spread, feeding, and faster viral replication (Morin et al., 2013; LaDeau et al., 2015. . (1) For the monthly accumulated rainfall, the maximum value considered was the maximum value recorded between the three monitoring stations, while the average considered was the average of the accumulated rainfall between the three monitoring stations. With information on climate variables in all neighborhoods of Recife, the shapefiles were generated for each of the variables for each month from 2013 to 2016. Finally, the inverse distance interpolation tool was used to estimate the spatial distribution of climatic variables throughout the Recife territory.

Linear Regression
The linear regression is the simplest method to predict numeric values. In this method, it is assumed that the data has a linear behavior, and that the prediction variable can be represented as a linear combination of the attributes with their pre-determined weights (Witten & Frank, 2005). Thus, the general model of linear regression is represented by the Equation 2.
where y is the prediction variable; x1,x2,...,xn, represent the values of the attributes and w0,w1,w2,...,wn represent the weights of each attribute. The idea of the linear regression algorithm is, then, to find the optimal weights that best represent the problem.
One of the ways to find the optimal weights is to minimize the sum of the squared difference between the predicted value and the actual value (Witten & Frank, 2005). The sum of the squared difference is calculated by Equation 3: (3)

Artificial Neural Networks
Artificial neural networks (ANN), consists in a machine learning technique based on the behavior of the human brain (Siriyasatien et al., 2018). The neural networks consist of smaller units, artificial neurons, which are fundamental to their functioning. The artificial neurons contains the following elements: (1.) a set of synapses or connectors where a signal xi at the entrance to the synapse j connected to the k neuron is multiplied by the synaptic weight wk,j (2.) an adder to add the input signals, weighted by the respective neuron synapses; (3.) an activation function to limit the output of a neuron (Haykin, 2001).
Mathematically, an artificial neuron is represented by the Equation 4 and by the Equation 5: wherein x1,x2,...,xn represent the input signals; wk,1,wk,2,...,wk,n represent the synaptic weights of the input signals xi for the k-th neuron; bk, is the term bias and φ is a neuron activation function. In regression applications, the inputs x1,x2,...,xn of the input layer correspond to the forecasting window. For instance, in case of temporal forecasting, the inputs are observed time window of the time series.
The network architecture used in this work was the Multilayer Perceptron (MLP). In this configuration, the neural network has an input layer, two or more hidden layers and an output layer (Haykin, 2001). ANNs have also been widely used to predict disease cases. For example, in the prediction of dengue cases in the city of São Paulo, Brazil (Baquero et al., 2018). They were also used to predict dengue outbreaks in the northeastern coast of Yucatán, Mexico, and in San Juan, Puerto Rico (Laureano-Rosario et al., 2018). Moreover, the ANNs were applied to model cases of infection by Salmonella in the state of Mississippi, USA (Akil and Ahmad, 2016).

Support Vector Regression
The support vector regression is a supervised machine learning technique for data analysis and pattern recognition. The idea of the SVR algorithm is to find the best hyperplane defined by Vapnik's ε-insensitivity loss function. When this hyperplane is found, a linear regression is applied to the corresponding hyperplane. In situations where the problem is linearly separable, the best hyperplane is given by the equation: where w = (w1,w2,...,wn) T is the vector of weights, x = (x1,x2,...,xn) T is the feature vector, and b is the bias. For problems that are not linearly separable, the data is mapped to a hyperplane in a larger dimension. Thereupon, the algorithm seeks to solve the problem by applying the linear regression of the equation 6 in the corresponding hyperplane. For nonlinearly separable problems, SVR machines use kernel functions, K : R×R → R. Then, the SVR output assumes the following expression: where the kernel function can be polynomial, sigmoidal, Gaussian, or even assume other mathematical expressions (Drucker et al., 1997;Witten and Frank, 2005;Smola and Schölkopf, 2004).

Metrics
The main metrics we adopted to evaluate the models are the following: the correlation coefficient and the Relative Quadratic Error (RMSE percentage). The correlation coefficient is a statistical measure between expected and forecasted values.
This value varies from -1 to 1. When it approaches 1, it indicates a strong positive correlation. Conversely, when the correlation coefficient is close to -1, it indicates that the variables have a strong negative correlation. When the correlation coefficient is close to zero, it indicates that there is no correlation between the variables (Witten and Frank, 2005).
The Pearson's Correlation Coefficient R is defined as follows: where p¯ and a¯ are the sample average values for the sets of predicted and actual values, respectively. Similarly, the Spearman's Rank Correlation Coefficient ρ is defined as following: where R(pi) and R(ai) are the ranks of pi and ai, whilst R¯(p) and R¯(a) are the sample averages of the ranks of pi and ai, respectively.
The Kendall's Rank Correlation τ is given as follows: where n is the number of observations and 1 ≤ i,j ≤ n. The signal function, sign, is defined as following: , for x ∈ R.

Forecasting Set
The prediction sets were assembled from the distribution maps of arboviruses cases and climatic variables for each two months. The bimonthly prediction model was chosen due to the fact that the Brazilian Unified Health System (SUS) is planning to combat arbovirus outbreaks considering the bimester cycle. The attribute vectors of the prediction sets were assembled by simultaneously scanning the spatial distribution maps pixel by pixel and concatenating latitude and longitude with the following information, in the following order: distribution of arbovirus cases, and for each month of the bimester, the temperature distribution, rainfall and wind speed. Each prediction vector contains information from the six quarters preceding the prediction quarter. Therefore, 18 prediction sets were assembled, each with 15,553 instances and 44 attributes, in which the output of each prediction set is the pixel value of the arbovirus case distribution at the corresponding coordinate. The 15,553 instance sets were established as test sets.
The Weka machine learning environment (Frank et al., 2004;Hall et al., 2009), version 3.8, was used to assemble the training set from the resample tool. This tool allows a new database to be created with random values for instances, but with the same statistical characteristics as the original database. The number of instances of the new base must be specified. In this case, the training sets were generated by applying resample to each of the prediction sets with the number of instances equivalent to 30% the original set. Sets containing 15,553 instances were used to test the models created by the best regressor.
From the training set, we investigated the best regression architectures for predicting the distribution of arbovirus cases, namely: Linear Regression, Support Vector Machine (SVM) and Multilayer Perceptron (MLP) with a single hidden layer. For the SVM regressor, evaluations were performed with the following configurations: C = 0.1 and linear (or degree 1), 2 and 3degree polynomial kernels, and RBF kernel. Regarding single layer MLP, we investigated architectures with 10, 20, 30 and 40 neurons in the hidden layer.

Results and Discussion
We evaluated each of the regressors in 30 rounds using 10-fold cross-validation. For the quantitative evaluation, we calculated the Correlation Coefficient (R), the Absolute Mean Error (MAE), the Mean Square Error (RMSE) and the Percent Relative Quadratic Error (RMSE percentage). However, the data were analyzed considering only the correlation coefficient, as global quality, and relative quadratic error as local quality metric. The detailed results of R, RMSE% and training time of each regressor are shown in Tables 1, 2 and 3. In this paper, we consider a high correlation coefficient to be above 0.9 and a low relative squared error to be below 5%. Best results are highlighted in red.
In Table 1, the results show that the linear regression presents satisfactory values for the correlation coefficient R, with average 0.97 and standard deviation of 0.03, and for the training time, as average of 0.05 and standard deviation. 0.03 and is therefore considered a very fast prediction method. On the other hand, the relative square error RMSE% presents a considerably high value, with an average of 21.23% and standard deviation of 12.11%.  with averages around 0.999 and 1, and standard deviation of 0.001. Regarding the relative quadratic error, we observed that the 10-neuron configuration has a fairly low RMSE% with an average of 4.15% and this value decreases as the number of neurons in the hidden layer increases, reaching a minimum value of 3.29% in the 30-neuron configuration, followed by a considerable increase to 3.67% in the 40-neuron configuration. The behavior of training time shows an increase as the number of neurons in the hidden layer increases. Thus, considering the evaluation metrics, the best network configuration among the ones evaluated was the network with 30 neurons because of its high correlation coefficient, relative square error satisfactorily below the established limit of 5% and having a training time reasonably low.  configurations were quite slow for 2-and 3-degree polynomial kernels, with a major disadvantage of 3-degree polynomials in its speed and stability. In contrast, for the linear and RBF kernels, the results showed that they are relatively fast compared to the training time of neural networks. However, they are considerably slower compared to linear regression. Taking into consideration the evaluation metrics, the 2-degree polynomial kernel is the best SVM configuration due to its high correlation coefficient R, low RMSE% and shorter training time among configurations that meet the requirements of R and RMSE% set for this task. Overall, by evaluating all tested architectures, we can observe that the multilayer perceptron is a regressor that meets the needs of the prediction problem in question. As mentioned in this section, the configuration with 30 hidden layer neurons reached the best evaluation because it has a high correlation coefficient R, RMSE% below 5%, and reasonably short training time when compared to other regressors. Although training time is not critical in this type of problem, it was adopted as an important criterion to select the best regressor. After all, despite having very good correlation coefficients and very low RMSE%, 2-and 3-degree polynomial kernels SVM settings achieved very high values for training time. On the other hand, configuring SVM with RBF kernel has proved to be quite unsuitable for solving this problem. The training time for this configuration was considerably shorter than the MLP training time with 30 hidden layer neurons. However, as seen in Table 3, the RMSE% reached very high values, far above the established value as adequate. Table 4 presents the results for MLP with 30 neurons in the hidden layer, considering the training set with 4,665 instances and the test set with 15,553 instances. This table also presents results for RBF-kernel SVM, considering these same sets. The metrics used to quantitatively evaluate the models are the correlation coefficient R and the relative square error (RMSE%). The qualitative evaluation of the models generated the distribution maps of arboviruses cases on the Recife map for each two months of 2014, 2015 and 2016.
For qualitative analysis, we generated the prediction images from the results obtained in the model validations. Figures   3, 4 and 5 correspond to the bimonthly predictions using MLP with 30 hidden layer neurons in 2014, 2015 and 2016, respectively. Table 4: Validation results of the prediction models created by the multilayer perceptron, with a single layer and 30 neurons in the hidden layer.
The warmer regions represent the areas with the highest concentrations of arbovirus cases, while the colder areas represent low case rates. The numeric labels of Recife's neighborhoods are shown in Table 5. Research, Society andDevelopment, v. 10, n. 12, e452101220804, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i12.20804 13 In Figure 3a, we can see that there was a higher concentration of arbovirus cases in the southern region of Recife. The most affected neighborhood in this area was Cohab, followed by Ibura and Boa Viagem. In the west of the city, the neighborhood with the highest rate of cases was Várzea. In the northernmost region of the city, the most affected neighborhoods were Dois Unidos and Casa Amarela. In the second quarter of the same year, Figure 3b, we can observe that the southern region of the city remains the most affected region. However, there is an increase in arbovirus cases in the Boa Viagem neighborhood and a decrease in cases in Cohab. We can also perceive an increase of cases in the neighborhood of Imbiribeira towards Pina. In the northern part of the city, it is possible to identify a significant increase in the neighborhoods of Dois Unidos, Linha do Tiro, Macaxeira, Nova Descoberta, and Vasco da Gama. In the third quarter of 2014 (Figure 3c), in the south zone, there is a considerable decrease in cases in Boa Viagem, Ipsep and Imbiribeira. However, the opposite occurs in the neighborhood of Ibura and Cohab. In the west of the city, Várzea also shows a remarkable reduction of cases. Cohab. In the second and third bimesters, Figures 4b and 4c, respectively, the points with the highest incidence are the neighborhoods of the southern zone, especially Cohab, Ibura and Boa Viagem. In the west, there are also cases in the Várzea neighborhood and, more to the southwest of the city, in Jardim São Paulo as well. In the fourth bimester of 2015, the situation is more controlled, where cases were concentrated only in Ibura and Cohab (see Figure 4d). During the 5th and 6th bimesters (Figures 4e and 4f, respectively), the situation worsens again. In the case of the 5th bimester, the neighborhoods located more in the center of the city had a significant increase in cases, especially the districts of  In the southern zone, the highest concentrations of cases occurred in Cohab and Ibura, with less intensity in the Ipsep, Boa Viagem and Imbiribeira neighborhoods. In the west of the city, the most affected neighborhoods were Várzea and Iputinga. In the third bimester of 2015 (Figure 5c), there was a considerable decrease in cases in the southern region of the city, focusing only on the neighborhoods of Cohab and Ibura. In the west of the city, it is also possible to notice a reduction of cases in the Várzea neighborhood. However, cases in the neighborhoods of Iputinga, Cordeiro and Torrões intensified. In the next two months (Figure 5d), the situation gets even worse. In the 4th bimester, the cases were mainly concentrated in the south towards the west of the city. The most affected neighborhoods were Cohab, Imbiribeira, Boa Viagem, Areias, Varzea, Iputinga, and Cordeiro.
In the fifth bimester of 2015 (Figure 5e), the behavior of the distribution of arbovirus cases was quite similar to the previous bimester. There is a decrease in cases in some neighborhoods of the southern region such as Ibura and Imbiribeira. In contrast, Boa Viagem showed a significant increase in the number of cases from the fourth to the fifth bimester. Comparing the map of Figure 5e with the map of Figure 5f, we can see that, from the southern region of the city towards the western region, the neighborhoods located in this region showed an increase in arbovirus cases. Finally, in the last two months of 2016 (Figure 5f), the prediction showed an improvement in the situation. The southern zone showed a significant reduction in arboviruses, except for the Cohab neighborhood. In the western region of the city, we can also notice a significant decrease in cases, except for Várzea and Iputinga. Overall, the prediction maps using MLP with 30 neurons showed that the main regions of Recife with high concentration cases are the west and south regions. In the western region, the neighborhood that appears most frequently with regard to the highest concentration of cases is the Várzea neighborhood. In the southern region, the neighborhoods that appear most frequently at the highest concentration of cases are the neighborhoods of Cohab, Ibura, Imbiribeira, and Boa Viagem. The prediction maps presented are similar to the actual distribution maps of arbovirus cases. This corroborates the quantitative analysis metrics of the chosen regression method.

Conclusion
The use of machine learning predictors proved to be very effective in predicting the distribution of arbovirus cases.
According to the qualitative results presented in the section 3, the regions in which arbovirus outbreaks transmitted by Aedes aegypti predominate are the southern and western regions of Recife. In the western region, the neighborhood that appears most frequently with regard to the highest concentration of cases is the Várzea neighborhood. In the southern region, the neighborhoods that appear most frequently at the highest concentration of cases are the neighborhoods of Cohab, Ibura, Imbiribeira and Boa Viagem. Already the northern region of the city appears with a high concentration of cases in the first two quarters of the year. Although there are cases throughout the year, it was also observed that arbovirus cases usually occur predominantly in the warmer months of the year (October to March).
Finally, the approach using spatio-temporal analysis provided a broader assessment of those regions where more or less arboviral outbreaks occur. From the qualitative results it was possible to differentiate in the heat maps the regions with very high concentration of cases from the regions with low concentration and the regions that are in the transition range. This type of approach is very relevant in supporting health managers and epidemiologists in the planning of short and medium term actions to prevent outbreaks of arboviruses transmitted by Aedes aegypti, and may also support the development of public policies for health promotion and sanitation.
As future work, we intend to evaluate new learning machines: statistical learning methods, random forests, classifier committees, meta-classifiers, and approaches based on hybrid architectures combining deep learning and linear regression methods. We also intend to expand the period observed until 2020, even considering that, in 2020, the data are underreported due to the Covid19 pandemic and the overload of the public health system in Recife. Additionally, we intend to use the proposed methodology to also predict in time and space the location of potential breeding sites for the Aedes aegypti mosquito.