Cluster analysis applied to the Human Development Index (HDI) of Brazilian States

This study aims to compare the performance of each method (hierarchical and non-hierarchical) of the grouping formed by several HDI from the 27 brazilian states, through the cluster analysis technique. As well as determining how many states there are in each formed group, to thus specify which technique best represents the data. Data from Atlas Brasil 2013 were used in relation to the 2010 HDI. For cluster analysis, the Mahalanobins matrix was used with the hierarchical method, from the data obtained, we applied the simple linkage methods, complete, average, ward liaison and a nonhierarchical method through the K-means method, the conphenetic correlation coefficient was also applied to measure the degree of fit between the original similar matrices and the resulting matrix of simplification provided by the grouping method. However, the method that best represents the data was the complete link. When grouping the states, the similarity between the HDI-R, HDI-L and HDI-S variables was considered this relationship formed similar groups between the connections from different regions of Brazil.


Introduction
The Human Development Index (HDI) was created in 1998 by two economists, the Pakistani Mahbub Ul Haq and the Indian Amartya, at the United Nations Development Program (UNDP). The HDI is considered an average for summarizing the basic conditions of a population, focusing on education, income, and quality of life. Published in Brazil for the first time in the year 1990, the HDI has gradually become a reference in several places around the world (Braga, 2017).
The Human Development Index in Brazil can be consulted through the Atlas platform that refers to human development, covering the Atlas of the 27 States, the Municipal Human Development Index (MDI) and Metropolitan Regions. Demonstrating more 200 indicators of demography, education, income, work, housing, and vulnerability.
According to data from the 2013 atlas, the Brazilian states present a considerable discrepancy in values obtained by the HDI with a range of 0.631 to 0.824 within a probable range, and can assume values between 0 and 1, the closer the HDI of a country is to 1, the more developed it becomes. Therefore this difference can be partly explained by the varieties of characteristic that distinguish one state (or country) from another in its geographical, economic, and infrastructure aspects (Costa, 2019).
Cluster analysis, also known as cluster analysis is a multivariate technique with the goal of promoting the segmentation of data into categories or groups based on their homogeneous or heterogeneous characteristics by classifying into the same or distinct groups. This technique groups data for interpretation using some methods that look for excluding, ascending groups to thus repress the information of a set (Campos, 2019), when we compare states through this technique we can form new groups which can be significantly smaller when compared to the set of states provided. Because it is a widely used technique it can generate groups of states with similar characteristics, promoting a broad view regarding the state with similar HDI.
Based on the cluster analysis, the hierarchical and non-hierarchical methods can be used. The hierarchical method aims to form a hierarchical decomposition of the data sets, forming a structure of a hierarchical tree, already the non-hierarchical method the methodology, but used is the k-th (Nascimento, 2019).
Other important methods that can be used are the linking methods such as: simple, complete, medium and ward. The simple linkage method succeeds between two very similar elements; the complete linkage method occurs contrary to the simple method; the average linkage method uses the arithmetic mean of the dissimilarity measures which treats the distance between two conglomerates as the average of the distances between all pairs of elements that were formed with the elements of the two conglomerates being compared; the ward linkage has a different method of forming its group from maximizing the homogeneity between the groups or however the total minimization of the sum of squares between the groups (Costa, 2019).
Thus, this study aims to compare the performance of each method (hierarchical and non-hierarchical) of grouping formed by various HDI of the 27 Brazilian states through the cluster analysis technique. As well as determine how many states have in each group formed, to thus specify which technique best represents the data (Barroso & Artes (2003).

Materials and Method
Data regarding the HDI 2010, taken from Atlas Brazil 2013 (Table 1), were used, based on the Brazilian states. These data were calculated through three main aspects: Income, Longevity, and Education, and can vary between 1 and 0. The closer to 1, the more developed the state is, and the closer to 0, the less developed the state is. For the cluster analysis study we used the dissimilarity method based on Mahalanobis distance (D2), which is considered one of the most used distances ( (Johnson & Wichern, 2002;), and can be calculated according to the following expression:

D 2 = (Xi -Xj)'. ∑ -1 (Xi -Xj)
where: D2 exhibits characteristic of being invariant for any non-singular linear transformation, Xi is the vector that belongs to plot i; Xj is a vector that belongs to plot j; ∑1 is the inverse of the residual covariance matrix of X; (Xi -Xj)'is the transposed vector of the difference between Xi and Xj.
Because it is a technique widely used in practice and easy to be found in some computer programs, the clustering algorithms used were: Simple linkage method which is defined by the two elements that are most similar to each other; Complete which is defined as the distance between the vectors of means; Average treats the distance between two conglomerates as the average of the distances between all pairs of elements that can be formed with the elements of the two conglomerates being compared and the Ward linkage method which can form the groups from the maximization of the homogeneity within the groups or the total minimization of the sum of squares within the groups (Costa 2019). It will be represented in the form of dendrograms.
According to , the cohenetic correlation was used to measure the degree of fit between the original similar matrices and the matrix resulting from the simplification provided by the clustering method according to the expression: where: Cij is the similarity value between individuals i and j, where will be obtained from the cohenetic matrix; Sij is the similarity value between individuals i and j, where will be obtained from the similarity matrix. Where:

Results and Discussion
In the cluster analysis used, a difference was observed between the methods applied, both for the hierarchical method and for the non-hierarchical method, each method showing its advantage and disadvantage. The hierarchical method has the advantage of using different dissimilar measures, its disadvantage is to reduce the number of outliers. The non-hierarchical method has the advantage of using a very large data set, with a smaller presence of outliers, but the disadvantage is to use the centroid randomly, making the hierarchical method superior to this method.Ao analisar a matriz da distância de mahalanosbis com a aplicação dos métodos de ligação simples, ligação completa, média da distância e de ward, A small change in the levels of the grouped elements has been observed, i.e. elements that are within the same group can be grouped in a different order when changing methods, where generally the grouping structures are quite similar.
In Table 2   5 When analyzing the dendograms, the presence of five groups was verified without using a cut-off criterion, each dendogram presented structures of groupings of homogeneous states to determine the groups that were formed, the dendograms represent different aspects for each group of states, some groups are identical, however the numbers of states obtained in each group are similar ( figure 1 to 4).
The appropriate hierarchical method of grouping, was the complete linkage method obtained through the Mahalanobis Distance matrix, since the cohenetic correlation coefficient shows a better distinction of the groups when the coefficient is greater than 0.70 found in each hierarchical method (Table 3). From the dendrogram the complete linkage method its CCC was 0.728 which corresponds to 72.8% consistency in clustering, that is, the state that are within the same group can be grouped in other ways when changing the method. The method that best represents the solution to this problem is Complete Linkage with deferent numbers of states in each cluster. Although the overall structure of the cluster is quite similar, with an irrelevant change in the grouped states.
For a clearer result, the individual analysis of each method used will be performed for the groups, taking into account each group formed using the average of each variable.

Simple Connection:
Composed by five groups, it presents three unitary groups and two groups with states from different regions of Brazil (Figure 1). In this method the states that stood out were: • Rondônia: Located in the North region, with an HDI = 0.690, an estimated population of 1,815,278, the main economic activities are agriculture, cattle breeding, the food industry, and vegetal and mineral extraction.
• Amazonia: With 4,144,597 inhabitants and an HDI = 0.674, its economy is based on the primary sector, with the extractive activities (animals, minerals, and vegetables) as the highlight.
• Mato Grosso do Sul: located in the central region of Brazil with 2,778,986 inhabitants and an HDI = 0.729, the main economic activity is still agriculture and cattle raising. Research, Society andDevelopment, v. 11, n. 2, e18011225747, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i2.25747 6 Complete linkage: Composed by five groups, it presents a unitary group, four groups with numbers of states from different regions of Brazil (Figure 2). In group two the states that stood out in this method were: • São Paulo: It has an estimated population of 46,649,132 inhabitants, with the HDI = 0.783 the second highest among the states, located in the Southeast region of the country. The state has a diversified economy responsible for about one-third of Brazil's GDP.
• Alagoas: Located in the Northeast region with an estimated population of 3,365,351 inhabitants, it has an economy in several areas with agriculture (pineapple, coconut, sugar cane, beans, etc.), industry (construction, food, etc.) and tourism that has increased in recent years, but even with all this it has an HDI = 0.631 considered the lowest in the country. inhabitants, the economy of Tocantins is based on an aggressive expansionist model of agro-exports, in other words, agricultural products (rice, soy, pineapple, corn, and others).
• Pernambuco: With HDI = 0.673, located in the Northeast region of the country, its industrial production is among the largest in the North-Northeast, with the following sectors: naval, automobile, chemical, metallurgical, flat glass, electroelectronic, non-metallic minerals, textile, and food industries, but also known today as the largest producer of guava and acerola.
• Ceará: Also located in the Northeast region of the country, it has an HDI = 0.682, the Ceará economy has been growing, that is, it stands out in the agricultural activity (beans, corn, rice, herbaceous cotton, tree cotton, cashew nuts, sugar cane, cassava, castor beans, tomatoes, bananas, oranges, coconuts, and, more recently, grapes), in the industry sectors (clothing, food, metallurgy, textiles, chemicals, and footwear), with an estimated population of 9,240,580 inhabitants.
• Distrito Federal: Known as Brasilia the capital of Brazil has the highest HDI= 0.824, located in the central region of the country, it is an important economic center, the main economic activity of the federal capital results from its administrative function. The estimated population is 3,094,325 people.
• Mato Grosso: It is located in the Central region, its population estimated at 3,567,234 people, this state leads as the largest national producer of grains in the country, but also produces other crops (beans, sugarcane, corn, cotton, sunflower, cassava), that is, one of the main producers and exporters of soybeans in Brazil.
• Rio de janeiro: Situated in the southeast region of the country, a large part of the state's economy is based on services, with a significant part of industry and little influence in the agricultural sector, its population is estimated at 17,463,349 people.
• Espírito Santo: It is located in the southeast region, with an estimated population of 4,108,508 people, in the economy it has stood out in agriculture, cattle breeding, and mining, in agricultural production with sugar cane, oranges, and coffee. The Research, Society andDevelopment, v. 11, n. 2, e18011225747, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i2.25747 7 main industrial sectors are: Oil and Natural Gas Extraction, Extraction of metallic minerals, Construction, Metallurgy and Industrial Services of Public Utility, such as Electric Power and Water.
• Santa Catarina: Located in the southern region of the country with an estimated population of 7,338,473 people, its economy is based on the following activities: industry (especially agro-industry, textiles, ceramics, and metal-mechanics), extractivism (minerals), and cattle-raising.
• Paraná: The main economic activities are agriculture (sugar cane, corn, soy, wheat, coffee, tomatoes, manioc), industry (agro-industry, automobile, paper and cellulose), and vegetal extraction (wood and yerba mate). It has an estimated population of 11,597,484 people. • Pará: Has an estimated population of approximately 8,602,865 inhabitants, the economy is based on the provision of services, commercial and agricultural activities also located in the northern region.
• Maranhão: Located in the northeast region, it has 7.075.181 inhabitants approximately, its main economy is based on the agricultural, industrial and mineral activities.
• Rio Grande do Norte: Also belonging to the Northeast region, with 3.506.853 inhabitants approximately, its HDI = 0,684 considers that it is the biggest of the region it is part of and has an economy based on commerce, on the textile industry, on agribusiness, tourism and on the extraction and processing of oil.
• Rio grande do Sul: Located in the southern region of Brazil, with a population of approximately 11.377.239 inhabitants, the main source of income is based on agribusiness and farming.  Source: Authors.
In the non-hierarchical method as shown in Figure 5 the k-means method was applied to obtain new groups, 5 groups were obtained from this method. Source: Authors.
Five new groups were found in this scatter diagram that will be used to explain the k-means method as shown in Table 4.

Conclusion
In this work the use of cluster analysis was proposed because it is an important tool that allows the classification of individuals based on the observation of similarity between the variables being used.
In the study it was possible to evaluate the hierarchical and non-hierarchical methods observing the behavior of each Brazilian state according to the Human Development Index (HDI), available in Atlas Brazil 2013. A cluster analysis was performed considering the variables HDI-R, HDI-L and HDI-E. Thus, according to the methods used, five groups were identified both for the hierarchical technique and for the non-hierarchical technique in each of the methods referring to the Brazilian regions.
One similarity that can be observed in the hierarchical technique that between each method evaluated the state of Rondônia remained isolated in a cluster in all observed links, another point that also those of the states that regardless of their HDI the state of São Paulo and Acre remain in the same group. As for the non-hierarchical technique in the k-means method, what drew more attention was the states that have the highest HDI remain together in the same group, as also occurred with the states of lower HDI.

Support and Acknowledgments
Thanks to the Institutional Scientific Initiation Scholarship Program (PIBIC) UEPB that provided an initiation scholarship for this work.