Comparison between similarity coefficients with application in forest sciences

The multivariate statistic has been used in divergence studies concerning plants species. Analysis of the similarity or distance among individuals is an important tool for population. This among aims to present the main show the main coefficients of similarity and dissimilarity and their properties and the importance of axioms for the complement of similarity for the methods in cluster analysis. We evaluated the changes caused by five different similarity coefficients in the group of 11 plots and 17 species. We tested the coefficients of Jaccard, Sorensen-Dice, Simple Agreement, Russel e Rao e Rogers e Tanimoto comparisons being made between them by cophenetic correlations, Rand, adjusted Rand and stress between the distances obtained by the addition of these coefficients, and also by means of dendrograms (visual inspection), projection efficiency in a two-dimensional space and groups formed by the method of average linkage. The results showed that the use of different similarity coefficients caused few changes in the grouping of installments in groups, and the validation obtained between similar plots. Even though few changes in the structure of most different groups, these coefficients changed some relationships between plots with high similarity.


Introduction
Multivariate statistical techniques have been widely used in forestry studies involving climate, soil, relief and vegetation variables simultaneously. These techniques are used in order to order, in order to determine the influence of environmental factors on the composition and productivity of the site, and to group, for the purpose of classification .
When the objective is to classify groups, a large number of (dis)similarity coefficients are found in the literature Jaccard, Sorensen-Dice, Simple concordance, Russell and Rao and Rogers and Tanimoto, and it is possible to observe different coefficients used with the same or different purposes. However, not all authors justify the reason for choosing a certain coefficient, that is, the choice is subjective and can compromise the nature of the analysis.
Similarity measures are numerical quantities that quantify the degree of association between pairs of objects or individuals, items, etc., and are considered a measure of similarity if, for everyone xi e xj that satisfy the following properties: . The calculation and structure of the numerical analysis result is obtained from the association matrix, which does not necessarily reflect all the information originally contained in the data matrix, as the objects or descriptors are represented in reduced space. This underscores the importance of choosing an appropriate measure of association and determines the topic of analysis. Therefore, the following considerations must be taken into account: 1. The nature of the study (ie, the initial question and the hypothesis) determines the type of structure that must be evidenced through an association matrix and, consequently, the type of (dis)similarity measure to be used.
2. Measures are represented by different mathematical equations and, in association matrix analysis, coefficients with specific mathematical properties are often required.
3. It is also necessary to consider the computational aspect, and, therefore, the choice of coefficient often depends on its availability in the computational package or on the user's ease in programming it.
Consider the comparison of a pair of elements (iej) from the results of q binary variables, each coded in such a way that it can assume values 0 or 1 (for example, 0 in the absence of a certain species and 1 in its presence). Thus, for each variable, one of the following settings must be observed: 0-0, 0-1, 1-0 or 1-1, the first value being relative to observation i and the second to observation j.
The coefficients of these variables normally focus on measuring (dis)similarity, based on counting the agreements (positive or negative) that exist between the elements. There are some coefficients that use the number of disagreements (positive or negative) as the main element of their measurement.
In general, measures of (dis)similarity are interrelated and easily transformable among themselves. There are a large number of similarity and/or dissimilarity coefficients for binary characters available in the literature. Such coefficients can be easily converted to dissimilarity coefficients. If the similarity is called s, the dissimilarity measure will be its complement between each pair of groups: the distances can, for example, be chosen as when is a similarity coefficient. Most cluster analysis methods require a measure of (dis)similarity between the elements to be clustered, which usually express a distance function or a metric. The function can only be considered a similarity or a dissimilarity if it satisfies certain properties or axioms.
Distances are used, like similarities, in order to measure the association between objects. Distance coefficients can be subdivided into three groups.
The first group tends to all properties:

Joint Absence
The similarity coefficients can be divided into two groups: those that consider the joint absence (symmetric coefficients) and those that do not consider the joint absence (asymmetric coefficients). Some similarity coefficients that consider the joint absence are presented below, emphasizing that it is indicated by the letter d (double zero) in the expressions.
An attribute is symmetric if both of its states are equally important and have the same weight. In these cases, the zerozero and one-one correspondences are completely equivalent and must both be included in the similarity coefficient.
Similarity, which is based on symmetrical attributes, is called similarity invariant, as the result does not change when some or all of the attributes are coded differently. For invariant similarity, the best-known coefficient for evaluating similarities between objects xi and xj is the coefficient of Sokal and Michener"(1958)" (Simple matching). Rogers and Tanimoto (1950), Russell and Rao (1940), and Gower and Legendre coefficients are other examples of symmetric similarity coefficients that treat positive (a) and negative (d) correspondences in the same way. The coefficients differ in the weights they assign to matches and to no matches.
Coefficient of similarity that considers the joint absence and is a metric, as it has all the properties of the axioms of dissimilarity. By using these coefficients, it assumes that there is no difference between presence (double 1) and absence (double 0).

Sokal and Michener also called Simple matching and Simple matching
• No undetermined value • It is a special case of proportion of agreement for two nominal variables • Family parameter member members are interchangeable with respect to an ordinal comparison • It becomes after correction for the chance to use • D = 1 -S satisfies the triangle inequality • Two multivariate generalizations satisfy a strong geneRaoization of the triangular inequality  (1927)) -1 Since a, b, c, and d are proportional, Simple matching found SSM= a + d. According to Simple matching, it can be interpreted as the number of 1s and 0s shared by variables in the same positions, divided by the total length of the variables. By comparing two clustering algorithms, to measure the agreement of two psychologists who classify people into undefined categories. For Sokal and Michene, Rogers and Tanimoto, Sokal and Sneath (1963) proposed the coeficiente , which gives twice as much weight to quantity (a + d) as compared to (b + c). Furthermore, Sokal and Sneath proposed the coefficients . This last coefficient does not work well when the sum of a + d is greater than b + c, as they are not based on the axioms of the coefficients, which is to be greater than 1. As alternatives (which do not include the quantity d) there are the coefficients Kulczynski (1927) and Ochiai (1957). The coefficient by Russell and Rao is called hybrid by Sokal and Sneath, since it includes the quantity d in the denominator but not in the numerator.

Russell and Rao also called Positive Concordances
• No undetermined value • D = 1 -S satisfies the triangle inequality • Matrix coefficient is totally positive of order 2 • First eigenvector of the coefficient matrix reflects an ordering of a stochastic model.
The similarity coefficient considers the joint absence and is a semimetric, as it does not meet the fourth property of the dissimilarity axioms, Anderberg (1973) ; Gower 2 (1985), Ochiai II ; Sneath and Sokal.

Disregard the joint absence
Given two asymmetric binary attributes, the agreement of two 1's (a positive match) is considered more significant than the agreement of two 0's (a negative match). Similarity based on such attributes is called non-invariant similarity, for which the best known coefficient is the Jaccard coefficient (1901), where the number of negative matches, d, is not considered important and is therefore ignored in the calculation. The Jaccard similarity index indicates the similarity between two communities, comparing the number of species between the areas used in its calculation and the numbers of species unique to each area and the number of species common between them.

Jaccard also known as community coefficient
• Member of the family parameter members are interchangeable with respect to an ordinal comparison.
• Limited by the correlation of the proportion below • Satisfaz a desigualdade triangular • A multivariate generalization satisfies a strong generalization of triangle inequality. Some similarity coefficients disregard joint absence and are a metric: Jaccard, Sorense-Dice, Andeberg, similarity coefficient that disregards joint absence and which is a semimetric.
Sørensen (1948)  The Kulczynski I coefficient is not a metric, and when the positive agreements are greater than the differences, they will not meet the properties.

Association coefficients
Such coefficients show how the pairs of individuals are associated. They generally range from -1, when a change in one variable is accompanied by an equal-magnitude change in the other, but in the opposite direction, to +1 when a change in one variable is accompanied by an equal-magnitude change in the other. Hamann, Yule, this coefficient measures the strength of agreements in relation to disagreements, the closer to 1, the greater the similarity between the elements when they agree.
The closer to -1, the greater similarity in relation to disagreement. There are also association coefficients that vary over the range [-∞, +∞], Thusquare. The correlation coefficient has been used successfully, precisely when it is intended that the results of the classification are not affected by differences in dispersion and scale of the variables.
Work was carried out to compare the similarity coefficients in species studies, which may help in the choice of these. Duarte et al. (1999), compares different coefficients of similarity in studies with beans and determines that the Sorensen coefficient is the most suitable for the study of genetic divergence for this species, when RAPD markers are used. Work was Research, Society andDevelopment, v. 11, n. 2, e48511226046, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i2.26046 6 carried out to compare the similarity coefficients in species studies, which may help in the choice of these. Duarte et al. (1999), compares different coefficients of similarity in studies with beans and determines that the Sorensen coefficient is the most suitable for the study of genetic divergence for this species, when RAPD markers are used. Meyer et al. (2004), using RAPD and AFLP markers in corn to compare similarity coefficients, demonstrates that, for this situation, the coefficients of Jaccard, Sorensen, Anderberg and Ochiai can be used, since the results for these coefficients showed little variation.
The same authors also point out that this result confirms the greater use of the Jaccard index in the analysis of genetic divergence, although it is not the most suitable for all species.
In all the coefficients used in these articles, a single method, UPGMA, and ten coefficients found in a program, NTSYS (Numerical Taxonomy and Multivariate System) version 1.7 (Rohlf, 1992), GENES (Cruz, 2001) were applied.
Among these 10 coefficients, Simple concordance, Rogers and Tanimoto, Russel and Rao and Jaccard are metric and Sorense, Ochiai and Kulczynski are semimetric, there are also three association coefficients Hamann, Yule and Phi(Pearson). It is important to note association coefficients, as their values range from -1 to +1. Therefore, it is essential to verify the correlation values, because when they assume negative values, they do not represent a metric, although it is also possible to associate a distance function to the correlation coefficient. Mulvey and Crowder (1979) define a correlation "metric" based on the following transformation: "that all similarity coefficients were transformed and analyzed as a measure of distance and their properties were not verified, to analyze if they represent a metric". Other methods should be applied so that their behavior is analyzed and verified.
The choice of clustering methods must also be judicious. The different methods can produce different results in the same data, since the authors of the articles do the following: they find the similarity coefficients and transform them into dissimilarity, without observing the properties of similarity and dissimilarity, however some of these coefficients, such as the Sorensen, do not meet the property of triangular inequality (not being a metric, but a semimetric) and the correlation coefficients that vary from -1 to +1 cannot be made by the transformations that are used in the coefficients. Gower and Legendre (1986) define two parameter families in which all members are linear in the numerator and denominator. They make a distinction between coefficients that do not include the quantity d.A primeira família para dados de presença e ausência é determinada por:

Parameter Families
Where , in order to obtain negative values. Members of the family: Members with gives more weight to a.
With attendance and absence data, this is regularly done. In the case of the relation where there is only some equality of positive numbers, that is, a is much smaller than (b + c) similar arguments can be used for the opposite case . All members and all given by 0 and 1, such that adding the limbs to every other hop. So, the coeficiente can be written in the form where e . The distance corresponding to the Sorensen coefficient was described by Barros et al. (2020) under the name of non-metric coefficient, used to compare dissimilarity of two samples: "a variant gives a double weight to the presence, because the presence of a species can be considered to be more informative than its absence". Absence may be due to various factors, as discussed above, but do not necessarily reflect differences in the environment. Dual presence, on the other hand, is a strong indication of similarity. Note, however, that Sorensen is monotonous for Jaccard. This property means that if the similarity of a pair of objects calculated with Jaccard is superior to that of the other pair of objects, the same will be true when using Sorensen.

The family of coefficients
It can be considered that the presence of a species is more informative than its absence. Absence can be caused by a number of factors, as discussed above, which do not necessarily reflect differences in the environment. Double-presence, on the contrary, is a strong indication of similarity. Note, however, that Sorensen is monotonous for Jaccard. This property means that if the similarity of a summarized pair of objects to Jaccard is greater than that of another pair of objects, the same will be true when using Sorensen. In other words, Jaccard and Sørensen only differ in their scales (weightings).
The choice of (dis)similarity coefficients for analyzing the results of an experiment must comply with criteria, so that the results presented are reliable. Each (dis)similarity coefficient has its own characteristics that must be taken into account, together with the individual or variable studied.
Little research has been carried out to determine the advantages and disadvantages of each of the (dis)similarity coefficients. In general, much of the work does not justify the choice of coefficients to be used. For greater fidelity, the works should contain a justification for the choice of (dis)similarity coefficients and the clustering methods used.

Material and Methods
Data from a survey of the vegetation of the Forestry Forest (Table 2), from the Federal University of Viçosa, in Viçosa, MG, taken from Albuquerque et al. (2006).  The clustering coefficients used were Jaccard, Sorense, Simple Agreement, Russell and Rao and Rogers and Tanimoto and the Mean Distance Method. These coefficients were used because they are the most used in practice and because they are easy to find in the most diverse computer programs.

Average distance
This method consists of grouping the two most similar objects and then using the arithmetic mean of the distances of the objects in each group to create the new distance matrix.
The average similarity of the individuals or group that is intended to be joined to an existing group is used.

Cophenetic correlation
For the various clustering coefficients used, the respective cophenetic matrices resulting from the simplification provided by the coefficients were obtained. Based on the original and cophenetic matrices, the cophenetic correlation was obtained, according to the expression (Albuquerque et al., 2006).
on what: cij = dissimilarity value between individuals i and j, obtained from the cophenetic matrix; and dij = dissimilarity value between individuals i and j, obtained from the dissimilarity matrix.

Rand
The adjusted Rand index determines the similarity between two plots P1 and P2 examining to which group pairs of species belong in the two groups. This means that if two species belong to the same group P1 and P2 the index value increases; on the other hand, if the two species belong to the same group in P1 but belong to a different group in P2 the index value goes down. The adjusted Rand index is the normalized version of the Rand index, where: and are the number of parcel groups P1 and P2; n is the amount of data in the initial set; is the number of species in the group and is the number of species in the group is the number of species that belong to the groups and , that is, the number of species common to P1 and P2.
Values close to 0 for the adjusted Rand index indicate random plots, which reveal little about the relationship between species, while values close to 1 are obtained by installments most relevant.

Stress
This statistical representation of stress (standardized residual sum of squares) was proposed by Kruskal (1964). It is a parameter that measures the distortion between the original matrix and the one obtained after the construction of the dendrogram.

Source: Authors
Although the general structure of the groupings is very similar, it can be observed that there are small changes in the levels at which the parcels are grouped, that is, parcels that are within the same group can be grouped in another order, when the parcels are changed. coefficients. However, this causes few practical problems. It is important to highlight that the fact that this type of analysis does not present an objective criterion for identifying the groups makes it very difficult to interpret the results. The cophenetic correlation coefficients between the five similarity coefficients, for both plots, were all moderate, demonstrating that there is a reasonable association between the original data and the dendrograms and that this is independent of the coefficient used and the number of groups, with few changes ( Table 4). The Jaccard with correlation at 0.54, Sorensen-Dice at 0.57, simple agreement at 0.50, Russell and Rao at 0.67 and Rogers and Tanimoto at 0.44, which indicates that there is change in ranks using either of these coefficients, that is, they rank the similarity between the plots in exactly the same order. It is observed that the cophenetic correlation does not allow making a clear distinction between the coefficients, regarding the dendrograms obtained.
The stress levels presented for the five coefficients (Table 4), for both plots, were of low magnitude. The stress level ranged from 33% for Jaccard 25%, for the Sorrese-Dice coefficient to 29% for the Simple Agreement coefficient, for the Russell and Rao coefficient 32% and the stress level ranged from 38% for the coefficient by Rogers and Tanimoto.
The Rand index takes values in the range [0, 1]. The maximum value (Rand = 1) will correspond to a situation where the two classifications coincide, with no pairs that are in the same group in one case, and in different groups in the other.
Since in this case the grouping of the 11 plots into 17 species is known, it is possible to compare the groupings obtained through the cluster analysis with this division by plots.No caso do agrupamento resultante do coeficiente Jaccard obtains a value of 0.91, Sorensen obtains a value of 0.96, concordance.simple obtains a value of 0.84, Russell and Rao obtains a value of 0.76, Rogers and Tanimoto obtains a value of 0.84 of the adjusted Rand index was used, while the Rand index values were similar for all similarity coefficients, noting that any similarity coefficient can be used when comparing the coefficients and by the average binding method, given that the Rand values are higher than the adjusted Rand values.
Considering that the results were performed independently for each coefficient.

Conclusion
The practical conclusion is that, in most data applications, it must be observed that the properties of the similarity and dissimilarity coefficients are met and the choice of the correct coefficients, for the variables, can probably be limited to the following five coefficients: Jaccard, Sorensen, Russell Rao, Sokal Michener and Rogers Tanimoto.