Techniques of quality of adjustment of statistical models with evaluation of probability distributions using production data of laying quails

The goal of our study was to evaluate the quality of fit from different types of probability distributions for continuous data. For this, performance traits and quality of quail egg in the production of nutraceutical eggs were used as a continuous data source. The data were collected over 42 days, the experimental design was completely randomized with 7 treatments, 6 repetitions, with 252 animals allocated in 36 cages. The distributions for continuous data used were the exponential, gamma, gaussian, and lognormal. The R Open Source and SAS® University Edition software was used to perform the analysis. The graphical analysis of the traits was performed from the predicted versus observed values, Cumulative Distribution Function (CDF), and skewness-kurtosis. The fits were also evaluated by the Akaike information criterion (AIC), Bayesian information criterion (BIC), Conditional model of adjusted R-Square (Rac 2 ), Conditional model of adjusted concordance correlation (rac), Kolmogorov-Smirnov test (KS), Cramer-von Mises test (CvM), Anderson-Darling test (AD), Watanabe-Akaike Information Criterion (WAIC) and Leave-one-out crossvalidation (LOO). All the tests indicated the Gaussian distribution as the most suitable and they excluded the exponential distribution for all the evaluated characteristics.


Introduction
Instead of adapting our data to classical statistics (Analysis of Variance -ANOVA), we should use approaches according to the characteristics of the observations, such as mixed models for situations involving fixed and random effects or generalized models when we have non-normal data (Bolker et al., 2009).
From the determination of the data type (discrete, continuous, sample space, etc.), the visual assessment can be used to verify the model and the most appropriate distribution to the data seeking greater verisimilitude. Scatterplots, for example, provide some information about the presence of outliers, the relationship between variables (correlations), and the behavior of the data over time (Vonesh, 2014;Sher et al, 2017). Silva et al. (2020) demonstrated the importance and ease of use of computational tools to analyze the data, adjust the distributions and verify the model. Making it a useful tool in reducing errors, costs, and processing time. Like other authors, they used statistical modeling techniques to forecast corn crops in the state of Mato Grosso, as they realized the need to predict events such as rain or drought, based on past situations (Silva et al., 2019).
Therefore, tools such as SAS® and R Open Source are used to identify the models that best describe reality. The GLIMMIX GOF macro of the SAS® software provides the predicted and observed values (Vonesh & Chinchilli, 1996). The fitdistrplus package of the R software produces graphs of skewness-kurtosis and Cumulative Distribution Function (CDF) (Muller et al., 2015).
The choice of Akaike (AIC) (Akaike, 1974) or Bayes (BIC) (Schwarz, 1978) information criteria for model selection must be based on their principles as statistical inference and nature of data (Anderson & Burnham, 2004). The information generated by the AIC is useful when there are not too many samples or when there is need to identify the number of model parameters, thereby penalizing the most complex models regarding the numbers of parameters of a model. The values of AIC and BIC are similar, since both require the estimation of parameters, however, the second considers Bayesian and deviation information (Deviance Information Criterion -DIC) (Bolker et al., 2009). According to Brito et al. (2020), the importance of using statistical models was verified, to identify those that most represent and help in the prediction of human diseases (Diabetic Retinopathy) and used parameters such as AIC and BIC to evaluate the most diverse distributions studied.
The Kolmogorov-Smirnov (KS) (Massey, 1951), Cramer-von Mises (CvM) (Darling, 1957) and Anderson-Darling (AD) tests do not consider the complexity of the models, which leads researchers to always use them when the parameters is known and constant, to avoid favoring more complex models, respecting the principle of parsimony (Muller & Dutang, 2015).
The brms package of the R software provides the criteria Watanabe-Akaike Information Criterion (WAIC) (Gelman, 2014) and Leave-one-out cross-validation (LOO). With a flexible structure, the package has several distributions available, allowing the alteration of the connection functions and the use of some information we already have from the data. WAIC can be an improvement in DIC (Bürkner, 2017).
In this study, our goal was to elucidate the available techniques for assessing the fit quality of statistical models, identifying the most appropriate distributions for the analyzed traits.

Methodology
The choice of which methodology to use in research serves to describe your hypothesis and confront it, to obtain information that contributes to learning (Pereira et al., 2018). To obtain quality and reliable results, Lüdke & André (1986) described in the form of a guide, all the planning to be carried out and developed in research, from data collection, analysis, and interpretation of results. Based on previous studies in animal science, we developed our research to maintain the reliability of our results (Pan W., 2001;Bürkner, 2017;Muller & Dutang, 2015;Vonesh et al., 1996).

Data sampling
The data were obtained from an experiment carried out in accordance with the institutional committee for the use of animals (protocol 0059/2013), at the Experimental Poultry Farm, in the county of Itaguaí, Rio de Janeiro State, Brazil. We used 252 female Japanese quails (Coturnix japonica) of the Fujikura lineage. The animals were 90 days old, the average weight of 188.3 ± 4.0 g, and average laying rate of 90%. The quails were allocated in 36 laying cages with dimensions of 33 × 25 × 20 cm.
The experimental design was completely randomized with 7 treatments, 6 repetitions with 6 animals per cage (experimental unit). The experimental diets were obtained from a control diet with increasing levels of organic selenium (0.10; 0.20; 0.30 and 0.40 mg) and 200 mg of DL-a-tocopheryl acetate were added per kilogram of feed as a source of vitamin E. Thus, the diets were: We used the nutritional requirements of Japanese quails described by NRC (1994) for formulating the diet. The exceptions were protein and calcium requirements that were based on the recommendations of Oliveira et al. (1999) and Barreto et al. (2007), respectively. The supplementation of vitamins and minerals was produced without selenium from tocopherols, this way we did not overestimate the concentrations.
The performance traits related to egg production were feed intake (Fi; g/(quail×day)), egg mass (EM; g/(quail×day)), daily egg yield per quail (DEy; eggs/(quail×day)), and feed conversion (FC; dmls, ie, intake mass/egg mass). We also evaluated egg quality using yolk mass (YM; g/egg); albumen mass (AM; g/egg); eggshell mass (ESM; g); yolk ratio (Yr; %); eggshell ratio (ESr; %); and albumen ratio (Ar; %). Research, Society and Development, v. 10, n. 11, e278101119317, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org /10.33448/rsd-v10i11.19317 4 At the beginning of the experiment, an adaptation phase was carried out for seven days, with the offering of the control diet to all animals. Subsequently, the experimental diets were offered in the ad libitum feeding system. The animals were submitted to a photoperiod of 17 hours, with the light controlled by a timer, and the temperature and relative humidity were recorded inside the house. The egg production trait and Fi were measured weekly. The traits YM, Am, and ESM were determined by collecting 3 eggs from each repetition (daily). As a reference, on day "zero", 80 eggs were collected randomly and followed the same protocol.
After 42 days of supplementation, the nutraceutical effects of tocopherol and selenium were assessed by analytically determining the concentration of their metabolic indicator, malondialdehyde, in the yolk (MDAY (mmol/g) of quails eggs, according to the methodologies described by Shahryar et al. (2010) and Enkvetchakul et al. (1995).

Data analysis
We used the software SAS® University Edition, and R Open-source version 3.6.3 to perform the analyses. The machine had a Linux Elementary OS operating system, with 6 GB RAM, 500 GB HD, and Intel® Core™ i7 processor. For all the studied traits, we evaluated the fit of the Gaussian, lognormal, gamma, and exponential distributions, that are suitable for continuous data.

Good of fit analysis
The SAS® %GLIMMIX_GOF macro was used to generate the R-Square Type Goodness-of-Fit Information and Model Fitting Information tables for each distribution and variable.
From the GLIMMIX procedure, we evaluated for each distribution the relationship on the Cartesian plane between observed values (y-axis) and predicted values (Pred, x-axis). From %GLIMMIX_GOF macro, the Conditional Model Adjusted R-Square ( 2 ) and Conditional Model Adjusted Concordance Correlation ( ) were obtained, and corrected by fixed and random factors, as well as fitted to parameter numbers. The use this same macro followed the methodology described by Vonesh et al. (1996).

Fit analysis through tests
The R open-source software, with the gofstat function of the fitdistrplus package, was used to evaluate candidate distributions, these being Gaussian, lognormal, gamma and exponential by means of the tests of Kolmogorov-Smirnov (KS), Cramer-von Mises (CvM) and Anderson-Darling (AD) tests, according to the methodology described by Muller & Dutang (2015). Other outputs of the gofstat function such as the Akaike (AIC) and Bayesian (BIC) information criteria, were also recorded.

Bayesian method for information criteria
The function brm and package brms, that uses Stan language (Stan Development Team, 2017;Carpenter, 2017), was used to describe the model and the distribution, and from the waic and loo functions the values of the information criteria of Watanabe-Akaike Information Criterion (WAIC) and Leave-one-out cross-validation (LOO) were obtained (Bürkner, 2017).

Coefficients ( and )
According to the results of the SAS® GLIMMIX_GOF macro, all traits had the highest 2 and with the Gaussian distribution (panels (a) Figures 1 to 6). The variable MDAY was the only one, among the 11 traits evaluated, that presented values of 2 and with Gamma (c) and exponential (d) distributions equal to the Gaussian one.

Tests
Regardless of the three tests used, ten among eleven traits analyzed with the exponential distribution, the lack of adjustment was evident, due to the much higher values in relation to the other analyzed distributions (Figures 7 to 12).
The test of Kolmogorov-Smirnov, for the traits, EM, YM, ESM, Ar, ESr, and MDAY the smaller values were obtained by lognormal distribution, while for DEy, Fi, FC, AM, and Yr the Gaussian distribution was the least biased.
In the Cramer-von Mises test, the traits MDAY, YM, ESM, and ESr, and Lognormal Ar showed lower values. For DEy, Fi, FC, Am, and Yr the Gaussian distribution was the best. In this same test, the EM traits showed the same value for the Gamma and Gaussian distributions.

Information criteria (AIC and BIC)
The values of AIC and BIC were similar, which led to equal decisions for all traits except for the variable MDAY (Figures 13 and 14).
Similar to what was observed in the tests, the exponential distribution was evidently the worst than the other distributions ( Figures 13 and 14), with the exception of the variable MDAY (Figure 14, panel e), which the Gaussian had the highest values of AIC and BIC.

Information criteria (WAIC and LOO)
The behavior of the values of WAIC and LOO (Figures 15 and 16)

Discussion
The choice of the most feasible distribution is made by approximating the observed data to the distributions. A nonzero asymmetry shows the lack of symmetry of the distribution concerning the observed data (Muller & Dutang, 2015). For the MDAY characteristic, the data were equally distant in all distributions, indicating no difference between them. For the others, the exponential distribution was the one that least converged.
The data evaluation using the Akaike and Bayes information criteria are the most indicated, but they should be used with caution, because the choice of which criterion will be used depends on the characteristics of the data, the number of observations, missed observations, among others. In situations when the number of observations is relatively large, both the AIC and BIC will tend to the same model (Burnham & Anderson, 2004). In our work, we found that both criteria had the smaller values for the same distributions, and the exponential distribution had the greatest values, thus it is not indicated for the evaluated characteristics.
The 2 and coefficient were used in their conditional forms (considering fixed and random effects) and corrected for the number of model parameters. 2 indicates the degree to which the predicted value is associated with the observed value, its range is between 0 and 1, which 1 indicates the perfect fit and 0 that there is no correlation. Therefore, the higher the R 2 ac value, the greater the relationship between the independent variables (X1, X2, …, and Xn) and the dependent variable (Y) (Vonesh et al., 1996). If to increase the number of variables in the model, 2 can increase, decrease, and even be negative (Gujarati, 2009). In this study, it was possible to observe 2 with negative values (Figures 2 to 5). Another way to explain theses negative values is by the square soma of model (axis-x) is so much higher that observed data (axis-y), i. e, the model using some distribution fits very poorly to the data of a given variable.
The is a measure used to measure the degree of association between variables, that is, e.g.: X1; X2; …; Xn, and Y are related, their covariance, the similarity between variables (Vonesh et al., 1996). For = 1 the relationship is positive and perfect; for = −1 the relationship is negative and perfect; and for r = 0 there is no relationship, or the correlation is not linear (Mukaka, 2012). Both 2 and are used to inform the correlation between variables, and not the fit quality of the model (Vonesh, 1996).
The Kolmogorov-Smirnov, Cramer-von Mises and Anderson-Darling tests should be used when the number of parameters of the model is known, not allowing the comparison of results when the number of parameters is different (Muller and Dutang, 2015), which is opposed to the principle of parsimony (Burnham & Anderson, 2004). Anderson-Darling statistics are important when emphasizing the tails of candidate distributions (Muller & Dutang, 2015). Thus, as well as the AIC and BIC criteria, the lower their value, the better the quality of fit of the distribution to the data (D' Agostino and Stephens, 1986;Muller & Dutang, 2015). The results found in the tests corroborate the results obtained for AIC and BIC, indicating the Gaussian and lognormal distributions as the most appropriate, suggesting that both evaluations, either by tests or criteria, are efficient for choosing the distribution for the evaluated characteristics.
WAIC and LOO have great flexibility in their use. In addition to being a multi-model selection information criterion, the use of the Bayesian is the adjustment of the complete experimental model including all effects simultaneously with the evaluation of the best likelihood function that fits the data. They have a wide variety of probability distributions (gamma, binomial, exponential, Gaussian, among others), allowing the users to enter information they already have about their variables as well as the inclusion of the regression model that is more flexible manually and simple to change. Both indexes are less biased and less robust than the deviance Information Criterion (DIC), also widely used in the selection of models by the Bayesian method (Bürkner, 2017).
For MDAY, the WAIC and LOO criteria have the smaller value the gamma distribution. However, the values we found are close to the values for lognormal distribution, converging to the results obtained with the Akaike and Bayes criteria, and AD, KS, and CvM tests.
In the evaluation of the characteristics DEy, Fi, EM, FC, YM, ESM, AM, Yr, ESr, and AP, the criteria, and tests (AIC, BIC, AD, KS, CvM, WAIC, and LOO) indicated lognormal, gamma and Gaussian as possible distributions, presenting values with little variation between them, and the exponential distribution as the least fitted.

Conclusion
Preliminary assessments that aim to identify the statistical model that best represents the reality of the data are essential to ensure the quality of subsequent statistical analyses, to avoid under or overestimation of results, and to reduce errors resulting from the use of an inappropriate distribution.
For future work, we suggest further studies on the STAN Bayesian approach, in which a complete model with all variables and distribution together can be evaluated.