Comparison of Machine Learning predictive methods to diagnose the Attention Deficit/Hyperactivity Disorder levels using SPECT

ADHD (attention deficit hyperactivity disorder) is a neurodevelopmental disorder characterized by harmful levels of inattention, disorganization, and/or hyperactivity-impulsivity. In childhood, these symptoms often overlap with those of other disorders, and they tend to persist into adulthood, interfering with relationships and academic and work life. Diagnosis, traditionally made by assessing the patient, i.e., testing and listening to relatives and teachers, has already been aided by neuroimaging. However, the visual analysis of such images to make a psychiatric diagnosis is a complex and sometimes time-consuming task. For this reason, computer-aided diagnostic tools have increasingly evolved that, when combined with machine learning (ML) techniques, can accelerate, facilitate, and maximize the accuracy of diagnoses. Nevertheless, research evaluating ML models for classifying ADHD considering severity using images of the brain SPECT (Single Photon Emission Computed Tomography) is still very sparse. For this reason, this article aims to evaluate the performance of the ML methods: k -NN ( k -Nearest Neighbors), Naive Bayes, Decision Tree, MLP (Multilayer Perceptron) and SVM (Support Vector Machine) in the classification of ADHD. The main goal of this analysis is to check whether the subjects have the disorder or not, and to classify the severity of those who have it using SPECT images. A database was created from SPECT images and diagnostic reports. After pre-processing these data, the best hyperparameters for the ML methods were searched, trained/tested and finally statistically compared. The best results were obtained with SVM and k -NN, with 98% accuracy. Although ADHD diagnosis by neuroimaging is not yet a standard clinical procedure, we argue that this study can contribute to ADHD diagnosis research and support methods for the development of CAD (computer-aided diagnosis) systems.


Introduction
According to CDC (CDC, Data and Statistics About ADHD, 2021), ADHD (attention-deficit/hyperactivity disorder) is one of the most common neurodevelopmental disorders in childhood, often persisting into adulthood. This disorder is characterized by problems in directing attention, controlling impulsive behavior, or being overly active. It occurs in most cultures and affects about 5% of children and 2.5% of adults. The Diagnostic and Statistical Manual of Mental Disorders or DSM (American Psychiatric Association, 2013) establishes three possible levels of "current severity" of ADHD: mild, moderate, and severe, depending on the degree of impairment or symptoms observed in the patient. According to De Silva, et al. (2019), early diagnosis minimizes long-term effects and helps in the development of intellectual skills. Clinical diagnosis involves several steps without a single test. It usually includes a checklist to assess symptoms and consideration of the child's history from the perspective of parents and teachers (CDC, What is ADHD? 2021). Kautzky, et al. (2020) warn of a likely challenge to ADHD diagnosis. This is based primarily on behavioral symptoms rather than objective biomarkers because of overlap with symptoms of other disorders. Another difficulty is studying ADHD in adults and trying to retrospectively assess symptoms in childhood. One of the solutions lies in the use of neuroimaging, where the use of EEG (electroencephalogram) and MRI (magnetic resonance imaging) in ADHD diagnosis, also using machine learning (ML) methods, is well established (Pulini, et al., 2019). ADHD significantly affects quality of life, especially from a parent's perspective. Individuals with ADHD have lifelong impairment in psychosocial, educational, and neuropsychological functioning. It is very important that the disorder is identified early to prevent and treat it effectively to improve the quality of life of those affected (Biederman, et al., 2012;Danckaerts, et al., 2010).
Regarding the areas of the brain whose functions are impaired by ADHD, Amen and Blake (1997), using images of the brain SPECT (Single Photon Emission Computed Tomography), demonstrated that blood flow to the prefrontal cortex is decreased in children and adolescents with ADHD. Kaya, et al. (2003) demonstrated that individuals with ADHD may have significant hypoperfusion in the medial and lateral temporal cortex in the right hemisphere. Goldberg, et al. (1999) found that ADHD is associated with significant dysfunction of the frontal and temporal lobes. Santra and Kumar (2014) also observed hypoperfusion in these regions and found evidence of normalization of prefrontal activity in post-therapy scans available after successful treatment. Several other brain regions are also being studied in ADHD patients, such as the parietal, cingulate gyrus, cerebellum, caudate, and thalamus. Nevertheless, the frontal region predominates in research (Santra & Kumar, 2014).
Nuclear medicine imaging PET (positron emission tomography) and SPECT have been used to diagnose ADHD (Kautzky, et al., 2020;Vázquez-Abad, et al., 2020). Nuclear medicine involves administering a radiopharmaceutical to the patient, the radioactive energy of which is sufficient to penetrate the patient's body and reach a radiation detector that converts it into images (O'malley, et al., 2020). According to Jales and Santos-Filho (2020), PET and SPECT are powerful weapons for the professionals who use them. According to Daniel Amen, MD, adding neuroimaging to patients' medical histories leads to more targeted treatments (Amen, 2012). Technetium-99 is commonly used in SPECT imaging due to its relatively short halflife in the form of the radiopharmaceuticals HMPAO (HexaMetilPropilenAminaOxima) and ECD (Ethyl Cysteinate Dimer).
For all of them, the manual evaluation of all these image details by a simple visual analysis is not a trivial task to make a diagnosis. In Alzheimer's disease (AD), for example, visual diagnosis in early stages is a difficult task that requires experienced specialists (Chaves, et al., 2009). In Parkinson's disease (PD), very subtle changes can be detected on magnetic resonance imaging of the brain by visual analysis (Haller, et al., 2012). According to Illán, et al. (2012), semiquantitative parameters can be used in the studies of PD with SPECT images, where the accuracy of quantification is important to monitor disease progression and therapeutic effects. All these aspects motivate the development of automated quantification techniques that provide similar performance to human operators. Therefore, to assist physicians in their diagnostic work with medical images, it is essential to use a computer-aided diagnosis (CAD) technique. In CAD systems, ML methods have been increasingly used in clinical research with promising results (Dubreuil-Vall, et al., 2020).
There are several studies that evaluate automatic diagnostic methods based on the analysis of images of the brain SPECT (Horn, et al., 2009;Segovia, et al., 2010;Illán, et al., 2012). According to Horn, et al. (2009), this type of tool can help physicians in their daily practice, especially when visual assessment is inconclusive. Segovia, et al. (2010) emphasize that CAD tools are desirable. Illán, et al. (2012) believe that the requirements for a diagnostic tool are potentially met by machine learning techniques for developing CAD systems.
Machine learning (ML) is a set of methods that can automatically detect patterns in data and use them to predict future data (Murphy, 2012). It is a process that induces a function approximation (hypothesis) from a set of sample data (experience) provided to it (Faceli, et al., 2021). After the methods have been trained and learned the hypothesis induction rule, they need to be evaluated. There are two major ML approaches: predictive or supervised and descriptive or unsupervised. In this study, only the predictive methods were used: k-Nearest Neighbors (k-NN), Naive Bayes (NB), Decision Trees (DT), Artificial Neural Networks (ANN) and Support Vector Machines (SVM). The k-NN algorithm classifies a new object based on training examples that are close to it. NB is based on computing probabilities under the assumption that the attribute values of an instance are independent of its class. DT recursively divides a decision problem into subproblems whose solutions can be combined in the form of a tree. ANNs are distributed computing systems consisting of densely connected units (neurons) arranged in layers that process mathematical functions. SVM searches for a hyperplane that best separates instances by their Research, Society andDevelopment, v. 11, n. 8, e54811831258, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.31258 4 classes. For more information on these ML methods, see Murphy (2022).
Whether diagnosing ADHD or another psychiatric disorder based on neuroimaging and CAD, research faces similar challenges, but not always in an appropriate manner. Some difficulties can be highlighted: the number of available samples related to the high dimensionality of the image attributes, the imbalance between the frequencies of the classes, and the lack of appropriate methods to validate the results. In particular, about the evaluation of ML methods for automatic ADHD diagnosis, including classification of the severity levels of the disorder, using brain images SPECT, is a lack of previously published studies, suggesting that this is a topic that requires further investigation. Therefore, this is the central topic of this study. Our aim was to evaluate and compare the performance of different ML methods using SPECT imaging modality, in classifying ADHD patients and determining severity for those who suffer from it.
The article is organized as follows: The next subsections provide an overview of related research; Section 2 explains relevant details of our methodology; Section 3 discusses the results found; and Section 4 concludes.

Related research
Research related to this study is then analyzed, first in the context of diagnosing various psychiatric disorders and then specifically in the context of ADHD. Table 1 lists several studies using classic ML methods to investigate automated diagnostic methods for various psychiatric disorders using a series of images of the brain SPECT. There are studies on Alzheimer's disease, frontotemporal dementia, Parkinson's disease, cocaine addiction, autism, and disorders with amnestic symptoms. These studies were analyzed and some key aspects are highlighted in Table 1: the disease studied, the size of the dataset, the distribution of classes, the ML methods used, and whether a statistical test was used to validate the results. Source: Author's own.

Mental illnesses classification by ML methods and SPECT images
From the "Diseases" column in Table 1, it appears that the scientific community has paid more attention to Parkinson's disease (PD) and Alzheimer's disease (AD). This is understandable considering that AD is among the 10 diseases Research, Society andDevelopment, v. 11, n. 8, e54811831258, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.31258 6 that cause the most deaths in the United States of America (Xu, et al., 2010), and the number of PD cases is expected to reach 1,238,000 by 2030 (Marras, et al., 2018). However, this should not be allowed to crowd out research into other diseases that affect mental health, as there is a wide variety of diseases and each one affects and limits patients' quality of life.
The number of samples from the datasets used in research, represented by "Number of subjects" in Table 1, is a relevant factor for accurate ML methods (Martinez-Murcia, et al., 2017). For this reason, this aspect was analyzed, and it was found that the number of examples is generally low, which can be justified by the fact that the dataset is protected or has limited access. When only a small number of instances are available, it is advisable to use a data augmentation technique (Goodfellow, et al., 2016) to systematically create "artificial data" and add them to the training set. This is because the more data available for training, the better the generalization of the model. None of the 26 papers analyzed used data augmentation, even those with fewer than 100 instances.
From "Class Distribution" in Table 1, it appears that the research generally dealt with unbalanced databases that contained classes more frequently than others, such as in Tagare, et al. (2017), where there were 68% PD subjects but only 32% healthy subjects. This is a common problem in classification problems that needs to be worked around so that the performance of the ML algorithms is not compromised. If ignored, it leads to favoring the classification of new data from classes with a larger number of samples (majority classes). Therefore, the papers in Table 1 were analyzed and found that at least half of them, even those with a relevant number of subjects, ignored this problem. This may have affected the interpretation of their results.
Regarding the ML techniques analyzed in Table 1 in the column "ML Methods", a wide use of SVM (Support Vector Machine) for the classification of mental illness can be seen. According to Cascianelli, et al. (2016), this could be due to its good generalizability or to the fact that its results are comparable to and often superior to those of other ML methods (Faceli, et al., 2021). Nevertheless, we defend studies that compare the performance of more than one classifier type. This would help expand the study of ML techniques for diagnosing mental illness with SPECT neuroimaging.
Regarding the performance of the classifiers, the papers were analyzed and those that used a statistical test were identified in the column "Statistical Validation" (Table 1). It was noted that some of the studies use it only as a pre-processing technique for feature selection. Nevertheless, when comparing two or more classifiers, it is necessary to check whether there are statistically relevant differences between their results. In this issue, it was found that many studies had not statistically validated their results. Others had only tested the statistical significance of accuracy using a permutation test (Wilcox, 2017).

ADHD classification by ML methods and neuroimaging
In this section, Table 2 presents a selection of papers that have examined models for automating ADHD diagnosis based on neuroimaging, machine learning (ML), and deep learning (DL). DL is a particular subtype of ML approaches that use learning to represent the world as a nested hierarchy of concepts (GoodFellow, et al., 2016). Source: Author's own.
When analyzing the image type according to the "Image Modality" column in Table 2, it is noticeable that the use of functional magnetic resonance (fMRI) and structural magnetic resonance (sMRI) is widespread. We hypothesize that the availability of public datasets such as ADHD-200 and ENIGMA-ADHD is an important factor in the choice of this imaging modality. All studies in Table 2 that used MRI with more than 500 samples ("Number of Subjects" column) used one of these two databases. sMRI images allow analysis of brain volume and anatomy, whereas fMRI measures brain activity by detecting fluctuations in blood oxygenation (Sen, et al., 2018). The paucity of work in Table 2 using nuclear medicine suggests that further investigation into automated ADHD diagnosis using ML/DL methods, PET, or SPECT imaging is needed.
It should be noted again how important it is to work with a data set in which the distribution of classes is balanced.
The "Class Distribution" column in Table 2 shows that there is research that has not solved this problem and leads to strange  (Chawla, et al., 2002), may have helped to solve this problem. Table 1, in the column "ML or DL Classifier" in Table 2, SVM was the most commonly used ML method.

As in
However, Peng, et al. (2013) and Qureshi, et al. (2017) evaluated the performance of SVM and ELM (extreme learning machine), a variant of a single-layer artificial neural network (Huang, et al., 2006). ELM performed better than SVM. Looking now at the performance of the methods ML and DL (column "Accuracy" in Table 2), given the discrepancies in the number of samples and the different methodological approaches in research, it is not possible to make a statement about which method is better, ML or DL, when it comes to automatic diagnosis of ADHD using neuroimaging. Finally, the column "Statistical Validation" in Table 2 shows that most studies did not report or ignored the statistical tests used to evaluate the performance of their classifier models.

Methodology
This section details our methodology, which follows a sequence adopted from KDD, Knowledge Discovery from Data Research, Society and Development, v. 11, n. 8, e54811831258, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.31258 8 (Fayyad, et al., 1996). The steps were: data acquisition, database preparation, pre-processing, finding the best hyperparameters, training and testing models, and statistical validation of results. Our algorithms have used the Python 1 programming language from the Anaconda 2 distribution and the sci-kit learn 3 machine learning library.

Data acquisition
The brain SPECT was recorded in 236 patients with an intravenous injection of 20mCi 99-mTc/HMPAO. This method was similar to that described by Mena (2009), in which images were acquired using a Siemens Corp. dual-head ECAM system. Dual image reconstruction was performed using Segami Corporation's OASIS software. One was without attenuation correction for images of lateral, anteroposterior, and superior cerebral cortex. The second with attenuation correction using a Chang coefficient of 0.1 was applied to parasagittal images and images of the inferior brain, from which the cerebellum was removed. This allowed us to easily examine the lower aspects of the occipital and temporal lobes and the basal ganglia.
The result of the acquisition was a set of 2D images, and a diagnostic report prepared by a specialist for each of the 236 patients. Figure 1 shows an example of a brain SPECT, acquired from a patient with bipolar disorder, ADHD, obsessive-  Source: Author's own.

Database preparation
Of the 236 subjects, 81 had an ADHD diagnosis and 155 had no ADHD diagnosis. The original distribution of the data is shown in Figure 2, where you can see the percentages of: (a) subjects who have the disorder and those who do not; (b) severity of ADHD, with those who do not have it classified as "Not Applicable"; (c) ADHD by sex; and (d) ADHD by age group.
The pixels of the images from SPECT were specified as input attributes for our database. For the output attribute, a single attribute named ADHD LEVEL was created with 4 discrete values (classes): 0 for individuals not diagnosed with ADHD; 1 for individuals with mild ADHD; 2 for moderate ADHD; and 3 for severe ADHD. Our strategy was to perform two tasks: Classifying ADHD and simultaneously assigning the severity level to individuals who have the disorder. The labeling of the baseline attributes was based on the diagnostic reports.
Finally, as mentioned earlier, there are a variety of studies that differ on which regions of brain function are affected by ADHD. The only region that is consistent in the research is the frontal region. Therefore, only the 3rd view (anterior view) was cropped from this group of images (Figure 1), leaving only one image per patient (Figure 3). The pixels of the resulting image were still preprocessed before being used in training/testing.  Source: Author's own.

Pre-processing
Some data quality issues may affect the process of hypothesis induction in the ML methods. Therefore, it is necessary to apply pre-processing techniques such as data cleaning, transformations, and dimensionality reduction (Bishop, 2006). We consider that our original database contains a small number of samples to achieve a good generalization of the models. In addition, it is evident from the percentages that the classes were originally unbalanced (Figure 2-b). Therefore, a preprocessing strategy had to be used to solve both problems, i.e., to increase the number of samples and to achieve a balance between classes. Therefore, the technique SMOTE was used, the Synthetic Minority Over-sampling Technique (Chawla, et al., 2002). This helped to balance the number of samples between the majority and minority classes. It was chosen to increase the number of minority class instances to match the number of majority classes. After applying SMOTE, all classes had a frequency of 25% and increased from 236 to 620 samples.
However, we recommend that images that need to be standardized or normalized be processed before applying SMOTE. Therefore, pixel standardization and normalization in our algorithms were performed immediately after loading the images so that SMOTE could be applied. The next step was feature selection, because each image had a size of 611 x 519 pixels, which would result in 317,109 input attributes to be computed by the ML algorithms. Consequently, it was necessary to reduce the number of attributes to improve the performance of the induced model, reduce the computational cost, and make the results more understandable. Therefore, Incremental Principal Component Analysis (IPCA) was used instead of PCA, Principal Component Analysis (Murphy, 2012, p. 387). Our dataset was too large to fit in memory and had to be decomposed. IPCA creates a low-rank approximation for the input data using a memory size that is independent of the number of input data samples. After applying IPCA, the dimensionality dropped to 100 input attributes for each of the 620 samples

Search for best hyperparameters
Most ML algorithms have hyperparameters that affect their performance, processing time, or memory consumption.
The choice of these parameters, which can be manual or automatic, affects the quality of the model and its ability to generalize to new entries (Goodfellow, et al., 2016).
Automatic selection was used in this study. The GridSearchCV class, from the sci-kit learn framework, was instantiated and given the parameters: a classifier object, a specific parameter array for each classifier type, the performance evaluation strategy, and a StratifiedKFold object for the stratified cross-validation strategy. For each of the ML methods this process was performed to determine the best parameter configuration. The nested cross-validation strategy (Cawley & Talbot, 2010) was used and then the accuracy and F-measure, a harmonic mean of precision and recall, were calculated (Olson &Research, Society andDevelopment, v. 11, n. 8, e54811831258, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.31258 11 Delen, 2008, p. 138). The percentage of instances and the image resolution rate were also varied to investigate how data reduction might affect the results.

Training and testing
Once the best hyperparameter settings for each classifier type were determined, the training phase began. A procedure was implemented to train each ML method. To evaluate performance, cross-validation was chosen, a learning strategy in which the dataset is divided into a fixed number of subsets (folds) and each fold is used once as test data while the rest is used for training. 10 subsets were used (tenfold cross-validation), but in such a way that each subset maintains an approximately balanced distribution of classes (stratified cross-validation).
In addition, the training and cross-validation were performed in a loop with 30 iterations. According to Witten et al., (2011), this generally strengthens the mitigation of biases caused by a particular sample selected for validation. In the end, the average of the results was calculated using the metrics of accuracy and F-measure. Demšar (2006) reported that the ML community is increasingly aware of the need for statistical validation of published results. He reasoned that this is due to the maturity of the field, the more frequent use of ML in applications, and the availability of new ML frameworks that facilitate the implementation and comparison of algorithms. García and Herrera (2008) also explored the use of statistical comparisons of classifiers across multiple datasets, extending Demšar's work.

Statistical validation
Given our problem context, where the same dataset was used for all algorithms (pairing of samples), and the measurement level of the performance variables (ordinal), the Friedman test (Friedman, 1937) was applied, a nonparametric statistical test. The defined H0 hypothesis was "there are no differences between the samples", meaning that there are no statistically relevant differences between the performance of the models. Otherwise, an alternative hypothesis would be adopted. This would lead us to use a paired test, for example, comparing two classifiers using the Nemenyi test (Nemenyi, 1963).
To facilitate analysis of the results of the paired tests, the CD diagram was used (Demšar, 2006), a simple diagram that provides the critical differences for the Nemenyi test. This diagram presents the order of algorithms, the magnitude of differences between them (in terms of rankings) and the importance of observed differences.

Results
The discussion begins with a commentary on the best hyperparameters found in an automated way for our data context. In Table 3, you can see the list of evaluated values and the best configuration found for each method. Although Naive Bayes has no parameters, both sci-kit learn implementations were tested: GaussianNB and BernoulliNB, and the best result was obtained with GaussianNB. In this parameter search, the percentage of instances and the image resolution rate were varied and found that the more instances, the better the results. Decreasing the image resolution only slightly improved the results for most methods, except for k-NN. Therefore, 100% of the instances and the original resolution of the images were chosen for training/testing and statistical validation. In an effort to reduce as much as possible the problems of our dataset, such as the small number of instances and the unbalanced classes, the technique SMOTE was added to the pre-processing phase. However, to better understand how this technique affects the results, the models were first trained with the original data distribution and then SMOTE was applied to evaluate performance using the metrics of accuracy and F-measure.
The results are presented in Table 4. From the "Accuracy" and "F-measure" columns, it can be seen that the use of SMOTE improves the accuracy of the methods by about 33% on average. MLP, for example, improved its accuracy by 39%.
For this reason, only the results of classes balanced by SMOTE were included in our analyzes. It can also be observed that the differences between the values of the "Accuracy" and "F-measure" columns (right side of Table 4) are extremely small. This could be explained by the balance of classes achieved by SMOTE. Accordingly, accuracy was chosen as the performance measure for comparing the classifiers.
The methods are ranked by "Accuracy" (bold values) in Table 4. The three best performances were obtained by SVM, k-NN, and MLP classifiers, followed by Decision Tree (DT) and Naive Bayes (NB). It is believed that the first three methods have better generalization than the other two. Using the Friedman test, it was confirmed that the observed differences in accuracy were statistically significant.
Therefore, based on the accuracy, the mean ranks for each classifier were calculated and the CD diagram (Critical Differences diagram) was created for the Nemenyi test. This diagram helped to describe the differences more accurately and to understand them better. From the CD diagram (Figure 4), it can be seen that connected classifiers: SVM and k-NN; MLP and DT; and DT and NB are not significantly different. In any case, from the CD diagram, it can be concluded that the performances of SVM Research, Society andDevelopment, v. 11, n. 8, e54811831258, 2022 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v11i8.31258 13 and k-NN are statistically superior to those of MLP, DT, and NB. Thus, the importance of statistical validation as a suitable approach for comparing the results of different ML methods has been confirmed.
Regarding the two best classifiers, SVM and k-NN, it can be seen in Table 3 that the best SVM version for our experiments was the one using the radial basis function (RBF) kernel (Lazzaro & Montefusco, 2002). It can also be seen from Table 4 and Figure 4 that the SVM and k-NN results are very close. In fact, k-NN and SVM with RBF kernel perform similarly. However, SVM is more efficient than k-NN because SVM only needs to remember the examples that make up the support vectors.

Final considerations
In this work, the performance of five machine learning (ML) methods was evaluated: k-NN, Decision Trees (DT), Naive Bayes (NB), MLP, and SVM in classifying ADHD patients using brain images SPECT. There is strong evidence that the ML methods can provide satisfactory models that can help in ADHD diagnosis, including severity level classification. SVM and k-NN were the best ML methods for this problem. The high accuracy rates of the models generated may reflect correct decisions about our methodological procedures. We took care of solving the data problems, found the best fit of the hyperparameters, and applied training/testing procedures appropriate for our problem context. Analogous to Vázquez-Abad, et al. (2020), we can suggest that ADHD severity level classification applications developed with our models would potentially be more accurate, faster, and less expensive, which would improve patient treatment.
It is expected that this work will be useful for researching computational methods for building CAD applications for automatic diagnosis of ADHD. In addition to comparing the performance of different machine learning methods, methodological directions can also be found, such as hyperparameter values. Nevertheless, much more research and refinement of this research is needed. Future improvement could be to include the classification of ADHD subtypes, to consider the comorbidities of ADHD, and to consider the heterogeneity of this disorder (Pulini, et al., 2019).