Affective computing in the context of music therapy: a systematic review

Music therapy is an effective tool to slow down the progress of dementia since interaction with music may evoke emotions that stimulates brain areas responsible for memory. This therapy is most successful when therapists provide adequate and personalized stimuli for each patient. This personalization is often hard. Thus, Artificial Intelligence (AI) methods may help in this task. This paper brings a systematic review of the literature in the field of affective computing in the context of music therapy. We particularly aim to assess AI methods to perform automatic emotion recognition applied to Human-Machine Musical Interfaces (HMMI). To perform the review, we conducted an automatic search in five of the main scientific databases on the fields of intelligent computing, engineering, and medicine. We search all papers released from 2016 and 2020, whose metadata, title or abstract contains the terms defined in the search string. The systematic review protocol resulted in the inclusion of 144 works from the 290 publications returned from the search. Through this review of the state-of-the-art, it was possible to list the current challenges in the automatic recognition of emotions. It was also possible to realize the potential of automatic emotion recognition to build non-invasive assistive solutions based on human-machine musical interfaces, as well as the artificial intelligence techniques in use in emotion recognition from multimodality data. Thus, machine learning for recognition of emotions from different data sources can be an important approach to optimize the clinical goals to be achieved through music therapy.


Motivation
With the aging of the world population, several countries have been facing considerable changes in the last decades (UN, 2020;EC, 2020). The elderly population has increased, and so is the prevalence of diseases associated with old age, such as osteoporosis, hypertension and dementia (NRC, 2001;WHO, 2019WHO, , 2017Ricci, 2019). Alzheimer's Disease and cerebrovascular ischemia are the two most important causes of dementia worldwide (Nichols et al., 2019;Rizzi et al., 2014).
Several studies show that music therapy has the ability to slow down the progress of dementia through musical stimuli and music education (Gallego & Garcia, 2017;de Souza et al., 2017;Fang et al., 2017;King et al., 2019). Interaction with music stimulates the areas of the brain responsible for memory through emotions. However, the effectiveness of music therapy is closely related to the recognition and correct stimulation of emotions in the patient by the therapist. This task of recognizing emotions is often arduous, especially for less experienced professionals.
Artificial Intelligence (AI) algorithms have been shown to be effective in solving complex classification problems, including the ones related to emotion recognition (Poria et al., 2017;Cambria, 2016). Thus, emotion recognition performed by AI methods has the potential to contribute to the construction of an interface capable of helping music therapists in determining musical genres, styles and rhythms. This interface may be of great help to optimize intervention in the context of the treatment of Alzheimer's disease and mild cognitive impairment.
Thus, this document proposes a systematic review of the literature in the field of Computational Intelligence applied to emotion recognition in Electroencephalography (EEG), voice signals and facial expression. From this review, we particularly aim to identify AI methods to perform automatic emotion recognition applied to human-machine musical interfaces. Particularly, this review seeks to answer the following research questions: (1) What are the current challenges in automatic emotion recognition? (2) How useful is automatic emotion recognition for non-invasive assistive solutions based on human-machine musical interfaces? (3) Which AI techniques are being used in emotion recognition? (4) Which deep networks architectures are used for music recommendation based on emotion recognition from multimodalities data?
Next topic presents some background information regarding to aging, cognitive impairment, affective computing, and music therapy. In Methodology section we describe the systematic search protocol. Then, the results are showed and discussed.
Finally, in the last section we highlight some conclusions and limitations of this work.

Background
This topic presents the main theoretical references which are the basis for carrying out and understanding this research. Thus, we explore the concept of cognitive deficits, focusing on Alzheimer's Disease. Then, we give a brief introduction about emotions. We point out emotion recognition, considering the forms and stimuli for identification, emphasizing the field of study of Affective Computing. Finally, we present a brief review about therapies and interventions.
Additionally, we discuss how they are being assisted by technologies and enhanced with musical techniques.
To introduce the discussion concerning cognitive deficits and dementia, some aspects need to be outlined. It is important to clarify that cognition refers to the way our brain learns, perceives, remembers, and also processes the information absorbed by the different senses (Sheffield et al., 2018). Therefore, cognitive deficit is characterized as an obstacle in this development, especially with respect to learning and intellectual limitations.
Also known as mild cognitive impairment, cognitive deficit can cause slight loss of memory, attention and difficulties in logical reasoning. However, it does not present dementia (Hamdan, 2008). In the elderly, these symptoms are often confused with natural causes of aging. As yet, these are signs that alert both individuals and their families, since in a few cases, they can evolve or be the beginning of Alzheimer's disease (AD) in its early stages (Pais et al., 2020).
Despite the similarities, the differentiation between cognitive decline and AD is essential to avoid confusion between both conditions. AD is the most common type of dementia and is also a general term used to describe conditions that occur when the brain can no longer function properly. In the literature, AD is classified as a chronic, degenerative, and progressive dementia (Lourinho & Ramos, 2019). According to Caetano et al. (2017), neurodegenerative diseases are those which cause irreversible neuron degeneration. According to research carried out by Bertazone et al. (2016), Alzheimer's Disease is characterized by changes in memory, nonetheless, it is rarely the most detectable symptom. Failures in cognition, motor skills and language can also appear as the first symptoms and tend to worsen as the disease progresses.
Although there is no cure for AD, other forms of intervention were also developed to promote an improvement on the Alzheimer's patient's quality of life (Caetano et al., 2017). In order to contemplate the individual in a holistic way, the treatment requires a multidisciplinary team -with an interdisciplinary approach -to combine pharmacological and nonpharmacological measures. de Souza et al. (2017) cites psycho-corporal and biological therapies, and music therapy as examples of these interventions. Nonetheless, several other approaches have also been emerging in the field of affective computing and computational intelligence.
Defining emotions is not a trivial task, as this term is frequently used in different contexts, and it is also present in everyday situations (Paxiuba & Lima, 2020). It is worth emphasizing that emotions play an essential role in the social formation of any human being. Studies clarify that emotional expressions are composed of variables that can be directly related to cognitive aspects (Le & Provost, 2013;Izard, 1977). Hence, emotions can manifest in different ways in each individual which include sensations, facial expressions, and body movement. For this reason, emotions are one of the most important experiences, because they guide choices, motivations, decisions, among other aspects. Additionally, they are essential for the process of verbal and non-verbal communication (Marosi-Holczberger et al., 2013;Dorneles et al., 2020).
As already mentioned, for each emotion, there is a definition and a peculiar way of manifesting itself in each individual.
When it comes to recognizing these emotions, humans are often able to feel and/or identify each other's emotional state. This detection may be naturally noticeable to people, but it is still a difficult task for computers. It is in this specific context that we highlight a sub-area in Artificial Intelligence which studies emotions in computers, called Affective Computing (AC) (Paxiuba & Lima, 2020).
The term "Affective Computing" was proposed by Rosalind Picard in 1997(Nalepa et al., 2019, and refers to a field of research that is totally interdisciplinary with other areas of knowledge (such as Biomedical Engineering, Psychology, and Computer Science). This field of research seeks to develop computational and emotion recognition methods for diverse purposes and applications. Briefly, AC studies how computers can recognize, model and express emotions (and other human psychological aspects), and how they respond to them (Picard, 1997).
In this scenario, there are several ways to investigate the recognition of emotions. But for that to happen, data must be used. According to González & McMullen (2020), data can come from different sources, such as voice, facial expressions, and physiological signals.
For the emotion classification through voice signal, cultural and language patterns can also be considered to the model development. Some attributes are considered more relevant, and they are well established in this type of prediction. Among them, we can mention pitch (Sondhi, 1968), energy or intensity (Ingale & Chaudhari, 2012), formants (Goudbeek et al., 2009), mel-cepstral frequency coefficients (MFCC) (Han & Chan, 2006) along with the characteristics common to most waveforms.
In the literature referring to audio recognition, Convolutional Neural Networks (CNN) , Support Vector Machines (SVM) (Sonawane et al., 2017), and Generative Adversarial Networks (GANs) (Chatziagapi et al., 2019) are quite recurrent. They are considered the leading models for this type of classification. It is also worth emphasizing the popularity of the MFCC as one of the most used attributes. According to Han & Chan (2006), the MFCC is a parametric representation of the frequency spectrum of the voice signal. It is a scale very close to the human auditory system, which has a non-linear behavior in frequency.
Another way to perform human affective recognition is through its physiological signals. From the Peripheral Nervous System, the most evaluated signals are: the Galvanic Skin Response (GSR), the Respiration Range, the Skin Temperature, and the Electrocardiogram (ECG). From the Central Nervous System, the signal evaluated is the Electroencephalogram, commonly known as EEG. In this type of approach, models which combine the analysis of more than one signal tend to achieve better performances. Knowing that physiological and cognitive responses can be significantly impacted by emotions (Brosch et al., 2013), combined classifiers that use multimodal analysis become very reliable for the identification of these sensations. They are also an alternative to voice and image's recognition methods.
Another popular method for emotion recognition is through facial expressions (Y. Wang & Kosinski, 2018). This method is somehow more direct, intuitive and easy to identify. AI models based on this optics almost always have attributes that can be visually identified, such as dark hair, and light eyes. However, these features are extremely difficult to standardize and to extract. Thus, studies that take into account the varieties of physiognomy, age, gender, and different demographic groups are a trend among new researches (Buolamwini & Gebru, 2018).
AI models focused on images and videos have a high computational cost. This is due to their large amount of data throughout various pre-processing stages or classification. However, these deficits can be offset by better performance and accuracy for day-to-day applications (Jeong & Ko, 2018), whereupon other approaches would not be feasible or practical.
Lastly, it is worth taking into account the fact that emotions are brief psychophysiological phenomena. Thereby, collecting data on emotions spontaneously is still a challenging task. For this reason, it is often necessary, in research Research, Society andDevelopment, v. 10, n. 15, e392101522844, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i15.22844 environments, to place the individual in situations with the purpose of evoking certain emotions. The emotions can be evoked through (Meska et al., 2020) odors, visual or auditory stimuli. The visual and auditory stimuli can be images, videos, songs (Vicencio-Martínez & Garay-Jiménez, 2019), affective scenes using Virtual Reality (T. Teo & Chia, 2018), and others.
Considering what has been exposed so far, it is important to contextualize them in interdisciplinary fields which include music, education, health and technology. These fields, when combined, have potential for therapeutic applications and diversified interventions. The interventions include games, affective entertainment, and health rehabilitation assisted by Virtual Reality (Bulagang et al., 2021). Music, for example, is a resource that has the ability to stimulate and develop the brain (Kirana et al., 2018). It has a variety of applications for therapeutic purposes such as emotion regulation, social interaction (Agres et al., 2021), and motor functions rehabilitation (Dechenaud et al., 2019). Music can also be applied in physical therapy. Since the exercises are often monotonous and repetitive for the patients, the music can increase their motivation and involvement in the section (Colombo et al., 2019).
With regard to technologies for therapeutic purposes, we highlight games, which add entertainment to the therapy moment, and improve patient engagement (Fonteles et al., 2018). This is possible because games offer rewards and different levels of challenges. These characteristics are advantageous because they involve the individuals in the rehabilitation process.
Simultaneously, they increase the individual's motivation during the exercises (Agres et al., 2021). Games can have different classifications depending on the goals for the rehabilitation process. For example, games like Exergames aim to improve the health and well-being of elderly people through stimuli to perform physical activities (Crespo et al., 2016). Another category is the serious games. According to Agres et al. (2021), serious games can be an alternative to train the skills of patients with Parkinson's Disease. It can also assist in treating traumatic brain injury, and in training for patients with dementia.
Virtual reality (VR) applications are pointed out by therapists as a high degree of immersion approach and as a nonthreatening method for patients to practice multisensory integration in a real context (Lubetzky et al., 2019). As shown in the research carried out by Geraets et al. (2021), VR can make therapies more available and cost-effective for a larger group of patients. In addition, they can increase the intensity of treatment through home exercises that complement face-to-face therapies. The authors emphasize that with VR, innovative strategies can not only improve therapeutic interventions, but can also be used to investigate mechanisms involved in the persistence of mental disorders. As in games, VR applications are also diverse. As it is a more engaging and realistic approach involving and realistic, it contributes to senses and thought stimulation (T. . These stimuli in VR can even be customized according to the purpose of rehabilitations, as in the study by Cameirão et al. (2017), where the stimuli were customized for the rehabilitation of stroke survivors with mild cognitive impairment (MCI).
Still in the context of technologies, therapies, and interventions, affective and social robotics emerges as an opportunity to carry out personalized therapies (Agres et al., 2021). These technologies can contribute to the rehabilitation process becoming less dependent on therapeutic expertise (Kikuchi et al., 2018). Thus, reducing the burden on the physicians who perform the interventions (Agres et al., 2021). In general, a robotic system be used for many purposes. It can be a bridge for interaction with autistic children, improving their social skills , assist in treatment of children with cancer (Ranjkar et al., 2019), contribute with wrist (English & Howard, 2017a) and upper limb (Kikuchi et al., 2018) rehabilitation. Furthermore, it can contribute to therapeutic interventions in order to support people with visual and intellectual disabilities (Wingerden et al., 2020).
It is worth reiterating that the studies mentioned in this section confirm that applications and inclusion of games, VR or robotics for therapeutic purposes -when integrated with musical strategies -can be a powerful tool. In fact, music makes all the difference in the rehabilitation process, and also influences the user's affinity with the systems. For this reason, the search for musical solutions for applications in healthcare grows continuously. A topic that cannot be discarded is the importance of properly trained professionals to assist in the development of these technologies so that their effectiveness is truly beneficial.

Methodology
This systematic review selected primary studies based on keywords, search period, and both inclusion and exclusion criteria. As sources, we chose five of the main scientific databases on the fields of intelligent computing, engineering, and medicine (i.e., IEEE Xplore, MedLine/PubMed, SCOPUS, Science Direct and Springer Link). The systematic review was conducted with automatic search in the databases using keywords and period as filters. Particularly, the scope of this study was to focus on the methodological aspects of state-of-the-art works, especially issues related to the most explored computational and artificial intelligence methods.
Therefore, wee search all papers released from 2016 to 2020, whose metadata, title or abstract contains the terms defined in the following search string: ("Artificial Intelligence" OR "Deep Learning" OR "Machine Learning" OR "Computational Intelligence" OR "Neural Network" OR "Deep Kernel") AND ("Electroencephalography" OR "EEG" OR "Neural Signals" OR "Brain Signal") AND ("Voice Signals" OR "Speech") AND ("Emotion Recognition" OR "Recognition of Emotion" OR "Affective") AND ("Alzheimer" OR "Dementia" OR "Degenerative Disorder" OR "Neurodegenerative Disease" OR "Neurodegenerative Disorder" OR "Cortical Disorder") AND ("Human-Machine Musical Interfaces" OR "Affective Music" OR "Music Therapy" OR "Music Biofeedback").
Article selection was performed in five phases ( Figure 1). In the first phase, we identified the number of papers from each scientific database. Then, in the second step, we assessed its suitability to all four exclusion criteria (EC). As EC, we exclude duplicated works, studies that do not include some computational tool, documents classified as poster, tutorial, editorial, book, annals or reviews, and studies based on invasive techniques.
The third phase consisted in an evaluation according to introduction and conclusion, in order to select studies that meet at least one of the following inclusion criteria (IC): IC1) studies with computational tools applied to EEG signals; IC2) studies with computational methods applied to voice signals; IC3) studies with computational tools applied to physiological data (Galvanic Skin Response, Respiratory Frequency, Heart Rate, Electrocardiogram, Electrooculogram); IC4) studies with computational tools applied to image or video processing; IC5) studies that use emotion recognition in EEG signals; IC6) studies with emotion recognition in voice signals; IC7) studies with emotion recognition in physiological signals; IC8) studies that use emotion recognition in images or videos; IC9) studies that use Human-Machine Musical Interfaces (HMMI).
After reading the remaining papers, in the fourth phase, we assigned scores to each of them based on the quality checklist in Table 1. To assign these quality scores, the articles were carefully read and evaluated for their formal aspects.
Thus, each article was assigned a score for each question presented in Table 1. The score could be 0 (zero) if the text did not meet the criterion, 0.5 if partially met, or 1 if the criterion was well presented in the text. Finally, in the fifth step, we group the studies according to their content.

Results and Discussion
Four people performed papers assessment. It is important to mention that during search process, we needed to reduce the length of the search strings used in Science Direct and Springer Link, since these bases did not support the original amount of Boolean operators. Thus, for Science Direct we used the terms: ("EEG") AND ("Speech") AND ("Emotion Recognition") AND ("Dementia") AND ("Music"). In Springer Link, we used: ("Machine Learning") AND ("EEG") AND ("Speech") AND ("Emotion Recognition") AND ("Dementia") AND ("Music").
In the last phase of the review, we grouped the selected studies according to the division shown in Figure 3. The studies were combined in six groups: Games, Virtual Reality, Affective Robotics and Therapies (G1), Physiological and behavioral responses induced by acoustic signals (G2), Emotion recognition associated with acoustic stimuli (G3), Noninvasive assistive solutions based on human-machine musical interfaces (G4), Musical composition, recommendation and customization (G5), and other approaches (G6). This last group consists of some studies that does not fit any of the other groups.
As shown in Figure 3, the minority of the studies (8%) are in the 4th group, regarding to non-invasive assistive solutions based on human-machine musical interfaces. G1 and G3 together concentrate almost half the amount of studies (47%). Thus, its possible to see that in the last years many studies are being conducted in the areas of games, virtual reality, affective robotics, therapies, and emotion recognition associated with acoustic stimuli. 19% of the papers are related to physiological and behavioral responses induced by acoustic signals (G2). Finnaly, both G5 and G6 have 13% of the studies each.
All selected studies were evaluated according to the previously defined quality criteria. Then, we calculated their average quality score (AQS) as the mean value of the scores from all criteria. Figure 4a shows the average score of all quality criteria for each group. Most of the studies had good performance in describing their goals and methodology, however, few studies provided information regarding to their contribution and limitations. The AQS for all groups is found in Figure 4b.
Overall, all groups showed AQS above 0.7. The studies on G3 achieved the best quality scores while less quality papers were found in G5. This may indicate that more qualified papers are related to studies on emotion recognition associated with acoustic stimuli. However, there are still few relevant studies on customization of musical content.
Regarding to the year of publication, Figure 5 shows the the distribution of papers published over the last five years for each group. We observe an increase in the number of studies in the last two years (2019 and 2020) specially in groups G2 and G3. This increase shows a growing interest in the effects of music and other acoustic stimuli in the human being.   shows the average quality score for all groups.  Source: Authors.

Games, virtual reality, affective robotics and therapies
In the present study, it was possible to identify that music brings several health benefits and contributes to the healthy development of the brain. In addition to music, we identified other gamified approaches such as Games, Virtual Reality (VR), and Affective Robotics. Such approaches were proposed by the articles listed in Table 2. All the aforementioned approaches, when applied to therapies for rehabilitation or treatment purposes, show excellent therapeutic potential. This occur because they contribute to the patient's involvement in the exercises and contribute to increase their motivation.
It is evident that emotion recognition approaches mostly use music, videos, and images (static and dynamic) to stimulate emotions in individuals. Yet, studies already use affective scenes in Virtual Reality to stimulate and to induce emotions. Then, they perform emotion recognition methods through electroencephalographic signals (T. Teo & Chia, 2018). In another case, the authors use VR scenes along with music as stimuli to induce emotional responses in individuals (Sra et al., 2017). It emphasizes that, although VR were not designed for this purpose, it stands out positively when compared to traditional approaches. One of the reasons is because, with VR, the individuals involvement can be higher since the VR is able to bring them closer to the reality in an immersive way. These aspects mentioned can be more promising when combined with music.
Affective virtual spaces have gained prominence for VR applications. It covers areas ranging from education, entertainment, art, well-being to health. Hence, VR applications are being widely used for therapeutic purposes. For example, to assist with rehabilitation of people with vestibular disorder (Lubetzky et al., 2019), and rehabilitation of stroke-affected people with mild cognitive impairment (Cameirão et al., 2017). Other applications include: psychological relaxation therapies , and support to the Emotional self-awareness (Bermúdez i Badia et al., 2019). These therapies use audiovisual scenes with music which induces relaxation, and feedback with VR oriented affective states, respectively. Thus, considering the aforementioned therapies, researches are working twoards the development of tools which assesses the user experience. Both the therapeutical context, and the individuals affective state are taken into account (Krüger et al., 2020).
Regarding to therapies, musical sonification appears as a potential technique. It has been helping in treatments and rehabilitation. The gamified approaches which have been developed involves adapted guitar for upper limbs rehabilitation for stroke-affected individuals (Dechenaud et al., 2019). It also includes gesture recognition tools for motor rehabilitation (Murad et al., 2017). Researches have also investigated and presented good results for therapies based on vibro-acoustics. They appraised its effects on physiological signals of young people and adults. The authors evaluated the biosignals to assess the therapy potential to induce a relaxation state after a stress situation (Delmastro et al., 2018;Cavallo et al., 2020).
Given the advances in therapies that use musical strategies, different applications arise. Among the application we exemplify: (1) the development of a system to measure the reaction of autistic children during music therapies, considering parameters such as frequency and intensity of sound; (2) protocol validation for training and assessment of manual function rehabilitation; (3) development of intelligent systems to detect stress through EEG signals, leading individuals to relaxation, (4) and music therapy applications to aid in memory recovery.
Physical therapy exercises often can cause discomfort, boredom or fatigue on patients. This is due to repetitive and monotonous activities. Games with musical techniques and VR applications have become an alternative strategy to involve patients on their activities. Musical games stand out for some contributions such as (1) in the promotion of the upper limb motor rehabilitation on stroke-affected individuals (Sanders et al., 2020); (2) in the motor and cognitive skill's exercises by touching sequences (English & Howard, 2017b); (3) and in hand exercises through the process of learning rhythms and melodic structures (Fonteles et al., 2018).
Exergames games based on VR are being used to treat low back pain (Ortegon-Sarmiento et al., 2020), and to assist in therapies for the elderly encouraging arm movement (Crespo et al., 2016). Also in the context of games, other approaches can be highligthed such as (1)  The use of affective robotics has also gained a lot of acceptance and prominence due to its diverse uses. It has been used mainly for therapeutic purposes, for example, robotic systems for upper limbs and wrists' motor rehabilitation using musical integration (Kikuchi et al., 2018;English & Howard, 2017a). Affective robots has also been used in health interventions in order to to assist individuals with intellectual disabilities (Shukla et al., 2019). Musical and social robots are also adopted for interaction with autistic children , and children undergoing cancer treatment (Ranjkar et al., 2019), respectively.
Affective robots using musical techniques report success in the rehabilitation process. These are being developed to act as personal companions with a focus on the mood of individuals. They can also interact through scents, and music to generate positive emotions . Affective robots can also interact as robot dogs to cooperate in music therapies and identification of pneumonia in patients suffering from dementia (Lyu & Yuan, 2020). Moreover, they are able to interact as a singing robot in order to communicate through music, and evoke and assess emotional responses in humans (Wolfe et al., 2020).

Physiological and behavioral responses induced by acoustic signals
From this study, it we observed that almost 19% of the selected works deal with physiological and/or behavioral responses induced by sound signals (Table 3). In addition, we noted that there has been an increase in the number of publications involving this topic over the years, as most articles were published over the years 2019 and 2020. Only 9 out of the 27 works in this area were published between the years 2016 and 2018. This demonstrates a growth in scientific interest in this area of research.
We also observed that such studies have been carried out in several countries, especially in the European, Asian and American continents. As for the data sources, almost all the works were developed from their own databases. The data were collected locally, and with little dissemination to the scientific community. Another aspect is that the vast majority of the returned studies is in the experimental phase. However, three of them propose the design and the tool validation.
The returned works also presented a great diversity in the investigation of the effects of sound signals in the human organism. Most of these works sought to better understand the interventions of music and auditory stimuli in human brain waves ( (Kanehira et al., 2018;Leslie et al., 2019;Ramdinwawii & Mittal, 2017;Stappen et al., 2019;Rushambwa & Mythili, 2017;Plut & Pasquier, 2019;Soysal et al., 2020;Rushambwa & Mythili, 2017).
Regarding the use of Artificial Intelligence (AI) tools, only nine works related to this topic made use of these tools (Yamada & Ono, 2019;Huang & Benjamin Knapp, 2017;Wang et al., 2016;Liu et al., 2020;Dutta et al., 2020;Bhargava et al., 2020;Q. Li et al., 2020;Ibrahim et al., 2019;Liu et al., 2020). Such computational approaches were mainly applied in works that sought to analyze cerebral and hemodynamic biosignals. Both supervised and unsupervised learning methods were used. Among these methods, it is worth highlight that Support Vector Machines (SVM) were applied in the majority of the studies. The ones which did not use AI tools, commonly used cognitive tests or performed purely qualitative analyzes through observations and interviews with participants.     (2017) Source: Authors.

Emotion recognition associated with acoustic stimuli
Among the 144 studies selected through this systematic review, 35 are related to emotion recognition (listed in Table   4). Considering the years of publication, we noted that most of the works were published in the years 2019 and 2020, both presenting nine studies.
Considering the types of approach for the recognition of emotions, we observed that a significant number of studies uses the analysis of physiological signals to achieve this objective. Thus, studies using biosignals such as EEG (Bankar et al., 2018;Shen et al., 2020;Dutta et al., 2020;Marimpis et al., 2020;Bo et al., 2017;Rahman et al., 2020), ECG (Hsu et al., 2020) and skin's electrodermal activity (Rahman et al., 2019) were returned. Other studies combined more than one parameter, in addition to biosignals, to assess emotion. In this context, we highlight studies using more than one biosignal for emotional assessment. In Ramírez et al. (2020)'s approach, they combined functional magnetic resonance's and EEG's parameters. Daly et al. (2019), on the other hand, combined ECG signal, and the skin's electrodermal activity to the emotion recgonition task. In works regarding the evaluation of biosignals, we also observe that in the vast majority the authors collect their own data.
Although they described the processes in the papers, the lack of availability of open datasets makes it difficult to reproduce and improve the models which were tested.
Sereval works also used other parameters instead of biosignals for emotion recognition. Among them, we can highlight the groups that assess emotion through audio analysis (Greer et al., 2020;Lv et al., 2018;Mo & Niu, 2019;Lopes et al., 2019;Kumar et al., 2016;Panda et al., 2020;Chapaneri & Jayaswal, 2018). Textual aspects also have great potential in the field of emotion recognition in music. Thus, some works explore the use of song lyrics (Malheiro et al., 2018;Matsumoto & Sasayama, 2018), as well as their association with audiovisual content to improve the performance of models in this field of study (Nemati & Naghsh-Nilchi, 2017). As for data, most studies that involve only audio analysis, use data from public databases such as Soundtracks, MTV database, MediaEval. In Panda et al. (2020)'s study, the authors made available to the scientific community the database which they elaborated. For those studies that use only video content, and audiovisual content, the authors assemble their own dataset. Therefore, it affects the study's reproducibility.

Non-invasive assistive solutions based on human-machine musical interfaces
Musical and sound stimuli are increasingly being explored in interfaces of the most diverse orders. In this review, 12 of the 144 articles returned by the research deal with solutions based on human-machine musical interfaces (Table 5). Seven of them are in the testing phase and five in the implementation and clinical validation phase of the solution. All these studies used their own databases collected locally in the field or in the laboratory. Most of the works were developed in European and Asian countries, with special emphasis on Japan. This country showed a great advance in the development, and application of humancomputer interfaces compared to other countries in the world.    Regarding the use of AI, six works did not use or did not make explicit the use of these techniques. Among the studies involving AI, we noticed a preference for supervised learning methods, especially the SVMs. In addition, there is also a predominance in the use of technologies using ARM-type microcontrollers, and sensors with remote communication.
Most of the works of this group propose the development of interfaces for monitoring and modulating mental and affective states through music (LingHu & Shu, 2018;Mideska et al., 2016;Shan et al., 2018;Daly et al., 2016;S. K. Ehrlich et al., 2019;Kobayashi & Fujishiro, 2016;Daly et al., 2020). In Desai et al. (2018)'s work they present an approach in the opposite direction.
In their study, they propose a musical brain machine interface (BCMI) which modulates musical parameters -such as harmony, rhythm, and melody -from the biofeedback captured by the user's EEG signal. The main goal is to adapt the music according to their emotional state. Thus, instead of using music to modulate the emotional state, the authors propose to alter the music according to the user's brain activity.
Some other studies searches for technologies to improve the development of activities of daily living for people with disabilities. An interface based on sound signals was proposed by Boumpa et al. (2018) to help people with dementia in the development of their daily activities (Boumpa et al., 2018). Hasan et al. (2020)'s study uses the emotional response to musical stimuli to identify movement intention (Hasan et al., 2020). Such a solution can be incorporated in assistive BCIs. Other works seek to apply these musical interfaces to improve communication and interaction between people (Syed et al., 2018;Yi-Hsiang et al., 2018). In general, there is a growth in the interest of researchers in human-machine interfaces, including those that make use of sound and musical aspects to achieve their goals. Despite this, there is a concentration of studies of this nature in regions with greater economic development, as they generally involve the use of high-tech and high-cost devices for capturing and processing signals. It was also observed that the use of AI in this context is still shy, possibly due to the difficulty associated with the integration of this technology in hardware with remote communication systems and real-time responses.

Musical composition, recommendation and customization
This group has 18 of the 144 studies returned by the search string. They deal with approaches to composition, personalization or music recommendation. These works are presented in Table 6. The vast majority of these works were elaborated by groups located in Eastern countries, mainly in China. Works in this area were most published in 2016 and 2020, with no work made available in 2019.  Research, Society andDevelopment, v. 10, n. 15, e392101522844, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i15.22844 Considering the origin of the data used in the studies, the majority of the groups used their own database. Three of them used the Facial Expression Recognition 2013 (FER-2013) Dataset, to identify emotions though facial expression.
Furthermore, two of them used the DEAPDataset, which is a dataset for emotion analysis using EEG, physiological and video signals.
From these works, we observed that the composition, personalization and musical recommendation are generally carried out using multimodal signals. Most works combine physiological signals such as galvanic skin response, heart and respiratory rates, and facial expressions. Some of these works also use electroencephalographic signals or functional images to analyze brain activity. As they are multimodal signals and often difficult to interpret, most studies use AI methods to analyze the data. In general, deep learning techniques are more frequently used, especially LSTMs and CNNs configurations. Some works also use shallow learning approaches, such as SVM and K-means. Works that do not use AI techniques commonly use computational methods of signal filtering and threshold identification to process the data.
Some of the works propose systems and approaches still in the state of computational testing phase. Others are in the validation and implementation phase with real users. From these studies, there are two works that present proposals for systems for automatic emotion detection (González & McMullen, 2020;Vinayagasundaram et al., 2016). Three other studies use techniques for generating affective or personalized music (Savery et al., 2019;Y.-C. Wu & Chen, 2016;Chen et al., 2020).
These studies of musical generation combine several aspects related to music composition and affectivity, and musical taste factors. Generally, these studies focus on the therapeutic context, to favor user engagement.

Other approaches
The approach used in the works presented here, in most cases, determines the conduct of the research. In this section, however, 19 articles could not be grouped since they do not have a common field among them (Table 7). It can be said that these studies are somewhat unique, from their own perspective or application. Among them, 12 works were published in the years 2019 and 2020. This number represents 63% of the total articles found since 2016. This increase may also indicate a growing trend towards new approaches involving emotion recognition (Zhang et al., 2018;Mehta et al., 2019;Al-Qazzaz et al., 2020;Aydın, 2020;Rizos & Schuller, 2019;Thammasan et al., 2017), and the use of music combined with affective computing.
Geographically, 58% of the works were published in the Asian continent. Most them were placed in China and Japan, with two publications each. As for the databases used, almost 42% were created exclusively for the study itself. It may indicate a scarcity of material for approaches that generally require very detailed and sufficiently diverse databases. In this context, studies that focus on the creation of databases under stimuli (Suhaimi et al., 2018), and even an alternative perspective are crucial. Regarding the study phases, eight of the 19 articles are classified as experimental phase. However there is also a wellbalanced distribution, and often mixed with modeling, testing and concept design. Research, Society andDevelopment, v. 10, n. 15, e392101522844, 2021 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v10i15.22844 20 An important observation is that in seven studies the authors did not use any Artificial Intelligence (AI) techniques.
Apparently, this fact may have some coherence because it deals with other approaches and alternatives. However, very similarly to the works already presented in this section 4, these publications also involve brain signal capture Barnstaple et al. (2020), database creation (Suhaimi et al., 2018), music generation, and physiological (Lui e Grunberg, 2019;Nalepa et al., 2019) or affective response to acoustic stimuli. In the studies which present AI methods, the algorithm widely used is the SVM. The SVM is present in six publication. Nonetheless, along with classical approaches, other algorithms were investigated -in most cases-such as LSTM, KNN, and Deep Neural Network.
It is interesting to bring attention that some of these studies -considering the emotion prediction perspectiveinvestigate the similarities between the images' emotional content and music (Xing et al., 2019;Verma et al., 2019;Parra et al., 2019). In these articles, the identification of visual and sound attributes, and their relationship with emotional stimulation help to understand two striking points. The first one is the emotion recognition itself, whereas the second one is the characteristics present in these media. These characteristics can be used in both acoustic (Saha et al., 2016) and visual (Suhaimi et al., 2018) stimulation to obtain an affective response.
In contrast, other works use well-known attributes to create affective sentences from images (Konno et al., 2018) and generate synthetic music from an original database (Herremans e Chew, 2019). Although the articles present promising approaches, it is important to draw attention to some aspects. For example, the need of higher datasets, better description of the evaluation metrics, and meticulous detailed methodology. A significant improvement on these aspects may possibly bring more precise answers and substantially enhance the proposed models.

Conclusion
This review aimed to identify AI methods to perform automatic emotion recognition applied to human-machine musical interfaces. We were interested in mapping studies related to emotions perceived through facial expressions, speech, EEG or physiological signals. Innovative therapeutic approaches were also of interest, especially those that use musical stimuli and computational methods.
From these publications, 144 remained in this review after the qualitative and content analysis.
These studies were grouped into six groups based on their content. Most of them had a high interdisciplinary content.
Associations of therapy issues with human-computer interfaces, music, and robotics were common. The studies in this review showed that the areas of affective computing and music therapy are being more combined in the state-of-the-art. While interdisciplinarity proved to be a feature of this field of study, it was also a challenge during the grouping process. Thus, making it often hard to categorize a study in only one of the groups.
The review also demonstrated that most databases used in the included studies are not public. This makes it difficult for other groups to access the data collected and used in many of the works. The lack of dissemination of the databases compromises the reproducibility of the studies. Moreover, it serves as an obstacle to the accessibility of the proposed solutions. Therefore, we encourage the dissemination of more databases in the area. This dissemination can not only provide greater reach of their authors works but can also benefit researchers from places with unsatisfying structure to carry out data acquisition.
This study also found that the use of AI in the context of affective computing and music therapy is growing with the years. The papers that did not use AI techniques commonly performed interviews, forms, and cognitive-behavioral tests to extract and assess information. The use of Support Vector Machines and decision trees architectures is still vast in the literature. However, in many cases, deep learning methods are also incorporated, mainly Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). In the case of CNNs, the authors have been investing in networks with more and more layers, as data has become increasingly complex and multimodal. As for RNNs, the most explored in the studies has been the Long ShortTerm Memory (LSTM), which may provide information in the temporal dimension. Temporality is an essential factor in speech processing and musical composition. Overall, we find that studies using AI and other computational techniques often do not provide a well-described methodology and/or parameters. This compromises the quality and reproducibility of these studies.
Many of the included studies that used AI applied it to music processing and emotion recognition. In music processing, AI was used to recognize patterns and use them to recommend personalized content or even compose music.
Regarding emotion recognition, most works use deep networks and multimodality data. Most of these studies succeed in recognizing primary emotions (i.e. anger, fear, surprise, disgust, contempt, happiness, and sadness) but are not efficient in identifying secondary emotions such as love, frustration, shame, and relief. Another challenge for emotion recognition is how to differentiate felt from expressed emotions. Some studies seek to overcome this challenge by assessing multimodality data (e.g. combine EEG, facial expression and physiological signals).
The importance of multimodality data was recurrent in the works. This also points to the relevance of these studies being carried out by multidisciplinary teams. The combination of professionals from different areas is beneficial for the development of robust and useful solutions.
Regarding the development of Human-Machine Musical Interfaces (HMMIs), the importance of using music in an appropriate, thought-out, programmed, and personalized way to achieve non-musical goals was perceived. Over the years, there has been an increase in the demand for innovations, including therapeutic ones. In this review, many works already propose HMMIs to provide personalized feedback based on the users' biomedical data. Unfortunately, there are still relatively few studies in this field, which are mostly concentrated in more developed countries.
From this systematic literature review, we hope to provide theoretical foundations to encourage the development of research in affective computing combined to music therapy. This is a very promising area where there is still a lot to be explored. Artificial Intelligence tools applied to emotion recognition can optimize music therapy processes and, thus, enhance its effects. Finally, we believe that the popularization of therapeutic approaches such as music therapy has great potential to improve the quality of life of people with cognitive, motor and behavioral disorders.