A systematic literature review on Machine Learning Model evaluation on healthcare applications

Machine Learning (ML) models have been applied to solve problems in various fields, which necessarily involves proper evaluation of models to ensure performance. Once deployed, ML models are subject to performance issues, such as those related to changes in data (drift). This type of issue has prompted efforts in model analysis and maintenance, as well as in continual learning, which seeks the ability to continuously learn from a (continuous) stream of data. Therefore, it's important to understand and develop methodologies that can be used to evaluate ML models, making their use in real-world environments feasible. Amongst current areas of application for ML, one that stands out, in particular, is Machine Learning for Healthcare, especially in conjunction with Software for Decision Support of Medical Applications, which presents specific challenges for the evaluation and monitoring of models, particularly


Introduction
Recently, Artificial Intelligence (AI) has consolidated itself as one of the go-to alternatives for solving complex problems in any given field of knowledge. It has become increasingly common to hear about or even find systems that make use of AI techniques (e.g. Machine Learning, Expert Systems, Deep Learning, among others) to solve everyday problems.
Healthcare, an area of high social impact, has been the subject of several studies that use Machine Learning (ML) techniques to solve problems. Some studies, for instance, applied ML techniques to predict patient outcomes during the COVID-19 Pandemic (Malki et al., 2021;Arowolo et al., 2022). Others tried to predict risk-of-death for ICU patients with heart failure (Luo et al., 2022). Given the severity of the issues addressed, the usage of ML techniques in healthcare applications faces particularly through modelling, analysis and validation challenges (Ghassemi et al., 2020). Solving them requires close collaboration between data scientists and healthcare experts to make sure that ML models are designed to solve real problems in the field and are interpretable and explainable to the clinical community.
Outcomes and performance of ML models are closely related to the data used for training and testing them (Gopal, 2019). Therefore, it becomes difficult to generalize results obtained with data from specific locations and patient characteristics to those other than those. Another aspect that makes it difficult to analyze and validate model results in healthcare applications is the need for continuous monitoring and specialist feedback, which is difficult to incorporate due to the demanding day-today routine of healthcare services professionals. The traditional statistical analysis of results may not be as efficient when it Research, Society andDevelopment, v. 12, n. 6, e5412642042, 2023 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v12i6.42042 3 comes to models in production and applied to situations that can mean life or death for patients, delimiting the need for research and development in ML model evaluation and monitoring for healthcare applications.
Based on this context, it is possible to state that studies related to evaluating and maintaining ML models applied to health are of great relevance. Despite this, the literature in the area does not present many works that discuss the limitations and provide clear paths for the described problem. Thus, this article presents a Systematic Literature Review (SLR) on evaluating and monitoring real-world ML models for healthcare applications.
The review follows the Systematic Resistance methodology defined by Kitchenhan (Kitchenham & Charters, 2007) and reflects the current literature on evaluating and maintaining ML models in health. This type of study has limitations related to time, as it observes works published up to the date of their realization. On the other hand, it is an easily reproducible study since it is based on a formal literature review protocol.
It is important to mention that the works listed in the review were analyzed considering the entire life cycle of an ML model applied to a real context in the health area, which comprehends: performance evaluation, model monitoring, and maintenance. In addition, this work also presents, as a secondary objective, an approach for analyzing and evaluating the performance of ML models in health that will be proposed based on the results of the review and observations made.
The next sections are organized as follows: Section 2 will discuss related concepts; Section 3 presents review methodology; Section 4 analyzes outcomes from the systematic review; Section 5 presents the discussion and outlines a research proposal; and, finally, Section 6 discusses conclusions and future work.

Related Concepts
This work is related to the monitoring and evaluating of ML models in the healthcare context. In this sense, this section will briefly explain some important aspects for evaluation and continuous observation of the results and performance of an ML model.

ML Model Evaluation
Building a Machine Learning model involves the following steps: pre-processing, which includes data collection and handling; processing, which amounts to running ML methods over the pre-processed data; and post-processing, with model performance metric collection and analysis (Mitchell et al., 2007). Traditionally, post-processing includes testing, which means training the model over a data sample to collect performance metrics. Another activity, called validation, is usually performed after testing as part of the post-processing step. This activity involves verifying model performance against different data samples kept for that purpose specifically. After that, the model is serialized and embedded in its target application to fulfil its role in solving the proposed problem (Gopal, 2019). This context delimits a problem. If model validation occurs before delivery and effective use against real-world data, can performance monitoring and evaluation in actual operation (in production) be called validation too? If so, how can one be differentiated from the other? Current literature seems to have little consideration for that matter. Validation and evaluation usually refer to both the final steps of building the model (post-processing) and evaluating that same model after it is effectively in use. That makes researching model monitoring and evaluation challenging, given the lack of consensus on terminology. In this research, validation, performance evaluation, monitoring, and maintenance refer to models already built and effectively in use, not those still under development.

Continuous Monitoring and Evaluation
There are considerable challenges to ML for Healthcare inherent to the clinical context. For instance: dealing with large volumes of data, data complexity, unstructured data, and patient privacy concerns, not to mention critical requirements regarding accuracy, since mistakes can result in life-threatening situations for patients. Those factors can become dealbreakers to ML model effectiveness and usefulness. Therefore, continuous monitoring and performance evaluation for Healthcare ML applications is a critical necessity.
Machine Learning Operations (MLOps), which adapts DevOps principles to ML model lifecycle, intends to manage the Intelligence Cycle for ML models so that people can work together to imagine, develop, deploy, operate, monitor, and improve machine learning systems on an ongoing basis (Treveil et al., 2020).
Getting models into production is just part of the process, not the end of it. Once a model is in operation, production data should be collected and monitored continuously to close the feedback loop. That way, new data can be selected and labelled into new training datasets and be used to improve ML models. That would allow models to adapt and improve continuously .
Factors inherent to business and product aspects can affect ML models' lifecycle, such as implementation cost and model impact (Wiens et al., 2019). Misalignment between model and business metrics can lead to undesirable effects on model performance. A statistically accurate model that fails to meet business expectations is doomed to failure. Therefore, studies about continuous model monitoring and validation are essential. That is especially true in contexts such as ML for healthcare.

Methodology
According to Kitchenham and Charters (2007), a Systematic Review is a study that aims at identifying research works related to a specific topic and addresses broader questions regarding research evolution. Therefore, conducting a Systematic Literature Review (SLR) is a good fit for this work, which seeks to understand current state-of-the-art regarding healthcare model evaluation, monitoring, and maintenance. This process utilizes a quantitative approach to collect and organize the selected data and a qualitative analysis to compare the established quality criteria to understand the current model evaluation and monitoring landscape. The research process occurs in three stages: planning, execution, and data extraction, as detailed in the following sections.

Research Planning
The Systematic Literature Review begins with methodological planning to reduce errors and biases in study selection and analysis. Planning defines the research objective, questions, search engine, search string, inclusion, exclusion, and quality criteria. Those are necessary for the execution phase.

Research Objective and Research Questions
This review's main objective is to establish current state-of-the-art regarding healthcare model evaluation, monitoring, and maintenance. The following Research Questions (RQ) account for that: • RQ1: Which methods and techniques evaluate machine learning models' performance in real-world applications?
• RQ2: What are their main characteristics, and how are they described?
• RQ3: Are there specificities for ML model evaluation in Healthcare applications?
• RQ4: How is model update handled considering system operation, and how does domain data quality assurance happen?
• RQ5: What are the main challenges and opportunities in evaluating ML models in healthcare applications?

Search Engine, Inclusion and Exclusion Criteria
Scopus search engine, from Elsevier, was chosen as the platform for the research, as it indexes the most relevant databases for the areas of computer science and machine learning, such as ACM Digital Library, IEEE Explorer, Science Direct, and Springer Link. The inclusion and exclusion criteria, which determine which studies should be included or excluded in a systematic review, were defined as follows.
• Inclusion criteria: ○ English-written studies only; ○ The studies must propose or analyze the evaluation process of machine learning models in healthcare applications.
• Exclusion criteria: ○ Grey literature (books, technical reports, non-scientific articles); ○ Duplicated results; ○ Same-author or same-research works; ○ Studies not related to healthcare; ○ Studies not related to Machine Learning; ○ Studies that do not address real-world operation; ○ Studies unavailable for download; ○ Studies that do not address any of the research questions; ○ Studies published prior to 2010.

Quality Criteria
The Quality Criteria (QC) evaluate the work's adherence to the research objective and research questions. In other words, research questions establish what should be investigated, and quality criteria objectively quantify how valuable the works are to the research. The following quality criteria were established: • QC1: Does the work address the evaluation of machine learning models already in use in a real-world operation (i.e., in a "production environment")?
• QC2: Does the work clearly detail the evaluation procedure for one or more machine learning models in production?
• QC3: Are there any particularities related to the management of machine learning models in healthcare applications?
• QC4: Are data-related change management choices detailed along with their motivations?
• QC5: Are model management choices detailed along with their motivations?
• QC6: Are limitations and opportunities described for machine learning model evaluation in production?
• QC7: Does the work describe or propose a framework for production model evaluation in a structured and reproducible manner?
• QC8: Does the work go beyond statistical techniques for model evaluation, taking into account domain experts' opinions and/or specific protocols for the application area?
The measurement of the quality criteria for each work is made using a scale. After reading the work, each receives a score indicating how well they address each quality criterion. The following scale was used: 0, when it does not address the quality criterion; 0.5, when it partially meets the criterion; and 1.0, when it fully meets it.
According to Kitchenham and Charters (2007), a search string must be refined in an iterative process of trial, observation, and refactoring that aims at returning works as coherent as possible to the research subject. The search string was based on the research questions and keywords widely used in Machine Learning for Healthcare applications. The following search string resulted from that process: "health" AND ("machine learning" OR "ML OPS" OR "MLOPS" OR "machine learning operation") AND ("continuous improvement" OR "continuous deployment" OR "continuous learning" OR "model drift" OR "data drift" OR "target drift" OR "concept drift" OR "model decay" OR "feedback loop" OR " model health" OR "machine learning health" OR "model validation" OR "model evaluation" OR "machine learning evaluation" OR "machine learning validation") After defining the string, the search was performed in the chosen search engine, considering the works' title, abstract, and keywords. The collected data and notes referring to the stages of the research execution (to be described below) are available in an electronic spreadsheet accessible through the link: https://bit.ly/3XktPfB. Extracted data include the year of publication; work title; list of authors; keywords; work type; and link (URL).

Execution
The Systematic Review protocol followed in this research divides the execution into three successive stages: Initially, the title and abstract of each work are read; [2] then, the introduction and conclusion of the selected ones are read; [3] and finally, the filtered ones deemed adherent to the research are read in full. The inclusion and exclusion criteria are observed during the readings of the first two stages. When an article does not meet all inclusion criteria or touches any exclusion criterion, it is removed and will not be read in the last stage. In the final stage, the articles remaining from stages 1 and 2 are fully read, and quality criteria are measured. Figure 1 describes the search process. Two researchers analyzed each work for stages 1 and 2. To avoid bias, each researcher separately indicated whether the work should be excluded or kept for the final stage, based on inclusion and exclusion criteria. In case of disagreement, a consensual conversation between the researchers would define whether the article should remain. In the final stage, only one researcher per work was involved. Table 1 details the initial amount at each step, how many got removed, and how many remained.

Results
After the execution of the first two iterations (stages 1 and 2), twenty-seven (27) works got selected for a full reading.
In stage 3, quality criteria evaluation took place for each. Research questions were then analyzed using quality measurements and data extracted from reading each work. This section details some of that analysis.
Stage 3 works got categorized according to their publisher. Figure 2 shows those on the left, making it clear that diverse publishers were involved. The right side of Figure 2 demonstrates a predominance of journals in terms of publication type, amounting to about 89% of the works read in full.    Research, Society and Development, v. 12, n. 6, e5412642042, 2023 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v12i6.42042 By the values measured for the quality criteria, it is possible to observe, from the point of view of each criterion, how the articles read in full generally met the quality criteria. This visualization brings an important perspective on the maturity of the works in terms of each criterion. Figure 5 presents the average values reached by articles read in full in each quality criterion. It is possible to observe that the overall average, drawn in an orange dashed line, has a value of 0.336 and that all criteria obtained averages below 0.7, with only two criteria reaching averages above 0.5. The list below presents the average values achieved by each quality criterion, followed by a brief discussion.
• QC1: the average achieved in this criterion was 0.333. That value indicates that evaluating and monitoring healthcare models in production have not been consistently approached by the works.
• QC2: the selected articles obtained an average of 0.352 for this criterion, which indicates that clarity and depth are lacking in the description of evaluation procedures for models in production.
• QC3: this was the highest average criterion amongst the works read, reaching an average of 0.611. This value indicates that they can identify particularities of ML for healthcare to some degree. Despite this, it is noted with this value that there are conditions for deepening the discussion on these particularities.
• QC4: unlike the previous criterion, the average value obtained by the works in this criterion was only 0.185, the lowest value amongst all quality criteria. With this result, it is possible to observe that the data-related change management decisions are reported hastily and can be significantly improved.
• QC5: in this criterion, the works reached an average of 0.278, denoting that the choices made for model management only are reported superficially.
• QC6: related to the limitations and opportunities in evaluating a model in production, the average score reached by the works, 0.5, indicates the addressing of such, but that there may still be a need for going deeper into this matter.
• QC7: works reached an average of 0.204 in this criterion. Thus, it is observable that research effort for establishing frameworks for evaluating ML models is limited.
• QC8: the last criterion presented 0.222 as an average obtained by the works. This value indicates that those have not prioritized the opinions of domain experts or used area-of-application-specific protocols for evaluating and monitoring models.
From the content analysis of the works read and the individual results of the quality criteria, it was possible to observe how each answered the Research Questions.  Research, Society andDevelopment, v. 12, n. 6, e5412642042, 2023 (CC BY 4.0) | ISSN 2525-3409 | DOI: http://dx.doi.org/10.33448/rsd-v12i6.42042

Discussion
This section presents a brief discussion of the review findings. It approaches some perspectives for each research question, using both results from the previous session and the content of the works read. Quality criteria measurements will also be used as a basis for the discussion since they came from the research questions.
Regarding RQ1 which delimits an investigation into which methods and techniques are used to evaluate ML model performance in real-world applications. The average values obtained by the articles in quality criteria 1, 2, and 8, respectively 0.333, 0.352, and 0.222, indicating that detailing the techniques used to validate ML models in the real world is superficial.
That becomes an even bigger issue in a context such as healthcare, where errors can lead to life-threatening situations for patients, which can end up being a barrier to machine learning adoption in clinical environments and overall healthcare contexts.
It is observable in the works that there is a lack of concrete data, metrics, and best practices for evaluating models in production, that is, ML models already deployed and in operation in real-world systems. Most of the articles reviewed only presented experimental reports, focusing mainly on the statistical evaluation of model performance during their construction, as is the case of (Van Helvoort et al., 2020;Johri et al., 2021;Qasim et al., 2021;Sun et al., 2022;. Some articles reported tests carried out in real-world environments with patients. However, they didn't detail their evaluation procedures on production models (Lam et al., 2022;Birkenbihl et al., 2020;Kamran et al., 2022;The RADAR-CNS Consortium et al., 2021). It's also noticeable that there is little information about the metrics and best practices for model evaluation in production for healthcare applications. RQ1 analysis is highly related to RQ2, which deals with the characteristics of methods and techniques used to evaluate ML models in the real world. Therefore, given the scarcity of responses related to practices for evaluating ML models in production, there is little documentation on the characteristics of the methods and techniques used. Despite that, some works mention the need for special care in the statistical evaluation of the training data of the models. Especially when the groups that originate the training data (patients from a specific hospital or people from certain geographical regions, for instance) have distinct characteristics (data-wise), applying that same model to other groups can lead to low model performance (Sun et al., 2022;Rafiq et al., 2020). There are also comments about the need for specialized professionals to participate in model construction and validation to promote better reliability (Wojtusiak, 2021;Risman et al., 2021;Harris et al., 2022;Rojas et al., 2022). Specialists can help both in processing and making sense of the data, model performance testing, and defining evaluation methods, thus ensuring that the resulting models are accurate and reliable.
Another issue pointed out by some works is the need for good model interpretability (Rafiq et al., 2020;Harris et al., 2022;Li et al., 2022;Duckworth et al., 2021). ML model interpretability and explainability can help ensure that ML-enabled applications provide coherent and reliable decisions. Explainability is especially important in healthcare, as it allows the interpretation of model results and facilitates data collection for model evaluation or processes such as auditing. In this context, communication and collaboration also should be prioritized when validating machine learning models in production, corroborating the need for improvement and going deeper into this matter as the answers presented for quality criteria 1, 2, and 8 are superficial.
RQ3 searches for specificities of the evaluation process for healthcare ML models. It is directly related to QC3, in which works obtained an average of 0.611, the highest score among all quality criteria. It's noticeable when reading the articles that a relevant part of them mentions problems or specificities related to model evaluation in healthcare applications (Shickel et al., 2020;Rafiq et al., 2020;Rojas et al., 2022;Fries et al., 2019). One of the most critical issues mentioned is the need to keep data up-to-date to provide input for continuous and consistent updating of ML models. Therefore, it is necessary to establish metrics that can identify changes in data distribution and trigger model retraining when those are detected (Birkenbihl et al., 2020;Rojas et al., 2022).
A second aspect pertains to regulatory and ethical concerns, critical issues for ML model management in healthcare applications (Carolan et al., 2022;Wojtusiak, 2021). In healthcare, ethical and regulatory questions concerning data confidentiality, traceability, and explainability of (model) decision process were already strongly present long before the recent pushes for data access rights and data privacy laws by initiatives such as the General Data Protection Law (LGPD) in Brazil, or the California Consumer Privacy Act (CCPA) in the US, among others (Harris et al., 2022;Rojas et al., 2022). Though these regulatory concerns are not specific to the healthcare context, they affect this area dramatically, given many of the best healthcare practices relate to the personalization of clinical decisions and the humanization of processes.
Finally, although there is a reasonable discussion about the particularities relevant to the management of ML models in healthcare applications, there is only a superficial discussion about possible solutions to the problems faced by model management due to these particularities. That is, it is observable that the works describe existing problems but do not discuss structured solutions to them (or only do it superficially).
QC4 and QC5, in which the articles obtained averages (respectively) of 0.185 and 0.278, are tightly related to RQ4, which seeks to describe ways to update the ML model during system operation and the quality assumptions observed on the domain data. The values obtained for the QCs indicate that details on the decisions taken regarding model updates are scarce. It is worth mentioning that, given the critical performance requirements of healthcare applications, it is vital to understand how to manage ML model updates when input data distribution changes, concepts deviate, or the very model is no longer a feasible solution for the problem at hand (Vieira et al., 2021).
RQ5 and the related QC6 address the challenges and opportunities related to ML model evaluation and monitoring in healthcare applications. The works obtained an average of 0.500 in QC6. This value indicates some level of depth in discussing challenges and opportunities. Challenges mentioned include data obtention in real-time, data scarcity, maintenance of existing systems, quantifying the comparability of validation data (from new patients) against training data, data accessibility and continuity, standardization of models, data imbalance, and those about the clinical routine and specialist availability. For example, models trained on data derived from a single health institution may not generalize well on multi-institutional scenarios. A variation on this problem is patient selection biases (regional, socioeconomic, and institutional) (Van Helvoort et al., 2020;Carolan et al., 2022;Lam et al., 2022;Birkenbihl et al., 2020;Kamran et al., 2022;Risman et al., 2021;Shickel et al., 2020;Bellocchio et al., 2021;Rafiq et al., 2020;Harris et al., 2022;The RADAR-CNS Consortium et al., 2021;Li et al., 2022;Lin et al., 2022;Rojas et al., 2022;Yang et al., 2021;Fries et al., 2019).
Such challenges may impact the feasibility of ML model evaluation and monitoring for healthcare applications.
Despite that, the ongoing discussions about these topics can favor the emergence of approaches that can provide solutions or ways to mitigate risks, as well as new businesses and healthcare services. Other challenges are related to Continuous Learning in healthcare, which presents different limitations.
Regarding the opportunities presented in the selected works, there are mentions of the creation of international standards and guides to deal with the regulatory challenges of ML in healthcare applications. Carolan et al. (2022) describes the need for better automation technologies to improve the efficiency of algorithms. There are also opportunities for expert management and monitoring (Algorithmic Stewardship), with projections of the near-future creation of MLOps departments for healthcare services and hospitals (Harris et al., 2022). Other possibilities include integrating equity in the ML lifecycle, removing biases, as well as collecting feedback from experts and other stakeholders to bring human knowledge into the learning process (Human-in-the-Loop Learning), and going beyond statistical metrics in evaluating the model performance, using domain-oriented approaches to measure the usefulness and commercial value of these (Rojas et al., 2022;Yang et al., 2021). Finally, there are opportunities for real-world applications supported by live data where teams can iteratively build and test at the bedside, continuous delivery (CD) MLOps platforms, design and oversight by people with AI security expertise, continuous assessment using randomization to avoid bias, and use of data flows with the HL7-FHIR protocol (Harris et al., 2022).
Based on those observations, it is noticeable that there is a need for improvement and deepening of research related to ML model evaluation and monitoring in healthcare applications. QC7 searches for works that discuss and propose solutions for evaluating ML models in a structured and reproducible way. The general average in this criterion was 0.204. In addition, of the 27 articles read, only three (3) fully meet this criterion (Carolan et al., 2022;Kamran et al., 2022;Fries et al., 2019), which reinforces the need for research that defines, discusses, and improves the ML model's evaluation and maintenance methods, especially in critical applications such as healthcare. Therefore, the main observation for QC7 is the need for a methodological approach to ML model evaluating, monitoring, and maintaining in healthcare applications once in real-world operation (production).

Conclusions and Future Researches
This work presents the result of a systematic literature review that sought to understand the current state of Machine Learning model evaluation, monitoring, and maintenance in healthcare applications. Following Kitchenhan's protocol (Kitchenham & Charters, 2007), twenty-seven (27) papers underwent complete analysis. The gathered results and the discussions that ensued (presented in previous sections) indicate the need for further research involving ML model evaluation, monitoring, and maintenance in real-world healthcare applications. That said, reasonable documentation of problems and limitations is available, which can provide a starting point for future research.
The struggle to find studies that go beyond the experimental report and effectively evaluate ML models in real-world operation suggests that considerable emphasis has occurred on model construction and experimental validation. Though, continuity of these efforts does not seem to happen when models enter system operation. As a result, accounting for model operation on real-world data has not been consistently addressed. Healthcare applications demand continuous monitoring, validation, and maintenance of the models due to the very criticality of the domain and the services involved.
Therefore, although the importance of ongoing model evaluation and monitoring is acknowledged, the literature still needs practical studies and detailed methodologies for continuous ML model evaluation in healthcare applications. It is essential to continue researching and developing effective methods for evaluating, monitoring, and maintaining ML models to guarantee that they are safe, reliable, and useful for healthcare applications.
The results of the systematic review suggest the need for a change management workflow for developers and managers of ML models. This process, to be proposed in future work, should include the following activities: [1] Obtaining available documentation (for example, baseline model performance, experimental design decisions), [2] Definition of evaluation criteria and parameters based on expert opinion, real-world statistical performance of models (quantitative metrics), and product, business, and area-of-application-specific protocols ( Other future work could establish a methodological approach for assessing the level of maturity of ML models, once in realworld use, based on good practices and concerns that permeate the entire lifecycle of the models.