Applying Text Mining and Natural Language Processing to Electronic Medical Records for extracting and transforming texts into structured data




Text Mining; Natural Language Processing; Electronic Medical Record; Anamnesis.


The recording of patients' data in electronic patient records (EPRs) by healthcare providers is usually performed in free text fields, allowing different ways of describing that type of information (e.g., abbreviation, terminology, etc.). In scenarios like that, retrieving data from such source (text) by using SQL (Structured Query Language) queries becomes an unfeasible issue. Based on this fact, we present in this paper a tool for extracting comprehensible and standardized patients' data from unstructured data which applies Text Mining and Natural Language Processing techniques. Our main goal is to carry out an automatic process of extracting, clearing and structuring data obtained from EPRs belonging to pregnant patients from the Januario Cicco maternity hospital located in Natal - Brazil. 3,000 EPRs written in Portuguese from 2016 e 2020 were used in our comparison analysis between data manually retrieved by health professionals (e.g., doctors and nurses) and data retrieved by our tool. Moreover, we applied the Kruskal-Wallis statistical test in order to statically evaluate the obtained results between manual and automatic processes. Finally, the statistical results have showed that there was no statistical difference between the retrieval processes. In this sense, the final results were considerably promising.


Antons, D., Grünwald, E., Cichy, P. & Salge, T. O. (2020). The application of text mining methods in innovation research: current state, evolution patterns, and development priorities. R&D Management, 50(3), 329-351.

Aramaki, E., Miura, Y., Tonoike, M., Ohkuma, T., Masuichi, H., Waki, K. & Ohe, K. (2010). Extraction of Adverse Drug Effects from Clinical Records. In Proceedings of the 13th World Congress on Medical (MEDINFO 2010) (pp. 739-743). IOS Press.

Cho, H., Choi, W. & Lee, H. (2017). A method for named entity normalization in biomedical articles: application to diseases and plants. BMC Bioinformatics, 18(451), 1-12.

Chu, S. (2002). Information retrieval and health/clinical management. Yearbook of medical informatics, 1, 271–275.

Downs, J., Velupillai, S., George, G., Holden, R., Kikoler, M., Dean, H., Fernandes, A. & Dutta, R. (2018). Detection of suicidality in adolescents with autism spectrum disorders: Developing a natural language processing approach for use in electronic health records. Journal of the American Medical Informatics Association, 641-649.

Ehrentraut, C., Ekholm, M., Tanushi, H., Tiedemann, J. & Dalianis, H. (2018). Detecting hospital-acquired infections: A document classification approach using support vector machines and gradient tree boosting. Health Informatics Journal, 24(1), 24–42.

Fleuren, W. W. M. & Alkema, W. (2015). Application of text mining in the biomedical domain. Methods, 74, 97–106.

Gomaa, W. & Fahmy, A. (2013). A survey of text similarity approaches. International Journal of Computer Applications, 68, 13–18.

Grechishcheva, S., Efimov, E. & Metsker, O. (2019). Risk markers identification in EHR using natural language processing: hemorrhagic and ischemic stroke cases. Procedia Computer Science, 156, 142–149.

Guan, J., Li, R., Yu, S., & Zhang, X. (2018). Generation of synthetic electronic medical record text, In: IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 374–380.

Guida, G. & Mauri, G. (1986). Evaluation of natural language processing systems: Issues and approaches. Proceedings of the IEEE, 74(7), 1026–1035.

Hand, D.J., Smyth, P. & Mannila, H. (2001). Principles of Data Mining. MIT Press, Cambridge, MA, USA.

Hearst, A. (1999). Untangling text data mining, In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics. ACL ’99, 3–10, USA: Association for Computational Linguistics.

Leaman, R., Khare, R. & Lu, Z. (2015). Challenges in clinical natural language processing for automated disorder normalization. Journal of Biomedical Informatics, 57, 28–37.

Leonardo, B. & Hansun, S. (2017). Text documents plagiarism detection using rabin-karp and jaro-winkler distance algorithms. Indonesian Journal of Electrical Engineering and Computer Science, 5(2), 462–471.

Li, B. & Han, L. (2013). Distance weighted cosine similarity measure for text classification. Intelligent Data Engineering and Automated Learning, 8206, 611–618.

Luo, G., Huang, X., Lin, C.Y.& Nie, Z. (2015). Joint entity recognition and disambiguation. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 879–888, Lisbon, Portugal: Association for Computational Linguistics.

Kreimeyer, K., Foster, M., Pandey, A., Arya, N., Halford, G., Jones, S. F., Forshee, R., Walderhaug, M. & Botsis, T. (2017). Natural language processing systems for capturing and standardizing unstructured clinical information: Asystematic review. Journal of Biomedical Informatics, 73, 14–29.

Kruskal, W.H. & Wallis, W.A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260), 583–621.

Montenegro, C. A. B. & Rezende, J.F. (2014). Fundamental Obstetrics, 13th edition, Gen.

Oghbaie, M. & Mohammadi, Z. M. (2018). Pairwise document similarity measure based on present term set. Journal Big Data, 5(52), 1–23.

Okuda, T., Tanaka, E. & Kasai, T. (1976). A method for the correction of garbled words based on the Levenshtein metric. IEEE Transactions on Computers, C-25(2), 172–178.

Okuda, T., Tanaka, E. & Kasai, T. (1976). A method for the correction of garbled words based on the Levenshtein metric. IEEE Transactions on Computers, C-25(2), 172–178.

Ratinov, L. & Roth, D. (2009). Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning (CoNLL-2009), 147–155, Colorado: Association for Computational Linguistics.

Wang, Y., Wang, L., Rastegar-Mojarad, M., Moon, S., Shen, F., Afzal, N., Liu, S., Zeng, Y., Mehrabi, S., Sohn, S. & Liu, H. (2018). Clinical information extraction applications: A literature review. Journal of Biomedical Informatics, 77, 34–49.

Wu, H., Hodgson, K., Dyson, S., Morley, K. I., Ibrahim, Z. M., Iqbal, E., Stewart, R., Dobson, Richard, J.B., & Sudlow, C. (2019). Efficient reuse of natural language processing models for phenotype-mention identification in free-text electronic medical records: A phenotype embedding approach. JMIR Med Inform, 7(4), e14782.




How to Cite

BENÍCIO, D. H. P. .; XAVIER JUNIOR, J. C. .; PAIVA, K. R. S. de .; CAMARGO, J. D. de A. S. . Applying Text Mining and Natural Language Processing to Electronic Medical Records for extracting and transforming texts into structured data. Research, Society and Development, [S. l.], v. 11, n. 6, p. e37711629184, 2022. DOI: 10.33448/rsd-v11i6.29184. Disponível em: Acesso em: 25 may. 2022.



Exact and Earth Sciences