Mining Techniques
Ismail Alaqta
[Assignment]
[University/Department]
[University Location]
Abstract
INTRODUCTION
Because electronic health record systems facilitate physicians in creating reports about medical examinations since birth, the medical history size of patients increases. However, not all content of a patient’s medical history is crucial information for medical users to find out. Furthermore, every single patient report needs to be read in order to validate that a patient is fine with medications or any medical procedures. This process might require aperiod of time to clarify whether the medical user can give special medications or conduct a procedure during an emergency. To reduce time reviewing all of a patient’s medical history reports, we suggested developing a patient medical multi-reports summarization system based on free text as shown in Fig. 1-1
Figure 11[]
Given the number of patient medical reports, methods must be found that will enable medical users to quickly understand and determine the content of those reports. Summarization is one approach that can help physiciansquickly determine the main points of reports. “A summary can be loosely defined as a text that is produced from one or more texts, that conveys important information in the original text(s), and that is no longer than half of the original text(s) and usually significantly less than that” [4]. By having a quick view on Fig. 1-2 [3] for the most important summarization factors, they are input, purpose, and output factors. When applying these factors on multi-reports of a patient, the input factor has three properties. The first one, in case of medical reports, consists of multi-documents. The second is mono-language, which uses only English language. And the last one is a text type input which is the same as our case. The second factor presents three different purposes. The first purpose is informative versus indicative summaries. In our case, the purpose of summary is informative, which means []. The second purpose is generic versus user-oriented summaries. In our case, we are using a generic purpose. So, the user can only have the most important information located in the patient’s medical reports. The third purpose is a general purpose versus []. In our case, there is a specific domain that is medical. The last summarization factor is the form of output. In our case, it is extractive output which means the system extracts some sentences through the patient’s reports.
Figure 12[]
I. THE IMPORTANCE OF SUMMARIZATION OF MULTI-REPORTS PATIENT
In this part, we will briefly show some analysis on why summarization is significant for physicians to access patient information. The increased number of clinicians in the health organisations from different specialists has seriously expanded the number of patients’ reports. In Fig. 1.3, there are 17,198 patients with the majority of them having multi-reports. This means that medical users strongly need an automatic text summarizer in order to retrieve the most relevant medical patient information, such as chorionic diseases and the medicines’ interaction within seconds. That can help physicians quickly make differential diagnoses; therefore, easily prescribe medications or identify the emergency situations without reading every single report.
Figure 13[]
II. METHODS
This project approaches the selection of the most relevant and important sentences to be included in the summary of patient multi-reports by adapting PatTexSum [5]. In the model, as shown in Fig. 2-1, we describe how this model represents the patient’s reports and selects the most relevantinformation - excluding redundant sentences.
Figure 2-1[]
III. REPORT REPRESENTATION
Before reports representation, there are some tasks such as stop words removal, stemming, and indexing necessary for text processing and information retrieval. The stop words are commonly occurring in text and worthless [1]. In our case, we use a specific collection of stop words called medical stop words. Stemming is another important aspect of text pre-processing that’s used for reducing the size of index (the next step after stemming) and raising up the importance of terms within context. Now, reports are ready to be indexed.
There are two methods to represent reports/sentences. The first one is the traditional bag-of-words (BOW). The second one is the transactional data format (TDF).
Let D={r1,,rn} be a report patient collection where each report rk consists of sentences S=s1k ,,szk.Each sentence is composed of stemmed words. The BOW representation of each sentence in the report of patient reports (D) is the set of all terms. The TDF representation is based on sentences representation. In other words, each sentence represents a transaction whose items taken from BOW astrjk={w1,,wl}where trjk⊆ sjk .
The statistical model term frequency-inverse document frequency (tf-idf) has been associated to each term in the whole collection of patient reports. The tf-idf values are feed from a table (TC) that means each row represents a distinct term of the patient collection whilst each column relates to a report. Each matrix element tcik is computed as the following formula:
tcik= nikr∈{q : wq∈ rk}nrk .logDdk∈ D : wi ∈ rk
wherenik is the number of occurrences of i-th term wi in the k-th report dk, D is the collection of reports;r∈{q : wq∈dk }nrk is the sum of occurrences number of all terms in the k-th report rk, and logDdk∈ D : wi ∈ rk calculates the inverse report frequency of term wi.
3.1The Pattern-based Model Generation
The well-structured data mining approach frequent itemsetis consideredthe most significant model discovered and improved by the community of database and data mining. The aim of it is to discover the most effective itemsets and the relationships among those itemsets[2]. Suppose that T is the patient report collection in format of transactional data provided by I as denoted DI=trjk ∈ T I ⊆ trjk} where I is the set of distinct items and kis the length of this set. The support can be evaluated by the following formula:
supI= DIT
The pattern-based model can be generated by giving the minimum support threshold min_supand model size p using “Tell me what I need to know: Succinctly summarizing data with itemsets” [6].
3.2Sentence Evaluation and Selection
The most relevant sentences to be added in the summaryneed to be weighed and chosen. Sentence evaluation and selection require a sentence importance score, which computes the weight of sentence by adding the tf-idf concomitant with each term in the sentence, and the sentence coverage based on the extracted pattern-based model.
3.3Sentence Importance Score
The importance score of sentences is calculated based on the BOW. It is figured as the summation of the tf-idf values of each term included in the sentence among the whole patient collection of reports as the following formula:
SR(Sjk)= i | wi ∈ Sjktcik|tjk|
wheretjk is the number of distinct terms appearing in Sjk , and i | wi ∈ Sjktcik is the sum of tf-idf values related to terms in Sjk.
3.4Sentence Model Coverage
After the pattern-based model has been generated, a binary vector has been associated to each sentence SCjk=sc1,,scp where p is the size of sentence model coverage. For example, sci= 1trjk (Ii) indicates determine whether itemset Ii contained in trjk or not. More properly
1trjk Ii= 1 if Ii ⊆trjk, 0 otherwise
IV. IMPLEMENTATION
The data we worked on was stored in XML format files. However, the files were not well structured and some were corrupted. In this situation, we needed to write a piece of Java code in order to reconstruct more than 17 thousands files to be well structured xml files and remove the corruptions.
At this point, the xml files have been well validated and ready to be read by using XOM library [8]. Because what we need to work on is the body of patient text report regardless of date, name, and ID, we only extracted the patient text report among the other information included in the XML file. Each text report was then putin an array list that has the same size number of patient text reports.
Since the sentence being the element, which needs to be included in the summary, the sentence segmentation is required to break down text of reports into meaningful sentences. In order to break sentences down efficiently, Stanford CoreNLP tools [9]have been used.
Tailoring sentences to be in transactional data format is required. It has been built to associate each term in the collection of the patient reports with a number due to the pattern-based model extraction [6].
V. EVALUATION
The evaluation has been designedfor comparison between a human manual summarizer and our automated summarizer system. The manual summarizer is sentence oriented. So, both the manual and our automated summarizer system use sentences as an output unit. The manual summarizer system has been implemented in a way that medical experts can choose the most important and appropriate sentences to be included in the summary of patient reports as shown in Fig. 3-1.
Figure 3-1[]
Three medical experts have been selected to work on the manual summarizer system. Each has been given five patients. Each patient’s reports number varies from 2-6 reports. Each expert needed a period of time from 15 to 20 minutes to summarize each patient’s reports. In contrast, our automated summarizer system only needs a few seconds to summarize patient reports.
Figure 3-2[]
The Wilcoxon test at confidence = 0.95 and p value = 0.05 has been applied to measure the difference between the number of sentences selected by users and the system. The result of this data analysis test has shown that there is no significant difference between the number of sentences selected by user and the ones selected by the system at p=0.394.
Figure 3-3[]
VI. CONCLUSIONAND FUTURE WORK
In the past two decades, there has been increased significance in the area of text summarization to build automated text summarization systems that have the ability to generate accurate and sufficient summaries of specific texts. Different automated text summarization systems have been developed based on different factors, such as abstractive or extractive summarization systems. However, these approaches do not consider deeper aspects of meaning including semantic content, emotional information, and physicians’ intention. Thus, the quality of the summaries produced by these systems is still inadequate. We believe that rhetoric can improve automated text summarization systems by providing the deeper aspects of meaning through rhetorical figuration metrics.
In this project, we presented a multi-patient report summarization system using some information retrieval methods and data mining techniques.
REFERENCES
[1] B. Liu, "Information Retrieval and Web Search," pp. 211-268, 2011.
[2] B. Liu, "Association Rules and Sequential Patterns," pp. 17-62, 2011.
[3] S. D. Afantenos, V. Karkaletsis, and P. Stamatopoulos, "Summarization from medical documents: a survey," arXiv preprint cs/0504061, 2005.
[4] D. R. Radev, E. Hovy, and K. McKeown, "Introduction to the special issue on summarization," Computational linguistics, vol. 28, pp. 399-408, 2002.
[5] E. Baralis, L. Cagliero, A. Fiori, and S. Jabeen, "PatTexSum: A pattern-based text summarizer," in Proceedings of The Workshop on Mining Complex Patterns, p. 14.
[6] M. Mampaey, N. Tatti, and J. Vreeken, "Tell me what i need to know: succinctly summarizing data with itemsets," in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011, pp. 573-581.
[7] E. Hatcher, O. Gospodnetic, and M. McCandless, "Lucene in action," ed: Manning Publications Greenwich, CT, 2004.
[8] E. R. Harold. (2002-2013, 11/10/2012). XOM. Available: http://www.xom.nu/
[9] The Stanford Natural Language Processing Group.(2012). Stanford CoreNLP. Available: http://nlp.stanford.edu/software/corenlp.shtml
LIST OF FIGURES
Figure 1-1. []
Figure 12. []
Figure 13. []
Figure 2-1.[]
Figure 3-1.[]
Figure 32.[]
Figure 33.[]