Research Article - (2017) Volume 7, Issue 3
In addition to pain, nausea and vomiting persist as the most frequent complaints of patients receiving patientcontrolled analgesia (PCA) after surgery. Many patients find postoperative nausea and vomiting (PONV) even more distressing than postoperative pain. Though many studies have evaluated the correlation of patient characteristics with PONV and identified several risk factors, there is little research into constructing models for PONV prediction. In this study, we proposed to analyze patient behaviors and apply machine learning methods to PONV prediction. We evaluated different learning algorithms, and investigated several data preprocessing techniques. We performed a thorough comparative study of machine learning techniques, and the experimental results suggest the application of machine learning to PONV prediction is feasible and promising.
Keywords: PONV (post-operative nausea and vomiting), PCA (patient-controlled analgesia), Machine learning, PCA demand behaviors, Feature selection, Data cleaning
With the advance of medical science, people have gradually become aware of the importance of pain management because pain can negatively affect quality of health care and even do more harm than an illness itself when it becomes intolerable. According to the studies, PCA (patient-controlled analgesia) is one of the most effective techniques for postoperative analgesia [1,2].
Despite the fact that IV-PCA (Intravenous PCA) has been widely used in hospitals for its effectiveness and safety as acute postoperative pain management, PCA also usually entails (PONV) post-operative nausea and vomiting that complicates recovery from surgery and decreases patient satisfaction [3,4]. In some studies patients were, on average, willing to pay extra $56 to avoid PONV; the figure increased to $73 and $100 in patients who had experienced postoperative nausea or vomiting, respectively [5,6].
Most previous studies of PONV were focused on identifying the risk factors, using regression techniques or proposing probabilistic models [7-10]. A recent work that applied an artificial neural network to predict postoperative vomiting has been proposed [11]. In this study, we investigated patient PCA demand behaviors, and derived demand pattern attributes by clustering demand profiles for PONV prediction. In addition, we proposed to use a neighborhood-based data cleaning technique to clarify class boundary. Lastly, we conducted a comparison of various machine learning classifiers to identify the best feature set and classifiers for PONV prediction. Our goal is to improve PONV prediction to increase patient satisfaction by applying machine learning methods and analyzing IV-PCA patient demand behaviors.
Study subjects
We collected and analyzed IV-PCA usage profiles from bone surgery patient records from 2009 to 2014 at Changhua Christian Hospital. Abbott Pain Management Provider (Abbott Lab, Chicago, IL, USA) was used for IV-PCA treatment. After excluding incomplete IV-PCA log files and patient records with missing values, we obtained 392 patient records. Of these 392 subjects, 121 had PONV and the remaining 271 showed no PONV. Each patient received at least 24 h of IV-PCA medication without using any antiemetic drugs. Each subject is represented by totally 28 basic attributes divided into 5 categories: (a) demographic, (b) biomedical, (c) operation-related, (d) opioid-related, and (e) PCA-related attributes.
PCA demand behavior pattern attributes
In addition to commonly studied demographic and physiological factors relevant to analgesic consumption, IV-PCA related attributes, such as the number of demands per hour, have been shown to correlate significantly with analgesic consumption prediction [11,12]. These findings suggest that these demand behavior-related attributes are likely to correlate with incidences of postoperative nausea and vomiting. To generate behavior pattern attributes for PONV prediction, we considered two types of pattern attributes based on time domain and frequency domain, respectively.
For time-based behavior pattern attributes, we first characterize different IV-PCA demand behaviors in the course of time. We retrieved the IV-PCA demand data from each patient’s IV-PCA treatment log file and derived three types of IV-PCA profiles based on (a) the number of successful IV-PCA demands in each time unit, (a) the number of failed IV-PCA demands in each time unit, and (c) the IV-PCA dose for each time unit. Four different time units were used in this study: 60 min, 45 min, 30 min and 20 min. We show a sample IV-PCA time-based dose profile in Figure 1.
From a time-based behavior pattern we can observe the change in the number of PCA demands and the amount of analgesic consumption; however, we cannot distinguish the distributions of PCA demands in different frequencies. Therefore, we also applied Fourier transform to time-based profiles to obtain a frequency-based profile. A sample frequency-based IV-PCA dose profile transformed from Figure 1 is shown in Figure 2.
After the process of various IV-PCA profiles, we applied k-medoid clustering to these profiles to identify significant demand patterns among the study patients. Figure 3 shows the four patterns identified in the timebased IV-PCA dose profiles of the 392 patients in a 12 h time period [13].
The demand profiles grouped into a cluster demonstrated similar demand behaviors, and the medoid of a cluster represented the behavior pattern for that cluster over time. By applying k-medoids to different IV-PCA demand profiles, we generated different demandpattern attributes. We expected the inclusion of demand patterns of the first few hours of IV-PCA usage to improve PONV prediction.
Feature selection
We used 28 basic patient attributes, classified in 5 categories, to describe each study subject. In addition, we derived a number of different PCA demand pattern attributes from various PCA demand profiles, based on different time units, different demand reference (e.g. dose or successful demand), and various values of k for k-medoids clustering. Though these attributes can characterize patient behaviors, they may also negatively interact with those 28 basic attributes. To avoid negative interaction among the features, we selected important features according to their information gain and used only these selected features to represent each patient. We show the feature selection process in Figure 4.
Data cleaning
Nausea and vomiting are most common adverse effects of IV-PCA with reported incidence of 3.1 to 34% [14,15]. From the point of machine learning, prediction of nausea and vomiting is a classification problem in an imbalanced class domain. Conventional machinelearning algorithms are typically biased toward the majority class, and produce poor predictive accuracy for the minority class. In addition to unequal class distribution, instances sparsely scattered in the data space make the prediction of a minority class even more difficult. We applied a neighborhood-based data cleaning approach to remove spurious data points of the majority class. It first identifies the k-nearest neighbors of each instance of the minority class and considers any majority class neighbor as “dirty.” After examining each instance in the minority class and its neighbors, the proposed approach removes those “dirty” instances. The rationale behind this process is that the nearest majority class neighbors of a minority class member are likely to mislead learning algorithms. Without them, learning algorithms can more easily recognize the minority class boundary. We illustrate the concept in Figure 5. Figure 5a shows an imbalanced data set before removing “dirty” instances. The rectangles in this figure represent the decision regions of the minority class, and several majority class examples are also included. The proposed approach first locates the k-nearest neighbors (e.g. k=3) for each minority class example and then presents the neighbors as linked to each minority class example (Figure 5b) and crosses out the “dirty” majority class neighbors (Figure 5c). Removing the “dirty” examples produces the “clean” decision regions of the minority class (Figure 5d).
Figure 5: The X- and Y-axes represent two attributes in the feature space. The minority class examples are denoted by black circles and the majority class examples are denoted by white circles. Black rectangles indicate the axis-parallel decision regions of the minority class. (a) We show an imbalanced data set with sparse minority class examples. The decision regions of the minority class contain the majority class examples. (b) To identify the “dirty” examples that may mislead learning, the proposed method locates k-nearest (where k is 3 in this example) neighbors for each minority class example. The 3-nearest neighbors of a minority class example are indicated by links. (c) A black cross marks each “dirty” example. (e) After the “dirty” examples are removed, the decision regions are “clean” (i.e., they contain only the minority class examples). Using these clean decision regions, learning algorithms can more easily recognize the correct boundary between classes.
performance measures
We evaluated prediction performances by using several measures: percentage accuracy, F-score and MCC. Table 1 lists the definitions of these measures.
Performance Measure | Definition |
---|---|
Recall | TP/(TP+FN) |
Precision | TP/(TP+FP) |
F-score | 2 ´ Recall ´ Precision/(Recall+Precision) |
MCC | |
Accuracy | (TP+TN)/(TP+TN+FP+FN) |
Table 1: Performance measures.
For the problem of PONV prediction, high true positive rate is more desirable compared with other measures, e.g. accuracy, because nausea and vomiting, in addition to pan, are the most frequent negative effects of patient satisfaction and the number of patients showing PONV is significantly smaller than those showing no PONV (3.1~34% PONV). Therefore, our goal is to apply machine learning techniques to obtain the highest F-score rather than the overall accuracy. We show the other performance measures for reference.
Machine learning classifiers in comparison
We tested 9 classifiers on PONV prediction. These classifiers can be characterized into six categories: (a) decision-based, (b) instancebased (c) probabilistic, (d) neural network, (e) feature-based, and (f) ensemble. We list the classifiers in Table 2. These classifiers have different design philosophies and applicability. There is little research into applications of machine learning to postoperative nausea and vomiting prediction. Through a comparative study, we intended to identify the superior classifiers and the appropriate patient features for PONV prediction.
Classifier | Category |
---|---|
PART [16] | Decision-based |
LADTree [17] | Decision-based |
K* [18] | Instance-based |
K-NN [19] | Instance-based |
Logistic Regression [20] | Probabilistic |
Bayes Net [21] | Probabilistic |
ANN [22] | Neural network |
VFI [23] | Feature-based |
Random Forest [24] | Ensemble |
Table 2: List of classifiers.
The goal of this study is twofold: (1) to compare the effects of different types of patient features as PONV risk factors, and (2) to evaluate the performance of different machine learning techniques for predicting PONV. To conduct a comparative study of risk factors, we divided patient features into 3 groups: (a) basic patient features, including demographic, biomedical, operation-related, and analgesicsrelated attributes, (b) in addition to basic features, PCA-related attributes are included, and (c) the complete feature set with behavior pattern attributes added. We performed experiments to evaluate the feasibility of various machine learning techniques, namely feature selection, data cleaning, and classification, and verified the synergy of the combination of these techniques. The experiments were conducted by performing stratified 10-fold cross-validation of 392 study subjects.
Experiment of classifiers using different groups of patient features
We tested the classifiers listed in Table 2, using different groups of patient features. We present the results in Table 3. The results for each classifier are presented in the order of groups (a), (b), (c) based on time domain, and (c) based on frequency domain, separately. For each classifier, we also performed a paired t-test between using group (a) and using the other feature groups, individually. A significant difference (p-val<0.05) is indicated by a star symbol.
Classifier | F-score | MCC | Accuracy |
---|---|---|---|
PART | 0.368 | 0.090 | 0.611 |
0.369 | 0.104 | 0.628 | |
0.342 | 0.066 | 0.607 | |
0.373 | 0.138 | 0.648 | |
LADTree | 0.294 | 0.072 | 0.641 |
0.391 | 0.147 | 0.646 | |
0.316 | 0.072 | 0.628 | |
0.336 | 0.077 | 0.615 | |
K* | 0.333 | 0.023 | 0.571 |
0.333 | 0.063 | 0.605 | |
0.321 | 0.019 | 0.581 | |
0.371 | 0.072 | 0.592 | |
K-NN | 0.341 | 0.078 | 0.619 |
0.344 | 0.093 | 0.630 | |
0.256 | 0.021 | 0.620 | |
0.315 | 0.039 | 0.602 | |
Logistic Regression | 0.369 | 0.168 | 0.671 |
0.385 | 0.127 | 0.633é | |
0.366 | 0.019é | 0.538é | |
0.356 | 0.019é | 0.546é | |
Bayes Net | 0.516 | 0.291 | 0.689 |
0.501 | 0.268 | 0.676é | |
0.395é | 0.032é | 0.513é | |
0.435 | 0.098é | 0.546é | |
ANN | 0.440 | 0.185 | 0.648 |
0.363 | 0.097 | 0.617 | |
0.353 | 0.078 | 0.605 | |
0.383 | 0.122 | 0.628 | |
VFI | 0.513 | 0.251 | 0.638 |
0.517 | 0.254 | 0.641 | |
0.402é | 0.019é | 0.495é | |
0.433 | 0.095é | 0.543é | |
Random Forest | 0.260 | 0.076 | 0.656 |
0.270 | 0.130 | 0.681 | |
0.166 | -0.044 | 0.620 | |
0.255 | 0.069 | 0.645 |
*Indicates a significant difference (p-val<0.05) from group (a)
Table 3: Results of classifiers using different features.
Table 3 shows that the addition of more features (PCA-related and behavior pattern) had little effect on most of the classifiers in study. On the other hand, we observed that these extra features could adversely hinder the learning of particular types of classifiers such as probabilistic learners, and feature-based learners. It suggests that the interactions incurred by more features significantly affect some classifiers. This finding reconfirmed that these learning algorithms have their own distinct characteristics and different applicability.
Experiment of feature selection
We hypothesized that the addition of extra features did not show improvement for PONV prediction in the first experiment was mainly due to adverse feature interactions. To verify our hypothesis, we first selected important features based on their information gain and then re-ran the experiment, using the selected features. We show the results after feature selection in Table 4. Like in Table 3, we present the results for each classifier in the order of groups (a), (b), (c) based on time domain, and (c) based on frequency domain, separately. Compared with those in Table 3, the numbers are presented in italics to indicate no performance improvement or performance decrease after feature selection.
Classifier | F-score | MCC | Accuracy |
---|---|---|---|
PART | 0.427 | 0.240 | 0.696 |
0.478 | 0.263 | 0.689 | |
0.425 | 0.218 | 0.684 | |
0.415 | 0.183 | 0.656 | |
LADTree | 0.403 | 0.148 | 0.636 |
0.403 | 0.148 | 0.636 | |
0.394 | 0.188 | 0.666 | |
0.352 | 0.105 | 0.628 | |
K* | 0.448 | 0.206 | 0.648 |
0.422 | 0.208 | 0.669 | |
0.464 | 0.242 | 0.679 | |
0.458 | 0.248 | 0.687 | |
K-NN | 0.426 | 0.192 | 0.663 |
0.399 | 0.162 | 0.655 | |
0.434 | 0.222 | 0.679 | |
0.418 | 0.193 | 0.663 | |
Logistic Regression | 0.518 | 0.313 | 0.699 |
0.518 | 0.313 | 0.699 | |
0.498 | 0.316 | 0.707 | |
0.455 | 0.272 | 0.707 | |
Bayes Net | 0.523 | 0.306 | 0.697 |
0.523 | 0.303 | 0.694 | |
0.519 | 0.310 | 0.702 | |
0.521 | 0.306 | 0.700 | |
ANN | 0.537 | 0.304 | 0.682 |
0.531 | 0.311 | 0.694 | |
0.470 | 0.221 | 0.659 | |
0.472 | 0.258 | 0.686 | |
VFI | 0.553 | 0.322 | 0.682 |
0.553 | 0.322 | 0.682 | |
0.555 | 0.316 | 0.684 | |
0.611é | 0.404é | 0.699 | |
Random Forest | 0.401 | 0.157 | 0.648 |
0.424 | 0.184 | 0.656 | |
0.486é | 0.267é | 0.689 | |
0.448 | 0.231 | 0.679 |
★Indicates a significant difference (p-val<0.05) from group (a)
Table 4: Results of applying feature selection.
According to Table 4, we clearly verify the merits of feature selection. The F-score and MCC have been substantially increased for all classifiers after feature selection, which indicates that feature selection resolves the feature interaction problem. As for the comparison between feature group (a) and the others, there is no significantly lower performance for groups (b) and (c) than group (a). On the contrary, we identified several significant positive results after feature selection. For example, the addition of PCA-related features and behavior-derived patterns increased F-score and MCC significantly for VFI and Random Forest.
We list the top-6 features in Table 5. The top-3 features are the demographic attributes, among which sex has been reported to be one strong factor for PONV in several studies, and our study reconfirmed this finding. In addition, we also identified patient height to be an important factor, which agrees with a similar finding has been reported in a survival analysis [25-27]. The remaining are pattern features that characterize PCA patient demand behaviors.
Significant Features |
---|
Surgery Size |
Sex |
Patient Height |
Frequency-based Behavior Patterns, using 20 min time units |
Frequency-based Behavior Patterns, using 60 min time units |
Time-based Behavior Patterns, using 60 min time units |
Table 5: Significant patient features as risk factors.
Experiemnt of data cleaning
While nausea and vomiting are most common adverse effects of IV-PCA, the incidence of PONV is relatively low, which makes the machine learning task an imbalanced classification problem. In this study, we proposed to apply a neighborhood-based data cleaning method to better balance the classes and reveal a clearer class boundary by removing redundant data points of the major class.
After feature selection, we performed data cleaning. We compared the effects of data cleaning for feature groups (a), (b) and (c). We show the results in Table 6. Compared with those in Table 4, the numbers are presented in italics to indicate no performance improvement or performance decrease after data cleaning. From Table 6 we notice that data cleaning improved F-score and MCC for most of the classifiers except that the MCCs of Bayes Net and VFI decreased. Nevertheless, it is worth notice that when frequency-based behavior pattern features were used, data cleaning increased both F-score and MCC for all classifiers, and VFI produced the highest performance for F-score and MCC. In contrast to F-score and MCC, accuracy of all the classifiers decreased variably after data cleaning in exchange for higher F-score and MCC. For an imbalanced class prediction problem such as PONV, F-score and MCC are more appropriate measures than accuracy, and we have verified that our data cleaning method can warrant better performance.
Classifier | F-score | MCC | Accuracy |
---|---|---|---|
PART | 0.526 | 0.249 | 0.597 |
0.560 | 0.305 | 0.633 | |
0.521 | 0.244 | 0.585 | |
0.553 | 0.294 | 0.574 | |
LADTree | 0.544 | 0.273 | 0.602 |
0.544 | 0.273 | 0.602 | |
0.549 | 0.294 | 0.638 | |
0.546 | 0.273 | 0.551é | |
K* | 0.519 | 0.219 | 0.508 |
0.535 | 0.275 | 0.631é | |
0.535 | 0.268 | 0.618é | |
0.576é | 0.344é | 0.630é | |
K-NN | 0.536 | 0.267 | 0.525 |
0.552 | 0.296 | 0.597é | |
0.531 | 0.286 | 0.657é | |
0.566 | 0.320 | 0.628 | |
Logistic Regression | 0.561 | 0.317 | 0.635 |
0.561 | 0.317 | 0.635 | |
0.561 | 0.317 | 0.635 | |
0.577 | 0.352 | 0.671 | |
Bayes Net | 0.539 | 0.272 | 0.526 |
0.550 | 0.290 | 0.554é | |
0.544 | 0.284 | 0.564é | |
0.591 | 0.366é | 0.671é | |
ANN | 0.563 | 0.322 | 0.649 |
0.558 | 0.313 | 0.644 | |
0.558 | 0.301 | 0.564é | |
0.548 | 0.279 | 0.584é | |
VFI | 0.557 | 0.307 | 0.618 |
0.557 | 0.307 | 0.618 | |
0.557 | 0.307 | 0.618 | |
0.613é | 0.406é | 0.697 | |
Random Forest | 0.554 | 0.298 | 0.594 |
0.516é | 0.230é | 0.582 | |
0.549 | 0.291 | 0.605 | |
0.561 | 0.317 | 0.602 |
★Indicates a significant difference (p-val<0.05) from group (a)
Table 6: Results of data cleaning.
Despite advancements in postoperative pain management, postoperative patient satisfaction remains inadequate in a large fraction of hospitalized patients. In addition to pain, nausea and vomiting have been the most distressing side effects of IV-PCA. Significant efforts have been focused on identifying and analyzing risk factors for PONV [8-10] whereas few previous works ever tested the identified factors for evaluating their predictive strengths. Unlike most previous research that mainly adapted statistical approaches, we not only applied machine learning methods for PONV prediction, but also made a thorough comparison of their performances. In addition, we proposed to consider patient PCA demand behaviors to improve PONV prediction. We conducted stratified 10-fold cross-validation, and the results confirmed the feasibility of the application of machine learning to pain management.
This work is partially supported by Ministry of Science and Technology of Taiwan (MOST 106-2221-E-009-184). The authors thank the department of anesthesiology at Changhua Christian Hospital for providing the IV-PCA patient data, and participating in this study.