ETL Summary:

A Jupyter notebook was used to clean the original data set. Once the file was loaded to the notebook, the data was put into a Pandas dataframe. Each input variable was inspected for any missing data. If a row was missing an input, the row was dropped from the dataframe. This created a dataframe with 13 input categorical and numerical variables with one categorical output variable indicating if the patient was healthy or sick with heart disease. The cleaning process condensed the original file from 303 entries down to 296. This cleaned data was then saved into a new csv file for reference.

As part of the ETL process to develop machine learning models, the 13 different input variables were evaluated through several feature selection techniques to better understand their potential influence to determining the output variable if someone was healthy or sick.

First, the numerical inputs like age and maximum heart rate were plotted on a histogram to understand if any of the input data in the full data set was skewed. The only numerical input variable that appeared skewed was “old peak” which refers to a finding on an electrocardiogram related to ST depression induced by exercise. The majority of patients have an old peak value of 0. It is not possible to have a negative old peak value, which makes the histogram appear to have a right skewed distribution. Overall, the numerical inputs did not appear to be inappropriately skewed.

Max Heart Rate Old Peak

As a next step, the full data set was split into train and test data sets using the SciKit-Learn “train_test_split” function. A random seed of 42 was used to prevent splitting the full data set into new groups each time the notebook was run. To verify the test and train data was stratified appropriately, the numerical inputs of the test and train data were plotted against one another to ensure they overlapped and nothing was skewed due to the random splitting of the data. Graphing the data demonstrated the test and train data overlapped nicely on each other with the test data making up a smaller subset of the full data set.

Max Heart Rate

Figure 2. An example of graphing the split test and train cholesterol input data to identify any skewed splitting in the data

ANOVA statistical functions from SciPy were also used to calculate if there was a statistical difference in the means between the test and train numerical input data. Stats such as one-way ANOVA, two-sided T-Test with unequal variances, and Kruskal-Wallis H-Test at 95% confidence intervals were all used for the analysis. The p-value between the test and train populations were compared to determine statistical significance. The majority of numerical inputs had p-values greater than 0.05, which implies there was no statistical difference between the means of the test and train data. The one exception was cholesterol, due to a potential outlier in the test data that has a patient with cholesterol above 500 (see Figure 2). Due to the limited full data set, the outlier was not removed from the test data set.

Table 1. P-values of the statistical tests comparing the mean of the test and train split numerical input data

Numerical Input

One-Way ANOVA

Two-Sided T-Test

Kruskel-Wallis

Age

0.703
0.720
0.687

BPS

0.977

0.976

0.634

Cholesterol

0.021
0.033
0.005

Max Heart Rate

0.832
0.845
0.939

Old Peak

0.276
0.242
0.357

Next, SciKit-Learn feature selection functions were used to evaluate the importance of the different inputs. In order to use these functions, the data had to be treated in the way the team wanted to analyze the data with the models. Therefore, a function was created to split, encode, and scale the data as it would be used for the models. As mentioned previously, the “test_train_split” function was used with a random seed to split the full data set. Numerical inputs were scaled with the MinMax scaler function from Sci-kit Learn, while categorical input data was encoded using Pandas get_dummies function. Next, the categorical output data was encoded with the Sci-Kit Learn LabelEncoder function. Finally, the test and train inputs and output variables were saved into separate csv files for the models to use.

The first Sci-Kit Learn feature selection function used was the SelectKBest function; a univariate chi-square analysis which is used to identify significant input features based on univariate statistical tests. As an input to the function, “chi2” was called out to apply chi-squared statistics to the data set, since the output is categorical. The return values were ranking scores from highest to lowest of the input data. Because the categorical inputs were encoded, each option for one categorical input variable was given its own score. For example, the “thal” input had the choices of “rev”, “norm”, and “fix”. “Thal_rev” ranked high with a score of 34.2, while the “thal_fix” ranked low with a score of 0.74. Overall, the most unintuitive result was cholesterol, which ranked near the bottom. Common sense would make the average person assume cholesterol to be a more significant contributor to heart disease.

Table 2. Chi-Square Score Rank of the Input Features

Input ID

Specs

Score

27

thal_rev

34.200294

17

exerciseInducedAngina_true

33.418728

9

chestPain_asympt

32.604218

26

thal_norm

28.482451

21

vesselsColored_0.0

21.540903

20

slope_up

18.872001

19

slope_flat

15.823077

16

exerciseInducedAngina_fal

14.756322

7

chestPain_abnang

14.126050

10

chestPain_notang

13.687783

23

vesselsColored_2.0

12.695136

5

sex_fem

12.124316

22

vesselsColored_1.0

11.352036

4

oldPeak

10.745139

24

vesselsColored_3.0

7.004118

6

sex_male

5.700838

13

ecg_abn

3.529412

8

chestPain_angina

2.826471

15

ecg_norm

2.605742

3

maxHeartRate

1.955725

14

ecg_hyp

1.548633

0

age

0.975826

25

thal_fix

0.741422

18

slope_down

0.684007

1

trestBps

0.355538

2

cholesterol

0.136939

12

bloodSugar_true

0.022368

11

bloodSugar_fal

0.004620

The next feature selection function used was the Extra Tree Classifier. It essentially applies the same logic as a decision tree in machine learning to determine an importance score to each classifier input. The top 15 inputs with the highest score were plotted in a horizontal bar chart to help visualize their scores relative to one another. Results from this function mirrored those from the SelectKBest function.

scored classifier

Figure 3. Top 15 scored classifier inputs by the Extra Tree Classifier function


The final feature selection technique used on the input data was a correlation matrix with a heat map to help visualize the results. This technique shows how input variables relate to each other. It helps to identify if any of the inputs are dependent on one another and may be confounded. This technique only works on numerical data, therefore there were only five features in our data set that were evaluated with this approach. The correlation function and a Seaborn heat map were used to create the graph. Overall, the correlations between the tested features are relatively low, with the highest being between maximum heart rate and age at -0.4.

scored classifier

Figure 4. Input correlation results with Seaborn heat map


When deciding what data to include in the machine learning models, the team decided to not eliminate any from the original data set. The original data set that was cleaned for this ETL work had already been paired down by other individuals from a larger set with around 48 inputs.