ETL Summary:

A Jupyter notebook was used to clean the original data set. Once the file was loaded to the notebook, the data was put into a Pandas dataframe. Each input variable was inspected for any missing data. If a row was missing an input, the row was dropped from the dataframe. This created a dataframe with 13 input categorical and numerical variables with one categorical output variable indicating if the patient was healthy or sick with heart disease. The cleaning process condensed the original file from 303 entries down to 296. This cleaned data was then saved into a new csv file for reference.

As part of the ETL process to develop machine learning models, the 13 different input variables were evaluated through several feature selection techniques to better understand their potential influence to determining the output variable if someone was healthy or sick.

First, the numerical inputs like age and maximum heart rate were plotted on a histogram to understand if any of the input data in the full data set was skewed. The only numerical input variable that appeared skewed was “old peak” which refers to a finding on an electrocardiogram related to ST depression induced by exercise. The majority of patients have an old peak value of 0. It is not possible to have a negative old peak value, which makes the histogram appear to have a right skewed distribution. Overall, the numerical inputs did not appear to be inappropriately skewed.

As a next step, the full data set was split into train and test data sets using the SciKit-Learn “train_test_split” function. A random seed of 42 was used to prevent splitting the full data set into new groups each time the notebook was run. To verify the test and train data was stratified appropriately, the numerical inputs of the test and train data were plotted against one another to ensure they overlapped and nothing was skewed due to the random splitting of the data. Graphing the data demonstrated the test and train data overlapped nicely on each other with the test data making up a smaller subset of the full data set.

Figure 2. An example of graphing the split test and train cholesterol input data to identify any skewed splitting in the data

ANOVA statistical functions from SciPy were also used to calculate if there was a statistical difference in the means between the test and train numerical input data. Stats such as one-way ANOVA, two-sided T-Test with unequal variances, and Kruskal-Wallis H-Test at 95% confidence intervals were all used for the analysis. The p-value between the test and train populations were compared to determine statistical significance. The majority of numerical inputs had p-values greater than 0.05, which implies there was no statistical difference between the means of the test and train data. The one exception was cholesterol, due to a potential outlier in the test data that has a patient with cholesterol above 500 (see Figure 2). Due to the limited full data set, the outlier was not removed from the test data set.

Table 1. P-values of the statistical tests comparing the mean of the test and train split numerical input data

Numerical Input	One-Way ANOVA	Two-Sided T-Test	Kruskel-Wallis
Age	0.703	0.720	0.687
BPS	0.977	0.976	0.634
Cholesterol	0.021	0.033	0.005
Max Heart Rate	0.832	0.845	0.939
Old Peak	0.276	0.242	0.357

Next, SciKit-Learn feature selection functions were used to evaluate the importance of the different inputs. In order to use these functions, the data had to be treated in the way the team wanted to analyze the data with the models. Therefore, a function was created to split, encode, and scale the data as it would be used for the models. As mentioned previously, the “test_train_split” function was used with a random seed to split the full data set. Numerical inputs were scaled with the MinMax scaler function from Sci-kit Learn, while categorical input data was encoded using Pandas get_dummies function. Next, the categorical output data was encoded with the Sci-Kit Learn LabelEncoder function. Finally, the test and train inputs and output variables were saved into separate csv files for the models to use.

The first Sci-Kit Learn feature selection function used was the SelectKBest function; a univariate chi-square analysis which is used to identify significant input features based on univariate statistical tests. As an input to the function, “chi2” was called out to apply chi-squared statistics to the data set, since the output is categorical. The return values were ranking scores from highest to lowest of the input data. Because the categorical inputs were encoded, each option for one categorical input variable was given its own score. For example, the “thal” input had the choices of “rev”, “norm”, and “fix”. “Thal_rev” ranked high with a score of 34.2, while the “thal_fix” ranked low with a score of 0.74. Overall, the most unintuitive result was cholesterol, which ranked near the bottom. Common sense would make the average person assume cholesterol to be a more significant contributor to heart disease.

Table 2. Chi-Square Score Rank of the Input Features

Input ID	Specs	Score
27	thal_rev	34.200294
17	exerciseInducedAngina_true	33.418728
9	chestPain_asympt	32.604218
26	thal_norm	28.482451
21	vesselsColored_0.0	21.540903
20	slope_up	18.872001
19	slope_flat	15.823077
16	exerciseInducedAngina_fal	14.756322
7	chestPain_abnang	14.126050
10	chestPain_notang	13.687783
23	vesselsColored_2.0	12.695136
5	sex_fem	12.124316
22	vesselsColored_1.0	11.352036
4	oldPeak	10.745139
24	vesselsColored_3.0	7.004118
6	sex_male	5.700838
13	ecg_abn	3.529412
8	chestPain_angina	2.826471
15	ecg_norm	2.605742
3	maxHeartRate	1.955725
14	ecg_hyp	1.548633
0	age	0.975826
25	thal_fix	0.741422
18	slope_down	0.684007
1	trestBps	0.355538
2	cholesterol	0.136939
12	bloodSugar_true	0.022368
11	bloodSugar_fal	0.004620

The next feature selection function used was the Extra Tree Classifier. It essentially applies the same logic as a decision tree in machine learning to determine an importance score to each classifier input. The top 15 inputs with the highest score were plotted in a horizontal bar chart to help visualize their scores relative to one another. Results from this function mirrored those from the SelectKBest function.

Figure 3. Top 15 scored classifier inputs by the Extra Tree Classifier function

The final feature selection technique used on the input data was a correlation matrix with a heat map to help visualize the results. This technique shows how input variables relate to each other. It helps to identify if any of the inputs are dependent on one another and may be confounded. This technique only works on numerical data, therefore there were only five features in our data set that were evaluated with this approach. The correlation function and a Seaborn heat map were used to create the graph. Overall, the correlations between the tested features are relatively low, with the highest being between maximum heart rate and age at -0.4.

Figure 4. Input correlation results with Seaborn heat map

When deciding what data to include in the machine learning models, the team decided to not eliminate any from the original data set. The original data set that was cleaned for this ETL work had already been paired down by other individuals from a larger set with around 48 inputs.