15 May, 2022

Heart Disease Prediction using Matlab

Introduction
Despite its importance, cardiology is also one of the most difficult specialties in medicine to practice well. If cardiac disease is not detected in its early stages, it has the potential to be lethal. In the United States, according to the Centers for Disease Control and Prevention (CDC), over 610000 people die from heart disease each year, accounting for one out of every four deaths, and this figure is increasing. An investigation by the Centers for Disease Control and Prevention indicated that men were responsible for more than half of all similar diseases discovered in 2009.
Visualization
Target is our response variable, which contains binary data in the form of either 0 or 1. The number 1 indicates that the individual has heart disease, whereas the number 0 shows that the individual does not have heart disease.


Based on these data, it was determined that the normal distribution applied to all four variables in this investigation. For the purpose of visualizing the relationship, the terms "target" and "all other variables" were plotted on a correlation matrix. We selected to focus on four independent factors since the number of available independent variables was large. These are the variables that physicians give the most attention to when attempting to diagnose the condition (cholesterol, heart rate, chest pain, and rest ECG test). The following is a representation of the correlation matrix:

Although doctors have focused more on those four indicators, we can see from the matrix that cholesterol, resting heart rate, and an ECG do not have a significant correlation with the dependent variable, despite the increased emphasis.
Prediction
Logistic regression model
When attempting to forecast the relationship between a category answer variable and a number of categorical independent variables, Logistic Regression was used. Approximately 75% of the dataset was utilized for training, while the remaining 25% was used for testing purposes.

According to the findings of the study, factors such as gender, chest pain, the number of visible major arteries, and the maximum heart rate all have an influence on the development of heart disease. Following that, we took the choice to create a new dataset that included only statistically significant elements.

Before applying the model, it is possible to utilize the variance inflation factor to identify whether or not there is multicollinearity in the data.

VIF seems to be modest, and the data does not appear to be multicollinear. The logit model can now be applied to significant variables data, thus we may proceed.

As can be seen, all of the criteria are significant, which is beneficial for making predictions. Let's have a look at how the connection works.

According to the above figure type 2 chest pain, according to the research shown above, is the type of chest pain that is most strongly connected with heart disease. When a person has chest pain, they may exhibit any or all of the signs of cardiac illness. Exercise-induced angina and an increase in the size of the major arteries help to reduce the risk of heart disease. Despite the fact that we achieved good results for our variance inflation factor, we nevertheless utilized cross validation using the trainControl() method to avoid overfitting.

The accuracy of the model is 0.8421 that can be translated to 84.21%.
Conclusion
A prediction model based on our study has an accuracy of 84.21%, and it can inform us whether or not we should take steps to lower our risk of having heart disease and stroke. In order to predict the connection between a categorical response variable and a number of categorical independent variables, we employed Logistic Regression. It is possible to increase the importance of various characteristics and to broaden the scope of our data collection in order to get better results.

No comments:

Post a Comment