This content originally appeared on DEV Community and was authored by Vishnu Ajit
Excited to share my second tutorial along with the python notebook which i made for experimenting with machine learning algorithms! This time we are exploring a project using LogisticRegression . It loads the dataset from csv file (dataset obtained from kaggle) and enables us to predict probabilities of a patient having Heart Attack🧑💻📊
Concepts Used Include:
- LogisticRegression🌀
- StandardScaler from sklearn.preprocessing library 🎯
- fit_transform() method ➖
- train_test_split() 🌟
- model.predict() 🔄
- model.predict_proba() 🌟
- classification_report() 🌟
- roc_auc_score() 🎯
Why This Notebook:
The main goal of this notebook is to visually understand how to use the LogisticRegression concept in machine learning algorithm. Using the beauty of the Python programming language we try to predict from a patient's hospital data whether he might have a heart attack in the future.
I’ve included a line to my notebook to guide you through the it
The link to the notebook: https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/project-supervised-learning-logistic-regression-heart-disease-prediction.ipynb
The link to the dataset : https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/heart-disease-prediction.csv (Dataset obtained from kaggle)
Kaggle url to the same above given dataset : https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression
What’s Next:
Over the Next week, I’ll be posting more of my notebooks for other concepts in Machine Learning as recommended by this url https://www.kaggle.com/discussions/getting-started/554563 [# Machine Learning Engineer Roadmap for 2025]
We'll especially be looking at Supervised Learning and Unsupervised Learning to get our feet wet before we begin to walk towards the shores of greater Artificial Intelligence.
Who's This For:
For anybody who loves python and who has been telling themselves I'm gonna learn Machine Learning one day. This is Day 2 for them ! Lets learn Machine Learning Together :) Yesterday we looked at Linear Regression. Today we are exploring the concept called Logistic Regression.
Feel free to explore the notebook and try out your own machine learning models! 🚀
The link to the notebook: https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/project-supervised-learning-logistic-regression-heart-disease-prediction.ipynb
The link to the dataset : https://github.com/ruforavishnu/Project_Machine_Learning/blob/master/heart-disease-prediction.csv (Dataset obtained from kaggle)
Kaggle url to the same above given dataset : https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression
Kaggle References: https://www.kaggle.com/discussions/getting-started/554563 [Machine Learning Engineer Roadmap for 2025]
Now Lets begin coding shall we? :)
Step 1.
Load the dataset from our csv file
import pandas as pd
data = pd.read_csv('heart-disease-prediction.csv')
print(data.head())
and we get the output
male age education currentSmoker cigsPerDay BPMeds prevalentStroke \
0 1 39 4.0 0 0.0 0.0 0
1 0 46 2.0 0 0.0 0.0 0
2 1 48 1.0 1 20.0 0.0 0
3 0 61 3.0 1 30.0 0.0 0
4 0 46 3.0 1 23.0 0.0 0
prevalentHyp diabetes totChol sysBP diaBP BMI heartRate glucose \
0 0 0 195.0 106.0 70.0 26.97 80.0 77.0
1 0 0 250.0 121.0 81.0 28.73 95.0 76.0
2 0 0 245.0 127.5 80.0 25.34 75.0 70.0
3 1 0 225.0 150.0 95.0 28.58 65.0 103.0
4 0 0 285.0 130.0 84.0 23.10 85.0 85.0
TenYearCHD
0 0
1 0
2 0
3 1
4 0
Step 2. Lets explore the data by ourselves first
We try running data.info() on our dataset
print(data.info())
and we get the output as
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4238 entries, 0 to 4237
Data columns (total 16 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 male 4238 non-null int64
1 age 4238 non-null int64
2 education 4133 non-null float64
3 currentSmoker 4238 non-null int64
4 cigsPerDay 4209 non-null float64
5 BPMeds 4185 non-null float64
6 prevalentStroke 4238 non-null int64
7 prevalentHyp 4238 non-null int64
8 diabetes 4238 non-null int64
9 totChol 4188 non-null float64
10 sysBP 4238 non-null float64
11 diaBP 4238 non-null float64
12 BMI 4219 non-null float64
13 heartRate 4237 non-null float64
14 glucose 3850 non-null float64
15 TenYearCHD 4238 non-null int64
dtypes: float64(9), int64(7)
memory usage: 529.9 KB
None
Step 3. Now what do we do with missing data?
What do we with columns in our dataset which have no value ?? and how do we do that ??
print(data.isnull().sum())
and we get the output
male 0
age 0
education 105
currentSmoker 0
cigsPerDay 29
BPMeds 53
prevalentStroke 0
prevalentHyp 0
diabetes 0
totChol 50
sysBP 0
diaBP 0
BMI 19
heartRate 1
glucose 388
TenYearCHD 0
dtype: int64
Oh, so there are a couple of columns that have Null data or NaN values.
The fillna() method comes to rescue us.
data.fillna(data.mean(), inplace=True)
Hmmm, did that work? how do we check that? Oh ! Lets try running data.isnull().sum() once again?
print(data.isnull().sum())
and we get the output
male 0
age 0
education 0
currentSmoker 0
cigsPerDay 0
BPMeds 0
prevalentStroke 0
prevalentHyp 0
diabetes 0
totChol 0
sysBP 0
diaBP 0
BMI 0
heartRate 0
glucose 0
TenYearCHD 0
dtype: int64
Yes, it worked
Step 4. Now we need to preprocess the data don't we?
How do we do that? Lets see. Ok, so what all kinds of columns do we have ?
data.columns to the rescue
data.columns
and we get the output
Index(['male', 'age', 'education', 'currentSmoker', 'cigsPerDay', 'BPMeds',
'prevalentStroke', 'prevalentHyp', 'diabetes', 'totChol', 'sysBP',
'diaBP', 'BMI', 'heartRate', 'glucose', 'TenYearCHD'],
dtype='object')
Ok, thats a lot of columns !! Kaggle has provided us with a lot of columns. We dont want all that do we?
Lets pick and choose.
['age', 'totChol','sysBP','diaBP', 'cigsPerDay','BMI','glucose']
Aha, now we build our friends. The only ones who have the keys to Logistic Regression. One is a DataFrame and the other is a Series.
Lets call them capital X and small y
X = data[['age', 'totChol','sysBP','diaBP', 'cigsPerDay','BMI','glucose']]
y = data['TenYearCHD']
Hmmm, lets see what we have now
X.head()
and we get the output
age | totChol | sysBP | diaBP | cigsPerDay | BMI | glucose | |
---|---|---|---|---|---|---|---|
0 | 39 | 195.0 | 106.0 | 70.0 | 0.0 | 26.97 | 77.0 |
1 | 46 | 250.0 | 121.0 | 81.0 | 0.0 | 28.73 | 76.0 |
2 | 48 | 245.0 | 127.5 | 80.0 | 20.0 | 25.34 | 70.0 |
3 | 61 | 225.0 | 150.0 | 95.0 | 30.0 | 28.58 | 103.0 |
4 | 46 | 285.0 | 130.0 | 84.0 | 23.0 | 23.10 | 85.0 |
Step 5: We need to normalize for better model performance
What is a standard scaler?
Simple explanation: A standard scaler is something that allows you to compare two items which are presently on different scales by bringing both of them to a similiar scale. So they can be compared against each other.
For example : Two friends are talking about how fast a Ferrari goes and how fast a Porsche goes. But one person is using the m/s scale and the other person is using the km/h scale. Its difficult to analyze which is faster right? So we convert both of them into either m/s or into km/h. So the comparison is easy enough.
And, Here comes the StandardScaler to our rescue
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)
Step 6: Now we need to split the data we have into 2 segments.
First segment is for training the machine learning model. Second segment is to test the machine learning model we trained using the first segment to really check whether the model did work.
Simple explanation: Kind of like asking a student who learnt using only one textbook , the questions from another textbook . Just to check if the student really understood the concept or has he just byhearted the whole thing.
And, How do we do that? By using train_test_split()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Note we have used test_size parameter to load only 20% (0.2 means 20%) of the available data for testing data. That means the remaining 80% is given as training data.
At the end of successful completion of the train_test_split we get 4 variables
X_train : has the training data. it has 80% elements from our Dataframe X (remember capital letter X?)
X_test : has the testing data. it has 20% elements from our Dataframe X
y_train: has the training data. has the 80% elements from our Series y (remember small letter y? )
y_test : has the testing data. has 20% elements from our Series y
Step 7. Finally we arrive at our final milestone. Training the LogisticRegression model
Lets train our model using LogisticRegression. (That is technical lingo for saying lets use the power of machine learning along with the beautiful python programming language to create an Artificial Intelligence model. An AI model that can predict what we want it to predict)
How do we do that? Oh just three lines of code :) 🤯🤯
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
Oh you can sit and pause. Its alright. That is it. Just 3 lines of code in python. And we have created an Artificial Intelligence model for ourselves. Ain't it a beauty?? 💛 💛
Step 8. Lets evaluate the machine learning model we just created
We save the values of the prediction to a variable called y_pred.
y_pred = model.predict(X_test)
We need to evaluate our model.
We use two methods for that
- classification_report()
- roc_auc_score()
Lets run that
from sklearn.metrics import classification_report , roc_auc_score
print(classification_report(y_test, y_pred))
print('ROC-AUC-score:', roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))
and we get the output as
precision recall f1-score support
0 0.86 0.99 0.92 724
1 0.55 0.05 0.09 124
accuracy 0.85 848
macro avg 0.70 0.52 0.51 848
weighted avg 0.81 0.85 0.80 848
ROC-AUC-score: 0.695252628764926
Step 9. We are done. The project is over. 💯✅✅Tada.
Lets test the machine learning model we created with a real person's data shall we?
patient2 = [[45, 210, 130, 85, 10, 25.1, 95]]
patient2_df = pd.DataFrame(patient2, columns=['age','totChol', 'sysBP','diaBP', 'cigsPerDay', 'BMI','glucose'])
patient2_scaled = scaler.transform(patient2_df)
We give the model our scaled data. and store the data in a variable called prediction.
prediction = model.predict(patient2_scaled)
Finally, lets test it using our old fashioned print() statement? ✅✅
# 1=Heart Disease, 0=No Heart Disease
if prediction[0] == 1:
print('The chances the patient might have a heart disease in the future is: True')
else:
print('The chances the patient might have a heart disease in the future is: False')
and we get the output
The chances the patient might have a heart disease in the future is: True
And it feels beautiful to know that we have completed learning one more machine learning concept doesn't it? :) 💯✅✅
Yes, it does 💛 💛
Homework: Now, here are a few other patient data for you to check on your own.
patient3 = [[65, 250, 155, 100, 15, 32.0, 150]]
patient4 = [[55, 240, 140, 90, 10, 29.5, 110]]
patient5 = [[70, 300, 160, 105, 20, 34.0, 180]]
Now go! 💨 🏃🏃 Go, open Visual Studio Code and start coding 🤖🤖. And, don't forget to come back here tomorrow for our next project. Like somebody once said: You never know what the tide might bring tomorrow? 🌊 🔮🖥️
This content originally appeared on DEV Community and was authored by Vishnu Ajit
Vishnu Ajit | Sciencx (2025-01-18T12:39:49+00:00) Project – Supervised Learning with Python – Lets use Logistic Regression for Predicting the chances of having a Heart Attack. Retrieved from https://www.scien.cx/2025/01/18/project-supervised-learning-with-python-lets-use-logistic-regression-for-predicting-the-chances-of-having-a-heart-attack/
Please log in to upload a file.
There are no updates yet.
Click the Upload button above to add an update.