Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them

1. Missing Data

How to Check?

df = pd.read_csv(‘name_of_csv_file.csv’)
df.info()

The range index will show you the total number, and then beside each entry, you’ll find its count. If it doesn’t equate to the total number…


This content originally appeared on DEV Community and was authored by GharamElhendy

1. Missing Data

How to Check?

df = pd.read_csv('name_of_csv_file.csv')
df.info()


The range index will show you the total number, and then beside each entry, you'll find its count. If it doesn't equate to the total number, then you have missing data in your set.

How to Deal with It?

This varies according to the situation at hand. For example, why is the data missing? And whether or not the occurrences seem random.

One way to go about this issue is to calculate the missing values using the mean.

For example, if you have missing values for the duration that a user viewed a product on your website. "duration" is the name of the variable in this case.

mean = df['duration'].mean()
df['duration'] = df['duration'].fillna(mean)


The second line can be written as:

df['duration'].fillna(mean, inplace=True)


And both serve to apply the changes (adding the data you just calculated) to the original set.

2. Duplicates

How to Check?

df.duplicated()


This should display "False" next to all the lines that aren't duplicates, and "True" next to the ones that are a duplicate of the ones above them.

I.e. The first instance will be marked as "False" but the second instance (which is the duplicate) will be marked as "True".

You can also check with:

sum(df.duplicated())


This works for bigger data sets, and it shows you just how many instances of duplicates you have.

How to Deal with It?

df.drop_duplicates(inplace=True)


Again, (inplace=True) is used to apply changes to the original data set.

3. Incorrect Data Types

How to Check?

df = pd.read_csv('name_of_csv_file.csv')
df.info()


for example, if beside the variable "Timestamp" you find "object", this means that your data set is dealing with the timestamp as a string (str) which is not ideal. The proper representation is DateTime object.

In this case, we'll use:

df['timestamp'] = pd.to_datetime(df['timestamp')]


Note: Data type corrections aren't applied when you re-open the csv file. So, next time you parse the file, make sure to change them again accordingly.

Git_It


This content originally appeared on DEV Community and was authored by GharamElhendy


Print Share Comment Cite Upload Translate Updates
APA

GharamElhendy | Sciencx (2021-04-28T07:48:03+00:00) Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them. Retrieved from https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/

MLA
" » Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them." GharamElhendy | Sciencx - Wednesday April 28, 2021, https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/
HARVARD
GharamElhendy | Sciencx Wednesday April 28, 2021 » Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them., viewed ,<https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/>
VANCOUVER
GharamElhendy | Sciencx - » Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them. [Internet]. [Accessed ]. Available from: https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/
CHICAGO
" » Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them." GharamElhendy | Sciencx - Accessed . https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/
IEEE
" » Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them." GharamElhendy | Sciencx [Online]. Available: https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/. [Accessed: ]
rf:citation
» Most Common Issues with Real-Life Data: How to Check for Them, and How to Fix Them | GharamElhendy | Sciencx | https://www.scien.cx/2021/04/28/most-common-issues-with-real-life-data-how-to-check-for-them-and-how-to-fix-them/ |

Please log in to upload a file.




There are no updates yet.
Click the Upload button above to add an update.

You must be logged in to translate posts. Please log in or register.