Effective Data Cleaning Strategies in Python: A Comprehensive Guide
Written on
Chapter 1: Introduction to Data Cleaning
Data cleaning is a fundamental yet often tedious aspect of data analysis.
In fact, managing real-world datasets can be quite labor-intensive. They frequently come with a host of issues such as misleading or abbreviated column names, missing entries, improper data types, and excessive information in single columns. Before delving into data processing, it is vital to rectify these problems. Clean data not only enhances productivity but also facilitates the generation of precise insights.
To help you navigate this process, I've outlined three essential data cleaning techniques you should master when using Python. For demonstration purposes, I will be utilizing an extended version of the Titanic dataset created by Pavlo Fesenko, which is freely available under a CC license.
This dataset consists of 1,309 rows and 21 columns. Below, you'll find numerous examples illustrating how to extract the most value from this data. Let's dive in! 🚀
First, import pandas and load the CSV file into a pandas DataFrame. It's a good idea to use the .info() method to gain a comprehensive overview of the dataset's size, column names, and their corresponding data types.
import pandas as pd
df = pd.read_csv("Complete_Titanic_Extended_Dataset.csv")
df.info()
Chapter 2: Initial Data Cleaning Steps
We begin with some straightforward cleaning tasks that can save both time and memory as you continue processing the data.
Section 2.1: Eliminating Unused Columns
The Titanic dataset features 21 columns, but you will not necessarily require all of them for your analysis. Identify and retain only the columns that are pertinent to your objectives. For example, suppose you decide that the columns PassengerId, SibSp, Parch, WikiId, Name_wiki, and Age_wiki are unnecessary. You can create a list of these column names and apply the df.drop() function as illustrated below:
columns_to_drop = ['PassengerId', 'SibSp',
'Parch', 'WikiId',
'Name_wiki', 'Age_wiki']
df.drop(columns_to_drop, inplace=True, axis=1)
df.head()
By checking memory consumption with the memory_usage="deep" argument in the .info() method, you will observe that the revised dataset consumes only 834 KB compared to the 1,000 KB of the original DataFrame. While these figures may seem minor, they can be significant when working with larger datasets, resulting in a 17% reduction in memory usage!
Keep in mind that using the inplace=True option with the .drop() method modifies the original DataFrame. If you wish to retain the original DataFrame, consider assigning the output of df.drop() (without inplace) to a new variable:
df1 = df.drop(columns_to_drop, axis=1)
Alternatively, if you need to retain only a select few columns, you can use df.copy() to create a new DataFrame with just those columns. For instance, if you want to keep only the Name, Sex, Age, and Survived columns, you can subset the original dataset as follows:
df1 = df[["Name", "Age", "Sex", "Survived"]].copy()
Section 2.2: Addressing Missing Values
Almost every dataset will require you to tackle missing values, which can be one of the more challenging aspects of data cleaning. If you intend to use this data for machine learning, it's crucial to understand that most models do not handle missing values well.
So, how can you identify missing data? Here are four common techniques to pinpoint where values are lacking:
- Using the `.info()` method: This provides a quick overview of which columns contain missing values.
df.info()
The column names highlighted in red indicate where multiple values are missing. Ideally, each column in the dataset should contain 1,309 values, but the output reveals that many columns have fewer.
- Visualizing Missing Values: Create a heatmap to visualize missing data by coding the data as Boolean values (1 for missing, 0 for present) and employing the .isna() function.
import seaborn as sns
sns.heatmap(df.isna())
The X-axis displays column names, while the Y-axis represents index numbers, helping you understand where missing data lies.
- Missing Data Percentage: While not straightforward, you can calculate the percentage of missing values in each column using the .isna() method.
import numpy as np
print("Amount of missing values in - ")
for column in df.columns:
percentage_missing = np.mean(df[column].isna())
print(f'{column} : {round(percentage_missing*100)}%')
With this approach, you can identify how much data is missing from individual columns, which is vital for handling these gaps effectively.
- Handling Missing Data: After identifying the missing data, you have several options to address it:
- Drop Records: Remove an entire record if a key column has a missing value. Be cautious, as this can significantly reduce the dataset size if many records are affected.
- Drop Columns: Conduct thorough research on specific columns to determine their importance before deciding to drop them.
- Impute Missing Data: Replace missing values with the mean, median, or mode of the respective column.
Chapter 3: Correcting Data Types
In addition to missing values, incorrect data types can compromise data quality. Each column should have the appropriate data type to facilitate future transformations. When using read_csv or similar functions in pandas, the library attempts to infer the data type for each column. This is usually accurate, but some columns may require manual adjustments.
For example, in the Titanic dataset, you can view data types with the .info() method:
df.info()
You might find that the Age and Survived columns are incorrectly classified as float64, whereas Age should be an integer, and Survived should only have binary values (0 or 1).
To illustrate, you can sample five rows from these columns:
df[["Name", "Sex", "Survived", "Age"]].sample(5)
To correct data types, you may need to handle missing values first, depending on your pandas version.
Additional cleaning techniques may also be necessary based on your specific use case, including:
- Replacing Values: Transform values like True/False or Yes/No into 1/0 for better compatibility with machine learning applications.
- Removing Outliers: Carefully evaluate outliers, as they may not always warrant removal.
- Eliminating Duplicates: Use the .drop_duplicates() method to remove any duplicate entries in the dataset.
In conclusion, data cleaning is a critical yet often overlooked aspect of data analysis. By implementing these strategies, you can effectively address common data issues and improve your analysis outcomes.
The first video titled "Data Cleaning in Pandas | Python Pandas Tutorials" dives deeper into various data cleaning techniques using Pandas.
The second video, "How to Do Data Cleaning (step-by-step tutorial on real-life dataset)," provides a detailed, practical guide on cleaning a real-world dataset.
Thank you for reading! If you found this article informative, consider subscribing to my email list for more insights and updates.