Mastering Data Cleaning in Python: An Engaging Guide to Success
Written on
“Uncovering mistakes during data cleaning can make you question your life choices.”
After taking a breather, we're diving into the next phase of our data analysis journey—data cleaning! Buckle up, as this third step is where the excitement truly begins. If you're new to this series, you can catch up here. The data we've scraped for this project is available here.
Eager to start? The complete Python code and explanations are accessible on my GitHub repository.
So, what does data cleaning entail? It involves correcting errors, eliminating inconsistencies, and preparing your data for effective analysis.
Why is this necessary? Clean data serves as the cornerstone of quality analysis. Imagine trying to construct a strong house on a swampy foundation; without proper data cleaning, any insights you derive will be unstable and unreliable. Here’s the scoop:
- Poor data leads to poor decisions: Typos, missing values, and inconsistencies can skew results, resulting in misguided decisions for your business or project.
- Garbage in, garbage out: No matter how sophisticated your analysis techniques are, they can’t salvage fundamentally flawed data.
- Time invested wisely: Data cleaning may seem monotonous, but it saves a significant amount of trouble later, ensuring that decisions are based on sound information.
In this hands-on tutorial, we’ll apply our skills using real job listings collected from indeed.com. As usual, I'll be working within a Jupyter notebook. Let’s begin by loading the essential Python libraries and reading our file into a Python DataFrame, then examining the initial rows using the .head() method.
# Import the necessary libraries
import pandas as pd
import numpy as np
import re
job = pd.read_csv('job_data.csv')
The .head() method provides a glimpse of our data (as shown in the image above). Now, let’s delve deeper to comprehend its structure:
# Display information about our dataset
job.info()
Here’s what we discovered from the output:
- Size: Our dataset consists of 203 rows, equating to 203 job listings.
- Columns: There are 7 columns, likely representing job title, company, location, and more.
- Data types: Currently, everything is categorized as an “object” (string) datatype. This means that 'Salary' is treated as text, and 'Date Posted' does not behave like a standard date. We’ll need to rectify this.
- Missing data: Some listings appear to have gaps in the information. That’s another aspect we need to clean!
These issues will be the focus of our upcoming tasks.
Data Cleaning
Before we get into the coding, let’s outline precisely what needs scrubbing, polishing, and fixing in this dataset:
- Address Missing Values: We must identify and resolve gaps in ‘Salary’, ‘Company’, ‘Job Type’, and any other crucial columns.
- Extract Numeric Values from Salary: The ‘Salary’ column is currently a jumbled mix. We’ll extract the numbers, remove currency symbols and extraneous text, and manage salary ranges.
- Standardize Values in ‘Job Type’: A clean format in this column will facilitate smoother analysis.
- Convert ‘Date Posted’: We need to transform this into a proper date format.
- Standardize ‘Summary’ Column: By eliminating random whitespace and correcting odd characters, we can simplify working with this text data.
Why is this crucial? A clean dataset translates to dependable results. We definitely don’t want flawed data leading us to erroneous conclusions!
Handle Missing Values
Before we can rectify missing data, we need to identify where it’s present and the extent of it. Let’s check:
# Identify missing values
missing_values = job.isnull().sum()
print("Missing Values:n", missing_values)
Key Observations:
- The Offenders: The ‘Company’, ‘Salary’, and ‘Job Type’ columns contain missing information.
- Major Issue: The ‘Salary’ column is missing a staggering 170 values, which requires careful attention.
Now that we’ve identified the problem areas, we’ll explore various strategies to tackle these missing values.
Let’s start with the less complicated issues:
# Fill missing Company values with a default value
job['Company'].fillna('Unknown', inplace=True)
# Fill missing Job Type values with a default value
job['Job Type'].fillna('Unknown', inplace=True)
Next, we’ll address the ‘Salary’ column using the following methods:
Retrieve unique values from the ‘Salary’ column to understand its distribution.
Clean the ‘Salary’ column by removing currency symbols, commas, whitespace, and non-numeric characters with regular expressions.
Convert salary ranges into average values for a more accurate representation.
Change the cleaned salary values to numeric format by casting them to float.
Fill any remaining missing values in the ‘Salary’ column with the median salary.
Finally, retrieve unique values from the ‘Salary’ column again to validate the cleaning and conversion process.
# This gives us an idea of the inconsistencies we'll need to handle.
job["Salary"].unique()
# Remove currency symbols, commas, whitespace and non-numeric characters
job['Salary'] = job['Salary'].str.replace('[?,a-zA-Zs]', '', regex=True)
# Convert values in range to average
def convert_to_average(salary):
if isinstance(salary, str) and '-' in salary:
lower, upper = map(int, salary.split('-'))
return (lower + upper) / 2
else:
return salary
job['Salary'] = job['Salary'].apply(convert_to_average)
# Convert to numeric
job['Salary'] = job['Salary'].astype(float)
# Impute missing salary values with the median of salary
job['Salary'].fillna(job['Salary'].median(), inplace=True)
job["Salary"].unique()
After resolving the missing values, we rerun the previous code, and now we have no missing values left.
Replace Inconsistent Values in the ‘Job Type’ Column
It seems the ‘Job Type’ column has a few variations that could disrupt our analysis. We need to standardize entries such as “Full-time” and “Full-time +1.” First, let’s examine the inconsistencies:
job['Job Type'].unique()
Output:
array(['Unknown', 'Contract +1', 'Full-time', 'Temporary', 'Full-time +1',
'Internship', 'Contract', 'Permanent'], dtype=object)
Can you spot the variations that might need cleaning? Next, we’ll address them with careful replacements.
# Replace inconsistent values in the 'Job Type' column
job['Job Type'] = job['Job Type'].replace({'Contract +1': 'Contract', 'Full-time +1': 'Full-time'})
Parse Date Posted
Currently, the ‘Date Posted’ column contains valuable information but in a disorganized format. Let’s break down the cleanup process:
# Remove Unnecessary Words
job['Date Posted'] = job['Date Posted'].str.replace('Posted', '') # Remove "Posted" phrase
job['Date Posted'] = job['Date Posted'].str.replace('EmployerActive', '') # Remove "EmployerActive" phrase
# Standardize 'Today'
job['Date Posted'] = job['Date Posted'].str.replace('Today', '0 days ago') # Replace "Today" with 0 days ago
# Define custom function to handle "30+ days ago"
def convert_to_date(value):
if '+' in value:
return pd.Timestamp.now().normalize() - pd.Timedelta(days=30)else:
return pd.Timestamp.now().normalize() - pd.Timedelta(days=int(value.split()[0]))# Apply the Function
job['Date Posted'] = job['Date Posted'].apply(convert_to_date)
Standardize the ‘Summary’ Column
We’ve cleaned up numbers and dates; now it’s time to refine our text data in the ‘Summary’ column. Our goal is to eliminate distracting characters and enhance analyzability.
# Function to remove special characters and non-letter characters
def clean_summary(summary):
# Define regular expression pattern to match non-letter characters
pattern = r'[^a-zA-Zs]'
# Replace non-letter characters with a space
cleaned_summary = re.sub(pattern, ' ', summary)
# Remove extra whitespaces
cleaned_summary = ' '.join(cleaned_summary.split())
return cleaned_summary
# Apply the clean_summary function to the Summary column
job['Summary'] = job['Summary'].apply(clean_summary)
The Results of Our Hard Work
After all that cleaning and transformation, let’s take a fresh look at our dataset. We should see a much more polished and analysis-ready DataFrame.
It’s always a good idea to check the structure of our dataset again after cleaning. Let’s see what those changes did to the data types and if any missing values slipped through:
Key Questions to Consider:
- Data Types: Are columns like ‘Salary’ and ‘Date Posted’ now in the correct formats for analysis?
- Missing Values: Did any gaps remain after our cleanup efforts? If so, we might need additional strategies to handle them.
Now that our dataset is sparkling clean, let’s ensure we save all our hard work for future analysis:
# Save the cleaned DataFrame to a CSV file
job.to_csv("cleaned_job_data.csv", index=False)
Conclusion
Throughout this hands-on project, we’ve witnessed firsthand why data cleaning is an essential step in the data analysis process. By addressing missing values, inconsistent formatting, and messy text, we’ve transformed raw job listings into a dataset that is now reliable and ready for exploration.
Where Can We Go From Here?
With our cleaned data in hand, the opportunities are exciting:
- Visualizations: Create charts and graphs to uncover patterns in salary distribution, job type trends, and more.
- Statistical analysis: Dive into hypothesis testing and calculations to answer key questions about the dataset.
- Text Analysis: Explore the ‘Summary’ column for insights about in-demand skills and employer requirements.
Remember: Clean data is the key to unlocking accurate and impactful insights. The skills you’ve acquired here will greatly benefit you in all your future data analysis endeavors!
Our next project in this series will focus on Exploratory Data Analysis. Stay tuned.
Feel free to reach out with questions, share your findings, or let’s collaborate on something new.