Harnessing Hidden Data: My Journey from Web Scraping to Employment

Web scraping is an invaluable skill for data professionals, and I want to share how I transformed an overlooked data source into a compelling portfolio project using just a single line of Python code.

A New Approach to Data Science Projects

When I commenced my data science job search in 2021, I created a set of criteria for the personal projects I would undertake. These guidelines were essential for showcasing my skills effectively:

Each project must include visual elements like graphs, dashboards, or notebooks.
I avoided using any standard datasets (goodbye, Titanic).
Above all, I refused to work with CSV files.

The importance of this last rule cannot be overstated. It’s a common sentiment among instructors and peers that a portfolio reliant solely on CSV data often indicates a novice in the field. In my experience, I usually generate CSVs after several data cleaning steps, rarely starting with pristine data—though that would certainly make my job easier.

When seeking project inspiration, my go-to suggestion is simple: web scraping.

Scraping Data from Wikipedia

While many might have been cautioned that Wikipedia is not a reliable source (a notion that’s become quite outdated), it actually hosts some of the most structured and accessible data online. Articles often feature tables, making them ripe for data extraction, as seen in the example of U.S. presidents.

The tables found on Wikipedia are incredibly easy to scrape using Pandas. With just one line of code, you can access the "wikitable" attribute:

switch_wiki = read_html(url, attrs={'class': 'wikitable'})

Once you master the read_html function, you won’t need extensive tutorials to continue your web scraping journey. If you’re eager to dive in, consider these Wikipedia articles that feature tables ideal for extraction:

Formula One Venues
Wimbledon Champions
Highest Grossing Films
Nintendo Switch Best-Sellers

Focusing on the last article, I’ll illustrate how I leveraged three tables of Wikipedia data to create a standout project that eventually helped me land data analysis and engineering roles post-graduation.

My project utilized Python and SQLite to track global game sales across three Nintendo consoles:

Nintendo Wii
Nintendo Switch
Nintendo 3Ds

Below, I will outline the key steps in my process, including code snippets, outputs, and visualizations that demonstrate how I transformed data from a wiki page into a polished presentation.

Note: For more insights on Python, SQL, and cloud computing, follow **Pipeline: Your Data Engineering Resource*.*

If you want to scrape multiple tables without manual effort, here's a neat trick:

Efficiently Scraping Multiple Wikipedia Pages

You can scrape over 200 Wikipedia tables into Pandas data frames using a single Pandas function and just two loops.

Data Cleaning

Generally, Wikipedia data is ready to go straight away, thanks to Pandas. However, the most common task involves cleaning up stray footnotes or superscripts, which can be tackled using regex.

When performing any operations on data frames, be sure to eliminate spaces in your strings—Pandas doesn’t handle those well.

Data Loading

Instead of merely visualizing the data frame, I opted to convert the output into a CSV file, which I then used to create a SQLite table. This process helped me better grasp the Extract-Load (EL) process I routinely execute at work.

I find SQL operations more intuitive for analytics compared to the functions in Pandas, which is another reason for my choice.

Data Analysis

The analysis phase concentrated on aggregate metrics such as sums, averages, and percentages. Even though my project revolved around video game data, I aimed to demonstrate its business relevance during interviews. Stakeholders typically prefer high-level insights over minute details, so I focused on key indicators like moving averages and percent changes.

Visualization

This is where the fun begins. A strong visualization component is essential for any effective project. Visualizations are not only shareable and appealing but also showcase your ability to narrate a data-driven story.

Even though I appreciate seeing the code behind visualizations, it’s challenging to create quality visuals from poor data. Clean, engaging visualizations are far more appealing to recruiters than a poorly documented GitHub repository filled with hastily compiled projects.

Revisit your portfolio and ensure it reflects your current skills. While I might critique some of my past decisions, this project exemplifies how unconventional yet accessible data can craft a compelling narrative that led to a data engineering job.

If you're interested in exploring or adapting the data, here are the links:

Switch games
Wii games
3 DS games

You can also view the complete notebook on my GitHub.

dxalxmur.com

Harnessing Hidden Data: My Journey from Web Scraping to Employment

A New Approach to Data Science Projects

Scraping Data from Wikipedia

Efficiently Scraping Multiple Wikipedia Pages

Data Cleaning

Data Loading

Data Analysis

Visualization

Share the page:

Recent Post:

Navigating the Complexities of Free Speech on Twitter

# Embracing Mental Health: A Journey to Wellness and Growth

Unlocking New Perspectives: The Power of Thinking Differently

Understanding the Science Behind Face Masks and Their Efficacy

The Realities of Startup Life: Not Just a Dream of Wealth

The Enduring Legacy of YouTube in the Video Streaming Landscape

Unraveling the Secrets of Solar Neutrinos and CNO Fusion

# A Comprehensive Guide to Understanding Ethereum for Beginners