dxalxmur.com

Harnessing Hidden Data: My Journey from Web Scraping to Employment

Written on

Web scraping is an invaluable skill for data professionals, and I want to share how I transformed an overlooked data source into a compelling portfolio project using just a single line of Python code.

A New Approach to Data Science Projects

When I commenced my data science job search in 2021, I created a set of criteria for the personal projects I would undertake. These guidelines were essential for showcasing my skills effectively:

  • Each project must include visual elements like graphs, dashboards, or notebooks.
  • I avoided using any standard datasets (goodbye, Titanic).
  • Above all, I refused to work with CSV files.

The importance of this last rule cannot be overstated. It’s a common sentiment among instructors and peers that a portfolio reliant solely on CSV data often indicates a novice in the field. In my experience, I usually generate CSVs after several data cleaning steps, rarely starting with pristine data—though that would certainly make my job easier.

When seeking project inspiration, my go-to suggestion is simple: web scraping.

Scraping Data from Wikipedia

While many might have been cautioned that Wikipedia is not a reliable source (a notion that’s become quite outdated), it actually hosts some of the most structured and accessible data online. Articles often feature tables, making them ripe for data extraction, as seen in the example of U.S. presidents.

The tables found on Wikipedia are incredibly easy to scrape using Pandas. With just one line of code, you can access the "wikitable" attribute:

switch_wiki = read_html(url, attrs={'class': 'wikitable'})

Once you master the read_html function, you won’t need extensive tutorials to continue your web scraping journey. If you’re eager to dive in, consider these Wikipedia articles that feature tables ideal for extraction:

  • Formula One Venues
  • Wimbledon Champions
  • Highest Grossing Films
  • Nintendo Switch Best-Sellers

Focusing on the last article, I’ll illustrate how I leveraged three tables of Wikipedia data to create a standout project that eventually helped me land data analysis and engineering roles post-graduation.

My project utilized Python and SQLite to track global game sales across three Nintendo consoles:

  • Nintendo Wii
  • Nintendo Switch
  • Nintendo 3Ds

Below, I will outline the key steps in my process, including code snippets, outputs, and visualizations that demonstrate how I transformed data from a wiki page into a polished presentation.

Note: For more insights on Python, SQL, and cloud computing, follow **Pipeline: Your Data Engineering Resource*.*

If you want to scrape multiple tables without manual effort, here's a neat trick:

Efficiently Scraping Multiple Wikipedia Pages

You can scrape over 200 Wikipedia tables into Pandas data frames using a single Pandas function and just two loops.

Data Cleaning

Generally, Wikipedia data is ready to go straight away, thanks to Pandas. However, the most common task involves cleaning up stray footnotes or superscripts, which can be tackled using regex.

When performing any operations on data frames, be sure to eliminate spaces in your strings—Pandas doesn’t handle those well.

Data Loading

Instead of merely visualizing the data frame, I opted to convert the output into a CSV file, which I then used to create a SQLite table. This process helped me better grasp the Extract-Load (EL) process I routinely execute at work.

I find SQL operations more intuitive for analytics compared to the functions in Pandas, which is another reason for my choice.

Data Analysis

The analysis phase concentrated on aggregate metrics such as sums, averages, and percentages. Even though my project revolved around video game data, I aimed to demonstrate its business relevance during interviews. Stakeholders typically prefer high-level insights over minute details, so I focused on key indicators like moving averages and percent changes.

Visualization

This is where the fun begins. A strong visualization component is essential for any effective project. Visualizations are not only shareable and appealing but also showcase your ability to narrate a data-driven story.

Even though I appreciate seeing the code behind visualizations, it’s challenging to create quality visuals from poor data. Clean, engaging visualizations are far more appealing to recruiters than a poorly documented GitHub repository filled with hastily compiled projects.

Revisit your portfolio and ensure it reflects your current skills. While I might critique some of my past decisions, this project exemplifies how unconventional yet accessible data can craft a compelling narrative that led to a data engineering job.

If you're interested in exploring or adapting the data, here are the links:

  • Switch games
  • Wii games
  • 3 DS games

You can also view the complete notebook on my GitHub.

Share the page:

Twitter Facebook Reddit LinkIn

-----------------------

Recent Post:

Navigating the Complexities of Free Speech on Twitter

Exploring the intricacies of free speech on Twitter and the challenges posed by differing national laws.

# Embracing Mental Health: A Journey to Wellness and Growth

Andy Johns shares his journey to mental wellness, highlighting the importance of vulnerability and self-care in the tech industry.

Unlocking New Perspectives: The Power of Thinking Differently

Discover the importance of thinking differently and how it can enhance problem-solving, creativity, and personal growth.

Understanding the Science Behind Face Masks and Their Efficacy

This article explores the evidence surrounding face masks, addressing common myths and emphasizing the importance of masks in preventing COVID-19 transmission.

The Realities of Startup Life: Not Just a Dream of Wealth

Discover the truth behind working at a startup—it's not just about wealth, but also the challenges and excitement of the journey.

The Enduring Legacy of YouTube in the Video Streaming Landscape

YouTube's evolution, its competition with TikTok, and its continued relevance in the digital landscape.

Unraveling the Secrets of Solar Neutrinos and CNO Fusion

Discover the groundbreaking evidence of CNO fusion in the Sun and its implications for astrophysics.

# A Comprehensive Guide to Understanding Ethereum for Beginners

Discover the ins and outs of Ethereum, its layers, and its unique functionalities beyond just cryptocurrency.