Harnessing Hidden Data: My Journey from Web Scraping to Employment
Written on
Web scraping is an invaluable skill for data professionals, and I want to share how I transformed an overlooked data source into a compelling portfolio project using just a single line of Python code.
A New Approach to Data Science Projects
When I commenced my data science job search in 2021, I created a set of criteria for the personal projects I would undertake. These guidelines were essential for showcasing my skills effectively:
- Each project must include visual elements like graphs, dashboards, or notebooks.
- I avoided using any standard datasets (goodbye, Titanic).
- Above all, I refused to work with CSV files.
The importance of this last rule cannot be overstated. It’s a common sentiment among instructors and peers that a portfolio reliant solely on CSV data often indicates a novice in the field. In my experience, I usually generate CSVs after several data cleaning steps, rarely starting with pristine data—though that would certainly make my job easier.
When seeking project inspiration, my go-to suggestion is simple: web scraping.
Scraping Data from Wikipedia
While many might have been cautioned that Wikipedia is not a reliable source (a notion that’s become quite outdated), it actually hosts some of the most structured and accessible data online. Articles often feature tables, making them ripe for data extraction, as seen in the example of U.S. presidents.
The tables found on Wikipedia are incredibly easy to scrape using Pandas. With just one line of code, you can access the "wikitable" attribute:
switch_wiki = read_html(url, attrs={'class': 'wikitable'})
Once you master the read_html function, you won’t need extensive tutorials to continue your web scraping journey. If you’re eager to dive in, consider these Wikipedia articles that feature tables ideal for extraction:
- Formula One Venues
- Wimbledon Champions
- Highest Grossing Films
- Nintendo Switch Best-Sellers
Focusing on the last article, I’ll illustrate how I leveraged three tables of Wikipedia data to create a standout project that eventually helped me land data analysis and engineering roles post-graduation.
My project utilized Python and SQLite to track global game sales across three Nintendo consoles:
- Nintendo Wii
- Nintendo Switch
- Nintendo 3Ds
Below, I will outline the key steps in my process, including code snippets, outputs, and visualizations that demonstrate how I transformed data from a wiki page into a polished presentation.
Note: For more insights on Python, SQL, and cloud computing, follow **Pipeline: Your Data Engineering Resource*.*
If you want to scrape multiple tables without manual effort, here's a neat trick:
Efficiently Scraping Multiple Wikipedia Pages
You can scrape over 200 Wikipedia tables into Pandas data frames using a single Pandas function and just two loops.
Data Cleaning
Generally, Wikipedia data is ready to go straight away, thanks to Pandas. However, the most common task involves cleaning up stray footnotes or superscripts, which can be tackled using regex.
When performing any operations on data frames, be sure to eliminate spaces in your strings—Pandas doesn’t handle those well.
Data Loading
Instead of merely visualizing the data frame, I opted to convert the output into a CSV file, which I then used to create a SQLite table. This process helped me better grasp the Extract-Load (EL) process I routinely execute at work.
I find SQL operations more intuitive for analytics compared to the functions in Pandas, which is another reason for my choice.
Data Analysis
The analysis phase concentrated on aggregate metrics such as sums, averages, and percentages. Even though my project revolved around video game data, I aimed to demonstrate its business relevance during interviews. Stakeholders typically prefer high-level insights over minute details, so I focused on key indicators like moving averages and percent changes.
Visualization
This is where the fun begins. A strong visualization component is essential for any effective project. Visualizations are not only shareable and appealing but also showcase your ability to narrate a data-driven story.
Even though I appreciate seeing the code behind visualizations, it’s challenging to create quality visuals from poor data. Clean, engaging visualizations are far more appealing to recruiters than a poorly documented GitHub repository filled with hastily compiled projects.
Revisit your portfolio and ensure it reflects your current skills. While I might critique some of my past decisions, this project exemplifies how unconventional yet accessible data can craft a compelling narrative that led to a data engineering job.
If you're interested in exploring or adapting the data, here are the links:
- Switch games
- Wii games
- 3 DS games
You can also view the complete notebook on my GitHub.