Quick $300 Web Scraping Opportunity: A Step-by-Step Guide
Written on
Chapter 1: Discovering the Opportunity
While browsing Craigslist, I stumbled upon a request for someone to download files from the Texas Department of Transportation's website and extract data from them. The client was seeking a manual approach.
Upon seeing this, an idea struck me—I could leverage Python to automate the process, completing the task in roughly an hour. After sending an email response, the client got back to me the following day, and we arranged a call. I reassured him that I could automate the necessary steps for efficient data retrieval. While he initially suggested hourly pay, I informed him that I typically charge on a project basis, and he accepted my proposal.
Section 1.1: Initial Steps
The first task was to download the files from the txdot.gov site. I adjusted an existing Jupyter notebook and accomplished this in about 15 minutes. For each year, I manually handled the data extraction. Tip: Sometimes, not every step needs automation. I merely adjusted a line of code as I navigated through different pages, streamlining the process without overthinking it.
To begin, I loaded the required libraries:
import pandas as pd import os, re, requests, urllib from selenium import webdriver from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.chrome.service import Service from selenium.webdriver.common.by import By from datetime import datetime, date, timedelta import urllib
Next, I identified the URL for the txdot.gov site, initiated the Firefox driver, and accessed the URL:
driver = webdriver.Firefox() # ('geckodriver.exe') driver.implicitly_wait(10) driver.get(url)
Section 1.2: The Download Process
I took a somewhat relaxed approach to gather the necessary information. Knowing there were fewer than 100 URLs for a specific year, I set up a loop to run 1000 times for the downloads, expecting my code would encounter errors. This wasn't an issue, as each iteration simply downloaded a file. The urllib.request.urlretrieve method facilitated the download using two parameters: the file location from the website and the filename along with its storage path on my device.
for row in range(1, 1000):
location = driver.find_element(By.XPATH,
f'/html/body/div[2]/div[3]/main/div[3]/div[1]/div[3]/div/div/div[7]/div/div/table/tbody/tr[{row}]/td[1]/a').get_attribute('href')print(location)
urllib.request.urlretrieve(location, 'c:/users/denni/downloads/txdot/' + location.split('/')[-1])
I still needed clarification from the client on how he wanted the data extracted. Having frequently worked with PDFs, I estimated that the initial extraction would take around 15 minutes, while the PDF parsing would likely take less than 45 minutes. Thus, I projected earning over $300 for this straightforward gig.
Chapter 2: Completing the Task
The remaining parts of the job were relatively straightforward. The client requested I compile the listings from the pages. Given there were only seven years of data, I quickly copied and pasted the information into Excel, taking less than 5 minutes. The final step involved exporting the results to Excel. Initially, I started coding in Python, but then I remembered the PDFElement Pro software I had purchased for PDF conversions, which turned out to be the ideal solution. What I thought would take 10 minutes to code ended up taking just about 3 minutes to execute. I then zipped the results and sent them to the client.
Total time spent: under 60 minutes. Compensation = $300.
Interestingly, the client later presented additional requests. I clarified that these changes would incur extra charges.
It's essential to demonstrate to clients the possibilities of automation. Often, they are willing to pay a premium for swift results, as he could have hired someone to do it manually for $200 or less.
Explore how to make $300 using Scrapebox Robot Messenger through this video!
Learn easy tips and tricks for making money with web scraping on Fiverr!