Strategic Steps to Mitigate Data Downtime: A Personal Insight
Written on
Anticipating Data Downtime
Data engineers must be prepared to navigate and recover from unavoidable downtimes and API failures.
Currently seeking employment? Stand out by creating a personal project using my complimentary 5-page project ideation guide.
In March 2024, nearly 3 billion individuals experienced a collective information blackout as Facebook and Instagram faced significant outages. Unlike other applications such as Reddit, which went completely offline, Facebook's downtime was misleading. By maintaining a login flow, it created the illusion that everything was functioning normally, trapping users in a frustrating login loop that ultimately led to error messages.
While I don't work for Meta/Facebook/Instagram, I can only speculate about the frenzy that engineers underwent to restore the world's largest social network before the lunch hour. However, I can attest to the feeling of having all systems operational, only to see visibility drop to zero.
Given the resources available to large corporations, including strong dev ops teams and extensive cloud infrastructure, such incidents are unlikely in my professional setting. Fortunately, this was not the case for my job.
My experience arose from a logistical oversight in my personal setup that forced me to shut down several automated processes just 10 minutes before their scheduled daily operations.
By detailing my triage and contingency approach, I hope to illustrate the flexibility required to manage a data outage on a smaller, less intense scale, as well as the steps necessary to deprecate or "sunset" underutilized or problematic pipelines.
3-Statement Model: 90-Minute Case Study from a Blank Excel Sheet provides insights into the importance of understanding data flow and the repercussions of outages.
The Beginning of the Outage
My experience began with a billing error, much like a utility company terminating service due to non-payment. The paid API service that powers not only my personal infrastructure but also the publication I co-edit, notified me that I had exhausted 100% of my tokens for the month. For context, I have only surpassed this limit once before, and it was costly. With three days remaining in my billing cycle, I was reluctant to incur extra charges for my existing service.
Rather than scrambling to maintain data flow, I made the unilateral decision to halt all operations. For the first time since launching my suite of automated reporting tools in June 2023, I found myself without insights into my content's performance or subscriber activity on ConvertKit.
Unfortunately, the process of shutting down my data was far from straightforward; it resembled untangling a mess of cables plugged into a power strip. Before halting the respective processes, I had to:
- Pause my cloud scheduler job for Pipeline
- Pause my cloud scheduler job for Learning SQL
- Cancel automated deliveries for my daily dashboard
- Cancel automated deliveries for Learning SQL's weekly dashboard
- Set all views to reflect the last updated date
- Reschedule cloud jobs for the day after billing (the 18th of the month)
- Reschedule dashboard deliveries for a date after the billing cycle
In a corporate data team, there would ideally be regular billing reports and a formal deprecation strategy to ease the transition from visibility to downtime. Instead, this burden rested solely on my shoulders. Thankfully, I was the only end-user for at least one of my products.
Managing the Shutdown
When faced with a data outage or planning to deprecate an existing product, consider the following steps:
Assess Costs
For data sources with high costs, like API token overages, it's crucial to evaluate the expense of unplugging. To understand the ROI of either a) deprecation or b) continued use, some basic data engineering calculations are necessary.
Even at premium prices, the data provided by an API may be integral to your organization’s data strategy and could support functions like sales and marketing. In such cases, management might decide to continue ingestion if the benefits outweigh the costs.
Identify Downstream Usage
Most of this step can be automated. If you're utilizing a shared SQL environment like BigQuery, you can query table metadata. While metadata can be overwhelming, your goal is to query a combination of a dataset’s INFORMATION_SCHEMA and __TABLES__ views.
For deprecation purposes, ensure that a table is unused with a query like:
SELECT * FROM project.dataset.INFORMATION_SCHEMA.COLUMNS
WHERE table_name = "deprecated_table"
You can also join INFORMATION_SCHEMA with specific tables to gather information about the last user who queried or modified the source in question.
Create a Deletion List
Given that my work's BigQuery environment is shared with numerous users and connected to a visualization platform utilized by all levels of the organization, I exercise caution with deletions.
Even for less contentious deprecation efforts, like removing rarely used cloud functions, I compile and share a spreadsheet. If this is merely a maintenance task, I circulate it among my immediate team; otherwise, I distribute it to the larger tech organization. Often, my choices are confirmed, and I proceed with deprecation, but sometimes a colleague identifies a hidden dependency I'm unaware of.
Implement a Pause
Even if I plan to delete deployed code ultimately, I typically pause the process for several days first. This precautionary step allows us to monitor runs without disrupting established infrastructure or interrupting data delivery.
Keeping Data Flowing
If you couldn't persuade management to continue with a costly API, you can't simply cut your losses and ignore requests for that data.
Whenever feasible, aim to do two things:
- Find a stopgap or temporary solution.
- Identify missing data and perform a backfill.
Temporary Solutions
During my three-day outage, I sought to salvage the situation and maintain reporting. I created a temporary view that captured all data up until the last complete day before the outage.
To achieve this, I modified a dynamic date field to:
WHERE DATE(upload_date) = "2024-04-16"
Although I wouldn't receive daily reports, I at least had historical data to reference, knowing the outage would last just 2–3 days.
Backfill Data
Once the data issue is resolved—or the bill is paid—it’s essential to quickly recover and ingest any data lost during the downtime. I skipped this step because my ETL pipeline doesn't utilize date fields (though my view does).
To identify gaps, a simple query like the following can be useful:
SELECT DATE(upload_date), COUNT(1) AS rowCount
FROM project.dataset.table
GROUP BY 1 ORDER BY 1
After identifying the days needing replacement, revisit your code and replace filter or date field references with explicit past dates.
Regardless of how you retroactively load your data, ensure that you understand the expected number of rows and the nature of the data returned, so you're genuinely filling in gaps rather than creating new issues.
Grasping how to manage (and ideally prevent) data outages is a key trait of a progressing data engineer. As uptime and downtime are critical metrics in data engineering, striving to provide accurate, reliable data is a fundamental expectation from nearly every job description and managerial or stakeholder requirement.
While you may not always predict when downtime will end, like Facebook's 3 billion users, you can be sure someone will let you know when it begins. Be prepared.
Financial Statements with Power BI - Part 1 P&L/BS explores the intersection of data visualization and financial reporting, offering insights into maintaining visibility during outages.
I need your feedback! Please take a moment to complete a 3-question survey to let me know how I can assist you beyond this blog. All respondents will receive a complimentary gift.