Optimizing Backup Failure Remediation

The Bocada Team | December 30, 2019

Fighting backup failure fires has historically been part of every backup administrator’s life. It’s stressful, it’s time intensive, and, ironically, doesn’t always focus attention on the most important issue: protecting valuable data.

With limited insight into the underlying causes of backup failures, backup admins have generally been forced to manually look at each and every failure, identify critical issues, and then develop a remediation plan. It’s a time intensive process that can result in critical failures being deprioritized and left unattended for too long.

Yet by adding value-rich pieces of information to each backup failure record, organizing similar failure types together, and adding automation through the entire remediation process, admins can vastly simplify and prioritize their workflow. Data stays protected while order is finally brought to the failure correction process.

The Standard Backup Performance Report

Organizations pulling backup failure data manually from backup products, or using simple reporting tools, will pull reports that generally include the following data:

Backup server
Backup client
Backup owner
Backup success or failure status
Last successful backup date

When run across an enterprise-sized backup environment you’ll receive an output file with thousands upon thousands of rows, often with limited diagnostic information.

Sorting by backup failure begins to segment the data, but it is still ripe with noise. Admins will know the job failed. But, without knowing the underlying cause they are left unsure if it needs to be addressed immediately or if it’s a temporary hiccup that will resolve itself with a re-run. Without additional diagnostic capabilities, backup admins are left spending hours manually looking at each and every failure.

A Backup Report That Separates The Noise From The Critical Failures

To truly optimize backup failure remediation, additional levels of detail must be made available to backup administrators.

In addition to the details mentioned above, optimized reports also include:

Error code: These are backup failure codes that come directly from the backup product, usually an alpha-numeric sequence.
Error message: This is the text-based message that comes directly from the backup job log intended to give incremental information above what the error code offers.
Error set: These are error codes that have been assigned to a particular type of error (e.g. file locked error, backup configuration error). Tools like Bocada automatically map error codes into error sets, making it easier to identify the underlying issue by bringing standardized error naming conventions across backup products.

When done together, you’ll see a report that looks like this:

A simple review of this table, and specifically the error sets, can pinpoint failures that must be addressed immediately like “media error” and “backup window error.” Meanwhile, errors sets like “backup file locked” or “file error” identify jobs that failed due to file usage or files that are no longer present. These are “noise.” That is, these are backups that failed for legitimate reasons, but don’t represent actual data protection concerns. They can be cleaned after more critical failures are tended to first.

To further optimize the time spent identifying and prioritizing job failures, we recommend splitting the report above into two separate reports based on error sets:

Backup Failures Due To Configuration & File Level Errors
Backup Failures Not Due To Configuration & File Level Errors

The first report would pull all clients that are now obsolete and don’t require a backup. These are fairly insignificant and can be addressed when looking to clean up your backup environment. The second report includes all of your high severity failures. This becomes the punch-list that your team works through to tackle high-priority issues that could result in data protection risks.

Introducing Automation Into Your Entire Backup Remediation Process

While the process above certainly improves failure remediation, consider introducing automation throughout the process. This ensures that not only will you have the information you need collected in the right way to efficiently correct failures but that you’ll have it ready-made when you and your team need it.

Data Aggregation & Normalization: Automating the pulling of backup performance metadata from your entire on-prem and cloud environment simplifies a hyper manual task. This includes pulling backup server, client, owner, failure status, and other relevant data points and then normalizing it in one core dashboard.
Report Creation & Distribution: With readily-available data on-hand, also consider automating the creation and distribution of custom reports that includes value-rich supplemental information like backup error set. Consistent report formats sent at regular intervals to relevant parties ensures that not only are failures being worked on by the right parties but that your team has all of the information they need to be efficient in fixing the failures.
Batching Failure Types: Taking the items above a step further, you can automate batching same-type failures together and generating reports on just those failures at the server level, business unit level, or other relevant dimension. This process can show where single culprits are causing the greatest number of errors, helping prioritize areas where once correction can have significant impact.
Ticketing: By automating the creation of tickets for high priority failures, less time is wasted on the ticketing process itself, freeing up admins to actually address the underlying failure causes.

This automated process means your “noise-free” data will be available nearly real-time. You’ll be able to eliminate hours of manual data collection work and increases performance by jumping on high priority failures. Data stays protected while backup teams better prioritize their efforts on only critical issues.