Picture the scene: the cockpit of a commercial jet and the co-pilot is worried:
“Captain, what if the engines fail?”
“We’ll just glide to safety.”
“What if the wings fall off?”
“That won’t happen but if it did, we’d parachute down safely.”
“And what if the parachutes fail?”
“We’d aim to land on a haystack, to cushion our fall.”
“But Captain, what if there’s a pitchfork in the haystack, pointing upwards…?”
While it may be frustrating, it’s always wise to consider the various scenarios of potential failure, and to plan for safety. Indeed, it makes sense to continue considering the possibilities: over time, the risk assessment can change, with previously unusual causes of failure becoming more likely. There may be entirely new risks to consider. Those hidden pitchforks.
It’s the same when you’re considering the health and wellbeing of your mainframe infrastructure. Resilience, recovery and cybersecurity are more important than ever.
In the mainframe world, we have many decades’ experience of planning and mitigation exercises. We can look to world-class Continuous and High Availability technologies to protect our customers and data during all manner of failures. And yet… ransomware is much in the news. And data corruption from disgruntled insiders is commonly documented; potentially, 70% of unauthorised activity can be traced to insider involvement. Stolen or sold credentials might lead to a few pennies added here and there to a balance, or to escalated privileges then access gained to some of the most powerful encryption and key-management facilities on the planet. Suddenly, your parachute has failed and you’re fast approaching the haystack—or in our case, a serious logical data corruption.
Disaster recovery planning scenarios have typically centred around crises initiated by hardware failure, say, or an application outage or geographical disruption; a lengthy city-wide power outage, perhaps, or configuration error. Failover is usually automated to backup infrastructure, and backup data. So far, so good. Or has a degree of complacency crept in?
What if the initiation is a warning of data corruption? Or of data or service access denial? A message delivered: “Access Denied!” where access should most definitely not be denied. Or a deluge of customer complaints about account errors? Or overnight balances failing to equalise? And what if the integrity of the restore copy of the database or file is in doubt, and in any case is inextricably linked to many other restores that must happen simultaneously, some of which are also in doubt? In such scenarios, where does your hardware copy/restore leave you? Potentially, a restore of equally corrupted data, and failure after hours or days of effort: days your organization may no longer have the luxury of.
The thing is, your instant data replication capability may instantly replicate your logical corruption. While conventional monitoring can detect system and application outages, there’s typically no validation of logical integrity. A current single point of recovery can itself be compromised and with systems, storage and tape pools participating in a single logical system structure, backups of these may be as vulnerable as the primary sources. Your scope of recovery may be as narrow as “system wide”: logical recovery will require additional forensic or surgical recovery capabilities.
My point is, do you have a plan for recovery from data corruption? Is that plan capable of restoring a clean trusted copy on which your customers can depend? If you’re taking data copies, are they air-gapped to somewhere safe, isolated point-in-time? Do you know how long it would take you to recover your systems and applications after an attack?
Okay, so I’ve raised this spectre and asked lots of questions, so I ought to start providing some answers. The IBM® Cyber Vault solution for Z ticks a great many boxes in this area. Designed to enable continuous data protection, it combines Z hardware and software, storage and integrated services, using a trusted air gapped copy of data to help enterprises to recover fast from outages due to corrupted logical data such as entities (tables), attributes (columns/fields) and relationships (keys).
Key to Cyber Vault is the provision of: data validation, early and often; forensic analysis to identify recovery actions; surgical recovery to extract data from the copy and logically restore back to the production environment; catastrophic recovery for worst-case scenarios, when the entire environment has to be restored back to the point in time of the copy; and offline backup for extra protection.
With new and increased cyber threats emerging during the 2020 pandemic and lockdown, this approach can help mainframe shops to identify cyber attacks aimed at logical data as well as responding and recovering to breaches more quickly. Complementing other existing security, High Availability/Disaster Recovery solutions and infrastructure, you probably have many of the key components already in place: z14 or z15, DS8000 Storage, IBM GDPS and IBM Security Guardium.
As I mentioned, it makes a lot of sense to continue scanning the horizon and considering the various risks and emerging possibilities, and plan accordingly. Some versions of the aircraft joke have all those unfortunate events actually occurring and end with this punchline: luckily, they missed the pitchfork; unluckily, they also missed the haystack… The more awareness we have of the possible risks and what’s available to mitigate them, the better prepared, better protected and more resilient we can be.
Andy Coulson has collaborated with GSE UK for many years. An IBM Redbook author, a presenter on mainframe security and technology, and a vlogger on the ‘Mainframe in 5 Minutes’ YouTube channel, he is passionate about the technology that lies within all things mainframe. Andy works for IBM UK, although wishes to point out that all views expressed are his own, not necessarily those of IBM or GSE.