Lessons on resilience as tech outage creates global disruption

A faulty code resulted in one of the most widespread tech outages in recent years for companies using Windows.

In what is being acknowledged as the largest IT outage in history, a defective software update from tech company CrowdStrike’s Falcon Sensor caused Windows computer systems to crash across the globe last week, leading to massive disruptions of critical functions across multiple industries.

The events highlight the importance of having strong regulatory directives that compel businesses to take resiliency and disaster response initiatives in the cybersecurity arena. If you don’t think we need these rules, consider the revelations about just how many businesses still don’t use multi-factor authentication?

The latest outages affected tens of thousands of airline flights, hit the London Stock Exchange, banks, Ticketmaster, the United Parcel Service, Starbucks and McDonald’s, and some medical facilities in Boston, New York City and elsewhere, which had to cancel or delay surgeries.

And Visa, Zelle, TD Bank, JPMorgan Chase Bank, and Bank of America had issues Friday with money transfers, according to DownDetector. Some courthouses, jails and other municipal institutions were affected. Employees at JPMorgan and the Japanese bank Nomura reportedly had trouble accessing their work stations, in some cases leading to delays in trading.

In the last few months, cybersecurity-related incidents have affected a number of large companies and provided business leaders with a lot to absorb in terms of the adequacy of their organizations’ cyber resilience, testing protocols, contractor relationships (and dependency), and business continuity planning.

The CrowdStrike crash

The crash did not involve any cyber intrusion, but the faulty CrowdStrike software update has hindered customers whose virtual machines are running Microsoft’s Windows Client and Windows Server. The Texas-based company’s software is designed to offer endpoint detection and response and generally protect against malware, but this time a defect in a routine content update for Windows host had ripple effects that could only be called gigantic.

CrowdStrike was considered the worldwide leader in endpoint security sales, providing services to more than half of Fortune 500 companies — a figure that excludes its smaller customers that are serviced through its partners, 43 of the 50 US states, seven of the top 10 manufacturers, and eight of the top 10 financial services firms, according to a report from Canalys.

The company issued statements early, assuming responsibility, describing its corrective actions and how affected customers could seek further information.

“The system was sent an update and that update had a software bug in it that caused an issue with the Microsoft operating system,” co-founder and CEO George Kurtz said. “As systems come back online, as they’re rebooted, they’re coming up and they’re working.

“We’re deeply sorry for the impact that we’ve caused to customers, to travelers, to anyone affected by this, including our company,” he said.

Implications

There is no totally foolproof software, and we have to accept that fact, but businesses must also craft business continuity/disaster response plans that can at least ensure a bad situation does not turn worse.

The SEC’s rules on cybersecurity risk management and incident disclosure and the EU’s Digital Operational Resilience Act are designed to compel businesses to take the appropriate steps to manage cyber risk within their own organizations and, importantly, within their supply chains.

While CrowdStrike has issued workarounds and fixes to the issue, it admits in its public statements that recovery requires manual intervention to each individually affected device, which could lead to a long recovery time from this problem.

Organizations need to already have planned for this – prioritizing the systems that are most critical to their business and recovering them in order of priority.

There will very likely be Congressional hearings involving CrowdStrike as to what went wrong with testing and quality assurance processes to ensure there is no repeat of the issue. But businesses cannot assume any evaluation will prevent the next interruption, so all critical system interruptions need to be approached with a clear business continuity and recovery plan.

Cyber security domino effect

The domino effect of any cybersecurity issue is a key factor to consider. Hundreds of organizations have faced service disruptions this year due to either a single attack or system failure (such as a routine software update) involving a third-party vendor.

In the global tech supply chain, there’s a lot of concentration and over-reliance on few providers, especially if you look just at the ones servicing particular industry sectors, such as automobile dealers, medical facilities, and many others. Put simply, one tech vendor, but tens of thousands of ripple effects.

Indeed, only 150 companies account for 90% of the technology products and services that global companies are using in their systems, according to research from SecurityScorecard and McKinsey & Co.

Appreciating that dependency and having relationships with a variety of tech vendors to help you get the information, solutions and even backup assistance from them and others is integral.

It is also important to appreciate what your organization’s impact on others downstream is or could be, taking note of how both your interruption and recovery plan will affect other stakeholders for whom you’re a vendor or provider of services.

Any system outage or chaotic situation is ripe for manipulation by bad actors seeking to capitalize on the tremendous distraction, resource reallocation and need for assistance organizations face at those times. Business might have returned to normal for you when you read this – but dialing up surveillance protocols in the midst of a business disruption so as not to fall prey to any malicious actors is always a go-to action item.